Repository: chiphuyen/stanford-tensorflow-tutorials
Branch: master
Commit: 51e53daaa2a3
Files: 116
Total size: 19.7 MB

Directory structure:
gitextract_fr18ta9_/

├── .gitignore
├── 2017/
│   ├── README.md
│   ├── assignments/
│   │   ├── chatbot/
│   │   │   ├── README.md
│   │   │   ├── chatbot.py
│   │   │   ├── config.py
│   │   │   ├── data.py
│   │   │   ├── model.py
│   │   │   └── output_convo.txt
│   │   ├── exercises/
│   │   │   ├── e01.py
│   │   │   └── e01_sol.py
│   │   ├── style_transfer/
│   │   │   ├── readme.md
│   │   │   ├── style_transfer.py
│   │   │   ├── utils.py
│   │   │   └── vgg_model.py
│   │   └── style_transfer_starter/
│   │       ├── readme.md
│   │       ├── style_transfer.py
│   │       ├── utils.py
│   │       └── vgg_model.py
│   ├── data/
│   │   ├── arvix_abstracts.txt
│   │   ├── fire_theft.xls
│   │   ├── friday.tfrecord
│   │   ├── heart.csv
│   │   └── heart.txt
│   ├── examples/
│   │   ├── 02_feed_dict.py
│   │   ├── 02_lazy_loading.py
│   │   ├── 02_simple_tf.py
│   │   ├── 02_variables.py
│   │   ├── 03_linear_regression_sol.py
│   │   ├── 03_linear_regression_starter.py
│   │   ├── 03_logistic_regression_mnist_sol.py
│   │   ├── 03_logistic_regression_mnist_starter.py
│   │   ├── 04_word2vec_no_frills.py
│   │   ├── 04_word2vec_starter.py
│   │   ├── 04_word2vec_visualize.py
│   │   ├── 05_csv_reader.py
│   │   ├── 05_randomization.py
│   │   ├── 07_basic_filters.py
│   │   ├── 07_convnet_mnist.py
│   │   ├── 07_convnet_mnist_starter.py
│   │   ├── 09_queue_example.py
│   │   ├── 09_tfrecord_example.py
│   │   ├── 11_char_rnn_gist.py
│   │   ├── autoencoder/
│   │   │   ├── autoencoder.py
│   │   │   ├── layer_utils.py
│   │   │   ├── layers.py
│   │   │   ├── train.py
│   │   │   └── utils.py
│   │   ├── cgru/
│   │   │   ├── README.md
│   │   │   ├── custom_getter.py
│   │   │   ├── data_reader.py
│   │   │   ├── my_layers.py
│   │   │   └── neural_gpu_v3.py
│   │   ├── data/
│   │   │   ├── arvix_abstracts.txt
│   │   │   ├── fire_theft.xls
│   │   │   ├── heart.csv
│   │   │   └── heart.txt
│   │   ├── deepdream/
│   │   │   ├── deepdream_exercise.py
│   │   │   └── deepdream_solution.py
│   │   ├── graphs/
│   │   │   ├── gist/
│   │   │   │   ├── events.out.tfevents.1499787135.MacBook-Pro
│   │   │   │   ├── events.out.tfevents.1499787150.MacBook-Pro
│   │   │   │   └── events.out.tfevents.1499787321.MacBook-Pro
│   │   │   ├── l2/
│   │   │   │   ├── events.out.tfevents.1499786503.MacBook-Pro
│   │   │   │   └── events.out.tfevents.1499786515.MacBook-Pro
│   │   │   └── linear_reg/
│   │   │       └── events.out.tfevents.1499786822.MacBook-Pro
│   │   ├── kernels.py
│   │   ├── process_data.py
│   │   └── utils.py
│   └── setup/
│       ├── requirements.txt
│       └── setup_instruction.md
├── LICENSE
├── README.md
├── assignments/
│   ├── 01/
│   │   ├── q1.py
│   │   └── q1_sol.py
│   ├── 02_style_transfer/
│   │   ├── load_vgg.py
│   │   ├── load_vgg_sol.py
│   │   ├── style_transfer.py
│   │   ├── style_transfer_sol.py
│   │   └── utils.py
│   ├── chatbot/
│   │   ├── README.md
│   │   ├── chatbot.py
│   │   ├── config.py
│   │   ├── data.py
│   │   ├── model.py
│   │   └── output_convo.txt
│   ├── trump_bot/
│   │   └── trump_tweets.txt
│   └── word_transform/
│       ├── common.en.vocab
│       ├── eval.vocab
│       └── train.vocab
├── examples/
│   ├── 02_lazy_loading.py
│   ├── 02_placeholder.py
│   ├── 02_simple_tf.py
│   ├── 02_variables.py
│   ├── 03_linreg_dataset.py
│   ├── 03_linreg_placeholder.py
│   ├── 03_linreg_starter.py
│   ├── 03_logreg.py
│   ├── 03_logreg_placeholder.py
│   ├── 03_logreg_starter.py
│   ├── 04_linreg_eager.py
│   ├── 04_linreg_eager_starter.py
│   ├── 04_word2vec.py
│   ├── 04_word2vec_eager.py
│   ├── 04_word2vec_eager_starter.py
│   ├── 04_word2vec_visualize.py
│   ├── 05_randomization.py
│   ├── 05_variable_sharing.py
│   ├── 07_convnet_layers.py
│   ├── 07_convnet_mnist.py
│   ├── 07_convnet_mnist_starter.py
│   ├── 07_run_kernels.py
│   ├── 11_char_rnn.py
│   ├── kernels.py
│   ├── utils.py
│   └── word2vec_utils.py
└── setup/
    ├── requirements.txt
    └── setup_instruction.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================

*.pdf

*.SUNet
*.pyc
.env/*
examples/data
examples/graphs/*
examples/checkpoints/*
examples/visualization/*


================================================
FILE: 2017/README.md
================================================
# tf-stanford-tutorials
This repository contains code examples for the 2017 course CS 20SI: TensorFlow for Deep Learning Research.<br>
Detailed syllabus and lecture notes can be found [here](http://cs20si.stanford.edu/2017).<br>
For this year's course, please see [here](http://cs20.stanford.edu)

# Note (as of July 11, 2017)
I've updated the code to TensorFlow 1.2 and Python3, except the code for chatbot. I will update the code for chatbot soon.


## Models include: <br>
### In the folder "examples": <br>
Linear Regression with Chicago's Fire-Theft dataset<br>
Logistic Regression with MNIST<br>
Word2vec skip-gram model with NCE loss<br>
Convnets with MNIST<br>
Autoencoder (by Nishith Khandwala)<br>
Deepdream (by Jon Shlens)<br>
Character-level language modeling <br>
<br>
### In the folder "assignments":<br>
Style Transfer<br>
Chatbot using sequence to sequence with attention<br>
<br>
## Misc<br>
Examples on how to use data readers, TFRecord<br>
Embedding visualization with TensorBoard<br>
Usage of summary ops<br>
Exercises to be familiar with other special TensorFlow ops<br>
Demonstration of the danger of lazy loading <br>
Convolutional GRU (CRGU) (by Lukasz Kaiser)

================================================
FILE: 2017/assignments/chatbot/README.md
================================================
<<<<<<<
A neural chatbot using sequence to sequence model with
attentional decoder. This is a fully functional chatbot.

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen as the starter code for assignment 3,
class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

The detailed assignment handout and information on training time can be found at http://web.stanford.edu/class/cs20si/assignments/a3.pdf 

<h2>Sample conversation</h2>
HUMAN ++++ hi <br>
BOT ++++ hi . what ' s your name ?<br>
HUMAN ++++ i'm chip<br>
BOT ++++ please .<br>
HUMAN ++++ please what?<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ where are you from?<br>
BOT ++++ california .<br>
HUMAN ++++ do you like california?<br>
BOT ++++ i ' m okay .<br>
HUMAN ++++ i like california a lot<br>
BOT ++++ of what ?<br>
HUMAN ++++ i live here<br>
BOT ++++ and what is it ?<br>
HUMAN ++++ california<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ i'm a student here<br>
BOT ++++ and get it out of here .<br>
HUMAN ++++ what do you like to do in your free time?<br>
BOT ++++ and what about the gun ?<br>
HUMAN ++++ do you have a gun?<br>
BOT ++++ yes .<br>
HUMAN ++++ you're dangerous!<br>
BOT ++++ i ' m not .<br>
HUMAN ++++ do you go hunting?<br>
BOT ++++ i can do .<br>
HUMAN ++++ what do you hunt?<br>
BOT ++++ virgin .<br>
HUMAN ++++ that's very bad<br>
BOT ++++ and you in love ?<br>
HUMAN ++++ no i'm not<br>
BOT ++++ please - -<br>
HUMAN ++++ okay, i am<br>
BOT ++++ and you are good .<br>
HUMAN ++++ yeah i am good too.<br>
BOT ++++ good good to you .<br>
HUMAN ++++ are you in love?<br>
BOT ++++ i ' m ready .<br>
HUMAN ++++ ready for what?<br>
BOT ++++ for what ?<br>
HUMAN ++++ for what?<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ tell me<br>
BOT ++++ i can do it .<br>
HUMAN ++++ okay, tell me what you're ready for<br>
BOT ++++ i ' m doing you .<br>
HUMAN ++++ no you're not<br>
BOT ++++ please !<br>
HUMAN ++++ this is a strange conversation<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ are you a guy?<br>
BOT ++++ of what ?<br>
HUMAN ++++ are you a boy?<br>
BOT ++++ i ' m fine .<br>
HUMAN ++++ are you a man?<br>
BOT ++++ i ' m a little .<br>
HUMAN ++++ what's your name?<br>
BOT ++++ chris -<br>
HUMAN ++++ chris who?<br>
BOT ++++ cornelius .<br>
HUMAN ++++ so i guess you're a guy.<br>
BOT ++++ i am .<br>
HUMAN ++++ okay<br>
BOT ++++ okay ?<br>
HUMAN ++++ do you have a girlfriend?<br>
BOT ++++ i ' m sure .<br>
HUMAN ++++ what's her name?<br>
BOT ++++ let ' s talk about something else .<br>

See output_convo.txt for more sample conversations.

<h2>Usage</h2>

Step 1: create a data folder in your project directory, download
the Cornell Movie-Dialogs Corpus from 
https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Unzip it

Step 2: python data.py
<br>This will do all the pre-processing for the Cornell dataset.

Step 3:
python chatbot.py --mode [train/chat] <br>
If mode is train, then you train the chatbot. By default, the model will
restore the previously trained weights (if there is any) and continue
training up on that.

If you want to start training from scratch, please delete all the checkpoints
in the checkpoints folder.

If the mode is chat, you'll go into the interaction mode with the bot.

By default, all the conversations you have with the chatbot will be written
into the file output_convo.txt in the processed folder. If you run this chatbot,
I kindly ask you to send me the output_convo.txt so that I can improve
the chatbot. My email is huyenn@stanford.edu

If you find the tutorial helpful, please head over to <a href="http://web.stanford.edu/class/cs20si/anonymous_chatlog.pdf">Anonymous Chatlog Donation</a>
to see how you can help us create the first realistic dialogue dataset.

Thank you very much!
>>>>>>> origin/master


================================================
FILE: 2017/assignments/chatbot/chatbot.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen as the starter code for assignment 3,
class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

This file contains the code to run the model.

See readme.md for instruction on how to run the starter code.
"""
from __future__ import division
from __future__ import print_function

import argparse
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import random
import sys
import time

import numpy as np
import tensorflow as tf

from model import ChatBotModel
import config
import data

def _get_random_bucket(train_buckets_scale):
    """ Get a random bucket from which to choose a training sample """
    rand = random.random()
    return min([i for i in range(len(train_buckets_scale))
                if train_buckets_scale[i] > rand])

def _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks):
    """ Assert that the encoder inputs, decoder inputs, and decoder masks are
    of the expected lengths """
    if len(encoder_inputs) != encoder_size:
        raise ValueError("Encoder length must be equal to the one in bucket,"
                        " %d != %d." % (len(encoder_inputs), encoder_size))
    if len(decoder_inputs) != decoder_size:
        raise ValueError("Decoder length must be equal to the one in bucket,"
                       " %d != %d." % (len(decoder_inputs), decoder_size))
    if len(decoder_masks) != decoder_size:
        raise ValueError("Weights length must be equal to the one in bucket,"
                       " %d != %d." % (len(decoder_masks), decoder_size))

def run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only):
    """ Run one step in training.
    @forward_only: boolean value to decide whether a backward path should be created
    forward_only is set to True when you just want to evaluate on the test set,
    or when you want to the bot to be in chat mode. """
    encoder_size, decoder_size = config.BUCKETS[bucket_id]
    _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks)

    # input feed: encoder inputs, decoder inputs, target_weights, as provided.
    input_feed = {}
    for step in range(encoder_size):
        input_feed[model.encoder_inputs[step].name] = encoder_inputs[step]
    for step in range(decoder_size):
        input_feed[model.decoder_inputs[step].name] = decoder_inputs[step]
        input_feed[model.decoder_masks[step].name] = decoder_masks[step]

    last_target = model.decoder_inputs[decoder_size].name
    input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32)

    # output feed: depends on whether we do a backward step or not.
    if not forward_only:
        output_feed = [model.train_ops[bucket_id],  # update op that does SGD.
                       model.gradient_norms[bucket_id],  # gradient norm.
                       model.losses[bucket_id]]  # loss for this batch.
    else:
        output_feed = [model.losses[bucket_id]]  # loss for this batch.
        for step in range(decoder_size):  # output logits.
            output_feed.append(model.outputs[bucket_id][step])

    outputs = sess.run(output_feed, input_feed)
    if not forward_only:
        return outputs[1], outputs[2], None  # Gradient norm, loss, no outputs.
    else:
        return None, outputs[0], outputs[1:]  # No gradient norm, loss, outputs.

def _get_buckets():
    """ Load the dataset into buckets based on their lengths.
    train_buckets_scale is the inverval that'll help us 
    choose a random bucket later on.
    """
    test_buckets = data.load_data('test_ids.enc', 'test_ids.dec')
    data_buckets = data.load_data('train_ids.enc', 'train_ids.dec')
    train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))]
    print("Number of samples in each bucket:\n", train_bucket_sizes)
    train_total_size = sum(train_bucket_sizes)
    # list of increasing numbers from 0 to 1 that we'll use to select a bucket.
    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
                           for i in range(len(train_bucket_sizes))]
    print("Bucket scale:\n", train_buckets_scale)
    return test_buckets, data_buckets, train_buckets_scale

def _get_skip_step(iteration):
    """ How many steps should the model train before it saves all the weights. """
    if iteration < 100:
        return 30
    return 100

def _check_restore_parameters(sess, saver):
    """ Restore the previously trained parameters if there are any. """
    ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint'))
    if ckpt and ckpt.model_checkpoint_path:
        print("Loading parameters for the Chatbot")
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        print("Initializing fresh parameters for the Chatbot")

def _eval_test_set(sess, model, test_buckets):
    """ Evaluate on the test set. """
    for bucket_id in range(len(config.BUCKETS)):
        if len(test_buckets[bucket_id]) == 0:
            print("  Test: empty bucket %d" % (bucket_id))
            continue
        start = time.time()
        encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], 
                                                                        bucket_id,
                                                                        batch_size=config.BATCH_SIZE)
        _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, 
                                   decoder_masks, bucket_id, True)
        print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start))

def train():
    """ Train the bot """
    test_buckets, data_buckets, train_buckets_scale = _get_buckets()
    # in train mode, we need to create the backward path, so forwrad_only is False
    model = ChatBotModel(False, config.BATCH_SIZE)
    model.build_graph()

    saver = tf.train.Saver()

    with tf.Session() as sess:
        print('Running session')
        sess.run(tf.global_variables_initializer())
        _check_restore_parameters(sess, saver)

        iteration = model.global_step.eval()
        total_loss = 0
        while True:
            skip_step = _get_skip_step(iteration)
            bucket_id = _get_random_bucket(train_buckets_scale)
            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], 
                                                                           bucket_id,
                                                                           batch_size=config.BATCH_SIZE)
            start = time.time()
            _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False)
            total_loss += step_loss
            iteration += 1

            if iteration % skip_step == 0:
                print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start))
                start = time.time()
                total_loss = 0
                saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step)
                if iteration % (10 * skip_step) == 0:
                    # Run evals on development set and print their loss
                    _eval_test_set(sess, model, test_buckets)
                    start = time.time()
                sys.stdout.flush()

def _get_user_input():
    """ Get user's input, which will be transformed into encoder input later """
    print("> ", end="")
    sys.stdout.flush()
    return sys.stdin.readline()

def _find_right_bucket(length):
    """ Find the proper bucket for an encoder input based on its length """
    return min([b for b in range(len(config.BUCKETS))
                if config.BUCKETS[b][0] >= length])

def _construct_response(output_logits, inv_dec_vocab):
    """ Construct a response to the user's encoder input.
    @output_logits: the outputs from sequence to sequence wrapper.
    output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB
    
    This is a greedy decoder - outputs are just argmaxes of output_logits.
    """
    print(output_logits[0])
    outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
    # If there is an EOS symbol in outputs, cut them at that point.
    if config.EOS_ID in outputs:
        outputs = outputs[:outputs.index(config.EOS_ID)]
    # Print out sentence corresponding to outputs.
    return " ".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs])

def chat():
    """ in test mode, we don't to create the backward path
    """
    _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc'))
    inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec'))

    model = ChatBotModel(True, batch_size=1)
    model.build_graph()

    saver = tf.train.Saver()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        _check_restore_parameters(sess, saver)
        output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+')
        # Decode from standard input.
        max_length = config.BUCKETS[-1][0]
        print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length)
        while True:
            line = _get_user_input()
            if len(line) > 0 and line[-1] == '\n':
                line = line[:-1]
            if line == '':
                break
            output_file.write('HUMAN ++++ ' + line + '\n')
            # Get token-ids for the input sentence.
            token_ids = data.sentence2id(enc_vocab, str(line))
            if (len(token_ids) > max_length):
                print('Max length I can handle is:', max_length)
                line = _get_user_input()
                continue
            # Which bucket does it belong to?
            bucket_id = _find_right_bucket(len(token_ids))
            # Get a 1-element batch to feed the sentence to the model.
            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], 
                                                                            bucket_id,
                                                                            batch_size=1)
            # Get output logits for the sentence.
            _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs,
                                           decoder_masks, bucket_id, True)
            response = _construct_response(output_logits, inv_dec_vocab)
            print(response)
            output_file.write('BOT ++++ ' + response + '\n')
        output_file.write('=============================================\n')
        output_file.close()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--mode', choices={'train', 'chat'},
                        default='train', help="mode. if not specified, it's in the train mode")
    args = parser.parse_args()

    if not os.path.isdir(config.PROCESSED_PATH):
        data.prepare_raw_data()
        data.process_data()
    print('Data ready!')
    # create checkpoints folder if there isn't one already
    data.make_dir(config.CPT_PATH)

    if args.mode == 'train':
        train()
    elif args.mode == 'chat':
        chat()

if __name__ == '__main__':
    main()


================================================
FILE: 2017/assignments/chatbot/config.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen as the starter code for assignment 3,
class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

This file contains the hyperparameters for the model.

See readme.md for instruction on how to run the starter code.
"""

# parameters for processing the dataset
DATA_PATH = '/Users/Chip/data/cornell movie-dialogs corpus'
CONVO_FILE = 'movie_conversations.txt'
LINE_FILE = 'movie_lines.txt'
OUTPUT_FILE = 'output_convo.txt'
PROCESSED_PATH = 'processed'
CPT_PATH = 'checkpoints'

THRESHOLD = 2

PAD_ID = 0
UNK_ID = 1
START_ID = 2
EOS_ID = 3

TESTSET_SIZE = 25000

# model parameters
""" Train encoder length distribution:
[175, 92, 11883, 8387, 10656, 13613, 13480, 12850, 11802, 10165, 
8973, 7731, 7005, 6073, 5521, 5020, 4530, 4421, 3746, 3474, 3192, 
2724, 2587, 2413, 2252, 2015, 1816, 1728, 1555, 1392, 1327, 1248, 
1128, 1084, 1010, 884, 843, 755, 705, 660, 649, 594, 558, 517, 475, 
426, 444, 388, 349, 337]
These buckets size seem to work the best
"""
# [19530, 17449, 17585, 23444, 22884, 16435, 17085, 18291, 18931]
# BUCKETS = [(6, 8), (8, 10), (10, 12), (13, 15), (16, 19), (19, 22), (23, 26), (29, 32), (39, 44)]

# [37049, 33519, 30223, 33513, 37371]
# BUCKETS = [(8, 10), (12, 14), (16, 19), (23, 26), (39, 43)]

# BUCKETS = [(8, 10), (12, 14), (16, 19)]
BUCKETS = [(16, 19)]

NUM_LAYERS = 3
HIDDEN_SIZE = 256
BATCH_SIZE = 64

LR = 0.5
MAX_GRAD_NORM = 5.0

NUM_SAMPLES = 512


================================================
FILE: 2017/assignments/chatbot/data.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen as the starter code for assignment 3,
class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

This file contains the code to do the pre-processing for the
Cornell Movie-Dialogs Corpus.

See readme.md for instruction on how to run the starter code.
"""
from __future__ import print_function

import os
import random
import re

import numpy as np

import config

def get_lines():
    id2line = {}
    file_path = os.path.join(config.DATA_PATH, config.LINE_FILE)
    with open(file_path, 'rb') as f:
        lines = f.readlines()
        for line in lines:
            parts = line.split(' +++$+++ ')
            if len(parts) == 5:
                if parts[4][-1] == '\n':
                    parts[4] = parts[4][:-1]
                id2line[parts[0]] = parts[4]
    return id2line

def get_convos():
    """ Get conversations from the raw data """
    file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE)
    convos = []
    with open(file_path, 'rb') as f:
        for line in f.readlines():
            parts = line.split(' +++$+++ ')
            if len(parts) == 4:
                convo = []
                for line in parts[3][1:-2].split(', '):
                    convo.append(line[1:-1])
                convos.append(convo)

    return convos

def question_answers(id2line, convos):
    """ Divide the dataset into two sets: questions and answers. """
    questions, answers = [], []
    for convo in convos:
        for index, line in enumerate(convo[:-1]):
            questions.append(id2line[convo[index]])
            answers.append(id2line[convo[index + 1]])
    assert len(questions) == len(answers)
    return questions, answers

def prepare_dataset(questions, answers):
    # create path to store all the train & test encoder & decoder
    make_dir(config.PROCESSED_PATH)
    
    # random convos to create the test set
    test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE)
    
    filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec']
    files = []
    for filename in filenames:
        files.append(open(os.path.join(config.PROCESSED_PATH, filename),'wb'))

    for i in range(len(questions)):
        if i in test_ids:
            files[2].write(questions[i] + '\n')
            files[3].write(answers[i] + '\n')
        else:
            files[0].write(questions[i] + '\n')
            files[1].write(answers[i] + '\n')

    for file in files:
        file.close()

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

def basic_tokenizer(line, normalize_digits=True):
    """ A basic tokenizer to tokenize text into tokens.
    Feel free to change this to suit your need. """
    line = re.sub('<u>', '', line)
    line = re.sub('</u>', '', line)
    line = re.sub('\[', '', line)
    line = re.sub('\]', '', line)
    words = []
    _WORD_SPLIT = re.compile(b"([.,!?\"'-<>:;)(])")
    _DIGIT_RE = re.compile(r"\d")
    for fragment in line.strip().lower().split():
        for token in re.split(_WORD_SPLIT, fragment):
            if not token:
                continue
            if normalize_digits:
                token = re.sub(_DIGIT_RE, b'#', token)
            words.append(token)
    return words

def build_vocab(filename, normalize_digits=True):
    in_path = os.path.join(config.PROCESSED_PATH, filename)
    out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:]))

    vocab = {}
    with open(in_path, 'rb') as f:
        for line in f.readlines():
            for token in basic_tokenizer(line):
                if not token in vocab:
                    vocab[token] = 0
                vocab[token] += 1

    sorted_vocab = sorted(vocab, key=vocab.get, reverse=True)
    with open(out_path, 'wb') as f:
        f.write('<pad>' + '\n')
        f.write('<unk>' + '\n')
        f.write('<s>' + '\n')
        f.write('<\s>' + '\n') 
        index = 4
        for word in sorted_vocab:
            if vocab[word] < config.THRESHOLD:
                with open('config.py', 'ab') as cf:
                    if filename[-3:] == 'enc':
                        cf.write('ENC_VOCAB = ' + str(index) + '\n')
                    else:
                        cf.write('DEC_VOCAB = ' + str(index) + '\n')
                break
            f.write(word + '\n')
            index += 1

def load_vocab(vocab_path):
    with open(vocab_path, 'rb') as f:
        words = f.read().splitlines()
    return words, {words[i]: i for i in range(len(words))}

def sentence2id(vocab, line):
    return [vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)]

def token2id(data, mode):
    """ Convert all the tokens in the data into their corresponding
    index in the vocabulary. """
    vocab_path = 'vocab.' + mode
    in_path = data + '.' + mode
    out_path = data + '_ids.' + mode

    _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path))
    in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'rb')
    out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'wb')
    
    lines = in_file.read().splitlines()
    for line in lines:
        if mode == 'dec': # we only care about '<s>' and </s> in encoder
            ids = [vocab['<s>']]
        else:
            ids = []
        ids.extend(sentence2id(vocab, line))
        # ids.extend([vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)])
        if mode == 'dec':
            ids.append(vocab['<\s>'])
        out_file.write(' '.join(str(id_) for id_ in ids) + '\n')

def prepare_raw_data():
    print('Preparing raw data into train set and test set ...')
    id2line = get_lines()
    convos = get_convos()
    questions, answers = question_answers(id2line, convos)
    prepare_dataset(questions, answers)

def process_data():
    print('Preparing data to be model-ready ...')
    build_vocab('train.enc')
    build_vocab('train.dec')
    token2id('train', 'enc')
    token2id('train', 'dec')
    token2id('test', 'enc')
    token2id('test', 'dec')

def load_data(enc_filename, dec_filename, max_training_size=None):
    encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'rb')
    decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'rb')
    encode, decode = encode_file.readline(), decode_file.readline()
    data_buckets = [[] for _ in config.BUCKETS]
    i = 0
    while encode and decode:
        if (i + 1) % 10000 == 0:
            print("Bucketing conversation number", i)
        encode_ids = [int(id_) for id_ in encode.split()]
        decode_ids = [int(id_) for id_ in decode.split()]
        for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS):
            if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size:
                data_buckets[bucket_id].append([encode_ids, decode_ids])
                break
        encode, decode = encode_file.readline(), decode_file.readline()
        i += 1
    return data_buckets

def _pad_input(input_, size):
    return input_ + [config.PAD_ID] * (size - len(input_))

def _reshape_batch(inputs, size, batch_size):
    """ Create batch-major inputs. Batch inputs are just re-indexed inputs
    """
    batch_inputs = []
    for length_id in range(size):
        batch_inputs.append(np.array([inputs[batch_id][length_id]
                                    for batch_id in range(batch_size)], dtype=np.int32))
    return batch_inputs


def get_batch(data_bucket, bucket_id, batch_size=1):
    """ Return one batch to feed into the model """
    # only pad to the max length of the bucket
    encoder_size, decoder_size = config.BUCKETS[bucket_id]
    encoder_inputs, decoder_inputs = [], []

    for _ in range(batch_size):
        encoder_input, decoder_input = random.choice(data_bucket)
        # pad both encoder and decoder, reverse the encoder
        encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size))))
        decoder_inputs.append(_pad_input(decoder_input, decoder_size))

    # now we create batch-major vectors from the data selected above.
    batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size)
    batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size)

    # create decoder_masks to be 0 for decoders that are padding.
    batch_masks = []
    for length_id in range(decoder_size):
        batch_mask = np.ones(batch_size, dtype=np.float32)
        for batch_id in range(batch_size):
            # we set mask to 0 if the corresponding target is a PAD symbol.
            # the corresponding decoder is decoder_input shifted by 1 forward.
            if length_id < decoder_size - 1:
                target = decoder_inputs[batch_id][length_id + 1]
            if length_id == decoder_size - 1 or target == config.PAD_ID:
                batch_mask[batch_id] = 0.0
        batch_masks.append(batch_mask)
    return batch_encoder_inputs, batch_decoder_inputs, batch_masks

if __name__ == '__main__':
    prepare_raw_data()
    process_data()

================================================
FILE: 2017/assignments/chatbot/model.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen as the starter code for assignment 3,
class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

This file contains the code to build the model

See readme.md for instruction on how to run the starter code.
"""
from __future__ import print_function

import time

import numpy as np
import tensorflow as tf

import config

class ChatBotModel(object):
    def __init__(self, forward_only, batch_size):
        """forward_only: if set, we do not construct the backward pass in the model.
        """
        print('Initialize new model')
        self.fw_only = forward_only
        self.batch_size = batch_size
    
    def _create_placeholders(self):
        # Feeds for inputs. It's a list of placeholders
        print('Create placeholders')
        self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i))
                               for i in range(config.BUCKETS[-1][0])]
        self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i))
                               for i in range(config.BUCKETS[-1][1] + 1)]
        self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i))
                              for i in range(config.BUCKETS[-1][1] + 1)]

        # Our targets are decoder inputs shifted by one (to ignore <s> symbol)
        self.targets = self.decoder_inputs[1:]
        
    def _inference(self):
        print('Create inference')
        # If we use sampled softmax, we need an output projection.
        # Sampled softmax only makes sense if we sample less than vocabulary size.
        if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:
            w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])
            b = tf.get_variable('proj_b', [config.DEC_VOCAB])
            self.output_projection = (w, b)

        def sampled_loss(inputs, labels):
            labels = tf.reshape(labels, [-1, 1])
            return tf.nn.sampled_softmax_loss(tf.transpose(w), b, inputs, labels, 
                                              config.NUM_SAMPLES, config.DEC_VOCAB)
        self.softmax_loss_function = sampled_loss

        single_cell = tf.nn.rnn_cell.GRUCell(config.HIDDEN_SIZE)
        self.cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * config.NUM_LAYERS)

    def _create_loss(self):
        print('Creating loss... \nIt might take a couple of minutes depending on how many buckets you have.')
        start = time.time()
        def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
            return tf.nn.seq2seq.embedding_attention_seq2seq(
                    encoder_inputs, decoder_inputs, self.cell,
                    num_encoder_symbols=config.ENC_VOCAB,
                    num_decoder_symbols=config.DEC_VOCAB,
                    embedding_size=config.HIDDEN_SIZE,
                    output_projection=self.output_projection,
                    feed_previous=do_decode)

        if self.fw_only:
            self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
                                        self.encoder_inputs, 
                                        self.decoder_inputs, 
                                        self.targets,
                                        self.decoder_masks, 
                                        config.BUCKETS, 
                                        lambda x, y: _seq2seq_f(x, y, True),
                                        softmax_loss_function=self.softmax_loss_function)
            # If we use output projection, we need to project outputs for decoding.
            if self.output_projection:
                for bucket in range(len(config.BUCKETS)):
                    self.outputs[bucket] = [tf.matmul(output, 
                                            self.output_projection[0]) + self.output_projection[1]
                                            for output in self.outputs[bucket]]
        else:
            self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
                                        self.encoder_inputs, 
                                        self.decoder_inputs, 
                                        self.targets,
                                        self.decoder_masks,
                                        config.BUCKETS,
                                        lambda x, y: _seq2seq_f(x, y, False),
                                        softmax_loss_function=self.softmax_loss_function)
        print('Time:', time.time() - start)

    def _creat_optimizer(self):
        print('Create optimizer... \nIt might take a couple of minutes depending on how many buckets you have.')
        with tf.variable_scope('training') as scope:
            self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

            if not self.fw_only:
                self.optimizer = tf.train.GradientDescentOptimizer(config.LR)
                trainables = tf.trainable_variables()
                self.gradient_norms = []
                self.train_ops = []
                start = time.time()
                for bucket in range(len(config.BUCKETS)):
                    
                    clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], 
                                                                 trainables),
                                                                 config.MAX_GRAD_NORM)
                    self.gradient_norms.append(norm)
                    self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), 
                                                            global_step=self.global_step))
                    print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start))
                    start = time.time()


    def _create_summary(self):
        pass

    def build_graph(self):
        self._create_placeholders()
        self._inference()
        self._create_loss()
        self._creat_optimizer()
        self._create_summary()


================================================
FILE: 2017/assignments/chatbot/output_convo.txt
================================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ really?
BOT ++++ i don ' t know what to say .
HUMAN ++++ how're you?
BOT ++++ i ' m fine .
HUMAN ++++ where're you from?
BOT ++++ california .
HUMAN ++++ which city?
BOT ++++ let ' s talk about something else .
HUMAN ++++ you don't like california?
BOT ++++ oh . . .
HUMAN ++++ why not?
BOT ++++ i can ' t .
HUMAN ++++ how old are you?
BOT ++++ eighteen .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ what's your name?
BOT ++++ what ?
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ prisoner .
HUMAN ++++ ha ha you really don't want to tell your name
BOT ++++ i ' m not going to do it .
HUMAN ++++ how old are you?
BOT ++++ can you guess ?
HUMAN ++++ eighteen?
BOT ++++ in the bathroom .
HUMAN ++++ in the bathroom?
BOT ++++ in the next time .
HUMAN ++++ why are you talking so funny?
BOT ++++ i ' m looking .
HUMAN ++++ where are you from?
BOT ++++ general .
HUMAN ++++ where are you from?
BOT ++++ how ?
HUMAN ++++ do you like the us?
BOT ++++ i ' m sure .
HUMAN ++++ do you vote?
BOT ++++ i ' m afraid .
HUMAN ++++ what are you afraid of?
BOT ++++ i ' m looking .
HUMAN ++++ what are you looking for?
BOT ++++ can you guess ?
HUMAN ++++ no
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm chip
BOT ++++ please .
HUMAN ++++ please what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ do you like california?
BOT ++++ i ' m okay .
HUMAN ++++ i like california a lot
BOT ++++ of what ?
HUMAN ++++ i live here
BOT ++++ and what is it ?
HUMAN ++++ california
BOT ++++ i don ' t know what to say .
HUMAN ++++ i'm a student here
BOT ++++ and get it out of here .
HUMAN ++++ what do you like to do in your free time?
BOT ++++ and what about the gun ?
HUMAN ++++ do you have a gun?
BOT ++++ yes .
HUMAN ++++ you're dangerous!
BOT ++++ i ' m not .
HUMAN ++++ do you go hunting?
BOT ++++ i can do .
HUMAN ++++ what do you hunt?
BOT ++++ virgin .
HUMAN ++++ that's very bad
BOT ++++ and you in love ?
HUMAN ++++ no i'm not
BOT ++++ please - -
HUMAN ++++ okay, i am
BOT ++++ and you are good .
HUMAN ++++ yeah i am good too.
BOT ++++ good good to you .
HUMAN ++++ are you in love?
BOT ++++ i ' m ready .
HUMAN ++++ ready for what?
BOT ++++ for what ?
HUMAN ++++ for what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ tell me
BOT ++++ i can do it .
HUMAN ++++ okay, tell me what you're ready for
BOT ++++ i ' m doing you .
HUMAN ++++ no you're not
BOT ++++ please !
HUMAN ++++ this is a strange conversation
BOT ++++ i don ' t know what to say .
HUMAN ++++ are you a guy?
BOT ++++ of what ?
HUMAN ++++ are you a boy?
BOT ++++ i ' m fine .
HUMAN ++++ are you a man?
BOT ++++ i ' m a little .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ so i guess you're a guy.
BOT ++++ i am .
HUMAN ++++ okay
BOT ++++ okay ?
HUMAN ++++ do you have a girlfriend?
BOT ++++ i ' m sure .
HUMAN ++++ what's her name?
BOT ++++ let ' s talk about something else .
=============================================
HUMAN ++++ hi chris
BOT ++++ hi
HUMAN ++++ how're you today?
BOT ++++ okay .
HUMAN ++++ the weather is nice today
BOT ++++ i don ' t know what to say .
HUMAN ++++ just say the weather is nice
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how're you today?
BOT ++++ okay .
HUMAN ++++ just okay?
BOT ++++ i ' m okay .
HUMAN ++++ what's on your mind?
BOT ++++ i ' m in my room .
HUMAN ++++ doing what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ you're boring.
BOT ++++ i ' m not .
HUMAN ++++ prove it
BOT ++++ and what ?
HUMAN ++++ what what?
BOT ++++ sorry , i ' m still very stupid .
HUMAN ++++ it's okay. we're all stupid sometimes.
BOT ++++ ' s good to see .
HUMAN ++++ what's good to see?
BOT ++++ i ' m in .
HUMAN ++++ you're in your room?
BOT ++++ i ' m here .
HUMAN ++++ where's your room?
BOT ++++ i ' m in my room .
HUMAN ++++ are you alone?
BOT ++++ i ' m fine .
HUMAN ++++ are you by yourself?
BOT ++++ in the morning .
HUMAN ++++ what time is it?
BOT ++++ # # # # # # # .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm marie
BOT ++++ you are .
HUMAN ++++ yes i am
BOT ++++ and you are my friend .
HUMAN ++++ yes you are 
BOT ++++ and you are my best .
HUMAN ++++ that's so sweet
BOT ++++ please .
HUMAN ++++ do you know what time it is?
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ what time is it?
BOT ++++ 5 9 6 9 1 4 5 .
HUMAN ++++ it doesn't make any sense
BOT ++++ are you ?
HUMAN ++++ i do 
BOT ++++ and you are ?
HUMAN ++++ i'm a human.
BOT ++++ you in .
HUMAN ++++ are you in trouble?
BOT ++++ i ' m in .
HUMAN ++++ what trouble?
BOT ++++ i ' m in love .
HUMAN ++++ wow that's nice.
BOT ++++ like that ?
HUMAN ++++ yeah
BOT ++++ i don ' t know what to say .
HUMAN ++++ who are you?
BOT ++++ i don ' t understand .
HUMAN ++++ who are you?
BOT ++++ no idea .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ are you american?
BOT ++++ i ' m fine .
HUMAN ++++ of course you're fine.
BOT ++++ and you ?
HUMAN ++++ i'm okay.
BOT ++++ you are .
HUMAN ++++ yes, i am.
BOT ++++ and you are my friend .
HUMAN ++++ yes, i am.
BOT ++++ and you are my friend .
HUMAN ++++ that's correct.
BOT ++++ and you ' re in trouble ?
HUMAN ++++ no, i'm not. why did you think so?
BOT ++++ because it ' s a good idea .
HUMAN ++++ what is your good idea?
BOT ++++ i ' ll be fine .
=============================================
HUMAN ++++ hello
BOT ++++ and you are in trouble ?
HUMAN ++++ what did you say that?
BOT ++++ ' s life .
HUMAN ++++ i'm not in trouble.
BOT ++++ what the hell is that ?
HUMAN ++++ okay, what trouble?
BOT ++++ i ' m looking .
HUMAN ++++ why did you say i'm in trouble?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what do you want to talk about?
BOT ++++ i can ' t .
HUMAN ++++ what can't?
BOT ++++ i ' ll have a drink .
HUMAN ++++ okay, me too.
BOT ++++ you were here ?
HUMAN ++++ i'm still here.
BOT ++++ what is it ?
HUMAN ++++ i thought you wanted a drink.
BOT ++++ i ' m not .
HUMAN ++++ you just said so!
BOT ++++ i was wrong .
HUMAN ++++ it's fine.
BOT ++++ okay ?
HUMAN ++++ yeah
BOT ++++ i don ' t know what to say .
HUMAN ++++ do you drink a lot?
BOT ++++ i can do it .
HUMAN ++++ do you drink often?
BOT ++++ in the time .
HUMAN ++++ that doesn't make any sense.
BOT ++++ are you ?
HUMAN ++++ what do you drink?
BOT ++++ ' s in a drink .
HUMAN ++++ do you drink alcohol?
BOT ++++ in time .
HUMAN ++++ i like beer.
BOT ++++ oh . . .
HUMAN ++++ don't you like beer?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where do you live?
BOT ++++ up .
HUMAN ++++ up where?
BOT ++++ in the hospital .
HUMAN ++++ i'm sorry. are you okay?
BOT ++++ i ' m fine .
HUMAN ++++ why are you in the hospital?
BOT ++++ and what ?
HUMAN ++++ how many words do you know?
BOT ++++ i ' d do it .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ whatever
BOT ++++ i don ' t know what to say .
HUMAN ++++ you're boring.
BOT ++++ i ' m not .
HUMAN ++++ yes you are.
BOT ++++ and you are my friend .
HUMAN ++++ no i'm not.
BOT ++++ please .
HUMAN ++++ okay fine. i'm your friend.
BOT ++++ you ' re my mother ?
HUMAN ++++ i have no son like you.
BOT ++++ i don ' t know what to say .
HUMAN ++++ i have no son.
BOT ++++ i see you .
HUMAN ++++ i have no children.
BOT ++++ you are .
HUMAN ++++ omg you're so dumb.
BOT ++++ i ' m not .
HUMAN ++++ you're stupid!
BOT ++++ i ' m not .
HUMAN ++++ do you always talk like this?
BOT ++++ of course .
HUMAN ++++ how old are you?
BOT ++++ eighteen .
HUMAN ++++ you're big enough.
BOT ++++ i ' m not .
HUMAN ++++ are you an adult?
BOT ++++ i ' m going .
HUMAN ++++ you're going to be an adult?
BOT ++++ i ' m looking for you .
HUMAN ++++ i'm in my room.
BOT ++++ and your mother ?
HUMAN ++++ she's in vietnam.
BOT ++++ you ' re here .
HUMAN ++++ yes i'm here.
BOT ++++ and what are you talking about ?
HUMAN ++++ you asked me where i am.
BOT ++++ and you are your friend .
HUMAN ++++ i need to go now.
BOT ++++ i will not .
HUMAN ++++ that's fine. you can stay. i'm leaving.
BOT ++++ oh . . .
HUMAN ++++ bye
BOT ++++ i don ' t know what to say .
HUMAN ++++ goodbye
BOT ++++ i don ' t know what to say .
HUMAN ++++ bye bye
BOT ++++ okay ?
HUMAN ++++ see you later
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm maria
BOT ++++ please .
HUMAN ++++ i'm jackson
BOT ++++ you are .
HUMAN ++++ i'm jack
BOT ++++ let ' s talk about something else .
HUMAN ++++ i have no go
BOT ++++ you are ?
HUMAN ++++ i have to go
BOT ++++ and what do you mean ?
HUMAN ++++ i'm leaving.
BOT ++++ what ?
HUMAN ++++ yeah, bye
BOT ++++ bye . have a nice day !
=============================================
HUMAN ++++ that's very bad.
BOT ++++ i ' ve got to work .
HUMAN ++++ what do you do?
BOT ++++ i ' ll be there .
HUMAN ++++ where do you work?
BOT ++++ in the trunk .
HUMAN ++++ are you a machenics?
BOT ++++ i ' m not .
HUMAN ++++ what are you?
BOT ++++ no idea .
=============================================


================================================
FILE: 2017/assignments/exercises/e01.py
================================================
"""
Simple exercises to get used to TensorFlow API
You should thoroughly test your code
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

sess = tf.InteractiveSession()
###############################################################################
# 1a: Create two random 0-d tensors x and y of any distribution.
# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.
# Hint: look up tf.cond()
# I do the first problem for you
###############################################################################

x = tf.random_uniform([])  # Empty array as shape creates a scalar.
y = tf.random_uniform([])
out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))
print(sess.run(out))

###############################################################################
# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).
# Return x + y if x < y, x - y if x > y, 0 otherwise.
# Hint: Look up tf.case().
###############################################################################

# YOUR CODE

###############################################################################
# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] 
# and y as a tensor of zeros with the same shape as x.
# Return a boolean tensor that yields Trues if x equals y element-wise.
# Hint: Look up tf.equal().
###############################################################################

# YOUR CODE

###############################################################################
# 1d: Create the tensor x of value 
# [29.05088806,  27.61298943,  31.19073486,  29.35532951,
#  30.97266006,  26.67541885,  38.08450317,  20.74983215,
#  34.94445419,  34.45999146,  29.06485367,  36.01657104,
#  27.88236427,  20.56035233,  30.20379066,  29.51215172,
#  33.71149445,  28.59134293,  36.05556488,  28.66994858].
# Get the indices of elements in x whose values are greater than 30.
# Hint: Use tf.where().
# Then extract elements whose values are greater than 30.
# Hint: Use tf.gather().
###############################################################################

# YOUR CODE

###############################################################################
# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,
# 2, ..., 6
# Hint: Use tf.range() and tf.diag().
###############################################################################

# YOUR CODE

###############################################################################
# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.
# Calculate its determinant.
# Hint: Look at tf.matrix_determinant().
###############################################################################

# YOUR CODE

###############################################################################
# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].
# Return the unique elements in x
# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.
###############################################################################

# YOUR CODE

###############################################################################
# 1h: Create two tensors x and y of shape 300 from any normal distribution,
# as long as they are from the same distribution.
# Use tf.cond() to return:
# - The mean squared error of (x - y) if the average of all elements in (x - y)
#   is negative, or
# - The sum of absolute value of all elements in the tensor (x - y) otherwise.
# Hint: see the Huber loss function in the lecture slides 3.
###############################################################################

# YOUR CODE

================================================
FILE: 2017/assignments/exercises/e01_sol.py
================================================
"""
Solution to simple TensorFlow exercises
For the problems 
"""
import tensorflow as tf

###############################################################################
# 1a: Create two random 0-d tensors x and y of any distribution.
# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.
# Hint: look up tf.cond()
# I do the first problem for you
###############################################################################

x = tf.random_uniform([])  # Empty array as shape creates a scalar.
y = tf.random_uniform([])
out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))

###############################################################################
# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).
# Return x + y if x < y, x - y if x > y, 0 otherwise.
# Hint: Look up tf.case().
###############################################################################

x = tf.random_uniform([], -1, 1, dtype=tf.float32)
y = tf.random_uniform([], -1, 1, dtype=tf.float32)
out = tf.case({tf.less(x, y): lambda: tf.add(x, y), 
			tf.greater(x, y): lambda: tf.subtract(x, y)}, 
			default=lambda: tf.constant(0.0), exclusive=True)
print(x)
sess = tf.InteractiveSession()
print(sess.run(x))

###############################################################################
# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] 
# and y as a tensor of zeros with the same shape as x.
# Return a boolean tensor that yields Trues if x equals y element-wise.
# Hint: Look up tf.equal().
###############################################################################

x = tf.constant([[0, -2, -1], [0, 1, 2]])
y = tf.zeros_like(x)
out = tf.equal(x, y)

###############################################################################
# 1d: Create the tensor x of value 
# [29.05088806,  27.61298943,  31.19073486,  29.35532951,
#  30.97266006,  26.67541885,  38.08450317,  20.74983215,
#  34.94445419,  34.45999146,  29.06485367,  36.01657104,
#  27.88236427,  20.56035233,  30.20379066,  29.51215172,
#  33.71149445,  28.59134293,  36.05556488,  28.66994858].
# Get the indices of elements in x whose values are greater than 30.
# Hint: Use tf.where().
# Then extract elements whose values are greater than 30.
# Hint: Use tf.gather().
###############################################################################

x = tf.constant([29.05088806,  27.61298943,  31.19073486,  29.35532951,
		        30.97266006,  26.67541885,  38.08450317,  20.74983215,
		        34.94445419,  34.45999146,  29.06485367,  36.01657104,
		        27.88236427,  20.56035233,  30.20379066,  29.51215172,
		        33.71149445,  28.59134293,  36.05556488,  28.66994858])
indices = tf.where(x > 30)
out = tf.gather(x, indices)

###############################################################################
# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,
# 2, ..., 6
# Hint: Use tf.range() and tf.diag().
###############################################################################

values = tf.range(1, 7)
out = tf.diag(values)

###############################################################################
# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.
# Calculate its determinant.
# Hint: Look at tf.matrix_determinant().
###############################################################################

m = tf.random_normal([10, 10], mean=10, stddev=1)
out = tf.matrix_determinant(m)

###############################################################################
# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].
# Return the unique elements in x
# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.
###############################################################################

x = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9])
unique_values, indices = tf.unique(x)

###############################################################################
# 1h: Create two tensors x and y of shape 300 from any normal distribution,
# as long as they are from the same distribution.
# Use tf.cond() to return:
# - The mean squared error of (x - y) if the average of all elements in (x - y)
#   is negative, or
# - The sum of absolute value of all elements in the tensor (x - y) otherwise.
# Hint: see the Huber loss function in the lecture slides 3.
###############################################################################

x = tf.random_normal([300], mean=5, stddev=1)
y = tf.random_normal([300], mean=5, stddev=1)
average = tf.reduce_mean(x - y)
def f1(): return tf.reduce_mean(tf.square(x - y))
def f2(): return tf.reduce_sum(tf.abs(x - y))
out = tf.cond(average < 0, f1, f2)

================================================
FILE: 2017/assignments/style_transfer/readme.md
================================================
For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf


================================================
FILE: 2017/assignments/style_transfer/style_transfer.py
================================================
""" An implementation of the paper "A Neural Algorithm of Artistic Style"
by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import tensorflow as tf

import vgg_model
import utils

# parameters to manage experiments
STYLE = 'guernica'
CONTENT = 'deadpool'
STYLE_IMAGE = 'styles/' + STYLE + '.jpg'
CONTENT_IMAGE = 'content/' + CONTENT + '.jpg'
IMAGE_HEIGHT = 250
IMAGE_WIDTH = 333
NOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image

CONTENT_WEIGHT = 0.01
STYLE_WEIGHT = 1

# Layers used for style features. You can change this.
STYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
W = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers.

# Layer used for content features. You can change this.
CONTENT_LAYER = 'conv4_2'

ITERS = 300
LR = 2.0

MEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))
""" MEAN_PIXELS is defined according to description on their github:
https://gist.github.com/ksimonyan/211839e770f7b538e2d8
'In the paper, the model is denoted as the configuration D trained with scale jittering. 
The input images should be zero-centered by mean pixel (rather than mean image) subtraction. 
Namely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].'
"""

# VGG-19 parameters file
VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'
VGG_MODEL = 'imagenet-vgg-verydeep-19.mat'
EXPECTED_BYTES = 534904783

def _create_content_loss(p, f):
    """ Calculate the loss between the feature representation of the
    content image and the generated image.
    
    Inputs: 
        p, f are just P, F in the paper 
        (read the assignment handout if you're confused)
        Note: we won't use the coefficient 0.5 as defined in the paper
        but the coefficient as defined in the assignment handout.
    Output:
        the content loss

    """
    return tf.reduce_sum((f - p) ** 2) / (4.0 * p.size)

def _gram_matrix(F, N, M):
    """ Create and return the gram matrix for tensor F
        Hint: you'll first have to reshape F
    """
    F = tf.reshape(F, (M, N))
    return tf.matmul(tf.transpose(F), F)

def _single_style_loss(a, g):
    """ Calculate the style loss at a certain layer
    Inputs:
        a is the feature representation of the real image
        g is the feature representation of the generated image
    Output:
        the style loss at a certain layer (which is E_l in the paper)

    Hint: 1. you'll have to use the function _gram_matrix()
        2. we'll use the same coefficient for style loss as in the paper
        3. a and g are feature representation, not gram matrices
    """
    N = a.shape[3] # number of filters
    M = a.shape[1] * a.shape[2] # height times width of the feature map
    A = _gram_matrix(a, N, M)
    G = _gram_matrix(g, N, M)
    return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2))

def _create_style_loss(A, model):
    """ Return the total style loss
    """
    n_layers = len(STYLE_LAYERS)
    E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)]
    
    ###############################
    ## TO DO: return total style loss
    return sum([W[i] * E[i] for i in range(n_layers)])
    ###############################

def _create_losses(model, input_image, content_image, style_image):
    with tf.variable_scope('loss') as scope:
        with tf.Session() as sess:
            sess.run(input_image.assign(content_image)) # assign content image to the input variable
            p = sess.run(model[CONTENT_LAYER])
        content_loss = _create_content_loss(p, model[CONTENT_LAYER])

        with tf.Session() as sess:
            sess.run(input_image.assign(style_image))
            A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS])                              
        style_loss = _create_style_loss(A, model)

        ##########################################
        ## TO DO: create total loss. 
        ## Hint: don't forget the content loss and style loss weights
        total_loss = CONTENT_WEIGHT * content_loss + STYLE_WEIGHT * style_loss
        ##########################################

    return content_loss, style_loss, total_loss

def _create_summary(model):
    """ Create summary ops necessary
        Hint: don't forget to merge them
    """
    with tf.name_scope('summaries'):
        tf.summary.scalar('content loss', model['content_loss'])
        tf.summary.scalar('style loss', model['style_loss'])
        tf.summary.scalar('total loss', model['total_loss'])
        tf.summary.histogram('histogram content loss', model['content_loss'])
        tf.summary.histogram('histogram style loss', model['style_loss'])
        tf.summary.histogram('histogram total loss', model['total_loss'])
        return tf.summary.merge_all()

def train(model, generated_image, initial_image):
    """ Train your model.
    Don't forget to create folders for checkpoints and outputs.
    """
    skip_step = 1
    with tf.Session() as sess:
        saver = tf.train.Saver()
        ###############################
        ## TO DO: 
        ## 1. initialize your variables
        ## 2. create writer to write your graph
        saver = tf.train.Saver()
        sess.run(tf.global_variables_initializer())
        writer = tf.summary.FileWriter('graphs', sess.graph)
        ###############################
        sess.run(generated_image.assign(initial_image))
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        initial_step = model['global_step'].eval()
        
        start_time = time.time()
        for index in range(initial_step, ITERS):
            if index >= 5 and index < 20:
                skip_step = 10
            elif index >= 20:
                skip_step = 20
            
            sess.run(model['optimizer'])
            if (index + 1) % skip_step == 0:
                ###############################
                ## TO DO: obtain generated image and loss
                gen_image, total_loss, summary = sess.run([generated_image, model['total_loss'], 
                                                             model['summary_op']])

                ###############################
                gen_image = gen_image + MEAN_PIXELS
                writer.add_summary(summary, global_step=index)
                print('Step {}\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))
                print('   Loss: {:5.1f}'.format(total_loss))
                print('   Time: {}'.format(time.time() - start_time))
                start_time = time.time()

                filename = 'outputs/%d.png' % (index)
                utils.save_image(filename, gen_image)

                if (index + 1) % 20 == 0:
                    saver.save(sess, 'checkpoints/style_transfer', index)

def main():
    with tf.variable_scope('input') as scope:
        # use variable instead of placeholder because we're training the intial image to make it
        # look like both the content image and the style image
        input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32)
    
    utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES)
    utils.make_dir('checkpoints')
    utils.make_dir('outputs')
    model = vgg_model.load_vgg(VGG_MODEL, input_image)
    model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)
    content_image = content_image - MEAN_PIXELS
    style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)
    style_image = style_image - MEAN_PIXELS

    model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, 
                                                    input_image, content_image, style_image)
    ###############################
    ## TO DO: create optimizer
    model['optimizer'] = tf.train.AdamOptimizer(LR).minimize(model['total_loss'], 
                                                            global_step=model['global_step'])
    ###############################
    model['summary_op'] = _create_summary(model)

    initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO)
    train(model, input_image, initial_image)

if __name__ == '__main__':
    main()


================================================
FILE: 2017/assignments/style_transfer/utils.py
================================================
""" Utils needed for the implementation of the paper "A Neural Algorithm of Artistic Style"
by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""
from __future__ import print_function

import os

from PIL import Image, ImageOps
import numpy as np
import scipy.misc
from six.moves import urllib

def download(download_link, file_name, expected_bytes):
    """ Download the pretrained VGG-19 model if it's not already downloaded """
    if os.path.exists(file_name):
        print("VGG-19 pre-trained model ready")
        return
    print("Downloading the VGG pre-trained model. This might take a while ...")
    file_name, _ = urllib.request.urlretrieve(download_link, file_name)
    file_stat = os.stat(file_name)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded VGG-19 pre-trained model', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')

def get_resized_image(img_path, height, width, save=True):
    image = Image.open(img_path)
    # it's because PIL is column major so you have to change place of width and height
    # this is stupid, i know
    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)
    if save:
        image_dirs = img_path.split('/')
        image_dirs[-1] = 'resized_' + image_dirs[-1]
        out_path = '/'.join(image_dirs)
        if not os.path.exists(out_path):
            image.save(out_path)
    image = np.asarray(image, np.float32)
    return np.expand_dims(image, 0)

def generate_noise_image(content_image, height, width, noise_ratio=0.6):
    noise_image = np.random.uniform(-20, 20, 
                                    (1, height, width, 3)).astype(np.float32)
    return noise_image * noise_ratio + content_image * (1 - noise_ratio)

def save_image(path, image):
    # Output should add back the mean pixels we subtracted at the beginning
    image = image[0] # the image
    image = np.clip(image, 0, 255).astype('uint8')
    scipy.misc.imsave(path, image)

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

================================================
FILE: 2017/assignments/style_transfer/vgg_model.py
================================================
""" Load VGGNet weights needed for the implementation of the paper 
"A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""

import numpy as np
import tensorflow as tf
import scipy.io

def _weights(vgg_layers, layer, expected_layer_name):
    """ Return the weights and biases already trained by VGG
    """
    W = vgg_layers[0][layer][0][0][2][0][0]
    b = vgg_layers[0][layer][0][0][2][0][1]
    layer_name = vgg_layers[0][layer][0][0][0][0]
    assert layer_name == expected_layer_name
    return W, b.reshape(b.size)

def _conv2d_relu(vgg_layers, prev_layer, layer, layer_name):
    """ Return the Conv2D layer with RELU using the weights, biases from the VGG
    model at 'layer'.
    Inputs:
        vgg_layers: holding all the layers of VGGNet
        prev_layer: the output tensor from the previous layer
        layer: the index to current layer in vgg_layers
        layer_name: the string that is the name of the current layer.
                    It's used to specify variable_scope.

    Output:
        relu applied on the convolution.

    Note that you first need to obtain W and b from vgg-layers using the function
    _weights() defined above.
    W and b returned from _weights() are numpy arrays, so you have
    to convert them to TF tensors using tf.constant.
    Note that you'll have to do apply relu on the convolution.
    Hint for choosing strides size: 
        for small images, you probably don't want to skip any pixel
    """
    with tf.variable_scope(layer_name) as scope:
        W, b = _weights(vgg_layers, layer, layer_name)
        W = tf.constant(W, name='weights')
        b = tf.constant(b, name='bias')
        conv2d = tf.nn.conv2d(prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME')
    return tf.nn.relu(conv2d + b)

def _avgpool(prev_layer):
    """ Return the average pooling layer. The paper suggests that average pooling
    actually works better than max pooling.
    Input:
        prev_layer: the output tensor from the previous layer

    Output:
        the output of the tf.nn.avg_pool() function.
    Hint for choosing strides and kszie: choose what you feel appropriate
    """
    return tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], 
                          padding='SAME', name='avg_pool_')

def load_vgg(path, input_image):
    """ Load VGG into a TensorFlow model.
    Use a dictionary to hold the model instead of using a Python class
    """
    vgg = scipy.io.loadmat(path)
    vgg_layers = vgg['layers']

    graph = {} 
    graph['conv1_1']  = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1')
    graph['conv1_2']  = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2')
    graph['avgpool1'] = _avgpool(graph['conv1_2'])
    graph['conv2_1']  = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1')
    graph['conv2_2']  = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2')
    graph['avgpool2'] = _avgpool(graph['conv2_2'])
    graph['conv3_1']  = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1')
    graph['conv3_2']  = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2')
    graph['conv3_3']  = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3')
    graph['conv3_4']  = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4')
    graph['avgpool3'] = _avgpool(graph['conv3_4'])
    graph['conv4_1']  = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1')
    graph['conv4_2']  = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2')
    graph['conv4_3']  = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3')
    graph['conv4_4']  = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4')
    graph['avgpool4'] = _avgpool(graph['conv4_4'])
    graph['conv5_1']  = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1')
    graph['conv5_2']  = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2')
    graph['conv5_3']  = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3')
    graph['conv5_4']  = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4')
    graph['avgpool5'] = _avgpool(graph['conv5_4'])
    
    return graph

================================================
FILE: 2017/assignments/style_transfer_starter/readme.md
================================================
For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf


================================================
FILE: 2017/assignments/style_transfer_starter/style_transfer.py
================================================
""" An implementation of the paper "A Neural Algorithm of Artistic Style"
by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import tensorflow as tf

import vgg_model
import utils

# parameters to manage experiments
STYLE = 'guernica'
CONTENT = 'deadpool'
STYLE_IMAGE = 'styles/' + STYLE + '.jpg'
CONTENT_IMAGE = 'content/' + CONTENT + '.jpg'
IMAGE_HEIGHT = 250
IMAGE_WIDTH = 333
NOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image

# Layers used for style features. You can change this.
STYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
W = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers.

# Layer used for content features. You can change this.
CONTENT_LAYER = 'conv4_2'

ITERS = 300
LR = 2.0

SAVE_EVERY = 20

MEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))
""" MEAN_PIXELS is defined according to description on their github:
https://gist.github.com/ksimonyan/211839e770f7b538e2d8
'In the paper, the model is denoted as the configuration D trained with scale jittering. 
The input images should be zero-centered by mean pixel (rather than mean image) subtraction. 
Namely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].'
"""

# VGG-19 parameters file
VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'
VGG_MODEL = 'imagenet-vgg-verydeep-19.mat'
EXPECTED_BYTES = 534904783

def _create_content_loss(p, f):
    """ Calculate the loss between the feature representation of the
    content image and the generated image.
    
    Inputs: 
        p, f are just P, F in the paper 
        (read the assignment handout if you're confused)
        Note: we won't use the coefficient 0.5 as defined in the paper
        but the coefficient as defined in the assignment handout.
    Output:
        the content loss

    """
    pass

def _gram_matrix(F, N, M):
    """ Create and return the gram matrix for tensor F
        Hint: you'll first have to reshape F
    """
    pass

def _single_style_loss(a, g):
    """ Calculate the style loss at a certain layer
    Inputs:
        a is the feature representation of the real image
        g is the feature representation of the generated image
    Output:
        the style loss at a certain layer (which is E_l in the paper)

    Hint: 1. you'll have to use the function _gram_matrix()
        2. we'll use the same coefficient for style loss as in the paper
        3. a and g are feature representation, not gram matrices
    """
    pass

def _create_style_loss(A, model):
    """ Return the total style loss
    """
    n_layers = len(STYLE_LAYERS)
    E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)]
    
    ###############################
    ## TO DO: return total style loss
    pass
    ###############################

def _create_losses(model, input_image, content_image, style_image):
    with tf.variable_scope('loss') as scope:
        with tf.Session() as sess:
            sess.run(input_image.assign(content_image)) # assign content image to the input variable
            p = sess.run(model[CONTENT_LAYER])
        content_loss = _create_content_loss(p, model[CONTENT_LAYER])

        with tf.Session() as sess:
            sess.run(input_image.assign(style_image))
            A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS])                              
        style_loss = _create_style_loss(A, model)

        ##########################################
        ## TO DO: create total loss. 
        ## Hint: don't forget the content loss and style loss weights
        
        ##########################################

    return content_loss, style_loss, total_loss

def _create_summary(model):
    """ Create summary ops necessary
        Hint: don't forget to merge them
    """
    pass

def train(model, generated_image, initial_image):
    """ Train your model.
    Don't forget to create folders for checkpoints and outputs.
    """
    skip_step = 1
    with tf.Session() as sess:
        saver = tf.train.Saver()
        ###############################
        ## TO DO: 
        ## 1. initialize your variables
        ## 2. create writer to write your graph
        ###############################
        sess.run(generated_image.assign(initial_image))
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        initial_step = model['global_step'].eval()
        
        start_time = time.time()
        for index in range(initial_step, ITERS):
            if index >= 5 and index < 20:
                skip_step = 10
            elif index >= 20:
                skip_step = 20
            
            sess.run(model['optimizer'])
            if (index + 1) % skip_step == 0:
                ###############################
                ## TO DO: obtain generated image and loss

                ###############################
                gen_image = gen_image + MEAN_PIXELS
                writer.add_summary(summary, global_step=index)
                print('Step {}\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))
                print('   Loss: {:5.1f}'.format(total_loss))
                print('   Time: {}'.format(time.time() - start_time))
                start_time = time.time()

                filename = 'outputs/%d.png' % (index)
                utils.save_image(filename, gen_image)

                if (index + 1) % SAVE_EVERY == 0:
                    saver.save(sess, 'checkpoints/style_transfer', index)

def main():
    with tf.variable_scope('input') as scope:
        # use variable instead of placeholder because we're training the intial image to make it
        # look like both the content image and the style image
        input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32)
    
    utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES)
    utils.make_dir('checkpoints')
    utils.make_dir('outputs')
    model = vgg_model.load_vgg(VGG_MODEL, input_image)
    model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
    
    content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)
    content_image = content_image - MEAN_PIXELS
    style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)
    style_image = style_image - MEAN_PIXELS

    model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, 
                                                    input_image, content_image, style_image)
    ###############################
    ## TO DO: create optimizer
    ## model['optimizer'] = ...
    ###############################
    model['summary_op'] = _create_summary(model)

    initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO)
    train(model, input_image, initial_image)

if __name__ == '__main__':
    main()


================================================
FILE: 2017/assignments/style_transfer_starter/utils.py
================================================
""" Utils needed for the implementation of the paper "A Neural Algorithm of Artistic Style"
by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""
from __future__ import print_function

import os

from PIL import Image, ImageOps
import numpy as np
import scipy.misc
from six.moves import urllib

def download(download_link, file_name, expected_bytes):
    """ Download the pretrained VGG-19 model if it's not already downloaded """
    if os.path.exists(file_name):
        print("VGG-19 pre-trained model ready")
        return
    print("Downloading the VGG pre-trained model. This might take a while ...")
    file_name, _ = urllib.request.urlretrieve(download_link, file_name)
    file_stat = os.stat(file_name)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded VGG-19 pre-trained model', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')

def get_resized_image(img_path, height, width, save=True):
    image = Image.open(img_path)
    # it's because PIL is column major so you have to change place of width and height
    # this is stupid, i know
    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)
    if save:
        image_dirs = img_path.split('/')
        image_dirs[-1] = 'resized_' + image_dirs[-1]
        out_path = '/'.join(image_dirs)
        if not os.path.exists(out_path):
            image.save(out_path)
    image = np.asarray(image, np.float32)
    return np.expand_dims(image, 0)

def generate_noise_image(content_image, height, width, noise_ratio=0.6):
    noise_image = np.random.uniform(-20, 20, 
                                    (1, height, width, 3)).astype(np.float32)
    return noise_image * noise_ratio + content_image * (1 - noise_ratio)

def save_image(path, image):
    # Output should add back the mean pixels we subtracted at the beginning
    image = image[0] # the image
    image = np.clip(image, 0, 255).astype('uint8')
    scipy.misc.imsave(path, image)

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

================================================
FILE: 2017/assignments/style_transfer_starter/vgg_model.py
================================================
""" Load VGGNet weights needed for the implementation of the paper 
"A Neural Algorithm of Artistic Style" by Gatys et al. in TensorFlow.

Author: Chip Huyen (huyenn@stanford.edu)
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
For more details, please read the assignment handout:
http://web.stanford.edu/class/cs20si/assignments/a2.pdf
"""

import numpy as np
import tensorflow as tf
import scipy.io

def _weights(vgg_layers, layer, expected_layer_name):
    """ Return the weights and biases already trained by VGG
    """
    W = vgg_layers[0][layer][0][0][2][0][0]
    b = vgg_layers[0][layer][0][0][2][0][1]
    layer_name = vgg_layers[0][layer][0][0][0][0]
    assert layer_name == expected_layer_name
    return W, b.reshape(b.size)

def _conv2d_relu(vgg_layers, prev_layer, layer, layer_name):
    """ Return the Conv2D layer with RELU using the weights, biases from the VGG
    model at 'layer'.
    Inputs:
        vgg_layers: holding all the layers of VGGNet
        prev_layer: the output tensor from the previous layer
        layer: the index to current layer in vgg_layers
        layer_name: the string that is the name of the current layer.
                    It's used to specify variable_scope.

    Output:
        relu applied on the convolution.

    Note that you first need to obtain W and b from vgg-layers using the function
    _weights() defined above.
    W and b returned from _weights() are numpy arrays, so you have
    to convert them to TF tensors using tf.constant.
    Note that you'll have to do apply relu on the convolution.
    Hint for choosing strides size: 
        for small images, you probably don't want to skip any pixel
    """
    pass

def _avgpool(prev_layer):
    """ Return the average pooling layer. The paper suggests that average pooling
    actually works better than max pooling.
    Input:
        prev_layer: the output tensor from the previous layer

    Output:
        the output of the tf.nn.avg_pool() function.
    Hint for choosing strides and kszie: choose what you feel appropriate
    """
    pass

def load_vgg(path, input_image):
    """ Load VGG into a TensorFlow model.
    Use a dictionary to hold the model instead of using a Python class
    """
    vgg = scipy.io.loadmat(path)
    vgg_layers = vgg['layers']

    graph = {} 
    graph['conv1_1']  = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1')
    graph['conv1_2']  = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2')
    graph['avgpool1'] = _avgpool(graph['conv1_2'])
    graph['conv2_1']  = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1')
    graph['conv2_2']  = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2')
    graph['avgpool2'] = _avgpool(graph['conv2_2'])
    graph['conv3_1']  = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1')
    graph['conv3_2']  = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2')
    graph['conv3_3']  = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3')
    graph['conv3_4']  = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4')
    graph['avgpool3'] = _avgpool(graph['conv3_4'])
    graph['conv4_1']  = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1')
    graph['conv4_2']  = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2')
    graph['conv4_3']  = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3')
    graph['conv4_4']  = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4')
    graph['avgpool4'] = _avgpool(graph['conv4_4'])
    graph['conv5_1']  = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1')
    graph['conv5_2']  = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2')
    graph['conv5_3']  = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3')
    graph['conv5_4']  = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4')
    graph['avgpool5'] = _avgpool(graph['conv5_4'])
    
    return graph

================================================
FILE: 2017/data/arvix_abstracts.txt
================================================
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).


================================================
FILE: 2017/data/heart.csv
================================================
sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
160,12,5.73,23.11,Present,49,25.3,97.2,52,1
144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
114,0,3.83,19.4,Present,49,24.86,2.49,29,0
132,0,5.8,30.96,Present,69,30.11,0,53,1
206,6,2.95,32.27,Absent,72,26.81,56.06,60,1
134,14.1,4.44,22.39,Present,65,23.09,0,40,1
118,0,1.88,10.05,Absent,59,21.57,0,17,0
132,0,1.87,17.21,Absent,49,23.63,0.97,15,0
112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0
117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0
120,7.5,15.33,22,Absent,60,25.31,34.49,49,0
146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1
158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1
124,14,6.23,35.96,Present,45,30.09,0,59,1
106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1
132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0
150,0.3,6.38,33.99,Present,62,24.64,0,50,0
138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0
142,18.2,4.34,24.38,Absent,61,26.19,0,50,0
124,4,12.42,31.29,Present,54,23.23,2.06,42,1
118,6,9.65,33.91,Absent,60,38.8,0,48,0
145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1
144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0
146,0,6.62,25.69,Absent,60,28.07,8.23,63,1
136,2.52,3.95,25.63,Absent,51,21.86,0,45,1
158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1
122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1
126,8.75,6.53,34.02,Absent,49,30.25,0,41,1
148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0
122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1
140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0
110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0
130,0,2.82,19.63,Present,70,24.86,0,29,0
136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1
118,0.28,5.8,33.7,Present,60,30.98,0,41,1
144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0
120,0,1.07,16.02,Absent,47,22.15,0,15,0
130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1
114,0,2.99,9.74,Absent,54,46.58,0,17,0
128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0
162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1
116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1
114,0,1.94,11.02,Absent,54,20.17,38.98,16,0
126,3.8,3.88,31.79,Absent,57,30.53,0,30,0
122,0,5.75,30.9,Present,46,29.01,4.11,42,0
134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0
152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1
134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1
156,3,1.82,27.55,Absent,60,23.91,54,53,0
152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0
118,0,2.99,16.17,Absent,49,23.83,3.22,28,0
126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1
103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0
121,0.8,5.29,18.95,Present,47,22.51,0,61,0
142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0
138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0
152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0
140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0
130,0,1.82,10.45,Absent,57,22.07,2.06,17,0
136,7.36,2.19,28.11,Present,61,25,61.71,54,0
124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0
112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0
118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0
122,0,3.37,16.1,Absent,67,21.06,0,32,1
118,0,3.67,12.13,Absent,51,19.15,0.6,15,0
130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0
130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0
126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0
128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0
136,0,4.12,17.42,Absent,52,21.66,12.86,40,0
134,0,5.9,30.84,Absent,49,29.16,0,55,0
140,0.6,5.56,33.39,Present,58,27.19,0,55,1
168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1
108,0.4,5.91,22.92,Present,57,25.72,72,39,0
114,3,7.04,22.64,Present,55,22.59,0,45,1
140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1
148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1
148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1
128,0,2.43,13.15,Present,63,20.75,0,17,0
130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0
126,10.5,4.49,17.33,Absent,67,19.37,0,49,1
140,0,5.08,27.33,Present,41,27.83,1.25,38,0
126,0.9,5.64,17.78,Present,55,21.94,0,41,0
122,0.72,4.04,32.38,Absent,34,28.34,0,55,0
116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0
120,3.7,4.02,39.66,Absent,61,30.57,0,64,1
143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0
118,4,3.95,18.96,Absent,54,25.15,8.33,49,1
194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0
134,3,4.37,23.07,Absent,56,20.54,9.65,62,0
138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0
136,0,5,27.58,Present,49,27.59,1.47,39,0
122,3.2,11.32,35.36,Present,55,27.07,0,51,1
164,12,3.91,19.59,Absent,51,23.44,19.75,39,0
136,8,7.85,23.81,Present,51,22.69,2.78,50,0
166,0.07,4.03,29.29,Absent,53,28.37,0,27,0
118,0,4.34,30.12,Present,52,32.18,3.91,46,0
128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0
118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0
158,3.6,2.97,30.11,Absent,63,26.64,108,64,0
108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1
170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1
118,1,5.76,22.1,Absent,62,23.48,7.71,42,0
124,0,3.04,17.33,Absent,49,22.04,0,18,0
114,0,8.01,21.64,Absent,66,25.51,2.49,16,0
168,9,8.53,24.48,Present,69,26.18,4.63,54,1
134,2,3.66,14.69,Absent,52,21.03,2.06,37,0
174,0,8.46,35.1,Present,35,25.27,0,61,1
116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1
128,0,10.58,31.81,Present,46,28.41,14.66,48,0
140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1
154,0.7,5.91,25,Absent,13,20.6,0,42,0
150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1
130,0,3.92,25.55,Absent,68,28.02,0.68,27,0
128,2,6.13,21.31,Absent,66,22.86,11.83,60,0
120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0
120,0,5.01,26.13,Absent,64,26.21,12.24,33,0
138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1
153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0
123,8.6,11.17,35.28,Present,70,33.14,0,59,1
148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0
136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0
134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1
152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0
158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0
132,2,3.08,35.39,Absent,45,31.44,79.82,58,1
134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1
142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0
134,6,3.3,28.45,Absent,65,26.09,58.11,40,0
122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1
116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0
128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0
120,0,3.68,12.24,Absent,51,20.52,0.51,20,0
124,0,3.95,36.35,Present,59,32.83,9.59,54,0
160,14,5.9,37.12,Absent,58,33.87,3.52,54,1
130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1
128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0
130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0
109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0
144,0,3.84,18.72,Absent,56,22.1,4.8,40,0
118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0
136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1
136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1
124,15.5,5.05,24.06,Absent,46,23.22,0,61,1
148,6,6.49,26.47,Absent,48,24.7,0,55,0
128,6.6,3.58,20.71,Absent,55,24.15,0,52,0
122,0.28,4.19,19.97,Absent,61,25.63,0,24,0
108,0,2.74,11.17,Absent,53,22.61,0.95,20,0
124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1
138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1
127,0,2.81,15.7,Absent,42,22.03,1.03,17,0
174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0
122,0,3.05,23.51,Absent,46,25.81,0,38,0
144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1
126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0
208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1
138,0,2.68,17.04,Absent,42,22.16,0,16,0
148,0,3.84,17.26,Absent,70,20,0,21,0
122,0,3.08,16.3,Absent,43,22.13,0,16,0
132,7,3.2,23.26,Absent,77,23.64,23.14,49,0
110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1
160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1
126,0.54,4.39,21.13,Present,45,25.99,0,25,0
162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0
194,2.55,6.89,33.88,Present,69,29.33,0,41,0
118,0.75,2.58,20.25,Absent,59,24.46,0,32,0
124,0,4.79,34.71,Absent,49,26.09,9.26,47,0
160,0,2.42,34.46,Absent,48,29.83,1.03,61,0
128,0,2.51,29.35,Present,53,22.05,1.37,62,0
122,4,5.24,27.89,Present,45,26.52,0,61,1
132,2,2.7,21.57,Present,50,27.95,9.26,37,0
120,0,2.42,16.66,Absent,46,20.16,0,17,0
128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0
108,15,4.91,34.65,Absent,41,27.96,14.4,56,0
166,0,4.31,34.27,Absent,45,30.14,13.27,56,0
152,0,6.06,41.05,Present,51,40.34,0,51,0
170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1
156,4,2.05,19.48,Present,50,21.48,27.77,39,1
116,8,6.73,28.81,Present,41,26.74,40.94,48,1
122,4.4,3.18,11.59,Present,59,21.94,0,33,1
150,20,6.4,35.04,Absent,53,28.88,8.33,63,0
129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0
134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0
126,0,5.98,29.06,Present,56,25.39,11.52,64,1
142,0,3.72,25.68,Absent,48,24.37,5.25,40,1
128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1
102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1
130,0,4.89,25.98,Absent,72,30.42,14.71,23,0
138,0.05,2.79,10.35,Absent,46,21.62,0,18,0
138,0,1.96,11.82,Present,54,22.01,8.13,21,0
128,0,3.09,20.57,Absent,54,25.63,0.51,17,0
162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0
160,3,9.19,26.47,Present,39,28.25,14.4,54,1
148,0,4.66,24.39,Absent,50,25.26,4.03,27,0
124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0
136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1
134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0
128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0
122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0
152,3,4.64,31.29,Absent,41,29.34,4.53,40,0
162,0,5.09,24.6,Present,64,26.71,3.81,18,0
124,4,6.65,30.84,Present,54,28.4,33.51,60,0
136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0
136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0
134,0.05,8.03,27.95,Absent,48,26.88,0,60,0
122,1,5.88,34.81,Present,69,31.27,15.94,40,1
116,3,3.05,30.31,Absent,41,23.63,0.86,44,0
132,0,0.98,21.39,Absent,62,26.75,0,53,0
134,0,2.4,21.11,Absent,57,22.45,1.37,18,0
160,7.77,8.07,34.8,Absent,64,31.15,0,62,1
180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1
124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0
114,0,4.97,9.69,Absent,26,22.6,0,25,0
208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0
138,0,3.14,12,Absent,54,20.28,0,16,0
164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1
144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0
136,7.5,7.39,28.04,Present,50,25.01,0,45,1
132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0
143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0
112,4.46,7.18,26.25,Present,69,27.29,0,32,1
134,10,3.79,34.72,Absent,42,28.33,28.8,52,1
138,2,5.11,31.4,Present,49,27.25,2.06,64,1
188,0,5.47,32.44,Present,71,28.99,7.41,50,1
110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1
136,13.2,7.18,35.95,Absent,48,29.19,0,62,0
130,1.75,5.46,34.34,Absent,53,29.42,0,58,1
122,0,3.76,24.59,Absent,56,24.36,0,30,0
138,0,3.24,27.68,Absent,60,25.7,88.66,29,0
130,18,4.13,27.43,Absent,54,27.44,0,51,1
126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0
176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0
122,0,5.49,19.56,Absent,57,23.12,14.02,27,0
124,0,3.23,9.64,Absent,59,22.7,0,16,0
140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1
128,6,4.37,22.98,Present,50,26.01,0,47,0
190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0
144,0.76,10.53,35.66,Absent,63,34.35,0,55,1
126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1
128,0,2.63,23.88,Absent,45,21.59,6.54,57,0
136,0.4,3.91,21.1,Present,63,22.3,0,56,1
158,4,4.18,28.61,Present,42,25.11,0,60,0
160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0
124,6,5.21,33.02,Present,64,29.37,7.61,58,1
158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0
128,0,6.34,11.87,Absent,57,23.14,0,17,0
166,3,3.82,26.75,Absent,45,20.86,0,63,1
146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0
161,9,4.65,15.16,Present,58,23.76,43.2,46,0
164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1
146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1
142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0
138,12,5.13,28.34,Absent,59,24.49,32.81,58,1
154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0
118,0,2.39,12.13,Absent,49,18.46,0.26,17,1
124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0
124,1.04,2.84,16.42,Present,46,20.17,0,61,0
136,5,4.19,23.99,Present,68,27.8,25.86,35,0
132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1
118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0
118,0.12,4.16,9.37,Absent,57,19.61,0,17,0
134,12,4.96,29.79,Absent,53,24.86,8.23,57,0
114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0
136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1
130,0,4.16,39.43,Present,46,30.01,0,55,1
136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1
136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0
154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0
108,0.8,2.47,17.53,Absent,47,22.18,0,55,1
136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1
174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1
124,4.25,8.22,30.77,Absent,56,25.8,0,43,0
114,0,2.63,9.69,Absent,45,17.89,0,16,0
118,0.12,3.26,12.26,Absent,55,22.65,0,16,0
106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1
146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0
206,0,4.17,33.23,Absent,69,27.36,6.17,50,1
134,3,3.17,17.91,Absent,35,26.37,15.12,27,0
148,15,4.98,36.94,Present,72,31.83,66.27,41,1
126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0
134,0,3.69,13.92,Absent,43,27.66,0,19,0
134,0.02,2.8,18.84,Absent,45,24.82,0,17,0
123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0
112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1
112,0,1.71,15.96,Absent,42,22.03,3.5,16,0
101,0.48,7.26,13,Absent,50,19.82,5.19,16,0
150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0
170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1
134,0,5.63,29.12,Absent,68,32.33,2.02,34,0
142,0,4.19,18.04,Absent,56,23.65,20.78,42,1
132,0.1,3.28,10.73,Absent,73,20.42,0,17,0
136,0,2.28,18.14,Absent,55,22.59,0,17,0
132,12,4.51,21.93,Absent,61,26.07,64.8,46,1
166,4.1,4,34.3,Present,32,29.51,8.23,53,0
138,0,3.96,24.7,Present,53,23.8,0,45,0
138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1
170,0,3.12,37.15,Absent,47,35.42,0,53,0
128,0,8.41,28.82,Present,60,26.86,0,59,1
136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0
128,0,3.22,26.55,Present,39,26.59,16.71,49,0
150,14.4,5.04,26.52,Present,60,28.84,0,45,0
132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1
142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0
130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0
174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1
114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0
162,1.5,2.46,19.39,Present,49,24.32,0,59,1
174,0,3.27,35.4,Absent,58,37.71,24.95,44,0
190,5.15,6.03,36.59,Absent,42,30.31,72,50,0
154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0
124,0,2.28,24.86,Present,50,22.24,8.26,38,0
114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0
168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1
142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0
154,0,4.81,28.11,Present,56,25.67,75.77,59,0
146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0
166,6,3.02,29.3,Absent,35,24.38,38.06,61,0
140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1
136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0
156,0,3.47,21.1,Absent,73,28.4,0,36,1
132,0,6.63,29.58,Present,37,29.41,2.57,62,0
128,0,2.98,12.59,Absent,65,20.74,2.06,19,0
106,5.6,3.2,12.3,Absent,49,20.29,0,39,0
144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0
154,0.31,2.33,16.48,Absent,33,24,11.83,17,0
126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0
134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1
152,19.45,4.22,29.81,Absent,28,23.95,0,59,1
146,1.35,6.39,34.21,Absent,51,26.43,0,59,1
162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0
130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1
138,6,7.24,37.05,Absent,38,28.69,0,59,0
148,0,5.32,26.71,Present,52,32.21,32.78,27,0
124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0
118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0
116,4.28,7.02,19.99,Present,68,23.31,0,52,1
162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1
138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0
137,1.2,3.14,23.87,Absent,66,24.13,45,37,0
198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1
154,4.5,4.75,23.52,Present,43,25.76,0,53,1
128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0
130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1
162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0
120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0
136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0
176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1
134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1
122,1.7,5.28,32.23,Present,51,24.08,0,54,0
134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1
134,0,2.43,22.24,Absent,52,26.49,41.66,24,0
136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0
132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0
152,1.68,3.58,25.43,Absent,50,27.03,0,32,0
132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1
124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0
140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0
166,0.6,2.42,34.03,Present,53,26.96,54,60,0
156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1
132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0
150,0,4.99,27.73,Absent,57,30.92,8.33,24,0
134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0
126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0
148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0
148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1
132,6,5.97,25.73,Present,66,24.18,145.29,41,0
128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0
128,5.16,4.9,31.35,Present,57,26.42,0,64,0
140,0,2.4,27.89,Present,70,30.74,144,29,0
126,0,5.29,27.64,Absent,25,27.62,2.06,45,0
114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0
118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0
126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0
154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1
112,1.44,2.71,22.92,Absent,59,24.81,0,52,0
140,8,4.42,33.15,Present,47,32.77,66.86,44,0
140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1
128,2.6,4.94,21.36,Absent,61,21.3,0,31,0
126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0
160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1
144,0,4.17,29.63,Present,52,21.83,0,59,0
148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1
146,0,4.92,18.53,Absent,57,24.2,34.97,26,0
164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1
130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1
154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1
178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0
180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0
134,12.5,2.73,39.35,Absent,48,35.58,0,48,0
142,0,3.54,16.64,Absent,58,25.97,8.36,27,0
162,7,7.67,34.34,Present,33,30.77,0,62,0
218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1
126,8.75,6.06,32.72,Present,33,27,62.43,55,1
126,0,3.57,26.01,Absent,61,26.3,7.97,47,0
134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0
132,0,4.17,36.57,Absent,57,30.61,18,49,0
178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1
208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1
160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0
116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0
180,25.01,3.7,38.11,Present,57,30.54,0,61,1
200,19.2,4.43,40.6,Present,55,32.04,36,60,1
112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0
120,0,3.1,26.97,Absent,41,24.8,0,16,0
178,20,9.78,33.55,Absent,37,27.29,2.88,62,1
166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0
164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1
216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1
146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0
134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1
158,16,5.56,29.35,Absent,36,25.92,58.32,60,0
176,0,3.14,31.04,Present,45,30.18,4.63,45,0
132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0
126,0,4.55,29.18,Absent,48,24.94,36,41,0
120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0
174,0,3.86,21.73,Absent,42,23.37,0,63,0
150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1
176,6,3.98,17.2,Present,52,21.07,4.11,61,1
142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1
132,0,3.3,21.61,Absent,42,24.92,32.61,33,0
142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0
146,1.16,2.28,34.53,Absent,50,28.71,45,49,0
132,7.2,3.65,17.16,Present,56,23.25,0,34,0
120,0,3.57,23.22,Absent,58,27.2,0,32,0
118,0,3.89,15.96,Absent,65,20.18,0,16,0
108,0,1.43,26.26,Absent,42,19.38,0,16,0
136,0,4,19.06,Absent,40,21.94,2.06,16,0
120,0,2.46,13.39,Absent,47,22.01,0.51,18,0
132,0,3.55,8.66,Present,61,18.5,3.87,16,0
136,0,1.77,20.37,Absent,45,21.51,2.06,16,0
138,0,1.86,18.35,Present,59,25.38,6.51,17,0
138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0
130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0
130,4,2.4,17.42,Absent,60,22.05,0,40,0
110,0,7.14,28.28,Absent,57,29,0,32,0
120,0,3.98,13.19,Present,47,21.89,0,16,0
166,6,8.8,37.89,Absent,39,28.7,43.2,52,0
134,0.57,4.75,23.07,Absent,67,26.33,0,37,0
142,3,3.69,25.1,Absent,60,30.08,38.88,27,0
136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0
142,0,4.32,25.22,Absent,47,28.92,6.53,34,1
130,0,1.88,12.51,Present,52,20.28,0,17,0
124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0
144,4,5.03,25.78,Present,57,27.55,90,48,1
136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0
120,0,2.77,13.35,Absent,67,23.37,1.03,18,0
154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0
124,1.6,7.22,39.68,Present,36,31.5,0,51,1
146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1
128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1
170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0
214,0.4,5.98,31.72,Absent,64,28.45,0,58,0
182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1
108,3,1.59,15.23,Absent,40,20.09,26.64,55,0
118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0
132,0,4.82,33.41,Present,62,14.7,0,46,1

================================================
FILE: 2017/data/heart.txt
================================================
"sbp"	"tobacco"	"ldl"	"adiposity"	"famhist"	"typea"	"obesity"	"alcohol"	"age"	"chd"
160	12	5.73	23.11	"Present"	49	25.3	97.2	52	1
144	0.01	4.41	28.61	"Absent"	55	28.87	2.06	63	1
118	0.08	3.48	32.28	"Present"	52	29.14	3.81	46	0
170	7.5	6.41	38.03	"Present"	51	31.99	24.26	58	1
134	13.6	3.5	27.78	"Present"	60	25.99	57.34	49	1
132	6.2	6.47	36.21	"Present"	62	30.77	14.14	45	0
142	4.05	3.38	16.2	"Absent"	59	20.81	2.62	38	0
114	4.08	4.59	14.6	"Present"	62	23.11	6.72	58	1
114	0	3.83	19.4	"Present"	49	24.86	2.49	29	0
132	0	5.8	30.96	"Present"	69	30.11	0	53	1
206	6	2.95	32.27	"Absent"	72	26.81	56.06	60	1
134	14.1	4.44	22.39	"Present"	65	23.09	0	40	1
118	0	1.88	10.05	"Absent"	59	21.57	0	17	0
132	0	1.87	17.21	"Absent"	49	23.63	0.97	15	0
112	9.65	2.29	17.2	"Present"	54	23.53	0.68	53	0
117	1.53	2.44	28.95	"Present"	35	25.89	30.03	46	0
120	7.5	15.33	22	"Absent"	60	25.31	34.49	49	0
146	10.5	8.29	35.36	"Present"	78	32.73	13.89	53	1
158	2.6	7.46	34.07	"Present"	61	29.3	53.28	62	1
124	14	6.23	35.96	"Present"	45	30.09	0	59	1
106	1.61	1.74	12.32	"Absent"	74	20.92	13.37	20	1
132	7.9	2.85	26.5	"Present"	51	26.16	25.71	44	0
150	0.3	6.38	33.99	"Present"	62	24.64	0	50	0
138	0.6	3.81	28.66	"Absent"	54	28.7	1.46	58	0
142	18.2	4.34	24.38	"Absent"	61	26.19	0	50	0
124	4	12.42	31.29	"Present"	54	23.23	2.06	42	1
118	6	9.65	33.91	"Absent"	60	38.8	0	48	0
145	9.1	5.24	27.55	"Absent"	59	20.96	21.6	61	1
144	4.09	5.55	31.4	"Present"	60	29.43	5.55	56	0
146	0	6.62	25.69	"Absent"	60	28.07	8.23	63	1
136	2.52	3.95	25.63	"Absent"	51	21.86	0	45	1
158	1.02	6.33	23.88	"Absent"	66	22.13	24.99	46	1
122	6.6	5.58	35.95	"Present"	53	28.07	12.55	59	1
126	8.75	6.53	34.02	"Absent"	49	30.25	0	41	1
148	5.5	7.1	25.31	"Absent"	56	29.84	3.6	48	0
122	4.26	4.44	13.04	"Absent"	57	19.49	48.99	28	1
140	3.9	7.32	25.05	"Absent"	47	27.36	36.77	32	0
110	4.64	4.55	30.46	"Absent"	48	30.9	15.22	46	0
130	0	2.82	19.63	"Present"	70	24.86	0	29	0
136	11.2	5.81	31.85	"Present"	75	27.68	22.94	58	1
118	0.28	5.8	33.7	"Present"	60	30.98	0	41	1
144	0.04	3.38	23.61	"Absent"	30	23.75	4.66	30	0
120	0	1.07	16.02	"Absent"	47	22.15	0	15	0
130	2.61	2.72	22.99	"Present"	51	26.29	13.37	51	1
114	0	2.99	9.74	"Absent"	54	46.58	0	17	0
128	4.65	3.31	22.74	"Absent"	62	22.95	0.51	48	0
162	7.4	8.55	24.65	"Present"	64	25.71	5.86	58	1
116	1.91	7.56	26.45	"Present"	52	30.01	3.6	33	1
114	0	1.94	11.02	"Absent"	54	20.17	38.98	16	0
126	3.8	3.88	31.79	"Absent"	57	30.53	0	30	0
122	0	5.75	30.9	"Present"	46	29.01	4.11	42	0
134	2.5	3.66	30.9	"Absent"	52	27.19	23.66	49	0
152	0.9	9.12	30.23	"Absent"	56	28.64	0.37	42	1
134	8.08	1.55	17.5	"Present"	56	22.65	66.65	31	1
156	3	1.82	27.55	"Absent"	60	23.91	54	53	0
152	5.99	7.99	32.48	"Absent"	45	26.57	100.32	48	0
118	0	2.99	16.17	"Absent"	49	23.83	3.22	28	0
126	5.1	2.96	26.5	"Absent"	55	25.52	12.34	38	1
103	0.03	4.21	18.96	"Absent"	48	22.94	2.62	18	0
121	0.8	5.29	18.95	"Present"	47	22.51	0	61	0
142	0.28	1.8	21.03	"Absent"	57	23.65	2.93	33	0
138	1.15	5.09	27.87	"Present"	61	25.65	2.34	44	0
152	10.1	4.71	24.65	"Present"	65	26.21	24.53	57	0
140	0.45	4.3	24.33	"Absent"	41	27.23	10.08	38	0
130	0	1.82	10.45	"Absent"	57	22.07	2.06	17	0
136	7.36	2.19	28.11	"Present"	61	25	61.71	54	0
124	4.82	3.24	21.1	"Present"	48	28.49	8.42	30	0
112	0.41	1.88	10.29	"Absent"	39	22.08	20.98	27	0
118	4.46	7.27	29.13	"Present"	48	29.01	11.11	33	0
122	0	3.37	16.1	"Absent"	67	21.06	0	32	1
118	0	3.67	12.13	"Absent"	51	19.15	0.6	15	0
130	1.72	2.66	10.38	"Absent"	68	17.81	11.1	26	0
130	5.6	3.37	24.8	"Absent"	58	25.76	43.2	36	0
126	0.09	5.03	13.27	"Present"	50	17.75	4.63	20	0
128	0.4	6.17	26.35	"Absent"	64	27.86	11.11	34	0
136	0	4.12	17.42	"Absent"	52	21.66	12.86	40	0
134	0	5.9	30.84	"Absent"	49	29.16	0	55	0
140	0.6	5.56	33.39	"Present"	58	27.19	0	55	1
168	4.5	6.68	28.47	"Absent"	43	24.25	24.38	56	1
108	0.4	5.91	22.92	"Present"	57	25.72	72	39	0
114	3	7.04	22.64	"Present"	55	22.59	0	45	1
140	8.14	4.93	42.49	"Absent"	53	45.72	6.43	53	1
148	4.8	6.09	36.55	"Present"	63	25.44	0.88	55	1
148	12.2	3.79	34.15	"Absent"	57	26.38	14.4	57	1
128	0	2.43	13.15	"Present"	63	20.75	0	17	0
130	0.56	3.3	30.86	"Absent"	49	27.52	33.33	45	0
126	10.5	4.49	17.33	"Absent"	67	19.37	0	49	1
140	0	5.08	27.33	"Present"	41	27.83	1.25	38	0
126	0.9	5.64	17.78	"Present"	55	21.94	0	41	0
122	0.72	4.04	32.38	"Absent"	34	28.34	0	55	0
116	1.03	2.83	10.85	"Absent"	45	21.59	1.75	21	0
120	3.7	4.02	39.66	"Absent"	61	30.57	0	64	1
143	0.46	2.4	22.87	"Absent"	62	29.17	15.43	29	0
118	4	3.95	18.96	"Absent"	54	25.15	8.33	49	1
194	1.7	6.32	33.67	"Absent"	47	30.16	0.19	56	0
134	3	4.37	23.07	"Absent"	56	20.54	9.65	62	0
138	2.16	4.9	24.83	"Present"	39	26.06	28.29	29	0
136	0	5	27.58	"Present"	49	27.59	1.47	39	0
122	3.2	11.32	35.36	"Present"	55	27.07	0	51	1
164	12	3.91	19.59	"Absent"	51	23.44	19.75	39	0
136	8	7.85	23.81	"Present"	51	22.69	2.78	50	0
166	0.07	4.03	29.29	"Absent"	53	28.37	0	27	0
118	0	4.34	30.12	"Present"	52	32.18	3.91	46	0
128	0.42	4.6	26.68	"Absent"	41	30.97	10.33	31	0
118	1.5	5.38	25.84	"Absent"	64	28.63	3.89	29	0
158	3.6	2.97	30.11	"Absent"	63	26.64	108	64	0
108	1.5	4.33	24.99	"Absent"	66	22.29	21.6	61	1
170	7.6	5.5	37.83	"Present"	42	37.41	6.17	54	1
118	1	5.76	22.1	"Absent"	62	23.48	7.71	42	0
124	0	3.04	17.33	"Absent"	49	22.04	0	18	0
114	0	8.01	21.64	"Absent"	66	25.51	2.49	16	0
168	9	8.53	24.48	"Present"	69	26.18	4.63	54	1
134	2	3.66	14.69	"Absent"	52	21.03	2.06	37	0
174	0	8.46	35.1	"Present"	35	25.27	0	61	1
116	31.2	3.17	14.99	"Absent"	47	19.4	49.06	59	1
128	0	10.58	31.81	"Present"	46	28.41	14.66	48	0
140	4.5	4.59	18.01	"Absent"	63	21.91	22.09	32	1
154	0.7	5.91	25	"Absent"	13	20.6	0	42	0
150	3.5	6.99	25.39	"Present"	50	23.35	23.48	61	1
130	0	3.92	25.55	"Absent"	68	28.02	0.68	27	0
128	2	6.13	21.31	"Absent"	66	22.86	11.83	60	0
120	1.4	6.25	20.47	"Absent"	60	25.85	8.51	28	0
120	0	5.01	26.13	"Absent"	64	26.21	12.24	33	0
138	4.5	2.85	30.11	"Absent"	55	24.78	24.89	56	1
153	7.8	3.96	25.73	"Absent"	54	25.91	27.03	45	0
123	8.6	11.17	35.28	"Present"	70	33.14	0	59	1
148	4.04	3.99	20.69	"Absent"	60	27.78	1.75	28	0
136	3.96	2.76	30.28	"Present"	50	34.42	18.51	38	0
134	8.8	7.41	26.84	"Absent"	35	29.44	29.52	60	1
152	12.18	4.04	37.83	"Present"	63	34.57	4.17	64	0
158	13.5	5.04	30.79	"Absent"	54	24.79	21.5	62	0
132	2	3.08	35.39	"Absent"	45	31.44	79.82	58	1
134	1.5	3.73	21.53	"Absent"	41	24.7	11.11	30	1
142	7.44	5.52	33.97	"Absent"	47	29.29	24.27	54	0
134	6	3.3	28.45	"Absent"	65	26.09	58.11	40	0
122	4.18	9.05	29.27	"Present"	44	24.05	19.34	52	1
116	2.7	3.69	13.52	"Absent"	55	21.13	18.51	32	0
128	0.5	3.7	12.81	"Present"	66	21.25	22.73	28	0
120	0	3.68	12.24	"Absent"	51	20.52	0.51	20	0
124	0	3.95	36.35	"Present"	59	32.83	9.59	54	0
160	14	5.9	37.12	"Absent"	58	33.87	3.52	54	1
130	2.78	4.89	9.39	"Present"	63	19.3	17.47	25	1
128	2.8	5.53	14.29	"Absent"	64	24.97	0.51	38	0
130	4.5	5.86	37.43	"Absent"	61	31.21	32.3	58	0
109	1.2	6.14	29.26	"Absent"	47	24.72	10.46	40	0
144	0	3.84	18.72	"Absent"	56	22.1	4.8	40	0
118	1.05	3.16	12.98	"Present"	46	22.09	16.35	31	0
136	3.46	6.38	32.25	"Present"	43	28.73	3.13	43	1
136	1.5	6.06	26.54	"Absent"	54	29.38	14.5	33	1
124	15.5	5.05	24.06	"Absent"	46	23.22	0	61	1
148	6	6.49	26.47	"Absent"	48	24.7	0	55	0
128	6.6	3.58	20.71	"Absent"	55	24.15	0	52	0
122	0.28	4.19	19.97	"Absent"	61	25.63	0	24	0
108	0	2.74	11.17	"Absent"	53	22.61	0.95	20	0
124	3.04	4.8	19.52	"Present"	60	21.78	147.19	41	1
138	8.8	3.12	22.41	"Present"	63	23.33	120.03	55	1
127	0	2.81	15.7	"Absent"	42	22.03	1.03	17	0
174	9.45	5.13	35.54	"Absent"	55	30.71	59.79	53	0
122	0	3.05	23.51	"Absent"	46	25.81	0	38	0
144	6.75	5.45	29.81	"Absent"	53	25.62	26.23	43	1
126	1.8	6.22	19.71	"Absent"	65	24.81	0.69	31	0
208	27.4	3.12	26.63	"Absent"	66	27.45	33.07	62	1
138	0	2.68	17.04	"Absent"	42	22.16	0	16	0
148	0	3.84	17.26	"Absent"	70	20	0	21	0
122	0	3.08	16.3	"Absent"	43	22.13	0	16	0
132	7	3.2	23.26	"Absent"	77	23.64	23.14	49	0
110	12.16	4.99	28.56	"Absent"	44	27.14	21.6	55	1
160	1.52	8.12	29.3	"Present"	54	25.87	12.86	43	1
126	0.54	4.39	21.13	"Present"	45	25.99	0	25	0
162	5.3	7.95	33.58	"Present"	58	36.06	8.23	48	0
194	2.55	6.89	33.88	"Present"	69	29.33	0	41	0
118	0.75	2.58	20.25	"Absent"	59	24.46	0	32	0
124	0	4.79	34.71	"Absent"	49	26.09	9.26	47	0
160	0	2.42	34.46	"Absent"	48	29.83	1.03	61	0
128	0	2.51	29.35	"Present"	53	22.05	1.37	62	0
122	4	5.24	27.89	"Present"	45	26.52	0	61	1
132	2	2.7	21.57	"Present"	50	27.95	9.26	37	0
120	0	2.42	16.66	"Absent"	46	20.16	0	17	0
128	0.04	8.22	28.17	"Absent"	65	26.24	11.73	24	0
108	15	4.91	34.65	"Absent"	41	27.96	14.4	56	0
166	0	4.31	34.27	"Absent"	45	30.14	13.27	56	0
152	0	6.06	41.05	"Present"	51	40.34	0	51	0
170	4.2	4.67	35.45	"Present"	50	27.14	7.92	60	1
156	4	2.05	19.48	"Present"	50	21.48	27.77	39	1
116	8	6.73	28.81	"Present"	41	26.74	40.94	48	1
122	4.4	3.18	11.59	"Present"	59	21.94	0	33	1
150	20	6.4	35.04	"Absent"	53	28.88	8.33	63	0
129	2.15	5.17	27.57	"Absent"	52	25.42	2.06	39	0
134	4.8	6.58	29.89	"Present"	55	24.73	23.66	63	0
126	0	5.98	29.06	"Present"	56	25.39	11.52	64	1
142	0	3.72	25.68	"Absent"	48	24.37	5.25	40	1
128	0.7	4.9	37.42	"Present"	72	35.94	3.09	49	1
102	0.4	3.41	17.22	"Present"	56	23.59	2.06	39	1
130	0	4.89	25.98	"Absent"	72	30.42	14.71	23	0
138	0.05	2.79	10.35	"Absent"	46	21.62	0	18	0
138	0	1.96	11.82	"Present"	54	22.01	8.13	21	0
128	0	3.09	20.57	"Absent"	54	25.63	0.51	17	0
162	2.92	3.63	31.33	"Absent"	62	31.59	18.51	42	0
160	3	9.19	26.47	"Present"	39	28.25	14.4	54	1
148	0	4.66	24.39	"Absent"	50	25.26	4.03	27	0
124	0.16	2.44	16.67	"Absent"	65	24.58	74.91	23	0
136	3.15	4.37	20.22	"Present"	59	25.12	47.16	31	1
134	2.75	5.51	26.17	"Absent"	57	29.87	8.33	33	0
128	0.73	3.97	23.52	"Absent"	54	23.81	19.2	64	0
122	3.2	3.59	22.49	"Present"	45	24.96	36.17	58	0
152	3	4.64	31.29	"Absent"	41	29.34	4.53	40	0
162	0	5.09	24.6	"Present"	64	26.71	3.81	18	0
124	4	6.65	30.84	"Present"	54	28.4	33.51	60	0
136	5.8	5.9	27.55	"Absent"	65	25.71	14.4	59	0
136	8.8	4.26	32.03	"Present"	52	31.44	34.35	60	0
134	0.05	8.03	27.95	"Absent"	48	26.88	0	60	0
122	1	5.88	34.81	"Present"	69	31.27	15.94	40	1
116	3	3.05	30.31	"Absent"	41	23.63	0.86	44	0
132	0	0.98	21.39	"Absent"	62	26.75	0	53	0
134	0	2.4	21.11	"Absent"	57	22.45	1.37	18	0
160	7.77	8.07	34.8	"Absent"	64	31.15	0	62	1
180	0.52	4.23	16.38	"Absent"	55	22.56	14.77	45	1
124	0.81	6.16	11.61	"Absent"	35	21.47	10.49	26	0
114	0	4.97	9.69	"Absent"	26	22.6	0	25	0
208	7.4	7.41	32.03	"Absent"	50	27.62	7.85	57	0
138	0	3.14	12	"Absent"	54	20.28	0	16	0
164	0.5	6.95	39.64	"Present"	47	41.76	3.81	46	1
144	2.4	8.13	35.61	"Absent"	46	27.38	13.37	60	0
136	7.5	7.39	28.04	"Present"	50	25.01	0	45	1
132	7.28	3.52	12.33	"Absent"	60	19.48	2.06	56	0
143	5.04	4.86	23.59	"Absent"	58	24.69	18.72	42	0
112	4.46	7.18	26.25	"Present"	69	27.29	0	32	1
134	10	3.79	34.72	"Absent"	42	28.33	28.8	52	1
138	2	5.11	31.4	"Present"	49	27.25	2.06	64	1
188	0	5.47	32.44	"Present"	71	28.99	7.41	50	1
110	2.35	3.36	26.72	"Present"	54	26.08	109.8	58	1
136	13.2	7.18	35.95	"Absent"	48	29.19	0	62	0
130	1.75	5.46	34.34	"Absent"	53	29.42	0	58	1
122	0	3.76	24.59	"Absent"	56	24.36	0	30	0
138	0	3.24	27.68	"Absent"	60	25.7	88.66	29	0
130	18	4.13	27.43	"Absent"	54	27.44	0	51	1
126	5.5	3.78	34.15	"Absent"	55	28.85	3.18	61	0
176	5.76	4.89	26.1	"Present"	46	27.3	19.44	57	0
122	0	5.49	19.56	"Absent"	57	23.12	14.02	27	0
124	0	3.23	9.64	"Absent"	59	22.7	0	16	0
140	5.2	3.58	29.26	"Absent"	70	27.29	20.17	45	1
128	6	4.37	22.98	"Present"	50	26.01	0	47	0
190	4.18	5.05	24.83	"Absent"	45	26.09	82.85	41	0
144	0.76	10.53	35.66	"Absent"	63	34.35	0	55	1
126	4.6	7.4	31.99	"Present"	57	28.67	0.37	60	1
128	0	2.63	23.88	"Absent"	45	21.59	6.54	57	0
136	0.4	3.91	21.1	"Present"	63	22.3	0	56	1
158	4	4.18	28.61	"Present"	42	25.11	0	60	0
160	0.6	6.94	30.53	"Absent"	36	25.68	1.42	64	0
124	6	5.21	33.02	"Present"	64	29.37	7.61	58	1
158	6.17	8.12	30.75	"Absent"	46	27.84	92.62	48	0
128	0	6.34	11.87	"Absent"	57	23.14	0	17	0
166	3	3.82	26.75	"Absent"	45	20.86	0	63	1
146	7.5	7.21	25.93	"Present"	55	22.51	0.51	42	0
161	9	4.65	15.16	"Present"	58	23.76	43.2	46	0
164	13.02	6.26	29.38	"Present"	47	22.75	37.03	54	1
146	5.08	7.03	27.41	"Present"	63	36.46	24.48	37	1
142	4.48	3.57	19.75	"Present"	51	23.54	3.29	49	0
138	12	5.13	28.34	"Absent"	59	24.49	32.81	58	1
154	1.8	7.13	34.04	"Present"	52	35.51	39.36	44	0
118	0	2.39	12.13	"Absent"	49	18.46	0.26	17	1
124	0.61	2.69	17.15	"Present"	61	22.76	11.55	20	0
124	1.04	2.84	16.42	"Present"	46	20.17	0	61	0
136	5	4.19	23.99	"Present"	68	27.8	25.86	35	0
132	9.9	4.63	27.86	"Present"	46	23.39	0.51	52	1
118	0.12	1.96	20.31	"Absent"	37	20.01	2.42	18	0
118	0.12	4.16	9.37	"Absent"	57	19.61	0	17	0
134	12	4.96	29.79	"Absent"	53	24.86	8.23	57	0
114	0.1	3.95	15.89	"Present"	57	20.31	17.14	16	0
136	6.8	7.84	30.74	"Present"	58	26.2	23.66	45	1
130	0	4.16	39.43	"Present"	46	30.01	0	55	1
136	2.2	4.16	38.02	"Absent"	65	37.24	4.11	41	1
136	1.36	3.16	14.97	"Present"	56	24.98	7.3	24	0
154	4.2	5.59	25.02	"Absent"	58	25.02	1.54	43	0
108	0.8	2.47	17.53	"Absent"	47	22.18	0	55	1
136	8.8	4.69	36.07	"Present"	38	26.56	2.78	63	1
174	2.02	6.57	31.9	"Present"	50	28.75	11.83	64	1
124	4.25	8.22	30.77	"Absent"	56	25.8	0	43	0
114	0	2.63	9.69	"Absent"	45	17.89	0	16	0
118	0.12	3.26	12.26	"Absent"	55	22.65	0	16	0
106	1.08	4.37	26.08	"Absent"	67	24.07	17.74	28	1
146	3.6	3.51	22.67	"Absent"	51	22.29	43.71	42	0
206	0	4.17	33.23	"Absent"	69	27.36	6.17	50	1
134	3	3.17	17.91	"Absent"	35	26.37	15.12	27	0
148	15	4.98	36.94	"Present"	72	31.83	66.27	41	1
126	0.21	3.95	15.11	"Absent"	61	22.17	2.42	17	0
134	0	3.69	13.92	"Absent"	43	27.66	0	19	0
134	0.02	2.8	18.84	"Absent"	45	24.82	0	17	0
123	0.05	4.61	13.69	"Absent"	51	23.23	2.78	16	0
112	0.6	5.28	25.71	"Absent"	55	27.02	27.77	38	1
112	0	1.71	15.96	"Absent"	42	22.03	3.5	16	0
101	0.48	7.26	13	"Absent"	50	19.82	5.19	16	0
150	0.18	4.14	14.4	"Absent"	53	23.43	7.71	44	0
170	2.6	7.22	28.69	"Present"	71	27.87	37.65	56	1
134	0	5.63	29.12	"Absent"	68	32.33	2.02	34	0
142	0	4.19	18.04	"Absent"	56	23.65	20.78	42	1
132	0.1	3.28	10.73	"Absent"	73	20.42	0	17	0
136	0	2.28	18.14	"Absent"	55	22.59	0	17	0
132	12	4.51	21.93	"Absent"	61	26.07	64.8	46	1
166	4.1	4	34.3	"Present"	32	29.51	8.23	53	0
138	0	3.96	24.7	"Present"	53	23.8	0	45	0
138	2.27	6.41	29.07	"Absent"	58	30.22	2.93	32	1
170	0	3.12	37.15	"Absent"	47	35.42	0	53	0
128	0	8.41	28.82	"Present"	60	26.86	0	59	1
136	1.2	2.78	7.12	"Absent"	52	22.51	3.41	27	0
128	0	3.22	26.55	"Present"	39	26.59	16.71	49	0
150	14.4	5.04	26.52	"Present"	60	28.84	0	45	0
132	8.4	3.57	13.68	"Absent"	42	18.75	15.43	59	1
142	2.4	2.55	23.89	"Absent"	54	26.09	59.14	37	0
130	0.05	2.44	28.25	"Present"	67	30.86	40.32	34	0
174	3.5	5.26	21.97	"Present"	36	22.04	8.33	59	1
114	9.6	2.51	29.18	"Absent"	49	25.67	40.63	46	0
162	1.5	2.46	19.39	"Present"	49	24.32	0	59	1
174	0	3.27	35.4	"Absent"	58	37.71	24.95	44	0
190	5.15	6.03	36.59	"Absent"	42	30.31	72	50	0
154	1.4	1.72	18.86	"Absent"	58	22.67	43.2	59	0
124	0	2.28	24.86	"Present"	50	22.24	8.26	38	0
114	1.2	3.98	14.9	"Absent"	49	23.79	25.82	26	0
168	11.4	5.08	26.66	"Present"	56	27.04	2.61	59	1
142	3.72	4.24	32.57	"Absent"	52	24.98	7.61	51	0
154	0	4.81	28.11	"Present"	56	25.67	75.77	59	0
146	4.36	4.31	18.44	"Present"	47	24.72	10.8	38	0
166	6	3.02	29.3	"Absent"	35	24.38	38.06	61	0
140	8.6	3.9	32.16	"Present"	52	28.51	11.11	64	1
136	1.7	3.53	20.13	"Absent"	56	19.44	14.4	55	0
156	0	3.47	21.1	"Absent"	73	28.4	0	36	1
132	0	6.63	29.58	"Present"	37	29.41	2.57	62	0
128	0	2.98	12.59	"Absent"	65	20.74	2.06	19	0
106	5.6	3.2	12.3	"Absent"	49	20.29	0	39	0
144	0.4	4.64	30.09	"Absent"	30	27.39	0.74	55	0
154	0.31	2.33	16.48	"Absent"	33	24	11.83	17	0
126	3.1	2.01	32.97	"Present"	56	28.63	26.74	45	0
134	6.4	8.49	37.25	"Present"	56	28.94	10.49	51	1
152	19.45	4.22	29.81	"Absent"	28	23.95	0	59	1
146	1.35	6.39	34.21	"Absent"	51	26.43	0	59	1
162	6.94	4.55	33.36	"Present"	52	27.09	32.06	43	0
130	7.28	3.56	23.29	"Present"	20	26.8	51.87	58	1
138	6	7.24	37.05	"Absent"	38	28.69	0	59	0
148	0	5.32	26.71	"Present"	52	32.21	32.78	27	0
124	4.2	2.94	27.59	"Absent"	50	30.31	85.06	30	0
118	1.62	9.01	21.7	"Absent"	59	25.89	21.19	40	0
116	4.28	7.02	19.99	"Present"	68	23.31	0	52	1
162	6.3	5.73	22.61	"Present"	46	20.43	62.54	53	1
138	0.87	1.87	15.89	"Absent"	44	26.76	42.99	31	0
137	1.2	3.14	23.87	"Absent"	66	24.13	45	37	0
198	0.52	11.89	27.68	"Present"	48	28.4	78.99	26	1
154	4.5	4.75	23.52	"Present"	43	25.76	0	53	1
128	5.4	2.36	12.98	"Absent"	51	18.36	6.69	61	0
130	0.08	5.59	25.42	"Present"	50	24.98	6.27	43	1
162	5.6	4.24	22.53	"Absent"	29	22.91	5.66	60	0
120	10.5	2.7	29.87	"Present"	54	24.5	16.46	49	0
136	3.99	2.58	16.38	"Present"	53	22.41	27.67	36	0
176	1.2	8.28	36.16	"Present"	42	27.81	11.6	58	1
134	11.79	4.01	26.57	"Present"	38	21.79	38.88	61	1
122	1.7	5.28	32.23	"Present"	51	24.08	0	54	0
134	0.9	3.18	23.66	"Present"	52	23.26	27.36	58	1
134	0	2.43	22.24	"Absent"	52	26.49	41.66	24	0
136	6.6	6.08	32.74	"Absent"	64	33.28	2.72	49	0
132	4.05	5.15	26.51	"Present"	31	26.67	16.3	50	0
152	1.68	3.58	25.43	"Absent"	50	27.03	0	32	0
132	12.3	5.96	32.79	"Present"	57	30.12	21.5	62	1
124	0.4	3.67	25.76	"Absent"	43	28.08	20.57	34	0
140	4.2	2.91	28.83	"Present"	43	24.7	47.52	48	0
166	0.6	2.42	34.03	"Present"	53	26.96	54	60	0
156	3.02	5.35	25.72	"Present"	53	25.22	28.11	52	1
132	0.72	4.37	19.54	"Absent"	48	26.11	49.37	28	0
150	0	4.99	27.73	"Absent"	57	30.92	8.33	24	0
134	0.12	3.4	21.18	"Present"	33	26.27	14.21	30	0
126	3.4	4.87	15.16	"Present"	65	22.01	11.11	38	0
148	0.5	5.97	32.88	"Absent"	54	29.27	6.43	42	0
148	8.2	7.75	34.46	"Present"	46	26.53	6.04	64	1
132	6	5.97	25.73	"Present"	66	24.18	145.29	41	0
128	1.6	5.41	29.3	"Absent"	68	29.38	23.97	32	0
128	5.16	4.9	31.35	"Present"	57	26.42	0	64	0
140	0	2.4	27.89	"Present"	70	30.74	144	29	0
126	0	5.29	27.64	"Absent"	25	27.62	2.06	45	0
114	3.6	4.16	22.58	"Absent"	60	24.49	65.31	31	0
118	1.25	4.69	31.58	"Present"	52	27.16	4.11	53	0
126	0.96	4.99	29.74	"Absent"	66	33.35	58.32	38	0
154	4.5	4.68	39.97	"Absent"	61	33.17	1.54	64	1
112	1.44	2.71	22.92	"Absent"	59	24.81	0	52	0
140	8	4.42	33.15	"Present"	47	32.77	66.86	44	0
140	1.68	11.41	29.54	"Present"	74	30.75	2.06	38	1
128	2.6	4.94	21.36	"Absent"	61	21.3	0	31	0
126	19.6	6.03	34.99	"Absent"	49	26.99	55.89	44	0
160	4.2	6.76	37.99	"Present"	61	32.91	3.09	54	1
144	0	4.17	29.63	"Present"	52	21.83	0	59	0
148	4.5	10.49	33.27	"Absent"	50	25.92	2.06	53	1
146	0	4.92	18.53	"Absent"	57	24.2	34.97	26	0
164	5.6	3.17	30.98	"Present"	44	25.99	43.2	53	1
130	0.54	3.63	22.03	"Present"	69	24.34	12.86	39	1
154	2.4	5.63	42.17	"Present"	59	35.07	12.86	50	1
178	0.95	4.75	21.06	"Absent"	49	23.74	24.69	61	0
180	3.57	3.57	36.1	"Absent"	36	26.7	19.95	64	0
134	12.5	2.73	39.35	"Absent"	48	35.58	0	48	0
142	0	3.54	16.64	"Absent"	58	25.97	8.36	27	0
162	7	7.67	34.34	"Present"	33	30.77	0	62	0
218	11.2	2.77	30.79	"Absent"	38	24.86	90.93	48	1
126	8.75	6.06	32.72	"Present"	33	27	62.43	55	1
126	0	3.57	26.01	"Absent"	61	26.3	7.97	47	0
134	6.1	4.77	26.08	"Absent"	47	23.82	1.03	49	0
132	0	4.17	36.57	"Absent"	57	30.61	18	49	0
178	5.5	3.79	23.92	"Present"	45	21.26	6.17	62	1
208	5.04	5.19	20.71	"Present"	52	25.12	24.27	58	1
160	1.15	10.19	39.71	"Absent"	31	31.65	20.52	57	0
116	2.38	5.67	29.01	"Present"	54	27.26	15.77	51	0
180	25.01	3.7	38.11	"Present"	57	30.54	0	61	1
200	19.2	4.43	40.6	"Present"	55	32.04	36	60	1
112	4.2	3.58	27.14	"Absent"	52	26.83	2.06	40	0
120	0	3.1	26.97	"Absent"	41	24.8	0	16	0
178	20	9.78	33.55	"Absent"	37	27.29	2.88	62	1
166	0.8	5.63	36.21	"Absent"	50	34.72	28.8	60	0
164	8.2	14.16	36.85	"Absent"	52	28.5	17.02	55	1
216	0.92	2.66	19.85	"Present"	49	20.58	0.51	63	1
146	6.4	5.62	33.05	"Present"	57	31.03	0.74	46	0
134	1.1	3.54	20.41	"Present"	58	24.54	39.91	39	1
158	16	5.56	29.35	"Absent"	36	25.92	58.32	60	0
176	0	3.14	31.04	"Present"	45	30.18	4.63	45	0
132	2.8	4.79	20.47	"Present"	50	22.15	11.73	48	0
126	0	4.55	29.18	"Absent"	48	24.94	36	41	0
120	5.5	3.51	23.23	"Absent"	46	22.4	90.31	43	0
174	0	3.86	21.73	"Absent"	42	23.37	0	63	0
150	13.8	5.1	29.45	"Present"	52	27.92	77.76	55	1
176	6	3.98	17.2	"Present"	52	21.07	4.11	61	1
142	2.2	3.29	22.7	"Absent"	44	23.66	5.66	42	1
132	0	3.3	21.61	"Absent"	42	24.92	32.61	33	0
142	1.32	7.63	29.98	"Present"	57	31.16	72.93	33	0
146	1.16	2.28	34.53	"Absent"	50	28.71	45	49	0
132	7.2	3.65	17.16	"Present"	56	23.25	0	34	0
120	0	3.57	23.22	"Absent"	58	27.2	0	32	0
118	0	3.89	15.96	"Absent"	65	20.18	0	16	0
108	0	1.43	26.26	"Absent"	42	19.38	0	16	0
136	0	4	19.06	"Absent"	40	21.94	2.06	16	0
120	0	2.46	13.39	"Absent"	47	22.01	0.51	18	0
132	0	3.55	8.66	"Present"	61	18.5	3.87	16	0
136	0	1.77	20.37	"Absent"	45	21.51	2.06	16	0
138	0	1.86	18.35	"Present"	59	25.38	6.51	17	0
138	0.06	4.15	20.66	"Absent"	49	22.59	2.49	16	0
130	1.22	3.3	13.65	"Absent"	50	21.4	3.81	31	0
130	4	2.4	17.42	"Absent"	60	22.05	0	40	0
110	0	7.14	28.28	"Absent"	57	29	0	32	0
120	0	3.98	13.19	"Present"	47	21.89	0	16	0
166	6	8.8	37.89	"Absent"	39	28.7	43.2	52	0
134	0.57	4.75	23.07	"Absent"	67	26.33	0	37	0
142	3	3.69	25.1	"Absent"	60	30.08	38.88	27	0
136	2.8	2.53	9.28	"Present"	61	20.7	4.55	25	0
142	0	4.32	25.22	"Absent"	47	28.92	6.53	34	1
130	0	1.88	12.51	"Present"	52	20.28	0	17	0
124	1.8	3.74	16.64	"Present"	42	22.26	10.49	20	0
144	4	5.03	25.78	"Present"	57	27.55	90	48	1
136	1.81	3.31	6.74	"Absent"	63	19.57	24.94	24	0
120	0	2.77	13.35	"Absent"	67	23.37	1.03	18	0
154	5.53	3.2	28.81	"Present"	61	26.15	42.79	42	0
124	1.6	7.22	39.68	"Present"	36	31.5	0	51	1
146	0.64	4.82	28.02	"Absent"	60	28.11	8.23	39	1
128	2.24	2.83	26.48	"Absent"	48	23.96	47.42	27	1
170	0.4	4.11	42.06	"Present"	56	33.1	2.06	57	0
214	0.4	5.98	31.72	"Absent"	64	28.45	0	58	0
182	4.2	4.41	32.1	"Absent"	52	28.61	18.72	52	1
108	3	1.59	15.23	"Absent"	40	20.09	26.64	55	0
118	5.4	11.61	30.79	"Absent"	64	27.35	23.97	40	0
132	0	4.82	33.41	"Present"	62	14.7	0	46	1

================================================
FILE: 2017/examples/02_feed_dict.py
================================================
""" Example to demonstrate the use of feed_dict
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

# Example 1: feed_dict with placeholder
# create a placeholder of type float 32-bit, value is a vector of 3 elements
a = tf.placeholder(tf.float32, shape=[3])

# create a constant of type float 32-bit, value is a vector of 3 elements
b = tf.constant([5, 5, 5], tf.float32)

# use the placeholder as you would a constant
c = a + b  # short for tf.add(a, b)

with tf.Session() as sess:
	# print(sess.run(c)) # InvalidArgumentError because a doesn’t have any value

	# feed [1, 2, 3] to placeholder a via the dict {a: [1, 2, 3]}
	# fetch value of c
	print(sess.run(c, {a: [1, 2, 3]})) # >> [6. 7. 8.]


# Example 2: feed_dict with variables
a = tf.add(2, 5)
b = tf.multiply(a, 3)

with tf.Session() as sess:
	# define a dictionary that says to replace the value of 'a' with 15
	replace_dict = {a: 15}

	# Run the session, passing in 'replace_dict' as the value to 'feed_dict'
	print(sess.run(b, feed_dict=replace_dict)) # >> 45

================================================
FILE: 2017/examples/02_lazy_loading.py
================================================
""" Example to demonstrate how the graph definition gets
bloated because of lazy loading
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf 

######################################## 
## NORMAL LOADING   			      ##
## print out a graph with 1 Add node  ## 
########################################

x = tf.Variable(10, name='x')
y = tf.Variable(20, name='y')
z = tf.add(x, y)

with tf.Session() as sess:
	sess.run(tf.global_variables_initializer())
	writer = tf.summary.FileWriter('./graphs/l2', sess.graph)
	for _ in range(10):
		sess.run(z)
	print(tf.get_default_graph().as_graph_def())
	writer.close()

######################################## 
## LAZY LOADING   					  ##
## print out a graph with 10 Add nodes## 
########################################

x = tf.Variable(10, name='x')
y = tf.Variable(20, name='y')

with tf.Session() as sess:
	sess.run(tf.global_variables_initializer())
	writer = tf.summary.FileWriter('./graphs/l2', sess.graph)
	for _ in range(10):
		sess.run(tf.add(x, y))
	print(tf.get_default_graph().as_graph_def()) 
	writer.close()

================================================
FILE: 2017/examples/02_simple_tf.py
================================================
""" Some simple TensorFlow's ops
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf


a = tf.constant(2)
b = tf.constant(3)
x = tf.add(a, b)
with tf.Session() as sess:
	writer = tf.summary.FileWriter('./graphs', sess.graph) 
	print(sess.run(x))
writer.close() # close the writer when you’re done using it


a = tf.constant([2, 2], name='a')
b = tf.constant([[0, 1], [2, 3]], name='b')
x = tf.multiply(a, b, name='dot_product')
with tf.Session() as sess:
	print(sess.run(x))
# >> [[0 2]
#	 [4 6]]

tf.zeros(shape, dtype=tf.float32, name=None)
#creates a tensor of shape and all elements will be zeros (when ran in session)

x = tf.zeros([2, 3], tf.int32) 
y = tf.zeros_like(x, optimize=True)
print(y)
print(tf.get_default_graph().as_graph_def())
with tf.Session() as sess:
	y = sess.run(y)


with tf.Session() as sess:
	print(sess.run(tf.linspace(10.0, 13.0, 4)))
	print(sess.run(tf.range(5)))
	for i in np.arange(5):
		print(i)

samples = tf.multinomial(tf.constant([[1., 3., 1]]), 5)

with tf.Session() as sess:
	for _ in range(10):
		print(sess.run(samples))

t_0 = 19 
x = tf.zeros_like(t_0) # ==> 0
y = tf.ones_like(t_0) # ==> 1

with tf.Session() as sess:
	print(sess.run([x, y]))

t_1 = ['apple', 'peach', 'banana']
x = tf.zeros_like(t_1) # ==> ['' '' '']
y = tf.ones_like(t_1) # ==> TypeError: Expected string, got 1 of type 'int' instead.

t_2 = [[True, False, False],
       [False, False, True],
       [False, True, False]] 
x = tf.zeros_like(t_2) # ==> 2x2 tensor, all elements are False
y = tf.ones_like(t_2) # ==> 2x2 tensor, all elements are True
with tf.Session() as sess:
	print(sess.run([x, y]))

with tf.variable_scope('meh') as scope:
	a = tf.get_variable('a', [10])
	b = tf.get_variable('b', [100])

writer = tf.summary.FileWriter('test', tf.get_default_graph())


x = tf.Variable(2.0)
y = 2.0 * (x ** 3)
z = 3.0 + y ** 2
grad_z = tf.gradients(z, [x, y])
with tf.Session() as sess:
	sess.run(x.initializer)
	print(sess.run(grad_z))


================================================
FILE: 2017/examples/02_variables.py
================================================
"""
Example to demonstrate the ops of tf.Variables()
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

# Example 1: how to run assign op
W = tf.Variable(10)
assign_op = W.assign(100)

with tf.Session() as sess:
	sess.run(W.initializer)
	print(W.eval()) # >> 10
	print(sess.run(assign_op)) # >> 100

# Example 2: tricky example
# create a variable whose original value is 2
my_var = tf.Variable(2, name="my_var") 

# assign 2 * my_var to my_var and run the op my_var_times_two
my_var_times_two = my_var.assign(2 * my_var)

with tf.Session() as sess:
	sess.run(tf.global_variables_initializer())
	print(sess.run(my_var_times_two)) # >> 4
	print(sess.run(my_var_times_two)) # >> 8
	print(sess.run(my_var_times_two)) # >> 16

# Example 3: each session maintains its own copy of variables
W = tf.Variable(10)
sess1 = tf.Session()
sess2 = tf.Session()

# You have to initialize W at each session
sess1.run(W.initializer)
sess2.run(W.initializer)

print(sess1.run(W.assign_add(10))) # >> 20
print(sess2.run(W.assign_sub(2))) # >> 8

print(sess1.run(W.assign_add(100))) # >> 120
print(sess2.run(W.assign_sub(50))) # >> -42

sess1.close()
sess2.close()

================================================
FILE: 2017/examples/03_linear_regression_sol.py
================================================
""" Simple linear regression example in TensorFlow
This program tries to predict the number of thefts from 
the number of fire in the city of Chicago
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import xlrd

import utils

DATA_FILE = 'data/fire_theft.xls'

# Step 1: read in data from the .xls file
book = xlrd.open_workbook(DATA_FILE, encoding_override="utf-8")
sheet = book.sheet_by_index(0)
data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])
n_samples = sheet.nrows - 1

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')

# Step 3: create weight and bias, initialized to 0
w = tf.Variable(0.0, name='weights')
b = tf.Variable(0.0, name='bias')

# Step 4: build model to predict Y
Y_predicted = X * w + b 

# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')
# loss = utils.huber_loss(Y, Y_predicted)

# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

with tf.Session() as sess:
	# Step 7: initialize the necessary variables, in this case, w and b
	sess.run(tf.global_variables_initializer()) 
	
	writer = tf.summary.FileWriter('./graphs/linear_reg', sess.graph)
	
	# Step 8: train the model
	for i in range(50): # train the model 100 epochs
		total_loss = 0
		for x, y in data:
			# Session runs train_op and fetch values of loss
			_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) 
			total_loss += l
		print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

	# close the writer when you're done using it
	writer.close() 
	
	# Step 9: output the values of w and b
	w, b = sess.run([w, b]) 

# plot the results
X, Y = data.T[0], data.T[1]
plt.plot(X, Y, 'bo', label='Real data')
plt.plot(X, X * w + b, 'r', label='Predicted data')
plt.legend()
plt.show()

================================================
FILE: 2017/examples/03_linear_regression_starter.py
================================================
""" Simple linear regression example in TensorFlow
This program tries to predict the number of thefts from 
the number of fire in the city of Chicago
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import xlrd

import utils

DATA_FILE = 'data/fire_theft.xls'

# Phase 1: Assemble the graph
# Step 1: read in data from the .xls file
book = xlrd.open_workbook(DATA_FILE, encoding_override='utf-8')
sheet = book.sheet_by_index(0)
data = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])
n_samples = sheet.nrows - 1

# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)
# Both have the type float32


# Step 3: create weight and bias, initialized to 0
# name your variables w and b


# Step 4: predict Y (number of theft) from the number of fire
# name your variable Y_predicted


# Step 5: use the square error as the loss function
# name your variable loss


# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
 
# Phase 2: Train our model
with tf.Session() as sess:
	# Step 7: initialize the necessary variables, in this case, w and b
	# TO - DO	


	# Step 8: train the model
	for i in range(50): # run 100 epochs
		total_loss = 0
		for x, y in data:
			# Session runs optimizer to minimize loss and fetch the value of loss. Name the received value as l
			# TO DO: write sess.run()

			total_loss += l
		print("Epoch {0}: {1}".format(i, total_loss/n_samples))
	
# plot the results
# X, Y = data.T[0], data.T[1]
# plt.plot(X, Y, 'bo', label='Real data')
# plt.plot(X, X * w + b, 'r', label='Predicted data')
# plt.legend()
# plt.show()

================================================
FILE: 2017/examples/03_logistic_regression_mnist_sol.py
================================================
""" Simple logistic regression model to solve OCR task 
with MNIST in TensorFlow
MNIST dataset: yann.lecun.com/exdb/mnist/
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import time

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 30

# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
mnist = input_data.read_data_sets('/data/mnist', one_hot=True) 

# Step 2: create placeholders for features and labels
# each image in the MNIST data is of shape 28*28 = 784
# therefore, each image is represented with a 1x784 tensor
# there are 10 classes for each image, corresponding to digits 0 - 9. 
# each lable is one hot vector.
X = tf.placeholder(tf.float32, [batch_size, 784], name='X_placeholder') 
Y = tf.placeholder(tf.int32, [batch_size, 10], name='Y_placeholder')

# Step 3: create weights and bias
# w is initialized to random variables with mean of 0, stddev of 0.01
# b is initialized to 0
# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)
# shape of b depends on Y
w = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name='weights')
b = tf.Variable(tf.zeros([1, 10]), name="bias")

# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
logits = tf.matmul(X, w) + b 

# Step 5: define loss function
# use cross entropy of softmax of logits as the loss function
entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss')
loss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch

# Step 6: define training op
# using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

with tf.Session() as sess:
	# to visualize using TensorBoard
	writer = tf.summary.FileWriter('./graphs/logistic_reg', sess.graph)

	start_time = time.time()
	sess.run(tf.global_variables_initializer())	
	n_batches = int(mnist.train.num_examples/batch_size)
	for i in range(n_epochs): # train the model n_epochs times
		total_loss = 0

		for _ in range(n_batches):
			X_batch, Y_batch = mnist.train.next_batch(batch_size)
			_, loss_batch = sess.run([optimizer, loss], feed_dict={X: X_batch, Y:Y_batch}) 
			total_loss += loss_batch
		print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))

	print('Total time: {0} seconds'.format(time.time() - start_time))

	print('Optimization Finished!') # should be around 0.35 after 25 epochs

	# test the model
	
	preds = tf.nn.softmax(logits)
	correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))
	accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :(
	
	n_batches = int(mnist.test.num_examples/batch_size)
	total_correct_preds = 0
	
	for i in range(n_batches):
		X_batch, Y_batch = mnist.test.next_batch(batch_size)
		accuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) 
		total_correct_preds += accuracy_batch	
	
	print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))

	writer.close()


================================================
FILE: 2017/examples/03_logistic_regression_mnist_starter.py
================================================
""" Starter code for logistic regression model to solve OCR task 
with MNIST in TensorFlow
MNIST dataset: yann.lecun.com/exdb/mnist/
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import time

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 10

# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
mnist = input_data.read_data_sets('/data/mnist', one_hot=True) 

# Step 2: create placeholders for features and labels
# each image in the MNIST data is of shape 28*28 = 784
# therefore, each image is represented with a 1x784 tensor
# there are 10 classes for each image, corresponding to digits 0 - 9. 
# Features are of the type float, and labels are of the type int


# Step 3: create weights and bias
# weights and biases are initialized to 0
# shape of w depends on the dimension of X and Y so that Y = X * w + b
# shape of b depends on Y


# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
# to get the probability distribution of possible label of the image
# DO NOT DO SOFTMAX HERE


# Step 5: define loss function
# use cross entropy loss of the real labels with the softmax of logits
# use the method:
# tf.nn.softmax_cross_entropy_with_logits(logits, Y)
# then use tf.reduce_mean to get the mean loss of the batch


# Step 6: define training op
# using gradient descent to minimize loss


with tf.Session() as sess:
	start_time = time.time()
	sess.run(tf.global_variables_initializer())	
	n_batches = int(mnist.train.num_examples/batch_size)
	for i in range(n_epochs): # train the model n_epochs times
		total_loss = 0

		for _ in range(n_batches):
			X_batch, Y_batch = mnist.train.next_batch(batch_size)
			# TO-DO: run optimizer + fetch loss_batch
			# 
			# 
			total_loss += loss_batch
		print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))

	print('Total time: {0} seconds'.format(time.time() - start_time))

	print('Optimization Finished!') # should be around 0.35 after 25 epochs

	# test the model
	preds = tf.nn.softmax(logits)
	correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))
	accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :(
	
	n_batches = int(mnist.test.num_examples/batch_size)
	total_correct_preds = 0
	
	for i in range(n_batches):
		X_batch, Y_batch = mnist.test.next_batch(batch_size)
		accuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) 
		total_correct_preds += accuracy_batch	
	
	print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))


================================================
FILE: 2017/examples/04_word2vec_no_frills.py
================================================
""" The no frills implementation of word2vec skip-gram model using NCE loss.
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

from process_data import process_data

VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 # the context window
NUM_SAMPLED = 64    # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 10000
SKIP_STEP = 2000 # how many steps to skip before reporting the loss

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    # Step 1: define the placeholders for input and output
    with tf.name_scope('data'):
        center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
        target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')

    # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU
    # Step 2: define weights. In word2vec, it's actually the weights that we care about

    with tf.name_scope('embedding_matrix'):
        embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), 
                            name='embed_matrix')

    # Step 3: define the inference
    with tf.name_scope('loss'):
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')

        # Step 4: construct variables for NCE loss
        nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
                                                    stddev=1.0 / (EMBED_SIZE ** 0.5)), 
                                                    name='nce_weight')
        nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

        # define loss function to be NCE loss function
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                            biases=nce_bias, 
                                            labels=target_words, 
                                            inputs=embed, 
                                            num_sampled=NUM_SAMPLED, 
                                            num_classes=VOCAB_SIZE), name='loss')

    # Step 5: define optimizer
    optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph)
        for index in range(NUM_TRAIN_STEPS):
            centers, targets = next(batch_gen)
            loss_batch, _ = sess.run([loss, optimizer], 
                                    feed_dict={center_words: centers, target_words: targets})
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
        writer.close()

def main():
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    word2vec(batch_gen)

if __name__ == '__main__':
    main()

================================================
FILE: 2017/examples/04_word2vec_starter.py
================================================
""" The mo frills implementation of word2vec skip-gram model using NCE loss. 
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

from process_data import process_data

VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 # the context window
NUM_SAMPLED = 64    # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 20000
SKIP_STEP = 2000 # how many steps to skip before reporting the loss

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    # Step 1: define the placeholders for input and output
    # center_words have to be int to work on embedding lookup

    # TO DO


    # Step 2: define weights. In word2vec, it's actually the weights that we care about
    # vocab size x embed size
    # initialized to random uniform -1 to 1

    # TOO DO


    # Step 3: define the inference
    # get the embed of input words using tf.nn.embedding_lookup
    # embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')

    # TO DO


        # Step 4: construct variables for NCE loss
        # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)
        # nce_weight (vocab size x embed size), intialized to truncated_normal stddev=1.0 / (EMBED_SIZE ** 0.5)
        # bias: vocab size, initialized to 0

        # TO DO


        # define loss function to be NCE loss function
        # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)
        # need to get the mean accross the batch
        # note: you should use embedding of center words for inputs, not center words themselves

        # TO DO

        
    # Step 5: define optimizer
    
    # TO DO


    with tf.Session() as sess:
        # TO DO: initialize variables


        total_loss = 0.0 # we use this to calculate the average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph)
        for index in range(NUM_TRAIN_STEPS):
            centers, targets = next(batch_gen)
            # TO DO: create feed_dict, run optimizer, fetch loss_batch

            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
        writer.close()

def main():
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    word2vec(batch_gen)

if __name__ == '__main__':
    main()


================================================
FILE: 2017/examples/04_word2vec_visualize.py
================================================
""" word2vec with NCE loss and code to visualize the embeddings on TensorBoard
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

from process_data import process_data
import utils

VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128 # dimension of the word embedding vectors
SKIP_WINDOW = 1 # the context window
NUM_SAMPLED = 64    # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
WEIGHTS_FLD = 'processed/'
SKIP_STEP = 2000

class SkipGramModel:
    """ Build the graph for word2vec model """
    def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate):
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.batch_size = batch_size
        self.num_sampled = num_sampled
        self.lr = learning_rate
        self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    def _create_placeholders(self):
        """ Step 1: define the placeholders for input and output """
        with tf.name_scope("data"):
            self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words')
            self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words')

    def _create_embedding(self):
        """ Step 2: define weights. In word2vec, it's actually the weights that we care about """
        # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU
        with tf.device('/cpu:0'):
            with tf.name_scope("embed"):
                self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size, 
                                                                    self.embed_size], -1.0, 1.0), 
                                                                    name='embed_matrix')

    def _create_loss(self):
        """ Step 3 + 4: define the model + the loss function """
        with tf.device('/cpu:0'):
            with tf.name_scope("loss"):
                # Step 3: define the inference
                embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed')

                # Step 4: define loss function
                # construct variables for NCE loss
                nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size],
                                                            stddev=1.0 / (self.embed_size ** 0.5)), 
                                                            name='nce_weight')
                nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

                # define loss function to be NCE loss function
                self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                                    biases=nce_bias, 
                                                    labels=self.target_words, 
                                                    inputs=embed, 
                                                    num_sampled=self.num_sampled, 
                                                    num_classes=self.vocab_size), name='loss')
    def _create_optimizer(self):
        """ Step 5: define optimizer """
        with tf.device('/cpu:0'):
            self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, 
                                                              global_step=self.global_step)

    def _create_summaries(self):
        with tf.name_scope("summaries"):
            tf.summary.scalar("loss", self.loss)
            tf.summary.histogram("histogram loss", self.loss)
            # because you have several summaries, we should merge them all
            # into one op to make it easier to manage
            self.summary_op = tf.summary.merge_all()

    def build_graph(self):
        """ Build the graph for our model """
        self._create_placeholders()
        self._create_embedding()
        self._create_loss()
        self._create_optimizer()
        self._create_summaries()

def train_model(model, batch_gen, num_train_steps, weights_fld):
    saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias

    initial_step = 0
    utils.make_dir('checkpoints')
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
        # if that checkpoint exists, restore from checkpoint
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph)
        initial_step = model.global_step.eval()
        for index in range(initial_step, initial_step + num_train_steps):
            centers, targets = next(batch_gen)
            feed_dict={model.center_words: centers, model.target_words: targets}
            loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op], 
                                              feed_dict=feed_dict)
            writer.add_summary(summary, global_step=index)
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
                saver.save(sess, 'checkpoints/skip-gram', index)
        
        ####################
        # code to visualize the embeddings. uncomment the below to visualize embeddings
        # run "'tensorboard --logdir='processed'" to see the embeddings
        # final_embed_matrix = sess.run(model.embed_matrix)
        
        # # it has to variable. constants don't work here. you can't reuse model.embed_matrix
        # embedding_var = tf.Variable(final_embed_matrix[:1000], name='embedding')
        # sess.run(embedding_var.initializer)

        # config = projector.ProjectorConfig()
        # summary_writer = tf.summary.FileWriter('processed')

        # # add embedding to the config file
        # embedding = config.embeddings.add()
        # embedding.tensor_name = embedding_var.name
        
        # # link this tensor to its metadata file, in this case the first 500 words of vocab
        # embedding.metadata_path = 'processed/vocab_1000.tsv'

        # # saves a configuration file that TensorBoard will read during startup.
        # projector.visualize_embeddings(summary_writer, config)
        # saver_embed = tf.train.Saver([embedding_var])
        # saver_embed.save(sess, 'processed/model3.ckpt', 1)

def main():
    model = SkipGramModel(VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE)
    model.build_graph()
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    train_model(model, batch_gen, NUM_TRAIN_STEPS, WEIGHTS_FLD)

if __name__ == '__main__':
    main()

================================================
FILE: 2017/examples/05_csv_reader.py
================================================
""" Some people tried to use TextLineReader for the assignment 1
but seem to have problems getting it work, so here is a short 
script demonstrating the use of CSV reader on the heart dataset.
Note that the heart dataset is originally in txt so I first
converted it to csv to take advantage of the already laid out columns.

You can download heart.csv in the data folder.
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import sys
sys.path.append('..')

import tensorflow as tf

DATA_PATH = 'data/heart.csv'
BATCH_SIZE = 2
N_FEATURES = 9

def batch_generator(filenames):
    """ filenames is the list of files you want to read from. 
    In this case, it contains only heart.csv
    """
    filename_queue = tf.train.string_input_producer(filenames)
    reader = tf.TextLineReader(skip_header_lines=1) # skip the first line in the file
    _, value = reader.read(filename_queue)

    # record_defaults are the default values in case some of our columns are empty
    # This is also to tell tensorflow the format of our data (the type of the decode result)
    # for this dataset, out of 9 feature columns, 
    # 8 of them are floats (some are integers, but to make our features homogenous, 
    # we consider them floats), and 1 is string (at position 5)
    # the last column corresponds to the lable is an integer

    record_defaults = [[1.0] for _ in range(N_FEATURES)]
    record_defaults[4] = ['']
    record_defaults.append([1])

    # read in the 10 rows of data
    content = tf.decode_csv(value, record_defaults=record_defaults) 

    # convert the 5th column (present/absent) to the binary value 0 and 1
    content[4] = tf.cond(tf.equal(content[4], tf.constant('Present')), lambda: tf.constant(1.0), lambda: tf.constant(0.0))

    # pack all 9 features into a tensor
    features = tf.stack(content[:N_FEATURES])

    # assign the last column to label
    label = content[-1]

    # minimum number elements in the queue after a dequeue, used to ensure 
    # that the samples are sufficiently mixed
    # I think 10 times the BATCH_SIZE is sufficient
    min_after_dequeue = 10 * BATCH_SIZE

    # the maximum number of elements in the queue
    capacity = 20 * BATCH_SIZE

    # shuffle the data to generate BATCH_SIZE sample pairs
    data_batch, label_batch = tf.train.shuffle_batch([features, label], batch_size=BATCH_SIZE, 
                                        capacity=capacity, min_after_dequeue=min_after_dequeue)

    return data_batch, label_batch

def generate_batches(data_batch, label_batch):
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        for _ in range(10): # generate 10 batches
            features, labels = sess.run([data_batch, label_batch])
            print(features)
        coord.request_stop()
        coord.join(threads)

def main():
    data_batch, label_batch = batch_generator([DATA_PATH])
    generate_batches(data_batch, label_batch)

if __name__ == '__main__':
    main()


================================================
FILE: 2017/examples/05_randomization.py
================================================
""" Examples to demonstrate ops level randomization
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

# Example 1: session is the thing that keeps track of random state
c = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.57493
    print(sess.run(c)) # >> -5.97319

# Example 2: each new session will start the random state all over again.
c = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.57493

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.57493

# Example 3: with operation level random seed, each op keeps its own seed.
c = tf.random_uniform([], -10, 10, seed=2)
d = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.57493
    print(sess.run(d)) # >> 3.57493

# Example 4: graph level random seed
tf.set_random_seed(2)
c = tf.random_uniform([], -10, 10)
d = tf.random_uniform([], -10, 10)

with tf.Session() as sess:
    print(sess.run(c)) # >> 9.12393
    print(sess.run(d)) # >> -4.53404
    

================================================
FILE: 2017/examples/07_basic_filters.py
================================================
"""
Simple examples of convolution to do some basic filters
Also demonstrates the use of TensorFlow data readers.

We will use some popular filters for our image.
It seems to be working with grayscale images, but not with rgb images.
It's probably because I didn't choose the right kernels for rgb images.

kernels for rgb images have dimensions 3 x 3 x 3 x 3
kernels for grayscale images have dimensions 3 x 3 x 1 x 1

Note:
When you call tf.train.string_input_producer,
a tf.train.QueueRunner is added to the graph, which must be run using
e.g. tf.train.start_queue_runners() else your session will run into deadlock
and your program will crash.

And to run QueueRunner, you need a coordinator to close to your queue for you.
Without coordinator, your threads will keep on running outside session and you will have the error:
ERROR:tensorflow:Exception in QueueRunner: Attempted to use a closed Session.

Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu

"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import sys
sys.path.append('..')

from matplotlib import gridspec as gridspec
from matplotlib import pyplot as plt
import tensorflow as tf

import kernels

FILENAME = 'data/friday.jpg'

def read_one_image(filename):
    """ This is just to demonstrate how to open an image in TensorFlow,
    but it's actually a lot easier to use Pillow 
    """
    filename_queue = tf.train.string_input_producer([filename])
    image_reader = tf.WholeFileReader()
    _, image_file = image_reader.read(filename_queue)
    image = tf.image.decode_jpeg(image_file, channels=3)
    image = tf.cast(image, tf.float32) / 256.0 # cast to float to make conv2d work
    return image

def convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'):
    images = [image[0]]
    for i, kernel in enumerate(kernels):
        filtered_image = tf.nn.conv2d(image, kernel, strides=strides, padding=padding)[0]
        if i == 2:
            filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255)
        images.append(filtered_image)
    return images

def get_real_images(images):
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        images = sess.run(images)
        coord.request_stop()
        coord.join(threads)
    return images

def show_images(images, rgb=True):
    gs = gridspec.GridSpec(1, len(images))
    for i, image in enumerate(images):
        plt.subplot(gs[0, i])
        if rgb:
            plt.imshow(image)
        else: 
            image = image.reshape(image.shape[0], image.shape[1])
            plt.imshow(image, cmap='gray')
        plt.axis('off')
    plt.show()

def main():
    rgb = False
    if rgb:
        kernels_list = [kernels.BLUR_FILTER_RGB, kernels.SHARPEN_FILTER_RGB, kernels.EDGE_FILTER_RGB, 
                    kernels.TOP_SOBEL_RGB, kernels.EMBOSS_FILTER_RGB]
    else:
        kernels_list = [kernels.BLUR_FILTER, kernels.SHARPEN_FILTER, kernels.EDGE_FILTER, 
                    kernels.TOP_SOBEL, kernels.EMBOSS_FILTER]

    image = read_one_image(FILENAME)
    if not rgb:
        image = tf.image.rgb_to_grayscale(image)
    image = tf.expand_dims(image, 0) # to make it into a batch of 1 element
    images = convolve(image, kernels_list, rgb)
    images = get_real_images(images)
    show_images(images, rgb)

if __name__ == '__main__':
    main()

================================================
FILE: 2017/examples/07_convnet_mnist.py
================================================
""" Using convolutional net on MNIST dataset of handwritten digit
(http://yann.lecun.com/exdb/mnist/)
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import time 

import tensorflow as tf
import tf.contrib.layers as layers
from tensorflow.examples.tutorials.mnist import input_data

import utils

N_CLASSES = 10

# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
mnist = input_data.read_data_sets("/data/mnist", one_hot=True)

# Step 2: Define paramaters for the model
LEARNING_RATE = 0.001
BATCH_SIZE = 128
SKIP_STEP = 10
DROPOUT = 0.75
N_EPOCHS = 1

# Step 3: create placeholders for features and labels
# each image in the MNIST data is of shape 28*28 = 784
# therefore, each image is represented with a 1x784 tensor
# We'll be doing dropout for hidden layer so we'll need a placeholder
# for the dropout probability too
# Use None for shape so we can change the batch_size once we've built the graph
with tf.name_scope('data'):
    X = tf.placeholder(tf.float32, [None, 784], name="X_placeholder")
    Y = tf.placeholder(tf.float32, [None, 10], name="Y_placeholder")

dropout = tf.placeholder(tf.float32, name='dropout')

# Step 4 + 5: create weights + do inference
# the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

with tf.variable_scope('conv1') as scope:
    # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d
    images = tf.reshape(X, shape=[-1, 28, 28, 1]) 
    kernel = tf.get_variable('kernel', [5, 5, 1, 32], 
                            initializer=tf.truncated_normal_initializer())
    biases = tf.get_variable('biases', [32],
                        initializer=tf.random_normal_initializer())
    conv = tf.nn.conv2d(images, kernel, strides=[1, 1, 1, 1], padding='SAME')
    conv1 = tf.nn.relu(conv + biases, name=scope.name)

    # output is of dimension BATCH_SIZE x 28 x 28 x 32
    conv1 = layers.conv2d(images, 32, 5, 1, activation_fn=tf.nn.relu, padding='SAME')

with tf.variable_scope('pool1') as scope:
    pool1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
                           padding='SAME')

    # output is of dimension BATCH_SIZE x 14 x 14 x 32

with tf.variable_scope('conv2') as scope:
    # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64
    kernel = tf.get_variable('kernels', [5, 5, 32, 64], 
                        initializer=tf.truncated_normal_initializer())
    biases = tf.get_variable('biases', [64],
                        initializer=tf.random_normal_initializer())
    conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME')
    conv2 = tf.nn.relu(conv + biases, name=scope.name)

    # output is of dimension BATCH_SIZE x 14 x 14 x 64
    # layers.conv2d(images, 64, 5, 1, activation_fn=tf.nn.relu, padding='SAME')

with tf.variable_scope('pool2') as scope:
    # similar to pool1
    pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
                            padding='SAME')

    # output is of dimension BATCH_SIZE x 7 x 7 x 64

with tf.variable_scope('fc') as scope:
    # use weight of dimension 7 * 7 * 64 x 1024
    input_features = 7 * 7 * 64
    w = tf.get_variable('weights', [input_features, 1024],
                        initializer=tf.truncated_normal_initializer())
    b = tf.get_variable('biases', [1024],
                        initializer=tf.constant_initializer(0.0))

    # reshape pool2 to 2 dimensional
    pool2 = tf.reshape(pool2, [-1, input_features])
    fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu')
    
    # pool2 = layers.flatten(pool2)
    # fc = layers.fully_connected(pool2, 1024, tf.nn.relu)

    fc = tf.nn.dropout(fc, dropout, name='relu_dropout')

with tf.variable_scope('softmax_linear') as scope:
    w = tf.get_variable('weights', [1024, N_CLASSES],
                        initializer=tf.truncated_normal_initializer())
    b = tf.get_variable('biases', [N_CLASSES],
                        initializer=tf.random_normal_initializer())
    logits = tf.matmul(fc, w) + b

    
# Step 6: define loss function
# use softmax cross entropy with logits as the loss function
# compute mean cross entropy, softmax is applied internally
with tf.name_scope('loss'):
    entropy = tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(entropy, name='loss')

with tf.name_scope('summaries'):
    tf.summary.scalar('loss', loss)
    tf.summary.histogram('histogram loss', loss)
    summary_op = tf.summary.merge_all()

# Step 7: define training op
# using gradient descent with learning rate of LEARNING_RATE to minimize cost
optimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(loss, 
                                        global_step=global_step)

utils.make_dir('checkpoints')
utils.make_dir('checkpoints/convnet_mnist')

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    # to visualize using TensorBoard
    writer = tf.summary.FileWriter('./graphs/convnet', sess.graph)
    ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))
    # if that checkpoint exists, restore from checkpoint
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
    
    initial_step = global_step.eval()

    start_time = time.time()
    n_batches = int(mnist.train.num_examples / BATCH_SIZE)

    total_loss = 0.0
    for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times
        X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE)
        _, loss_batch, summary = sess.run([optimizer, loss, summary_op], 
                                feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) 
        writer.add_summary(summary, global_step=index)
        total_loss += loss_batch
        if (index + 1) % SKIP_STEP == 0:
            print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP))
            total_loss = 0.0
            saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index)
    
    print("Optimization Finished!") # should be around 0.35 after 25 epochs
    print("Total time: {0} seconds".format(time.time() - start_time))
    
    # test the model
    n_batches = int(mnist.test.num_examples/BATCH_SIZE)
    total_correct_preds = 0
    for i in range(n_batches):
        X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE)
        _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], 
                                        feed_dict={X: X_batch, Y:Y_batch, dropout: 1.0}) 
        preds = tf.nn.softmax(logits_batch)
        correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1))
        accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))
        total_correct_preds += sess.run(accuracy)   
    
    print("Accuracy {0}".format(total_correct_preds/mnist.test.num_examples))

================================================
FILE: 2017/examples/07_convnet_mnist_starter.py
================================================
""" Using convolutional net on MNIST dataset of handwritten digit
(http://yann.lecun.com/exdb/mnist/)
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
from __future__ import print_function
from __future__ import division
from __future__ import print_function

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import time 

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

import utils

N_CLASSES = 10

# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
mnist = input_data.read_data_sets("/data/mnist", one_hot=True)

# Step 2: Define paramaters for the model
LEARNING_RATE = 0.001
BATCH_SIZE = 128
SKIP_STEP = 10
DROPOUT = 0.75
N_EPOCHS = 1

# Step 3: create placeholders for features and labels
# each image in the MNIST data is of shape 28*28 = 784
# therefore, each image is represented with a 1x784 tensor
# We'll be doing dropout for hidden layer so we'll need a placeholder
# for the dropout probability too
# Use None for shape so we can change the batch_size once we've built the graph
with tf.name_scope('data'):
    X = tf.placeholder(tf.float32, [None, 784], name="X_placeholder")
    Y = tf.placeholder(tf.float32, [None, 10], name="Y_placeholder")

dropout = tf.placeholder(tf.float32, name='dropout')

# Step 4 + 5: create weights + do inference
# the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

utils.make_dir('checkpoints')
utils.make_dir('checkpoints/convnet_mnist')

with tf.variable_scope('conv1') as scope:
    # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d
    # use the dynamic dimension -1
    images = tf.reshape(X, shape=[-1, 28, 28, 1])
    
    # TO DO

    # create kernel variable of dimension [5, 5, 1, 32]
    # use tf.truncated_normal_initializer()
    
    # TO DO

    # create biases variable of dimension [32]
    # use tf.constant_initializer(0.0)
    
    # TO DO 

    # apply tf.nn.conv2d. strides [1, 1, 1, 1], padding is 'SAME'
    
    # TO DO

    # apply relu on the sum of convolution output and biases
    
    # TO DO 

    # output is of dimension BATCH_SIZE x 28 x 28 x 32

with tf.variable_scope('pool1') as scope:
    # apply max pool with ksize [1, 2, 2, 1], and strides [1, 2, 2, 1], padding 'SAME'
    
    # TO DO

    # output is of dimension BATCH_SIZE x 14 x 14 x 32

with tf.variable_scope('conv2') as scope:
    # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64
    kernel = tf.get_variable('kernels', [5, 5, 32, 64], 
                        initializer=tf.truncated_normal_initializer())
    biases = tf.get_variable('biases', [64],
                        initializer=tf.random_normal_initializer())
    conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME')
    conv2 = tf.nn.relu(conv + biases, name=scope.name)

    # output is of dimension BATCH_SIZE x 14 x 14 x 64

with tf.variable_scope('pool2') as scope:
    # similar to pool1
    pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
                            padding='SAME')

    # output is of dimension BATCH_SIZE x 7 x 7 x 64

with tf.variable_scope('fc') as scope:
    # use weight of dimension 7 * 7 * 64 x 1024
    input_features = 7 * 7 * 64
    
    # create weights and biases

    # TO DO

    # reshape pool2 to 2 dimensional
    pool2 = tf.reshape(pool2, [-1, input_features])

    # apply relu on matmul of pool2 and w + b
    fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu')
    
    # TO DO

    # apply dropout
    fc = tf.nn.dropout(fc, dropout, name='relu_dropout')

with tf.variable_scope('softmax_linear') as scope:
    # this you should know. get logits without softmax
    # you need to create weights and biases

    # TO DO

# Step 6: define loss function
# use softmax cross entropy with logits as the loss function
# compute mean cross entropy, softmax is applied internally
with tf.name_scope('loss'):
    # you should know how to do this too
    
    # TO DO

# Step 7: define training op
# using gradient descent with learning rate of LEARNING_RATE to minimize cost
# don't forgot to pass in global_step

# TO DO

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    # to visualize using TensorBoard
    writer = tf.summary.FileWriter('./my_graph/mnist', sess.graph)
    ##### You have to create folders to store checkpoints
    ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))
    # if that checkpoint exists, restore from checkpoint
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
    
    initial_step = global_step.eval()

    start_time = time.time()
    n_batches = int(mnist.train.num_examples / BATCH_SIZE)

    total_loss = 0.0
    for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times
        X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE)
        _, loss_batch = sess.run([optimizer, loss], 
                                feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) 
        total_loss += loss_batch
        if (index + 1) % SKIP_STEP == 0:
            print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP))
            total_loss = 0.0
            saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index)
    
    print("Optimization Finished!") # should be around 0.35 after 25 epochs
    print("Total time: {0} seconds".format(time.time() - start_time))
    
    # test the model
    n_batches = int(mnist.test.num_examples/BATCH_SIZE)
    total_correct_preds = 0
    for i in range(n_batches):
        X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE)
        _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], 
                                        feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) 
        preds = tf.nn.softmax(logits_batch)
        correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1))
        accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))
        total_correct_preds += sess.run(accuracy)   
    
    print("Accuracy {0}".format(total_correct_preds/mnist.test.num_examples))

================================================
FILE: 2017/examples/09_queue_example.py
================================================
""" Example to demonstrate how to use queues
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf

N_SAMPLES = 1000
NUM_THREADS = 4
# Generating some simple data
# create 1000 random samples, each is a 1D array from the normal distribution (10, 1)
data = 10 * np.random.randn(N_SAMPLES, 4) + 1 
# create 1000 random labels of 0 and 1
target = np.random.randint(0, 2, size=N_SAMPLES) 

queue = tf.FIFOQueue(capacity=50, dtypes=[tf.float32, tf.int32], shapes=[[4], []])

enqueue_op = queue.enqueue_many([data, target])
data_sample, label_sample = queue.dequeue()

# create ops that do something with data_sample and label_sample

# create NUM_THREADS to do enqueue
qr = tf.train.QueueRunner(queue, [enqueue_op] * NUM_THREADS)
with tf.Session() as sess:
	# create a coordinator, launch the queue runner threads.
	coord = tf.train.Coordinator()
	enqueue_threads = qr.create_threads(sess, coord=coord, start=True)
	try:
		for step in range(100): # do to 100 iterations
			if coord.should_stop():
				break
			data_batch, label_batch = sess.run([data_sample, label_sample])
			print(data_batch)
			print(label_batch)
	except Exception as e:
		coord.request_stop(e)
	finally:
		coord.request_stop()
		coord.join(enqueue_threads)

================================================
FILE: 2017/examples/09_tfrecord_example.py
================================================
""" Examples to demonstrate how to write an image file to a TFRecord,
and how to read a TFRecord file using TFRecordReader.
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import sys
sys.path.append('..')

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# image supposed to have shape: 480 x 640 x 3 = 921600
IMAGE_PATH = 'data/'

def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def get_image_binary(filename):
    """ You can read in the image using tensorflow too, but it's a drag
        since you have to create graphs. It's much easier using Pillow and NumPy
    """
    image = Image.open(filename)
    image = np.asarray(image, np.uint8)
    shape = np.array(image.shape, np.int32)
    return shape.tobytes(), image.tobytes() # convert image to raw data bytes in the array.

def write_to_tfrecord(label, shape, binary_image, tfrecord_file):
    """ This example is to write a sample to TFRecord file. If you want to write
    more samples, just use a loop.
    """
    writer = tf.python_io.TFRecordWriter(tfrecord_file)
    # write label, shape, and image content to the TFRecord file
    example = tf.train.Example(features=tf.train.Features(feature={
                'label': _int64_feature(label),
                'shape': _bytes_feature(shape),
                'image': _bytes_feature(binary_image)
                }))
    writer.write(example.SerializeToString())
    writer.close()

def write_tfrecord(label, image_file, tfrecord_file):
    shape, binary_image = get_image_binary(image_file)
    write_to_tfrecord(label, shape, binary_image, tfrecord_file)

def read_from_tfrecord(filenames):
    tfrecord_file_queue = tf.train.string_input_producer(filenames, name='queue')
    reader = tf.TFRecordReader()
    _, tfrecord_serialized = reader.read(tfrecord_file_queue)

    # label and image are stored as bytes but could be stored as 
    # int64 or float64 values in a serialized tf.Example protobuf.
    tfrecord_features = tf.parse_single_example(tfrecord_serialized,
                        features={
                            'label': tf.FixedLenFeature([], tf.int64),
                            'shape': tf.FixedLenFeature([], tf.string),
                            'image': tf.FixedLenFeature([], tf.string),
                        }, name='features')
    # image was saved as uint8, so we have to decode as uint8.
    image = tf.decode_raw(tfrecord_features['image'], tf.uint8)
    shape = tf.decode_raw(tfrecord_features['shape'], tf.int32)
    # the image tensor is flattened out, so we have to reconstruct the shape
    image = tf.reshape(image, shape)
    label = tfrecord_features['label']
    return label, shape, image

def read_tfrecord(tfrecord_file):
    label, shape, image = read_from_tfrecord([tfrecord_file])

    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        label, image, shape = sess.run([label, image, shape])
        coord.request_stop()
        coord.join(threads)
    print(label)
    print(shape)
    plt.imshow(image)
    plt.show() 

def main():
    # assume the image has the label Chihuahua, which corresponds to class number 1
    label = 1 
    image_file = IMAGE_PATH + 'friday.jpg'
    tfrecord_file = IMAGE_PATH + 'friday.tfrecord'
    write_tfrecord(label, image_file, tfrecord_file)
    read_tfrecord(tfrecord_file)

if __name__ == '__main__':
    main()


================================================
FILE: 2017/examples/11_char_rnn_gist.py
================================================
""" A clean, no_frills character-level generative language model.
Created by Danijar Hafner (danijar.com), edited by Chip Huyen
for the class CS 20SI: "TensorFlow for Deep Learning Research"

Based on Andrej Karpathy's blog: 
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import sys
sys.path.append('..')

import time

import tensorflow as tf

import utils

DATA_PATH = 'data/arvix_abstracts.txt'
HIDDEN_SIZE = 200
BATCH_SIZE = 64
NUM_STEPS = 50
SKIP_STEP = 40
TEMPRATURE = 0.7
LR = 0.003
LEN_GENERATED = 300

def vocab_encode(text, vocab):
    return [vocab.index(x) + 1 for x in text if x in vocab]

def vocab_decode(array, vocab):
    return ''.join([vocab[x - 1] for x in array])

def read_data(filename, vocab, window=NUM_STEPS, overlap=NUM_STEPS//2):
    for text in open(filename):
        text = vocab_encode(text, vocab)
        for start in range(0, len(text) - window, overlap):
            chunk = text[start: start + window]
            chunk += [0] * (window - len(chunk))
            yield chunk

def read_batch(stream, batch_size=BATCH_SIZE):
    batch = []
    for element in stream:
        batch.append(element)
        if len(batch) == batch_size:
            yield batch
            batch = []
    yield batch

def create_rnn(seq, hidden_size=HIDDEN_SIZE):
    cell = tf.contrib.rnn.GRUCell(hidden_size)
    in_state = tf.placeholder_with_default(
            cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size])
    # this line to calculate the real length of seq
    # all seq are padded to be of the same length which is NUM_STEPS
    length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state)
    return output, in_state, out_state

def create_model(seq, temp, vocab, hidden=HIDDEN_SIZE):
    seq = tf.one_hot(seq, len(vocab))
    output, in_state, out_state = create_rnn(seq, hidden)
    # fully_connected is syntactic sugar for tf.matmul(w, output) + b
    # it will create w and b for us
    logits = tf.contrib.layers.fully_connected(output, len(vocab), None)
    loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=logits[:, :-1], labels=seq[:, 1:]))
    # sample the next character from Maxwell-Boltzmann Distribution with temperature temp
    # it works equally well without tf.exp
    sample = tf.multinomial(tf.exp(logits[:, -1] / temp), 1)[:, 0] 
    return loss, sample, in_state, out_state

def training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state):
    saver = tf.train.Saver()
    start = time.time()
    with tf.Session() as sess:
        writer = tf.summary.FileWriter('graphs/gist', sess.graph)
        sess.run(tf.global_variables_initializer())
        
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/arvix/checkpoint'))
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        
        iteration = global_step.eval()
        for batch in read_batch(read_data(DATA_PATH, vocab)):
            batch_loss, _ = sess.run([loss, optimizer], {seq: batch})
            if (iteration + 1) % SKIP_STEP == 0:
                print('Iter {}. \n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))
                online_inference(sess, vocab, seq, sample, temp, in_state, out_state)
                start = time.time()
                saver.save(sess, 'checkpoints/arvix/char-rnn', iteration)
            iteration += 1

def online_inference(sess, vocab, seq, sample, temp, in_state, out_state, seed='T'):
    """ Generate sequence one character at a time, based on the previous character
    """
    sentence = seed
    state = None
    for _ in range(LEN_GENERATED):
        batch = [vocab_encode(sentence[-1], vocab)]
        feed = {seq: batch, temp: TEMPRATURE}
        # for the first decoder step, the state is None
        if state is not None:
            feed.update({in_state: state})
        index, state = sess.run([sample, out_state], feed)
        sentence += vocab_decode(index, vocab)
    print(sentence)

def main():
    vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}")
    seq = tf.placeholder(tf.int32, [None, None])
    temp = tf.placeholder(tf.float32)
    loss, sample, in_state, out_state = create_model(seq, temp, vocab)
    global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
    optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step)
    utils.make_dir('checkpoints')
    utils.make_dir('checkpoints/arvix')
    training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state)
    
if __name__ == '__main__':
    main()

================================================
FILE: 2017/examples/autoencoder/autoencoder.py
================================================
import tensorflow as tf

from layers import *

def encoder(input):
    # Create a conv network with 3 conv layers and 1 FC layer
    # Conv 1: filter: [3, 3, 1], stride: [2, 2], relu
    
    # Conv 2: filter: [3, 3, 8], stride: [2, 2], relu
    
    # Conv 3: filter: [3, 3, 8], stride: [2, 2], relu
    
    # FC: output_dim: 100, no non-linearity
    raise NotImplementedError

def decoder(input):
    # Create a deconv network with 1 FC layer and 3 deconv layers
    # FC: output dim: 128, relu
    
    # Reshape to [batch_size, 4, 4, 8]
    
    # Deconv 1: filter: [3, 3, 8], stride: [2, 2], relu
    
    # Deconv 2: filter: [8, 8, 1], stride: [2, 2], padding: valid, relu
    
    # Deconv 3: filter: [7, 7, 1], stride: [1, 1], padding: valid, sigmoid
    raise NotImplementedError

def autoencoder(input_shape):
    # Define place holder with input shape

    # Define variable scope for autoencoder
    with tf.variable_scope('autoencoder') as scope:
        # Pass input to encoder to obtain encoding
        
        # Pass encoding into decoder to obtain reconstructed image
        
        # Return input image (placeholder) and reconstructed image
        pass


================================================
FILE: 2017/examples/autoencoder/layer_utils.py
================================================
import tensorflow as tf

def get_deconv2d_output_dims(input_dims, filter_dims, stride_dims, padding):
    # Returns the height and width of the output of a deconvolution layer.
    batch_size, input_h, input_w, num_channels_in = input_dims
    filter_h, filter_w, num_channels_out  = filter_dims
    stride_h, stride_w = stride_dims

    # Compute the height in the output, based on the padding.
    if padding == 'SAME':
      out_h = input_h * stride_h
    elif padding == 'VALID':
      out_h = (input_h - 1) * stride_h + filter_h

    # Compute the width in the output, based on the padding.
    if padding == 'SAME':
      out_w = input_w * stride_w
    elif padding == 'VALID':
      out_w = (input_w - 1) * stride_w + filter_w

    return [batch_size, out_h, out_w, num_channels_out]


================================================
FILE: 2017/examples/autoencoder/layers.py
================================================
import tensorflow as tf

from layer_utils import get_deconv2d_output_dims

def conv(input, name, filter_dims, stride_dims, padding='SAME',
         non_linear_fn=tf.nn.relu):
    input_dims = input.get_shape().as_list()
    assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in
    assert(len(filter_dims) == 3) # height, width and num_channels out
    assert(len(stride_dims) == 2) # stride height and width

    num_channels_in = input_dims[-1]
    filter_h, filter_w, num_channels_out = filter_dims
    stride_h, stride_w = stride_dims

    # Define a variable scope for the conv layer
    with tf.variable_scope(name) as scope:
        # Create filter weight variable
        
        # Create bias variable
        
        # Define the convolution flow graph
        
        # Add bias to conv output
        
        # Apply non-linearity (if asked) and return output
        pass

def deconv(input, name, filter_dims, stride_dims, padding='SAME',
           non_linear_fn=tf.nn.relu):
    input_dims = input.get_shape().as_list()
    assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in
    assert(len(filter_dims) == 3) # height, width and num_channels out
    assert(len(stride_dims) == 2) # stride height and width

    num_channels_in = input_dims[-1]
    filter_h, filter_w, num_channels_out = filter_dims
    stride_h, stride_w = stride_dims
    # Let's step into this function
    output_dims = get_deconv2d_output_dims(input_dims,
                                           filter_dims,
                                           stride_dims,
                                           padding)

    # Define a variable scope for the deconv layer
    with tf.variable_scope(name) as scope:
        # Create filter weight variable
        # Note that num_channels_out and in positions are flipped for deconv.
        
        # Create bias variable
        
        # Define the deconv flow graph
        
        # Add bias to deconv output
        
        # Apply non-linearity (if asked) and return output
        pass

def max_pool(input, name, filter_dims, stride_dims, padding='SAME'):
    assert(len(filter_dims) == 2) # filter height and width
    assert(len(stride_dims) == 2) # stride height and width

    filter_h, filter_w = filter_dims
    stride_h, stride_w = stride_dims
    
    # Define the max pool flow graph and return output
    pass

def fc(input, name, out_dim, non_linear_fn=tf.nn.relu):
    assert(type(out_dim) == int)

    # Define a variable scope for the FC layer
    with tf.variable_scope(name) as scope:
        input_dims = input.get_shape().as_list()
        # the input to the fc layer should be flattened
        if len(input_dims) == 4:
            # for eg. the output of a conv layer
            batch_size, input_h, input_w, num_channels = input_dims
            # ignore the batch dimension
            in_dim = input_h * input_w * num_channels
            flat_input = tf.reshape(input, [batch_size, in_dim])
        else:
            in_dim = input_dims[-1]
            flat_input = input

        # Create weight variable
        
        # Create bias variable
        
        # Define FC flow graph
        
        # Apply non-linearity (if asked) and return output
        pass


================================================
FILE: 2017/examples/autoencoder/train.py
================================================
import tensorflow as tf

from utils import *
from autoencoder import *

batch_size = 100
batch_shape = (batch_size, 28, 28, 1)
num_visualize = 10

lr = 0.01
num_epochs = 50

def calculate_loss(original, reconstructed):
    return tf.div(tf.reduce_sum(tf.square(tf.sub(reconstructed,
                                                 original))), 
                  tf.constant(float(batch_size)))

def train(dataset):
    input_image, reconstructed_image = autoencoder(batch_shape)
    loss = calculate_loss(input_image, reconstructed_image)
    optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss)

    init = tf.global_variables_initializer()
    with tf.Session() as session:
        session.run(init)

        dataset_size = len(dataset.train.images)
        print "Dataset size:", dataset_size
        num_iters = (num_epochs * dataset_size)/batch_size
        print "Num iters:", num_iters
        for step in xrange(num_iters):
            input_batch  = get_next_batch(dataset.train, batch_size)
            loss_val,  _ = session.run([loss, optimizer], 
                                       feed_dict={input_image: input_batch})
            if step % 1000 == 0:
                print "Loss at step", step, ":", loss_val

        test_batch = get_next_batch(dataset.test, batch_size)
        reconstruction = session.run(reconstructed_image,
                                     feed_dict={input_image: test_batch})
        visualize(test_batch, reconstruction, num_visualize)

if __name__ == '__main__':
    dataset = load_dataset()
    train(dataset)
    

================================================
FILE: 2017/examples/autoencoder/utils.py
================================================
import os
import sys
import tensorflow
import numpy as np

import matplotlib
matplotlib.use('TKAgg')
from matplotlib import pyplot as plt

from tensorflow.examples.tutorials.mnist import input_data

mnist_image_shape = [28, 28, 1]

def load_dataset():
    return input_data.read_data_sets('MNIST_data')

def get_next_batch(dataset, batch_size):
    # dataset should be mnist.(train/val/test)
    batch, _ = dataset.next_batch(batch_size)
    batch_shape = [batch_size] + mnist_image_shape
    return np.reshape(batch, batch_shape)

def visualize(_original, _reconstructions, num_visualize):
    vis_folder = './vis/'
    if not os.path.exists(vis_folder):
          os.makedirs(vis_folder)

    original = _original[:num_visualize]
    reconstructions = _reconstructions[:num_visualize]
    
    count = 1
    for (orig, rec) in zip(original, reconstructions):
        orig = np.reshape(orig, (mnist_image_shape[0],
                                 mnist_image_shape[1]))
        rec = np.reshape(rec, (mnist_image_shape[0],
                               mnist_image_shape[1]))
        f, ax = plt.subplots(1,2)
        ax[0].imshow(orig, cmap='gray')
        ax[1].imshow(rec, cmap='gray')
        plt.savefig(vis_folder + "test_%d.png" % count)
        count += 1


================================================
FILE: 2017/examples/cgru/README.md
================================================
This is the files used to explain convolutional GRU (CGRU) by Lukasz Kaiser at Google Brain. The accompanied slides can be found at http://web.stanford.edu/class/cs20si/lectures/slides_12.pdf


================================================
FILE: 2017/examples/cgru/custom_getter.py
================================================
# From [github]/tensorflow/python/kernel_tests/variable_scope_test.py
  def testGetterThatCreatesTwoVariablesAndSumsThem(self):

    def custom_getter(getter, name, *args, **kwargs):
      g_0 = getter("%s/0" % name, *args, **kwargs)
      g_1 = getter("%s/1" % name, *args, **kwargs)
      with tf.name_scope("custom_getter"):
        return g_0 + g_1  # or g_0 * const / ||g_0|| or anything you want

    with variable_scope.variable_scope("scope", custom_getter=custom_getter):
      v = variable_scope.get_variable("v", [1, 2, 3])
      # Or a full model if you wish. OO layers are ok.

    self.assertEqual([1, 2, 3], v.get_shape())
    true_vars = variables_lib.trainable_variables()
    self.assertEqual(2, len(true_vars))
    self.assertEqual("scope/v/0:0", true_vars[0].name)
    self.assertEqual("scope/v/1:0", true_vars[1].name)
    self.assertEqual("custom_getter/add:0", v.name)
    with self.test_session() as sess:
      variables_lib.global_variables_initializer().run()
      np_vars, np_v = sess.run([true_vars, v])
      self.assertAllClose(np_v, sum(np_vars))


================================================
FILE: 2017/examples/cgru/data_reader.py
================================================
def examples_queue(data_sources, data_fields_to_features, training,
                   data_items_to_decoders=None, data_items_to_decode=None):
  """Contruct a queue of training or evaluation examples.

  This function will create a reader from files given by data_sources,
  then enqueue the tf.Examples from these files, shuffling if training
  is true, and finally parse these tf.Examples to tensors.

  The dictionary data_fields_to_features for an image dataset can be this:

  data_fields_to_features = {
    'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
    'image/format': tf.FixedLenFeature((), tf.string, default_value='raw'),
    'image/class/label': tf.FixedLenFeature(
        [1], tf.int64, default_value=tf.zeros([1], dtype=tf.int64)),
  }

  and for a simple algorithmic dataset with variable-length data it is this:

  data_fields_to_features = {
    'inputs': tf.VarLenFeature(tf.int64),
    'targets': tf.VarLenFeature(tf.int64),
  }

  The data_items_to_decoders dictionary argument can be left as None if there
  is no decoding to be performed. But, e.g. for images, it should be set so that
  the images are decoded from the features, e.g., like this for MNIST:

  data_items_to_decoders = {
    'image': tfexample_decoder.Image(
      image_key = 'image/encoded',
      format_key = 'image/format',
      shape=[28, 28],
      channels=1),
    'label': tfexample_decoder.Tensor('image/class/label'),
  }

  These arguments are compatible with the use of tf.contrib.slim.data module,
  see there for more documentation.

  Args:
    data_sources: a list or tuple of sources from which the data will be read,
      for example [/path/to/train@128, /path/to/train2*, /tmp/.../train3*]
    data_fields_to_features: a dictionary from data fields in the data sources
      to features, such as tf.VarLenFeature(tf.int64), see above for examples.
    training: a Boolean, whether to read for training or evaluation.
    data_items_to_decoders: a dictionary mapping data items (that will be
      in the returned result) to decoders that will decode them using features
      defined in data_fields_to_features; see above for examples. By default
      (if this is None), we grab the tensor from every feature.
    data_items_to_decode: a subset of data items that will be decoded;
      by default (if this is None), we decode all items.

  Returns:
    A dictionary mapping each data_field to a corresponding 1D int64 tensor
    read from the created queue.

  Raises:
    ValueError: if no files are found with the provided data_prefix or no data
      fields were provided.
  """
  with tf.name_scope("examples_queue"):
    # Read serialized examples using slim parallel_reader.
    _, example_serialized = tf.contrib.slim.parallel_reader.parallel_read(
        data_sources, tf.TFRecordReader, shuffle=training,
        num_readers=4 if training else 1)

    if data_items_to_decoders is None:
      data_items_to_decoders = {
          field: tf.contrib.slim.tfexample_decoder.Tensor(field)
          for field in data_fields_to_features
      }

    decoder = tf.contrib.slim.tfexample_decoder.TFExampleDecoder(
        data_fields_to_features, data_items_to_decoders)

    if data_items_to_decode is None:
      data_items_to_decode = data_items_to_decoders.keys()

    decoded = decoder.decode(example_serialized, items=data_items_to_decode)
    return {field: tensor
            for (field, tensor) in zip(data_items_to_decode, decoded)}


def batch_examples(examples, batch_size, bucket_boundaries=None):
  """Given a queue of examples, create batches of examples with similar lengths.

  We assume that examples is a dictionary with string keys and tensor values,
  possibly coming from a queue, e.g., constructed by examples_queue above.
  Each tensor in examples is assumed to be 1D. We will put tensors of similar
  length into batches togeter. We return a dictionary with the same keys as
  examples, and with values being batches of size batch_size. If elements have
  different lengths, they are padded with 0s. This function is based on
  tf.contrib.training.bucket_by_sequence_length so see there for details.

  For example, if examples is a queue containing [1, 2, 3] and [4], then
  this function with batch_size=2 will return a batch [[1, 2, 3], [4, 0, 0]].

  Args:
    examples: a dictionary with string keys and 1D tensor values.
    batch_size: a python integer or a scalar int32 tensor.
    bucket_boundaries: a list of integers for the boundaries that will be
      used for bucketing; see tf.contrib.training.bucket_by_sequence_length
      for more details; if None, we create a default set of buckets.

  Returns:
    A dictionary with the same keys as examples and with values being batches
    of examples padded with 0s, i.e., [batch_size x length] tensors.
  """
  # Create default buckets if none were provided.
  if bucket_boundaries is None:
    # Small buckets -- go in steps of 8 until 64.
    small_buckets = [8 * (i + 1) for i in xrange(8)]
    # Medium buckets -- go in steps of 32 until 256.
    medium_buckets = [32 * (i + 3) for i in xrange(6)]
    # Large buckets -- go in steps of 128 until maximum of 1024.
    large_buckets = [128 * (i + 3) for i in xrange(6)]
    # By default use the above 20 bucket boundaries (21 queues in total).
    bucket_boundaries = small_buckets + medium_buckets + large_buckets
  with tf.name_scope("batch_examples"):
    # The queue to bucket on will be chosen based on maximum length.
    max_length = 0
    for v in examples.values():  # We assume 0-th dimension is the length.
      max_length = tf.maximum(max_length, tf.shape(v)[0])
    (_, outputs) = tf.contrib.training.bucket_by_sequence_length(
        max_length, examples, batch_size, bucket_boundaries,
        capacity=2 * batch_size, dynamic_pad=True)
    return outputs


================================================
FILE: 2017/examples/cgru/my_layers.py
================================================
def saturating_sigmoid(x):
  """Saturating sigmoid: 1.2 * sigmoid(x) - 0.1 cut to [0, 1]."""
  with tf.name_scope("saturating_sigmoid", [x]):
    y = tf.sigmoid(x)
    return tf.minimum(1.0, tf.maximum(0.0, 1.2 * y - 0.1))


def embedding(x, vocab_size, dense_size, name=None, reuse=None):
  """Embed x of type int64 into dense vectors, reducing to max 4 dimensions."""
  with tf.variable_scope(name, default_name="embedding",
                         values=[x], reuse=reuse):
    embedding_var = tf.get_variable("kernel", [vocab_size, dense_size])
    return tf.gather(embedding_var, x)


def conv_gru(x, kernel_size, filters, padding="same", dilation_rate=1,
             name=None, reuse=None):
  """Convolutional GRU in 1 dimension."""
  # Let's make a shorthand for conv call first.
  def do_conv(args, name, bias_start, padding):
    return tf.layers.conv1d(args, filters, kernel_size,
                padding=padding, dilation_rate=dilation_rate,
                bias_initializer=tf.constant_initializer(bias_start), name=name)
  # Here comes the GRU gate.
  with tf.variable_scope(name, default_name="conv_gru",
                         values=[x], reuse=reuse):
    reset = saturating_sigmoid(do_conv(x, "reset", 1.0, padding))
    gate = saturating_sigmoid(do_conv(x, "gate", 1.0, padding))
    candidate = tf.tanh(do_conv(reset * x, "candidate", 0.0, padding))
    return gate * x + (1 - gate) * candidate


================================================
FILE: 2017/examples/cgru/neural_gpu_v3.py
================================================
def neural_gpu(features, hparams, name=None):
  """The core Neural GPU."""
  with tf.variable_scope(name, "neural_gpu"):
    inputs = features["inputs"]
    emb_inputs = common_layers.embedding(
        inputs, hparams.vocab_size, hparams.hidden_size)

    def step(state, inp):
      x = tf.nn.dropout(state, 1.0 - hparams.dropout)
      for layer in xrange(hparams.num_hidden_layers):
        x = common_layers.conv_gru(
            x, hparams.kernel_size, hparams.hidden_size, name="cgru_%d" % layer)
      return tf.where(inp == 0, state, x)  # No-op where inp is just padding=0.

    final = tf.foldl(step, tf.transpose(inputs, [1, 0]),
                     initializer=emb_inputs,
                     parallel_iterations=1, swap_memory=True)
    return common_layers.conv(final, hparams.vocab_size, 3, padding="same")


def mixed_curriculum(inputs, hparams):
  """Mixed curriculum: skip short sequences, but only with some probability."""
  with tf.name_scope("mixed_curriculum"):
    inputs_length = tf.to_float(tf.shape(inputs)[1])
    used_length = tf.cond(tf.less(tf.random_uniform([]),
                                  hparams.curriculum_mixing_probability),
                          lambda: tf.constant(0.0),
                          lambda: inputs_length)
    step = tf.to_float(tf.contrib.framework.get_global_step())
    relative_step = step / hparams.curriculum_lengths_per_step
    return used_length - hparams.curriculum_min_length > relative_step


def neural_gpu_curriculum(features, hparams, mode):
  """The Neural GPU model with curriculum."""
  with tf.name_scope("neural_gpu_with_curriculum"):
    inputs = features["inputs"]
    is_training = mode == tf.contrib.learn.ModeKeys.TRAIN
    should_skip = tf.logical_and(is_training, mixed_curriculum(inputs, hparams))
    final_shape = tf.concat([tf.shape(inputs),
                             tf.constant([hparams.vocab_size])], axis=0)
    outputs = tf.cond(should_skip,
                      lambda: tf.zeros(final_shape),
                      lambda: neural_gpu(features, hparams))
    return outputs, should_skip


def basic_params1():
  """A set of basic hyperparameters."""
  return tf.HParams(batch_size=32,
                    num_hidden_layers=4,
                    kernel_size=3,
                    hidden_size=64,
                    vocab_size=256,
                    dropout=0.2,
                    clip_grad_norm=2.0,
                    initializer="orthogonal",
                    initializer_gain=1.5,
                    label_smoothing=0.1,
                    optimizer="Adam",
                    optimizer_adam_epsilon=1e-4,
                    optimizer_momentum_momentum=0.9,
                    max_train_length=512,
                    learning_rate_decay_scheme="none",
                    learning_rate_warmup_steps=100,
                    learning_rate=0.1)


def curriculum_params1():
  """Set of hyperparameters with curriculum settings."""
  hparams = common_hparams.basic_params1()
  hparams.add_hparam("curriculum_mixing_probability", 0.1)
  hparams.add_hparam("curriculum_lengths_per_step", 1000.0)
  hparams.add_hparam("curriculum_min_length", 10)
  return hparams


================================================
FILE: 2017/examples/data/arvix_abstracts.txt
================================================
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).
In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.
Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
Unsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could "guide" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.
Solving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.
We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.
We consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.
We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\Gamma \subset \mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)
Deep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.
Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\lambda$, we explain the reason why to learn $\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.
We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.
In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.
It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
The Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.
Deep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.
We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.
Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.
Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.
Regularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.
We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.
Top-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using "generators" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the "feature flows" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) "zero-shot learning" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) "data augmentation" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training "inside the CNN".
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
Motivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed "maxout" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called "channel-out" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the "harder" image classification benchmarks.
In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.
Deep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers "know them when they see them" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
The backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
In this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).
Deep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.
Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.
We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Deep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.
Recurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.
We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Deep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.
Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
In this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.
We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.
Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris 1
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.
We introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).

================================================
FILE: 2017/examples/data/heart.csv
================================================
sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
160,12,5.73,23.11,Present,49,25.3,97.2,52,1
144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
114,0,3.83,19.4,Present,49,24.86,2.49,29,0
132,0,5.8,30.96,Present,69,30.11,0,53,1
206,6,2.95,32.27,Absent,72,26.81,56.06,60,1
134,14.1,4.44,22.39,Present,65,23.09,0,40,1
118,0,1.88,10.05,Absent,59,21.57,0,17,0
132,0,1.87,17.21,Absent,49,23.63,0.97,15,0
112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0
117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0
120,7.5,15.33,22,Absent,60,25.31,34.49,49,0
146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1
158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1
124,14,6.23,35.96,Present,45,30.09,0,59,1
106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1
132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0
150,0.3,6.38,33.99,Present,62,24.64,0,50,0
138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0
142,18.2,4.34,24.38,Absent,61,26.19,0,50,0
124,4,12.42,31.29,Present,54,23.23,2.06,42,1
118,6,9.65,33.91,Absent,60,38.8,0,48,0
145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1
144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0
146,0,6.62,25.69,Absent,60,28.07,8.23,63,1
136,2.52,3.95,25.63,Absent,51,21.86,0,45,1
158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1
122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1
126,8.75,6.53,34.02,Absent,49,30.25,0,41,1
148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0
122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1
140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0
110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0
130,0,2.82,19.63,Present,70,24.86,0,29,0
136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1
118,0.28,5.8,33.7,Present,60,30.98,0,41,1
144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0
120,0,1.07,16.02,Absent,47,22.15,0,15,0
130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1
114,0,2.99,9.74,Absent,54,46.58,0,17,0
128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0
162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1
116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1
114,0,1.94,11.02,Absent,54,20.17,38.98,16,0
126,3.8,3.88,31.79,Absent,57,30.53,0,30,0
122,0,5.75,30.9,Present,46,29.01,4.11,42,0
134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0
152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1
134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1
156,3,1.82,27.55,Absent,60,23.91,54,53,0
152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0
118,0,2.99,16.17,Absent,49,23.83,3.22,28,0
126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1
103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0
121,0.8,5.29,18.95,Present,47,22.51,0,61,0
142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0
138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0
152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0
140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0
130,0,1.82,10.45,Absent,57,22.07,2.06,17,0
136,7.36,2.19,28.11,Present,61,25,61.71,54,0
124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0
112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0
118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0
122,0,3.37,16.1,Absent,67,21.06,0,32,1
118,0,3.67,12.13,Absent,51,19.15,0.6,15,0
130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0
130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0
126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0
128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0
136,0,4.12,17.42,Absent,52,21.66,12.86,40,0
134,0,5.9,30.84,Absent,49,29.16,0,55,0
140,0.6,5.56,33.39,Present,58,27.19,0,55,1
168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1
108,0.4,5.91,22.92,Present,57,25.72,72,39,0
114,3,7.04,22.64,Present,55,22.59,0,45,1
140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1
148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1
148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1
128,0,2.43,13.15,Present,63,20.75,0,17,0
130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0
126,10.5,4.49,17.33,Absent,67,19.37,0,49,1
140,0,5.08,27.33,Present,41,27.83,1.25,38,0
126,0.9,5.64,17.78,Present,55,21.94,0,41,0
122,0.72,4.04,32.38,Absent,34,28.34,0,55,0
116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0
120,3.7,4.02,39.66,Absent,61,30.57,0,64,1
143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0
118,4,3.95,18.96,Absent,54,25.15,8.33,49,1
194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0
134,3,4.37,23.07,Absent,56,20.54,9.65,62,0
138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0
136,0,5,27.58,Present,49,27.59,1.47,39,0
122,3.2,11.32,35.36,Present,55,27.07,0,51,1
164,12,3.91,19.59,Absent,51,23.44,19.75,39,0
136,8,7.85,23.81,Present,51,22.69,2.78,50,0
166,0.07,4.03,29.29,Absent,53,28.37,0,27,0
118,0,4.34,30.12,Present,52,32.18,3.91,46,0
128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0
118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0
158,3.6,2.97,30.11,Absent,63,26.64,108,64,0
108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1
170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1
118,1,5.76,22.1,Absent,62,23.48,7.71,42,0
124,0,3.04,17.33,Absent,49,22.04,0,18,0
114,0,8.01,21.64,Absent,66,25.51,2.49,16,0
168,9,8.53,24.48,Present,69,26.18,4.63,54,1
134,2,3.66,14.69,Absent,52,21.03,2.06,37,0
174,0,8.46,35.1,Present,35,25.27,0,61,1
116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1
128,0,10.58,31.81,Present,46,28.41,14.66,48,0
140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1
154,0.7,5.91,25,Absent,13,20.6,0,42,0
150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1
130,0,3.92,25.55,Absent,68,28.02,0.68,27,0
128,2,6.13,21.31,Absent,66,22.86,11.83,60,0
120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0
120,0,5.01,26.13,Absent,64,26.21,12.24,33,0
138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1
153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0
123,8.6,11.17,35.28,Present,70,33.14,0,59,1
148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0
136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0
134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1
152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0
158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0
132,2,3.08,35.39,Absent,45,31.44,79.82,58,1
134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1
142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0
134,6,3.3,28.45,Absent,65,26.09,58.11,40,0
122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1
116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0
128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0
120,0,3.68,12.24,Absent,51,20.52,0.51,20,0
124,0,3.95,36.35,Present,59,32.83,9.59,54,0
160,14,5.9,37.12,Absent,58,33.87,3.52,54,1
130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1
128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0
130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0
109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0
144,0,3.84,18.72,Absent,56,22.1,4.8,40,0
118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0
136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1
136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1
124,15.5,5.05,24.06,Absent,46,23.22,0,61,1
148,6,6.49,26.47,Absent,48,24.7,0,55,0
128,6.6,3.58,20.71,Absent,55,24.15,0,52,0
122,0.28,4.19,19.97,Absent,61,25.63,0,24,0
108,0,2.74,11.17,Absent,53,22.61,0.95,20,0
124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1
138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1
127,0,2.81,15.7,Absent,42,22.03,1.03,17,0
174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0
122,0,3.05,23.51,Absent,46,25.81,0,38,0
144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1
126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0
208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1
138,0,2.68,17.04,Absent,42,22.16,0,16,0
148,0,3.84,17.26,Absent,70,20,0,21,0
122,0,3.08,16.3,Absent,43,22.13,0,16,0
132,7,3.2,23.26,Absent,77,23.64,23.14,49,0
110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1
160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1
126,0.54,4.39,21.13,Present,45,25.99,0,25,0
162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0
194,2.55,6.89,33.88,Present,69,29.33,0,41,0
118,0.75,2.58,20.25,Absent,59,24.46,0,32,0
124,0,4.79,34.71,Absent,49,26.09,9.26,47,0
160,0,2.42,34.46,Absent,48,29.83,1.03,61,0
128,0,2.51,29.35,Present,53,22.05,1.37,62,0
122,4,5.24,27.89,Present,45,26.52,0,61,1
132,2,2.7,21.57,Present,50,27.95,9.26,37,0
120,0,2.42,16.66,Absent,46,20.16,0,17,0
128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0
108,15,4.91,34.65,Absent,41,27.96,14.4,56,0
166,0,4.31,34.27,Absent,45,30.14,13.27,56,0
152,0,6.06,41.05,Present,51,40.34,0,51,0
170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1
156,4,2.05,19.48,Present,50,21.48,27.77,39,1
116,8,6.73,28.81,Present,41,26.74,40.94,48,1
122,4.4,3.18,11.59,Present,59,21.94,0,33,1
150,20,6.4,35.04,Absent,53,28.88,8.33,63,0
129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0
134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0
126,0,5.98,29.06,Present,56,25.39,11.52,64,1
142,0,3.72,25.68,Absent,48,24.37,5.25,40,1
128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1
102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1
130,0,4.89,25.98,Absent,72,30.42,14.71,23,0
138,0.05,2.79,10.35,Absent,46,21.62,0,18,0
138,0,1.96,11.82,Present,54,22.01,8.13,21,0
128,0,3.09,20.57,Absent,54,25.63,0.51,17,0
162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0
160,3,9.19,26.47,Present,39,28.25,14.4,54,1
148,0,4.66,24.39,Absent,50,25.26,4.03,27,0
124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0
136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1
134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0
128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0
122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0
152,3,4.64,31.29,Absent,41,29.34,4.53,40,0
162,0,5.09,24.6,Present,64,26.71,3.81,18,0
124,4,6.65,30.84,Present,54,28.4,33.51,60,0
136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0
136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0
134,0.05,8.03,27.95,Absent,48,26.88,0,60,0
122,1,5.88,34.81,Present,69,31.27,15.94,40,1
116,3,3.05,30.31,Absent,41,23.63,0.86,44,0
132,0,0.98,21.39,Absent,62,26.75,0,53,0
134,0,2.4,21.11,Absent,57,22.45,1.37,18,0
160,7.77,8.07,34.8,Absent,64,31.15,0,62,1
180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1
124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0
114,0,4.97,9.69,Absent,26,22.6,0,25,0
208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0
138,0,3.14,12,Absent,54,20.28,0,16,0
164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1
144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0
136,7.5,7.39,28.04,Present,50,25.01,0,45,1
132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0
143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0
112,4.46,7.18,26.25,Present,69,27.29,0,32,1
134,10,3.79,34.72,Absent,42,28.33,28.8,52,1
138,2,5.11,31.4,Present,49,27.25,2.06,64,1
188,0,5.47,32.44,Present,71,28.99,7.41,50,1
110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1
136,13.2,7.18,35.95,Absent,48,29.19,0,62,0
130,1.75,5.46,34.34,Absent,53,29.42,0,58,1
122,0,3.76,24.59,Absent,56,24.36,0,30,0
138,0,3.24,27.68,Absent,60,25.7,88.66,29,0
130,18,4.13,27.43,Absent,54,27.44,0,51,1
126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0
176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0
122,0,5.49,19.56,Absent,57,23.12,14.02,27,0
124,0,3.23,9.64,Absent,59,22.7,0,16,0
140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1
128,6,4.37,22.98,Present,50,26.01,0,47,0
190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0
144,0.76,10.53,35.66,Absent,63,34.35,0,55,1
126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1
128,0,2.63,23.88,Absent,45,21.59,6.54,57,0
136,0.4,3.91,21.1,Present,63,22.3,0,56,1
158,4,4.18,28.61,Present,42,25.11,0,60,0
160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0
124,6,5.21,33.02,Present,64,29.37,7.61,58,1
158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0
128,0,6.34,11.87,Absent,57,23.14,0,17,0
166,3,3.82,26.75,Absent,45,20.86,0,63,1
146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0
161,9,4.65,15.16,Present,58,23.76,43.2,46,0
164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1
146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1
142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0
138,12,5.13,28.34,Absent,59,24.49,32.81,58,1
154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0
118,0,2.39,12.13,Absent,49,18.46,0.26,17,1
124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0
124,1.04,2.84,16.42,Present,46,20.17,0,61,0
136,5,4.19,23.99,Present,68,27.8,25.86,35,0
132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1
118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0
118,0.12,4.16,9.37,Absent,57,19.61,0,17,0
134,12,4.96,29.79,Absent,53,24.86,8.23,57,0
114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0
136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1
130,0,4.16,39.43,Present,46,30.01,0,55,1
136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1
136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0
154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0
108,0.8,2.47,17.53,Absent,47,22.18,0,55,1
136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1
174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1
124,4.25,8.22,30.77,Absent,56,25.8,0,43,0
114,0,2.63,9.69,Absent,45,17.89,0,16,0
118,0.12,3.26,12.26,Absent,55,22.65,0,16,0
106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1
146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0
206,0,4.17,33.23,Absent,69,27.36,6.17,50,1
134,3,3.17,17.91,Absent,35,26.37,15.12,27,0
148,15,4.98,36.94,Present,72,31.83,66.27,41,1
126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0
134,0,3.69,13.92,Absent,43,27.66,0,19,0
134,0.02,2.8,18.84,Absent,45,24.82,0,17,0
123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0
112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1
112,0,1.71,15.96,Absent,42,22.03,3.5,16,0
101,0.48,7.26,13,Absent,50,19.82,5.19,16,0
150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0
170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1
134,0,5.63,29.12,Absent,68,32.33,2.02,34,0
142,0,4.19,18.04,Absent,56,23.65,20.78,42,1
132,0.1,3.28,10.73,Absent,73,20.42,0,17,0
136,0,2.28,18.14,Absent,55,22.59,0,17,0
132,12,4.51,21.93,Absent,61,26.07,64.8,46,1
166,4.1,4,34.3,Present,32,29.51,8.23,53,0
138,0,3.96,24.7,Present,53,23.8,0,45,0
138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1
170,0,3.12,37.15,Absent,47,35.42,0,53,0
128,0,8.41,28.82,Present,60,26.86,0,59,1
136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0
128,0,3.22,26.55,Present,39,26.59,16.71,49,0
150,14.4,5.04,26.52,Present,60,28.84,0,45,0
132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1
142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0
130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0
174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1
114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0
162,1.5,2.46,19.39,Present,49,24.32,0,59,1
174,0,3.27,35.4,Absent,58,37.71,24.95,44,0
190,5.15,6.03,36.59,Absent,42,30.31,72,50,0
154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0
124,0,2.28,24.86,Present,50,22.24,8.26,38,0
114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0
168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1
142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0
154,0,4.81,28.11,Present,56,25.67,75.77,59,0
146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0
166,6,3.02,29.3,Absent,35,24.38,38.06,61,0
140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1
136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0
156,0,3.47,21.1,Absent,73,28.4,0,36,1
132,0,6.63,29.58,Present,37,29.41,2.57,62,0
128,0,2.98,12.59,Absent,65,20.74,2.06,19,0
106,5.6,3.2,12.3,Absent,49,20.29,0,39,0
144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0
154,0.31,2.33,16.48,Absent,33,24,11.83,17,0
126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0
134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1
152,19.45,4.22,29.81,Absent,28,23.95,0,59,1
146,1.35,6.39,34.21,Absent,51,26.43,0,59,1
162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0
130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1
138,6,7.24,37.05,Absent,38,28.69,0,59,0
148,0,5.32,26.71,Present,52,32.21,32.78,27,0
124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0
118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0
116,4.28,7.02,19.99,Present,68,23.31,0,52,1
162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1
138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0
137,1.2,3.14,23.87,Absent,66,24.13,45,37,0
198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1
154,4.5,4.75,23.52,Present,43,25.76,0,53,1
128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0
130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1
162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0
120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0
136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0
176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1
134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1
122,1.7,5.28,32.23,Present,51,24.08,0,54,0
134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1
134,0,2.43,22.24,Absent,52,26.49,41.66,24,0
136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0
132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0
152,1.68,3.58,25.43,Absent,50,27.03,0,32,0
132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1
124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0
140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0
166,0.6,2.42,34.03,Present,53,26.96,54,60,0
156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1
132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0
150,0,4.99,27.73,Absent,57,30.92,8.33,24,0
134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0
126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0
148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0
148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1
132,6,5.97,25.73,Present,66,24.18,145.29,41,0
128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0
128,5.16,4.9,31.35,Present,57,26.42,0,64,0
140,0,2.4,27.89,Present,70,30.74,144,29,0
126,0,5.29,27.64,Absent,25,27.62,2.06,45,0
114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0
118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0
126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0
154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1
112,1.44,2.71,22.92,Absent,59,24.81,0,52,0
140,8,4.42,33.15,Present,47,32.77,66.86,44,0
140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1
128,2.6,4.94,21.36,Absent,61,21.3,0,31,0
126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0
160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1
144,0,4.17,29.63,Present,52,21.83,0,59,0
148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1
146,0,4.92,18.53,Absent,57,24.2,34.97,26,0
164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1
130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1
154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1
178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0
180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0
134,12.5,2.73,39.35,Absent,48,35.58,0,48,0
142,0,3.54,16.64,Absent,58,25.97,8.36,27,0
162,7,7.67,34.34,Present,33,30.77,0,62,0
218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1
126,8.75,6.06,32.72,Present,33,27,62.43,55,1
126,0,3.57,26.01,Absent,61,26.3,7.97,47,0
134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0
132,0,4.17,36.57,Absent,57,30.61,18,49,0
178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1
208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1
160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0
116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0
180,25.01,3.7,38.11,Present,57,30.54,0,61,1
200,19.2,4.43,40.6,Present,55,32.04,36,60,1
112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0
120,0,3.1,26.97,Absent,41,24.8,0,16,0
178,20,9.78,33.55,Absent,37,27.29,2.88,62,1
166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0
164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1
216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1
146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0
134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1
158,16,5.56,29.35,Absent,36,25.92,58.32,60,0
176,0,3.14,31.04,Present,45,30.18,4.63,45,0
132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0
126,0,4.55,29.18,Absent,48,24.94,36,41,0
120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0
174,0,3.86,21.73,Absent,42,23.37,0,63,0
150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1
176,6,3.98,17.2,Present,52,21.07,4.11,61,1
142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1
132,0,3.3,21.61,Absent,42,24.92,32.61,33,0
142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0
146,1.16,2.28,34.53,Absent,50,28.71,45,49,0
132,7.2,3.65,17.16,Present,56,23.25,0,34,0
120,0,3.57,23.22,Absent,58,27.2,0,32,0
118,0,3.89,15.96,Absent,65,20.18,0,16,0
108,0,1.43,26.26,Absent,42,19.38,0,16,0
136,0,4,19.06,Absent,40,21.94,2.06,16,0
120,0,2.46,13.39,Absent,47,22.01,0.51,18,0
132,0,3.55,8.66,Present,61,18.5,3.87,16,0
136,0,1.77,20.37,Absent,45,21.51,2.06,16,0
138,0,1.86,18.35,Present,59,25.38,6.51,17,0
138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0
130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0
130,4,2.4,17.42,Absent,60,22.05,0,40,0
110,0,7.14,28.28,Absent,57,29,0,32,0
120,0,3.98,13.19,Present,47,21.89,0,16,0
166,6,8.8,37.89,Absent,39,28.7,43.2,52,0
134,0.57,4.75,23.07,Absent,67,26.33,0,37,0
142,3,3.69,25.1,Absent,60,30.08,38.88,27,0
136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0
142,0,4.32,25.22,Absent,47,28.92,6.53,34,1
130,0,1.88,12.51,Present,52,20.28,0,17,0
124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0
144,4,5.03,25.78,Present,57,27.55,90,48,1
136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0
120,0,2.77,13.35,Absent,67,23.37,1.03,18,0
154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0
124,1.6,7.22,39.68,Present,36,31.5,0,51,1
146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1
128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1
170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0
214,0.4,5.98,31.72,Absent,64,28.45,0,58,0
182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1
108,3,1.59,15.23,Absent,40,20.09,26.64,55,0
118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0
132,0,4.82,33.41,Present,62,14.7,0,46,1

================================================
FILE: 2017/examples/data/heart.txt
================================================
"sbp"	"tobacco"	"ldl"	"adiposity"	"famhist"	"typea"	"obesity"	"alcohol"	"age"	"chd"
160	12	5.73	23.11	"Present"	49	25.3	97.2	52	1
144	0.01	4.41	28.61	"Absent"	55	28.87	2.06	63	1
118	0.08	3.48	32.28	"Present"	52	29.14	3.81	46	0
170	7.5	6.41	38.03	"Present"	51	31.99	24.26	58	1
134	13.6	3.5	27.78	"Present"	60	25.99	57.34	49	1
132	6.2	6.47	36.21	"Present"	62	30.77	14.14	45	0
142	4.05	3.38	16.2	"Absent"	59	20.81	2.62	38	0
114	4.08	4.59	14.6	"Present"	62	23.11	6.72	58	1
114	0	3.83	19.4	"Present"	49	24.86	2.49	29	0
132	0	5.8	30.96	"Present"	69	30.11	0	53	1
206	6	2.95	32.27	"Absent"	72	26.81	56.06	60	1
134	14.1	4.44	22.39	"Present"	65	23.09	0	40	1
118	0	1.88	10.05	"Absent"	59	21.57	0	17	0
132	0	1.87	17.21	"Absent"	49	23.63	0.97	15	0
112	9.65	2.29	17.2	"Present"	54	23.53	0.68	53	0
117	1.53	2.44	28.95	"Present"	35	25.89	30.03	46	0
120	7.5	15.33	22	"Absent"	60	25.31	34.49	49	0
146	10.5	8.29	35.36	"Present"	78	32.73	13.89	53	1
158	2.6	7.46	34.07	"Present"	61	29.3	53.28	62	1
124	14	6.23	35.96	"Present"	45	30.09	0	59	1
106	1.61	1.74	12.32	"Absent"	74	20.92	13.37	20	1
132	7.9	2.85	26.5	"Present"	51	26.16	25.71	44	0
150	0.3	6.38	33.99	"Present"	62	24.64	0	50	0
138	0.6	3.81	28.66	"Absent"	54	28.7	1.46	58	0
142	18.2	4.34	24.38	"Absent"	61	26.19	0	50	0
124	4	12.42	31.29	"Present"	54	23.23	2.06	42	1
118	6	9.65	33.91	"Absent"	60	38.8	0	48	0
145	9.1	5.24	27.55	"Absent"	59	20.96	21.6	61	1
144	4.09	5.55	31.4	"Present"	60	29.43	5.55	56	0
146	0	6.62	25.69	"Absent"	60	28.07	8.23	63	1
136	2.52	3.95	25.63	"Absent"	51	21.86	0	45	1
158	1.02	6.33	23.88	"Absent"	66	22.13	24.99	46	1
122	6.6	5.58	35.95	"Present"	53	28.07	12.55	59	1
126	8.75	6.53	34.02	"Absent"	49	30.25	0	41	1
148	5.5	7.1	25.31	"Absent"	56	29.84	3.6	48	0
122	4.26	4.44	13.04	"Absent"	57	19.49	48.99	28	1
140	3.9	7.32	25.05	"Absent"	47	27.36	36.77	32	0
110	4.64	4.55	30.46	"Absent"	48	30.9	15.22	46	0
130	0	2.82	19.63	"Present"	70	24.86	0	29	0
136	11.2	5.81	31.85	"Present"	75	27.68	22.94	58	1
118	0.28	5.8	33.7	"Present"	60	30.98	0	41	1
144	0.04	3.38	23.61	"Absent"	30	23.75	4.66	30	0
120	0	1.07	16.02	"Absent"	47	22.15	0	15	0
130	2.61	2.72	22.99	"Present"	51	26.29	13.37	51	1
114	0	2.99	9.74	"Absent"	54	46.58	0	17	0
128	4.65	3.31	22.74	"Absent"	62	22.95	0.51	48	0
162	7.4	8.55	24.65	"Present"	64	25.71	5.86	58	1
116	1.91	7.56	26.45	"Present"	52	30.01	3.6	33	1
114	0	1.94	11.02	"Absent"	54	20.17	38.98	16	0
126	3.8	3.88	31.79	"Absent"	57	30.53	0	30	0
122	0	5.75	30.9	"Present"	46	29.01	4.11	42	0
134	2.5	3.66	30.9	"Absent"	52	27.19	23.66	49	0
152	0.9	9.12	30.23	"Absent"	56	28.64	0.37	42	1
134	8.08	1.55	17.5	"Present"	56	22.65	66.65	31	1
156	3	1.82	27.55	"Absent"	60	23.91	54	53	0
152	5.99	7.99	32.48	"Absent"	45	26.57	100.32	48	0
118	0	2.99	16.17	"Absent"	49	23.83	3.22	28	0
126	5.1	2.96	26.5	"Absent"	55	25.52	12.34	38	1
103	0.03	4.21	18.96	"Absent"	48	22.94	2.62	18	0
121	0.8	5.29	18.95	"Present"	47	22.51	0	61	0
142	0.28	1.8	21.03	"Absent"	57	23.65	2.93	33	0
138	1.15	5.09	27.87	"Present"	61	25.65	2.34	44	0
152	10.1	4.71	24.65	"Present"	65	26.21	24.53	57	0
140	0.45	4.3	24.33	"Absent"	41	27.23	10.08	38	0
130	0	1.82	10.45	"Absent"	57	22.07	2.06	17	0
136	7.36	2.19	28.11	"Present"	61	25	61.71	54	0
124	4.82	3.24	21.1	"Present"	48	28.49	8.42	30	0
112	0.41	1.88	10.29	"Absent"	39	22.08	20.98	27	0
118	4.46	7.27	29.13	"Present"	48	29.01	11.11	33	0
122	0	3.37	16.1	"Absent"	67	21.06	0	32	1
118	0	3.67	12.13	"Absent"	51	19.15	0.6	15	0
130	1.72	2.66	10.38	"Absent"	68	17.81	11.1	26	0
130	5.6	3.37	24.8	"Absent"	58	25.76	43.2	36	0
126	0.09	5.03	13.27	"Present"	50	17.75	4.63	20	0
128	0.4	6.17	26.35	"Absent"	64	27.86	11.11	34	0
136	0	4.12	17.42	"Absent"	52	21.66	12.86	40	0
134	0	5.9	30.84	"Absent"	49	29.16	0	55	0
140	0.6	5.56	33.39	"Present"	58	27.19	0	55	1
168	4.5	6.68	28.47	"Absent"	43	24.25	24.38	56	1
108	0.4	5.91	22.92	"Present"	57	25.72	72	39	0
114	3	7.04	22.64	"Present"	55	22.59	0	45	1
140	8.14	4.93	42.49	"Absent"	53	45.72	6.43	53	1
148	4.8	6.09	36.55	"Present"	63	25.44	0.88	55	1
148	12.2	3.79	34.15	"Absent"	57	26.38	14.4	57	1
128	0	2.43	13.15	"Present"	63	20.75	0	17	0
130	0.56	3.3	30.86	"Absent"	49	27.52	33.33	45	0
126	10.5	4.49	17.33	"Absent"	67	19.37	0	49	1
140	0	5.08	27.33	"Present"	41	27.83	1.25	38	0
126	0.9	5.64	17.78	"Present"	55	21.94	0	41	0
122	0.72	4.04	32.38	"Absent"	34	28.34	0	55	0
116	1.03	2.83	10.85	"Absent"	45	21.59	1.75	21	0
120	3.7	4.02	39.66	"Absent"	61	30.57	0	64	1
143	0.46	2.4	22.87	"Absent"	62	29.17	15.43	29	0
118	4	3.95	18.96	"Absent"	54	25.15	8.33	49	1
194	1.7	6.32	33.67	"Absent"	47	30.16	0.19	56	0
134	3	4.37	23.07	"Absent"	56	20.54	9.65	62	0
138	2.16	4.9	24.83	"Present"	39	26.06	28.29	29	0
136	0	5	27.58	"Present"	49	27.59	1.47	39	0
122	3.2	11.32	35.36	"Present"	55	27.07	0	51	1
164	12	3.91	19.59	"Absent"	51	23.44	19.75	39	0
136	8	7.85	23.81	"Present"	51	22.69	2.78	50	0
166	0.07	4.03	29.29	"Absent"	53	28.37	0	27	0
118	0	4.34	30.12	"Present"	52	32.18	3.91	46	0
128	0.42	4.6	26.68	"Absent"	41	30.97	10.33	31	0
118	1.5	5.38	25.84	"Absent"	64	28.63	3.89	29	0
158	3.6	2.97	30.11	"Absent"	63	26.64	108	64	0
108	1.5	4.33	24.99	"Absent"	66	22.29	21.6	61	1
170	7.6	5.5	37.83	"Present"	42	37.41	6.17	54	1
118	1	5.76	22.1	"Absent"	62	23.48	7.71	42	0
124	0	3.04	17.33	"Absent"	49	22.04	0	18	0
114	0	8.01	21.64	"Absent"	66	25.51	2.49	16	0
168	9	8.53	24.48	"Present"	69	26.18	4.63	54	1
134	2	3.66	14.69	"Absent"	52	21.03	2.06	37	0
174	0	8.46	35.1	"Present"	35	25.27	0	61	1
116	31.2	3.17	14.99	"Absent"	47	19.4	49.06	59	1
128	0	10.58	31.81	"Present"	46	28.41	14.66	48	0
140	4.5	4.59	18.01	"Absent"	63	21.91	22.09	32	1
154	0.7	5.91	25	"Absent"	13	20.6	0	42	0
150	3.5	6.99	25.39	"Present"	50	23.35	23.48	61	1
130	0	3.92	25.55	"Absent"	68	28.02	0.68	27	0
128	2	6.13	21.31	"Absent"	66	22.86	11.83	60	0
120	1.4	6.25	20.47	"Absent"	60	25.85	8.51	28	0
120	0	5.01	26.13	"Absent"	64	26.21	12.24	33	0
138	4.5	2.85	30.11	"Absent"	55	24.78	24.89	56	1
153	7.8	3.96	25.73	"Absent"	54	25.91	27.03	45	0
123	8.6	11.17	35.28	"Present"	70	33.14	0	59	1
148	4.04	3.99	20.69	"Absent"	60	27.78	1.75	28	0
136	3.96	2.76	30.28	"Present"	50	34.42	18.51	38	0
134	8.8	7.41	26.84	"Absent"	35	29.44	29.52	60	1
152	12.18	4.04	37.83	"Present"	63	34.57	4.17	64	0
158	13.5	5.04	30.79	"Absent"	54	24.79	21.5	62	0
132	2	3.08	35.39	"Absent"	45	31.44	79.82	58	1
134	1.5	3.73	21.53	"Absent"	41	24.7	11.11	30	1
142	7.44	5.52	33.97	"Absent"	47	29.29	24.27	54	0
134	6	3.3	28.45	"Absent"	65	26.09	58.11	40	0
122	4.18	9.05	29.27	"Present"	44	24.05	19.34	52	1
116	2.7	3.69	13.52	"Absent"	55	21.13	18.51	32	0
128	0.5	3.7	12.81	"Present"	66	21.25	22.73	28	0
120	0	3.68	12.24	"Absent"	51	20.52	0.51	20	0
124	0	3.95	36.35	"Present"	59	32.83	9.59	54	0
160	14	5.9	37.12	"Absent"	58	33.87	3.52	54	1
130	2.78	4.89	9.39	"Present"	63	19.3	17.47	25	1
128	2.8	5.53	14.29	"Absent"	64	24.97	0.51	38	0
130	4.5	5.86	37.43	"Absent"	61	31.21	32.3	58	0
109	1.2	6.14	29.26	"Absent"	47	24.72	10.46	40	0
144	0	3.84	18.72	"Absent"	56	22.1	4.8	40	0
118	1.05	3.16	12.98	"Present"	46	22.09	16.35	31	0
136	3.46	6.38	32.25	"Present"	43	28.73	3.13	43	1
136	1.5	6.06	26.54	"Absent"	54	29.38	14.5	33	1
124	15.5	5.05	24.06	"Absent"	46	23.22	0	61	1
148	6	6.49	26.47	"Absent"	48	24.7	0	55	0
128	6.6	3.58	20.71	"Absent"	55	24.15	0	52	0
122	0.28	4.19	19.97	"Absent"	61	25.63	0	24	0
108	0	2.74	11.17	"Absent"	53	22.61	0.95	20	0
124	3.04	4.8	19.52	"Present"	60	21.78	147.19	41	1
138	8.8	3.12	22.41	"Present"	63	23.33	120.03	55	1
127	0	2.81	15.7	"Absent"	42	22.03	1.03	17	0
174	9.45	5.13	35.54	"Absent"	55	30.71	59.79	53	0
122	0	3.05	23.51	"Absent"	46	25.81	0	38	0
144	6.75	5.45	29.81	"Absent"	53	25.62	26.23	43	1
126	1.8	6.22	19.71	"Absent"	65	24.81	0.69	31	0
208	27.4	3.12	26.63	"Absent"	66	27.45	33.07	62	1
138	0	2.68	17.04	"Absent"	42	22.16	0	16	0
148	0	3.84	17.26	"Absent"	70	20	0	21	0
122	0	3.08	16.3	"Absent"	43	22.13	0	16	0
132	7	3.2	23.26	"Absent"	77	23.64	23.14	49	0
110	12.16	4.99	28.56	"Absent"	44	27.14	21.6	55	1
160	1.52	8.12	29.3	"Present"	54	25.87	12.86	43	1
126	0.54	4.39	21.13	"Present"	45	25.99	0	25	0
162	5.3	7.95	33.58	"Present"	58	36.06	8.23	48	0
194	2.55	6.89	33.88	"Present"	69	29.33	0	41	0
118	0.75	2.58	20.25	"Absent"	59	24.46	0	32	0
124	0	4.79	34.71	"Absent"	49	26.09	9.26	47	0
160	0	2.42	34.46	"Absent"	48	29.83	1.03	61	0
128	0	2.51	29.35	"Present"	53	22.05	1.37	62	0
122	4	5.24	27.89	"Present"	45	26.52	0	61	1
132	2	2.7	21.57	"Present"	50	27.95	9.26	37	0
120	0	2.42	16.66	"Absent"	46	20.16	0	17	0
128	0.04	8.22	28.17	"Absent"	65	26.24	11.73	24	0
108	15	4.91	34.65	"Absent"	41	27.96	14.4	56	0
166	0	4.31	34.27	"Absent"	45	30.14	13.27	56	0
152	0	6.06	41.05	"Present"	51	40.34	0	51	0
170	4.2	4.67	35.45	"Present"	50	27.14	7.92	60	1
156	4	2.05	19.48	"Present"	50	21.48	27.77	39	1
116	8	6.73	28.81	"Present"	41	26.74	40.94	48	1
122	4.4	3.18	11.59	"Present"	59	21.94	0	33	1
150	20	6.4	35.04	"Absent"	53	28.88	8.33	63	0
129	2.15	5.17	27.57	"Absent"	52	25.42	2.06	39	0
134	4.8	6.58	29.89	"Present"	55	24.73	23.66	63	0
126	0	5.98	29.06	"Present"	56	25.39	11.52	64	1
142	0	3.72	25.68	"Absent"	48	24.37	5.25	40	1
128	0.7	4.9	37.42	"Present"	72	35.94	3.09	49	1
102	0.4	3.41	17.22	"Present"	56	23.59	2.06	39	1
130	0	4.89	25.98	"Absent"	72	30.42	14.71	23	0
138	0.05	2.79	10.35	"Absent"	46	21.62	0	18	0
138	0	1.96	11.82	"Present"	54	22.01	8.13	21	0
128	0	3.09	20.57	"Absent"	54	25.63	0.51	17	0
162	2.92	3.63	31.33	"Absent"	62	31.59	18.51	42	0
160	3	9.19	26.47	"Present"	39	28.25	14.4	54	1
148	0	4.66	24.39	"Absent"	50	25.26	4.03	27	0
124	0.16	2.44	16.67	"Absent"	65	24.58	74.91	23	0
136	3.15	4.37	20.22	"Present"	59	25.12	47.16	31	1
134	2.75	5.51	26.17	"Absent"	57	29.87	8.33	33	0
128	0.73	3.97	23.52	"Absent"	54	23.81	19.2	64	0
122	3.2	3.59	22.49	"Present"	45	24.96	36.17	58	0
152	3	4.64	31.29	"Absent"	41	29.34	4.53	40	0
162	0	5.09	24.6	"Present"	64	26.71	3.81	18	0
124	4	6.65	30.84	"Present"	54	28.4	33.51	60	0
136	5.8	5.9	27.55	"Absent"	65	25.71	14.4	59	0
136	8.8	4.26	32.03	"Present"	52	31.44	34.35	60	0
134	0.05	8.03	27.95	"Absent"	48	26.88	0	60	0
122	1	5.88	34.81	"Present"	69	31.27	15.94	40	1
116	3	3.05	30.31	"Absent"	41	23.63	0.86	44	0
132	0	0.98	21.39	"Absent"	62	26.75	0	53	0
134	0	2.4	21.11	"Absent"	57	22.45	1.37	18	0
160	7.77	8.07	34.8	"Absent"	64	31.15	0	62	1
180	0.52	4.23	16.38	"Absent"	55	22.56	14.77	45	1
124	0.81	6.16	11.61	"Absent"	35	21.47	10.49	26	0
114	0	4.97	9.69	"Absent"	26	22.6	0	25	0
208	7.4	7.41	32.03	"Absent"	50	27.62	7.85	57	0
138	0	3.14	12	"Absent"	54	20.28	0	16	0
164	0.5	6.95	39.64	"Present"	47	41.76	3.81	46	1
144	2.4	8.13	35.61	"Absent"	46	27.38	13.37	60	0
136	7.5	7.39	28.04	"Present"	50	25.01	0	45	1
132	7.28	3.52	12.33	"Absent"	60	19.48	2.06	56	0
143	5.04	4.86	23.59	"Absent"	58	24.69	18.72	42	0
112	4.46	7.18	26.25	"Present"	69	27.29	0	32	1
134	10	3.79	34.72	"Absent"	42	28.33	28.8	52	1
138	2	5.11	31.4	"Present"	49	27.25	2.06	64	1
188	0	5.47	32.44	"Present"	71	28.99	7.41	50	1
110	2.35	3.36	26.72	"Present"	54	26.08	109.8	58	1
136	13.2	7.18	35.95	"Absent"	48	29.19	0	62	0
130	1.75	5.46	34.34	"Absent"	53	29.42	0	58	1
122	0	3.76	24.59	"Absent"	56	24.36	0	30	0
138	0	3.24	27.68	"Absent"	60	25.7	88.66	29	0
130	18	4.13	27.43	"Absent"	54	27.44	0	51	1
126	5.5	3.78	34.15	"Absent"	55	28.85	3.18	61	0
176	5.76	4.89	26.1	"Present"	46	27.3	19.44	57	0
122	0	5.49	19.56	"Absent"	57	23.12	14.02	27	0
124	0	3.23	9.64	"Absent"	59	22.7	0	16	0
140	5.2	3.58	29.26	"Absent"	70	27.29	20.17	45	1
128	6	4.37	22.98	"Present"	50	26.01	0	47	0
190	4.18	5.05	24.83	"Absent"	45	26.09	82.85	41	0
144	0.76	10.53	35.66	"Absent"	63	34.35	0	55	1
126	4.6	7.4	31.99	"Present"	57	28.67	0.37	60	1
128	0	2.63	23.88	"Absent"	45	21.59	6.54	57	0
136	0.4	3.91	21.1	"Present"	63	22.3	0	56	1
158	4	4.18	28.61	"Present"	42	25.11	0	60	0
160	0.6	6.94	30.53	"Absent"	36	25.68	1.42	64	0
124	6	5.21	33.02	"Present"	64	29.37	7.61	58	1
158	6.17	8.12	30.75	"Absent"	46	27.84	92.62	48	0
128	0	6.34	11.87	"Absent"	57	23.14	0	17	0
166	3	3.82	26.75	"Absent"	45	20.86	0	63	1
146	7.5	7.21	25.93	"Present"	55	22.51	0.51	42	0
161	9	4.65	15.16	"Present"	58	23.76	43.2	46	0
164	13.02	6.26	29.38	"Present"	47	22.75	37.03	54	1
146	5.08	7.03	27.41	"Present"	63	36.46	24.48	37	1
142	4.48	3.57	19.75	"Present"	51	23.54	3.29	49	0
138	12	5.13	28.34	"Absent"	59	24.49	32.81	58	1
154	1.8	7.13	34.04	"Present"	52	35.51	39.36	44	0
118	0	2.39	12.13	"Absent"	49	18.46	0.26	17	1
124	0.61	2.69	17.15	"Present"	61	22.76	11.55	20	0
124	1.04	2.84	16.42	"Present"	46	20.17	0	61	0
136	5	4.19	23.99	"Present"	68	27.8	25.86	35	0
132	9.9	4.63	27.86	"Present"	46	23.39	0.51	52	1
118	0.12	1.96	20.31	"Absent"	37	20.01	2.42	18	0
118	0.12	4.16	9.37	"Absent"	57	19.61	0	17	0
134	12	4.96	29.79	"Absent"	53	24.86	8.23	57	0
114	0.1	3.95	15.89	"Present"	57	20.31	17.14	16	0
136	6.8	7.84	30.74	"Present"	58	26.2	23.66	45	1
130	0	4.16	39.43	"Present"	46	30.01	0	55	1
136	2.2	4.16	38.02	"Absent"	65	37.24	4.11	41	1
136	1.36	3.16	14.97	"Present"	56	24.98	7.3	24	0
154	4.2	5.59	25.02	"Absent"	58	25.02	1.54	43	0
108	0.8	2.47	17.53	"Absent"	47	22.18	0	55	1
136	8.8	4.69	36.07	"Present"	38	26.56	2.78	63	1
174	2.02	6.57	31.9	"Present"	50	28.75	11.83	64	1
124	4.25	8.22	30.77	"Absent"	56	25.8	0	43	0
114	0	2.63	9.69	"Absent"	45	17.89	0	16	0
118	0.12	3.26	12.26	"Absent"	55	22.65	0	16	0
106	1.08	4.37	26.08	"Absent"	67	24.07	17.74	28	1
146	3.6	3.51	22.67	"Absent"	51	22.29	43.71	42	0
206	0	4.17	33.23	"Absent"	69	27.36	6.17	50	1
134	3	3.17	17.91	"Absent"	35	26.37	15.12	27	0
148	15	4.98	36.94	"Present"	72	31.83	66.27	41	1
126	0.21	3.95	15.11	"Absent"	61	22.17	2.42	17	0
134	0	3.69	13.92	"Absent"	43	27.66	0	19	0
134	0.02	2.8	18.84	"Absent"	45	24.82	0	17	0
123	0.05	4.61	13.69	"Absent"	51	23.23	2.78	16	0
112	0.6	5.28	25.71	"Absent"	55	27.02	27.77	38	1
112	0	1.71	15.96	"Absent"	42	22.03	3.5	16	0
101	0.48	7.26	13	"Absent"	50	19.82	5.19	16	0
150	0.18	4.14	14.4	"Absent"	53	23.43	7.71	44	0
170	2.6	7.22	28.69	"Present"	71	27.87	37.65	56	1
134	0	5.63	29.12	"Absent"	68	32.33	2.02	34	0
142	0	4.19	18.04	"Absent"	56	23.65	20.78	42	1
132	0.1	3.28	10.73	"Absent"	73	20.42	0	17	0
136	0	2.28	18.14	"Absent"	55	22.59	0	17	0
132	12	4.51	21.93	"Absent"	61	26.07	64.8	46	1
166	4.1	4	34.3	"Present"	32	29.51	8.23	53	0
138	0	3.96	24.7	"Present"	53	23.8	0	45	0
138	2.27	6.41	29.07	"Absent"	58	30.22	2.93	32	1
170	0	3.12	37.15	"Absent"	47	35.42	0	53	0
128	0	8.41	28.82	"Present"	60	26.86	0	59	1
136	1.2	2.78	7.12	"Absent"	52	22.51	3.41	27	0
128	0	3.22	26.55	"Present"	39	26.59	16.71	49	0
150	14.4	5.04	26.52	"Present"	60	28.84	0	45	0
132	8.4	3.57	13.68	"Absent"	42	18.75	15.43	59	1
142	2.4	2.55	23.89	"Absent"	54	26.09	59.14	37	0
130	0.05	2.44	28.25	"Present"	67	30.86	40.32	34	0
174	3.5	5.26	21.97	"Present"	36	22.04	8.33	59	1
114	9.6	2.51	29.18	"Absent"	49	25.67	40.63	46	0
162	1.5	2.46	19.39	"Present"	49	24.32	0	59	1
174	0	3.27	35.4	"Absent"	58	37.71	24.95	44	0
190	5.15	6.03	36.59	"Absent"	42	30.31	72	50	0
154	1.4	1.72	18.86	"Absent"	58	22.67	43.2	59	0
124	0	2.28	24.86	"Present"	50	22.24	8.26	38	0
114	1.2	3.98	14.9	"Absent"	49	23.79	25.82	26	0
168	11.4	5.08	26.66	"Present"	56	27.04	2.61	59	1
142	3.72	4.24	32.57	"Absent"	52	24.98	7.61	51	0
154	0	4.81	28.11	"Present"	56	25.67	75.77	59	0
146	4.36	4.31	18.44	"Present"	47	24.72	10.8	38	0
166	6	3.02	29.3	"Absent"	35	24.38	38.06	61	0
140	8.6	3.9	32.16	"Present"	52	28.51	11.11	64	1
136	1.7	3.53	20.13	"Absent"	56	19.44	14.4	55	0
156	0	3.47	21.1	"Absent"	73	28.4	0	36	1
132	0	6.63	29.58	"Present"	37	29.41	2.57	62	0
128	0	2.98	12.59	"Absent"	65	20.74	2.06	19	0
106	5.6	3.2	12.3	"Absent"	49	20.29	0	39	0
144	0.4	4.64	30.09	"Absent"	30	27.39	0.74	55	0
154	0.31	2.33	16.48	"Absent"	33	24	11.83	17	0
126	3.1	2.01	32.97	"Present"	56	28.63	26.74	45	0
134	6.4	8.49	37.25	"Present"	56	28.94	10.49	51	1
152	19.45	4.22	29.81	"Absent"	28	23.95	0	59	1
146	1.35	6.39	34.21	"Absent"	51	26.43	0	59	1
162	6.94	4.55	33.36	"Present"	52	27.09	32.06	43	0
130	7.28	3.56	23.29	"Present"	20	26.8	51.87	58	1
138	6	7.24	37.05	"Absent"	38	28.69	0	59	0
148	0	5.32	26.71	"Present"	52	32.21	32.78	27	0
124	4.2	2.94	27.59	"Absent"	50	30.31	85.06	30	0
118	1.62	9.01	21.7	"Absent"	59	25.89	21.19	40	0
116	4.28	7.02	19.99	"Present"	68	23.31	0	52	1
162	6.3	5.73	22.61	"Present"	46	20.43	62.54	53	1
138	0.87	1.87	15.89	"Absent"	44	26.76	42.99	31	0
137	1.2	3.14	23.87	"Absent"	66	24.13	45	37	0
198	0.52	11.89	27.68	"Present"	48	28.4	78.99	26	1
154	4.5	4.75	23.52	"Present"	43	25.76	0	53	1
128	5.4	2.36	12.98	"Absent"	51	18.36	6.69	61	0
130	0.08	5.59	25.42	"Present"	50	24.98	6.27	43	1
162	5.6	4.24	22.53	"Absent"	29	22.91	5.66	60	0
120	10.5	2.7	29.87	"Present"	54	24.5	16.46	49	0
136	3.99	2.58	16.38	"Present"	53	22.41	27.67	36	0
176	1.2	8.28	36.16	"Present"	42	27.81	11.6	58	1
134	11.79	4.01	26.57	"Present"	38	21.79	38.88	61	1
122	1.7	5.28	32.23	"Present"	51	24.08	0	54	0
134	0.9	3.18	23.66	"Present"	52	23.26	27.36	58	1
134	0	2.43	22.24	"Absent"	52	26.49	41.66	24	0
136	6.6	6.08	32.74	"Absent"	64	33.28	2.72	49	0
132	4.05	5.15	26.51	"Present"	31	26.67	16.3	50	0
152	1.68	3.58	25.43	"Absent"	50	27.03	0	32	0
132	12.3	5.96	32.79	"Present"	57	30.12	21.5	62	1
124	0.4	3.67	25.76	"Absent"	43	28.08	20.57	34	0
140	4.2	2.91	28.83	"Present"	43	24.7	47.52	48	0
166	0.6	2.42	34.03	"Present"	53	26.96	54	60	0
156	3.02	5.35	25.72	"Present"	53	25.22	28.11	52	1
132	0.72	4.37	19.54	"Absent"	48	26.11	49.37	28	0
150	0	4.99	27.73	"Absent"	57	30.92	8.33	24	0
134	0.12	3.4	21.18	"Present"	33	26.27	14.21	30	0
126	3.4	4.87	15.16	"Present"	65	22.01	11.11	38	0
148	0.5	5.97	32.88	"Absent"	54	29.27	6.43	42	0
148	8.2	7.75	34.46	"Present"	46	26.53	6.04	64	1
132	6	5.97	25.73	"Present"	66	24.18	145.29	41	0
128	1.6	5.41	29.3	"Absent"	68	29.38	23.97	32	0
128	5.16	4.9	31.35	"Present"	57	26.42	0	64	0
140	0	2.4	27.89	"Present"	70	30.74	144	29	0
126	0	5.29	27.64	"Absent"	25	27.62	2.06	45	0
114	3.6	4.16	22.58	"Absent"	60	24.49	65.31	31	0
118	1.25	4.69	31.58	"Present"	52	27.16	4.11	53	0
126	0.96	4.99	29.74	"Absent"	66	33.35	58.32	38	0
154	4.5	4.68	39.97	"Absent"	61	33.17	1.54	64	1
112	1.44	2.71	22.92	"Absent"	59	24.81	0	52	0
140	8	4.42	33.15	"Present"	47	32.77	66.86	44	0
140	1.68	11.41	29.54	"Present"	74	30.75	2.06	38	1
128	2.6	4.94	21.36	"Absent"	61	21.3	0	31	0
126	19.6	6.03	34.99	"Absent"	49	26.99	55.89	44	0
160	4.2	6.76	37.99	"Present"	61	32.91	3.09	54	1
144	0	4.17	29.63	"Present"	52	21.83	0	59	0
148	4.5	10.49	33.27	"Absent"	50	25.92	2.06	53	1
146	0	4.92	18.53	"Absent"	57	24.2	34.97	26	0
164	5.6	3.17	30.98	"Present"	44	25.99	43.2	53	1
130	0.54	3.63	22.03	"Present"	69	24.34	12.86	39	1
154	2.4	5.63	42.17	"Present"	59	35.07	12.86	50	1
178	0.95	4.75	21.06	"Absent"	49	23.74	24.69	61	0
180	3.57	3.57	36.1	"Absent"	36	26.7	19.95	64	0
134	12.5	2.73	39.35	"Absent"	48	35.58	0	48	0
142	0	3.54	16.64	"Absent"	58	25.97	8.36	27	0
162	7	7.67	34.34	"Present"	33	30.77	0	62	0
218	11.2	2.77	30.79	"Absent"	38	24.86	90.93	48	1
126	8.75	6.06	32.72	"Present"	33	27	62.43	55	1
126	0	3.57	26.01	"Absent"	61	26.3	7.97	47	0
134	6.1	4.77	26.08	"Absent"	47	23.82	1.03	49	0
132	0	4.17	36.57	"Absent"	57	30.61	18	49	0
178	5.5	3.79	23.92	"Present"	45	21.26	6.17	62	1
208	5.04	5.19	20.71	"Present"	52	25.12	24.27	58	1
160	1.15	10.19	39.71	"Absent"	31	31.65	20.52	57	0
116	2.38	5.67	29.01	"Present"	54	27.26	15.77	51	0
180	25.01	3.7	38.11	"Present"	57	30.54	0	61	1
200	19.2	4.43	40.6	"Present"	55	32.04	36	60	1
112	4.2	3.58	27.14	"Absent"	52	26.83	2.06	40	0
120	0	3.1	26.97	"Absent"	41	24.8	0	16	0
178	20	9.78	33.55	"Absent"	37	27.29	2.88	62	1
166	0.8	5.63	36.21	"Absent"	50	34.72	28.8	60	0
164	8.2	14.16	36.85	"Absent"	52	28.5	17.02	55	1
216	0.92	2.66	19.85	"Present"	49	20.58	0.51	63	1
146	6.4	5.62	33.05	"Present"	57	31.03	0.74	46	0
134	1.1	3.54	20.41	"Present"	58	24.54	39.91	39	1
158	16	5.56	29.35	"Absent"	36	25.92	58.32	60	0
176	0	3.14	31.04	"Present"	45	30.18	4.63	45	0
132	2.8	4.79	20.47	"Present"	50	22.15	11.73	48	0
126	0	4.55	29.18	"Absent"	48	24.94	36	41	0
120	5.5	3.51	23.23	"Absent"	46	22.4	90.31	43	0
174	0	3.86	21.73	"Absent"	42	23.37	0	63	0
150	13.8	5.1	29.45	"Present"	52	27.92	77.76	55	1
176	6	3.98	17.2	"Present"	52	21.07	4.11	61	1
142	2.2	3.29	22.7	"Absent"	44	23.66	5.66	42	1
132	0	3.3	21.61	"Absent"	42	24.92	32.61	33	0
142	1.32	7.63	29.98	"Present"	57	31.16	72.93	33	0
146	1.16	2.28	34.53	"Absent"	50	28.71	45	49	0
132	7.2	3.65	17.16	"Present"	56	23.25	0	34	0
120	0	3.57	23.22	"Absent"	58	27.2	0	32	0
118	0	3.89	15.96	"Absent"	65	20.18	0	16	0
108	0	1.43	26.26	"Absent"	42	19.38	0	16	0
136	0	4	19.06	"Absent"	40	21.94	2.06	16	0
120	0	2.46	13.39	"Absent"	47	22.01	0.51	18	0
132	0	3.55	8.66	"Present"	61	18.5	3.87	16	0
136	0	1.77	20.37	"Absent"	45	21.51	2.06	16	0
138	0	1.86	18.35	"Present"	59	25.38	6.51	17	0
138	0.06	4.15	20.66	"Absent"	49	22.59	2.49	16	0
130	1.22	3.3	13.65	"Absent"	50	21.4	3.81	31	0
130	4	2.4	17.42	"Absent"	60	22.05	0	40	0
110	0	7.14	28.28	"Absent"	57	29	0	32	0
120	0	3.98	13.19	"Present"	47	21.89	0	16	0
166	6	8.8	37.89	"Absent"	39	28.7	43.2	52	0
134	0.57	4.75	23.07	"Absent"	67	26.33	0	37	0
142	3	3.69	25.1	"Absent"	60	30.08	38.88	27	0
136	2.8	2.53	9.28	"Present"	61	20.7	4.55	25	0
142	0	4.32	25.22	"Absent"	47	28.92	6.53	34	1
130	0	1.88	12.51	"Present"	52	20.28	0	17	0
124	1.8	3.74	16.64	"Present"	42	22.26	10.49	20	0
144	4	5.03	25.78	"Present"	57	27.55	90	48	1
136	1.81	3.31	6.74	"Absent"	63	19.57	24.94	24	0
120	0	2.77	13.35	"Absent"	67	23.37	1.03	18	0
154	5.53	3.2	28.81	"Present"	61	26.15	42.79	42	0
124	1.6	7.22	39.68	"Present"	36	31.5	0	51	1
146	0.64	4.82	28.02	"Absent"	60	28.11	8.23	39	1
128	2.24	2.83	26.48	"Absent"	48	23.96	47.42	27	1
170	0.4	4.11	42.06	"Present"	56	33.1	2.06	57	0
214	0.4	5.98	31.72	"Absent"	64	28.45	0	58	0
182	4.2	4.41	32.1	"Absent"	52	28.61	18.72	52	1
108	3	1.59	15.23	"Absent"	40	20.09	26.64	55	0
118	5.4	11.61	30.79	"Absent"	64	27.35	23.97	40	0
132	0	4.82	33.41	"Present"	62	14.7	0	46	1

================================================
FILE: 2017/examples/deepdream/deepdream_exercise.py
================================================
"""DeepDream.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os.path
import zipfile

import numpy as np
import PIL.Image

import tensorflow as tf

FLAGS = tf.app.flags.FLAGS


tf.app.flags.DEFINE_string('data_dir',
                           '/tmp/inception/',
                           'Directory for storing Inception network.')

tf.app.flags.DEFINE_string('jpeg_file',
                           'output.jpg',
                           'Where to save the resulting JPEG.')


def get_layer(layer):
  """Helper for getting layer output Tensor in model Graph.

  Args:
   layer: string, layer name

  Returns:
    Tensor for that layer.
  """
  graph = tf.get_default_graph()
  return graph.get_tensor_by_name('import/%s:0' % layer)


def maybe_download(data_dir):
  """Maybe download pretrained Inception network.

  Args:
    data_dir: string, path to data
  """
  url = ('https://storage.googleapis.com/download.tensorflow.org/models/'
         'inception5h.zip')
  basename = 'inception5h.zip'
  local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download(
      basename, data_dir, url)

  # Uncompress the pretrained Inception network.
  print('Extracting', local_file)
  zip_ref = zipfile.ZipFile(local_file, 'r')
  zip_ref.extractall(FLAGS.data_dir)
  zip_ref.close()


def normalize_image(image):
  """Stretch the range and prepare the image for saving as a JPEG.

  Args:
    image: numpy array

  Returns:
    numpy array of image in uint8
  """
  # Clip to [0, 1] and then convert to uint8.
  image = np.clip(image, 0, 1)
  image = np.uint8(image * 255)
  return image


def save_jpeg(jpeg_file, image):
  pil_image = PIL.Image.fromarray(image)
  pil_image.save(jpeg_file)
  print('Saved to file: ', jpeg_file)


def main(unused_argv):
  # Maybe download and uncompress pretrained Inception network.
  maybe_download(FLAGS.data_dir)

  model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb')

  # Load the pretrained Inception model as a GraphDef.
  with tf.gfile.FastGFile(model_fn, 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())

  with tf.Graph().as_default():
    # Input for the network.
    input_image = tf.placeholder(np.float32, name='input')
    pixel_mean = 117.0
    input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0)
    tf.import_graph_def(graph_def, {'input': input_preprocessed})

    # Grab a list of the names of Tensor's that are the output of convolutions.
    graph = tf.get_default_graph()
    layers = [op.name for op in graph.get_operations()
              if op.type == 'Conv2D' and 'import/' in op.name]
    feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1])
                    for name in layers]
    # print('Layers available: %s' % ','.join(layers))
    print('Number of layers', len(layers))
    print('Number of features:', sum(feature_nums))

    # Pick an internal layer and node to visualize.
    # Note that we use outputs before applying the ReLU nonlinearity to
    # have non-zero gradients for features with negative initial activations.
    layer = 'mixed4d_3x3_bottleneck_pre_relu'
    channel = 139
    layer_channel = get_layer(layer)[:, :, :, channel]
    print('layer %s, channel %d: %s' % (layer, channel, layer_channel))

    # Define the optimization as the average across all spatial locations.
    score = tf.reduce_mean(layer_channel)

    # Automatic differentiation with TensorFlow. Magic!
    input_gradient = tf.gradients(score, input_image)[0]

    # Employ random noise as a image.
    noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0
    image = noise_image.copy()

    ################################################################
    # EXERCISE: Implemement the Deep Dream algorithm here!
    ################################################################

  # Save the image.
  stddev = 0.1
  image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5
  image = normalize_image(image)
  save_jpeg(FLAGS.jpeg_file, image)


if __name__ == '__main__':
  tf.app.run()

================================================
FILE: 2017/examples/deepdream/deepdream_solution.py
================================================
"""DeepDream.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os.path
import zipfile

import sys
sys.path.extend(['', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0'])


import numpy as np
import PIL.Image

import tensorflow as tf

FLAGS = tf.app.flags.FLAGS


tf.app.flags.DEFINE_string('data_dir',
                           '/tmp/inception/',
                           'Directory for storing Inception network.')

tf.app.flags.DEFINE_string('jpeg_file',
                           'output.jpg',
                           'Where to save the resulting JPEG.')


def get_layer(layer):
  """Helper for getting layer output Tensor in model Graph.

  Args:
   layer: string, layer name

  Returns:
    Tensor for that layer.
  """
  graph = tf.get_default_graph()
  return graph.get_tensor_by_name('import/%s:0' % layer)


def maybe_download(data_dir):
  """Maybe download pretrained Inception network.

  Args:
    data_dir: string, path to data
  """
  url = ('https://storage.googleapis.com/download.tensorflow.org/models/'
         'inception5h.zip')
  basename = 'inception5h.zip'
  local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download(
      basename, data_dir, url)

  # Uncompress the pretrained Inception network.
  print('Extracting', local_file)
  zip_ref = zipfile.ZipFile(local_file, 'r')
  zip_ref.extractall(FLAGS.data_dir)
  zip_ref.close()


def normalize_image(image):
  """Stretch the range and prepare the image for saving as a JPEG.

  Args:
    image: numpy array

  Returns:
    numpy array of image in uint8
  """
  # Clip to [0, 1] and then convert to uint8.
  image = np.clip(image, 0, 1)
  image = np.uint8(image * 255)
  return image


def save_jpeg(jpeg_file, image):
  pil_image = PIL.Image.fromarray(image)
  pil_image.save(jpeg_file)
  print('Saved to file: ', jpeg_file)


def main(unused_argv):
  # Maybe download and uncompress pretrained Inception network.
  maybe_download(FLAGS.data_dir)

  model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb')

  # Load the pretrained Inception model as a GraphDef.
  with tf.gfile.FastGFile(model_fn, 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())

  with tf.Graph().as_default():
    # Input for the network.
    input_image = tf.placeholder(np.float32, name='input')
    pixel_mean = 117.0
    input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0)
    tf.import_graph_def(graph_def, {'input': input_preprocessed})

    # Grab a list of the names of Tensor's that are the output of convolutions.
    graph = tf.get_default_graph()
    layers = [op.name for op in graph.get_operations()
              if op.type == 'Conv2D' and 'import/' in op.name]
    feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1])
                    for name in layers]
    # print('Layers available: %s' % ','.join(layers))
    print('Number of layers', len(layers))
    print('Number of features:', sum(feature_nums))

    # Pick an internal layer and node to visualize.
    # Note that we use outputs before applying the ReLU nonlinearity to
    # have non-zero gradients for features with negative initial activations.
    layer = 'mixed4d_3x3_bottleneck_pre_relu'
    channel = 139
    layer_channel = get_layer(layer)[:, :, :, channel]
    print('layer %s, channel %d: %s' % (layer, channel, layer_channel))

    # Define the optimization as the average across all spatial locations.
    score = tf.reduce_mean(layer_channel)

    # Automatic differentiation with TensorFlow. Magic!
    input_gradient = tf.gradients(score, input_image)[0]

    # Employ random noise as a image.
    noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0
    image = noise_image.copy()
    
    ################################################################
    ### BEGIN SOLUTION #####
    ################################################################
    step_scale = 1.0
    num_iter = 20
    with tf.Session() as sess:
      for i in xrange(num_iter):
        image_gradient, score_value = sess.run([input_gradient, score], {input_image:image})
        # Normalize the gradient, so the same step size should work 
        image_gradient /= image_gradient.std() + 1e-8 
        image += image_gradient * step_scale
        print('At step = %d, score = %.3f' % (i, score_value))

  # Save the image.
  stddev = 0.1
  image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5
  image = normalize_image(image)
  save_jpeg(FLAGS.jpeg_file, image)
  ##################################################################
  ### END SOLUTION #####
  ##################################################################


if __name__ == '__main__':
  tf.app.run()

================================================
FILE: 2017/examples/kernels.py
================================================
import numpy as np
import tensorflow as tf

a = np.zeros([3, 3, 3, 3])
a[1, 1, :, :] = 0.25
a[0, 1, :, :] = 0.125
a[1, 0, :, :] = 0.125
a[2, 1, :, :] = 0.125
a[1, 2, :, :] = 0.125
a[0, 0, :, :] = 0.0625
a[0, 2, :, :] = 0.0625
a[2, 0, :, :] = 0.0625
a[2, 2, :, :] = 0.0625

BLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
# a[1, 1, :, :] = 0.25
# a[0, 1, :, :] = 0.125
# a[1, 0, :, :] = 0.125
# a[2, 1, :, :] = 0.125
# a[1, 2, :, :] = 0.125
# a[0, 0, :, :] = 0.0625
# a[0, 2, :, :] = 0.0625
# a[2, 0, :, :] = 0.0625
# a[2, 2, :, :] = 0.0625
a[1, 1, :, :] = 1.0
a[0, 1, :, :] = 1.0
a[1, 0, :, :] = 1.0
a[2, 1, :, :] = 1.0
a[1, 2, :, :] = 1.0
a[0, 0, :, :] = 1.0
a[0, 2, :, :] = 1.0
a[2, 0, :, :] = 1.0
a[2, 2, :, :] = 1.0
BLUR_FILTER = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[1, 1, :, :] = 5
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[2, 1, :, :] = -1
a[1, 2, :, :] = -1

SHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[1, 1, :, :] = 5
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[2, 1, :, :] = -1
a[1, 2, :, :] = -1

SHARPEN_FILTER = tf.constant(a, dtype=tf.float32)

# a = np.zeros([3, 3, 3, 3])
# a[:, :, :, :] = -1
# a[1, 1, :, :] = 8

# EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32)

EDGE_FILTER_RGB = tf.constant([
			[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],
            [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],
			[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]]
])

a = np.zeros([3, 3, 1, 1])
# a[:, :, :, :] = -1
# a[1, 1, :, :] = 8
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[1, 2, :, :] = -1
a[2, 1, :, :] = -1
a[1, 1, :, :] = 4

EDGE_FILTER = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[0, :, :, :] = 1
a[0, 1, :, :] = 2 # originally 2
a[2, :, :, :] = -1
a[2, 1, :, :] = -2

TOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[0, :, :, :] = 1
a[0, 1, :, :] = 2 # originally 2
a[2, :, :, :] = -1
a[2, 1, :, :] = -2

TOP_SOBEL = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[0, 0, :, :] = -2
a[0, 1, :, :] = -1 
a[1, 0, :, :] = -1
a[1, 1, :, :] = 1
a[1, 2, :, :] = 1
a[2, 1, :, :] = 1
a[2, 2, :, :] = 2

EMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[0, 0, :, :] = -2
a[0, 1, :, :] = -1 
a[1, 0, :, :] = -1
a[1, 1, :, :] = 1
a[1, 2, :, :] = 1
a[2, 1, :, :] = 1
a[2, 2, :, :] = 2
EMBOSS_FILTER = tf.constant(a, dtype=tf.float32)

================================================
FILE: 2017/examples/process_data.py
================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from collections import Counter
import random
import os
import sys
sys.path.append('..')
import zipfile

import numpy as np
from six.moves import urllib
import tensorflow as tf

import utils

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/'
EXPECTED_BYTES = 31344016
DATA_FOLDER = 'data/'
FILE_NAME = 'text8.zip'

def download(file_name, expected_bytes):
    """ Download the dataset text8 if it's not already downloaded """
    file_path = DATA_FOLDER + file_name
    if os.path.exists(file_path):
        print("Dataset ready")
        return file_path
    file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path)
    file_stat = os.stat(file_path)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded the file', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')
    return file_path

def read_data(file_path):
    """ Read data into a list of tokens 
    There should be 17,005,207 tokens
    """
    with zipfile.ZipFile(file_path) as f:
        words = tf.compat.as_str(f.read(f.namelist()[0])).split() 
        # tf.compat.as_str() converts the input into the string
    return words

def build_vocab(words, vocab_size):
    """ Build vocabulary of VOCAB_SIZE most frequent words """
    dictionary = dict()
    count = [('UNK', -1)]
    count.extend(Counter(words).most_common(vocab_size - 1))
    index = 0
    utils.make_dir('processed')
    with open('processed/vocab_1000.tsv', "w") as f:
        for word, _ in count:
            dictionary[word] = index
            if index < 1000:
                f.write(word + "\n")
            index += 1
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, index_dictionary

def convert_words_to_index(words, dictionary):
    """ Replace each word in the dataset with its index in the dictionary """
    return [dictionary[word] if word in dictionary else 0 for word in words]

def generate_sample(index_words, context_window_size):
    """ Form training pairs according to the skip-gram model. """
    for index, center in enumerate(index_words):
        context = random.randint(1, context_window_size)
        # get a random target before the center word
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

def get_batch(iterator, batch_size):
    """ Group a numerical stream into batches and yield them as Numpy arrays. """
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)
        yield center_batch, target_batch

def process_data(vocab_size, batch_size, skip_window):
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    dictionary, _ = build_vocab(words, vocab_size)
    index_words = convert_words_to_index(words, dictionary)
    del words # to save memory
    single_gen = generate_sample(index_words, skip_window)
    return get_batch(single_gen, batch_size)

def get_index_vocab(vocab_size):
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    return build_vocab(words, vocab_size)


================================================
FILE: 2017/examples/utils.py
================================================
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

def huber_loss(labels, predictions, delta=1.0):
    residual = tf.abs(predictions - labels)
    def f1(): return 0.5 * tf.square(residual)
    def f2(): return delta * residual - 0.5 * tf.square(delta)
    return tf.cond(residual < delta, f1, f2)

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
    	pass

================================================
FILE: 2017/setup/requirements.txt
================================================
tensorflow==1.2.1
scipy==0.19.1
scikit-learn==0.18.2
matplotlib==2.0.2
xlrd==1.0.0
ipdb==0.10.1
Pillow==4.2.1
lxml==3.8.0


================================================
FILE: 2017/setup/setup_instruction.md
================================================
Tensorflow supports both Python 2.7 and Python 3.3+. <b>Note that for Windows, TensorFlow supports only 64-bit Python 3.5.</b>
For this course, I will use Python 2.7. But you’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 2.7

Google has a pretty detailed instruction on how to download and setup Tensorflow. You can follow it here: https://www.tensorflow.org/get_started/os_setup

Unless your computer has GPU, you should install Tensorflow without GPU support. My recommendation is always set up Tensorflow using virtualenv. For the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses.

Below is a simpler instruction on how to install tensorflow for people using Mac OS. If you have any problem installing Tensorflow, feel free to post it on Piazza: piazza.com/stanford/winter2017/cs20si

## Install TensorFlow<br>
### For Mac OS

If you get “permission denied” error in any command, use “sudo” in front of that command.

You will need pip (or pip3 if you use Python 3), and virtualenv.

Step 1: set up pip and virtual environment
```bash
$ sudo easy_install pip 
$ sudo easy_install --upgrade six
$ pip install virtualenv
```

Step 2: set up a project directory. You will do all work for this class in this directory
```bash
$ mkdir [my project]
```

Step 3: set up virtual environment for the project directory. 
```bash
$ cd [my project]
$ virtualenv venv --distribute
```
These commands create a venv subdirectory in your project where everything is installed.

Step 4: to activate the virtual environment 
```bash
$ source venv/bin/activate
```

If you type:
```bash
$ pip freeze
```

You will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, refer to requirements.txt
Step 5: Install Tensorflow and other dependencies
```bash
$ pip install tensorflow
$ pip freeze > requirements.txt
```

Step n: 
To exit the virtual environment, use:
```bash
$ deactivate
```

If you want your virtual environment to inherit globally installed packages, (not recommended), use:
```bash
$ virtualenv venv --distribute --system-site-packages
```
### For Ubuntu


### For Windows


### On the cloud
If you don't want to install TensorFlow, you can use TensorFlow over the web.

#### SageMath
You can use Tensorflow over the web at https://cloud.sagemath.com/
Simply click on the link, create an account (or log in with your GitHub), and create a TensorFlow project.

#### Jupyter
You can also use Jupyter notebook to write TensorFlow programs.

# Possible set up problems
## Matplotlib
If you have problem with using Matplotlib in virtual environment, here is a simple fix. <br>
If you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib.
Go there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg```

Or you can simply add this after importing matplotlib: ```matplotlib.use("TkAgg")```


================================================
FILE: LICENSE
================================================
The MIT License (MIT)

Copyright (c) 2017 Huyen Nguyen (Chip Huyen)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

================================================
FILE: README.md
================================================
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Join the https://gitter.im/stanford-tensorflow-tutorials](https://badges.gitter.im/tflearn/tflearn.svg)](https://gitter.im/stanford-tensorflow-tutorials)

# stanford-tensorflow-tutorials
This repository contains code examples for the course CS 20: TensorFlow for Deep Learning Research. <br>
It will be updated as the class progresses. <br>
Detailed syllabus and lecture notes can be found [here](http://cs20.stanford.edu).<br>
For this course, I use python3.6 and TensorFlow 1.4.1.

For the code and notes of the previous year's course, please see the folder 2017 and the website https://web.stanford.edu/class/cs20si/2017

For setup instruction and the list of dependencies, please see the setup folder of this repository.

================================================
FILE: assignments/01/q1.py
================================================
"""
Simple exercises to get used to TensorFlow API
You should thoroughly test your code.
TensorFlow's official documentation should be your best friend here
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

sess = tf.InteractiveSession()
###############################################################################
# 1a: Create two random 0-d tensors x and y of any distribution.
# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.
# Hint: look up tf.cond()
# I do the first problem for you
###############################################################################

x = tf.random_uniform([])  # Empty array as shape creates a scalar.
y = tf.random_uniform([])
out = tf.cond(tf.greater(x, y), lambda: x + y, lambda: x - y)
print(sess.run(out))

###############################################################################
# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).
# Return x + y if x < y, x - y if x > y, 0 otherwise.
# Hint: Look up tf.case().
###############################################################################

# YOUR CODE

###############################################################################
# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] 
# and y as a tensor of zeros with the same shape as x.
# Return a boolean tensor that yields Trues if x equals y element-wise.
# Hint: Look up tf.equal().
###############################################################################

# YOUR CODE

###############################################################################
# 1d: Create the tensor x of value 
# [29.05088806,  27.61298943,  31.19073486,  29.35532951,
#  30.97266006,  26.67541885,  38.08450317,  20.74983215,
#  34.94445419,  34.45999146,  29.06485367,  36.01657104,
#  27.88236427,  20.56035233,  30.20379066,  29.51215172,
#  33.71149445,  28.59134293,  36.05556488,  28.66994858].
# Get the indices of elements in x whose values are greater than 30.
# Hint: Use tf.where().
# Then extract elements whose values are greater than 30.
# Hint: Use tf.gather().
###############################################################################

# YOUR CODE

###############################################################################
# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,
# 2, ..., 6
# Hint: Use tf.range() and tf.diag().
###############################################################################

# YOUR CODE

###############################################################################
# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.
# Calculate its determinant.
# Hint: Look at tf.matrix_determinant().
###############################################################################

# YOUR CODE

###############################################################################
# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].
# Return the unique elements in x
# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.
###############################################################################

# YOUR CODE

###############################################################################
# 1h: Create two tensors x and y of shape 300 from any normal distribution,
# as long as they are from the same distribution.
# Use tf.cond() to return:
# - The mean squared error of (x - y) if the average of all elements in (x - y)
#   is negative, or
# - The sum of absolute value of all elements in the tensor (x - y) otherwise.
# Hint: see the Huber loss function in the lecture slides 3.
###############################################################################

# YOUR CODE

================================================
FILE: assignments/01/q1_sol.py
================================================
"""
Solution to simple exercises to get used to TensorFlow API
You should thoroughly test your code.
TensorFlow's official documentation should be your best friend here
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

sess = tf.InteractiveSession()
###############################################################################
# 1a: Create two random 0-d tensors x and y of any distribution.
# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.
# Hint: look up tf.cond()
# I do the first problem for you
###############################################################################

x = tf.random_uniform([])  # Empty array as shape creates a scalar.
y = tf.random_uniform([])
out = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))

###############################################################################
# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).
# Return x + y if x < y, x - y if x > y, 0 otherwise.
# Hint: Look up tf.case().
###############################################################################

x = tf.random_uniform([], -1, 1, dtype=tf.float32)
y = tf.random_uniform([], -1, 1, dtype=tf.float32)
out = tf.case({tf.less(x, y): lambda: tf.add(x, y), 
			tf.greater(x, y): lambda: tf.subtract(x, y)}, 
			default=lambda: tf.constant(0.0), exclusive=True)


###############################################################################
# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] 
# and y as a tensor of zeros with the same shape as x.
# Return a boolean tensor that yields Trues if x equals y element-wise.
# Hint: Look up tf.equal().
###############################################################################

x = tf.constant([[0, -2, -1], [0, 1, 2]])
y = tf.zeros_like(x)
out = tf.equal(x, y)

###############################################################################
# 1d: Create the tensor x of value 
# [29.05088806,  27.61298943,  31.19073486,  29.35532951,
#  30.97266006,  26.67541885,  38.08450317,  20.74983215,
#  34.94445419,  34.45999146,  29.06485367,  36.01657104,
#  27.88236427,  20.56035233,  30.20379066,  29.51215172,
#  33.71149445,  28.59134293,  36.05556488,  28.66994858].
# Get the indices of elements in x whose values are greater than 30.
# Hint: Use tf.where().
# Then extract elements whose values are greater than 30.
# Hint: Use tf.gather().
###############################################################################

x = tf.constant([29.05088806,  27.61298943,  31.19073486,  29.35532951,
		        30.97266006,  26.67541885,  38.08450317,  20.74983215,
		        34.94445419,  34.45999146,  29.06485367,  36.01657104,
		        27.88236427,  20.56035233,  30.20379066,  29.51215172,
		        33.71149445,  28.59134293,  36.05556488,  28.66994858])
indices = tf.where(x > 30)
out = tf.gather(x, indices)

###############################################################################
# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,
# 2, ..., 6
# Hint: Use tf.range() and tf.diag().
###############################################################################

values = tf.range(1, 7)
out = tf.diag(values)

###############################################################################
# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.
# Calculate its determinant.
# Hint: Look at tf.matrix_determinant().
###############################################################################

m = tf.random_normal([10, 10], mean=10, stddev=1)
out = tf.matrix_determinant(m)

###############################################################################
# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].
# Return the unique elements in x
# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.
###############################################################################

x = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9])
unique_values, indices = tf.unique(x)

###############################################################################
# 1h: Create two tensors x and y of shape 300 from any normal distribution,
# as long as they are from the same distribution.
# Use tf.cond() to return:
# - The mean squared error of (x - y) if the average of all elements in (x - y)
#   is negative, or
# - The sum of absolute value of all elements in the tensor (x - y) otherwise.
# Hint: see the Huber loss function in the lecture slides 3.
###############################################################################

x = tf.random_normal([300], mean=5, stddev=1)
y = tf.random_normal([300], mean=5, stddev=1)
average = tf.reduce_mean(x - y)
def f1(): return tf.reduce_mean(tf.square(x - y))
def f2(): return tf.reduce_sum(tf.abs(x - y))
out = tf.cond(average < 0, f1, f2)

================================================
FILE: assignments/02_style_transfer/load_vgg.py
================================================
""" Load VGGNet weights needed for the implementation in TensorFlow
of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) 

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

For more details, please read the assignment handout:
https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing

"""
import numpy as np
import scipy.io
import tensorflow as tf

import utils

# VGG-19 parameters file
VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'
VGG_FILENAME = 'imagenet-vgg-verydeep-19.mat'
EXPECTED_BYTES = 534904783

class VGG(object):
    def __init__(self, input_img):
        utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES)
        self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers']
        self.input_img = input_img
        self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))

    def _weights(self, layer_idx, expected_layer_name):
        """ Return the weights and biases at layer_idx already trained by VGG
        """
        W = self.vgg_layers[0][layer_idx][0][0][2][0][0]
        b = self.vgg_layers[0][layer_idx][0][0][2][0][1]
        layer_name = self.vgg_layers[0][layer_idx][0][0][0][0]
        assert layer_name == expected_layer_name
        return W, b.reshape(b.size)

    def conv2d_relu(self, prev_layer, layer_idx, layer_name):
        """ Create a convolution layer with RELU using the weights and
        biases extracted from the VGG model at 'layer_idx'. You should use
        the function _weights() defined above to extract weights and biases.

        _weights() returns numpy arrays, so you have to convert them to TF tensors.

        Don't forget to apply relu to the output from the convolution.
        Inputs:
            prev_layer: the output tensor from the previous layer
            layer_idx: the index to current layer in vgg_layers
            layer_name: the string that is the name of the current layer.
                        It's used to specify variable_scope.
        Hint for choosing strides size: 
            for small images, you probably don't want to skip any pixel
        """
        ###############################
        ## TO DO
        out = None
        ###############################
        setattr(self, layer_name, out)

    def avgpool(self, prev_layer, layer_name):
        """ Create the average pooling layer. The paper suggests that 
        average pooling works better than max pooling.
        
        Input:
            prev_layer: the output tensor from the previous layer
            layer_name: the string that you want to name the layer.
                        It's used to specify variable_scope.

        Hint for choosing strides and kszie: choose what you feel appropriate
        """
        ###############################
        ## TO DO
        out = None
        ###############################
        setattr(self, layer_name, out)

    def load(self):
        self.conv2d_relu(self.input_img, 0, 'conv1_1')
        self.conv2d_relu(self.conv1_1, 2, 'conv1_2')
        self.avgpool(self.conv1_2, 'avgpool1')
        self.conv2d_relu(self.avgpool1, 5, 'conv2_1')
        self.conv2d_relu(self.conv2_1, 7, 'conv2_2')
        self.avgpool(self.conv2_2, 'avgpool2')
        self.conv2d_relu(self.avgpool2, 10, 'conv3_1')
        self.conv2d_relu(self.conv3_1, 12, 'conv3_2')
        self.conv2d_relu(self.conv3_2, 14, 'conv3_3')
        self.conv2d_relu(self.conv3_3, 16, 'conv3_4')
        self.avgpool(self.conv3_4, 'avgpool3')
        self.conv2d_relu(self.avgpool3, 19, 'conv4_1')
        self.conv2d_relu(self.conv4_1, 21, 'conv4_2')
        self.conv2d_relu(self.conv4_2, 23, 'conv4_3')
        self.conv2d_relu(self.conv4_3, 25, 'conv4_4')
        self.avgpool(self.conv4_4, 'avgpool4')
        self.conv2d_relu(self.avgpool4, 28, 'conv5_1')
        self.conv2d_relu(self.conv5_1, 30, 'conv5_2')
        self.conv2d_relu(self.conv5_2, 32, 'conv5_3')
        self.conv2d_relu(self.conv5_3, 34, 'conv5_4')
        self.avgpool(self.conv5_4, 'avgpool5')

================================================
FILE: assignments/02_style_transfer/load_vgg_sol.py
================================================
""" Load VGGNet weights needed for the implementation in TensorFlow
of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) 

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

For more details, please read the assignment handout:

"""
import numpy as np
import scipy.io
import tensorflow as tf

import utils

# VGG-19 parameters file
VGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'
VGG_FILENAME = 'imagenet-vgg-verydeep-19.mat'
EXPECTED_BYTES = 534904783

class VGG(object):
    def __init__(self, input_img):
        utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES)
        self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers']
        self.input_img = input_img
        self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))

    def _weights(self, layer_idx, expected_layer_name):
        """ Return the weights and biases at layer_idx already trained by VGG
        """
        W = self.vgg_layers[0][layer_idx][0][0][2][0][0]
        b = self.vgg_layers[0][layer_idx][0][0][2][0][1]
        layer_name = self.vgg_layers[0][layer_idx][0][0][0][0]
        assert layer_name == expected_layer_name
        return W, b.reshape(b.size)

    def conv2d_relu(self, prev_layer, layer_idx, layer_name):
        """ Return the Conv2D layer with RELU using the weights, 
        biases from the VGG model at 'layer_idx'.
        Don't forget to apply relu to the output from the convolution.
        Inputs:
            prev_layer: the output tensor from the previous layer
            layer_idx: the index to current layer in vgg_layers
            layer_name: the string that is the name of the current layer.
                        It's used to specify variable_scope.


        Note that you first need to obtain W and b from from the corresponding VGG's layer 
        using the function _weights() defined above.
        W and b returned from _weights() are numpy arrays, so you have
        to convert them to TF tensors. One way to do it is with tf.constant.

        Hint for choosing strides size: 
            for small images, you probably don't want to skip any pixel
        """
        ###############################
        ## TO DO
        with tf.variable_scope(layer_name) as scope:
            W, b = self._weights(layer_idx, layer_name)
            W = tf.constant(W, name='weights')
            b = tf.constant(b, name='bias')
            conv2d = tf.nn.conv2d(prev_layer, 
                                filter=W, 
                                strides=[1, 1, 1, 1], 
                                padding='SAME')
            out = tf.nn.relu(conv2d + b)
        ###############################
        setattr(self, layer_name, out)

    def avgpool(self, prev_layer, layer_name):
        """ Return the average pooling layer. The paper suggests that 
        average pooling works better than max pooling.
        Input:
            prev_layer: the output tensor from the previous layer
            layer_name: the string that you want to name the layer.
                        It's used to specify variable_scope.

        Hint for choosing strides and kszie: choose what you feel appropriate
        """
        ###############################
        ## TO DO
        with tf.variable_scope(layer_name):
            out = tf.nn.avg_pool(prev_layer, 
                                ksize=[1, 2, 2, 1], 
                                strides=[1, 2, 2, 1],
                                padding='SAME')
        ###############################
        setattr(self, layer_name, out)

    def load(self):
        self.conv2d_relu(self.input_img, 0, 'conv1_1')
        self.conv2d_relu(self.conv1_1, 2, 'conv1_2')
        self.avgpool(self.conv1_2, 'avgpool1')
        self.conv2d_relu(self.avgpool1, 5, 'conv2_1')
        self.conv2d_relu(self.conv2_1, 7, 'conv2_2')
        self.avgpool(self.conv2_2, 'avgpool2')
        self.conv2d_relu(self.avgpool2, 10, 'conv3_1')
        self.conv2d_relu(self.conv3_1, 12, 'conv3_2')
        self.conv2d_relu(self.conv3_2, 14, 'conv3_3')
        self.conv2d_relu(self.conv3_3, 16, 'conv3_4')
        self.avgpool(self.conv3_4, 'avgpool3')
        self.conv2d_relu(self.avgpool3, 19, 'conv4_1')
        self.conv2d_relu(self.conv4_1, 21, 'conv4_2')
        self.conv2d_relu(self.conv4_2, 23, 'conv4_3')
        self.conv2d_relu(self.conv4_3, 25, 'conv4_4')
        self.avgpool(self.conv4_4, 'avgpool4')
        self.conv2d_relu(self.avgpool4, 28, 'conv5_1')
        self.conv2d_relu(self.conv5_1, 30, 'conv5_2')
        self.conv2d_relu(self.conv5_2, 32, 'conv5_3')
        self.conv2d_relu(self.conv5_3, 34, 'conv5_4')
        self.avgpool(self.conv5_4, 'avgpool5')

================================================
FILE: assignments/02_style_transfer/style_transfer.py
================================================
""" Implementation in TensorFlow of the paper 
A Neural Algorithm of Artistic Style (Gatys et al., 2016) 

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

For more details, please read the assignment handout:
https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import tensorflow as tf

import load_vgg
import utils

def setup():
    utils.safe_mkdir('checkpoints')
    utils.safe_mkdir('outputs')

class StyleTransfer(object):
    def __init__(self, content_img, style_img, img_width, img_height):
        '''
        img_width and img_height are the dimensions we expect from the generated image.
        We will resize input content image and input style image to match this dimension.
        Feel free to alter any hyperparameter here and see how it affects your training.
        '''
        self.img_width = img_width
        self.img_height = img_height
        self.content_img = utils.get_resized_image(content_img, img_width, img_height)
        self.style_img = utils.get_resized_image(style_img, img_width, img_height)
        self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height)

        ###############################
        ## TO DO
        ## create global step (gstep) and hyperparameters for the model
        self.content_layer = 'conv4_2'
        self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
        # content_w, style_w: corresponding weights for content loss and style loss
        self.content_w = None
        self.style_w = None
        # style_layer_w: weights for different style layers. deep layers have more weights
        self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] 
        self.gstep = None # global step
        self.lr = None
        ###############################

    def create_input(self):
        '''
        We will use one input_img as a placeholder for the content image, 
        style image, and generated image, because:
            1. they have the same dimension
            2. we have to extract the same set of features from them
        We use a variable instead of a placeholder because we're, at the same time, 
        training the generated image to get the desirable result.

        Note: image height corresponds to number of rows, not columns.
        '''
        with tf.variable_scope('input') as scope:
            self.input_img = tf.get_variable('in_img', 
                                        shape=([1, self.img_height, self.img_width, 3]),
                                        dtype=tf.float32,
                                        initializer=tf.zeros_initializer())
    def load_vgg(self):
        '''
        Load the saved model parameters of VGG-19, using the input_img
        as the input to compute the output at each layer of vgg.

        During training, VGG-19 mean-centered all images and found the mean pixels
        to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract
        this mean from our images.

        '''
        self.vgg = load_vgg.VGG(self.input_img)
        self.vgg.load()
        self.content_img -= self.vgg.mean_pixels
        self.style_img -= self.vgg.mean_pixels

    def _content_loss(self, P, F):
        ''' Calculate the loss between the feature representation of the
        content image and the generated image.
        
        Inputs: 
            P: content representation of the content image
            F: content representation of the generated image
            Read the assignment handout for more details

            Note: Don't use the coefficient 0.5 as defined in the paper.
            Use the coefficient defined in the assignment handout.
        '''
        ###############################
        ## TO DO
        self.content_loss = None
        ###############################
        
    def _gram_matrix(self, F, N, M):
        """ Create and return the gram matrix for tensor F
            Hint: you'll first have to reshape F
        """
        ###############################
        ## TO DO
        return None
        ###############################

    def _single_style_loss(self, a, g):
        """ Calculate the style loss at a certain layer
        Inputs:
            a is the feature representation of the style image at that layer
            g is the feature representation of the generated image at that layer
        Output:
            the style loss at a certain layer (which is E_l in the paper)

        Hint: 1. you'll have to use the function _gram_matrix()
            2. we'll use the same coefficient for style loss as in the paper
            3. a and g are feature representation, not gram matrices
        """
        ###############################
        ## TO DO
        return None
        ###############################

    def _style_loss(self, A):
        """ Calculate the total style loss as a weighted sum 
        of style losses at all style layers
        Hint: you'll have to use _single_style_loss()
        """
        ###############################
        ## TO DO
        self.style_loss = None
        ###############################

    def losses(self):
        with tf.variable_scope('losses') as scope:
            with tf.Session() as sess:
                # assign content image to the input variable
                sess.run(self.input_img.assign(self.content_img)) 
                gen_img_content = getattr(self.vgg, self.content_layer)
                content_img_content = sess.run(gen_img_content)
            self._content_loss(content_img_content, gen_img_content)

            with tf.Session() as sess:
                sess.run(self.input_img.assign(self.style_img))
                style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers])                              
            self._style_loss(style_layers)

            ##########################################
            ## TO DO: create total loss. 
            ## Hint: don't forget the weights for the content loss and style loss
            self.total_loss = None
            ##########################################

    def optimize(self):
        ###############################
        ## TO DO: create optimizer
        self.opt = None
        ###############################

    def create_summary(self):
        ###############################
        ## TO DO: create summaries for all the losses
        ## Hint: don't forget to merge them
        self.summary_op = None
        ###############################


    def build(self):
        self.create_input()
        self.load_vgg()
        self.losses()
        self.optimize()
        self.create_summary()

    def train(self, n_iters):
        skip_step = 1
        with tf.Session() as sess:
            
            ###############################
            ## TO DO: 
            ## 1. initialize your variables
            ## 2. create writer to write your grapp
            ###############################
            
            sess.run(self.input_img.assign(self.initial_img))

            ###############################
            ## TO DO: 
            ## 1. create a saver object
            ## 2. check if a checkpoint exists, restore the variables
            ##############################

            initial_step = self.gstep.eval()
            
            start_time = time.time()
            for index in range(initial_step, n_iters):
                if index >= 5 and index < 20:
                    skip_step = 10
                elif index >= 20:
                    skip_step = 20
                
                sess.run(self.opt)
                if (index + 1) % skip_step == 0:
                    ###############################
                    ## TO DO: obtain generated image, loss, and summary
                    gen_image, total_loss, summary = None, None, None
                    ###############################
                    
                    # add back the mean pixels we subtracted before
                    gen_image = gen_image + self.vgg.mean_pixels 
                    writer.add_summary(summary, global_step=index)
                    print('Step {}\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))
                    print('   Loss: {:5.1f}'.format(total_loss))
                    print('   Took: {} seconds'.format(time.time() - start_time))
                    start_time = time.time()

                    filename = 'outputs/%d.png' % (index)
                    utils.save_image(filename, gen_image)

                    if (index + 1) % 20 == 0:
                        ###############################
                        ## TO DO: save the variables into a checkpoint
                        ###############################
                        pass

if __name__ == '__main__':
    setup()
    machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250)
    machine.build()
    machine.train(300)

================================================
FILE: assignments/02_style_transfer/style_transfer_sol.py
================================================
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import tensorflow as tf

import load_vgg_sol
import utils

def setup():
    utils.safe_mkdir('checkpoints')
    utils.safe_mkdir('outputs')

class StyleTransfer(object):
    def __init__(self, content_img, style_img, img_width, img_height):
        '''
        img_width and img_height are the dimensions we expect from the generated image.
        We will resize input content image and input style image to match this dimension.
        Feel free to alter any hyperparameter here and see how it affects your training.
        '''
        self.img_width = img_width
        self.img_height = img_height
        self.content_img = utils.get_resized_image(content_img, img_width, img_height)
        self.style_img = utils.get_resized_image(style_img, img_width, img_height)
        self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height)

        ###############################
        ## TO DO
        ## create global step (gstep) and hyperparameters for the model
        self.content_layer = 'conv4_2'
        self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
        self.content_w = 0.01
        self.style_w = 1
        self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] 
        self.gstep = tf.Variable(0, dtype=tf.int32, 
                                trainable=False, name='global_step')
        self.lr = 2.0
        ###############################

    def create_input(self):
        '''
        We will use one input_img as a placeholder for the content image, 
        style image, and generated image, because:
            1. they have the same dimension
            2. we have to extract the same set of features from them
        We use a variable instead of a placeholder because we're, at the same time, 
        training the generated image to get the desirable result.

        Note: image height corresponds to number of rows, not columns.
        '''
        with tf.variable_scope('input') as scope:
            self.input_img = tf.get_variable('in_img', 
                                        shape=([1, self.img_height, self.img_width, 3]),
                                        dtype=tf.float32,
                                        initializer=tf.zeros_initializer())
    def load_vgg(self):
        '''
        Load the saved model parameters of VGG-19, using the input_img
        as the input to compute the output at each layer of vgg.

        During training, VGG-19 mean-centered all images and found the mean pixels
        to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract
        this mean from our images.

        '''
        self.vgg = load_vgg_sol.VGG(self.input_img)
        self.vgg.load()
        self.content_img -= self.vgg.mean_pixels
        self.style_img -= self.vgg.mean_pixels

    def _content_loss(self, P, F):
        ''' Calculate the loss between the feature representation of the
        content image and the generated image.
        
        Inputs: 
            P: content representation of the content image
            F: content representation of the generated image
            Read the assignment handout for more details

            Note: Don't use the coefficient 0.5 as defined in the paper.
            Use the coefficient defined in the assignment handout.
        '''
        # self.content_loss = None
        ###############################
        ## TO DO
        self.content_loss = tf.reduce_sum((F - P) ** 2) / (4.0 * P.size)
        ###############################
    
    def _gram_matrix(self, F, N, M):
        """ Create and return the gram matrix for tensor F
            Hint: you'll first have to reshape F
        """
        ###############################
        ## TO DO
        F = tf.reshape(F, (M, N))
        return tf.matmul(tf.transpose(F), F)
        ###############################

    def _single_style_loss(self, a, g):
        """ Calculate the style loss at a certain layer
        Inputs:
            a is the feature representation of the style image at that layer
            g is the feature representation of the generated image at that layer
        Output:
            the style loss at a certain layer (which is E_l in the paper)

        Hint: 1. you'll have to use the function _gram_matrix()
            2. we'll use the same coefficient for style loss as in the paper
            3. a and g are feature representation, not gram matrices
        """
        ###############################
        ## TO DO
        N = a.shape[3] # number of filters
        M = a.shape[1] * a.shape[2] # height times width of the feature map
        A = self._gram_matrix(a, N, M)
        G = self._gram_matrix(g, N, M)
        return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2))
        ###############################

    def _style_loss(self, A):
        """ Calculate the total style loss as a weighted sum 
        of style losses at all style layers
        Hint: you'll have to use _single_style_loss()
        """
        n_layers = len(A)
        E = [self._single_style_loss(A[i], getattr(self.vgg, self.style_layers[i])) for i in range(n_layers)]
        
        ###############################
        ## TO DO
        self.style_loss = sum([self.style_layer_w[i] * E[i] for i in range(n_layers)])
        ###############################

    def losses(self):
        with tf.variable_scope('losses') as scope:
            with tf.Session() as sess:
                # assign content image to the input variable
                sess.run(self.input_img.assign(self.content_img)) 
                gen_img_content = getattr(self.vgg, self.content_layer)
                content_img_content = sess.run(gen_img_content)
            self._content_loss(content_img_content, gen_img_content)

            with tf.Session() as sess:
                sess.run(self.input_img.assign(self.style_img))
                style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers])                              
            self._style_loss(style_layers)

            ##########################################
            ## TO DO: create total loss. 
            ## Hint: don't forget the weights for the content loss and style loss
            self.total_loss = self.content_w * self.content_loss + self.style_w * self.style_loss
            ##########################################

    def optimize(self):
        ###############################
        ## TO DO: create optimizer
        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.total_loss,
                                                            global_step=self.gstep)
        ###############################

    def create_summary(self):
        ###############################
        ## TO DO: create summaries for all the losses
        ## Hint: don't forget to merge them
        with tf.name_scope('summaries'):
            tf.summary.scalar('content loss', self.content_loss)
            tf.summary.scalar('style loss', self.style_loss)
            tf.summary.scalar('total loss', self.total_loss)
            self.summary_op = tf.summary.merge_all()
        ###############################


    def build(self):
        self.create_input()
        self.load_vgg()
        self.losses()
        self.optimize()
        self.create_summary()

    def train(self, n_iters):
        skip_step = 1
        with tf.Session() as sess:
            
            ###############################
            ## TO DO: 
            ## 1. initialize your variables
            ## 2. create writer to write your graph
            sess.run(tf.global_variables_initializer())
            writer = tf.summary.FileWriter('graphs/style_stranfer', sess.graph)
            ###############################
            sess.run(self.input_img.assign(self.initial_img))


            ###############################
            ## TO DO: 
            ## 1. create a saver object
            ## 2. check if a checkpoint exists, restore the variables
            saver = tf.train.Saver()
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/style_transfer/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            ##############################

            initial_step = self.gstep.eval()
            
            start_time = time.time()
            for index in range(initial_step, n_iters):
                if index >= 5 and index < 20:
                    skip_step = 10
                elif index >= 20:
                    skip_step = 20
                
                sess.run(self.opt)
                if (index + 1) % skip_step == 0:
                    ###############################
                    ## TO DO: obtain generated image, loss, and summary
                    gen_image, total_loss, summary = sess.run([self.input_img,
                                                                self.total_loss,
                                                                self.summary_op])

                    ###############################
                    
                    # add back the mean pixels we subtracted before
                    gen_image = gen_image + self.vgg.mean_pixels 
                    writer.add_summary(summary, global_step=index)
                    print('Step {}\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))
                    print('   Loss: {:5.1f}'.format(total_loss))
                    print('   Took: {} seconds'.format(time.time() - start_time))
                    start_time = time.time()

                    filename = 'outputs/%d.png' % (index)
                    utils.save_image(filename, gen_image)

                    if (index + 1) % 20 == 0:
                        ###############################
                        ## TO DO: save the variables into a checkpoint
                        saver.save(sess, 'checkpoints/style_stranfer/style_transfer', index)
                        ###############################

if __name__ == '__main__':
    setup()
    machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250)
    machine.build()
    machine.train(300)

================================================
FILE: assignments/02_style_transfer/utils.py
================================================
""" Utils needed for the implementation in TensorFlow
of the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) 

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

For more details, please read the assignment handout:
https://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing

"""

import os

from PIL import Image, ImageOps
import numpy as np
import scipy.misc
from six.moves import urllib

def download(download_link, file_name, expected_bytes):
    """ Download the pretrained VGG-19 model if it's not already downloaded """
    if os.path.exists(file_name):
        print("VGG-19 pre-trained model is ready")
        return
    print("Downloading the VGG pre-trained model. This might take a while ...")
    file_name, _ = urllib.request.urlretrieve(download_link, file_name)
    file_stat = os.stat(file_name)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded VGG-19 pre-trained model', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')

def get_resized_image(img_path, width, height, save=True):
    image = Image.open(img_path)
    # PIL is column major so you have to swap the places of width and height
    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)
    if save:
        image_dirs = img_path.split('/')
        image_dirs[-1] = 'resized_' + image_dirs[-1]
        out_path = '/'.join(image_dirs)
        if not os.path.exists(out_path):
            image.save(out_path)
    image = np.asarray(image, np.float32)
    return np.expand_dims(image, 0)

def generate_noise_image(content_image, width, height, noise_ratio=0.6):
    noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype(np.float32)
    return noise_image * noise_ratio + content_image * (1 - noise_ratio)

def save_image(path, image):
    image = image[0]
    image = np.clip(image, 0, 255).astype('uint8')
    scipy.misc.imsave(path, image)

def safe_mkdir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

================================================
FILE: assignments/chatbot/README.md
================================================
A neural chatbot using sequence to sequence model with
attentional decoder. This is a fully functional chatbot.

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

The detailed assignment handout can be found at:
https://docs.google.com/document/d/1GJfn2B6EI8JueDiBwzTAdD34d6pC99BSt6vldOmUCPQ/edit?usp=sharing


<h2>Sample conversation</h2>
HUMAN ++++ hi <br>
BOT ++++ hi . what ' s your name ?<br>
HUMAN ++++ i'm chip<br>
BOT ++++ please .<br>
HUMAN ++++ please what?<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ where are you from?<br>
BOT ++++ california .<br>
HUMAN ++++ do you like california?<br>
BOT ++++ i ' m okay .<br>
HUMAN ++++ i like california a lot<br>
BOT ++++ of what ?<br>
HUMAN ++++ i live here<br>
BOT ++++ and what is it ?<br>
HUMAN ++++ california<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ i'm a student here<br>
BOT ++++ and get it out of here .<br>
HUMAN ++++ what do you like to do in your free time?<br>
BOT ++++ and what about the gun ?<br>
HUMAN ++++ do you have a gun?<br>
BOT ++++ yes .<br>
HUMAN ++++ you're dangerous!<br>
BOT ++++ i ' m not .<br>
HUMAN ++++ do you go hunting?<br>
BOT ++++ i can do .<br>
HUMAN ++++ what do you hunt?<br>
BOT ++++ virgin .<br>
HUMAN ++++ that's very bad<br>
BOT ++++ and you in love ?<br>
HUMAN ++++ no i'm not<br>
BOT ++++ please - -<br>
HUMAN ++++ okay, i am<br>
BOT ++++ and you are good .<br>
HUMAN ++++ yeah i am good too.<br>
BOT ++++ good good to you .<br>
HUMAN ++++ are you in love?<br>
BOT ++++ i ' m ready .<br>
HUMAN ++++ ready for what?<br>
BOT ++++ for what ?<br>
HUMAN ++++ for what?<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ tell me<br>
BOT ++++ i can do it .<br>
HUMAN ++++ okay, tell me what you're ready for<br>
BOT ++++ i ' m doing you .<br>
HUMAN ++++ no you're not<br>
BOT ++++ please !<br>
HUMAN ++++ this is a strange conversation<br>
BOT ++++ i don ' t know what to say .<br>
HUMAN ++++ are you a guy?<br>
BOT ++++ of what ?<br>
HUMAN ++++ are you a boy?<br>
BOT ++++ i ' m fine .<br>
HUMAN ++++ are you a man?<br>
BOT ++++ i ' m a little .<br>
HUMAN ++++ what's your name?<br>
BOT ++++ chris -<br>
HUMAN ++++ chris who?<br>
BOT ++++ cornelius .<br>
HUMAN ++++ so i guess you're a guy.<br>
BOT ++++ i am .<br>
HUMAN ++++ okay<br>
BOT ++++ okay ?<br>
HUMAN ++++ do you have a girlfriend?<br>
BOT ++++ i ' m sure .<br>
HUMAN ++++ what's her name?<br>
BOT ++++ let ' s talk about something else .<br>

See output_convo.txt for more sample conversations.

<h2>Usage</h2>

Step 1: create a data folder in your project directory, download
the Cornell Movie-Dialogs Corpus from 
https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Unzip it

Step 2: update config.py file<br>
Change DATA_PATH to where you store your data

Step 3: python3 data.py<br>
This will do all the pre-processing for the Cornell dataset.

Step 4:
python3 chatbot.py --mode [train/chat] <br>
If mode is train, then you train the chatbot. By default, the model will
restore the previously trained weights (if there is any) and continue
training up on that.

If you want to start training from scratch, please delete all the checkpoints
in the checkpoints folder.

If the mode is chat, you'll go into the interaction mode with the bot.

By default, all the conversations you have with the chatbot will be written
into the file output_convo.txt in the processed folder. If you run this chatbot,
I kindly ask you to send me the output_convo.txt so that I can improve
the chatbot.


Thank you very much!


================================================
FILE: assignments/chatbot/chatbot.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

This file contains the code to run the model.

See README.md for instruction on how to run the starter code.
"""
import argparse
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import random
import sys
import time

import numpy as np
import tensorflow as tf

from model import ChatBotModel
import config
import data

def _get_random_bucket(train_buckets_scale):
    """ Get a random bucket from which to choose a training sample """
    rand = random.random()
    return min([i for i in range(len(train_buckets_scale))
                if train_buckets_scale[i] > rand])

def _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks):
    """ Assert that the encoder inputs, decoder inputs, and decoder masks are
    of the expected lengths """
    if len(encoder_inputs) != encoder_size:
        raise ValueError("Encoder length must be equal to the one in bucket,"
                        " %d != %d." % (len(encoder_inputs), encoder_size))
    if len(decoder_inputs) != decoder_size:
        raise ValueError("Decoder length must be equal to the one in bucket,"
                       " %d != %d." % (len(decoder_inputs), decoder_size))
    if len(decoder_masks) != decoder_size:
        raise ValueError("Weights length must be equal to the one in bucket,"
                       " %d != %d." % (len(decoder_masks), decoder_size))

def run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only):
    """ Run one step in training.
    @forward_only: boolean value to decide whether a backward path should be created
    forward_only is set to True when you just want to evaluate on the test set,
    or when you want to the bot to be in chat mode. """
    encoder_size, decoder_size = config.BUCKETS[bucket_id]
    _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks)

    # input feed: encoder inputs, decoder inputs, target_weights, as provided.
    input_feed = {}
    for step in range(encoder_size):
        input_feed[model.encoder_inputs[step].name] = encoder_inputs[step]
    for step in range(decoder_size):
        input_feed[model.decoder_inputs[step].name] = decoder_inputs[step]
        input_feed[model.decoder_masks[step].name] = decoder_masks[step]

    last_target = model.decoder_inputs[decoder_size].name
    input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32)

    # output feed: depends on whether we do a backward step or not.
    if not forward_only:
        output_feed = [model.train_ops[bucket_id],  # update op that does SGD.
                       model.gradient_norms[bucket_id],  # gradient norm.
                       model.losses[bucket_id]]  # loss for this batch.
    else:
        output_feed = [model.losses[bucket_id]]  # loss for this batch.
        for step in range(decoder_size):  # output logits.
            output_feed.append(model.outputs[bucket_id][step])

    outputs = sess.run(output_feed, input_feed)
    if not forward_only:
        return outputs[1], outputs[2], None  # Gradient norm, loss, no outputs.
    else:
        return None, outputs[0], outputs[1:]  # No gradient norm, loss, outputs.

def _get_buckets():
    """ Load the dataset into buckets based on their lengths.
    train_buckets_scale is the inverval that'll help us 
    choose a random bucket later on.
    """
    test_buckets = data.load_data('test_ids.enc', 'test_ids.dec')
    data_buckets = data.load_data('train_ids.enc', 'train_ids.dec')
    train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))]
    print("Number of samples in each bucket:\n", train_bucket_sizes)
    train_total_size = sum(train_bucket_sizes)
    # list of increasing numbers from 0 to 1 that we'll use to select a bucket.
    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
                           for i in range(len(train_bucket_sizes))]
    print("Bucket scale:\n", train_buckets_scale)
    return test_buckets, data_buckets, train_buckets_scale

def _get_skip_step(iteration):
    """ How many steps should the model train before it saves all the weights. """
    if iteration < 100:
        return 30
    return 100

def _check_restore_parameters(sess, saver):
    """ Restore the previously trained parameters if there are any. """
    ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint'))
    if ckpt and ckpt.model_checkpoint_path:
        print("Loading parameters for the Chatbot")
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        print("Initializing fresh parameters for the Chatbot")

def _eval_test_set(sess, model, test_buckets):
    """ Evaluate on the test set. """
    for bucket_id in range(len(config.BUCKETS)):
        if len(test_buckets[bucket_id]) == 0:
            print("  Test: empty bucket %d" % (bucket_id))
            continue
        start = time.time()
        encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], 
                                                                        bucket_id,
                                                                        batch_size=config.BATCH_SIZE)
        _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, 
                                   decoder_masks, bucket_id, True)
        print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start))

def train():
    """ Train the bot """
    test_buckets, data_buckets, train_buckets_scale = _get_buckets()
    # in train mode, we need to create the backward path, so forwrad_only is False
    model = ChatBotModel(False, config.BATCH_SIZE)
    model.build_graph()

    saver = tf.train.Saver()

    with tf.Session() as sess:
        print('Running session')
        sess.run(tf.global_variables_initializer())
        _check_restore_parameters(sess, saver)

        iteration = model.global_step.eval()
        total_loss = 0
        while True:
            skip_step = _get_skip_step(iteration)
            bucket_id = _get_random_bucket(train_buckets_scale)
            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], 
                                                                           bucket_id,
                                                                           batch_size=config.BATCH_SIZE)
            start = time.time()
            _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False)
            total_loss += step_loss
            iteration += 1

            if iteration % skip_step == 0:
                print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start))
                start = time.time()
                total_loss = 0
                saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step)
                if iteration % (10 * skip_step) == 0:
                    # Run evals on development set and print their loss
                    _eval_test_set(sess, model, test_buckets)
                    start = time.time()
                sys.stdout.flush()

def _get_user_input():
    """ Get user's input, which will be transformed into encoder input later """
    print("> ", end="")
    sys.stdout.flush()
    return sys.stdin.readline()

def _find_right_bucket(length):
    """ Find the proper bucket for an encoder input based on its length """
    return min([b for b in range(len(config.BUCKETS))
                if config.BUCKETS[b][0] >= length])

def _construct_response(output_logits, inv_dec_vocab):
    """ Construct a response to the user's encoder input.
    @output_logits: the outputs from sequence to sequence wrapper.
    output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB
    
    This is a greedy decoder - outputs are just argmaxes of output_logits.
    """
    print(output_logits[0])
    outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
    # If there is an EOS symbol in outputs, cut them at that point.
    if config.EOS_ID in outputs:
        outputs = outputs[:outputs.index(config.EOS_ID)]
    # Print out sentence corresponding to outputs.
    return " ".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs])

def chat():
    """ in test mode, we don't to create the backward path
    """
    _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc'))
    inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec'))

    model = ChatBotModel(True, batch_size=1)
    model.build_graph()

    saver = tf.train.Saver()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        _check_restore_parameters(sess, saver)
        output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+')
        # Decode from standard input.
        max_length = config.BUCKETS[-1][0]
        print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length)
        while True:
            line = _get_user_input()
            if len(line) > 0 and line[-1] == '\n':
                line = line[:-1]
            if line == '':
                break
            output_file.write('HUMAN ++++ ' + line + '\n')
            # Get token-ids for the input sentence.
            token_ids = data.sentence2id(enc_vocab, str(line))
            if (len(token_ids) > max_length):
                print('Max length I can handle is:', max_length)
                line = _get_user_input()
                continue
            # Which bucket does it belong to?
            bucket_id = _find_right_bucket(len(token_ids))
            # Get a 1-element batch to feed the sentence to the model.
            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], 
                                                                            bucket_id,
                                                                            batch_size=1)
            # Get output logits for the sentence.
            _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs,
                                           decoder_masks, bucket_id, True)
            response = _construct_response(output_logits, inv_dec_vocab)
            print(response)
            output_file.write('BOT ++++ ' + response + '\n')
        output_file.write('=============================================\n')
        output_file.close()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--mode', choices={'train', 'chat'},
                        default='train', help="mode. if not specified, it's in the train mode")
    args = parser.parse_args()

    if not os.path.isdir(config.PROCESSED_PATH):
        data.prepare_raw_data()
        data.process_data()
    print('Data ready!')
    # create checkpoints folder if there isn't one already
    data.make_dir(config.CPT_PATH)

    if args.mode == 'train':
        train()
    elif args.mode == 'chat':
        chat()

if __name__ == '__main__':
    main()


================================================
FILE: assignments/chatbot/config.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

This file contains the hyperparameters for the model.

See README.md for instruction on how to run the starter code.
"""

# parameters for processing the dataset
DATA_PATH = 'data/cornell movie-dialogs corpus'
CONVO_FILE = 'movie_conversations.txt'
LINE_FILE = 'movie_lines.txt'
OUTPUT_FILE = 'output_convo.txt'
PROCESSED_PATH = 'processed'
CPT_PATH = 'checkpoints'

THRESHOLD = 2

PAD_ID = 0
UNK_ID = 1
START_ID = 2
EOS_ID = 3

TESTSET_SIZE = 25000

BUCKETS = [(19, 19), (28, 28), (33, 33), (40, 43), (50, 53), (60, 63)]


CONTRACTIONS = [("i ' m ", "i 'm "), ("' d ", "'d "), ("' s ", "'s "), 
				("don ' t ", "do n't "), ("didn ' t ", "did n't "), ("doesn ' t ", "does n't "),
				("can ' t ", "ca n't "), ("shouldn ' t ", "should n't "), ("wouldn ' t ", "would n't "),
				("' ve ", "'ve "), ("' re ", "'re "), ("in ' ", "in' ")]

NUM_LAYERS = 3
HIDDEN_SIZE = 256
BATCH_SIZE = 64

LR = 0.5
MAX_GRAD_NORM = 5.0

NUM_SAMPLES = 512


================================================
FILE: assignments/chatbot/data.py
================================================
""" A neural chatbot using sequence to sequence model with
attentional decoder. 

This is based on Google Translate Tensorflow model 
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/

Sequence to sequence model by Cho et al.(2014)

Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu

This file contains the code to do the pre-processing for the
Cornell Movie-Dialogs Corpus.

See readme.md for instruction on how to run the starter code.
"""
import os
import random
import re

import numpy as np

import config

def get_lines():
    id2line = {}
    file_path = os.path.join(config.DATA_PATH, config.LINE_FILE)
    print(config.LINE_FILE)
    with open(file_path, 'r', errors='ignore') as f:
        # lines = f.readlines()
        # for line in lines:
        i = 0
        try:
            for line in f:
                parts = line.split(' +++$+++ ')
                if len(parts) == 5:
                    if parts[4][-1] == '\n':
                        parts[4] = parts[4][:-1]
                    id2line[parts[0]] = parts[4]
                i += 1
        except UnicodeDecodeError:
            print(i, line)
    return id2line

def get_convos():
    """ Get conversations from the raw data """
    file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE)
    convos = []
    with open(file_path, 'r') as f:
        for line in f.readlines():
            parts = line.split(' +++$+++ ')
            if len(parts) == 4:
                convo = []
                for line in parts[3][1:-2].split(', '):
                    convo.append(line[1:-1])
                convos.append(convo)

    return convos

def question_answers(id2line, convos):
    """ Divide the dataset into two sets: questions and answers. """
    questions, answers = [], []
    for convo in convos:
        for index, line in enumerate(convo[:-1]):
            questions.append(id2line[convo[index]])
            answers.append(id2line[convo[index + 1]])
    assert len(questions) == len(answers)
    return questions, answers

def prepare_dataset(questions, answers):
    # create path to store all the train & test encoder & decoder
    make_dir(config.PROCESSED_PATH)
    
    # random convos to create the test set
    test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE)
    
    filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec']
    files = []
    for filename in filenames:
        files.append(open(os.path.join(config.PROCESSED_PATH, filename),'w'))

    for i in range(len(questions)):
        if i in test_ids:
            files[2].write(questions[i] + '\n')
            files[3].write(answers[i] + '\n')
        else:
            files[0].write(questions[i] + '\n')
            files[1].write(answers[i] + '\n')

    for file in files:
        file.close()

def make_dir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

def basic_tokenizer(line, normalize_digits=True):
    """ A basic tokenizer to tokenize text into tokens.
    Feel free to change this to suit your need. """
    line = re.sub('<u>', '', line)
    line = re.sub('</u>', '', line)
    line = re.sub('\[', '', line)
    line = re.sub('\]', '', line)
    words = []
    _WORD_SPLIT = re.compile("([.,!?\"'-<>:;)(])")
    _DIGIT_RE = re.compile(r"\d")
    for fragment in line.strip().lower().split():
        for token in re.split(_WORD_SPLIT, fragment):
            if not token:
                continue
            if normalize_digits:
                token = re.sub(_DIGIT_RE, '#', token)
            words.append(token)
    return words

def build_vocab(filename, normalize_digits=True):
    in_path = os.path.join(config.PROCESSED_PATH, filename)
    out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:]))

    vocab = {}
    with open(in_path, 'r') as f:
        for line in f.readlines():
            for token in basic_tokenizer(line):
                if not token in vocab:
                    vocab[token] = 0
                vocab[token] += 1

    sorted_vocab = sorted(vocab, key=vocab.get, reverse=True)
    with open(out_path, 'w') as f:
        f.write('<pad>' + '\n')
        f.write('<unk>' + '\n')
        f.write('<s>' + '\n')
        f.write('<\s>' + '\n') 
        index = 4
        for word in sorted_vocab:
            if vocab[word] < config.THRESHOLD:
                break
            f.write(word + '\n')
            index += 1
        with open('config.py', 'a') as cf:
            if filename[-3:] == 'enc':
                cf.write('ENC_VOCAB = ' + str(index) + '\n')
            else:
                cf.write('DEC_VOCAB = ' + str(index) + '\n')

def load_vocab(vocab_path):
    with open(vocab_path, 'r') as f:
        words = f.read().splitlines()
    return words, {words[i]: i for i in range(len(words))}

def sentence2id(vocab, line):
    return [vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)]

def token2id(data, mode):
    """ Convert all the tokens in the data into their corresponding
    index in the vocabulary. """
    vocab_path = 'vocab.' + mode
    in_path = data + '.' + mode
    out_path = data + '_ids.' + mode

    _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path))
    in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'r')
    out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'w')
    
    lines = in_file.read().splitlines()
    for line in lines:
        if mode == 'dec': # we only care about '<s>' and </s> in encoder
            ids = [vocab['<s>']]
        else:
            ids = []
        ids.extend(sentence2id(vocab, line))
        # ids.extend([vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)])
        if mode == 'dec':
            ids.append(vocab['<\s>'])
        out_file.write(' '.join(str(id_) for id_ in ids) + '\n')

def prepare_raw_data():
    print('Preparing raw data into train set and test set ...')
    id2line = get_lines()
    convos = get_convos()
    questions, answers = question_answers(id2line, convos)
    prepare_dataset(questions, answers)

def process_data():
    print('Preparing data to be model-ready ...')
    build_vocab('train.enc')
    build_vocab('train.dec')
    token2id('train', 'enc')
    token2id('train', 'dec')
    token2id('test', 'enc')
    token2id('test', 'dec')

def load_data(enc_filename, dec_filename, max_training_size=None):
    encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'r')
    decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'r')
    encode, decode = encode_file.readline(), decode_file.readline()
    data_buckets = [[] for _ in config.BUCKETS]
    i = 0
    while encode and decode:
        if (i + 1) % 10000 == 0:
            print("Bucketing conversation number", i)
        encode_ids = [int(id_) for id_ in encode.split()]
        decode_ids = [int(id_) for id_ in decode.split()]
        for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS):
            if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size:
                data_buckets[bucket_id].append([encode_ids, decode_ids])
                break
        encode, decode = encode_file.readline(), decode_file.readline()
        i += 1
    return data_buckets

def _pad_input(input_, size):
    return input_ + [config.PAD_ID] * (size - len(input_))

def _reshape_batch(inputs, size, batch_size):
    """ Create batch-major inputs. Batch inputs are just re-indexed inputs
    """
    batch_inputs = []
    for length_id in range(size):
        batch_inputs.append(np.array([inputs[batch_id][length_id]
                                    for batch_id in range(batch_size)], dtype=np.int32))
    return batch_inputs


def get_batch(data_bucket, bucket_id, batch_size=1):
    """ Return one batch to feed into the model """
    # only pad to the max length of the bucket
    encoder_size, decoder_size = config.BUCKETS[bucket_id]
    encoder_inputs, decoder_inputs = [], []

    for _ in range(batch_size):
        encoder_input, decoder_input = random.choice(data_bucket)
        # pad both encoder and decoder, reverse the encoder
        encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size))))
        decoder_inputs.append(_pad_input(decoder_input, decoder_size))

    # now we create batch-major vectors from the data selected above.
    batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size)
    batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size)

    # create decoder_masks to be 0 for decoders that are padding.
    batch_masks = []
    for length_id in range(decoder_size):
        batch_mask = np.ones(batch_size, dtype=np.float32)
        for batch_id in range(batch_size):
            # we set mask to 0 if the corresponding target is a PAD symbol.
            # the corresponding decoder is decoder_input shifted by 1 forward.
            if length_id < decoder_size - 1:
                target = decoder_inputs[batch_id][length_id + 1]
            if length_id == decoder_size - 1 or target == config.PAD_ID:
                batch_mask[batch_id] = 0.0
        batch_masks.append(batch_mask)
    return batch_encoder_inputs, batch_decoder_inputs, batch_masks

if __name__ == '__main__':
    prepare_raw_data()
    process_data()

================================================
FILE: assignments/chatbot/model.py
================================================
import time

import numpy as np
import tensorflow as tf

import config

class ChatBotModel:
    def __init__(self, forward_only, batch_size):
        """forward_only: if set, we do not construct the backward pass in the model.
        """
        print('Initialize new model')
        self.fw_only = forward_only
        self.batch_size = batch_size

    def _create_placeholders(self):
        # Feeds for inputs. It's a list of placeholders
        print('Create placeholders')
        self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i))
                               for i in range(config.BUCKETS[-1][0])]
        self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i))
                               for i in range(config.BUCKETS[-1][1] + 1)]
        self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i))
                              for i in range(config.BUCKETS[-1][1] + 1)]

        # Our targets are decoder inputs shifted by one (to ignore <GO> symbol)
        self.targets = self.decoder_inputs[1:]

    def _inference(self):
        print('Create inference')
        # If we use sampled softmax, we need an output projection.
        # Sampled softmax only makes sense if we sample less than vocabulary size.
        if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:
            w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])
            b = tf.get_variable('proj_b', [config.DEC_VOCAB])
            self.output_projection = (w, b)

        def sampled_loss(logits, labels):
            labels = tf.reshape(labels, [-1, 1])
            return tf.nn.sampled_softmax_loss(weights=tf.transpose(w), 
                                              biases=b, 
                                              inputs=logits, 
                                              labels=labels, 
                                              num_sampled=config.NUM_SAMPLES, 
                                              num_classes=config.DEC_VOCAB)
        self.softmax_loss_function = sampled_loss

        single_cell = tf.contrib.rnn.GRUCell(config.HIDDEN_SIZE)
        self.cell = tf.contrib.rnn.MultiRNNCell([single_cell for _ in range(config.NUM_LAYERS)])

    def _create_loss(self):
        print('Creating loss... \nIt might take a couple of minutes depending on how many buckets you have.')
        start = time.time()
        def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
            setattr(tf.contrib.rnn.GRUCell, '__deepcopy__', lambda self, _: self)
            setattr(tf.contrib.rnn.MultiRNNCell, '__deepcopy__', lambda self, _: self)
            return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
                    encoder_inputs, decoder_inputs, self.cell,
                    num_encoder_symbols=config.ENC_VOCAB,
                    num_decoder_symbols=config.DEC_VOCAB,
                    embedding_size=config.HIDDEN_SIZE,
                    output_projection=self.output_projection,
                    feed_previous=do_decode)

        if self.fw_only:
            self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(
                                        self.encoder_inputs, 
                                        self.decoder_inputs, 
                                        self.targets,
                                        self.decoder_masks, 
                                        config.BUCKETS, 
                                        lambda x, y: _seq2seq_f(x, y, True),
                                        softmax_loss_function=self.softmax_loss_function)
            # If we use output projection, we need to project outputs for decoding.
            if self.output_projection:
                for bucket in range(len(config.BUCKETS)):
                    self.outputs[bucket] = [tf.matmul(output, 
                                            self.output_projection[0]) + self.output_projection[1]
                                            for output in self.outputs[bucket]]
        else:
            self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(
                                        self.encoder_inputs, 
                                        self.decoder_inputs, 
                                        self.targets,
                                        self.decoder_masks,
                                        config.BUCKETS,
                                        lambda x, y: _seq2seq_f(x, y, False),
                                        softmax_loss_function=self.softmax_loss_function)
        print('Time:', time.time() - start)

    def _creat_optimizer(self):
        print('Create optimizer... \nIt might take a couple of minutes depending on how many buckets you have.')
        with tf.variable_scope('training') as scope:
            self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

            if not self.fw_only:
                self.optimizer = tf.train.GradientDescentOptimizer(config.LR)
                trainables = tf.trainable_variables()
                self.gradient_norms = []
                self.train_ops = []
                start = time.time()
                for bucket in range(len(config.BUCKETS)):
                    
                    clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], 
                                                                 trainables),
                                                                 config.MAX_GRAD_NORM)
                    self.gradient_norms.append(norm)
                    self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), 
                                                            global_step=self.global_step))
                    print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start))
                    start = time.time()


    def _create_summary(self):
        pass

    def build_graph(self):
        self._create_placeholders()
        self._inference()
        self._create_loss()
        self._creat_optimizer()
        self._create_summary()

================================================
FILE: assignments/chatbot/output_convo.txt
================================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ really?
BOT ++++ i don ' t know what to say .
HUMAN ++++ how're you?
BOT ++++ i ' m fine .
HUMAN ++++ where're you from?
BOT ++++ california .
HUMAN ++++ which city?
BOT ++++ let ' s talk about something else .
HUMAN ++++ you don't like california?
BOT ++++ oh . . .
HUMAN ++++ why not?
BOT ++++ i can ' t .
HUMAN ++++ how old are you?
BOT ++++ eighteen .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how are you?
BOT ++++ i ' m okay .
HUMAN ++++ what's your name?
BOT ++++ what ?
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what's your name?
BOT ++++ prisoner .
HUMAN ++++ ha ha you really don't want to tell your name
BOT ++++ i ' m not going to do it .
HUMAN ++++ how old are you?
BOT ++++ can you guess ?
HUMAN ++++ eighteen?
BOT ++++ in the bathroom .
HUMAN ++++ in the bathroom?
BOT ++++ in the next time .
HUMAN ++++ why are you talking so funny?
BOT ++++ i ' m looking .
HUMAN ++++ where are you from?
BOT ++++ general .
HUMAN ++++ where are you from?
BOT ++++ how ?
HUMAN ++++ do you like the us?
BOT ++++ i ' m sure .
HUMAN ++++ do you vote?
BOT ++++ i ' m afraid .
HUMAN ++++ what are you afraid of?
BOT ++++ i ' m looking .
HUMAN ++++ what are you looking for?
BOT ++++ can you guess ?
HUMAN ++++ no
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm chip
BOT ++++ please .
HUMAN ++++ please what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ do you like california?
BOT ++++ i ' m okay .
HUMAN ++++ i like california a lot
BOT ++++ of what ?
HUMAN ++++ i live here
BOT ++++ and what is it ?
HUMAN ++++ california
BOT ++++ i don ' t know what to say .
HUMAN ++++ i'm a student here
BOT ++++ and get it out of here .
HUMAN ++++ what do you like to do in your free time?
BOT ++++ and what about the gun ?
HUMAN ++++ do you have a gun?
BOT ++++ yes .
HUMAN ++++ you're dangerous!
BOT ++++ i ' m not .
HUMAN ++++ do you go hunting?
BOT ++++ i can do .
HUMAN ++++ what do you hunt?
BOT ++++ virgin .
HUMAN ++++ that's very bad
BOT ++++ and you in love ?
HUMAN ++++ no i'm not
BOT ++++ please - -
HUMAN ++++ okay, i am
BOT ++++ and you are good .
HUMAN ++++ yeah i am good too.
BOT ++++ good good to you .
HUMAN ++++ are you in love?
BOT ++++ i ' m ready .
HUMAN ++++ ready for what?
BOT ++++ for what ?
HUMAN ++++ for what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ tell me
BOT ++++ i can do it .
HUMAN ++++ okay, tell me what you're ready for
BOT ++++ i ' m doing you .
HUMAN ++++ no you're not
BOT ++++ please !
HUMAN ++++ this is a strange conversation
BOT ++++ i don ' t know what to say .
HUMAN ++++ are you a guy?
BOT ++++ of what ?
HUMAN ++++ are you a boy?
BOT ++++ i ' m fine .
HUMAN ++++ are you a man?
BOT ++++ i ' m a little .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ so i guess you're a guy.
BOT ++++ i am .
HUMAN ++++ okay
BOT ++++ okay ?
HUMAN ++++ do you have a girlfriend?
BOT ++++ i ' m sure .
HUMAN ++++ what's her name?
BOT ++++ let ' s talk about something else .
=============================================
HUMAN ++++ hi chris
BOT ++++ hi
HUMAN ++++ how're you today?
BOT ++++ okay .
HUMAN ++++ the weather is nice today
BOT ++++ i don ' t know what to say .
HUMAN ++++ just say the weather is nice
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi
HUMAN ++++ how're you today?
BOT ++++ okay .
HUMAN ++++ just okay?
BOT ++++ i ' m okay .
HUMAN ++++ what's on your mind?
BOT ++++ i ' m in my room .
HUMAN ++++ doing what?
BOT ++++ i don ' t know what to say .
HUMAN ++++ you're boring.
BOT ++++ i ' m not .
HUMAN ++++ prove it
BOT ++++ and what ?
HUMAN ++++ what what?
BOT ++++ sorry , i ' m still very stupid .
HUMAN ++++ it's okay. we're all stupid sometimes.
BOT ++++ ' s good to see .
HUMAN ++++ what's good to see?
BOT ++++ i ' m in .
HUMAN ++++ you're in your room?
BOT ++++ i ' m here .
HUMAN ++++ where's your room?
BOT ++++ i ' m in my room .
HUMAN ++++ are you alone?
BOT ++++ i ' m fine .
HUMAN ++++ are you by yourself?
BOT ++++ in the morning .
HUMAN ++++ what time is it?
BOT ++++ # # # # # # # .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm marie
BOT ++++ you are .
HUMAN ++++ yes i am
BOT ++++ and you are my friend .
HUMAN ++++ yes you are 
BOT ++++ and you are my best .
HUMAN ++++ that's so sweet
BOT ++++ please .
HUMAN ++++ do you know what time it is?
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ what time is it?
BOT ++++ 5 9 6 9 1 4 5 .
HUMAN ++++ it doesn't make any sense
BOT ++++ are you ?
HUMAN ++++ i do 
BOT ++++ and you are ?
HUMAN ++++ i'm a human.
BOT ++++ you in .
HUMAN ++++ are you in trouble?
BOT ++++ i ' m in .
HUMAN ++++ what trouble?
BOT ++++ i ' m in love .
HUMAN ++++ wow that's nice.
BOT ++++ like that ?
HUMAN ++++ yeah
BOT ++++ i don ' t know what to say .
HUMAN ++++ who are you?
BOT ++++ i don ' t understand .
HUMAN ++++ who are you?
BOT ++++ no idea .
HUMAN ++++ what's your name?
BOT ++++ chris -
HUMAN ++++ chris who?
BOT ++++ cornelius .
HUMAN ++++ where are you from?
BOT ++++ california .
HUMAN ++++ are you american?
BOT ++++ i ' m fine .
HUMAN ++++ of course you're fine.
BOT ++++ and you ?
HUMAN ++++ i'm okay.
BOT ++++ you are .
HUMAN ++++ yes, i am.
BOT ++++ and you are my friend .
HUMAN ++++ yes, i am.
BOT ++++ and you are my friend .
HUMAN ++++ that's correct.
BOT ++++ and you ' re in trouble ?
HUMAN ++++ no, i'm not. why did you think so?
BOT ++++ because it ' s a good idea .
HUMAN ++++ what is your good idea?
BOT ++++ i ' ll be fine .
=============================================
HUMAN ++++ hello
BOT ++++ and you are in trouble ?
HUMAN ++++ what did you say that?
BOT ++++ ' s life .
HUMAN ++++ i'm not in trouble.
BOT ++++ what the hell is that ?
HUMAN ++++ okay, what trouble?
BOT ++++ i ' m looking .
HUMAN ++++ why did you say i'm in trouble?
BOT ++++ let ' s talk about something else .
HUMAN ++++ what do you want to talk about?
BOT ++++ i can ' t .
HUMAN ++++ what can't?
BOT ++++ i ' ll have a drink .
HUMAN ++++ okay, me too.
BOT ++++ you were here ?
HUMAN ++++ i'm still here.
BOT ++++ what is it ?
HUMAN ++++ i thought you wanted a drink.
BOT ++++ i ' m not .
HUMAN ++++ you just said so!
BOT ++++ i was wrong .
HUMAN ++++ it's fine.
BOT ++++ okay ?
HUMAN ++++ yeah
BOT ++++ i don ' t know what to say .
HUMAN ++++ do you drink a lot?
BOT ++++ i can do it .
HUMAN ++++ do you drink often?
BOT ++++ in the time .
HUMAN ++++ that doesn't make any sense.
BOT ++++ are you ?
HUMAN ++++ what do you drink?
BOT ++++ ' s in a drink .
HUMAN ++++ do you drink alcohol?
BOT ++++ in time .
HUMAN ++++ i like beer.
BOT ++++ oh . . .
HUMAN ++++ don't you like beer?
BOT ++++ i don ' t know what to say .
HUMAN ++++ where do you live?
BOT ++++ up .
HUMAN ++++ up where?
BOT ++++ in the hospital .
HUMAN ++++ i'm sorry. are you okay?
BOT ++++ i ' m fine .
HUMAN ++++ why are you in the hospital?
BOT ++++ and what ?
HUMAN ++++ how many words do you know?
BOT ++++ i ' d do it .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ how many?
BOT ++++ i ' m not .
HUMAN ++++ whatever
BOT ++++ i don ' t know what to say .
HUMAN ++++ you're boring.
BOT ++++ i ' m not .
HUMAN ++++ yes you are.
BOT ++++ and you are my friend .
HUMAN ++++ no i'm not.
BOT ++++ please .
HUMAN ++++ okay fine. i'm your friend.
BOT ++++ you ' re my mother ?
HUMAN ++++ i have no son like you.
BOT ++++ i don ' t know what to say .
HUMAN ++++ i have no son.
BOT ++++ i see you .
HUMAN ++++ i have no children.
BOT ++++ you are .
HUMAN ++++ omg you're so dumb.
BOT ++++ i ' m not .
HUMAN ++++ you're stupid!
BOT ++++ i ' m not .
HUMAN ++++ do you always talk like this?
BOT ++++ of course .
HUMAN ++++ how old are you?
BOT ++++ eighteen .
HUMAN ++++ you're big enough.
BOT ++++ i ' m not .
HUMAN ++++ are you an adult?
BOT ++++ i ' m going .
HUMAN ++++ you're going to be an adult?
BOT ++++ i ' m looking for you .
HUMAN ++++ i'm in my room.
BOT ++++ and your mother ?
HUMAN ++++ she's in vietnam.
BOT ++++ you ' re here .
HUMAN ++++ yes i'm here.
BOT ++++ and what are you talking about ?
HUMAN ++++ you asked me where i am.
BOT ++++ and you are your friend .
HUMAN ++++ i need to go now.
BOT ++++ i will not .
HUMAN ++++ that's fine. you can stay. i'm leaving.
BOT ++++ oh . . .
HUMAN ++++ bye
BOT ++++ i don ' t know what to say .
HUMAN ++++ goodbye
BOT ++++ i don ' t know what to say .
HUMAN ++++ bye bye
BOT ++++ okay ?
HUMAN ++++ see you later
BOT ++++ i don ' t know what to say .
=============================================
HUMAN ++++ hi
BOT ++++ hi . what ' s your name ?
HUMAN ++++ i'm maria
BOT ++++ please .
HUMAN ++++ i'm jackson
BOT ++++ you are .
HUMAN ++++ i'm jack
BOT ++++ let ' s talk about something else .
HUMAN ++++ i have no go
BOT ++++ you are ?
HUMAN ++++ i have to go
BOT ++++ and what do you mean ?
HUMAN ++++ i'm leaving.
BOT ++++ what ?
HUMAN ++++ yeah, bye
BOT ++++ bye . have a nice day !
=============================================
HUMAN ++++ that's very bad.
BOT ++++ i ' ve got to work .
HUMAN ++++ what do you do?
BOT ++++ i ' ll be there .
HUMAN ++++ where do you work?
BOT ++++ in the trunk .
HUMAN ++++ are you a machenics?
BOT ++++ i ' m not .
HUMAN ++++ what are you?
BOT ++++ no idea .
=============================================


================================================
FILE: assignments/trump_bot/trump_tweets.txt
================================================
'State works hard and illegally for Clinton' #DrainTheSwamp __HTTP__ _E_
RT @IvankaTrump: Touched by the warm hospitality of Prime Minister Abe and the Japanese people. ありがとうございます [Thank you]! Until next time ... _E_
Since Congress can't get its act together on HealthCare I will be using the power of the pen to give great HealthCare to many people FAST _E_
I always said @BarackObama will attack Iran in some form prior to the election. _E_
Today I am working on my 'big surprise' for the @RNC convention. Everyone will love it. _E_
What a shock! The U.S. Capitol Christmas tree pays homage @BarackObama but failed to mention Jesus. _E_
Making America Safe is my number one priority. We will not admit those into our country we cannot safely vet. __HTTP__ _E_
Repubs must not allow Pres Obama to subvert the Constitution of the US for his own benefit & because he is unable to negotiate w/ Congress. _E_
Tell Iran to let our Christian Pastor go and I mean right now. If they don't there will be hell to pay. _E_
Man shot inside Paris police station. Just announced that terror threat is at highest level. Germany is a total mess big crime. GET SMART! _E_
Thank you! __HTTP__ _E_
I am now inspecting the Old Post Office on Pennsylvania Avenue will be a great hotel. Soon off to the Oklahoma State Fair! _E_
Just a few more days until the 13th season of All Star @CelebApprentice premieres. Be sure to tune in this Sunday at 9PM on @nbc. Big! _E_
Look forward to being in Phoenix tomorrow at 2:00 P.M. Hottest ticket in entire country. Was supposed to be 500 people now many thousands! _E_
The Al Frankenstien picture is really bad speaks a thousand words. Where do his hands go in pictures 2 3 4 5 & 6 while she sleeps? ..... _E_
Karl Rove's strategy and commercials were the worst I have ever seen. _E_
.@lindseygraham who had zero in his presidential run before dropping out in disgrace saying the most horrible things about me on @FoxNews. _E_
Great Concert at 4:00 P.M. today at Lincoln Memorial. Enjoy! _E_
"Donald Trump: Mitt Romney 'Blew It' Shouldn't Run Again" __HTTP__ via @Newsmax_Media by @OwenTew _E_
.@HillaryClinton talking about jobs? Remember what she promised upstate New York. #BigLeagueTruth#Debates __HTTP__ _E_
.@IsraeliPM @netanyahu is a resolute leader. When he sets a red line it stands! _E_
Thank you Idaho! I love your potatoes nobody grows them better. As President I will protect your market. __HTTP__ _E_
A bad thing finally happened to Derek Jeter he is a great champion. _E_
Obama told his donors this past week "public opinion" is on his side. Don't believe that one either. _E_
Join me this Friday in Pensacola Florida at the Pensacola Bay Center! Tickets: __HTTP__ __HTTP__ _E_
Wow record setting cold temperatures throughout large parts of the country. Must be global warming I mean climate change! _E_
...Colin Powell thought Iraq has weapons of mass destruction. _E_
Review your work habits & make sure they are taking you in the right direction. Don't tread water get out there and go for it. _E_
President said we would never leave a soldier behind. How about the 4 who died in Benghazi? _E_
The point is: the Chinese are smart they respond to economic pressure and they know they're not going to get (cont) __HTTP__ _E_
Look forward to introducing Governor Mike Pence (who has done a spectacular job in the great State of Indiana). My first choice from start! _E_
If Stuart Stevens' book is as bad as his horrible political advice to Mitt Romney don't waste your money. Arrogant guy but a zero! _E_
Snowboarder/Skateboarder @Shaun_White stopped by to visit this week.... __HTTP__ _E_
#MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_
#TBT With @britneyspears __HTTP__ _E_
Why is @RandPaul allowed to take advantage of the people of Kentucky by running for Senator and Pres. Why should Kentucky be back up plan? _E_
We are about to have a record $500B trade deficit with the Chinese this year. That money should be back here financing jobs in America. _E_
Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
Little @MacMiller I'm now going to teach you a big boy lesson about lawsuits and finance. You ungrateful dog! _E_
"The way to get started is to quit talking and begin doing." Walt Disney _E_
In the coming months and years ahead I look forward to building an even STRONGER relationship between the United States and China. __HTTP__ _E_
Who was it that secretly said to Russian President Tell Vladimir that after the election I'll have more flexibility? @foxandfriends _E_
I will be interviewed on @meetthepress this morning. Enjoy! _E_
Wacko pervert @AnthonyWeiner's idea of Hispanic outreach is using Carlos Danger as his sexting. He's an insensitive racist. _E_
Republicans should not negotiate against themselves again with @BarackObama in today's debt talks First and foremost CUTCAP and BALANCE. _E_
'Small business optimism soars after Trump election' __HTTP__ _E_
We need a great leader now! __HTTP__ _E_
I am going to Trump National Doral in Miami this week to check out the $250 million renovation. In construction always watch the money! _E_
A vote for Hillary Clinton is a vote for another generation of poverty high crime & lost opportunities. #ImWithYou __HTTP__ _E_
.@MattGinellaGC Matt the statement about Pinehurst looking like a local community golf course awful was not made by me but tweeted to me _E_
Our hearts are with all affected by the wildfires in California. God bless our brave First Responders and @FEMA team. We support you! __HTTP__ _E_
Hopefully others will follow suit. Our country needs & should demand security. It is time to get tough & be smart! _E_
Heading to North Carolina for two big rallies. Will be there soon. We will bring jobs back where they belong! _E_
.@ThrillistChi named @SixteenChicago @TrumpChicago one of the "best value Michelin starred restaurants in Chicago" __HTTP__ _E_
President Obama wants to change the name of Mt. McKinley to Denali after more than 100 years. Great insult to Ohio. I will change back! _E_
Obama is looking rhetorical and weak. @MittRomney is looking strong and sharp. _E_
Great to see the construction of the Old Post Office on Penn Ave. Going fast under budget ahead of schedule! _E_
Thank you New York and Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Joe McQuaid (@deucecrew) is desperately trying to sell the @UnionLeader. It's a loser and my comments haven't helped him much. _E_
We know who did the hoax of James Gandolfini and ObamaCare. Be careful Mister. _E_
A message of condolences and support regarding the terrorist attacks in Tel Aviv: __HTTP__ _E_
...One point I made sure to stress at @LibertyU is to be sure to get even with anyone who crosses you... _E_
ObamaCare Story of the Day: "Florida Cancer Patient Loses Insurance During Treatment B/C of ObamaCare" __HTTP__ _E_
Now that Ken Frazier of Merck Pharma has resigned from President's Manufacturing Councilhe will have more time to LOWER RIPOFF DRUG PRICES! _E_
Wishing everyone a safe and Happy Halloween!#Halloween2017 __HTTP__ _E_
So many tweets & stories on Stewart/Pattinson Look it doesn't matter the relationship will never be the same. It is permanently broken. _E_
It is finally sinking through. 46% OF PEOPLE BELIEVE MAJOR NATIONAL NEWS ORGS FABRICATE STORIES ABOUT ME. FAKE NEWS even worse! Lost cred. _E_
Drew Peterson just got 36 years for killing his wife bring back the death penalty! _E_
Fake News CNN and NBC are going out of their way to disparage our great First Responders as a way to get Trump. Not fair to FR or effort! _E_
Don't forget to tune in tonight for the two hour premiere of The Apprentice. 9 pm EST on NBC. We're all in for a fantastic new season! _E_
#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_
Join me this weekend! #NYPrimary4/16: SYRACUSE NOON __HTTP__ WATERTOWN 3pm __HTTP__ #Trump2016 _E_
I met some really great Air Force GENERALS and Navy ADMIRALS today talking about airplane capability and pricing. Very impressive people! _E_
Take your work seriously take yourself less seriously. It's a great recipe for some good times & great memories. _E_
Lightweight A.G. Eric Schneiderman who has been a total failure in office failed to report the 98% approval rating of students for courses _E_
Effective today my administration officially declared the #OpioidCrisis a NATIONAL PUBLIC HEALTH EMERGENCY under federal law. __HTTP__ _E_
Will be doing @foxandfriends live tomorrow at 7AM ET from Europe. _E_
RT @DonaldJTrumpJr: Happy new year everyone. #newyear #family #vacation #familytime __HTTP__ _E_
If our healthcare plan is approved you will see real healthcare and premiums will start tumbling down. ObamaCare is in a death spiral! _E_
Wishing a Happy Father's Day to all the Dad's out there YOU are a champion today and everyday! __HTTP__ _E_
Can you imagine with all of the talk about ObamaCare technical breakdowns made it a disastrous day. Our government is badly broken! _E_
Trump Nat'l Golf Club Philadelphia 360 beautiful acres as designed by Tom Fazio with views of the Philly skyline. __HTTP__ _E_
Via @IBDeditorials: "Most Americans Label Obama Presidency A Failure" __HTTP__ _E_
#WeeklyAddress __HTTP__ __HTTP__ _E_
After climbing a great hill one only finds that there are many more hills to climb. Nelson Mandela _E_
I'll be on @americanowradio with Andy Dean at 6:30 ET today talking about last night's @FoxNews debate. __HTTP__ _E_
Via @EveningExpress: Images of Donald Trump's 2nd North east golf course released: Public have say on images __HTTP__ _E_
Many people are saying it was wonderful that Mrs. Obama refused to wear a scarf in Saudi Arabia but they were insulted.We have enuf enemies _E_
Entrepreneurs: Identify your goals. Know precisely what you want to achieve. Have your own vision and stick with it! _E_
Experience is a hard teacher because she gives the test first the lesson afterwards. Vernon Sanders Law _E_
...big unnecessary regulation cuts made it all possible" (among many other things). "President Trump reversed the policies of President Obama and reversed our economic decline." Thank you Stuart Varney. @foxandfriends _E_
Will be interviewed on @oreillyfactor tonight at 8:00 P.M. _E_
#ICYMI OHIO RALLY!Watch here: __HTTP__ __HTTP__ _E_
Dopey Sugar @Lord_Sugar Bad ratings come on keep making me money remember I own your show. _E_
China hacked the U.S. Chamber of Commerce and now has the information of all 3 million members. China keeps (cont) __HTTP__ _E_
Check out the recent Editorial in the Wall Street Journal @WSJ about what a complete disaster the @CFPB has been under its leader from previous Administration who just quit! _E_
"You have to set higher and higher goals. You have to want more or you will start slipping backwards fast." – Think BIG _E_
The only @Forbes Five Star & @fivediamond hotel in NYC @TrumpNewYork is the definition of luxury __HTTP__ The Best! _E_
This is really unfair and a conflict for all the other candidates. I said it should not be allowed and ABC agreed. _E_
Eliot better have a great pre nup—I want to help Silda in her negotiation. _E_
Congratulations to Gretchen Carlson on her big move to hosting an afternoon solo show this fall on @FoxNews. _E_
The Clinton News Network sometimes referred to as @CNN is getting more and more biased.They act so indignant hear them behind closed doors _E_
.@DonaldJTrumpJr and @EricTrump with @HulkHogan Great shot! __HTTP__ _E_
I predicted Apple's stock fall based on their dumb refusal to give the option of a larger iPhone screen like Samsung. I sold my Apple stock _E_
.@PapaJohns CEO John Schnatte has told shareholders that ObamaCare will force him to raise pizza prices __HTTP__ REPEAL! _E_
Important day spent at Camp David with our very talented Generals and military leaders. Many decisions made including on Afghanistan. _E_
The unforgivable crime is soft hitting. Do not hit at all if it can be avoided but never hit softly. Theodore Roosevelt _E_
Thank you Las Vegas Nevada!#Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_
The entire country is FREEZING we desperately need a heavy dose of global warming and fast! Ice caps size reaches all time high. _E_
My appearances on @todayshow __HTTP__ and @gma __HTTP__ _E_
Record cold temperatures in July 20 to 30 degrees colder than normal. What the hell happened to GLOBAL WARMING? _E_
The politicians of the U.K. should watch Katie Hopkins of Daily __HTTP__ on @FoxNews. Many people in the U.K. agree with me! _E_
Congrats to Miss Universe 2011 @RealLeilaLopes & @Giant great @OsiUmenyiora on their engagement! I am very happy for you both. _E_
Obama deserves much less credit for the killing of Bin Laden. The praise goes to our brave military and intelligence officers. _E_
Flashback – Jeb Bush says illegal immigrants breaking our laws is an "act of love" __HTTP__ He will never secure the border. _E_
Join @AmerIcan32 founded by Hall of Fame legend @JimBrownNFL32 on 1/19/2017 in Washington D.C.... __HTTP__ _E_
RE: Michael Jackson: He was a great friend and a spectacular entertainer. It's a devastating loss! Donald J. Trump _E_
Via @AP's: ObamaCare is a tax __HTTP__ @BarackObama gave the largest tax increase in history on the middle class. Shameful! _E_
I'm honored to be presented the award of Doctor of Business Administration Honoris Causa from Robert Gordon University in Aberdeen Scotland _E_
Great job by the FBI Boston Police and all others involved start the trial tonight! _E_
Rising premium costs from Obamacare will cost businesses billions __HTTP__ Guess where these new costs get passed to – you. _E_
Our spa @TrumpSoHo gets a nice write up in @DETAILS: #gotmilk _E_
More hysterical DSRL videos featuring Donald Trump and Double Trump plus enter Golden Lick Race Sweepstakes: __HTTP__ _E_
So proud of @FEMA Military and First Responders! Thank you! __HTTP__ _E_
Wow! Thank you Louisville Kentucky! #VoteTrump on 3/5/2016! Lets #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
By @kwrcrow: "NY Post caught 'LYING' Again!" __HTTP__ ... The Donald" should go far. Actually if I run I'll win. _E_
Happy and proud to help @MittRomney win Ohio with robo calls in pivotal Cuyahoga County _E_
...fired. This story is totally made up by the dishonest media.The Chief is doing a FANTASTIC job for me and more importantly for the USA! _E_
Now China is helping Iran smuggle nuclear parts __HTTP__ . China is not an ally but our country's greatest threat & rival. _E_
Via @TheTodaysGolfer "@TrumpScotland gets new clubhouse" __HTTP__ _E_
Just departing La Crosse Wisconsin. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_
Thousands of US warplanes ships and missiles contain fake electronic components from China leaving them open (cont) __HTTP__ _E_
I will be interviewed on @CNN at 7:00 A.M. _E_
The biggest business people have used the bankruptcy laws to their advantage Warren B Icahn Kravis and this week John Paulson for haters! _E_
My exclusive @WSOC_TV interview with @BlairMiller9 discussing Trump National North Carolina & future deals __HTTP__ _E_
Happy New Year to all my Jewish friends. _E_
51 Million American to travel this weekend highest number in twelve years (AAA). Traffic and airports are running very smoothly! @FoxNews _E_
Letterman @Late_Show was great last night. I had a lot of fun. You could see his audience really wanted Obama to take the $ for charity. _E_
Just interviewed by @LouDobbs. Will be aired tonight at 7pmE on @FoxBusiness. #Dobbs _E_
Looking forward to being hosted & interviewed next Monday by David Rubenstein at the @TheEconomicClub __HTTP__ _E_
Success comes with hard work focus and luck. The luck comes to those who seek it out. If you are not in the game you cannot get lucky. _E_
Will be interviewed on @Morning_Joe at 7:3O. Enjoy! _E_
We should look to China where big time pollution takes place as they manufacture inefficient and costly wind turbines for Scotland! _E_
Stock Market hits another all time high on Friday. 5.3 trillion dollars up since Election. Fake News doesn't spent much time on this! _E_
The Failing @nytimes in a story by Peter Baker should have mentioned the rapid terminations by me of TPP & The Paris Accord & the fast.... _E_
Via Business Insider: Donald Trump's Poll Dominance in 2 Key States is Mind Blowing __HTTP__ _E_
The American economy would grow if Washington didn't keep threatening higher taxes and more regulations. Government is not the solution. _E_
This year's Trump Miss Universe Pageant is comprised of truly beautiful women.Will be simulcast live December 19th on @nbc and @Univision. _E_
The Democrats don't want money from budget going to border wall despite the fact that it will stop drugs and very bad MS 13 gang members. _E_
God bless all the brave souls who perished 12 years ago today. You will never be forgotten! _E_
The people of Buffalo should be happy Terry Pegula got the team but I hope he does better w/the Bills than he has w/the Sabres. Good luck! _E_
Just left Liberty University. Chancellor Jerry Falwell Jr.& his father have done an amazing job...great school & the students were fantastic _E_
Can you believe that the corrupt and pathetic South Africa police force has yet to arrest the sign language guy. Such danger give 10 years! _E_
Today we honor the fallen at #PearlHarbor 74 years ago today. If you see a vet today thank them! #RememberOurVets __HTTP__ _E_
RT @foxandfriends: STILL AHEAD: @realDonaldTrump joins us at 7am/et! #RNCinCLE __HTTP__ _E_
The Emmys are all politics that's why despite nominations The Apprentice never won even though it should have many times over. _E_
RT @PChowka: Fox News With Hannity's Help Regains Its Ratings Dominance By Peter Barry Chowka at The Hagmann report __HTTP__ _E_
We must never bend too much. Yitzhak Shamir (1915 2012) __HTTP__ _E_
I will be doing @colbertlateshow at 11:30 on CBS. Enjoy! __HTTP__ _E_
Obama's new campaign ad defends Solyndra __HTTP__ I guess losing $500M is a cause for celebration for @BarackObama. _E_
How much longer will the failing nytimes with its big losses and massive unfunded liability (and non existent sources) remain in business? _E_
RT @VP: Went to the Senate today to say @POTUS & I fully support Graham Cassidy plan to repeal/replace Obamacare. Let's get this done. __HTTP__ _E_
...@Lord_Sugar You need the income from the show to keep going hope it doesn't hurt. _E_
God never takes away something from your life without replacing it with something better. Rev. @BillyGraham _E_
Under Obama Iran has taken over Iraq Al Qaeda has taken over Libya the Muslim Brotherhood now controls Egypt. Worst foreign policy ever. _E_
Glad everyone could see Mar a Lago last night on @datelinenbc. It is the crown jewel of Palm Beach. _E_
.@MissUniverse visited my office tall and beautiful! __HTTP__ _E_
Remember when @BarackObama promised you could keep your coverage? Study shows 1 in 10 employers will drop health care __HTTP__ _E_
Hillary said that guns don't keep you safe. If she really believes that she should demand that her heavily armed bodyguards quickly disarm! _E_
The 9/11 trials at Gitmo over the weekend were a disaster. Can you imagine how much worse it would be if @BarackObama tried them in NYC? _E_
JOBS JOBS JOBS! __HTTP__ _E_
NPR's @NealConan said schlonged to WaPo re: 1984 Mondale/Ferraro campaign: That ticket went on to get schlonged at the polls. #Hypocrisy _E_
Which campaign is possibly on the trajectory towards insolvency? __HTTP__ At least @BarackObama is consistent. _E_
The invisible hand of the market always moves faster and better than the heavy hand of government. @MittRomney _E_
The new reality. China's economy 'underpins' global demand __HTTP__ Our leaders just watched as China took full control. _E_
Did you know that one of seven Americans is now on food stamps? Think of it. In the United States the most pr... (cont) __HTTP__ _E_
I still can't get over how the Republicans—my friends—spent hundreds of millions of dollars on such terrible & ineffective ads. _E_
Thank you! #MakeAmericaGreatAgain __HTTP__ _E_
Wow 30000 e mails were deleted by Crooked Hillary Clinton. She said they had to do with a wedding reception. Liar! How can she run? _E_
Do you notice that nobody is talking about the many scandals of the Obama administration anymore The Teflon President! _E_
Crooked Hillary wants to get rid of all guns and yet she is surrounded by bodyguards who are fully armed. No more guns to protect Hillary! _E_
May the Festival of Lights bring our Jewish friends from around the world health & happiness! Happy Hanukkah! __HTTP__ _E_
Fox & Friends at 7.00 _E_
It is time to take care of OUR people to rebuild OUR NATION and to fight for OUR GREAT AMERICAN WORKERS! #TaxReform #USA __HTTP__ _E_
In his entire political career @BarackObama has never had a tough @GOP opponent before @MittRomney. He is a paper tiger. #GOMITT _E_
Bill Clinton wants to #MakeAmericaGreatAgain __HTTP__ _E_
Failure defeats losers failure inspires winners. Robert T. Kiyosaki@theRealKiyosaki _E_
I have never met a successful person that was a quitter. Successful people never ever give up! _E_
Our debt is about to top $17T. ObamaCare and China (& others) are killing American business. _E_
Will be interviewed on @NewDay on @CNN at 7:15 A.M. _E_
My @foxandfriends interview re: Muslim Brotherhood taking over Egypt our vast natural gas resources & US tax system __HTTP__ _E_
For all of those who have been asking a big cast announcement coming soon for @ApprenticeNBC! _E_
Sometimes there is justice. A Chinese military newspaper was hacked. __HTTP__ _E_
Via @UnionLeader BY Bill Smith: "GOP rally in Manchester fires up party faithful" __HTTP__ _E_
A tough week was had by@MittRomney but he's come back from adversity before. _E_
CUT CAP AND BALANCE. TAXED ENOUGH ALREADY! _E_
.@GOP has leverage. Must stay united & on message. _E_
Via @PPDNews: Donald Trump: 'I Am Not Doing This For Fun' We Can't Fix U.S. 'Unless We Put Right Person' In WH __HTTP__ _E_
Trump Golf Links at Ferry Point in the Bronx NY will open soon. A Jack Nicklaus Signature Design. Beautiful. __HTTP__ _E_
Under @BarackObama the Iranian nuclear program has rapidly grown. __HTTP__ _E_
Via Int'l Business Times: Jeb Bush Got $1.3M Job at Lehman After Florida Shifted Pension Cash To Bank. __HTTP__ _E_
China is complaining about 2500 marines being placed in Australia. Meanwhile they are building bases across Latin America. #TimeToGetTough _E_
Donald Trump trademarked Reagan slogan & would like to stop other Republicans from using it __HTTP__ via @businessinsider _E_
Is Roger Simon @politicoroger ever right about anything? Now he's attacking @BillClinton in defense of (cont) __HTTP__ _E_
GO VOTE FROM NOW TO 8:30 P.M. NEVADA. I WILL BE AT VARIOUS CAUCUS SITES. MAKE AMERICA GREAT AGAIN! _E_
RT @TheFive: @POTUS being unpredictable is a big asset North Korea knew exactly what President Obama was going to do. @jessebwatters _E_
A pessimist is one who makes difficulties of his opportunities... _E_
Glad to hear @seanhannity supports my offer to Obama. As Sean says "it is an easy $5 million to charity. What does Obama have to lose?" _E_
My interview on @WOR710 with Jon Gambling discussing #TimeToGetTough meeting @NewtGingrich and the 2012 election __HTTP__ _E_
Next year will be an interesting one. I look forward to running against Hillary Clinton a totally flawed candidate and beating her soundly _E_
.@AustinKaiser52 The 2 people I am most excited to hear speak on Thrursday at @CPACnews is @GovChristie & @realDonaldTrump #DCBound Thanks. _E_
Now is the time to buy housing before values have fully recovered. In 5 years remember I told you so. _E_
Thank you to the men and women of Fort Myer and every member of the U.S. Military at home and abroad. #USA __HTTP__ _E_
Our country should be worried about nuclear control far more than gun control & that one's not even close! _E_
From Donald Trump: "I'm so proud of my wife Melania and the launch of her new jewelry line to debut on QVC on April 30th at 9 p.m." _E_
Internal polling shows that I would swamp @RobAstorino in a NY Republican primary 77% to 23%. But won't run if party is not unified. _E_
Via @swan_investor by @Forbes: "The Trump Card: Make America Great Again" __HTTP__ _E_
Join me in Roanoke Virginia tomorrow at the Berglund Center Coliseum ~ 6pm! Tickets available at:... __HTTP__ _E_
Great to see Sec. Clinton leaving the hospital yesterday with @ChelseaClinton and Pres. Clinton. Glad she is recuperating. _E_
Bernie Sanders is continuing his quest because he believes that Crooked Hillary Clinton will be forced out of the race e mail scandal! _E_
I will be interviewed on @oreillyfactor tonight at 8:00. Will be talking about the poor treatment of our veterans illegal immigration etc. _E_
I very much appreciate all of the great reviews & comments on my speech in Michigan the people were great. _E_
See when I said NATO was obsolete because of no terrorism protection they made the change without giving me credit. __HTTP__ _E_
Many Red State Democrats sticking with Obama on deficit spending on the ObamaCare monstrosity will be defeated in 2014. _E_
Happy 4th of July to everyone including the haters and losers! _E_
CNN is the worst fortunately they have bad ratings because everyone knows they are biased. __HTTP__ _E_
Thank you @Todayshow for the wonderful and honest poll results on Chicago sign. People love it! __HTTP__ @TrumpChicago _E_
Remember I predicted that New York Magazine would fold and people scoffed? Just announced (N.Y.Post) it lost big $'s & is cutting way back! _E_
Happy birthday to @garyplayer a truly great Champion and Person! _E_
Any increase in ObamaCare premiums is the fault of the Democrats for giving us a product that never had a chance of working. _E_
"Mold yourself into the person who can do big things." – Think Big _E_
Ron DeSantis Iraq vet Navy hero bronze star Yale Harvard Law running for Congress in Fla. Very impressive. __HTTP__ _E_
70% of the Chinese say they are better off than they were 4 years ago __HTTP__ At least someone has done well under Obama. _E_
Do you know how many years @TheRealMarilu starred on Taxi? #CelebApprentice _E_
"Interested is interesting. If you remember that simple rule you will have no trouble making conversation." Think Like a Billionaire _E_
I stand ready to lead us down a new path where we are lifted up by our desire to succeed not by a resentment of success. @MittRomney _E_
ObamaCare is causing such grief and tragedy for so many. It is being dismantled but in the meantime premiums & deductibles are way up! _E_
I will be in Indiana on Sunday and Monday at four MAKE AMERICA GREAT AGAIN rallies. See you there! _E_
Congrats @TrumpWaikiki for winning @AmericanExpress Fine Hotels & Resorts 'Hotel Partner of The Year for 2014' award! _E_
Is business success a natural talent? I think it's a combination of aptitude work and luck. Think Like a Champion _E_
Great ruling on wind farm in Scotland—very smart judge! Front page article. __HTTP__ _E_
Thank you @BrentBozell As you know I have been saying this for a long time __HTTP__ _E_
Flashback @FoxNewsInsider July '14:"Trump: Bergdahl Swap Another Mistake By 'Gang That Couldn't Shoot Straight'" __HTTP__ _E_
Everyone should boycott Italy if Amanda Knox is not freed she is totally innocent. _E_
The ALS #IceBucketChallenge that Trumps them all __HTTP__ _E_
Great job @MariaTCardona on @ThisWeekABC. You made kooky Cokie Roberts and @BillKristol look even dumber than they are. You will be right! _E_
Great to be on @andersoncooper tonight with my wonderful family. Will be rebroadcast at 12:00 A.M. (EASTERN). _E_
Virtually all Presidents and candidates including John McCain Bill Clinton George H.W. Bush and George W. Bush... __HTTP__ _E_
Thank you @JakeTapper for giving me credit for my vision on bombing the oil fields. Should have been done long ago. #Trump2016 _E_
...intentional. This whole narrative is a way of saving face for Democrats losing an election that everyone thought they were supposed..... _E_
"Some people spend an entire lifetime wondering if they made a difference in the world.The Marines don't have that problem." Ronald Reagan _E_
Rev.@BillyGraham is doing tremendous work this election cycle educating the Christian community on @MittRomney. _E_
"Read the Bible. Work hard and honestly. And don't complain." – Rev. @BillyGraham _E_
Robert Bryce @NYPost Congrats on your great opinion piece on terrible wind turbines & how destructive they are. Windmills are a disaster. _E_
I will be making a speech at 12:00 in Fort Worth Texas. Really big crowd expected. Will be talking about the debate last night plus plus! _E_
After many years of failurecountries are coming together to finally address the dangers posed by North Korea. We must be tough & decisive! _E_
Show me someone without an ego and I'll show you a loser. How To Get Rich _E_
I like John McCain but we have to start rebuilding the United States instead of countries who hate us and want us to fail be smart! _E_
Happy Birthday @TheLeeGreenwood!#FlashbackFriday __HTTP__ _E_
The media is spending more time doing a forensic analysis of Melania's speech than the FBI spent on Hillary's emails. _E_
Any senator who votes against starting debate is telling America that you are fine w/ the #OCareNightmare! Remarks: __HTTP__ __HTTP__ _E_
Thank you America! Get out & VOTE tomorrow! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
I will be speaking at the #StopIranDeal rally shortly watch live here __HTTP__ _E_
Negotiation tip #2: I always go into the deal anticipating the worst... _E_
Can you imagine if I had the small crowds that Hillary is drawing today in Pennsylvania. It would be a major media event! @CNN @FoxNews _E_
My ties shirts and cufflinks have never been more beautiful THE BEST available at Macy's! _E_
JFK Files are released long ahead of schedule! _E_
Who is the moron who decided to release the Ferguson grand jury findings after 9:00 o'clock in the evening. What were they thinking? _E_
THANK YOU! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
My @jrg710 interview discussing building a cemetery next to Trump National the FL primary @ApprenticeNBC and OPEC __HTTP__ _E_
ISIS is making big threats today no respect for U.S.A. or our leader If I win it will be a very different storywith very fast results _E_
Commissioner Adam Silver made a strong and very wise decision concerning Donald Sterling. _E_
Sissy Graydon Carter of failing Vanity Fair Magazine and owner of bad food restaurants has a problem his V.F. Oscar party is no longer hot _E_
Love that Patriots won Brady is best ever! Seahawks pass was DUMBEST play in the history of football! Great going COACH B! _E_
Oil has been over $33/gallon for 34 months. A new record. And now with Obama's war on coal American families will be hit even harder. _E_
I'm really glad that @MittRomney no longer says what a nice guy @BarackObama is. _E_
My daughter Ivanka did great tonight in New Hampshire. The sold out crowd loved her and she loved them. Thanks Ivanka! _E_
.@ChuckGrassley got your message loud and clear. We have fantastic people on the ground got there long before #Harvey. So far so good! _E_
.@GeraldoRivera Thank you Geraldo for your nice words on @oreillyfactor tonight. You are a true champion! Thank @ericbolling great guy! _E_
Airing live from Baton Rouge at 8PM ET on @nbc 2014 @MissUSA Competition will be a tremendous event __HTTP__ _E_
The secret of getting ahead is getting started. Mark Twain _E_
Why is President Obama allowed to use Air Force One on the campaign trail with Crooked Hillary? She is flying with him tomorrow. Who pays? _E_
Will be interviewed on @foxandfriends at 8:00 A.M. _E_
"Having an ego and acknowledging it is a healthy choice. Our ego gives us a sense of purpose." – Think Like a Champion _E_
OmikronDreamer @realDonaldTrump do you wear your own ties? Yes. _E_
Do your homework. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. _E_
I will be on @oreillyfactor tonight interview with Bill O'Reilly on @FoxNews at 8 p.m. repeated at 11 p.m. _E_
My @FoxNews interview on @TeamCavuto discussing why debt commission should be discussed in debates & @RNC convention __HTTP__ _E_
Via @HPCaTravel by @alau2: "Trump Hotel Reflects Youthful Luxurious Vancouver: Ivanka Trump" __HTTP__ _E_
RT @Scavino45: .@POTUS @realDonaldTrump and @FLOTUS Melania visit with @UMCSN patient Tiffany Huizarin Las Vegas earlier today. #VegasStron... _E_
The toughest thing about success is that you've got to keep on being a success. Irving Berlin _E_
C SPAN/Conversation with Donald Trump/Economic Club of Washington DC __HTTP__ _E_
President Obama played golf yesterday??? _E_
Now the UN is attacking @Redskins franchise __HTTP__ With all the world's problems is this really a top priority? _E_
Via @GravisMarketing: "New Hampshire Poll: Trump into top tier status" __HTTP__ _E_
I am growing the Republican Party tremendously just look at the numbers way up! Democrats numbers are significantly down from years past. _E_
At 96 stories above Michigan Avenue if you're not staying at the 5 star @TrumpChicago then you're in its shadow __HTTP__ _E_
Next time Marco Rubio should drink his water from a glass as opposed to a bottle—would have much less negative impact. _E_
Obama lied 100% about Libya and the killings emails are absolute. He must release his records on Wednesday and stop the lies. _E_
Can you believe that President Obama still hasn't stopped the flights and people pouring into the U.S. from West Africa. TERRIBLE PRESIDENT! _E_
RT @Scavino45: 20295 miles later #POTUSinAsia has successfully concluded as @POTUS @realDonaldTrump lands on the South Lawn of @WhiteHouse... _E_
'Immigration Ban Is One Of Trump's Most Popular Orders So Far' __HTTP__ _E_
RT @VP: All Americans in harms way need to be prepared and should continue visiting __HTTP__ for critical updates on #Hurric... _E_
Why doesn't phony @bobvanderplaats tell his followers all the times he asked for him and his family to stay at my hotels didn't like paying _E_
Robert Slater who just passed away was a terrific writer who wrote a very fair book about me. He will be missed. __HTTP__ _E_
I believe in spending what you have to. But I also believe in not spending more than you should. The Art of The Deal _E_
My @FoxNews interview with @megynkelly discussing the 2012 election and the Newsmax @iontv debate __HTTP__ _E_
.@HillaryClinton is NOT above the law!#Debates2016 __HTTP__ _E_
This chart from AEI's @JimPethokoukis shows how terrible @BarackObama's 'recovery' really is: __HTTP__ Disaster. _E_
In '08 @BarackObama called Jerusalem Israel's capital __HTTP__ Now he attacks @MittRomney on Jerusalem __HTTP__ _E_
I won't be doing Fox & Friends tomorrow morning in that I have a big breakfast meeting on a deal. I will be back next week at 7. Thank you! _E_
On June 22 I will be going to Scotland to celebrate the opening of the newly renovated @TrumpTurnberry Resort the worlds best. _E_
I know Rand Paul and I think he may find a way to get there for the good of the Party! _E_
DC has shrunk our military and exploded our country with debt. We can't send another politician to the White House __HTTP__ _E_
Just heard Foreign Minister of North Korea speak at U.N. If he echoes thoughts of Little Rocket Man they won't be around much longer! _E_
Will be traveling to the Great State of Ohio tonight. Big crowd expected. See you there! _E_
I will be interviewed by @oreillyfactor at 4:00 P.M. (prior to the #SuperBowl Pre game Show) on Fox Network. Enjoy! _E_
.@Jetsetterdotcom in Hong Kong featured 8 pages on my great hometown of New York City including @TrumpSoHo __HTTP__ _E_
.@claudiajordan's judgment wasn't the best in who she chose to come back to the boardroom—that was her demise. #CelebApprentice _E_
Next Tuesday remember how our president has not lifted a finger for USMC Tahmooressi. He only wants illegals to cross our border. _E_
It was really strange when Hillary was missing from the podium last night. Not very presidential! _E_
My @CNN interview with @PiersTonight discussing the Newsmax @iontv debate #TimeToGetTough the GOP and the economy __HTTP__ _E_
Nice article on Trump Links at Ferry Point in today's New York Post the construction is going really well! _E_
WE LOVE YOU LAS VEGAS! __HTTP__ _E_
RT @JaniceTaylor912: @DonaldJTrumpJr @Reryan08 @IvankaTrump @EricTrump obvious to all that he raised some GREAT responsible patriotic kid... _E_
Poll: Trump Leads GOP Field Among Hispanics Records 34% Favorability __HTTP__ _E_
With the World hating us and wanting to destroy the U.S. we have just cut the hell out of the military budget making it smallest since '39 _E_
I am pleased to announce that I have chosen Governor Mike Pence as my Vice Presidential running mate. News conference tomorrow at 11:00 A.M. _E_
#ObamacareFail __HTTP__ _E_
So Obama used to tell classmates that he was Kenyan royalty and an Indonesian prince __HTTP__ Sounds like his book bio! _E_
We are going to make this a government of the people once again!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
Iraq told us to get out Iraq is now falling and Iraq now wants us to come back! Don't do it unless we get the OIL and I mean ALL OF IT! _E_
Interesting studies show that wind farms have a warming effect on the climate _E_
I am so proud of our great Country. God bless America! _E_
The Celebrity Apprentice Sunday night at 9 PM on NBC. Another great episode! __HTTP__ _E_
Sadly firing can be an essential and responsible business decision. It isn't pleasant but lopping off a branch can save a tree. _E_
All of this Russia talk right when the Republicans are making their big push for historic Tax Cuts & Reform. Is this coincidental? NOT! _E_
... It is all about incorporating a sense of optimism into everything you do while also acknowledging the negative." – Think Big _E_
.@MittRomney scored last night on both substance and style. _E_
Join me in congratulating @NASA's @AstroPeggy by using the hashtag #CongratsPeggy! Earlier today:... __HTTP__ _E_
An individual whose whole career is trying to take down successful celebrities with nonsense campaigns has turned his attention to me..... _E_
Thank you America! #Trump2016 __HTTP__ _E_
Entrepreneurs must have vision plus the power of focus... to see the future and turn their vision into a profitable reality. #MidasTouch _E_
I'll bet if I didn't harass Apple for the last 2 years about the large screen iPhone they wouldn't have done it—but it bends & breaks! _E_
President Obama's inaugural had record low ratings. What does that portend? _E_
Mike Leach's lessons his takeaways from Geronimo's life are fascinating & useful whether in boardroom or locker room __HTTP__ _E_
Rasmussen just announced that my approval rating jumped to 49% a far better number than I had in winning the Election and higher than certain "sacred cows." Other Trump polls are way up also. So why does the media refuse to write this? Oh well someday! _E_
"President Donald J. Trump Proclaims October 24 2017 as United Nations Day" Read more: __HTTP__ __HTTP__ _E_
Entrepreneurs: Be passionate you have to love what you're doing to be successful at it. _E_
I am committed to keeping our air and water clean but always remember that economic growth enhances environmental protection. Jobs matter! _E_
Washington will continue to run record deficits into the election. We are borrowing at a rate of $1.40 from China. Truly unsustainable. _E_
Who is rooting for Obama more tonight his campaign advisors or the press? _E_
If I win the Presidency we will swamp Justice Ginsburg with real judges and real legal opinions! _E_
.@KarlRove who spent $430 million in the last cycle and didn't win one race said I'm not a candidate until I file papers. Next week Karl! _E_
Via @CarrGaz: "Trump to jet in to unveil Trump @TurnberryBuzz clubhouse" __HTTP__ _E_
Tom Brady would have won if he was throwing a soccer ball. He is my friend and a total winner! @Patriots _E_
Iran will convince our incompetent President that they are trying to help us with Iraq take over the country & oil and O will say thanks _E_
Foreigners slashed the purchase of US debt late last year the first time in over 2years. We must control spending. __HTTP__ _E_
....the 2016 election with interviews speeches and social media. I had to beat #FakeNews and did. We will continue to WIN! _E_
.@genesimmons Keep up the great work and congrats we are proud of you! _E_
Despite the phony Witch Hunt going on in America the economic & jobs numbers are great. Regulations way down jobs and enthusiasm way up! _E_
"Talent wins games but teamwork and intelligence wins championships." Michael Jordan _E_
If the morons who killed all of those people at Charlie Hebdo would have just waited the magazine would have folded no money no success! _E_
Entrepreneurs: What is the standard for which you want to be known? Identify that standard and then establish it. _E_
China watched Obama's press conference yesterday salivating. We will be borrowing trillions more from them. _E_
Thank you Jacksonville Florida!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Thank you to @piersmorgan for your nice statement about me in the @HollywoodReporter __HTTP__ _E_
The failing New York Daily News knowingly incorrectly reported that I wanted to speak at the Republican National Convention wrong! _E_
"Problems setbacks mistakes & losses are all part of life. We shouldn't be shocked if and when they happen." – Think Like a Champion _E_
Thank you for you support Virginia! In ONE DAY get out and #VoteTrumpPence16! #ICYMI: __HTTP__ __HTTP__ _E_
We're coming up on the NEW YEAR It is really important that despite so many stupid decisions being made in Washington we make it BEST EVER _E_
RT @foxandfriends: POTUS the predictor? President Trump foretold housing upswing in 2012 __HTTP__ _E_
The Chinese are mistreating Hillary Clinton on her trip __HTTP__ They have zero respect for us. Outrageous! _E_
New CBS National Poll just out massive lead for Trump. The Wall Street Journal/NBC Poll is a total joke. No wonder WSJ is doing so badly! _E_
The Roger Stone report on @CNN is false Fake News. Have not spoken to Roger in a long time had nothing to do with my decision. _E_
Haters and losers say I wear a wig (I don't) say I went bankrupt (I didn't) say I'm worth $3.9 billion (much more). They know the truth! _E_
Wonderful to be in North Dakota with the incredible hardworking men & women @ the Andeavor Refinery. Full remarks: __HTTP__ __HTTP__ _E_
Just left Trump National Doral in Miami under massive construction The Blue Monster will be one of the greatest courses ever built! _E_
Departing Pittsburgh now where it was my great honor to stand with our incredible workers and to show the world that AMERICA is back and we are coming back bigger and better and stronger than ever before! __HTTP__ _E_
If you are lucky enough to catch a knockout assaulter before getting slugged and you carry a gun shoot the bastard (teach them a lesson)! _E_
RT @Scavino45: "Utilities cutting rates cite benefits of Trump tax reform" __HTTP__ _E_
When you do your Christmas shopping remember how disloyal @Macys was to the subject of illegal immigration. #BoycottMacys #DumpMacys _E_
We are making tremendous progress with the V. A. There has never been so much done so quickly and we have just started. We love our VETS! _E_
JOBS JOBS JOBS! __HTTP__ _E_
Mexico's court system is a dishonest joke. I am owed a lot of money & nothing happens. _E_
In Massachusetts the place is packed! #MakeAmericaGreatAgain _E_
Entrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_
RT @DRUDGE_REPORT: 10 SCANDALS ON DIRECTOR'S WATCH... __HTTP__ _E_
Senior United States District Judge Robert E. Payne today ruled in favor of Trump campaign delegates who had argued.. __HTTP__ _E_
The Zimmerman trial is over. It is time to move on. While Zimmerman is no angel he was acquitted and should be able to move on. _E_
Entrepreneurs: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_
Audience chanting RUN TRUMP RUN! during my my @SRQRepublicans speech! They are going to be very happy... _E_
Firing @lisalampanelli may have come as a surprise. She's a strong player. But there are no losers at this late point. #sweepstweet _E_
Via @necn by @KatherineNECN: "Trump Waiting to See Who Runs in 2016" __HTTP__ _E_
Justice Ginsburg of the U.S. Supreme Court has embarrassed all by making very dumb political statements about me. Her mind is shot resign! _E_
Why does a failed magazine like @Forbes constantly seek out trivial nonsense? Their circulation way down. @Clare_OC _E_
Premiering on January 4th the 14th season of @ApprenticeNBC will have major fireworks every episode. The Board Room is electric! _E_
These crimes won't be happening if I'm elected POTUS. Killer should have never been here. #AmericaFirst __HTTP__ _E_
Another freezing day in the Spring what is going on with global warming ? Good move changing the name to climate change sad! _E_
Cyprus is seizing private bank accounts as collateral for €10bn bail out. We owe $17T. Think it can't happen here? _E_
The terrorist who killed so many people in Germany said just before crime by God's will we will slaughter you pigs I swear we will...... _E_
Scary America would have had to pay all its GDP to the government to cover @BarackObama's real 2011 budget deficit __HTTP__ _E_
Tomorrow will be a really big day for America. MAKE AMERICA GREAT AGAIN! _E_
.@BreitbartNews: DONALD TRUMP: CANTOR'S DEFEAT SHOWS 'EVERYBODY' IN CONGRESS VULNERABLE IF THEY SUPPORT AMNESTY __HTTP__ _E_
Thanks. __HTTP__ _E_
Congratulations to my son Eric for making the Forbes 30 under 30 list. He's done a great job! __HTTP__ _E_
Via @UnionLeader by @tuohy: Trump hires Lewandowski as presidential run eyed __HTTP__ #FITN #MakeAmericaGreatAgain _E_
Via @BreitbartNews by mboyle1: EXCLUSIVE DONALD TRUMP CONFIRMED TO SPEAK AT #CPAC2014 __HTTP__ @ACUConservative @CPACnews _E_
Amazingly @AnthonyWeiner is going to run. The cure rate for his problem is 0. Lots of other things will come out. _E_
I look forward to attending & speaking at the Iowa Land Investment Expo—total sellout crowd __HTTP__ @PeoplesCompany _E_
America should not be pressuring @Israel to show restraint against Iran. We should be working to stop Iran's nuclear drive. _E_
Amazing! AG Schneiderman sues a school w/ a 98% approval rating but doesn't go after billion $ fraudsters all over Wall St. _E_
The hatchet job in @NYMag about Roger Ailes is total bullshit. He is the ultimate winner who is surrounded by a great team. @FoxNews _E_
Thank you for your continued support!#MakeAmericaGreatAgain __HTTP__ _E_
Obama's planned tax hike will hit over 1 million small businesses __HTTP__ Expect more massive unemployment and stagnant growth _E_
Join me live now in Las Vegas Nevada! We will MAKE AMERICA SAFE & GREAT AGAIN! #VoteTrumpNV #NevadaCaucus __HTTP__ _E_
"Donald Trump to name golf course after mother" __HTTP__ via @scotsmandotcom _E_
If you can't see it you can't make it happen. Entrepreneurs chase your dreams with resolute focus & determination. Be positive! _E_
A Lion's List of Democrats are not attending @BarackObama's DNC Convention. The Democratic Party is in turmoil. __HTTP__ _E_
The Holiday Season in New York City is a very special time. I love seeing and meeting the many tourists who visit the #TRUMP Tower atrium. _E_
Just watched Jon Stewart(?) jumping up and down and screaming like a madman nothing funny or smart just loud and obnoxious a pushy dope! _E_
Please tweet me your questions to answer in my #trumpvlog. _E_
TPP does not stop Japan's currency manipulation & China has a backdoor to join. It must be stopped. We need to protect the American worker! _E_
I have nothing to do with the Plaza Casino in Atlantic City. I have not been involved with Atlantic City for many years. Used to love A.C.! _E_
It's boardroom time! Does anyone miss @OMAROSA? #CelebApprentice _E_
.@HillaryClinton Sneers At Millions Of Average Americans. __HTTP__ #VPDebate #BigLeagueTruth _E_
Blatant and rampant property destruction in Baltimore as the police stand by and watch. Should be a lesson on how NOT to handle riots. SAD! _E_
Thank you Piers they don't know what they're getting into. __HTTP__ _E_
I applaud @netanyahu for announcing that he will show up at the UN to defend @Israel. A true US friend and great leader. _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
Amazing crowd last night in Dallas more spirit and passion than ever before. Today all over the great State of Texas! _E_
Was not mentioned that we built one of the great golf courses in the world bringing tremendous business to Scotland. __HTTP__ _E_
"Simply take a big goal and mold yourself to become the person who can accomplish that goal." – Think Big _E_
Tonight's #CelebrityApprentice will continue to impress. Be sure to tune in tonight at 9PM ET @NBC. It will be amazing. _E_
.@CelebApprentice Flashback: "What @bretmichaels Learned from the 'Rock Star of Real Estate'" __HTTP__ _E_
Just left Virginia where I unveiled my healthcare and other plans for our great Veterans! They will be very happy! __HTTP__ _E_
Donald Trump To Mitt Romney: 'You're Fired' __HTTP__ via @fitsnews _E_
Trump International Hotel & Tower New York winner of the Forbes Five Star Hotel Award in 2009 through 2012. __HTTP__ _E_
If I would have done the last debate a record would have been set (instead of the poor ratings recieved). Also VETS got $6000000. _E_
Scary. President Obama told Boehner that the government doesn't have a spending problem __HTTP__ _E_
Donald Trump: 'Monkey business' on jobs __HTTP__ via @politico _E_
I have an idea for @JebBush whose campaign is a disaster. Try using your last name and don't be ashamed of it! _E_
Credible Source on 9 11 Muslim Celebrations: FBI __HTTP__ _E_
Host of the 2017 U.S. Women's Open Trump Bedminster has been rated one of America's best golf courses. _E_
SOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_
Snowden should come back to America and face justice. Instead he is begging for clemency from Moscow. Treat him as a spy. _E_
Iran has warned the US not to send an aircraft carrier back into the Strait of Hormuz. We should send three as a (cont) __HTTP__ _E_
It will be interesting to see what happens to Eliot Spitzer if he loses the election for Comptroller to very capable @scottmstringer. _E_
.@foxandfriends int. on gov. collecting data whistle blower hiding in China & no bikinis in Miss World pageant __HTTP__ _E_
Huge Townhall tomorrow at 5PM in the NH Barrington Middle School! Thanks to @straffordnhgop​ for hosting! Let's Make America Great Again! _E_
Watch me on @Hannityshow tonight at 9PM ET on @FoxNews. _E_
RT @DanScavino: Jesse Jackson on @realDonaldTrump when he donated space for the Rainbow/Push Coalition. #DebateNight __HTTP__ _E_
China will extract much from Secretary Kerry and the U:S. in order for them to help us with the North Korea problem don't let this happen! _E_
I don't like seeing the Pope standing at the checkout counter (front desk) of a hotel in order to pay his bill. It's not Pope like! _E_
Disappointed the @NewYorkObserver article on @AGSchneiderman did not bring up his dealings w/ Shirley Huntley. __HTTP__ _E_
My @foxandfriends interview discussing Pres. Obama's inauguration @GOP debt plan & @CelebApprentice #1 branding __HTTP__ _E_
I will be interviewed on @foxandfriends at 9:00 A.M. I will be talking about the rigged and boss controlled Republican primaries! _E_
It's time to let Pete Rose the all time hits leader into the Baseball Hall of Fame. Enough already!!!!! _E_
In trade military and EVERYTHING else it will be AMERICA FIRST! This will quickly lead to our ultimate goal: MAKE AMERICA GREAT AGAIN! _E_
.@SabrinaSiddiqui Re: Taylor and Conor great news for Taylor! _E_
Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_
Make sure to follow me on @periscopeco. I will be streaming my announcement at 11AM. _E_
With Dr. Dror Paley & Dr. Ben Carson with two wonderful children at Mar a Lago. __HTTP__ _E_
"Here's the truth the gov't doesn't shutdown" __HTTP__ via @AP. All essential services continue. Don't believe lies. _E_
Look how small the pages have become @WSJ. Looks like a tabloid—saving money I assume! _E_
"To keep momentum keep challenging yourself." – Think Big _E_
Don't forget to watch Celebrity Apprentice this Sunday night at 9 pm on NBC. You're in for a great show. __HTTP__ _E_
The legendary Barbara Walters interviews Melania Trump and me on a special this Friday night at 10:00 on ABC. Don't miss it! _E_
.@JebBush like it or not our country needs more energy and spirit than you can provide! #MakeAmericaGreatAgain _E_
I told you the Oscars were terrible—bad look bad talent—and among the lowest ratings in show's history. __HTTP__ ... _E_
I will be on @foxandfriends at 7.30 A.M. _E_
The Emmys were horrendous...the absolute worst show! _E_
Watch me tonight on Late Night with Jimmy Fallon.Photo: Lloyd Bishop/NBC __HTTP__ _E_
Gov.Kasich of Ohio just stated on a morning show that he doesn't watch politics or anything on television he only watches the @GolfChannel _E_
Done by a real fan! #TRUMP __HTTP__ _E_
I will be tweeting live tonight during Celebrity Apprentice 9 o'clock on NBC! _E_
Very dangerous pattern developing across country by Obama supporters. Detroit poll watcher was threatened with gun __HTTP__ _E_
In my book @Joan_Rivers had a lousy doctor shoving a camera down her throat at her age. Something went really wrong that should not have! _E_
Can you imagine if Bush's administration drafted a memo legalizing the killing of Americans?! Democrats are such hypocrites. _E_
I use both iPhone & Samsung. If Apple doesn't give info to authorities on the terrorists I'll only be using Samsung until they give info. _E_
The Apprentice on the other hand has been a MAJOR television hit often times finishing #1. Even now after 13 seasons it wins its slot! _E_
.@TheBrodyFile Exclusive: @realDonaldTrump Says He Will Protect Evangelicals Better Than @tedcruz __HTTP__ #CBNNews #2016 _E_
"Diligence is the mother of good luck." Benjamin Franklin _E_
Riley Rone was a great young man. We will miss him dearly. __HTTP__ _E_
No surprise. @Rosie is failing on @TheView.Terrible ratings."Malcontent & another season is out of the question __HTTP__ _E_
The only people who don't like the Tax Cut Bill are the people that don't understand it or the Obstructionist Democrats that know how really good it is and do not want the credit and success to go to the Republicans! _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Entrepreneurs: Resolve to be bigger than your problems. Who's the boss? _E_
Our country is facing a major threat from radical Islamic terrorism. We better get very smart and very tough FAST before it is too late! _E_
A president either is constantly on top of events or if he hesitates events will soon be on top of him... _E_
RT @realDonaldTrump: So much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They prom... _E_
Once again Obama is going to lose on another prospective nomination. Chuck Hagel will not be named Sec. of Defense & probably shouldn't be. _E_
Bloomberg: Trump leads GOP field __HTTP__ _E_
Look forward to Governor Mike Pence V.P. introduction tomorrow in New York City. _E_
Sorry I will miss the CPAC gathering in Orlando there in spirit Obama must go. _E_
My #GOPDebate @facebook question for the other candidates __HTTP__ _E_
With 50 days until the election it is #TimeToGetTough for @MittRomney & @GOP _E_
Congratulations to @BretBaier on the immediate & tremendous success of his book 'Special Heart.' Already in its third printing! _E_
.@TrumpChicago's exceptional dining w/equally exceptional views of the city are exclusive world class experiences __HTTP__ _E_
Great memory @TheRealMarilu! #CelebApprentice _E_
On my way to Iowa. Will be landing in Des Moines in two hours. See ya! _E_
My contract with the American voter will restore honesty accountability & CHANGE to Washington! #DrainTheSwamp __HTTP__ _E_
We have many problems in our house (country!) and we need to fix them before we let visitors come over and stay. MAKE AMERICA GREAT AGAIN! _E_
Success tip: Don't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_
Seems hard to believe that @Facebook could be worth that much be careful if you invest. And Mark Zuckerberg get a pre nup. _E_
To be completed this yearTrump Int'l Golf Club Dubai will feature a 7205 yard par 71 & double sided driving range __HTTP__ _E_
The US government's foreign debt is at a record $5.29T __HTTP__ China is laughing all the way to the bank. _E_
RT @foxandfriends: Jared Kushner didn't suggest Russian communications channel in meeting source says __HTTP__ _E_
The Blue Monster Golf Course officially opens tomorrow at Trump National Doral with a ribbon cutting ceremony. GREAT COURSE GREAT REVIEWS! _E_
America's hearts & prayers are with the people of #PuertoRico & the #USVI. We will get through this and we will get through this TOGETHER! __HTTP__ _E_
Thank you Readers' Choice: Trump Int'l Hotel Las Vegas has been nominated by 10 Best for Best Pet Friendly Hotel __HTTP__ _E_
Will be heading over shortly to make remarks at The National Prayer Breakfast in Washington. Great religious and political leaders and many friends including T.V. producer Mark Burnett of our wonderful 14 season Apprentice triumph will be there. Looking forward to seeing all! _E_
Government dependency has surged over 23% since @BarackObama has taken office. __HTTP__ He is creating an entitlement culture. _E_
Democrats refusal to give even one vote for massive Tax Cuts is why we need Republican Roy Moore to win in Alabama. We need his vote on stopping crime illegal immigration Border Wall Military Pro Life V.A. Judges 2nd Amendment and more. No to Jones a Pelosi/Schumer Puppet! _E_
The Green Party scam to fill up their coffers by asking for impossible recounts is now being joined by the badly defeated & demoralized Dems _E_
Pictures of my beautiful mother amazing father and family hanging @MontesKitchen in upstate New York. __HTTP__ _E_
Why did @DanaPerino beg me for a tweet (endorsement) when her book was launched? _E_
I was disappointed that Ted Cruz would speak behind my back get caught and then deny it. Well welcome to the wonderful world of politics! _E_
Congrats to @mboyle1 of @BreitbartNews for exposing Jason Linkins of @HuffingtonPost as a lightweight dope who gives false information. _E_
Starting next week and by popular demand (plus good ratings) NBC will broadcast only two hour episodes of Celebrity Apprentice at 9 P.M. _E_
Barack Obama is hard at work today on his highest priority his reelection. @BarackObama has 5 fundraisers in 2 cities. __HTTP__ _E_
The lunatics in Congress banned the word 'lunatic' from Congress last week __HTTP__ Busy doing the peoples' work! _E_
When someone can discourage you you probably aren't determined enough. Be resolute. That's what it takes to get things done. _E_
Why does the liberal media think Bill O'Reilly (@oreillyfactor) is a complete and total vulgarian? I don't think so! _E_
.@alexsalmond @pressjournal RT @djkevritch im proud to be scottish but bonnie scotland will soon be a thing of the past w/ these windmills _E_
Join me live for the #SOTU __HTTP__ _E_
Big win today in the House for GOP Tax Cuts and Reform 227 205. Zero Dems they want to raise taxes much higher but not for our military! _E_
Fake News CNN made a vicious and purposeful mistake yesterday. They were caught red handed just like lonely Brian Ross at ABC News (who should be immediately fired for his "mistake"). Watch to see if @CNN fires those responsible or was it just gross incompetence? _E_
SSE slashes offshore wind investment—wants British government to pay for its losses on these monstrosities __HTTP__ _E_
More waste fraud and abuse over $460M in food stamps went to ineligible households __HTTP__ Where's the accountability? _E_
Under Trump gains against #ISIS have dramatically accelerated __HTTP__ _E_
Now Obama's campaign is guaranteeing 12 million new jobs during a 2nd term __HTTP__ More like $12T in new debt if he wins. _E_
In response to @Lawrence my net worth is substantially more than 7 billion dollars very low debt great as... (cont) __HTTP__ _E_
Young entrepreneurs – be patient and continue to work with determination. With hard work success will follow. Keep your focus! _E_
The reason I put up approximately $50 million for my successful primary campaign is very simple I want to MAKE AMERICA GREAT AGAIN! _E_
I cancelled today's meeting with the failing @nytimes when the terms and conditions of the meeting were changed at the last moment. Not nice _E_
If the ban were announced with a one week notice the bad would rush into our country during that week. A lot of bad dudes out there! _E_
I will be watching the great Governor @Mike_Pence and live tweeting the VP debate tonight starting at 8:30pm est! Enjoy! _E_
Will be on @OreillyFactor tonight at 8:30pm @FoxNews prior to Melania's speech at the #GOPConvention. Tune in she will do great! #RNCinCLE _E_
You can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of The Deal _E_
I will be campaigning in Indiana all day. Things are looking great and the support of Bobby Knight has been so amazing. Today will be fun! _E_
I am on David Letterman tonight. _E_
Will be interviewed tonight on @seanhannity at 10:00. There is so much to talk about! _E_
Congratulations to Georgina Bloomberg on winning the inaugural Central Park Grand Prix CSI 3* @MikeBloomberg _E_
#HappyNewYearAmerica! __HTTP__ _E_
Crooked Hillary Clinton looks presidential? I don't think so! Four more years of Obama and our country will never come back. ISIS LAUGHS! _E_
...Whether you are a Republican or Democrat we should hope that Pres. @BarackObama does a great job for the country. _E_
When the American People speak ALL OF US should listen. Just over one year ago you spoke loud and clear. On November 8 2016 you voted to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
As the world watches we are days away from passing HISTORIC TAX CUTS for American families and businesses. It will be the BIGGEST TAX CUT and TAX REFORM in the HISTORY of our country! __HTTP__ _E_
Great video of tonights crowd reacting to my latest proposal in SC. #Trump2016 __HTTP__ __HTTP__ _E_
Joe McQuaid (@deucecrew) of the dying Union Leader wanted ads lunches donations speeches from me and tweets very unethical. _E_
Nice to see Obama released a situation room photo from Sandy. How about releasing the photo taken during Benghazi? _E_
.@BarackObama economic gloom: jobless claims have surged __HTTP__ while factory activity is (cont) __HTTP__ _E_
Don't believe the biased and phony media quoting people who work for my campaign. The only quote that matters is a quote from me! _E_
Why did failing A.G. Eric Schneiderman after years of looking file his pathetic lawsuit on a SATURDAY afternoon (unheard of)? No case! _E_
We don't need another stimulus. The first one was a complete failure. Why repeat the same mistake? _E_
09 19 2011 17:54:28 _E_
Thx to all the people who called to say they are cutting their @Macys credit card as a protest against illegal immigrants pouring into US _E_
#TrumpVine Weiner is a joke.... __HTTP__ _E_
The ObamaCare disaster is in full swing. Websites are down people can't sign up and elderly can't understand the lingo. _E_
Texas & Louisiana: We are w/ you today we are w/ you tomorrow & we will be w/ you EVERY SINGLE DAY AFTER to restore recover & REBUILD! __HTTP__ _E_
100% fabricated and made up charges pushed strongly by the media and the Clinton Campaign may poison the minds of the American Voter. FIX! _E_
Get ready for the Apprentice tonight TWO AMAZING EPISODES. I will be live tweeting! _E_
My @CNBCClosingBell interview discussing QE3 the housing market my stock picks and the 2012 election __HTTP__ _E_
Via @financialpost: "Climate changing for global warming journalists" by Lawrence Solomon (cont) __HTTP__ _E_
Watch the clip from @Late_Show where the crowd cheers after I explain that my offer is about transparency __HTTP__ _E_
I am greatly honored by the results of the CNN poll in Iowa. In the end I believe the final results will be even better than that! _E_
Watching @CNN and consider @secupp to be one of the least talented people on television. Boring and biased! _E_
Make our borders strong and stop illegal immigration. Even President Obama agrees __HTTP__ _E_
How does failed writer and pundit like @stephenfhayes with no success and little talent get away with criticizing candidates. _E_
While the next season of @CelebApprentice is packed w/ All Stars ours fans will be happy to see @Joan_Rivers in the board room.She is back! _E_
Can you believe the Republicans are studying the Democrats on how to win an election? _E_
Debate showed these guys really hate each other. At one point it looked like they would come to blows. _E_
If Obama was smart he would cancel the Muslim Brotherhood's WH visit later this month. He won't. _E_
Isn't it amazing that @Macys paid a massive fine for profiling African Americans & then criticized me for discussing illegal immigration! _E_
Jeb Bush is desperate strongly in favor of #CommonCore and very weak on illegal immigration. _E_
Negotiation is persuasion more than power. Negotiation includes a lot of fine lines and that's what makes it an art. _E_
Was @foxandfriends just named the most influential show in news? You deserve it three great people! The many Fake News Hate Shows should study your formula for success! _E_
With one Yes vote in hospital & very positive signs from Alaska and two others (McCain is out) we have the HCare Vote but not for Friday! _E_
Trump approval rebounds to 45% surges among Hispanics union homes men __HTTP__ _E_
Obama is about to destroy the mililtary through the sequester. The Middle East is a mess. Yet Colin Powell still endorses him. Wonder why? _E_
AMERICA FIRST! _E_
Must read piece by @DanielPipes: "Obama's Diplomatic Acrobatics" __HTTP__ _E_
Wow the Republican Convention went so smoothly compared to the Dems total mess. But fear not the dishonest media will find a good spinnnn! _E_
Despite the ever increasing Ebola disaster Obama refuses to stop flights from West Africa.It's almost like he's saying F you to U.S. public _E_
Ivanka caught up with Bret and Holly backstage. Both Bret and Holly were champions all the way. __HTTP__ _E_
WH claims it lied about Pres. Obama living with his uncle b/c "wasn't mentioned in his book." I guess Bill Ayers never knew about it! _E_
Welcome to the @WhiteHouse Amir Sabah al Ahmed al Jaber al Sabah of Kuwait! Joint press conference coming up soon: __HTTP__ __HTTP__ _E_
"WHAT HAPPENED""How Team Hillary played the press for fools on Russia" __HTTP__ WE KNOW! __HTTP__ _E_
Congratulations to @TrumpNewYork and @TrumpToronto for the @WSJ coverage on perks in luxury hotels: __HTTP__ _E_
Wow! This might be my highest # yet! Thank you to my opposition you are totally ineffective & have been for years! __HTTP__ _E_
.@Macys was one of the worst performing stocks on the S&P last year plunging 46%. Very disloyal company. Another win for Trump! Boycott. _E_
find the leakers within the FBI itself. Classified information is being given to media that could have a devastating effect on U.S. FIND NOW _E_
Army training slide lists Hillary Clinton as insider threat: __HTTP__ _E_
New Zogby poll— highly respected— but the media won't report it because it gives me an even bigger lead! __HTTP__ _E_
The American US Airways merger will create even worse service and much higher fares. _E_
Just like I have been able to spend far less money than others on the campaign and finish #1 so too should our country. We can be great! _E_
If the decision by the grand jury in Ferguson was the exact opposite you would still be having the riots right now! _E_
Whatever happened to Obama's 'independent investigation' into national security leaks from his administration? Where's the media? _E_
In '09 Obama released the ISIS chief. The terrorist gloated "I'll see you in New York" __HTTP__ Historic nat'l sec. error _E_
.@bwilliams wouldn't you love to have my ratings? _E_
A poll of the Miami Dade was conclusively in favor of gambling in Miami. @willweatherford @FLGovScott __HTTP__ _E_
An honor to be endorsed by the New England Police Benevolent Association. Thank you! __HTTP__ __HTTP__ _E_
My Scotland course is receiving accolades from all over the world a great honor for me. _E_
I was so happy when I heard that @Politico one of the most dishonest political outlets is losing a fortune. Pure scum! _E_
My wonderful son Eric will no longer be allowed to raise money for children with cancer because of a possible conflict of interest with... _E_
Do you think the 14 African nations that are banning West Africans from coming into their nations are racist? _E_
What a great evening we had. So interesting that Sanders beat Crooked Hillary. The dysfunctional system is totally rigged against him! _E_
New York Republican leader @EdwardFCox is pushing my friend @RobAstorino into political suicide. Results won't be pleasant! _E_
It's Thursday and only 26 days until the election. How many illegal donations from China and Saudi Arabia did Obama collect today? _E_
People like doing deals with me because they know it will be profitable that I work quickly and that they will be treated fairly. _E_
From my family to yours...I want to wish you all a very merry Christmas! _E_
The habitual vacationer @BarackObama has sacrificed so much. He is delaying his 17 day Hawaii vacation a couple of hours. _E_
Thank you for the warm welcome to Brussels Belgium this afternoon! __HTTP__ _E_
.#CelebrityApprentice Two hour live show on Monday night will determine who will become the winner of Celebrity Apprentice.Full cast returns _E_
Does President Obama ever discuss the sneak attack on Pearl Harbor while he's in Japan? Thousands of American lives lost. #MDW _E_
With 46 stories and 391 beautiful rooms @TrumpSoHo offers a wide array of AAA Five Diamond luxury options __HTTP__ _E_
Tax experts throughout the media agree that no sane person would give their tax returns during an audit. After the audit no problem! _E_
.@BetteMidler talks about my hair but I'm not allowed to talk about her ugly face or body so I won't. Is this a double standard? _E_
New @OANN national poll released. Thank you America! #Trump2016 __HTTP__ _E_
I'll be in London on Sunday at the ExCel Centre to talk about success. It will be a great time for everyone! __HTTP__ _E_
"One reason many people do not do well in business is because they do not do well with people." – Midas Touch _E_
Wow @CNN has nothing but my opponents on their shows. Really one sided and unfair reporting. Maybe I shouldn't do their town hall tonight! _E_
Marco Rubio couldn't even respond properly to President Obama's State of the Union Speech without pouring sweat & chugging water. He choked! _E_
Trump National Golf Club Bedminster New Jersey has courses designed by Tom Fazio & 16 acres of practice facilities. __HTTP__ _E_
The evening news broadcasts must stop talking about weather—boring and too many other topics. _E_
I will be on Piers Morgan Live tonight at 9 p.m. on CNN. Tune in! _E_
Amazing @VanityFair survived one more day without folding. The clock is ticking... _E_
Today's third stop Londonderry New Hampshire! Thank you!#FITN #VoteTrumpNH __HTTP__ _E_
Romance or Adventure what do you prefer? #CelebApprentice _E_
Looking forward to tonight's Ayrshire Chamber of Commerce Annual Dinner 2015 @AyrshireChamber _E_
Heading to Scotland to check out Turnberry & Trump Int'l Golf Links Scotland. Then heading to Dubai @DamacOfficial a great company. _E_
If you don't do your part don't blame God. Billy Sunday _E_
Glad to hear @BarackObama's attack ad featuring my plane is playing in North Carolina. Free ad time for Trump National in Charlotte! _E_
The opening of Trump Turnberry in Scotland was a big success. Good timing I was here for BREXIT. Very exciting news conference today! _E_
Speaking to great patriots @MCC_CT. My first visit to Granite State since declaring my candidacy! #FITN __HTTP__ _E_
The Ultimate Merger: __HTTP__ 06 17_omarosa_is_back_and_this_time_its_personal.html _E_
"The entrepreneur's ability to dream to win lose and win again and again is often called the entrepreneurial spirit." – Midas Touch _E_
No matter what Bill Clinton says and no matter how well he says it the phony media will exclaim it to be incredible. Highly overrated! _E_
Via @Newsmax_Media by @wandacarruthers: Trump: 'Inconceivable' Obama didn't know about ISIS threat __HTTP__ _E_
Get out and VOTE tomorrow! We will MAKE AMERICA GREAT AGAIN! #CTPrimary #DEPrimary #MDPrimary #PAPrimary #RIPrimary __HTTP__ _E_
Why isn't anyone using the @CNN Iowa Poll with me having a big lead. They only want to use the one negative poll (2nd place).Dishonest press _E_
.@bobvanderplaats is a total phony and dishonest guy. Asked me for expensive hotel rooms free (and more). I said pay and he endorsed Cruz! _E_
Thank you Council Bluffs Iowa! Will be back soon. Remember everything you need to know about Hillary just... __HTTP__ _E_
Via @WSJ: "The ObamaCare Awakening: Americans are losing their coverage by political design." __HTTP__ _E_
I was interviewed by Greta Van Susteren today here at Trump Tower. Tune in tonight on Fox News at 10 p.m.... (cont) __HTTP__ _E_
Interview w/Melanie Batley via Newsmax __HTTP__ _E_
We should have taken the oil in Iraq and now our mortal enemies have got it and with no opposition. Really dumb U.S. pols! I'm so angry! _E_
Can you believe they are blaming @MittRomney for Egypt. _E_
With all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special counsel appointed! _E_
'BuzzFeed Runs Unverifiable Trump Russia Claims' #FakeNews __HTTP__ _E_
Looking forward to meeting with @SenBobCorker in a little while. We will be traveling to North Carolina together today. _E_
.@JudgeJeanine Tonight at 9 P.M. on @FoxNews ENJOY! _E_
For America to be strong again the ways of politicians must be put in the past. Let's Make America Great Again! __HTTP__ _E_
Filming of the record 13th season of @CelebApprentice has started. Be sure to be on the lookout for future updates. _E_
2012 is the most important election of my lifetime. @BarackObama must be defeated. _E_
We allow Japan to sell us millions of cars with zero import tax and we can't make a trade deal with them our country is in big trouble! _E_
My new club on the Atlantic Ocean in Ireland will soon be one of the best in the World and no one will be looking into ugly wind turbines! _E_
Movie producer Harvey Weinstein who lost his company to Colony Capital is against guns but makes movies w/ major gun violence really! _E_
BREAKING Border security rally in Phoenix AZ at 2PM MST has been moved to @PhoenixConvCtr! Build a wall! Let's Make America Great Again! _E_
The Democrat Governor.of Minnesota said The Affordable Care Act (ObamaCare) is no longer affordable! And it is lousy healthcare. _E_
Is the Boston killer eligible for Obama Care to bring him back to health? _E_
President Xi of China has stated that he is upping the sanctions against #NoKo. Said he wants them to denuclearize. Progress is being made. _E_
FBI Director Comey was the best thing that ever happened to Hillary Clinton in that he gave her a free pass for many bad deeds! The phony... _E_
.@ScottWalker despite your coming to my office to give me an award your very dumb fundraiser hit me very hard not smart! _E_
Ft. Hood Jihadi Nidal Hassan has been paid over $300g in Army salary while on trial. His victims are deprived of any benefits... _E_
Sexting Pervert @anthonyweiner has returned to twitter. Parents of all underage girls should BLOCK him immediately! _E_
Obama is not a leader he's just a campaigner! _E_
Great job by MichaelCaputo on @foxandfriends. _E_
Call it any way you like but Snowden is a traitor. When our country was great do you know what we did to traitors? _E_
Memorial Day is a time to honor our nation's finest who made the ultimate sacrifice for our freedom. God bless them all. _E_
President Obama missed the deadline! _E_
The 250 million dollar construction of Trump Nationsl Doral is coming along great. Just left Miami where I toured entire project.AMAZING! _E_
Just read that Trump has the largest (and I add most enthusiastic) crowds. Tonight I will be in New Hampshire the place will be packed! _E_
Refloating the Costa Concordia for many hundreds of millions of $'s is ridiculous. Should have taken it apart in small pieces save fortune _E_
Our $16T national debt is now bigger than our $15T GDP. If Obama is re elected watch for an economic meltdown in 2013. _E_
We have done a great job with the almost impossible situation in Puerto Rico. Outside of the Fake News or politically motivated ingrates... _E_
Hillary Clinton answered email questions differently last night than she has in the past. She is totally confused. Unfit to serve as #POTUS. _E_
Here are Hillary Clinton's accomplishments at the State Department.#Debates2016 #RattledHillary __HTTP__ _E_
I will be in Palm Beach Jupiter and Miami today checking on big construction projects. I love Florida and love on time and on budget const _E_
I'll be on @seanhannity tonight at 10 PM and look forward to it. Lots to discuss! Enjoy. _E_
Now America knows the Emperor has no clothes. Why would Obama do better in a 2nd debate? #Debate #Obama _E_
... I never felt that I could let up for a moment. Harry S. Truman _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
I just want to know how much is Saudi Arabia and others who we are helping willing to pay for our saving from total extinction. Pay up now! _E_
Lightweight Attorney General Eric Schneiderman will be next to lose. He goes after a school with a 98% approval ratingleaves biggies alone _E_
Great new campaign ad just released by @MittRomney __HTTP__ _E_
Great evening in San Jose other than the thugs. My supporters are far tougher if they want to be but fortunately they are not hostile. _E_
Dummies left Iraq without the oil not believable! _E_
Capitalism doesn't guarantee success only a chance to succeed. The community organizer @BarackObama doesn't (cont) __HTTP__ _E_
ISIS is operating a training camp 8 miles outside our Southern border __HTTP__ We need a wall. Deduct costs from Mexico! _E_
#NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87 12. MAKE AMERICA GREAT AGAIN! _E_
If Snowden was such a hero then he would be in America. He is escaping justice! _E_
Looks like plane may have been found in the Indian Ocean off the coast of Australia. _E_
I just had a great victory against lightweight A.G. Eric Schneiderman. Most of his case re Trump U. was thrown out or gutted. Little remains _E_
Remember when the failing @nytimes apologized to its subscribers right after the election because their coverage was so wrong. Now worse! _E_
The Republicans will get zero credit for passing immigration reform—and I said zero! _E_
I am not just running against Crooked Hillary Clinton I am running against the very dishonest and totally biased media but I will win! _E_
Watch commodity prices soar because of the freezing cold. Will be bad for the economy. We could use some global warming. _E_
Wow my campaign is hearing from more and more Bernie supporters that they will NEVER support Crooked Hillary. She sold them out V.P. pick! _E_
In Austin Texas with some of our amazing Border Patrol Agents. I will not let them down! __HTTP__ __HTTP__ _E_
LIVE FACT CHECK: Trump's RIGHT. The Clinton Foundation has taken MILLIONS from the Middle East. #DrainTheSwamp __HTTP__ _E_
I will be on @CNNSitRoom with @wolfblitzer from 5 7pm est. on @CNN. _E_
Thank you Piers. __HTTP__ _E_
Via @australian: Trump empire planning to build a presence in Sydney __HTTP__ _E_
Entrepreneurs: Practice positive thinking with a lot of reality checks. Know that goals come with obstacles. _E_
Many people are saying that my challenge to Obama is having a huge negative effect on his poll numbers I agree. _E_
National Pearl Harbor Remembrance Day "A day that will live in infamy!" December 7 1941 _E_
My heart & prayers go out to all of the victims of the terrible #Brussels tragedy. This madness must be stopped and I will stop it. _E_
Spoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenMajLeader @SpeakerRyan. Th... __HTTP__ _E_
I think having Jeb's endorsement hurts Lyin' Ted. Jeb spent more than $150000000 and got nothing. I spent a fraction of that and am first! _E_
.@JebBush just took millions of $'s in special interest money to look like a tough guy. Will never work! _E_
Via @DMRegister by @JoelAschbrenner: Trump to speak at @LandExpo in West Des Moines __HTTP__ _E_
Wind farms are killing many thousands of birds. They make hunters look like nice people! _E_
#CelebrityApprentice contestant @LouFerrigno stopped by to visit today __HTTP__ _E_
Another company that the DOE has given money to just filed for bankruptcy. This is how the money we borrow at 40% from China is wasted. _E_
Who do you want negotiating for us? __HTTP__ _E_
The American gymnastic team was great our country should take their lead. _E_
Obama lied to the public about the Al Qaeda attack on our consulate in Libya. He should be held accountable. _E_
Why does @mcuban continue to embarrass the 31 35 & 11TH place @dallasmavs with childish behavior? Really unprofessional! _E_
.@RepTomMarino Great job on television this morning. Glad to have you on my side! _E_
John Kerry is openly celebrating the tenuous nuclear deal with Iran. Great dealmakers do not celebrate dealsthey just go on to the next one _E_
Will be interviewed on @FaceTheNation with @JDickerson tomorrow at 10:30am EST. Enjoy! _E_
ICYMI @IvankaTrump's int. on @TODAYshow discussing @Joan_Rivers & contestant rivalries on @ApprenticeNBC __HTTP__ _E_
Join my team over on my Facebook page live now! #Debates __HTTP__ __HTTP__ _E_
Happy Veterans Day to ALL in particular to the haters and losers who have no idea how lucky they are!!! _E_
Trump National Golf Club Washington D.C. is situated on 600 acres overlooking the Potomac River. Beautiful! __HTTP__ ... _E_
How much longer are we expected to put up with the world's most incompetent leader ObamaCare Iran Syria bads deals. JUST NEVER ENDS _E_
RT @DanScavino: .@realDonaldTrump stops by overflow room in Mechanicsburg Pennsylvania prior to main rally. #TrumpMovement #MAGA __HTTP__ _E_
Israel is being barraged by rockets from Gaza recently. They must respond accordingly in defense of their citizens. _E_
We are going to defend our industry & create a level playing field for the American worker. It is time to put... __HTTP__ _E_
Oil is rising back over $100 barrel. OPEC loves to rip us off. Why shouldn't they they always get away with it. _E_
.@AROD is back on the DL. The coming suspension will be announced soon by @MLB. _E_
I've realized that success requires 100% effort and 100% focus. Nothing less. _E_
There is nothing nice about searching for terrorists before they can enter our country. This was a big part of my campaign. Study the world! _E_
One of the best produced including the incredible stage & set in the history of conventions. Great unity! Big T.V. ratings! @KarlRove _E_
Noisy windfarm driving community crazy! __HTTP__ @AlexSalmond @AberdeenCC @AberdeenshireCC _E_
Discipline is a key ingredient for success. It will build character motivation and bring opportunity. _E_
The new NBC POLL has me in first place but said I was third in the debate I demand a recount (just kidding!). EVERY other poll had me #1. _E_
Dow dives more than 500 points down 9% from high. Be careful! _E_
The CBO has predicted that unemployment will rise to 8.8% this next year. __HTTP__ This is @BarackObama's economic recovery. _E_
What is our country coming to when a judge can halt a Homeland Security travel ban and anyone even with bad intentions can come into U.S.? _E_
30 million Americans are unemployed yet Obama has set up workshops across the country for illegals to get Amnesty __HTTP__ _E_
The @ForbesInspector & @AAAnews 5 star restaurant @TrumpNewYork's @Jean_GeorgesNYC is NYC's top destination __HTTP__ _E_
RT @EricTrump: Tune into @GMA right now to catch a great interview with my father & the entire family! #VoteTrumpPence16 __HTTP__ _E_
Hillary Clinton spokesperson admitted that their was no ISIS video of me. Therefore Hillary LIED at the debate last night. SAD! _E_
Our economy is struggling and OPEC continues to rip us off. Output is low and the price is too high. They ar... (cont) __HTTP__ _E_
Marco Rubio is being crucified by the media for drinking water during speech! _E_
"If it's worth doing it's worth fighting for. You'll have lots of people and obstacles in your way. Work & fight to get beyond them. _E_
Residential Capital a company in which Warren Buffett is involved went bankrupt but that doesn't mean that Warren Buffett went bankrupt! _E_
Champions aren't made in the gyms. Champions are made from something they have deep inside them a desire a dream a vision. Muhammad Ali _E_
Thank you Sparks Nevada!#VoteTrumpNV #NevadaCaucus Finder: __HTTP__ __HTTP__ _E_
Heading over to the @UN to meet with Ambassador @NikkiHaley and all of her great representatives! #USA _E_
Do not underestimate yourself and know you are able to handle what comes your way by increasing your leverage. _E_
#CelebrityApprentice @arsenioofficial "trying to be invisible"? No way that's going to happen. #sweepstweet _E_
Thank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 #NHPolitics #FITN __HTTP__ __HTTP__ _E_
Via @Suntimes: Trump wins at trial calls woman suing him 'horrible human being' __HTTP__ _E_
When I said in an interview that Putin is not going into Ukraine you can mark it down I am saying if I am President. Already in Crimea! _E_
I gave a woman named Barbara Res a top N.Y. construction job when that was unheard of and now she is nasty. So much for a nice thank you! _E_
Obama's 2014 budget "eyes $1 trillion hike in tax revenue" __HTTP__ He loves taxes. T E A. Taxed Enough Already. _E_
We have to bring back and cherish the middle class once the backbone and true strength of the U.S.A. It can happen! _E_
A regular part of your day should be devoted to expanding your horizons. Learning is a new beginning. _E_
I had a fantastic time with @jacknicklaus at the grand opening of the great @TrumpFerryPoint. Watch the video __HTTP__ _E_
Will Smith did a great job by smacking the guy reporter who kissed him on the lips at a red carpet event. (cont) __HTTP__ _E_
Sorry losers and haters but my I.Q. is one of the highest and you all know it! Please don't feel so stupid or insecureit's not your fault _E_
BIG @MittRomney is preferred to handle the economy over @BarackObama by 63% 29% in a @gallupnews poll __HTTP__ _E_
RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_
Great meeting with @HouseGOP and @SenateGOP leaders including @SpeakerRyan @SenateMajLdr @GOPLeader @JohnCornyn... __HTTP__ _E_
"Tomorrow hopes we have learned something from yesterday." John Wayne _E_
China is getting minerals from Afghanistan __HTTP__ We are getting our troops killed by the Afghani govt't. Time to get out. _E_
I will be doing #GDNY Good Day N.Y. with Rosanna &Greg live at 8.30 A.M. I will be giving money to a great guy who lost his son in the WTC. _E_
Captain Khan killed 12 years ago was a hero but this is about RADICAL ISLAMIC TERROR and the weakness of our leaders to eradicate it! _E_
Crooked Hillary Clinton got Brexit wrong. I said LEAVE will win. She has no sense of markets and such bad judgement. Only a question of time _E_
Ralph Northam will allow crime to be rampant in Virginia. He's weak on crime weak on our GREAT VETS Anti Second Amendment.... _E_
Peaceful protests are a hallmark of our democracy. Even if I don't always agree I recognize the rights of people to express their views. _E_
Definitely watch @Carl_C_Icahn 's 'Danger Ahead'. Very insightful particularly on how corp inversions hurt America: __HTTP__ _E_
See story in Fusion and Huff. Post about rape at the border. Beyond terrible! Isn't Fusion owned by Univision? _E_
Trump Tower Punta Del Este features the Trump Organization's signature superior quality detail & perfection __HTTP__ _E_
I will not be commenting on boardroom specifics would be unfair to the different time zones. #CelebApprentice _E_
Attitude is a little thing that makes a big difference. Winston Churchill _E_
Nasty for the middle class electricity prices surged to an all time high this past March __HTTP__ FRACK NOW _E_
Once John Kasich announced he was running for president and opened his mouth people realized he was a complete & total dud! _E_
It is now commonly agreed after many months of COSTLY looking that there was NO collusion between Russia and Trump. Was collusion with HC! _E_
.@NYMag is a piece of garbage but I think it is very nice & charitable that they employ the no talent illiterate hack @jonathanchait. _E_
Governor Cuomo is right about one thing Attorney General Eric Schneiderman does wear eyeliner! What the hell is up with him? _E_
Heading to Joint Base Andrews on #MarineOne with Prime Minister Shinzō earlier today. __HTTP__ _E_
RT @RealJamesWoods: Only now with a #RealPresident do we see the scope of destruction engineered by #Obama and the #Democrat cabal. @realD... _E_
Will be on Meet the Press with @ChuckTodd tomorrow morning. Enjoy! _E_
Why aren't people looking at this reporters earliest statement as to what happened that is before she found out the episode was on tape? _E_
Thank you to all of my Twitter followers for helping to defeat Weiner and Spitzer. Remember in the beginning they said it couldn't be done! _E_
Congratulations to Linda McMahon on her victory in the Connecticut Senate primary. She is an amazing woman smart as you get! @Linda_McMahon _E_
President Obama is the best thing that ever happened to Jimmy Carter! _E_
My Administration Governor @RicardoRossello and many others are working together to help the people of Puerto Rico in every way... _E_
RT @seanspicer: .@timkaine wants to tough on crime fails to talk about defending rapists and murders #VPDebate _E_
Via @chicagotribune by @bob_writes: "@TrumpChicago tower unit sets resale record at $3.99M" __HTTP__ _E_
Congratulations to Boston on the @RedSox World Series victory. Earned and deserved. _E_
Mar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_
One of the most expensive projects ever in Miami @TrumpDoral's $200M of renovations are right on schedule. When completed will be elite! _E_
I will be interviewed by Chris Wallace on Fox tomorrow morning. Tune in! _E_
.@MiamiHerald discusses our @TrumpCollection #TrumpPets program @TrumpDoral: __HTTP__ _E_
Columbia University stated there was a computer error in their system concerning @BarackObama's attendance. (cont) __HTTP__ _E_
Perhaps a new meeting will be set up with the @nytimes. In the meantime they continue to cover me inaccurately and with a nasty tone! _E_
Under President Obama do you think America will become a THIRD WORLD COUNTRY? _E_
In 2010 alone our trade deficit with China cost over 566000 jobs __HTTP__ This is unsustainable for the American worker. _E_
The government will spend over $3.8T this year. The sequester is a pittance of the outlays less than 2%. Where's the problem? _E_
Thank you @NFIB together we will #MakeAmericaGreatAgain! __HTTP__ _E_
Our debt is about to reach $17T. Iraq has $20T in oil reserves. Interesting. _E_
My @FoxNews interview @seanhannity discussing Obama's failed presidency Ebola DC Post Office midterms & 2016 __HTTP__ _E_
Everyone is laughing at the @nytimes for the lame hit piece they did on me and women.I gave them many names of women I helped refused to use _E_
Both candidates are looking sharp now it's up to the mouth and the mind. #VPDebate _E_
We had all the leverage in our nuclear negotiations with Iran and our leaders foolishly decided to let them out of the trap. WHY? _E_
Loser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must cut off & use better! _E_
Keep testing your limits. Never become complacent. Always think big! _E_
The $1B failed website is the tip of the iceberg on the ObamaCare. Over 90 million estimated will lose their plans next year. _E_
ICYMI @IvankaTrump's @waytooearly int. w/@ThomasARoberts on @ApprenticeNBC's firingsTrump Int'l DC & @MissUniverse __HTTP__ _E_
...confidence that President Al Sisi will handle situation properly. _E_
An honor to join the @FaithandFreedom Coalition yesterday. In America we don't worship government. We worship God.... __HTTP__ _E_
Good move by @MSNBC in downgrading @WeGotEd to a dead weekend spot. This is truly a guy who shouldn't be on tv. _E_
A lot changed when David Letterman said he was probably born in this country the word probably is a total disaster for Obama. _E_
.@mike_pence is doing a great job so far no contest! _E_
Seal the deal! Hold your business meeting at the luxurious @TrumpNewYork Executive Board Room __HTTP__ _E_
Canada's PM was in China last week brokering a deal to sell the oil @BarackObama rejected in Keystone. __HTTP__ Unbelievable! _E_
President Andrew Jackson who died 16 years before the Civil War started saw it coming and was angry. Would never have let it happen! _E_
Donald Trump tops Franklin Pierce/Herald poll at 28 percent in N.H. __HTTP__ _E_
What do you think so far? #CelebApprentice _E_
Today is National Prescription Drug Take Back Day. Everyone can help fight the #OpioidEpidemic by participating! __HTTP__ __HTTP__ _E_
Senator Landrieu If you are a Senator representing Louisiana then you SHOULD own a home in the state. Send @BillCassidy to the Senate! _E_
What a foolish move by @davidaxelrod to speak in Boston yesterday! Completely outmaneuvered by the @MittRomney campaign. _E_
Mexican leadership has been laughing at us for many years but now it's no longer laughter—it's disbelief... _E_
"Runaway Obamacare Spending Will Cost Democrats" __HTTP__ via @BloombergView by @lanheechen _E_
How much money is the extremely unattractive (both inside and out) Arianna Huffington paying her poor ex hubby for the use of his name? _E_
I LIVE IN NEW JERSEY & @realDonaldTrump IS RIGHT: MUSLIMS DID CELEBRATE ON 9/11 HERE! WE SAW IT! __HTTP__ _E_
It's Thursday. How much time did Washington waste today trying to find a solution on the so called fiscal cliff? _E_
WE WILL ONLY BE THE LAND OF THE FREE AS LONG AS WE ARE HOME OF THE BRAVE! _E_
I will be on Fox & Friends tomorrow morning at 7. Will be discussing basic stupidity and incompetence of which our leaders have plenty! _E_
Get on Trump's List email from the RNC was not authorized. I am self funding my campaign! Do not pay. Email: __HTTP__ _E_
Via @PeoplesCompany: Real Estate Magnate Donald J. Trump to Headline 2015 @LandExpo in West Des Moines Iowa __HTTP__ _E_
Nobody could have done what I've done for #PuertoRico with so little appreciation. So much work! __HTTP__ _E_
I wonder if @megynkelly and her flunkies have written their scripts yet about my debate performance tonight. No matter how well I do bad! _E_
My honor thank you. __HTTP__ _E_
Jeb Bush just got contact lenses and got rid of the glasses. He wants to look cool but it's far too late. 1% in Nevada! _E_
RT @ColumbiaBugle: @realDonaldTrump @FLOTUS President Trump greeting families affected by Hurricane Harvey. #TexasStrong __HTTP__ _E_
.@DarrellIssa is a very good man. Help him win his congressional seat in California. _E_
Don't believe the manipulated job numbers. Walmart has just cut orders with suppliers because of rising inventory. _E_
Product placement is a definite prerequisite. #sweepstweet _E_
I love the Mexican people but Mexico is not our friend. They're killing us at the border and they're killing us on jobs and trade. FIGHT! _E_
The animal who beheaded the woman in Oklahoma should be given a very fast trial and then the death penalty. The same fate beheading? _E_
...Americans do what we do best: we pull together. We join hands. We lock arms and through the tears and the sadness we stand strong... __HTTP__ _E_
.@MonicaCrowley you were GREAT on @seanhannity tonight. Thank you for the nice words! _E_
Join me in Florida on Wednesday! Daytona & Jacksonville:Daytona \ 3pm __HTTP__ | 7pm __HTTP__ _E_
Thank you Cadillac Michigan! #VoteTrumpMI on 3/8/2016. We will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
"The best luck of all is the luck you make for yourself." – General Douglas MacArthur _E_
Hillary Clinton's short speech is pandering to the worst instincts in our society. She should be ashamed of herself! _E_
Just completed purchase of magnificent Ritz Carlton in Jupiter Florida. Will be renamed Trump National Golf Club & be tremendous success. _E_
Gabriel Aubry should learn how to fight—he became a punching bag. Always drama with Halle B! _E_
Inside 'Bill Clinton Inc.': Hacked memo reveals intersection of charity and personal income. #DrainTheSwamp! __HTTP__ _E_
On immigration I'm consulting with our immigration officers& our wage earners. Hillary Clinton is consulting with Wall Street. _E_
While Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him & his dumb clown humor. Too bad! _E_
.@tomhanks was fabulous in @LuckyGuyPlay last night—as was the entire cast. _E_
The U.S. has enough problems without publicity seekers going out and openly mocking religion in order to provoke attacks and death. BE SMART _E_
The failing @nytimes has gone nuts that Crooked Hillary is doing so badly. They are willing to say anything has become a laughingstock rag! _E_
"Everyone's dream can come true if you just stick to it and work hard." @serenawilliams _E_
150 Clinton E mails still contain classified information. More sensitive when she was Sec.of State. This is a very big deal. _E_
THANK YOU Atlanta Georgia! Leaving for Nevada now. Lets MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ __HTTP__ _E_
Join me in Pensacola Florida this Friday at 7pm! #VoteTrump __HTTP__ __HTTP__ _E_
Why is oil at a record high? OPEC & the oil speculators continue to rip us off. _E_
The train accident that just occurred in DuPont WA shows more than ever why our soon to be submitted infrastructure plan must be approved quickly. Seven trillion dollars spent in the Middle East while our roads bridges tunnels railways (and more) crumble! Not for long! _E_
Monday morning 7:30 AM I'll be on @foxandfriends. Tune in! _E_
Mike & Mike in one minute! _E_
Did the poor but smart to leave ex husband of @ariannahuff get any of the dollars she got for the use of his name in really stupid AOL deal? _E_
My @foxandfriends interview re: @IvankaTrump's pregnancy my grandchildren Obama's 18% tax rate & Obamacare __HTTP__ _E_
Good @marcorubio is trying to eliminate the tax on Olympic medals __HTTP__ Our athletes should not be taxed on their wins. _E_
RT @seanhannity: Graph: @RealDonaldTrump's Historic 13 Million Primary Votes Compared To Every GOP Nominee Since 1908 __HTTP__ _E_
Real estate taxes are far too high @BriarcliffManor Westchester. A total joke how they waste money! Replace Mayor Vescio. _E_
70 stories above Panama Bay @TrumpPanama the majestic sail design is Central America's architectural icon __HTTP__ _E_
The Rust Belt was created by politicians like the Clintons who allowed our jobs to be stolen from us by other countries like Mexico. END! _E_
Thank you @TIME readers a great honor! __HTTP__ _E_
Democrat Congresswoman totally fabricated what I said to the wife of a soldier who died in action (and I have proof). Sad! _E_
Saudi Arabia was vehemently against the Iran nuclear deal. Then today they embraced it. What happened? What did we give them to endorse? _E_
Congrats @SixteenChicago's @ChefLents on your Chef of the Year nom in @EaterChicago Annual Eater Awards Vote now! __HTTP__ _E_
Looking forward to speaking at Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. Second visit to New Hampshire this year. _E_
As I always said the Birthers were after the truth. Thanks to @RealSheriffJoe @BarackObama can't hide anymore. _E_
WOW! __HTTP__ _E_
I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Congress must repeal ObamaCare. Obama will veto while Americans continue to lose their doctors & pay rising premiums. _E_
JOIN ME TOMORROW!MINNESOTA 2pm __HTTP__ 6pm __HTTP__ 9:30p... __HTTP__ _E_
FAKE NEWS A TOTAL POLITICAL WITCH HUNT! _E_
My @foxandfriends interview from yesterday __HTTP__ _E_
Remember this @BarackObama told @GStephanopoulos in 09 that it is not true that the individual mandate is a tax __HTTP__ _E_
Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_
The losing team is now back in boardroom. I can't discuss the team members or what's going on or what happens from here on out. _E_
Our deficit spending is China's gain. @BarackObama is bankrupting our country. _E_
Thank you @JCLayfield will get even better as my Administration continues to put #AmericaFirst __HTTP__ _E_
Benghazi. Obama lied. Our people died. _E_
A lovely letter from the daughter of the late great John Wayne. Our country could use a John Wayne right now. __HTTP__ _E_
Bernie Sanders was right when he said that Crooked Hillary Clinton was not qualified to be president because she suffers from BAD judgement! _E_
Congratulations to Jim Herman my ass't golf pro at Trump Nat'l Golf Club/Bedminster NJ for qualifying for the U.S. Open! @usopengolf _E_
Great shots of @TrumpTowerNY #CelebApprentice _E_
Had a record crowd in Boone Iowa. A fantastic day we will #MakeAmericaGreatAgain __HTTP__ _E_
I really like Nelson Mandela but South Africa is a crime ridden mess that is just waiting to explode not a good situation for the people! _E_
Always enjoy appearing on @extratv. @MarioLopezExtra & @mariamenounos were terrific yesterday. _E_
Hillary Clinton has been working on solving the terrorism problem for years. TIME FOR A CHANGE I WILL SOLVE AND FAST! _E_
If you want to be a success you have to get used to frequently hearing the word no and ignoring it. Think Big _E_
Just bought Doral Hotel & Country Club in Miami within two years it will be the best resort in the country. _E_
THe Art of the Deal The best thing you can do is deal from strength and leverage is the biggest strength you have. CUT CAP and BALANCE. _E_
I just got off the phone with the great people of Guam! Thank you for your support! #VoteTrump today! #Trump2016 _E_
Exclusive interview w/ my wife @MELANIATRUMP tomorrow morning @ 8amE on @Morning_Joe w/ @morningmika @MSNBC. Enjoy! __HTTP__ _E_
RT @USCGSoutheast: .@USCG crews worked together with the @RedCross @fema and members of local #police #fire and #government to distribut... _E_
Glad to hear Mariano Rivera is going to make a comeback in 2013. He is a true sportsman and a great competitor. __HTTP__ _E_
.@canoetravel Also the very obsolete ugly and expensive wind turbines will never be build in Aberdeen. No longer works. @GolfChannel _E_
ICYMI @MELANIATRUMP Reading newspapers and see... #BillyGraham95 #happybirthday @BillyGraham __HTTP__ _E_
My team of deplorables will be managing my Twitter account for this evenings debate. Tune in!#DebateNight #TrumpPence16 _E_
Summer's almost here update your business wardrobe with Trump Signature Collection exclusively available @Macys __HTTP__ _E_
Heading to Baton Rouge Louisiana for a speech. Expecting a very large crowd! See you soon. #Trump2016 #MakeAmericaGreatAgain _E_
Hillary Clinton is weak on illegal immigration among many other things. She is strong on corruption corruption is what she's best at! _E_
Thanks to all for the wonderful congratulation sent to me on the birth of Ivanka's little boy so nice! _E_
Lightweight @AGSchneiderman the worst attorney general in the US is in a tough election with John Cahill @CahillForAG _E_
I like Russell Brand but Katy Perry made a big mistake when she married him. Let's see if I'm right I hope not. _E_
You can take the smartest kid at Wharton the one who gets straight A's and has a 170 IQ and if he doesn't (cont) __HTTP__ _E_
Realize that being an entrepreneur is not a group effort. You're in charge everything starts with you. _E_
"Know from the inside out that you have the power to succeed and you will." – Think Like a Champion _E_
Happy Veterans Day. To those who have served thank you for your special work. _E_
Dopey @chicagotribune critic fails to mention the ugly Sun Times sign. _E_
Another attack in London by a loser terrorist.These are sick and demented people who were in the sights of Scotland Yard. Must be proactive! _E_
I guess they don't have freedom of the press in Scotland. We created this ad and the ASA would not allow us to (cont) __HTTP__ _E_
Great interview in @postedtoronto of @DonaldJTrumpJr: He makes me proud. __HTTP__ _E_
"90% of Trump 2017 news coverage was negative" and much of it contrived!@foxandfriends _E_
Doing interview today with Maria Bartiromo at 10:00 A.M. on @FoxNews ENJOY! _E_
Hillary & Obama's Broken Promises. #RepealObamacare __HTTP__ _E_
and yet another ...all of them are spectacular. __HTTP__ _E_
Welcome to the new reality! Moody's just downgraded the entire US health insurance industry because of ObamaCare. _E_
Will be speaking to President Recep Tayyip Erdogan of Turkey this morning about bringing peace to the mess that I inherited in the Middle East. I will get it all done but what a mistake in lives and dollars (6 trillion) to be there in the first place! _E_
Thanks @renee2i for hosting me tomorrow at the Two International Group! Looking forward to making new friends & discussing #FITN topics. _E_
My interview with @gretawire on Fox News for those who missed it 'Obama's Constantly on Vacation' __HTTP__ _E_
My @FoxNews interview w/@seanhannity __HTTP__ _E_
.@SenTedCruz Ted free legal advice on how to pre empt the Dems on citizen issue. Go to court now & seek Declaratory Judgment you will win! _E_
Sugar: @Lord_Sugar Keep working hard so I make plenty of $ with your show... _E_
Obama won't send troops to fight jihadists yet sends them to Liberia to contract Ebola. He is a delusional failure. _E_
.@BlairKamin Blair you may be the worst architectural critic in the business but thanks for your nice reviews about Trump Chicago & sign PR _E_
Trump @DoralResort is hosting the WGC @cadillacchamp from March 6th – 10th. Join me I will be there all four days. _E_
Great purchase in Ireland will be a top spot! __HTTP__ _E_
The three political disasters could lead to a major and complete political meltdown! _E_
Via @postandcourier: The Donald at @TheCitadelOEA __HTTP__ _E_
RT @paultdove: @FoxBusiness Republican Senators who are opposing the President look at the great economic news: Americans Are Noticing! _E_
Enjoyed my visit to Trump Doral in Miami yesterday. Looking forward to returning for the WGC @Cadillac Championship on March 6th 10th... _E_
By not doing the failed poorly rated debate I was able to make the point of not allowing unfairness while raising $6000000 for VETS. _E_
Loved being in Manassas VA last night. Such incredible spirit! Now in DC for a speech will then visit Old Post Office under construction. _E_
You can find your polling locations at: __HTTP__ #FITN #NHPrimary #VoteTrumpNH __HTTP__ _E_
Thanks Go Angelo people are now really aware of my ties shirts and cuff links at Macy's _E_
When your secretary of defense tells you that your proposed cuts will erode America's military capability you (cont) __HTTP__ _E_
My honor. __HTTP__ _E_
Why is @AlexSalmond pursuing environment destroying windmills when @VattenfallGroup quit because of no (cont) __HTTP__ _E_
America's debt crisis is our country's greatest challenge. Spending must be curbed for our long term fiscal future. _E_
Detroit is going through very hard times right now.. If they are smart brighter days are ahead. _E_
OPEC will use yesterday's attacks on our embassies to raise the price of gas. They are always ripping us off. _E_
China is so brazen that they now give us economic advice they tell us what to do much like a strong stockh... (cont) __HTTP__ _E_
Will be on @foxandfriends now. Enjoy! _E_
Jack Welch thinks Sam Palmisano retired CEO of IBM should be the next CEO of MICROSOFT. Interesting! _E_
What did you think of the boardroom? #CelebApprentice _E_
The world is noticing thanks! __HTTP__ _E_
.@NeneLeakes seeks my advice on prenups tonight at 9 PM on Bravo _E_
Tomorrow on the @MissUniverse Facebook page submit your final question for the contestants __HTTP__ _E_
Everybody tells me not to hit back at the lowlifes that go after me for PR sorry but I must. It's my nature. _E_
We stand in absolute solidarity with the people of the United Kingdom. __HTTP__ _E_
"Donald Trump offers political advice to Palm Beach Republicans" __HTTP__ via @SunSentinel _E_
Will be doing a live Thanksgiving Video Teleconference with Members of the Military at 9:00 A.M. Afghanistan Iraq USS Monterey Turkey & Bahrain. Then going to Coast Guard Quarters Florida. _E_
Watch Obama push major global warming legislation early in his second term... _E_
I am interviewed on the @oreillyfactor tonight at 8:00. Then at 10:00 I am interviewed by @donlemon on @CNN. Enjoy! _E_
A 40mph gust of wind wrecked a wind turbine in Scotland __HTTP__ Any turbine in close proximity to a school must go! _E_
What is a bit appealing about this idea of Trump hosting a debate is consider the diverse audience that perh... (cont) __HTTP__ _E_
What did you think of my decision? #CelebApprentice _E_
Convenient David Plouffe collected $100G fee from Iranian affiliate only a month before joining @whitehouse __HTTP__ _E_
Make sure you're registered to vote! Let's #MakeAmericaGreatAgain! We can't afford more years of FAILURE! All info:... __HTTP__ _E_
Get ready for tonight! _E_
Remember when comedian Bill Maher openly praised the disgusting terrorists who destroyed the World Trade Center then got canned by ABC? _E_
I cannot believe that Apple didn't come out with a larger screen IPhone. Samsung is stealing their business. STEVE JOBS IS SPINNING IN GRAVE _E_
Thank you Hilton Head South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Is it legal for a sitting President to be wire tapping a race for president prior to an election? Turned down by court earlier. A NEW LOW! _E_
Jury was unanimous after hearing the made up case against my co. Filed many years ago she.and her pathetic lawyer should pay me big damages _E_
The @EricTrumpFdn event featured a performance by #CelebApprentice @JohnRich a great event for a great cause! Watch __HTTP__ _E_
Via @BreitbartNews by @mboyle1: Exclusive: Trump To Address South Carolina Tea Party Convention __HTTP__ _E_
Both Washington D.C. and DALLAS are turning out to be really big events. D.C. is protest of incompetent Iran deal and Dallas is big speech! _E_
A single Ebola carrier infects 2 others at a minimum. STOP THE FLIGHTS! NO VISAS FROM EBOLA STRICKEN COUNTRIES! _E_
I will be on Fox & Friends at 7.00 (20 minutes). Plenty of terrible and tragic news to talk about! Too bad. _E_
Thank you @foxandfriends great show! _E_
I agree @MMFlint To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp w/ me on 11/8. __HTTP__ _E_
RT @TXMilitary: #PhotosFromTheField: Aerial photos from our rescue crews earlier today. #Harvey #TMDHarvey @USNationalGuard __HTTP__ _E_
Leaving for North Carolina. Big crowd will be fun! _E_
Great evening last night in New Hampshire. Got the endorsement from the New England Police Union big territory great people! Thank you. _E_
"I don't see the point of being politically correct if that means actually being incorrect." – Donald J. Trump 'Midas Touch' _E_
Snowden is a traitor and a disgrace. Make no mistake he is no hero. In fact he is a coward who should come back & face justice. _E_
"Recognize that the world needs more entrepreneurs. Everyone is counting on you." – Midas Touch _E_
.@bobschieffer did an excellent job as debate moderator last night. I only wish Mitt was more aggressive! _E_
Obama and Republicans are hollowing out our military. Now want to cut troop levels. Lowest level in over 20 years. _E_
RT @foxandfriends: .@carriesheffield: The mainstream media is neglecting their duty to represent the public. They've failed to represent ha... _E_
Congratulations to @drewbrees on setting the @NFL record with 48 consecutive games with a TD pass. He is a great guy and player. _E_
#CelebApprentice Selfies yes or no? _E_
Getting rdy to leave for tonight's Celebrate Freedom Concert honoring our GREAT VETERANS w/ so many of my evangelic... __HTTP__ _E_
Thank you. __HTTP__ _E_
Can u believe that Jeb Bush's campaign manager is in Berlin Germany looking for money? What's he giving to Germany? __HTTP__ _E_
How can NYS allow lightweight @AGSchneiderman to remain in office? What are JCOPE & Moreland Commissions waiting on? __HTTP__ _E_
Via @DrudgeReport: __HTTP__ _E_
Our new Miss USA Alyssa Campanella came up to my office today for a visit. We're proud to have her as our new title holder. _E_
Buy at the point of maximum pessimism sell at the point of maximum optimism. Sir John Templeton _E_
Happy to have just passed 1.5M followers on twitter. We picked up over 14000 yesterday alone. It's great to speak to everyone daily. _E_
Want jobs? Slash corporate tax rate. Tax incentives for companies that create jobs in US. America will boom. _E_
No matter the mission the brave men & women of our @USCG proudly answer the call to serve 24/7/365. THANK YOU and HAPPY BIRTHDAY! #CG227 __HTTP__ _E_
Failure for all of @BarackObama's talk of engaging the world U.S. favorability has dropped around the world __HTTP__ _E_
I will be doing a major sit down interview on State of the Union With Jake Tapper at 9:00 A.M. on @CNN. Enjoy! _E_
I will be in PR on Tues. to further ensure we continue doing everything possible to assist & support the people in their time of great need. _E_
RT @Reince: Happy New Year + God's blessings to you all. Looking forward to incredible things in 2017! @realDonaldTrump will Make America... _E_
George Will is a political moron. Last month he said Romney couldn't win. _E_
Great to talk jobs with #NABTU2017. Tremendous spirit & optimism we will deliver! __HTTP__ _E_
We are no longer silent. We are energized & ready to take our country back. Let's Make America Great Again! __HTTP__ _E_
.@NBCNews purposely left out this part of my nuclear qoute: until such time as the world comes to its senses regarding nukes. Dishonest! _E_
...2nd Amendment Strong Military ISIS historic VA improvement Supreme Court Justice Record Stock Market lowest unemployment in 17 yrs! _E_
Today's ceremony is a day for both remembrance and resolve.#NATOMeeting #NATO __HTTP__ _E_
Yet another terrorist attack today in Israel a father shot at by a Palestinian terrorist was killed while: __HTTP__ _E_
.@BarackObama reported over $269710 of foreign income out of his gross $894520 and paid $5841 in foreign taxes __HTTP__ _E_
What a waste of time being interviewed by @andersoncooper when he puts on really stupid talking heads likeTim O'Brien dumb guy with no clue! _E_
I appeared on David Letterman last night. And don't forget Sunday night the first episode of Celebrity Apprentice will be on NBC at 9 pm. _E_
.@BrandenRoderick I was pleased to see the wonderful statements you made about me to the media.I'm not surprised you're a special person _E_
My shirts ties and fragrance are doing great at @Macys try them! Make fantastic gifts. _E_
64 stories of golden glass over the strip @TrumpLasVegas' elite hotel rooms feature floor to ceiling windows __HTTP__ _E_
Libya is selling its oil to China I notice the Chinese Ambassador is very safe. _E_
How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Bad (or sick) guy! _E_
The only way to spread economic growth is to lower taxes and end unfriendly regulatory practices. _E_
President Obama said that he thinks he would have won against me. He should say that but I say NO WAY! jobs leaving ISIS OCare etc. _E_
The Wilson family should thank me. Pegula overpaid for the @buffalobills because of me! _E_
A woman who got fired after two days of working with Scott Walker a wacko now trying to raise funds to fight me. _E_
RT @williebosshog: Make America Great Again! #Trump2016 __HTTP__ _E_
#ConfirmGorsuch #SCOTUS __HTTP__ _E_
American incomes have fallen $3040 per household in the last 38 months __HTTP__ _E_
He is working hard and for that he must be given credit! _E_
Petraeus is already negotiating a book deal. __HTTP__ Smart. Always negotiate when you are a hot commodity! _E_
Top 50 Facts About Crooked Hillary Clinton From Trump 'Stakes Of The Election' Address: __HTTP__ _E_
Entrepreneurs: Be cautiously optimistic. Call it positive thinking with a lot of reality checks. _E_
The Tonight Show begins in 5 minutes. Enjoy! _E_
Getting ready to pay final respect to GREAT LADY Joan Rivers. She could light up a room like no other! She will be greatly missed. _E_
I will be interviewed on @TODAY Show at 7:00 A.M. and on Morning Joe at 7:20. _E_
Lance Armstrong is now being sued by Fed Govt what was he thinkking? _E_
Bernie sanders has abandoned his supporters by endorsing pro war pro TPP pro Wall Street Crooked Hillary Clinton. _E_
I am watching the New York mayoral race very closely... _E_
RT @foxandfriends: .@JudgeJeanine: There will be an uproar in this country if they end up with an indictment against a Trump family member... _E_
This country cannot take four more years of Barack Obama! #Debate _E_
Sorry couldn't do @foxandfriends this morning big meeting. Will double up next week at 7. _E_
I am the BEST builder just look at what I've built. Hillary can't build. Republican candidates can't build. They don't have a clue! _E_
I knew last year that @TIME Magazine lost all credibility when they didn't include me in their Top 100... _E_
I actually enjoyed the piece re sign @TheDailyShow. Could it be that I'm starting to like Jon Stewart? _E_
Wow Hillary and Bill are in deep trouble but don't worry my fellow Republicans will let them off the hook. All talk no action. _E_
Hear me on @kiss925toronto now!#rozandmocha _E_
Via HuffPost Pollster #1 __HTTP__ _E_
RT @seanhannity: #Hannity Starts in 30 minutes with @newtgingrich and my monologue on the Deep State's allies in the media _E_
Thr coverage about me in the @nytimes and the @washingtonpost gas been so false and angry that the times actually apologized to its..... _E_
The pinnacle of the luxury public golf experience @TrumpGolfLA overlooking the Pacific Ocean in Palos Verdes __HTTP__ _E_
Sleepy eyes @chucktodd when looking at my financial filings should've said "Great job Mr. Trump Sir." _E_
Good advice from my father Fred C. Trump: Know everything you can about what you're doing. _E_
Well as predicted the 9th Circuit did it again Ruled against the TRAVEL BAN at such a dangerous time in the history of our country. S.C. _E_
RT @DanScavino: Great interview on @foxandfriends by @SteveDoocy w/ Carrier employee who has a message for #PEOTUS @realDonaldTrump & #VPE... _E_
Wow every poll said I won the debate last night. Great honor! _E_
It is also amazing how comments can be edited to provide statements that are used in a knowingly incorrect manner. _E_
Diet Coke tweet had a monster response dammit I wish the stuff worked. _E_
Via @BreitbartNews by Katie McHugh: POLL: DONALD TRUMP LEADS THE PACK AS GOP FRONTRUNNER __HTTP__ _E_
Is Gov. @BobbyJindal the stupid one for using the phrase "the stupid party" when referring to the Republicans? _E_
Former Navy SEAL Questions @BarackObama's birthplace __HTTP__ _E_
Crooked Hillary Clinton discussing the #SecondAmendment at a private event. #2A cc: @NRA __HTTP__ _E_
Goofy Senator Elizabeth Warren @elizabethforma has done less in the U.S. Senate than practically any other senator. All talk no action! _E_
THANK YOU AMERICA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
.@SethMacFarlane will be a great Oscar host. He did an amazing job at my @ComedyCentral roast. _E_
America needs to rebuild our infrastructure. Why are we sending trillions overseas when our own roads bridge... (cont) __HTTP__ _E_
Why won't President Obama use the term Islamic Terrorism? Isn't it now after all of this time and so much death about time! _E_
James Clapper called me yesterday to denounce the false and fictitious report that was illegally circulated. Made up phony facts.Too bad! _E_
The failing @nytimes is truly one of the worst newspapers. They knowingly write lies and never even call to fact check. Really bad people! _E_
..not associated with Russia. Trump team spied on before he was nominated. If this is true does not get much bigger. Would be sad for U.S. _E_
Many of Bernie's supporters have left the arena. Did Bernie go home and go to sleep? _E_
Get the big picture but be prepared for the picture to change. Where there's a will there's a win. Think positively! _E_
Sorry folks but if I would have relied on the Fake News of CNN NBC ABC CBS washpost or nytimes I would have had ZERO chance winning WH _E_
Work hard play hard and live to the hilt. Think Like a Billionaire _E_
I will be on @foxandfriends in ten minutes enjoy! _E_
I will be live tweeting @megynkelly Show in 10 minutes. Should be interesting. Will be on Fox Network! ENJOY! _E_
Thank you for your support!#AmericaFirst #LeadRight2016 __HTTP__ _E_
Secy. Sebelius who was responsible for the horrendous ObamaCare rollout should resign or be fired.Refuses to go before Congress to explain _E_
RT @austinroneil: @realDonaldTrump Thanks for all the inspirational quotes. Helping encourage this young entrepreneur. :) _E_
I am going to give @Rosie a pass. @Rosie is desperate to get back on TV so she can be on yet another show that can be quickly canceled. _E_
I will make our Military so big powerful & strong that no one will mess with us. #Trump2016 __HTTP__ __HTTP__ _E_
#EndCommonCore #Trump2016Video: __HTTP__ __HTTP__ _E_
Great new poll Florida thank you! #MakeAmericaGreatAgain __HTTP__ _E_
Wow the economy is really bad! GROSS DOMESTIC PRODUCT down 0.7% in 1st. quarter and getting worse. I TOLD YOU SO! Only I can fix. _E_
Tell Congress to straighten out the many problems of our country before trying to be the policemen to the world. Make America great again! _E_
Via @njdotcom by Eugene R. Dunn Medford: Donald Trump towers over GOP field __HTTP__ They hate us because they ain't us. _E_
Congratulations to @BarackObama yesterday marked the 1 YR anniversary of our country's credit being downgraded __HTTP__ _E_
What do you think of water boarding the Boston killer sometime prior to allowing our doctors to make him well? I suspect he may talk! _E_
Thoughts and prayers to the great people of Indiana. You will prevail! _E_
The new course at Trump International Scotland will be a par 72 layout with five sets of tees ranging from 7540 yards to 5630. _E_
See @IvankaTrump on the cover of @HudsonMOD? View the digital edition: __HTTP__ _E_
Hillary Clinton is the only candidate on stage who voted for the Iraq War. #Debates2016 #MAGA __HTTP__ _E_
Dishonest @nytimes reporter Jonathan Martin refused to acknowledge massive crowd surge forward... __HTTP__ _E_
.@alexsalmond RT @islandbluenose You'll be doing us in scotland a great service if you win. Good Luck. _E_
.@HillaryClinton is weak on illegal immigration & totally incompetent as a manager and leader no strength or stamina to be #POTUS! _E_
Great job once again by law enforcement! We are proud of them and should embrace them without them we don't have a country! _E_
I will be on @OutFrontCNN with @ErinBurnett at 7PM. Tune in!#Trump2016 _E_
ginrnnr2 @realDonaldTrump ...is China economy in a bubble ? Only if we want it to be! _E_
I have decided to add a caveat to my offer. Obama can't decide to send my $5M to Rev. Wright if he releases his records. _E_
This is about the money I gave to charity and in response to your comments about Gadhafi... __HTTP__ _E_
A beautiful view from my office today __HTTP__ _E_
#TrumpTower is one of the country's top tourist destinations. _E_
Thank you @IngrahamAngle! #AmericaFirst __HTTP__ _E_
Being an entrepreneur is not a group effort. You have to trust yourself and your instincts. _E_
I can't resist hitting lightweight @DannyZuker verbally when he starts up because he is just.so pathetic and easy (stupid)! _E_
Thank you. __HTTP__ _E_
China is very much the economic lifeline to North Korea so while nothing is easy if they want to solve the North Korean problem they will _E_
We need another Bush in office about as much as we need Obama to have a 3rd term. No more Bushes! _E_
Thank you West Virginia. Let's keep it going. Go out and vote on Tuesday we will win big. #Trump2016 _E_
Sebelius didn't test $635M (probably $1B) ObamaCare website until "a couple of days leading up to the launch." __HTTP__ _E_
Via @Mediaite: Trump to @gretawire: Sequester Cuts Don't Go Far Enough' __HTTP__ _E_
The upcoming record 13th season of @CelebApprentice is going to be very special. Our production team's ingenuity is amazing. _E_
Thank you! __HTTP__ _E_
I don't like bullies. I am not going to stand around and watch @KarlRove target the Tea Party. Karl Rove gave us Barack Obama. Loser. _E_
Great Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY! Congratulations! _E_
Bill O'Reilly doing a major special on @OreillyFactor tonight @FoxNews at 8pmE. Watch it should be good! #Trump2016 _E_
A great night in West Allis Wisconsin! Thank you! #VoteTrumpWI #WIPrimary __HTTP__ __HTTP__ _E_
Will soon be heading to Davos Switzerland to tell the world how great America is and is doing. Our economy is now booming and with all I am doing will only get better...Our country is finally WINNING again! _E_
Drew Brees is having a great game a fantastic quarterback and really good guy! _E_
To aspiring entrepreneurs: Be tenacious. Once you've decided on your goals remain fixed on them. Set the bar high! _E_
Entrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_
Via Newsmax. Nice article thank you so much. __HTTP__ _E_
The real scandal here is that classified information is illegally given out by intelligence like candy. Very un American! _E_
#MakeAmericaWorkAgain#TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_
Our great African American President hasn't exactly had a positive impact on the thugs who are so happily and openly destroying Baltimore! _E_
Via @MiamiHerald: "@IvankaTrump talks family and business" __HTTP__ _E_
Did you ever not do something that had you done it would have turned out to be a disaster. Never look back just learn from your experience! _E_
Katherine Webb gets a Donald Trump job offer says she's 'shocked' about the attention __HTTP__ via @Zap2it _E_
Don't miss me on @foxandfriends Monday at 7:30 AM _E_
Writing my inaugural address at the Winter White House Mar a Lago three weeks ago. Looking forward to Friday.... __HTTP__ _E_
Thank you to @foxandfriends for exposing the truth. Perhaps that's why your ratings are soooo much better than your untruthful competition! _E_
Made in America? @BarackObama called his 'birthplace' Hawaii here in Asia. __HTTP__ _E_
.@MittRomney is trying to hit back at me because I'm saying that he let the Repub Party down w/ his loss to Obama. Should've won—he choked! _E_
One of the simplest joys of life is golf. A great game to both play and watch. _E_
With the impending crisis in Korea is it a big confidence builder that Chuck Hagel is Sec. of Defense? Elections have consequences. _E_
.@melaniatrump on @QVC tonight at 7PM EST. Tune in! _E_
..But the people were Pro Trump! Virtually no President has accomplished what we have accomplished in the first 9 months and economy roaring _E_
The Chinese are now hacking White House computers. Why not? They already own the place. _E_
Must read @nypost editorial on $40M NYC taxpayer settlement to Central Park Thugs Wilding for Profit __HTTP__ _E_
With our weakened dollar gas will continue to rise. Fracking is an answer to lowering energy costs. _E_
It was a very wise move that Ted Cruz renounced his Canadian citizenship 18 months ago. Senator John McCain is certainly no friend of Ted! _E_
Actually Putin doesn't want Alaska because the Environmental Protection Agency will make it impossible for him to drill for oil! _E_
"Being an entrepreneur is a big task. So what can you do to prepare? First and foremost expand your focus." Midas Touch _E_
Great new poll Iowa thank you!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
We had a great News Conference at Trump Tower today. A couple of FAKE NEWS organizations were there but the people truly get what's going on _E_
Join me tonight in Cedar Rapids Iowa at 7pm: __HTTP__ Arizona tomorrow night at 3pm: __HTTP__ _E_
Insurance companies are fleeing ObamaCare it is dead. Our healthcare plan will lower premiums & deductibles and be great healthcare! _E_
Can you believe that Ted Cruz who has been killing our country on trade for so long just put out a Wisconsin ad talking about trade? _E_
Entrepreneurs: Be totally focused. Know everything you can about what you're doing. Give your work 100% of your concentrated effort. _E_
Congratulations to @newtgingrich on a stunning win in South Carolina. All eyes are on Florida now. _E_
Paul Ryan should spend more time on balancing the budget jobs and illegal immigration and not waste his time on fighting Republican nominee _E_
My @FoxNews @gretawire int. on the border crisis #BringBackOurMarine & Obama's ineptitude & the economy __HTTP__ _E_
Will be at Fort Worth (Texas) Convention Center at 11:30 A.M. Big crowd get there early! Big announcement to be made! _E_
"'THE DONALD' GOT A MUSKET" __HTTP__ via @fitsnews _E_
Thank you Toledo Ohio! It is so important for you to get out and VOTE on November 8 2016! Lets MAKE AMERICA SAFE... __HTTP__ _E_
Via @InverurieHerald: Trump's new course plans on display __HTTP__ _E_
Watch as I humiliate a dais full of talent. #TrumpRoast airs tonight at 10:30/9:30c on Comedy Central __HTTP__ _E_
Accounting firm Ernst & Young and the celebrity judges are insulted by Miss Pennsylvania's made up PR. _E_
...approvals of The Keystone XL & Dakota Access pipelines. Also look at the recent EPA cancelations & our great new Supreme Court Justice! _E_
I will bring our jobs back to America fix our military and take care of our vets end Common Core and ObamaCare protect 2nd A build WALL _E_
My @SquawkCNBC interview from earlier in the week discussing the GOP primary and @newtgingrich's electability __HTTP__ _E_
The Mayor of San Jose did a terrible job of ordering the protection of innocent people. The thugs were lucky supporters remained peaceful! _E_
In the center of Ireland's rugged west coast @Trump_Ireland offers a beautiful golf course top dining and a Spa __HTTP__ _E_
By Obama mentioning Manhattan yesterday in his response he has singlehandedly made it target #1. How totally stupid is this guy? _E_
Disloyal R's are far more difficult than Crooked Hillary. They come at you from all sides. They don't know how to win I will teach them! _E_
Today I delivered remarks at the 36th Annual National Peace Officers' Memorial Service. #NationalPoliceWeekWatch... __HTTP__ _E_
.@pennjillette is an extraordinary entertainer & magician whose star on the Hollywood Walk of Fame is long overdue. Very proud of him. _E_
Via @ArabianBusiness: "@IvankaTrump eyes new projects in Abu Dhabi" for Trump Organization __HTTP__ _E_
As long as we open our eyes to God's grace and open our hearts to God's love then America will forever be the land of the free the home of the brave and a light unto all nations. #NationalPrayerBreakfast __HTTP__ _E_
My daughter Ivanka thinks I should run for President. Maybe I should listen. __HTTP__ _E_
Discover your true self and surround yourself with people who complement your gifts and modes of operation. Midas Touch _E_
.@AS_ScienceGuy @realDonaldTrump Thank you for all your support of @autismspeaks Great new breakthroughs. Fantastic! _E_
It is terrible that neither Obama Biden nor Kerry attended Lady Thatcher's funeral. They would all run to Muslim Brotherhood Morsi's. _E_
The only reason President Obama wants to attack Syria is to save face over his very dumb RED LINE statement. Do NOT attack Syriafix U.S.A. _E_
RT @foxandfriends: FOX NEWS ALERT: North Korea responds to U.S. with Guam attack plan as Secretary Mattis warns Kim Jung Un "he is grossly... _E_
.@tedcruz you were terrific on @seanhannity tonight. I am going to the border tomorrow. _E_
#CNNDebate __HTTP__ _E_
It's Thursday how much $ has @BarackObama wasted today? _E_
Dr. Ben Carson I concur. I believe in God who can change people he can make any of us better. @RealBenCarson _E_
From Bloomberg: "Chrysler's Jeep expects China production agreement soon." I told you so. _E_
No wonder Afghanistan is a mess! @BarackObama is releasing high level insurgents in exchange for pledges of peace. __HTTP__ _E_
.@TrumpSoHo New York has interiors by celebrated design house Fendi Casa and 360 degree views of the city skyline. __HTTP__ _E_
Wow three top MICROSOFT investors want Bill Gates out as Chairman. Do not like job he is doing! _E_
What an amazing comeback and win by the Patriots. Tom Brady Bob Kraft and Coach B are total winners. Wow! _E_
"Learning is a new beginning we can give ourselves every day." – Trump: How to Get Rich _E_
Johnny Miller—Great job this weekend. Most insightful and tough. See you at Doral. _E_
I will be making a major speech on ILLEGAL IMMIGRATION on Wednesday in the GREAT State of Arizona. Big crowds looking for a larger venue. _E_
.@myfoxny discussing NYPD Chief Kelly's great record & the launch of the crowdfunding site __HTTP__ __HTTP__ _E_
Let's take a closer look at that birth certificate. @BarackObama was described in 2003 as being born in Kenya. __HTTP__ _E_
Wish Obama would say ISIS like almost everyone else rather than ISIL. _E_
Afghan Leader Karzai has received tens of millions of dollars IN CASH from the U.S. Government how stupidly is our Country being run? _E_
Will be on @jimmykimmel tonight at 11:35pmE on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_
I'm just so tired of listening to the same old rhetoric and words day after day from our President. It's time to stop talking WORK! _E_
Over 100M are now receiving some form of welfare __HTTP__ We must do better. @MittRomney has the vision to get America working. _E_
Just to show you how unfair Republican primary politics can be I won the State of Louisiana and get less delegates than Cruz Lawsuit coming _E_
Join us in Toledo Ohio tomorrow night at 8pm! #TrumpPence16 #MAGATickets: __HTTP__ __HTTP__ _E_
America became a powerhouse because of our deep belief in the virtue of self reliance. #TimeToGetTough (cont) __HTTP__ _E_
Help fund @Dratzenberger's new show 'American Made' on @fundanything __HTTP__ John is on @teamcavuto today re project. _E_
Will be on @seanhannity tonight at 10pm hosted by @GovMikeHuckabee. Enjoy! _E_
Our online store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_
A true honor to receive the endorsement of John Wayne's daughter....read: __HTTP__ __HTTP__ _E_
Bayer AG has pledged to add U.S. jobs and investments after meeting with President elect Donald Trump the latest in a string... @WSJ _E_
#trumpvlog The song Donald Trump hits 54 million views. @MacMiller Where's my money? __HTTP__ _E_
"@TrumpFerryPoint A Brand New Championship Golf Course In NYC Developed By Donald Trump And Anyone Can Play It" __HTTP__ _E_
Via @TheOaklandPress Donald Trump speaks in Novi(Michigan) draws record breaking crowd __HTTP__ _E_
Our country under President Obama is on life support! Great leaders must bring people together. _E_
Via @theobserver: Donald Trump: Lake Norman golf course 'one of the hottest places around' __HTTP__ _E_
After being ripped off for years Obama finally figured out that China is taking advantage of us. He's finally listening to me. _E_
Great #Thanksgiving travel and parade watching tips by @NYTimesTravel including an option from @TrumpNewYork: __HTTP__ _E_
Crooked's camp incited violence at my rallies. These incidents weren't spontaneous like she claimed in Benghazi! __HTTP__ _E_
Go to Trump Doral in Miami and watch the World Golf Championship! On NOW! _E_
We commend SG @AntonioGuterres & his call for the UN to focus more on people & less on bureaucracy. #USAatUNGA #UNGA __HTTP__ __HTTP__ _E_
.@FoxNewsSunday _E_
Entrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_
The failing @HuffingtonPost and dopey @ariannahuff are writing so much false junk about me they just can't get enough! BE CAREFUL. _E_
Just a reminder that Ted Cruz supported liberal Justice John Roberts who gave us #Obamacare. __HTTP__ _E_
Lies and incompetence the two words that are most closely associated with ObamaCare! _E_
Spitzer and Weiner lost lightweight Eric Schneiderman will be next he will be challenged in the PRIMARY. He has done a really poor job! _E_
Each of the 176 magnificent luxury suites and guestrooms at @TrumpNewYork provide a sophisticated urban appeal __HTTP__ _E_
Chris McDaniel looks like he will win in Mississippi GREAT NEWS and big victory for Tea Party! _E_
"Push yourself again and again. Don't give an inch until the final buzzer sounds." Larry Bird _E_
Highly respected PUBLIC POLICY POLLING (PPP) just announced that I am number one in IOWA. Thank you! _E_
Bush was called unpatriotic by @BarackObama in '07 for adding $4T to debt __HTTP__ @BarackObama increased it $6T in 3 years. _E_
Donna Summer performed for me many times she was great and will be missed. @TheDonnaSummer _E_
.@BMP_Music_Event Read 'Midas Touch' great book for entrepreneurs. Good luck! _E_
#TheView Lots of fun on @TheViewTV with @JennyMcCarthy and @SherriEShepherd __HTTP__ _E_
Just completed call with President Moon of South Korea. Very happy and impressed with 15 0 United Nations vote on North Korea sanctions. _E_
.@mystikangel Bring @johnrich back? He is back! _E_
Yes the BP oil spill was bad but it was no reason to put tighter clamps on domestic drilling. That showed no (cont) __HTTP__ _E_
The habitual vacationer @BarackObama is now in Hawaii. This vacation is costing taxpayers $4 milion +++ while there is 20% unemployment. _E_
#VoteTrump video: __HTTP__ #ArizonaPrimary #UtahCaucus #UTCaucus #AmericanSamoa __HTTP__ _E_
So we can spy on our ally's leaders but can't water board terrorists? _E_
Just got back to New York from California. Will be on Fox & Friends tomorrow morning at 7.00. ObamaCare and other disasters to be discussed _E_
The Fake News Is going all out in order to demean and denigrate! Such hatred! _E_
Today's EO established a commission on combating drug addiction and the opioid crisis. Watch listening session... __HTTP__ _E_
Carly Fiorina I agree! Ted Cruz is just another politician. All talk no action! __HTTP__ _E_
It was @BarackObama who promised if you like your plan you can keep your plan. Now ObamaCare is causing (cont) __HTTP__ _E_
#UtahCaucus message from @IvankaTrump! #UTCaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Vote for your favorite @MissUSA contestant the 2013 #MissUSA Fan Vote at __HTTP__ ! _E_
The @PGATOUR comes to Miami on March 6th when the @CadillacChamp returns to @TrumpDoral __HTTP__ See you there! _E_
Why aren't the lawyers looking at and using the Federal Court decision in Boston which is at conflict with ridiculous lift ban decision? _E_
Yesterday in front of Rockefeller Center __HTTP__ _E_
The judge in the Oscar Pistorious case is a total moron. She said he didn't act like a killer. This is another O.J. disaster! _E_
Wacko @glennbeck is a sad answer to the @SarahPalinUSA endorsement that Cruz so desperately wanted. Glenn is a failing crying lost soul! _E_
Thanks to our loyal viewers & fans last night's @ApprenticeNBC topped all the demos & grew 24% in our regular slot premiere. _E_
Congratulations @ElonMusk and @SpaceX on the successful #FalconHeavy launch. This achievement along with @NASA's commercial and international partners continues to show American ingenuity at its best! __HTTP__ _E_
Yesterday the White House claimed its ISIS strategy is a 'success.' Tell that to the Christians being beheaded. We need to hit ISIS hard! _E_
Imposing dunes on the Aberdeenshire coastline @TrumpScotland's Championship course is a classical Scottish links __HTTP__ _E_
With @BarackObama listing himself as Born in Kenya in 1999 __HTTP__ HI laws allowed him to produce a fake certificate. #SCAM _E_
"30000 MACY'S CUSTOMERS RETALIATE IN SUPPORT OF DONALD TRUMP" __HTTP__ via @BreitbartNews by @ASwoyer _E_
Just arrived in Youngstown Ohio with @FLOTUS Melania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
I will be going to New Hampshire today home of my first primary victory to discuss terror and the horrible events of yesterday. 2:30 P.M. _E_
Thank you to all of our law enforcement officers across America! #LESM #MAGA __HTTP__ __HTTP__ _E_
Wow just saw the really bad @CNN ratings. People don't want to watch bad product that only builds up Crooked Hillary. _E_
Why should ObamaCare be delayed for businesses and not working families? With premiums rising at record levels it is not equitable. _E_
Next Saturday night I will be holding a BIG rally in Pennsylvania. Look forward to it! _E_
Big election tomorrow in the Great State of Alabama. Vote for Senator Luther Strange tough on crime & border will never let you down! _E_
Join me in Clive Iowa tomorrow at noon! #AmericaFirst #MAGATickets: __HTTP__ __HTTP__ _E_
...At the same time go through a worst case scenario but keep it short. Focus on your goal—look at the solution not the problem. _E_
It was my great honor to pay tribute to a VET who went above & beyond the call of duty to PROTECT our COMRADES our COUNTRY & OUR FREEDOM! __HTTP__ _E_
Lithium ion batteries should not be allowed to be used in aircraft. I won't fly on the Boeing 787 Dreamliner it uses those batteries. _E_
Not only does Obama spy on German leaders he criticizes their trade surplus __HTTP__ We should have a trade surplus! _E_
Hillary took money and did favors for regimes that enslave women and murder gays. _E_
.@PatrickBuchanan was great on @TeamCavuto @FoxNews. Thank you Pat! #Tump2016 _E_
.@GOP HOUSE LEADERSHIP – ESTABLISH SELECT COMMITTEE ON BENGHAZI. THERE IS A MASSIVE COVERUP. _E_
RT @namusca: #VoteTrump2016 a real leader that truly cares about America & our values. He wants to bring prosperity back 2 USA __HTTP__ _E_
I am very disappointed in China. Our foolish past leaders have allowed them to make hundreds of billions of dollars a year in trade yet... _E_
Remember anything you read about Atlantic City has nothing to do with me. I sold years ago and left. Good timing but very sad! _E_
Looking forward to honoring the great Dogan family & the success of the Trump Towers project in Istanbul @FollowTurkey Annual Gala Dinner _E_
Time to end the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_
.@Peggynoonannyc An election between Hillary and myself will be the biggest voter turnout in U.S. history. Just like the debates 24 M vs 2M. _E_
Just said at #NCGOPCon that I'm not beholden to lobbyists and donors! No special interest would control me if I were in office. _E_
Excited to be travelling to New Hampshire on Monday. The Granite State is a model for the country. Live Free or Die! _E_
Spanish version of ObamaCare website delayed __HTTP__ Hitting google translate apparently too complicated. #MakeDCListen _E_
USMC Sgt. Tahmooressi sacrificed for our country. While Obama is welcoming illegals our Marine is locked in a Mexican jail. #FreeOurMarine _E_
Taxpayers are paying a fortune for the use of Air Force One on the campaign trail by President Obama and Crooked Hillary. A total disgrace! _E_
Wow Record ratings for WGC Cadillac Championship at Trump National Doral's Blue Monster Most watched in seven years. CONGRATS to@Tiger Woods _E_
I have a proven track record supporting our Veterans. Veterans deserve universal access to care. VA scandal proves politicians are inept. _E_
Launching the Trump Home by Dorya Furniture Collection today. It looks amazing! @HPMARKETNEWS @DoryaInteriors __HTTP__ _E_
#2017Jambo Remember your duty. Honor your history. Take care of the people God puts into your life – and LOVE & CHERISH your country! __HTTP__ _E_
It was the childishly written & taunting PR statement by Fox that made me not do the debate more so than lightweight reporter @megynkelly. _E_
.@PiersMorgan is right he won the show because "I know how to play the game." #CelebApprentice _E_
CNN'S slogan is CNN THE MOST TRUSTED NAME IN NEWS. Everyone knows this is not true that this could in fact be a fraud on the American Public. There are many outlets that are far more trusted than Fake News CNN. Their slogan should be CNN THE LEAST TRUSTED NAME IN NEWS! _E_
#ISIS is making $400M/year on oil. I have been saying it for years. We need to bomb the oil! __HTTP__ __HTTP__ _E_
Great Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts: __HTTP__ _E_
Endorsements for Lyin' Ted Cruz __HTTP__ _E_
.@IvankaTrump's Favorite Miami Hot Spots @TrumpGolf @TrumpDoral __HTTP__ _E_
The United States must greatly strengthen and expand its nuclear capability until such time as the world comes to its senses regarding nukes _E_
An updated POLL tracker (with all polls thru the weekend) reveals I maintained a double digit lead at... __HTTP__ _E_
Look if we can make chopsticks in America and sell them to the Chinese we can compete on hundreds of other fronts as well. TimeToGetTough _E_
Remember Obama limped across the finish line he should have lost to Hillary. Be careful! _E_
If Stop & Frisk is struck down by the pandering NYC politicians increases in crime & eventual terrorist attacks will be on them. _E_
Bullshit Pop gave me knowledge and a relatively small amount of money (split between brothers and sisters) and I built it into over 9 bill. _E_
Thank you working hard! __HTTP__ _E_
Will be doing interview on @GolfChannel at 8.00 this morning. Will be talking about getting the great PGA Championship & Senior PGA etc. _E_
Via @BreitbartNews by @rwildewrites: "Donald Trump: I Can Make America Great Again" __HTTP__ (Hyperlinked on @DRUDGE_REPORT) _E_
Must see video Obama's criticism of @MittRomney is identical to Carter's on Reagan __HTTP__ _E_
Why did we spend billions of our money on Libya if we are not going to get any of the country's oil? What do we get out of this? _E_
China is buying gas fields in Texas __HTTP__ & stealing our corporate secrets... _E_
Will be interviewed by @SeanHannity on @foxnews at 10PM tonight. Enjoy! _E_
Looking forward to watching the legendary @BarbaraJWalters interview my family (and me) tonight on @ABC at 10:00. Many things to talk about! _E_
With two champion style courses @TrumpGolfDC graces 600 rolling acres along the peaceful and scenic Potomac River __HTTP__ _E_
Senator Chuck Schumer helping to import Europes problems said Col.Tony Shaffer. We will stop this craziness! @foxandfriends _E_
If President Obama was going to attack Syria he should've done it a long time ago as a surprise & not after (cont) __HTTP__ _E_
TO ALL AMERICANS __HTTP__ _E_
I'm at Trump Doral right now Tiger will tee off shortly. _E_
As Iran began the process of taking over Iraq many people wanted me to say that "I told you so!" – so I told you so. _E_
MUST READ! My @chicagotribune editorial: I love Chicago ... and my sign! __HTTP__ _E_
#TBT At the US Open Tennis Tournament with @EricTrump see same hairstyle! __HTTP__ _E_
Via @DMRegister :"@brentroske on Politics: Trump Talks Iowa" __HTTP__ _E_
End the Democrats Obstruction! __HTTP__ _E_
TRUMP APPROVAL HITS 50% __HTTP__ _E_
Many of Hillary's donors are the same donors as Jeb Bush's—all rich will have total control—know them well. _E_
My @HollywoodLife interview w/ @MELANIATRUMP discussing her debut on @ApprenticeNBC & her skin care line __HTTP__ _E_
#MakeAmericaSafeAgain __HTTP__ _E_
Crude is at $100/Barrel. With the current state of the world economy how is that possible? OPEC is ripping of... (cont) __HTTP__ _E_
No wonder boxing is close to dead! _E_
Thank you @billoreilly & @KarlRove. Ted Cruz should be immediately disqualified in Iowa with each candidate moving up one notch. _E_
Our biggest problems are solved by growth. We need a President who is a job creator. Let's Make America Great Again! __HTTP__ _E_
Jodi thought she outsmarted the system it didn't work! Congratulations to the jury on a job well done! Now will it be life or death? _E_
Well now they're saying that I not only won the NBC Presidential Forum but last night the big debate. Nice! _E_
Young Entrepreneurs – the Holiday season is here but that is no excuse not to stay on top of your business prospects. Focus! _E_
Consumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_
By @kwrcrow: "NY Post caught 'LYING' Again!" __HTTP__ The Donald" should go far. Actually if I run I'll win. _E_
The resolution being considered at the United Nations Security Council regarding Israel should be vetoed....cont: __HTTP__ _E_
Obama attacks the CIA for waterboarding while routinely droning civilians caught in the Islamist crosshairs. _E_
If dopey Mark Cuban of failed Benefactor fame wants to sit in the front row perhaps I will put Gennifer Flowers right alongside of him! _E_
Obama can release 5 senior Taliban for a deserter but can't make Mexico release decorated Marine Sgt. Andrew Tahmooressi. Pathetic _E_
Romney's campaign is being put on the defensive. He cannot let this happen. Stop pandering. Must get tougher (cont) __HTTP__ _E_
Do people notice Hillary is copying my airplane rallies she puts the plane behind her like I have been doing from the beginning. _E_
We need much tougher much smarter leadership and we need it NOW! _E_
John Cahill is highly respected in all circles—really nice to see that he's running for New York State Attorney General. @CahillForAG _E_
I won every poll from last nights Presidential Debate except for the little watched @CNN poll. _E_
BORDER WALL prototypes underway! __HTTP__ _E_
#CelebrityApprentice It's good to have Jack back too with @marleematlin. He's become a star. #sweepstweet _E_
"Always remember that the future comes one day at a time." Dean Acheson _E_
Denzel Washington gave a wonderful commencement speech over the weekend. From the heart! _E_
Show me someone without an ego and I'll show you a loser having a healthy ego or high opinion of yourself is a real positive in life! _E_
I love America. And when you love something you protect it passionately fiercely even. #TimeToGetTough (cont) __HTTP__ _E_
My use of social media is not Presidential it's MODERN DAY PRESIDENTIAL. Make America Great Again! _E_
#MakeAmericaGreatAgain #Trump2016Video: __HTTP__ __HTTP__ _E_
.@JebBush is totally lost he spends too much time managing the bloated staff of his campaign & not enough talking about America's future. _E_
The Oscars were a great night for Mexico & why not—they are ripping off the US more than almost any other nation. _E_
"Every strike brings me closer to the next home run." Babe Ruth _E_
Learn from yesterday live for today hope for tomorrow. The important thing is not to stop questioning. Albert Einstein _E_
Wow Obama really put it to Israel by canceling flights there. This puts them at a tremendous disadvantage. Tourism and more will just stop. _E_
I want to help our miners while the Democrats are blocking their healthcare. _E_
.@oreillyfactor why don't you have some knowledgeable talking heads on your show for a change instead of the same old Trump haters. Boring! _E_
Tom Brady is a good friend of mine a great player a great guy and a total winner! Fantastic comeback win this is what our country needs! _E_
We have given Syria so much time and information there has never been such an instance in wartime history. Syria is now fully prepared! _E_
Very sad that Republican donors were targeted by Obama's IRS. _E_
Via @Reuters by @sumeet_chat: "Donald Trump plans investment in India betting on Modi government" __HTTP__ _E_
.@CelebApprentice having "top brand impact 2012" ahead of Idol Survivor X Factor & all others has caused quite a stir no surprise! _E_
In other words Russia was against Trump in the 2016 Election and why not I want strong military & low oil prices. Witch Hunt! __HTTP__ _E_
I have always had a good relationship with Chuck Schumer. He is far smarter than Harry R and has the ability to get things done. Good news! _E_
Thank you Arlene! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #DrainTheSwamp __HTTP__ _E_
With Luis Mexico and the United States would have made wonderful deals together where both Mexico and the US would have benefitted. _E_
Join me tomorrow in Dubuque Iowa! #IACaucus #Trump2016 __HTTP__ _E_
Everybody is laughing at Jeb Bush spent $100 million and is at bottom of pack. A pathetic figure! _E_
.@TrumpNationalNY is NY's best golf club. A 5 Star Diamond Award winner w/ an elite golf course & top facilities __HTTP__ _E_
Heading to Alabama now big crowd! _E_
Just out according to @CNN: Utah officials report voting machine problems across entire country _E_
Why did Clinton supporter @AlisonForKY declare Crooked Hillary winner in KY when AP hasn't even called the race? _E_
Thank you Michigan! #VoteTrumpMITrump 35%Kasich 17%Cruz 12%Rubio 12%Carson 9% Via: ARG _E_
Thank you. __HTTP__ _E_
Thank you Fort Wayne Indiana!#Trump2016 #INPrimary __HTTP__ _E_
Newsstand sales for @VanityFair run by sleepy Graydon Carter are down almost 20%. All he cares about are his bad food restaurants! _E_
Just returned from Asia after 12 very successful days. Great to be home! _E_
Wow even I didn't realize we did so much. Wish the Fake News would report! Thank you. __HTTP__ _E_
RT @WhiteHouse: FACT: when #Obamacare was signed CBO estimated that 23M would be covered in 2017. They were off by 100%. Only 10.3M people... _E_
People who have the ability to work should. But with the government happy to send checks too many of them don't. #TimeToGetTough _E_
Vattenfall CEO stated that the company needed to prepare itself for falling electricity demands in coming years a changing market. _E_
Congratulations to @MichaelPhelps on concluding the greatest Olympic career ever. You have made us all very proud. _E_
The new winner of the @MissTeenUSA pageant K. Lee Graham __HTTP__ _E_
US job cuts jumped 53% in May from April __HTTP__ This is the Obama recovery? _E_
With all of the words President Obama just dispensed at his press conference he didn't say what we all want to hear I'LL STOP THE FLIGHTS _E_
Among the lowest temperatures EVER in much of the United States. Ice caps at record size. Changed name from GLOBAL WARMING to CLIMATE CHANGE _E_
Amnesty is suicide for Republicans.Not one of those 12 million who broke our laws will vote Republican.Obama is laughing at @GOP. _E_
UNBELIEVABLE!Clinton campaign contractor caught in voter fraud video is a felon who visited White House 342 times: __HTTP__ _E_
Crime is out of control and rapidly getting worse. Look what is going on in Chicago and our inner cities. Not good! _E_
Procter and Gamble is relocating its beauty headquarters from Cincinnati to Asia what are we doing?! _E_
95% of Americans will pay less or at worst the same amount of taxes (mostly far less). The Dems only want to raise your taxes! _E_
Today is referendum on ObamaCare Amnesty slow growth having your healthcare dropped & all the other lies. _E_
Join me live in Waukesha Wisconsin for an 8pmE rally! #AmericaFirst #MAGA __HTTP__ _E_
President Obama should stay out of the Hong Kong protests we have enough problems in our own country!Can't even properly police White House _E_
All the haters and losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_
FLORIDA Just like TX WE are w/you today we are w/you tomorrow & we will be w/you EVERY SINGLE DAY AFTER to RESTORE RECOVER & REBUILD! __HTTP__ _E_
Whether @RepPaulRyan's plan is sound fiscal policy is not the relevant issue the issue is strategic timing. Why release it now? _E_
I applaud Columbia South Carolina for cleaning up biz center __HTTP__ Will cut crime & advance commerce. _E_
Whether you like it or not Bush also gave us Obama! _E_
It was 25 years ago today that Pan Am flight 103 was downed by a terrorist killing 270 innocent people. @AlexSalmond released the terrorist! _E_
THE HILL'S TWITTER ROOM: Trump: Spitzer Weiner turning New York into 'pervert central' __HTTP__ _E_
RT @DanScavino: Join @realDonaldTrump on his official social media platforms during tonight's debate ~ as @TeamTrump manages rapid response... _E_
Bring in 2014 @TrumpSoHo's NYE soireé NYC's most exclusive New Year's Eve Party w/SoHi & @VeuveClicquot __HTTP__ _E_
I have a surprise for a really special kid on Thursday's episode of @KatieShow with @KatieCouric: __HTTP__ _E_
The Wall is the Wall it has never changed or evolved from the first day I conceived of it. Parts will be of necessity see through and it was never intended to be built in areas where there is natural protection such as mountains wastelands or tough rivers or water..... _E_
Crude is about to pass $90/barrel. The OPEC monopoly must be broken. They are robbing our country blind. _E_
Ted Cruz along with Jeb Bush pushed Justice John Roberts onto the Supreme Court. Roberts could have killed ObamaCare twice but didn't! _E_
My thoughts and prayers go out to the @PhillyPolice & @Penn police officers in Philadelphia. __HTTP__ _E_
....because he doesn't even live there! He wants to raise taxes and kill healthcare. On Tuesday #VoteKarenHandel. _E_
Thank you! #AmericaFirst __HTTP__ _E_
Very little discussion of all the purposely false and defamatory stories put out this week by the Fake News Media. They are out of control correct reporting means nothing to them. Major lies written then forced to be withdrawn after they are exposed...a stain on America! _E_
Rosie O'Donnell went after me again on The View in order to stir up her failing ratings. Nothing will help her @Rosie always fails. _E_
Obama's VA Secretary just said we shouldn't measurewait times. Hillary says VA problems are not 'widespread.' I will take care ofour vets! _E_
Remember I'll see you in D.C. at the Capitol Building on Wednesday at 1:00 o'clock. Then Dallas on Sept.14 at 6:00 P.M. American Air Center _E_
Entrepreneurs: Another question to ask yourself—"What am I pretending not to see?" There may be great opportunities right around you. _E_
Thank you Clive Iowa! __HTTP__ _E_
Weekly jobless claims soared to 21.5% a 6 month high __HTTP__ ObamaCare the greatest job killer in US history. _E_
Jeb is spending millions of dollars on "hit" ads funded by lobbyists & special interests. Bad system. _E_
I find that @Reuters is a far more professional operation than @AP. _E_
Ford is MOVING jobs from Michigan to Mexico AGAIN! __HTTP__ As President this will stop on Day One! Jobs will stay here. _E_
Despite some very corrupt and dishonest media coverage there are many great reporters I respect and lots of GOOD NEWS for the American people to be proud of! _E_
Congratulations to @NewYorkObserver on celebrating its 25 year anniversary. Great paper under amazing management! _E_
Hey @realjeffreyross @whitney cummings @lisalampanelli: you call yourselves comedians? #TrumpRoast tonight 10:30/9:30c on @ComedyCentral. _E_
Alert US jobless claims up 46000 to 388000. Really bad news. 7.8% is now a fraud not possible! _E_
The new Congress must restore military spending & stop Obama budget cuts. Also hold Obama accountable on the VA. _E_
So why aren't the Committees and investigators and of course our beleaguered A.G. looking into Crooked Hillarys crimes & Russia relations? _E_
Congratulations to @CNN for having the wisdom to pick TRUMP! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
My review of #TheDarkKnightRises and more in today's #trumpvlog __HTTP__ _E_
Why is it that the Fake News rarely reports Ocare is on its last legs and that insurance companies are fleeing for their lives? It's dead! _E_
See you at 7:00 P.M. tonight Phoenix Arizona! #MAGATickets: __HTTP__ __HTTP__ _E_
I hope Newtown CT can now start to heal—but it won't be easy! _E_
I don't hate Obama at all I just think he is an absolutely terrible president maybe the worst in our history! _E_
The situations in Tulsa and Charlotte are tragic. We must come together to make America safe again. _E_
Without passion you don't have energy and without energy you have nothing! Just one more of my totally brilliant quotes use it well. _E_
.@BillKristol Bill your small and slightly failing magazine will be a giant success when you finally back Trump. Country will soar! _E_
Dopey @billmaher still owes me $5M for charity. I hope he pays up before @hbo fires him which will happen! _E_
But why shouldn't I speak out? Don't you speak out in this country? George Steinbrenner _E_
THANK YOU to all of the great volunteers helping out with #HurricaneHarvey relief in Texas! __HTTP__ _E_
Be careful of an Obama bomb to win election! Would be a horrible thing to do. _E_
We cannot continue to let Israel be treated with such total disdain and disrespect. They used to have a great friend in the U.S. but....... _E_
Highly respected economist @Larry_Kudlow is a big fan of my tax plan—thank you Larry. __HTTP__ _E_
THE LAST THING THIS COUNTRY NEEDS IS ANOTHER BUSH! _E_
Doesn't help Kasich to do negative ads on me because he still has to go through everyone else he's almost last. _E_
.@FLGovScott can create tens of thousands of jobs by approving casinos in Miami it's time. @willweatherford _E_
A story in the @washingtonpost that I was close to "rescinding" the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources don't exist! _E_
If Alison Grimes can't admit she voted for Obama even if she is embarrassed then you can't trust her! Vote @Team_Mitch! _E_
The Mullahs laughed when @BarackObama asked Iran to return our drone they will show it to China first. _E_
.@danawhite You have done an amazing job I am proud to have been there at the very beginning! _E_
To the people of Puerto Rico:Do not believe the #FakeNews!#PRStrong _E_
Israel Saudi Arabia and the Middle East were great. Trying hard for PEACE. Doing well. Heading to Vatican & Pope then #G7 and #NATO. _E_
.@MittRomney looks much calmer and Obama should stop nodding his head backwards and forward. _E_
Congressional Black Caucus Chairman Emanuel Cleaver is right. @BarackObama's budget is a nervous breakdown on paper. __HTTP__ _E_
As an honorary Buckeye I want to thank the OH GOP primary voters for putting @MittRomney over the top. It was a crucial win. _E_
We will bring back our jobs. We will bring back our borders. We will bring back our wealth and we will bring back our dreams! _E_
.@JordanSpieth Great playing at the Masters and don't get down Jordan you will win many tournaments and many MAJORS! Keep working hard. _E_
Bill Clinton is right: Obamacare is 'crazy' 'doesn't work' and 'doesn't make sense'. Thanks Bill for telling the truth. _E_
Bernie Sanders on HRC: Bad Judgement. John Podesta on HRC: Bad Instincts. #BigLeagueTruth #Debate _E_
Watched Gennady Golovkin @gggboxing at MSG on Saturday night. He was fantastic should fight @FloydMayweather! _E_
He @BarackObama invited his top campaign bundlers and donors to the British State Dinner __HTTP__ So corrupt! _E_
Thehas great strength & patience but if it is forced to defend itself or its allies we will have no choice but to totally destroy #NoKo. __HTTP__ _E_
The Democrats have no message not on economics not on taxes not on jobs not on failing #Obamacare. They are only OBSTRUCTIONISTS! _E_
Via @gulf_news by @JoeHeim: "@IvankaTrump: Giving back is a priority for me" __HTTP__ _E_
The Blue Monster is celebrated in June issue of Robb Report as the Best of the Best winner in Golf Course Category. __HTTP__ _E_
New Day on CNN treats me very badly. @AlisynCamerota is a disaster. Not going to watch anymore. _E_
I bet the terrorists in Libya used weapons we supplied them during their so called 'revolution' to attack our embassy in Benghazi. _E_
Glad the Trans Pacific Partnership failed in the Senate. Bad deal for American worker & economy! We need SMART TRADE! __HTTP__ _E_
In America we don't worship government we worship God. #ValuesVotersSummit __HTTP__ _E_
Hillary's wars in the Middle East have unleashed destruction terrorism and ISIS across the world. _E_
This was the reporters statement when she found out there was tape from my facility she changed her tune. __HTTP__ _E_
#ChrisWallace who interviewed me on Sunday had his highest ratings since Feb of '09. Congratulations! __HTTP__ _E_
New Poll Shows Donald Trump Blowing Everyone Else Out of the Water. __HTTP__ _E_
Thank you Tucson Arizona! A great afternoon with 6000 supporters! #VoteTrump on Tuesday!#MakeAmericaGreatAgain __HTTP__ _E_
The FBI is totally unable to stop the national security leakers that have permeated our government for a long time. They can't even...... _E_
To be in charge you have to take responsibility you have to instill confidence. It's like being a conductor set the tempo. _E_
"The harder you work the luckier you get." Gary Player _E_
Listen and learn from others but make your own decisions. Use your instincts you alone know where you want to go. _E_
My successful acquisition of the Kluge estate was a fantastic deal which is already being studied in business schools. _E_
Why doesn't Obama let our marines who are guarding the embassies in Egypt have live ammunition? They need it fast. _E_
RT @NRCC: Good to hear @realDonaldTrump is on board.GOP is the party of free enterprise.Join us as we innovate: __HTTP__ _E_
The Yuan hit another record high against the Dollar. China is laughing at our expense. _E_
I want all Americans to succeed together. President Obama's illegal executive amnesty undermines job prospects for... __HTTP__ _E_
Iowa Congressman @SteveKingIA has endorsed the Newsmax @iontv debate. He has been doing great work in the House. _E_
I see Marco Rubio just landed another billionaire to give big money to his Superpac which are total scams. Marco must address him as SIR ! _E_
All the governors are already backing off of the Ebola quarantines. Bad decision that will lead to more mayhem. _E_
92 year old registers to vote for first time says will vote for Trump __HTTP__ _E_
So wrong! @BarackObama is hosting China's VP Xi Jinping today at the Pentagon with a full honor ceremony with music and cannons... _E_
"Trump: 'Never Give Up' on Farmland Value Rally" __HTTP__ @TerryBranstad @KimReynoldsIA @ChuckGrassley @SenJoniErnst @BNorthey _E_
Watched protests yesterday but was under the impression that we just had an election! Why didn't these people vote? Celebs hurt cause badly. _E_
Let's not get too excited about Monday's U.S. Supreme Court oral argument on #ObamaCare before the decision. No (cont) __HTTP__ _E_
Via @RoyalOakPatch: Oakland County High Schoolers Have Chance to Win $1000 Scholarship & Meet Donald Trump __HTTP__ _E_
Wow 15 policemen hurt in Baltimore some badly! Where is the National Guard. Police must get tough and fast! Thugs must be stopped. _E_
Thank you Concord North Carolina! When WE win on November 8th we are going to Washington D.C. and we are going t... __HTTP__ _E_
...Mexico cannot believe what they are getting away with and have absolutely no respect for our leader. _E_
.@bigstack19 @realDonaldTrump Does anyone actually read Rolling Stone anymore? Guess they had to create (cont) __HTTP__ _E_
Lance Armstrong was given veryvery bad advice! _E_
President Obama should have gone to Louisiana days ago instead of golfing. Too little too late! _E_
While @BarackObama spends recklessly on domestic projects he is hollowing out our military with over $487B in cuts __HTTP__ _E_
NBC News just called it the great freeze coldest weather in years. Is our country still spending money on the GLOBAL WARMING HOAX? _E_
Unlike crooked Hillary Clinton who wants to destroy all miners I want wages to go up in America. We will do so by bringing back jobs! _E_
"Peace is not absence of conflict it is the ability to handle conflict by peaceful means." – Pres. Ronald Reagan _E_
Just arrived in Cleveland Ohio join Governor @Mike_Pence and I now LIVE via: __HTTP__ _E_
Trump Puerto Rico is 1st development in Puerto Rico to combine lavish residences world class golf & a beach __HTTP__ _E_
Resolve to be bigger than your problems. Who's the boss? Realize that fear is the exact opposite of faith. _E_
I am so proud of my daughter Ivanka. To be abused and treated so badly by the media and to still hold her head so high is truly wonderful! _E_
Governor @Mike_Pence and I will be in Cleveland Ohio tomorrow night at 7pm join us! #MAGATickets:... __HTTP__ _E_
I will be on @meetthepress this morning at various times across the U.S. @NBCNews Enjoy! _E_
Trump Doral's renovations are right on schedule __HTTP__ Once completed it will be the top resort in the U.S. _E_
Why does @BarackObama support the radical Islamists in Egypt protests yet has such a high disregard for the Tea Party? _E_
RT @DonaldJTrumpJr: Ironic since Hillary has gotten a lot more of that dark unaccountable money into her campaign. #debates _E_
Watch CNN tomorrow at 2 pm & 5 pm and on Friday at 7 pm & 11 pm for a Thanksgiving Special hosted by John King. I'll be a featured guest. _E_
Why are some more concerned with granting terrorist rights than protecting innocent Americans? _E_
It has been great to meet so many wonderful people in my #TimeToGetTough book signings. Anyone who wants to be Prez should read! _E_
How bad is the New York Times—the most inaccurate coverage constantly. Always trying to belittle. Paper has lost its way! _E_
.@billmaher: Bill you are really beginning to understand what is going on with Trump actually you always knew! _E_
Wow it's now official. ObamaCare website has topped $1B __HTTP__ Will soon be up to $1.5B _E_
ICYMI my speech from this past Saturday at the @NHGOP @FITNsummit via @cspan __HTTP__ _E_
I predicted Rosie O'Donnell would fail at the View and was right. Now I predict Rosie will take over for Brian Williams! _E_
I will be the greatest job producing president in American history. #Trump2016 #VoteTrump __HTTP__ __HTTP__ _E_
Heading to Manassas Virginia for a rally. Will have a moment of silence for the victims of the California shootings. So sad! _E_
Live Free or Die: A motto for the whole country to follow. #NewHampshire #FITN #VoteTrumpNH __HTTP__ _E_
Iraq has granted Iran full air rights to fly over and arm Syria. What did America accomplish with the Iraq war? And now Syria?! _E_
A testament to American ingenuity @TrumpTowerNY shines over Fifth Avenue as one of NYC's most iconic buildings __HTTP__ _E_
Youth unemployment is at a record high. ObamaCare is a job destroyer which is ruining aspiring careers. It must be repealed. _E_
New PPP poll just released in Iowa up 6 points from last poll. Leading w/ 28%! Don't worry media won't report it! __HTTP__ _E_
I wish good luck to all of the Republican candidates that traveled to California to beg for money etc. from the Koch Brothers. Puppets? _E_
Spitzer never made 10 cents on his own he worked for his very rich father (a friend of mine who never thought much of Eliot as a businessman _E_
Thanks many are saying I'm the best 140 character writer in the world. It's easy when it's fun. _E_
#trumpvlog My thoughts on @RickSantorum in today's video blog... __HTTP__ _E_
There's nothing "compassionate" about allowing welfare dependency to be passed from generation to generation. Time To Get Tough _E_
Honored to meet w/ Pres Abbas from the Palestinian Authority & his delegation who have been working hard w/everybody involved toward peace. __HTTP__ _E_
Looking forward to visiting Mason City Iowa tomorrow. Will be my 8th day in the Hawkeye State this year! __HTTP__ _E_
What Barbara Res does not say is that she would call my company endlessly and for years trying to come back. I said no. _E_
Nobody cares about the Iowa straw poll is what @JonHuntsman said yesterday. His problem is that nobody cares about his campaign (or him). _E_
Hope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man respected by all. _E_
.@WineEnthusiast just awarded Trump Vineyard's Sparkling Reserve 91 points the highest rated wine in Virginia... __HTTP__ _E_
Merry Christmas have an amazing day! _E_
.@DineshDSouza had to give $1000 to @BarackObama's brother for his child's hospital bill __HTTP__ Isn't that disgraceful? _E_
Russia talk is FAKE NEWS put out by the Dems and played up by the media in order to mask the big election defeat and the illegal leaks! _E_
Remember while @BarackObama is lauding himself tonight with self indulgent compliments we have our brave soldiers fighting in Afghanistan. _E_
.@CNN should listen. Ana Navarro has no talent no TV persona and works for Bush—a total conflict of interest. __HTTP__ _E_
Just as I have been saying for MANY years and while they phony negotiate with the U.S. over nuclear Iran is taking over Iraq. Really sad! _E_
At some point the Fake News will be forced to discuss our great jobs numbers strong economy success with ISIS the border & so much else! _E_
He @MittRomney is a successful entrepreneur. @BarackObama successfuly ruined America's credit. Easy choice in November. _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
.@Chrysler disputes my statement but watch Chrysler move @Jeep jobs to China after the election. _E_
Nothing is so permanent as a temporary government program. Milton Friedman _E_
Our Native American Senator goofy Elizabeth Warren couldn't care less about the American worker...does nothing to help! _E_
The American worker is being victimized by our trade policies. We need smart trade which can only be accomplished by smart dealmakers. _E_
Incredible handheld video of the Las Vegas Strip in 1969. The skyline looks better with @TrumpLasVegas! __HTTP__ _E_
Senator Ted Cruz has been MATHEMATICALLY ELIMINATED from race. He said Kasich should get out for same reason. I think both should get out! _E_
I have no greater privilege than to serve as your Commander in Chief. HAPPY BIRTHDAY to the incredible men and women @USNavy!#242NavyBday __HTTP__ _E_
Barney Frank admited that ObamaCare does have 'death panels' yesterday. Obamacare must be fully repealed or healthcare will be destroyed. _E_
RT @DRUDGE_REPORT: TRUMP APPROVAL HITS 50% __HTTP__ _E_
Remember if you don't pat yourself on the back nobody else will. Take credit for your successes and don't let others forget!!!!!! _E_
Why isn't the @GOP congress doing everything possible to defund and cut ObamaCare? _E_
These are something I just can't buy. Excited for the @usopen __HTTP__ _E_
We must stand firm against the UN's ploy to sabotage Israel if the UN grants the PA statehood then we must immediately defund it. _E_
The state of Virginia economy under Democrat rule has been terrible. If you vote Ed Gillespie tomorrow it will come roaring back! _E_
I've learned that mistakes can often be as good a teacher as success. Jack Welch _E_
Why does @oreillyfactor and @FoxNews always have Karl Rove on. He spent $430 million and lost ALL races. A dope who said Romney won election _E_
Key Obamacare premiums to jump 25% next year: __HTTP__ _E_
The Republican Party needs strong and committed leaders not weak people such as @JeffFlake if it is going to stop illegal immigration. _E_
The @washingtonpost which is the lobbyist (power) for not imposing taxes on #Amazon today did a nasty cartoon attacking @tedcruz kids. Bad _E_
Unfortunately with some men when the poison kicks in (not me of course) there are no rules or guidelines in the military that will stop them _E_
Thank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day I'll never forget. __HTTP__ _E_
I am now in Palm Beach Florida and will be going to church tonight. MAKE AMERICA GREAT AGAIN! _E_
Good morning. I will be on Fox and Friends at 7.00 (30 minutes). Enjoy! @foxandfriends _E_
This is a great time for @RickSantorum to bow out with dignity. _E_
. @OMAROSA is smart and strategic. People should cut her some slack and respect the way she works on @ApprenticeNBC. _E_
My @SquawkCNBC interview discussing last night's presidential debate my stock picks and tomorrow's big announcement __HTTP__ _E_
Great day for America's future Security and Safety courtesy of the U.S. Supreme Court. I will keep fighting for the American people & WIN! _E_
I am happy to have started #ObamasFavoriteCharity. Really enjoying reading everyone's tweets. _E_
Entrepreneurs: Realize that fear is the exact opposite of faith. Resolve to be bigger than your problems. Who's the boss? _E_
My interview with @RealMichaelKay discussing why A Rod should be fired from @yankees & how to terminate his contract __HTTP__ _E_
All seven on line polls including Drudge and Time with thousands of respondents said I won the debate. @krauthammer said I was so so. _E_
The ultra liberal and seriously failing Des Moines Register is BEGGING my team for press credentials to my event in Iowa today but they lie! _E_
How did you like Michelle Obama's bangs last night? _E_
IMPORTANT @RepMattSalmon & @RepEdRoyce will hold a hearing on Oct. 1w/USMC Sgt. Tahmooressi's mother & wife __HTTP__ _E_
We should be able to negotiate a deal with Iran because they know we could blow them away to the Stone Age.They just don't believe we would. _E_
A big POLL will be announced this morning on @CBSNews Face The Nation. I wonder if I do well if the press will report the results? Doubt it _E_
I really enjoyed the debate tonight even though the @FoxNews trio especially @megynkelly was not very good or professional! _E_
Ralph Northamwho is running for Governor of Virginiais fighting for the violent MS 13 killer gangs & sanctuary cities. Vote Ed Gillespie! _E_
The Trump base is far bigger & stronger than ever before (despite some phony Fake News polling). Look at rallies in Penn Iowa Ohio....... _E_
Via @scj by @rodboshart: "Trump: Next president has to be 'great one'" __HTTP__ _E_
Surprise @oreillyfactor used my name big league in pre ads to promote the show—then talked about everyone else but me! _E_
Meet the 'Trumpocrats': Lifelong Democrats Breaking w/ Party Over Hillary to Support Donald Trump for President: __HTTP__ _E_
Yesterday 15 @GOP senators sided with people who got into this country by breaking our laws. _E_
Very proud of our incredible First Lady (@FLOTUS.) She is a truly great representative for our country! __HTTP__ _E_
People in our country want borders and without them the old line pols like Crooked Hillary will not win. It is time for CHANGE and JOBS! _E_
#TrumpAdvice __HTTP__ _E_
We're all proud of @erictrump for being on @Forbes 30 Under 30 list. __HTTP__ _E_
.@SenJohnMcCain Thank you for coming to D.C. for such a vital vote. Congrats to all Rep. We can now deliver grt healthcare to all Americans! _E_
General Keith Kellogg who I have known for a long time is very much in play for NSA as are three others. _E_
during a general election. I for one am appalled that somebody that is the nominee of one of our two major parties would take that kind _E_
P.S. There is also something really good to say about humility. Being confident and humble is a great combination maybe the best of all! _E_
I just arrived in Barcelona. I make a big speech tomorrow and then off to Ireland and Scotland. _E_
Via The Hill: Trump Tops National Poll for Second Straight Week __HTTP__ _E_
Will be on Howard Stern at 6.45 A.M. and the Today Show at 8.00 A.M. _E_
How is it possible that the people of the great State of Colorado never got to vote in the Republican Primary? Great anger totally unfair! _E_
Wow Matt Lauer was just fired from NBC for "inappropriate sexual behavior in the workplace." But when will the top executives at NBC & Comcast be fired for putting out so much Fake News. Check out Andy Lack's past! _E_
Via @GolfDigest by @LukeKerrDineen: "@MichaelBreed to open golf academy at Donald Trump's @TrumpFerryPoint" __HTTP__ _E_
Crooked Hillary Clinton is totally unfit to be our president really bad judgement and a temperament according to new book which is a mess! _E_
How can the economy ever recover when @BarackObama keeps threatening the private sector with more taxes. This is no way to spur growth. _E_
Watch Face The Nation will be on now! _E_
For the truth about job creation in America go to __HTTP__ A great site for employers to get the tools & information they need! _E_
"Successful people keep moving. They make mistakes but they don't quit." – Conrad Hilton _E_
#TimeToGetTough The crowd at the book signing at Trump Tower in NYC right now... __HTTP__ _E_
Isn't it amazing that Obama "never knew" about the IRS scandals until he saw it in the news?! _E_
Has AG Schneiderman been extorting his targets and their lawyers for contributions? We will find out. _E_
Reminder: The Miss Universe competition will be LIVE from the Bahamas Tonight @ 9pm (EST) on NBC: __HTTP__ _E_
Congratulations to @GatewayPundit on being named the #ROL15 @BreitbartNews award. Well earned & well deserved! _E_
The bigger problem with Ebola is all of the people coming into the U.S. from West Africa who may be infected with the disease. STOP FLIGHTS! _E_
I want to negotiate my own and much better trade deals for our country. MUST INCLUDE CURRENCY MANIPULATION (and more). DO NOT LET PASS! _E_
I find it offensive that Goofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be Native American to get in Harvard. _E_
With championship links @TrumpScotland's world class amenities also include dining & luxury accommodations __HTTP__ _E_
"If you know the enemy and know yourself you need not fear the results of a hundred battles." Sun Tzu _E_
Fear defeats more people than any other one thing in the world. Ralph Waldo Emerson _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The best way to build a successful business is by results. In the end that is what counts. _E_
Product integration is very important. #CelebApprentice _E_
I only wish my wonderful daughter Tiffany could have been with us at Mar a Lago for our great election victory. She is a winner! _E_
On schedule for 2016 completion @trumpvancouver's 57 story twisting tower will be the icon of Vancouver's skyline __HTTP__ _E_
My interview with @seanhannity discussing this season's @ApprenticeNBC #TimeToGetTough the economy and GOP primary. __HTTP__ _E_
Icahn Kravis Zell Buffett have all used the bankrutcy law to their benefit. Many of the top business people do. _E_
The EU just dropped their self imposed carbon tax. I bet they wish they had all that money back! _E_
Wow @FoxNews just reporting big news. Source: Official behind unmasking is high up. Known Intel official is responsible. Some unmasked.... _E_
RT @IvankaTrump: I have long respected India's accomplished and charismatic Foreign Minister @SushmaSwaraj and it was an honor to meet her... _E_
Trump Golf Links at Ferry Point: Grand Opening next Tuesday May 26th at 11 AM. Jack Nicklaus will be joining me. __HTTP__ _E_
Good news for those that want to Make America Great Again I am winning every poll in every STATE and NATIONAL and by big numbers! Thanks _E_
Over the past 11 months I have travelled tens of thousands of miles to visit 13 countries. I have met with more than 100 world leaders and everywhere I traveled it was my highest privilege and greatest honor to represent the AMERICAN PEOPLE! __HTTP__ _E_
The perfect getaway @Trump_Ireland is Europe's most elite 5 star destination perfected with old world luxury __HTTP__ _E_
Control your own destiny or someone else will. @jack_welch _E_
Only three weeks until the new season of @CelebApprentice begins filming great all star cast. _E_
The Trump Organization is honored to have been awarded the redevelopment of The Old Post Office. Will be DC's finest hotel. _E_
Obama is a disaster at foreign policy. Never had the experience or knowledge. He is not capable of doing the job. _E_
RT @AdrianaCohen16: Carly Fiorina no lifeboat for a fast sinking @tedcruz campaign __HTTP__ via @bostonherald @realdonaldtru... _E_
Via @WSJ: "A New Direction For America" by @MittRomney _E_
"The risk of a wrong decision is preferable to the terror of indecision." – Maimonides _E_
Do you think John Kerry is aware of the fact that they are building nuclear weapons in Iran and North Korea and Pakistan already has them!! _E_
Fellow inductee @SammartinoBruno and me. #WWEHOF __HTTP__ _E_
While @BarackObama is slashing the military he is also negotiating with our sworn enemy the Taliban who facilitated 9/11. _E_
Congratulations to two great and hardworking guys Corey Lewandowski and David Bossie on the success of their just out book "Let Trump Be Trump." Finally people with real knowledge are writing about our wonderful and exciting campaign! _E_
Why would the people of Florida vote for Marco Rubio when he defrauded them by agreeing to represent them as their Senator and then quit! _E_
Looking forward to addressing the record setting crowd tonight at the New York County Lincoln Day Dinner. Lots to talk about! _E_
Thank you @davidaxelrod for your nice words this morning on @CNN. It was a good night! _E_
Beautiful @MissUSA in @NewYorkPost tomorrow as Audrey Hepburn in front of Tiffany's. _E_
...Trump/Russia story was an excuse used by the Democrats as justification for losing the election. Perhaps Trump just ran a great campaign? _E_
WOW @foxandfrlends "Dossier is bogus. Clinton Campaign DNC funded Dossier. FBI CANNOT (after all of this time) VERIFY CLAIMS IN DOSSIER OF RUSSIA/TRUMP COLLUSION. FBI TAINTED." And they used this Crooked Hillary pile of garbage as the basis for going after the Trump Campaign! _E_
Dopey Sugar @Lord_Sugar—you are the worst kind of loser—a total fool. _E_
So @BarackObama's campaign is calling @MittRomney a potential criminal __HTTP__ How about Obama's Tony Rezko land deal! _E_
Isn't it funny when a failed Senator like goofy Elizabeth Warren can spend a whole day tweeting about Trump & gets nothing done in Senate? _E_
Excited to be returning to the @NCGOP State Convention as the Keynote of Saturday's dinner! @NCGOP is a strong Conservative state party! _E_
Re Miss Universe Pageant we've spoken w/the LGBT community in Russia who asked "please don't leave it would send the wrong signal." _E_
.ccolvinj @AP is one of the truly bad reporters working for an organization that has totally lost its way. Stories are fictional garbage. _E_
Via @theFAMiLYLEADER: "Donald Trump to Speak at The Family Leadership Summit" __HTTP__ Get tix __HTTP__ _E_
Our very stupidly run Country better stop being so politically correct or we won't have a Country to run anymore! _E_
I told you so a long time ago: Iraq just lost second largest city as their soldiers drop their guns and run. Only the beginning! OIL. _E_
Did you ever see a situation so ridiculous as our President explaining what when and where to Congress about a Syrian attack. Far too late! _E_
"Donald Trump on VA woes: 'I'd fire everybody' 'you fix it by getting Trump elected'" __HTTP__ via @washtimes by @dsherfinski _E_
Via @HeraldWeekly by Lauren Odomirok: Trump Norman play renovated golf course __HTTP__ _E_
When nobody wanted the UFC I opened the way by letting them fight at the Trump Taj Mahal in Atlantic City. Dana White has done a great job! _E_
Thank you California! #Trump2016 __HTTP__ __HTTP__ _E_
Have you seen the new #TRUMP line of clothing apparel and fragrances @Macy's? Selling like hotcakes. Great for Christmas gifts etc. _E_
China is happy to learn that @BarackObama plans to borrow another $300 Billion. @BarackObama is their favorite client. _E_
An amazing article by Kevin Gabriel __HTTP__ A must read by friends and foes of President Obama. End date is tomorrow at noon. _E_
Young entrepreneurs: Your success is measured by results. Be productive in the face of challenges. Setbacks are not fatal. _E_
The @SenTedCruz endorsement was a wonderful surprise. I greatly appreciate his support! We will have a tremendous victory on November 8th. _E_
I was at @FoxNews and met Juan Williams in passing. He asked if he could have pictures taken with me. I said fine. He then trashes on air! _E_
I watched @BarackObama at the National Prayer Breakfast and he looked totally uncomfortable with his words. (cont) __HTTP__ _E_
Totally false reporting on my call with @Reince Priebus. He called me ten minutes said I hit a "nerve doing well end! _E_
I agree Mike thank you to all of our law enforcement officers! #VPDebate Police officers are the best of us... @Mike_Pence _E_
Happy #NationalFarmersDay!📸 __HTTP__ __HTTP__ _E_
I wonder who @ArsenioHall's first guest will be his show will be great! _E_
Thank you New Hampshire! #FITN #NHPrimary #VoteTrumpNH Voting questions? __HTTP__ __HTTP__ _E_
.@joycefinance #asktrump __HTTP__ _E_
Big interview tonight by Henry Kravis at The Business Council of Washington. Looking forward to it! _E_
I'll be signing copies of my new book Time To Get Tough tomorrow in Trump Tower 11 am to 2 pm. Hope to see you there. _E_
My wife @MELANIATRUMP will be #OnTheRecord w/ @greta tonight at 7pmE on @FoxNews. Enjoy! __HTTP__ __HTTP__ _E_
Governor @RicardoRossello We are with you and the people of Puerto Rico. Stay safe! #PRStrong _E_
#noratings @Lawrence will soon be off tv bad ratings he has a face made for radio. _E_
I will be interviewed on @foxandfriends at 7:30 A.M. Enjoy! _E_
When terrorists are beheading and executing American citizens in such a brutal waythe report on torture should be the least of our concerns _E_
Sometimes by losing a battle you find a new way to win the war. _E_
With the number of tweets sad sack @Rosie has done she has totally lost control of herself hopefully not a breakdown. _E_
Thank you. __HTTP__ _E_
Thank you to Shawn Steel for the nice words on @FoxNews. _E_
Thanks. __HTTP__ _E_
More dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_
HYPOCRITE! Long before @BarackObama called the Tea Party 'teabaggers' he dressed as a revolutionary in a Hyde Park rally __HTTP__ _E_
China is about to acquire a unit of AIG which we bailed out for $5.5B __HTTP__ China is making great deals on our backs. _E_
Rising 70 stories over Panama Bay @TrumpPanama offers our elite amenities in Latin Americas tallest building __HTTP__ _E_
Hope everyone enjoyed their Thanksgiving. But get ready our country is in big trouble! _E_
Spent time with Indiana Governor Mike Pence and family yesterday. Very impressed great people! _E_
Democrats try so hard to mock & belittle Republicans—& the Republicans just don't fight back—no energy! _E_
We must keep the pressure on @BarackObama's administration to make sure Chen comes to the US. It would be a tragedy to abandon him in China. _E_
For more information on tonight's two hour telethon 8 to 10 p.m.: __HTTP__ _E_
...The fact is that Puerto Rico has been destroyed by two hurricanes. Big decisions will have to be made as to the cost of its rebuilding! _E_
I wonder if @BarackObama ever applied to Occidental Columbia or Harvard as a foreign student. When can we see (cont) __HTTP__ _E_
... in order to occupy space in a truly ugly office building in a much worse location! _E_
For all of those that think life is easy & don't want to work remember: HOPE IS THE POOR MAN'S BREAD. _E_
.@KellyannePolls Kellyanne you were fantastic on @meetthepress today. Keep going I will win for the people. MAKE AMERICA GREAT AGAIN! _E_
A great day in both Spencer & Davenport Iowa! THANK YOU for the support! #Trump2016 #FITN #IAPolitics __HTTP__ _E_
Thank you Mississippi! #Trump2016 _E_
Head on over to my Facebook page to have your questions answered in the next #AskTheDonald __HTTP__ _E_
Thank you Anthony @Scaramucci @WSJ The Entrepreneur's Case for Trump __HTTP__ _E_
NO MERCY TO TERRORISTS you dumb bastards! _E_
Thank you to respected columnist Katie Hopkins of Daily __HTTP__ for her powerful writing on the U.K.'s Muslim problems. _E_
Two great people! __HTTP__ _E_
With a record deficit and $15 trillion in debt @BarackObama is spending $4 million of our money on his Hawaii vacation. Just plain wrong. _E_
I'm very proud of the work my son @EricTrump has been doing with the @EricTrumpFDN take a look... __HTTP__ _E_
He @RickSantorum has as much chance of being the GOP nominee as @Rosie does of ever having a successful (cont) __HTTP__ _E_
Via @DailyCaller by @alweaver22:"Trump: Obama One Of 'The Worst Things That's Ever Happened To Israel'" __HTTP__ _E_
Pleasure in the job puts perfection in the work. Aristotle _E_
Technology has shown we have tremendous energy resources right under our feet that we didn't know about 5 years ago. _E_
So they caught Fake News CNN cold but what about NBC CBS & ABC? What about the failing @nytimes & @washingtonpost? They are all Fake News! _E_
Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_
Entrepreneurs: Success is good. Success with significance is even better. Make your work count. _E_
...well into our 4th week of shooting the record 13th season of @CelebApprentice. The 'All Stars' are hard at work... _E_
Every poll Time Drudge Slate and others said I won both debates but heard Megyn Kelly had her two puppets say bad stuff. I don't watch _E_
RT @AnnCoulter: Anyone who plans to talk about Trump ever again has to see this speech. Your opinion is irrelevant unless you listened to... _E_
RT @NFIB: .@NFIB encouraged by @realDonaldTrump's #taxplan says #smallbiz would benefit from lower tax rate: __HTTP__ _E_
Departing Farmers Round Table in Boynton Beach Florida. Get out & VOTE lets #MAGA! EARLY VOTING BY FL. COUNTY:... __HTTP__ _E_
Via @BreitbartNews by @mboyle1: DONALD TRUMP: MSM INVESTIGATION INTO SCOTT WALKER'S COLLEGE A 'DOUBLE STANDARD' __HTTP__ _E_
Woody Johnson owner of the NYJets is @JebBush's finance chairman. If Woody would've been w/me he would've been in the playoffs at least! _E_
RT @realDonaldTrump: Happy Birthday @DonaldJTrumpJr! __HTTP__ _E_
The Fed is considering issuing even more US bond debt into the market. Not good! _E_
The United Nations has such great potential but right now it is just a club for people to get together talk and have a good time. So sad! _E_
My @marklevinshow interview discussing Obama's SOTU Rove's attack on the Tea Party & All Star @ApprenticeNBC __HTTP__ _E_
.@mcuban says he is a member of Dallas National but doesn't play golf. Who is a member of a golf club that doesn't play?? No talent! @TMZ _E_
Why doesn't President Obama simply apologize for telling a big fat lie announce that ObamaCare was a mistake and deal a really great plan! _E_
WHAT THEY ARE SAYING ABOUT THE CLINTON CAMPAIGN'S ANTI CATHOLIC BIGOTRY: __HTTP__ _E_
.@brithume I am in first place by a lot in all polls tied for first place with Ben Carson in one Iowa poll. I thought you knew this thanks _E_
O.K. Christmas is over now we can all go back to the wars of life. Focus focus focus never accept defeat push hard for total victory! _E_
I will be on Face the Nation with John Dickerson on CBS this morning. Enjoy! _E_
Crooked Hillary Clinton Tops Middle East Forum's 'Islamist Money List' __HTTP__ _E_
Trump Int'l Palm Beach offers a spectacular course with hill vistas bunkers and incredible water features. __HTTP__ _E_
I'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring' or 'insourcing?' We need (cont) __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Wonder if Obama will ever say RADICAL ISLAMIC TERRORIST? _E_
Lots of response that Obama should give the $5M to the families of our great heroes who were murdered in Benghazi. _E_
Before Kids Can Go Places They Need a Place To Go the motto of The Police Athletic League an organization I'm very proud to support _E_
THE DONALD J. TRUMP PRESIDENTIAL EXPLORATORY COMMITTEE __HTTP__ _E_
What can you learn today that you didn't know before? Set the bar high do the best you possibly can. _E_
Via @golf_com by @joepassov: "@TrumpFerryPoint Will Be One of Nation's Best Public Courses" __HTTP__ _E_
'U.S. Industrial Production Surged in April' __HTTP__ _E_
My complaint against @AGSchneiderman is a "case study" for JCOPE & Moreland Commissions on everything that is wrong with NYS politics. _E_
Thank you Washington! Honored to say on behalf of our great movement we have broken the all time record for votes in GOP primary history. _E_
North Korea can't survive or even eat without the help of China. China could solve this problem with one phone call they love taunting us! _E_
Get Snowden back from Russia—he has done tremendous damage to the US & should pay a very heavy price. _E_
Thank you Newt! __HTTP__ _E_
As soon as John Kasich is hit with negative ads he will drop like a rock in the polls against Crooked Hillary Clinton. I will win! _E_
Great Town Hall tonight at 10:00 P.M. (Eastern) conducted by @seanhannity on @FoxNews _E_
Had a great time on @gretawire's inaugural 7PM show. Congrats to Greta on the new spot! _E_
Pervert alert. @RepWeiner is back on twitter. All girls under the age of 18 block him immediately. _E_
...lottery continues deadly catch and release and bars enforcement even for FUTURE illegal immigrants. Voting for this amendment would be a vote AGAINST law enforcement and a vote FOR open borders. If Dems are actually serious about DACA they should support the Grassley bill! _E_
Obama must now FOCUS get his mind off March.Madness and LEAD! Watch Russia closely work hard on the economy and get rid of ObamaCare! _E_
Once ObamaCare is fully enacted in NY conveniently after 2014 expect higher premiums bigger deductibles & worse care. Job killer! _E_
Well Obama refused to say (he just can't say it) that we are at WAR with RADICAL ISLAMIC TERRORISTS. _E_
I don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1 in terror no problem! _E_
For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_
Doing a commercial for @NFLONFOX lots of fun! __HTTP__ _E_
Congress get ready to do your job DACA! _E_
.@latoyajackson is once again at the top of her game in the upcoming All Star season of @CelebApprentice. Amazing in the boardroom... _E_
MAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN! _E_
I am honored that Texas supporters have filed papers in Texas to create Make America Great Party on my behalf. __HTTP__ _E_
"The Constitution is the guide which I never will abandon" George Washington _E_
.@MichelleMalkin would be nothing without being on the @seanhannity show. I don't see what Sean sees in her—loser! _E_
Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_
Obama & Clinton should stop meeting with special interests & start meeting with the victims of illegal immigration. _E_
Golf Odyssey one of golf's most respected publications just named Trump International Golf Links Scotland golf course of the year _E_
Just landed in Iowa speaking soon! _E_
It all comes down to one simple question: How much money can you stand to lose? That's how much risk you should assume. _E_
What do you think of Gary's definition of f u n? _E_
Trump International Hotel & Tower Toronto continues to receive accolades. Great city great hotel. __HTTP__ #TrumpToronto _E_
RT @detroitnews: .@IvankaTrump in Michigan: 'This is your movement' __HTTP__ @realDonaldTrump __HTTP__ _E_
.@TPNNtweets Donald Trump Tells A Fascinating Inside Story About His Dealings w/ The Obama WH __HTTP__ @johnhawkinsrwn _E_
If Chelsea Clinton were asked to hold the seat for her motheras her mother gave our country away the Fake News would say CHELSEA FOR PRES! _E_
Sadly this kind of stuff even happened to Ronald Reagan. There is nothing nice about it! #MakeAmericaGreatAgain __HTTP__ _E_
On Sunday Jerome Bettis 'the bus' from the Pittsburgh Steelers will play at Trump Int'l Golf Club/Palm Beach against Julius Erving 'Dr J' _E_
I will be interviewed on The O'Reilly Factor this evening at 8 pm on the Fox News Channel. @oreillyfactor _E_
Our legal system is broken! 77% of refugees allowed into U.S. since travel reprieve hail from seven suspect countries. (WT) SO DANGEROUS! _E_
NY Jets center Nick Mangold interns for Trump. Watch Trump's Fabulous World of Golf tonight 9PM ET on Golf Channel __HTTP__ _E_
With 49 days until the election @MittRomney needs to stay on offense. He should not be apologizing. Deflect onto Obama's record. _E_
Obama's motto: If I don't go on tax payer funded vacations & constantly fundraise then the terrorists win. _E_
Does everyone remember @MittRomney and his famous remarks about self deportation and 47% . He was done. I don't need his angry advice! _E_
Obama told Medvedev after the '12 reelect he would "have more flexibility." It was music to Putin's ears. _E_
Amazing. @CelebApprentice has started filming our record 13th season this week thanks to our big and very loyal fan base. _E_
The results are in on the final debate and it is almost unanimous I WON! Thank you these are very exciting times. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Via @Newsmax_Media: "Donald Trump 2016: 8 Facts About Personal Life of GOP Presidential Hopeful" __HTTP__ _E_
#CrookedHillary __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The Costa Concordia shipwreck is a MONUMENT TO STUPIDITY but the uprighting of the ship is a MONUMENT TO GENIUS! _E_
Wealth comes from big goals and sustained action toward those goals every day. Think Big _E_
Wow you are all correct about @FoxNews totally biased and disgusting reporting. _E_
My @971FMTalk int. with @DLoesch on #HandsOffMyGun 2014 election results stopping Obamacare new Senate & 2016 __HTTP__ _E_
Via @CarolinaLive by @JoelAllenWPDE:"Big names wrap up largest ever SC Tea Party Coalition Convention" __HTTP__ _E_
See story in The Scotsman re: wind turbines __HTTP__ _E_
I answered your questions in today's video... watch at __HTTP__ _E_
Will be interviewed by @andersoncooper on @CNN tonight. Let's see if he treats me fairly—enjoy! _E_
A wonderful evening in South Carolina big crowd amazing energy! _E_
The truly great Phyllis Schlafly who honored me with her strong endorsement for president has passed away at 92. She was very special! _E_
President Obama spoke last night about a world that doesn't exist. 70% of the people think our country is going in the wrong direction. #DNC _E_
Via cnsnews by @SJonesCNS: "Trump Explains His Appeal: 'People Are Tired...Of These Incompetent Politicians'" __HTTP__ _E_
Glad to hear @SethMacFarlane will be hosting this year's Oscars. Something new that should be fun. _E_
My thoughts on @barackobama's campaign.... __HTTP__ _E_
See Sanders backed Hillary on E mails at the debate hurting himself and then she threw him under the bus (but failed). Disloyal person! _E_
"You measure your people and you take action on those that don't measure up." @jack_welch _E_
Why gas prices will rise Miss Canada/Miss Universe and #CelebApprentice in today's #trumpvlog... __HTTP__ _E_
We must leave stop and frisk for A Rod and Anthony Weiner! _E_
RT @Scavino45: .@POTUS & @FLOTUS w/ @LVMPD Officer Cook 2nd day on job received gunshot wound to the right chest & right arm saving live... _E_
I had a great time in D.C. yesterday at the Trump International Hotel OPO groundbreaking ceremony. Watch __HTTP__ _E_
Mike Pence won big. We should all be proud of Mike! _E_
Going to the White House is considered a great honor for a championship team.Stephen Curry is hesitatingtherefore invitation is withdrawn! _E_
Why would smart voters want to put Democrats in Congress in 2018 Election when their policies will totally kill the great wealth created during the months since the Election. People are much better off now not to mention ISIS VA Judges Strong Border 2nd A Tax Cuts & more? _E_
Don't let Obama buy the election by handing out unlimited free money to states. _E_
Thank you New Hampshire! Departing with my amazing family now! #FITN #NHPrimary __HTTP__ __HTTP__ _E_
"Trump signs lease for a NH office returns Monday" __HTTP__ via @UnionLeader by @tuohy _E_
HillaryClinton can illegally get the questions to the Debate & delete 33000 emails but my son Don is being scorned by the Fake News Media? _E_
#FlashbackFriday @kimkardashian on the set of @ApprenticeNBC __HTTP__ _E_
"Fortunately for a quarterback you can play for a long time because you don't get hit very often." – Tom Brady @SuperBowl @Patriots _E_
It should be mandatory that all haters and losers use their real name or identification when tweeting they will no longer be so brave! _E_
.@RudyGiuliani one of the finest people I know and a former GREAT Mayor of N.Y.C. just took himself out of consideration for State . _E_
Wow @megynkelly really bombed tonight. People are going wild on twitter! Funny to watch. _E_
Mitt Romney gave a masterful speech this weekend at Liberty University with a wonderful introduction by Mark DeMoss. Well done. @MittRomney _E_
If you don't believe in yourself no one else will. _E_
When will Washington stand up to China. China is manipulating its currency and stealing our jobs. Washington should move on legislation. _E_
I am in Colorado big day planned but nothing can be as big as yesterday! _E_
Ted Cruz has now apologized to Marco Rubio and Ben Carson for fraud and dirty tricks. No wonder he has lost Evangelical support! _E_
.@CNN Why is somebody (Beck) I beat so soundly all of a sudden an expert on Donald Trump (all over television). She knows nothing about me. _E_
We may get out of ObamaCare because the train wreck is impossible to implement __HTTP__ It is a disaster. _E_
I will be interviewed on @foxandfriends at 7:00 this morning. Plenty to talk about! _E_
Thank you for a great evening Laconia New Hampshire will be back soon! #AmericaFirst __HTTP__ __HTTP__ _E_
Hillary Clinton's open borders immigration policies will drive down wages for all Americans and make everyone less safe. _E_
I fought hard against Spitzer and Weiner and both lost. For a while when Spitzer was way up it seemed that I was a lone voice! Good power _E_
Thank you America!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
A great great honor to welcome & recognize the National Teacher of the Year as well as the Teacher of the Year fro... __HTTP__ _E_
Believe and act as if it were impossible to fail. Charles F. Kettering _E_
I will and I agree! RT @ZacharyQuinto@realdonaldtrump you can't possibly make any more money. so why don't you make a difference instead?! _E_
Thank you to President Moon of South Korea for the beautiful welcoming ceremony. It will always be remembered. __HTTP__ _E_
I can't believe that Prime Minister @David_Cameron is giving massive subsidy to Scotland to destroy itself with windfarms. _E_
TRAIN WRECK just the beginning. Our roads airports tunnels bridges electric grid all falling apart.I can fix for 20% of pols & better _E_
Wow such sacrfices for his re election. @BarackObama will not vacation in Martha's Vineyard this summer. __HTTP__ _E_
Lightweight @AGSchneiderman is driving business & jobs out of NY. Only wants self publicity—a total loser! _E_
I (we) broke the all time record for most votes gotten in a Republican Primary by a lot and with many states left to go! Thank you. _E_
A strong military will stop wars. Peace through Strength! Let's Make America Great Again! __HTTP__ _E_
If a conservative Republican made the mistake that Mrs. Obama just made by calling Braley by the wrong name it would be the biggest story! _E_
.@usgsa A momentous day. Great job on Old Post Office we will make you proud! _E_
Can you believe that the Afghan war is our "longest war" ever—bring our troops home rebuild the U.S. make America great again. _E_
Everyone here is talking about why John Podesta refused to give the DNC server to the FBI and the CIA. Disgraceful! _E_
#TBT With James Lipton on the set of @ApprenticeNBC __HTTP__ _E_
Thousands of e mails from folks urging me to seek the Americans Elect Presidential nomination. _E_
.@dubephnx If we didn't remove incredibly powerful fire retardant asbestos & replace it with junk that doesn't (cont) __HTTP__ _E_
Scotland does not have free press even when you are just stating the facts it's crazy! _E_
The original Apprentice is coming back do you have what it takes to be the next Apprentice? For casting details: __HTTP__ _E_
We did it! Thank you to all of my great supporters we just officially won the election (despite all of the distorted and inaccurate media). _E_
Nothing conservative about the Club for Growth coming into my office and demanding a $1M contribution which naturally they did not get. _E_
What other country tells the enemy when we are going to attack like Obama is doing with ISIS. Whatever happened to the element of surprise? _E_
...NFL attendance and ratings are WAY DOWN. Boring games yes but many stay away because they love our country. League should back U.S. _E_
I'm going to the @Yankees game tonight to root them on they always win when I am there. _E_
As China is building an air and naval force @BarackObama is cutting ours. __HTTP__ He is weakening our national security. _E_
Get out tomorrow and vote so that we can all finally say those magic words __HTTP__ _E_
It was a true honor to be at Yokota Air Base with our GREAT @USForcesJapan! __HTTP__ _E_
Iraq is more dangerous today than any time under Saddam. War was a mistake as I said from the very beginning. Bush & Obama should apologize _E_
I guess they have Lance Armstrong cold. Brutal report. A waste of taxpayer money to take down an American hero. _E_
Heading to Iowa join me today at noon! #MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_
Putin & I discussed forming an impenetrable Cyber Security unit so that election hacking & many other negative things will be guarded.. _E_
RT @foxandfriends: President Trump to sign an executive order on religious liberty today the National Day of Prayer | @kevincorke __HTTP__ _E_
.@FoxNews is changing their theme from fair and balanced to unfair and unbalanced. But dying @WSJ is worse.Their phony poll is a joke! _E_
.@JohnLegere @TMobile John focus on running your company I think the service is terrible! Try hiring some good managers. _E_
Be sure to get a copy of @williebosshog's new book American Hunter. _E_
The Arab Spring is not working out so well nice name bad results! _E_
Join me in Redding California tomorrow at 1:00pm. #Trump2016Tickets: __HTTP__ _E_
The next generation of luxury @TrumpVancouver will be the icon of the Vancouver skyline __HTTP__ _E_
Just landed in Iowa. See everyone soon! #MAGA _E_
We are taking action to #RepealANDReplace #Obamacare! Contact your Rep & tell them you support #AHCA. #PassTheBill... __HTTP__ _E_
Bay Bridge in San Fransisco built in China keeps getting worse. Cost overruns are out of control China is having a field day with us! _E_
Many reports of peaceful protests by Iranian citizens fed up with regime's corruption & its squandering of the nation's wealth to fund terrorism abroad. Iranian govt should respect their people's rights including right to express themselves. The world is watching! #IranProtests _E_
.@KatyTurNBC 3rd rate reporter & @SopanDeb @ CBS lied. Finished in normal manner&signed autos for 20min. Dishonest! __HTTP__ _E_
We cannot take four more years of Barack Obama and that's what you'll get if you vote for Hillary. #BigLeagueTruth _E_
Remember when Obama promised "you can keep your health care plan?" Not in these 10 states. __HTTP__ Another lie. _E_
My plan will lower taxes for our country not raise them. Phony @club4growth says I will raise taxes—just another lie. _E_
Congratulations to new Congressman @leezeldin being named to House Foreign Affairs Comm. and co chair the House Republican Israel Caucus. _E_
Welcome to the new reality. Goldman Sachs just based their new Asia Pacific chairman not in Tokyo but Beijing. __HTTP__ _E_
Doing Fox and Friends at 7.00 A.M. Hope you loved Apprentice last night. _E_
We must stop being politically correct and get down to the business of security for our people. If we don't get smart it will only get worse _E_
Today I announced our strategy to confront the Iranian regime's hostile actions and to ensure that they never acquire a nuclear weapon. __HTTP__ _E_
The Tax Cut Bill is coming along very well great support. With just a few changes some mathematical the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well! _E_
How crazy 7.5% of all births in U.S. are to illegal immigrants over 300000 babies per year. This must stop. Unaffordable and not right! _E_
Entrepreneurs: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_
Can anyone imagine Chafee as president? No way. _E_
RT @foxandfriends: President Trump officially nominates former Assistant Attorney General Christopher Wray to head the FBI __HTTP__ _E_
....it is very possible that those sources don't exsist but are made up by fake news writers. #FakeNews is the enemy! _E_
THE U.S.G.A. Boy's Junior Champion at Trump National Golf Club Bedminster just won The Australian Open. We are proud of you @JordanSpieth _E_
Iraq is being ravaged by Al Qaeda. Country in utter chaos & all oil is going to Iran & China __HTTP__ Terrible mistake! _E_
Letterman @Late_Show had Brian Williams @NBCNightlyNews as guest last night I was on last Thursday _E_
.@lightjzup Industrial turbines are destroying our land. _E_
Thanks. __HTTP__ _E_
Biggest story in politics is now happening in the great State of Colorado where over one million people have been precluded from voting! _E_
A pessimist is one who makes difficulties of his opportunities... _E_
To all of those who asked I predicted two weeks ago and again last night that Dwight Howard would go to Houston.Do I get congrats insight? _E_
Incredibly proud of my son @EricTrump & his efforts on behalf of @StJude in Memphis TN. __HTTP__ __HTTP__ _E_
If Christian Bale turned down $50M to return as Batman he should have his head examined. What was he thinking?! _E_
They should have rebuilt the two buildings of the World Trade Center exactly as they were except taller and stronger. A better statement! _E_
Can you imagine trading five really bad enemies of the U.S. for the freedom of traitor Bergdahl. Just another bad deal! _E_
RT @PChowka: Sean Hannity's Big Week Top Ratings Probing Reporting and Let There Be Light at American Thinker __HTTP__ h... _E_
Whether you think you can or think you can't you're right. Henry Ford _E_
Excited and honored to be addressing @theFAMiLYLEADER summit in Iowa this August. __HTTP__ _E_
Going to New Hampshire in a little while. Big crowds! #MakeAmericaGreatAgain! _E_
In real estate all locations can be enhanced through good marketing. Be smart! _E_
Why does @CNN bore their audience with people like @secupp a totally biased loser who doesn't have a clue. I hear she will soon be gone! _E_
My @foxandfriends int. @FoxNewsInsider "'Once a Choker Always a Choker': DJT Takes Credit for Romney Dropping Out" __HTTP__ _E_
The media has been speculating that I fired Rex Tillerson or that he would be leaving soon FAKE NEWS! He's not leaving and while we disagree on certain subjects (I call the final shots) we work well together and America is highly respected again! __HTTP__ _E_
Let's not start celebrating over Libya until we see who takes over. _E_
"@NMoralesNBC @ThomasARoberts to Host 63rd Annual @MissUniverse" __HTTP__ via @TheWrap by @AnthonyMaglio _E_
My @FoxBusiness int. w/Don Imus on not drinking alcohol politicians being all talk and no action & the border __HTTP__ _E_
.@GretchenCarlson's memoir is a powerful example of perseverance & hope. "Getting Real" is as real as it gets. Get it & enjoy! #GettingReal _E_
Today we gathered in the Roosevelt Room for one single reason: to CUT THE RED TAPE! For many decades an ever growing maze of regs rules and restrictions has cost our country trillions of dollars millions of jobs countless American factories & devastated entire industries. __HTTP__ _E_
I will start reviewing various political reporters etc & websites as to their professionalism & fairness—many people asking for this. _E_
"If you are passionate about your endeavors it will be reflected back to you in your end result." – Trump Never Give Up _E_
There's a lot going on at the Eric Trump Foundation ... __HTTP__ _E_
#GOPDebate #GoogleTrends __HTTP__ _E_
Great day for Tax Cuts and the Republican Party. But the biggest Winner will be our great Country! _E_
Ask: Is there anyone else who can do this better than I can?That's just another way of saying know yourself & know your competition. _E_
Assad will never give up his chemical weapons. He has spent years and billions accumulating them. This is all a ruse. _E_
"When you expect things to happen strangely enough they do happen." J. P. Morgan _E_
'U.S. Murders Increased 10.8% in 2015' via @WSJ: __HTTP__ _E_
.@DennisRodman re @Omarosa is right she's becoming predictable. _E_
I know you will enjoy reading my tax plan __HTTP__ #MakeAmericaGreatAgain _E_
.@TrumpGolfLA is @theknot's pick for the Best of Weddings with our Vista Terrace looking over the Pacific Ocean __HTTP__ _E_
Hillary said such nasty things about me read directly off her teleprompter...but there was no emotion no truth. Just can't read speeches! _E_
RT @jessebwatters: Thanks for watching!! __HTTP__ _E_
Cruz said Kasich should leave because he couldn't get to 1237. Now he can't get to 1237. Drop out LYIN' Ted. _E_
Wise words from my mother: "Trust in God and be true to yourself." Mary MacLeod Trump _E_
I pick the best locations @Trump_Charlotte has incredible views of beautiful Lake Norman. __HTTP__ _E_
Jon Stewart is the most overrated joke on television. A wiseguy with no talent. Not smart but convinces dopes he is! Fading out fast. _E_
I am self funding my campaign and am therefore not controlled by the lobbyists and special interests like lightweight Rubio or Ted Cruz! _E_
The Justice Department's investigation into the national security leaks is not independent. This is a very grave situation. _E_
Thank you America! #Trump2016 __HTTP__ _E_
Very excited to be returning to Iowa tomorrow to campaign for my friend & strong Conservative leader @SteveKingIA! _E_
Clear winner of the #GOPDebate. Thank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Thank you! #GOPDebate __HTTP__ _E_
"Donald Trump hosts first ever 'Trump Invitational' at Mar a Lago" __HTTP__ via @WPTV _E_
RT @charliekirk11: ISIS getting slaughtered: Square miles liberated from ISISTrump: 26000 Obama: 13200Total Square miles held by... _E_
When it comes to money finance and even life PROTECT THE DOWNSIDE AND THE UPSIDE WILL TAKE CARE OF ITSELF! _E_
Young entrepreneurs – always remember in negotiations that sometimes the best deal you make is the one you walk away from. _E_
#MakeAmericaSafeAgain __HTTP__ _E_
.@AlexSalmond is making a truly stupid mistake by forcing ugly industrial wind turbines down Scotland's throat –he's hated for it. _E_
Via @theblaze by @BillyHallowell:"DONALD TRUMP BLASTS OBAMA FOR FAILING TO SECURE CHRISTIAN PASTOR'S FREEDOM IN IRAN" __HTTP__ _E_
.@alexsalmond RT @RichWaugaman This time I agree 100% I never knew how useless a wind turbine was until I (cont) __HTTP__ _E_
The stage is set for the real debate it will be very interesting! _E_
Blackdog Scotland started a petition against @VattenfallGroup. __HTTP__ _E_
Thank you Arizona! See you soon!#MakeAmericaGreatAgain __HTTP__ _E_
Venezuela should allow Leopoldo Lopez a political prisoner & husband of @liliantintori (just met w/ @marcorubio) o... __HTTP__ _E_
Hillary said she was under sniper fire (while surrounded by USSS.) Turned out to be a total lie. She is not fit to... __HTTP__ _E_
The Republicans who want to cut SS & Medicaid are wrong. A robust economy will Make America Great Again! __HTTP__ _E_
In August 2012 Obama said the so called Arab Spring sprung from 'joyful longing for human freedom' __HTTP__ Good call! _E_
RT @DonaldJTrumpJr: Not surprising at all! Father Of Otto Warmbier: Obama Admin Told Us To Keep Quiet Trump Admin Brought Him Home __HTTP__ _E_
Over 2 million people have lost their jobs since @BarackObama became POTUS. How many of them still have healthcare? _E_
Little Mac Miller's next album may bomb. He can't use my name again for sales. _E_
I took some heat a long time ago when I said that George Zimmerman was a sicko and bad news. I know people and this guy is no good trouble! _E_
#ThankYouTour2016 12/6 North Carolina __HTTP__ Iowa __HTTP__ Michiga... __HTTP__ _E_
Be sure to download my new The Celebrity Apprentice app to begin interacting with this Sunday's episode __HTTP__ _E_
While Hillary and I both won South Carolina by big margins Repubs got far more votes with a massive increase from past cycles.GROWING PARTY _E_
I would love to be at the Cadillac World Golf Championship @TrumpDoral in Miami but even more so in Orlando with the #TrumpTrain! _E_
The Fed's actions these past 3 years could bring record high inflation in the near future. That would be (cont) __HTTP__ _E_
No surprise that China was caught cheating in the Olympics. That's the Chinese M.O. Lie Cheat & Steal in all international dealings. _E_
Proud of @IvankaTrump for her leadership on these important issues. Looking forward to hearing her speak at the W20! __HTTP__ _E_
Thank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Going over to @ABC to do LIVE at 9:00. _E_
Thank you to everyone for all of the nice comments by Twitter pundits and otherwise for my speech last night. _E_
Obamacare puts poor people on a form of government run single payer health insurance that many doctors don't take @Avik _E_
.@FLGovScott Gaming states are laughing at stupidity of not approving gaming in FL—they're afraid of Miami—can't believe their luck! _E_
Trump University has a 98% approval rating. I could have settled but won't out of principle! _E_
.@Toure I felt very sorry for you during your meltdown on @PiersMorgan. He drove you insane but of course Piers is a lot smarter than you _E_
Rumor has it that the grubby head of failing @VanityFair Magazine Sloppy Graydon Carter is going to be fired or replaced very soon? _E_
Via @wmbfnews: Donald Trump puts Tea Party on map for 2016 __HTTP__ _E_
Bad news for @BarackObama. @gallupnews reports that the economy (71%) and gas prices (65%) are Americans' top (cont) __HTTP__ _E_
#CaucusForTrump #Trump2016 __HTTP__ _E_
The five Taliban leaders released for a deserter must really be laughing and having a good time right now. They are saying how dumb U.S. is! _E_
My @FoxNews interview with @gretawire discussing the #CNNDebate and how to deal with Iran without using force __HTTP__ _E_
#trumpvlog China is laughing.... __HTTP__ _E_
I answered my @Facebook fans questions via video watch __HTTP__ _E_
Ted Nugent was obviously using a figure of speech unfortunate as it was. It just shows the anger people have towards @BarackObama. _E_
My thoughts on the Geico ad and more in today's video blog.... __HTTP__ _E_
Most of the world's great riders are at Mar a Lago today for the Trump Invitational one of the most important equestrian events of the year _E_
Do as I say not as I do. Obama just granted a special ObamaCare exemption for all Congress __HTTP__ All are hypocrites! _E_
"Developing your talent requires work and work creates luck." – Trump Never Give Up _E_
The US should not give a penny of foreign aid to Egypt if the Muslim Brotherhood takes over the country. We (cont) __HTTP__ _E_
Drudge Poll on who won the 3rd #GOPDebate. Thank you! __HTTP__ _E_
Thank you to @jdickerson and @FaceTheNation for a very fair and professional interview this morning. No wonder you are #1 in the ratings! _E_
Frack now and frack fast unless we want to continue to be dependent on countries that hate us. _E_
When will anyone be held accountable for the VA scandal? The politicians are experts in never facing any consequence. _E_
Via @AmSpec by Jeffrey Lord: "New Obama Scandal Erupts: Trump Targeted" __HTTP__ _E_
So I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_
Joe Scarborough initially endorsed Jeb Bush and Jeb crashed then John Kasich and that didn't work. Not much power or insight! _E_
I invite you to join my campaign to Make America Great Again! Sign up to Volunteer! __HTTP__ _E_
Barack Obama said absolutely not 3 times before he agreed to go after Bin Laden now he wants all of the credit! _E_
Thanks @JamersonHayes they are all total losers with nothing going for them! _E_
Check out this great story from the @WSJ... __HTTP__ _E_
In calling my tweets 'obnoxious' @AOL says "I sure know how to keep them wanting more." They are welcome. I just tell it like it is. _E_
.@PhilMickels0n_ is right—California taxes are far too high. It's ridiculous. _E_
The Audacity of @BarackObama the Federal Reserve purchased 61% of all debt issued by Treasury in 2011. Killing our children's future. _E_
RT @realDonaldTrump: At the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of go... _E_
Obama should play golf with Republicans & opponents rather than his small group of friends. That way maybe the terrible gridlock would end. _E_
Between a terrible press conference mishandled prisoner swap & Taliban attacks Hagel's 1st trip as SOD was a disaster. No surprise. _E_
Our vets are treated like 3rd class citizens. Enough! Join me & @V4SA on @USSIOWA at LA Waterfront to hear my plan for vets & the military! _E_
I look forward to playing golf with President @BarackObama someday. _E_
Big response to my Tea Party statement remember they were never fully energized by Romney campaign and will have far more power with time. _E_
Entrepreneurs: Gain and use information to your advantage see every day as an opportunity to learn. _E_
I guess Rupert Murdoch and the @nypost don't like Donald Trump. Such false reporting about my big hit in Iowa. Even my enemies said bull. _E_
Don't believe Kay Hagan on Ebola travel ban. She also promised that you would keep your healthcare plan under ObamaCare. Vote @ThomTillis! _E_
Putin re Snowden issue "it is like shearing a pig: there's lots of squealing and little fleece." _E_
Check out today's video blog __HTTP__ I want to answer more of your questions tweet me..... _E_
The only reason irrelevant @GlennBeck doesn't like me is I refused to do his failing show asked many times. Very few listeners sad! _E_
Give your goals substance make them count on as many levels as you can. Remember that passion can be the catalyst for great achievement. _E_
.@ScotGolfPodcast Work has not yet begun. We're in the approval phase. It will be amazing. You will love the final result. _E_
Watch the WH spokesman try to spin @BarackObama's rationale for using exec. priv. on Fast & Furious __HTTP__ _E_
A very interesting read. Unfortunately so much is true. __HTTP__ _E_
President Obama and our negotiators are failed checker players playing against Grand Master Chess champions. Very sad to watch! _E_
Chinese oil trader just bought "record number" of Mideast crude __HTTP__ China gains while we fight ISIS. What are we doing? _E_
I have built so many great & complicated projects– creating tens of thousands of jobs video: __HTTP__ __HTTP__ _E_
Via @GolfweekMag by @GolfweekBRomine: "@TigerWoods to design Trump course in Dubai" __HTTP__ _E_
If JP Morgan took their case through the courts for 15 years nobody would be suing them—easy target. _E_
'Trump is right about violent crime: It's on the rise in major cities' __HTTP__ _E_
As #HurricaneHarvey intensifies remember to #PlanAhead. __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_
things they did and said (like giving the questions to the debate to H). A total double standard! Media as usual gave them a pass. _E_
Congratulations to @newtgingrich‎ on being signed to co host @CNN Crossfire. Great move by Jeff Zucker. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
New York should Frack. Thousands of jobs and millions in revenue. NY would be a truly rich state. _E_
The worst thing Hillary could do is have her husband campaign for her. Just watch. _E_
On our YouTube channel the opening of the incredible Trump Ocean Club in Panama.... __HTTP__ _E_
I just saw the movie Unbroken very good except I thought the ending was weak no retribution! And we complain about waterboarding. _E_
After 1 year of investigation with Zero evidence being found Chuck Schumer just stated that Democrats should blame ourselvesnot Russia. _E_
Watching the Ryder Cup on @GolfChannel. Very interesting and tough matches. Amazing sport my favorite! _E_
Kasich only looks O.K. in polls against Hillary because nobody views him as a threat and therefore have placed ZERO negative ads against him _E_
With Ben Carson wanting to hit his mother on head with a hammer stab a friend and Pyramids built for grain storage don't people get it? _E_
RT @mitchellvii: EXACTLY AS I SAID House Intel Chair: We Cannot Rule Out Sr. Obama Officials Were Involved in Trump Surveillance __HTTP__ _E_
The USC made a terrible decision today. How can a requirement to buy private health insurance logically be a Government tax?! _E_
Carly Fiorina did such a horrible job at Lucent and HP virtually destroying both companies that she never got another CEO job offer! Pres. _E_
The Football program at Penn State should be suspended. _E_
Will be speaking with Germany and France this morning. _E_
I just left @trumpwinery in CharlottesvilleVirginia it is the finest in the country really incredible! _E_
At least 3.5M fellow Americans are going to lose their healthcare plans because of ObamaCare. Defund then repeal! _E_
I liked The Kelly File much better without @megynkelly. Perhaps she could take another eleven day unscheduled vacation! _E_
As a candidate I promised we would pass a massive TAX CUT for the everyday working American families who are the backbone and the heartbeat of our country. Now we are just days away... __HTTP__ _E_
Thank you @TrumpWomensTour!#MakeAmericaGreatAgain __HTTP__ _E_
The historic $250M renovations at @TrumpDoral are moving on pace. Once complete @TrumpDoral will be South Florida's premiere resort. _E_
They laughed at me when I said to bomb the ISIS controlled oil fields. Now they are not laughing and doing what I said. #Trump2016 _E_
The NYPost reports @VanityFair Magazine dropped 18% to only 283938 newsstand copies sold. Very sad & their bloggers are doing even worse! _E_
Donald Trump will keynote Oakland County Republicans' Lincoln Day dinner __HTTP__ via @MLive Record crowd expected. _E_
My interview with @ASavageNation discussing #TimeToGetTough my 2012 plans and Iraq __HTTP__ __HTTP__ _E_
Lightweight Senator Marco Rubio features Trump Univ. students in FL. attack ads who submitted excellent reviews. __HTTP__ _E_
Thank you Ohio see you tonight! __HTTP__ _E_
.@BarackObama bowed to the Saudi King in public yet the Dems are questioning @MittRomney's diplomatic skills. _E_
.@DonaldJTrumpJr and I on the 18th hole at Trump International Golf Links Scotland __HTTP__ _E_
Did you agree with my decision? #CelebApprentice _E_
I never met former Defense Secretary Robert Gates. He knows nothing about me. But look at the results under his guidance a total disaster! _E_
Via @FootwearNews by @kristenmhenning: "@IvankaTrump Works to Beat Breast Cancer" __HTTP__ _E_
Steven Spielberg is a great filmmaker. Go see Lincoln. _E_
I am astonished that the media continues to lie. @BarackObama gutted welfare reform. It is a fact! _E_
As a big job creator I was greatly honored to have been mentioned twice tonight during the debate. _E_
Just announced that Iraq (U.S.) is preparing for battle to reclaim Mosul. Why do they have to announce this? Makes mission much harder! _E_
Failed candidate Mitt Romneywho ran one of the worst races in presidential historyis working with the establishment to bury a big R win! _E_
Our gov't should immediately stop sending $'s to Mexico no friend until they release Marine & stop allowing immigrant inflow into U.S. _E_
Donald Trump visits Doral resort says he's allaying neighbors' concerns __HTTP__ via @MiamiHerald _E_
I hope the boycott of @Macys continues forever. So many people are cutting up their cards. Macy's stores suck and they are bad for U.S.A. _E_
The blatant waste of taxpayers' dollars doesn't bother Obama because it's all part of his broader nanny state (cont) __HTTP__ _E_
A tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_
.@Omarosa is not winning points being called "the wicked witch of the Mid West" and most certainly other things. #CelebApprentice _E_
RT @FoxNews: .@POTUS: Our infrastructure will again be the best in the world. We used to have the greatest infrastructure anywhere in the... _E_
I have a dream that our country will be great again! #DreamDay _E_
Benghazi is bigger than Watergate. Don't let Obama get away with allowing Americans to die. Kick him out of office tomorrow. _E_
Via @dcexaminer by @eScarry: "Donald Trump: @HuffingtonPost 'a very dishonest organization'" __HTTP__ _E_
Great coordination between agencies at all levels of government. Continuing rains and flash floods are being dealt with. Thousands rescued. _E_
Dow S&P 500 and Nasdaq all finished the day at new RECORD HIGHS! __HTTP__ _E_
At least @TheTinaBeast is consistent. She takes over a magazine and it ends up in the gutter. _E_
'Trump signs bill undoing Obama coal mining rule' __HTTP__ _E_
Thank you New Hampshire!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Just arrived in Mississippi for the rally. Word is that the crowd is overflowing and massive. Will be an amazing evening! _E_
"A man always has two reasons for doing anything: a good reason and the real reason." J. P. Morgan _E_
Happy Passover to everyone celebrating in the United States of America Israel and around the world. #ChagSameach _E_
Thank you Eric! __HTTP__ _E_
RT @foxnation: .@realDonaldTrump's First Full Month in Office Sees Biggest Jobs Gain 'In Years': Report: __HTTP__ _E_
Because of President Obama's failed leadership we have put Vladimir Putin & Russia back on the world stage! No reason for this. _E_
Jobless claims rose yet again last week __HTTP__ @BarackObama's economic record is abysmal we can do much better. _E_
Obama doesn't know what he's doing. His foreign policy is a disaster. Libya Egypt Iraq Afghanistan all (cont) __HTTP__ _E_
Thank you! __HTTP__ _E_
Curtis Sliwa doing tv commentary on 9/13/2001. Good job Curtis. Please send your apologies to @realDonaldTrump. __HTTP__ _E_
Republicans are always worried about their general approval. With proposing to 'ignore the debt ceiling' they are ignoring their base. _E_
MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_
Thank you! Facebook: __HTTP__ __HTTP__ __HTTP__ _E_
Today we remember the men and women who made the ultimate sacrifice in serving. Thank you God bless your families & God bless the USA! _E_
Congratulations to @SteveKingIA and his team on running a great campaign. Steve is a strong leader in the House. _E_
Thank you for your endorsement @paulteutulsr! #BikersForTrump #VoteTrumpNV Video: __HTTP__ __HTTP__ _E_
If you have a hard time communicating one way to overcome it is to turn your focus onto your audience. Midas Touch _E_
Thanks for all the nice comments about the @Late_Show last night. I enjoyed it and David enjoyed the ratings. __HTTP__ _E_
A great @The Masters. The course looks so beautiful. Fantastic for golf and television ratings! _E_
Stop saying I went bankrupt. I never went bankrupt but like many great business people have used the laws to corporate advantage—smart! _E_
Amazing! Watch @NHLBruins fans take over National Anthem during pregame ceremonies __HTTP__ _E_
Should not raise taxes in Wisconsin but massive budget deficit. Education roads etc suffering. @DanHenninger lies. @WSJ _E_
Via @BreitbartNews: "EXCLUSIVE: TRUMP SMACKS BACK AGAINST MEDIA ATTACKS ON CPAC SPEECH" __HTTP__ by @mboyle1 _E_
Remember our six brave heroes who died searching for Bergdahl after he deserted __HTTP__ (h/t @Military_News) _E_
"Leverage: don't make deals without it." – The Art of the Deal _E_
My @FoxNews interview on @gretawire discussing the @RNC convention @BarackObama's sealed records & real estate advice __HTTP__ _E_
Congratulations @TrumpSoHo on being named a "Great East Coast Hotels for Teens and Families" by @ParadeMagazine __HTTP__ _E_
Derek Jeter's rehab assignment is progressing on schedule. He's a true @Yankees captain. Look forward to seeing him back on the field _E_
When I say I would end Obamacare I would also come up with a plan that would be far better much easier to understand and cost less! _E_
Happy #MothersDay to all the great mothers out there! __HTTP__ _E_
"The worst thing you can possibly do in a deal is seem desperate to make it." The Art of the Deal _E_
Entrepreneurs: See each day as an opportunity to show what you can do at the highest level. _E_
If the Prez wants to create jobs talk to some business people not liberal intellectuals. _E_
Watching Pyongyang terrorize Asia today is just amazing! _E_
Pay to play. Collusion. Cover ups. And now bribery? So CROOKED. I will #DrainTheSwamp. __HTTP__ _E_
RT @AnnCoulter: GREATEST FOREIGN POLICY SPEECH SINCE WASHINGTON'S FAREWELL ADDRESS. _E_
Today @BarackObama will borrow 40 cents on every dollar he spends from China. Just another day at the office. _E_
Always know you could be on the precipice of something great. Donald J. Trump __HTTP__ _E_
Phony Rubio commercial. I could have settled but won't out of principle! See student surveys. __HTTP__ _E_
The @TheView @ABC once great when headed by @BarbaraJWalters is now in total freefall. Whoopi Goldberg is terrible. Very sad! _E_
I will be meeting with the NRA who has endorsed me about not allowing people on the terrorist watch list or the no fly list to buy guns. _E_
I'll be on @greta ON THE RECORD tonight at 7 PM _E_
By the way New York State MUST LOWER TAXES (and fast) and must start going after all of the energy that lies just below our feet (now)! _E_
HALF of Americans don't pay income tax despite crippling govt debt... __HTTP__ _E_
'Majority in Leading EU Nations Support Trump Style Travel Ban' Poll of more than 10000 people in 10 countries... __HTTP__ _E_
RT @DiamondandSilk: When the President says You're Fired That means: Pack Yo Stuff and Go Not Say You Refuse to Go! #DrainTheSwam... _E_
Rush Limbaugh is great tells it as he sees it really honorable guy! Thanks Rush! #Trump2016 _E_
RT @FoxNews: .@KellyannePolls: Since @POTUS took office 863000 new jobs were filled by women. Over half a million American women have en... _E_
An important part of my (or anybody's) success is the ability to judge people. I believe that @MileyCyrus is a really good person. _E_
If FM @AlexSalmond needs to litter Scotland w/ ugly industrial wind turbines to gain independence he will lose! __HTTP__ _E_
I will be interviewed on @CNN @NewDay at 7:30 A.M. Enjoy! _E_
Let's Trump the Establishment! We are no longer silent. We will Make America Great Again! __HTTP__ _E_
Outrageous @BarackObama is trying to unilaterally gut welfare reform __HTTP__ He doesn't believe in a strong work ethic. _E_
Stamps are going up once again. Now the US postal service will lose even more money. _E_
My @extratv interview discussing @Rosie's new baby my acceptance of @billmaher's $5M offer & hiring @_KatherineWebb __HTTP__ _E_
.@bobvanderplaats asked me to do an event. The people holding the event called me to say he wanted $100000 for himself.Phony @foxandfriends _E_
Trust your instincts. They are there for a reason. Without instincts you'll have a hard time getting to and staying at the top. _E_
When I jokingly said bring back Steve Jobs to run Apple because Apple has not been doing well the haters & losers had a field day! Sad. _E_
Lyin' Ted Cruz just used a picture of Melania from a G.Q. shoot in his ad. Be careful Lyin' Ted or I will spill the beans on your wife! _E_
Stock market hits new high with longest winning streak in decades. Great level of confidence and optimism even before tax plan rollout! _E_
"Donald Trump to address SC Tea Party Coalition at Myrtle Beach event" __HTTP__ via @CarolinaLive by @timmcginniswpde _E_
With Terry McAuliffe Gov of Virginia at the Trump Winery in Charlottesville VA largest on East Coast. @GovernorVA __HTTP__ _E_
You have no idea what my strategy on ISIS is and neither does ISIS (a good thing). Please get your facts straight thanks. @megynkelly _E_
Join my team tonight at 8:30pmE! __HTTP__ __HTTP__ _E_
.@TimTebow has tremendous talent and a proven ability to lead. He deserves to be in the @nfl. _E_
America's Labor Market Continues to Boom JOBS JOBS JOBS! __HTTP__ _E_
How Trump Won And How The Media Missed It __HTTP__ _E_
China is driving the price of gold up in order to ease pressure against Iranian sanctions. __HTTP__ _E_
.@FoxNews is the only network that does not even mention my very successful event last night. $6000000 raised in one hour for our VETS. _E_
Practice positive thinking this will keep you focused while weeding out anything that is unnecessary negative or detrimental. _E_
The cheap 12 inch sq. marble tiles behind speaker at UN always bothered me. I will replace with beautiful large marble slabs if they ask me. _E_
.@Linda_McMahon is an elite businesswoman who will bring a great outlook to DC. Support her campaign here __HTTP__ _E_
The debate tonight will be a total disaster low ratings with advertisers and advertising rates dropping like a rock. I hate to see this. _E_
Thank you @rushlimbaugh for your wonderful words. We will #MakeAmericaGreatAgain _E_
.@Lord_Sugar If you didn't say the iPod would be gone in a year you might have been really rich instead of the peanut money you have. _E_
"Protect the downside and the upside will take care of itself. – The Art of the Deal _E_
Congrats to @TimTebow on making @Patriots' first cut. Stay strong and positive! We are all rooting for you. _E_
New on our YouTube channel today is a brand new #trumpdocumentary giving you a look inside the world of Trump Golf... __HTTP__ _E_
If any candidate believes that with what we know today we still should have invaded Iraq then they are unqualified to be Commander in Chief. _E_
The dying @VanityFair's circulation has "dropped" & its newsstand sales have "plummeted by 20.1 percent" __HTTP__ _E_
I am truly honored to have been chosen Statesman of the Year by the Republican Party of Sarasota County. The (cont) __HTTP__ _E_
Looked at plans for Trump Doral Country Club today. It will be amazing! Glad to be in Miami. _E_
If elected I will undo all of Obama's executive orders. I will deliver. Let's Make America Great Again! __HTTP__ _E_
Entrepreneurs: Resolve to be bigger than your problems. Who's the boss? Don't negate your own power. _E_
I've done the largest house sale in U.S. history by selling a Palm Beach mansion for $100M $60M more than I paid. I love real estate. _E_
Mar a Lago in Palm Beach is one of the great palazzos of the world with a fantastic history. __HTTP__ _E_
Watch the latest From The Desk Of Donald Trump at __HTTP__ and read this article __HTTP__ _E_
Trump urges GOP to be 'mean as hell' __HTTP__ Via @CNNPolitics _E_
Looks like two time failed candidate Mitt Romney is going to be telling Republicans how to get elected. Not a good messenger! _E_
I truly understood the appeal of Ron Paul but his son @RandPaul didn't get the right gene. _E_
Most people think small because most people are afraid of success afraid of making decisions afraid of winning. The Art of the Deal _E_
.@oreillyfactor The people of Iowa love the fact that I stuck up for my rights as I will do for the U.S. Also got $6000000 for our VETS! _E_
Resolve never to quit never to give up no matter what the situation. @jacknicklaus _E_
Donald Trump backs 'Apprentice' Randal Pinkett for N.J. Lieutenant Governor: __HTTP__ _E_
Apparently @MartinBashir said something about me on his show yesterday. I was surprised to find out he is on TV. Who knew?! _E_
They don't like Rubio in Florida he left them high & dry. Doesn't even show up for votes! _E_
Republicans sorry but I've been hearing about Repeal & Replace for 7 years didn't happen! Even worse the Senate Filibuster Rule will.... _E_
The Euro is going to collapse soon. Cross border lending is already down and banks are stopping their Euro investments. _E_
Horrible killing of a 13 year old American girl at her home in Israel by a Palestinian terrorist. We must get tough. __HTTP__ _E_
I am working hard even on Thanksgiving trying to get Carrier A.C. Company to stay in the U.S. (Indiana). MAKING PROGRESS Will know soon! _E_
#TrumpVine A message for @AnthonyWeiner __HTTP__ _E_
On the 13th tee box @TrumpScotland with my grand daughter Kai! @DonaldJTrumpJr __HTTP__ _E_
The Democrats will only vote for Tax Increases. Hopefully all Senate Republicans will vote for the largest Tax Cuts in U.S. history. _E_
RT @JohnStossel: I can skate here ONLY b/c @realdonaldtrump fixed this rink after NYC gov't spent $13M but FAILED! Good for Trump! __HTTP__ _E_
Via @BPolitics by @Griffin Aboard Donald Trump's 757 at the South Carolina Tea Party Convention __HTTP__ _E_
On behalf of an entire nation Happy 242nd Birthday to the men and women of the United States Marines!#USMC242 #SemperFi __HTTP__ _E_
Ann Romney is a fantastic lady. She was great in thanking people last night! __HTTP__ _E_
Our nation is a once great nation divided! _E_
Be sure to watch my wonderful wife Melania Trump tonight on @QVC at 1AM EST _E_
..... I wonder if Angelo has a job or is on assistance. In any event I'm sure he is a nice guy! _E_
Congratulations to @FoxNews for winning November in the cable news rating race with 9 of 10 top shows __HTTP__ _E_
My @greta int. on @FoxNews on how to defeat ISIS Obama losing ground to ISIS & Making America Great Again! __HTTP__ _E_
Miss USA pageant had a 4 to 1 vote in favor but it won't be in Miami Doral in 2014 Mayor Boria voted against it. I want total support! _E_
New study shows 80% of Congress have no business experience it shows! _E_
AmyMek Amen! @realDonaldTrump has drawn more attention to Veterans issues in 1 week than these politicians have in decades! _E_
Great poll numbers for @MittRomney just out he is leading substantially in swing states. _E_
NEW POLL: Trump Blue Collar Support highest since FDR in 1930s WOW! __HTTP__ _E_
We have to repeal & replace #Obamacare! Look at what is doing to people! #DrainTheSwamp __HTTP__ _E_
Nice story from @businessinsider __HTTP__ _E_
We have tremendous economic power over China if our leaders knew how to use it which they don't! China's economy would collapse without us. _E_
Everyone's favorite frontman Twisted Sister lead singer @deesnider returns to this year's All Star @ApprenticeNBC. Dee does great! _E_
Another good poll result in the great state of SC. Trump at 30%. Carson at 15% and Bush at 9%. __HTTP__ _E_
"We have a president who has a vendetta against businesspeople and considers them the enemy. He's also (cont) __HTTP__ _E_
"Donald Trump launches new men's fragrance Empire @Macys Because every man has his own empire to build'" __HTTP__ _E_
As bad as they were I don't remember our embassies being attacked when Mubarak and Gaddafi were in power. _E_
The Failing @nytimes the pipe organ for the Democrat Party has become a virtual lobbyist for them with regard to our massive Tax Cut Bill. They are wrong so often that now I know we have a winner! _E_
Thug Politics. Lightweight hack Schneiderman meets with Obama on Thursday then brings frivolous suit on Saturday. _E_
Washington should have brought in Strasburg to relieve they would have won. _E_
A wonderful story on Iowa voters by @arappeport of the @NYTimes. __HTTP__ _E_
The speakers slots at the Republican Convention are totally filled with a long waiting list of those that want to speak Wednesday release _E_
With gas prices rising and the economy failing @BarackObama seeks to have his EPA raise energy prices by $109B __HTTP__ _E_
RT @DRUDGE_REPORT: REUTERS ROLLING: TRUMP 39% CRUZ 14.5% BUSH 10.6% CARSON 9.6% RUBIO 6.7%... MORE... __HTTP__ _E_
.@EricTrump did an amazing job raising money for @StJude with his @EricTrumpFDN event featuring @LisaLampanelli. Watch __HTTP__ _E_
A big salute to Jerry Jones owner of the Dallas Cowboys who will BENCH players who disrespect our Flag. Stand for Anthem or sit for game! _E_
and stay at the fantastic Trump International Hotel Las Vegas ... __HTTP__ _E_
Thank you Fort Lauderdale Florida. #MakeAmericaGreatAgain __HTTP__ _E_
Stop and frisk works. Instead of criticizing @NY_POLICE Chief Ray Kelly New Yorkers should be thanking him for keeping NY safe. _E_
RT @foxandfriends: Insurers seeking huge premium hikes on ObamaCare plans __HTTP__ _E_
Paul Ryan a man who doesn't know how to win (including failed run four years ago) must start focusing on the budget military vets etc. _E_
Don't believe the media stories. OPEC and the Saudis have not been doing us any favors recently with oil outputs. Oil should be $30/barrel. _E_
I will defeat Crooked Hillary Clinton on 11/8/2016. #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
"Palin's brand among evangelicals is as gold as the faucets in Trump tower" said Ralph Reed the chairman of the Faith & Freedom Coalition. _E_
"Fans like winners. They come to watch stars – great exciting players who do great exciting things." The Art of The Deal _E_
Great day in Colorado & Arizona. Will be in Nevada Colorado and New Mexico tomorrow join me!Tickets:... __HTTP__ _E_
Have passion drive and enthusiasm? You can check out the @TrumpCollection careers here: __HTTP__ _E_
"The most important thing in communication is hearing what isn't said." Peter Drucker _E_
"Sometimes life hits you in the head with a brick. Don't lose faith." Steve Jobs _E_
The 13th season of All Star @CelebApprentice is unique. We really pushed the envelope here. Our great and loyal fans will love it. _E_
In any business there will be ups and downs. If you can weather the rough times your success will be even greater during high times. _E_
CLINTON CORRUPTION AND HER SABOTAGE OF THE INNER CITIES. Full speech transcript: __HTTP__ _E_
I will be going to Atlanta Georgia tomorrow—here's the info: __HTTP__ Hope to see you there! #MakeAmericaGreatAgain! _E_
.@BillMaher didn't come through with his promised $5 million for charity so today I will sue him. _E_
OPEC is better off than they were 4 years ago. Gas has more than doubled during @BarackObama's term. Outrageous! _E_
The Republican Party has to be smart & strong if it wants to win in November. Can't allow lightweights to set up a spoiler Indie candidate! _E_
....This now allows for the passage of large scale Tax Cuts (and Reform) which will be the biggest in the history of our country! _E_
With @ivankatrump and the Chairman of DAMAC in Dubai. __HTTP__ _E_
.@hardball_chris says he's "glad" we had a hurricane! With many people dying and thousands hurting MSNBC (cont) __HTTP__ _E_
Great Gravis Poll on the great state of NH. Also watch @FaceTheNation on CBS & @HowardKurtz #mediabuzz both on Sunday. _E_
The so called bipartisan DACA deal presented yesterday to myself and a group of Republican Senators and Congressmen was a big step backwards. Wall was not properly funded Chain & Lottery were made worse and USA would be forced to take large numbers of people from high crime..... _E_
Our country needs strong borders and extreme vetting NOW. Look what is happening all over Europe and indeed the world a horrible mess! _E_
Congrats @NBCInvestigates on revealing that Obama knew millions of Americans would lose their healthcare plans __HTTP__ _E_
Congrats to @TrumpWaikiki celebrating 51 consecutive months as the #1 Honolulu Hotel on @TripAdvisor! _E_
The U.S. cannot negotiate with terrorists. It is a sad and terrible situation for the family involved but this can only lead to disaster. _E_
.... to help McConnell who spoke right after him."@BreitbartNews _E_
If Republican Senators are unable to pass what they are working on now they should immediately REPEAL and then REPLACE at a later date! _E_
"You may have to try a lot of things to get just one thing to work. That's tenacity and it's critical to success." – Trump Never Give Up _E_
If the people so violently shot down in Paris had guns at least they would have had a fighting chance. _E_
Wow! Senator Mark Warner got caught having extensive contact with a lobbyist for a Russian oligarch. Warner did not want a "paper trail" on a "private" meeting (in London) he requested with Steele of fraudulent Dossier fame. All tied into Crooked Hillary. _E_
RT @foxandfriends: .@GeraldoRivera: Chances of impeachment went from 3% to 0% with Comey's testimony __HTTP__ _E_
Just landed in Da Nang Vietnam to deliver a speech at #APEC2017 _E_
Thank you Richmond Virginia! #Trump2016 __HTTP__ _E_
Thank you Iowa! #Trump2016 __HTTP__ _E_
Whether I choose him or not for State Rex Tillerson the Chairman & CEO of ExxonMobil is a world class player and dealmaker. Stay tuned! _E_
I will be live tweeting during the Celebrity Apprentice at 9 P.M. Also will be hosting Dateline just prior to Apprentice at 8 P.M. _E_
My interview from last night with @piersmorgan discussing OWS __HTTP__ _E_
I just returned from Iowa what a beautiful state. The people are amazing and the event for Congressman Steve King was a great success! _E_
Will be on @foxandfriends at 7.00. (30 minutes). A great deal to talk about including Ebola quarantine. _E_
The Trans Pacific Partnership will increase our trade deficits & send even more jobs overseas. This is a bad deal. Time for smart trade! _E_
Face The Nation's interview of me was the highest rated show that they have had in 15 years. Congratulations and WOW! @CBSNews @jdickerson _E_
Join us live in the Oval Office for the swearing in of our new Attorney General @SenatorSessions!LIVE:... __HTTP__ _E_
Business is an art in itself & powerful negotiation skills are one of the techniques necessary to facilitate success. Think Like a Champion _E_
Elections have consequences. Obama just published "final regulations for ObamaCare's individual mandate" __HTTP__ Enjoy! _E_
Read a great interview with Donald Trump that appeared in The New York Times Magazine: __HTTP__ _E_
.@EWErickson got fired like a dog from RedStateand now he is the one leading opposition against me. _E_
Don't let the GLOBAL WARMING wiseguys get away with changing the name to CLIMATE CHANGE because the FACTS do not let GW tag to work anymore! _E_
.@Yankees should get rid of A Rod ASAP I can't watch this guy anymore! _E_
It's Tuesday how much will the media continue to cover up the embassy attacks for Obama? _E_
It is great to meet fellow patriots at the #TimeToGetTough book signings. Can't wait to meet more today at Trump Tower from 12PM to 2PM _E_
According to new employment numbers 296000 Americans have dropped out of the work force & gave up looking for work. _E_
Wow it's snowing in Isreal and on the pyramids in Egypt. Are we still wasting billions on the global warming con? MAKE U.S. COMPETITIVE! _E_
Tonight I will be on @FoxNews with @SeanHannity at 10pm and @CNN w/ @AndersonCooper at 10:10pm. Enjoy! #VoteTrumpSC #Trump2016 _E_
Donald Trump shocked by 'stupid decision' about @OMAROSA on '@ApprenticeNBC' __HTTP__ @TODAY_Clicker _E_
Obama's war on coal is killing American jobs making us more energy dependent on our enemies & creating a great business disadvantage. _E_
With @StephenBaldwin7 earlier today at @ApprenticeNBC press conference in @TrumpTowerNY. __HTTP__ _E_
By Scotland officials canceling my local ad about how damaging wind turbines are it became a much bigger story around the world. Great! _E_
Why does @BarackObama always have to rely on teleprompters? _E_
I will be on @foxandfriends at 8:30 A.M. Will be talking about lightweight Marco Rubio and lying Ted Cruz! _E_
Obama will be trying very hard at next debate he doesn't want to lose the Boeing. _E_
We need a real President! __HTTP__ _E_
When your life flashes before your eyes make sure you've got plenty to watch. Anonymous _E_
How stressed are @lisarinna and @pennjillette already? #CelebApprentice _E_
Looking forward to giving keynote speech tonight @ChesterfieldGOP Lincoln Reagan dinner in Virginia. _E_
Via @bostonherald by Eugene R. Dunn: "Iran a clear danger" __HTTP__ _E_
I hear this moron @billmaher said nasty things about me (hair etc—boring) on the terminated @jayleno show. Stupid guy/bad ratings! _E_
Unbelievable evening in Melbourne Florida w/ 15000 supporters and an additional 12000 who could not get in. Tha... __HTTP__ _E_
On behalf of @FLOTUS Melania and I THANK YOU for an unforgettable afternoon and evening at the Forbidden City in Beijing President Xi and Madame Peng Liyuan. We are looking forward to rejoining you tomorrow morning! __HTTP__ _E_
...allegations of unmasking Trump transition officials. Not good! _E_
Wow NY Observer story about @AGSchneiderman really exposes him as a sleazebag & crook. He's bad for New York. __HTTP__ _E_
Does Madonna know something we all don't about Barack? At a concert she said we have a black Muslim in the White House. _E_
How do you fight millions of dollars of fraudulent commercials pushing for crooked politicians? I will be using Facebook & Twitter. Watch! _E_
Thank you for a great night at the Verizon Wireless Arena New Hampshire! #VoteTrumpNH#MakeAmericaGreatAgain #FITN __HTTP__ _E_
There is no better place in the world to spend Christmas than Mar a Lago __HTTP__ in Palm Beach Florida. _E_
Just finished another week of filming @ApprenticeNBC. This season a record 14th is shaping up to be the best yet. _E_
Leaving soon after a great time in New Hampshire a truly special place! _E_
From ABC News: In Demand: Washington's Highest (and lowest) Speaking Fees by Scott Wilson __HTTP__ _E_
Don't forget to watch Celebrity Apprentice tonight at 9pm...you will love it! _E_
Landing in Pennsylvania now. Great new poll this morning thank you. Lets #DrainTheSwamp and #MakeAmericaGreatAgain... __HTTP__ _E_
Will be playing golf today with Rand Paul at Trump International in Palm Beach. Will be both interesting and fun! _E_
Can you believe that our very stupid politicians released the leader of ISIS and now we are spending billions trying to get him back! _E_
Via the Washington Post: Inside the World of Donald Trump's Super Fans: __HTTP__ _E_
RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_
Hurricane looks like largest ever recorded in the Atlantic! _E_
Great news that @ehasselbeck will be joining @foxandfriends. Elisabeth is a tremendous person and will be missed on @theviewtv. _E_
Thank you to the @nydailynews for a very nice story __HTTP__ _E_
Thank you to @exxonmobil for your $20 billion investment that is creating more than 45000 manufacturing & construction jobs in the USA! _E_
Like her or not Hillary did what she had to do in the debate last night—get through it. Her opponents were very gentle and soft! _E_
Obama's nuclear deal with the Iranians will lead to a nuclear arms race in the Middle East. It has to be stopped. _E_
Miss Universe contestants are amazing—the most beautiful ever! _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
Via @politicalwire: Tweet of the Day __HTTP__ _E_
Now that Obama's poll numbers are in tailspin – watch for him to launch a strike in Libya or Iran. He is desperate. _E_
Of course there is large scale voter fraud happening on and before election day. Why do Republican leaders deny what is going on? So naive! _E_
to make up their own minds as to the truth. The media lies to make it look like I am against Intelligence when in fact I am a big fan! _E_
Congratulations to @seanhannity on his tremendous increase in television ratings. Speaking of ratings I will be on his show tonight @ 10pE. _E_
My interview with @NYDNGatecrasher discussing @BarackObama's #WHCD and my endorsement of @MittRomney __HTTP__ _E_
Happy #SmallBusinessSaturday!A great day to support your community and America's JOB creators by shopping locally at a #SmallBiz. #ShopSmall __HTTP__ _E_
Via @TODAY_Clicker: Donald Trump promises 'tough and mean and nasty' 'Celebrity Apprentice' __HTTP__ _E_
.@GovernorPataki did a terrible job as Governor of New York. If he ran again he would have lost in a landslide. He and Graham ZERO in polls _E_
This week the Senate can join the House & take a strong stand for the Middle Class families who are the backbone of America. Together we will give the American people a big beautiful Christmas present a massive tax cut that lets Americans keep more of their HARD EARNED MONEY! __HTTP__ _E_
Obama spoke to the Mexican president last week & did not mention UMC Sgt. Tahmooressi. Sad! _E_
I find the photos of these children killed in Newtown in the New York Post heartbreaking.#Angels _E_
It's Monday. How much will premiums rise today because of ObamaCare? REPEAL! _E_
Thank you Michael Harrison @Talkersmagazine for your kind words greatly appreciated! _E_
A fact golfers don't get aches & pains like others who don't golf. It is amazingly remedial. _E_
...popular vote. ABC News/Washington Post Poll (wrong big on election) said almost all stand by their vote on me & 53% said strong leader. _E_
RT @DonnaWR8: @realDonaldTrump Thank you @POTUS for believing in Us like we believed in you! #MAGA __HTTP__ _E_
If the presidential election were held today according to this @surveyusa poll Donald Trump would defeat any Dem: __HTTP__ _E_
Just read @PiersMorgan's book "Shooting Straight" and whether you love him or hate him (I'm in the first category) it is terrific. _E_
"Miss Universe Ratings 6.1 Million Viewers Best Since 2008" __HTTP__ _E_
We will have the votes for Healthcare but not for the reconciliation deadline of Friday after which we need 60. Get rid of Filibuster Rule! _E_
#CrookedHillary gives Obama an "A" for an economic recovery that's the slowest since WWII... #BigLeagueTruth... __HTTP__ _E_
How foolish did @davidaxelrod look yesterday trying to rationalize why @BarackObama accepts donations from Bain? __HTTP__ _E_
The problem with agreeing to a policy on immigration is that the Democrats don't want secure bordersthey don't care about safety for U.S.A. _E_
ICYMI via @foxnewsinsider my @foxandfriends from yesterday on Obama's dangerous disconnect __HTTP__ _E_
Why is @BarackObama letting the Taliban know when our troops are leaving? __HTTP__ This is dangerous for our soldiers. _E_
I will be holding a major news conference in New York City with my children on December 15 to discuss the fact that I will be leaving my ... _E_
I will be interviewed on @jaketapper @CNN at 9:00 A.M. and Fox News Sunday with Chris Wallace at 10:O0 A.M. CNN Iowa Poll 13 point lead! _E_
My interview on @gretawire discussing the economy and @TheHermanCain Witch Hunt __HTTP__ _E_
Spend your last day of 2013 contemplating the moves you will make in 2014 to make it your best year ever! _E_
Considering Obama hasn't proposed anything concrete if he wins he won't have a mandate. Another 4 years of legislative stalemate. _E_
The @nfl ratings continue to fall every week and will keep dropping. Boring games too many flags too soft! _E_
Today's #trumpvlog answers your tweets about my thoughts on the Republican candidates... __HTTP__ _E_
Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_
I guess @edshow is a lot smarter than dopes like @JonahNRO & @stephenfhayes. Oh well both mags are dying anyway. __HTTP__ _E_
To the brave men and women past and present in our armed services best wishes on Veterans Day. _E_
President Obama you are a complete and total disaster but you have a chance to do something great and important: STOP THE FLIGHTS! _E_
I've been watching very little @CNBC lately—the good news is I'm switching over to @BloombergNews and @FoxNews. _E_
Mobile Alabama today at 3:00 P.M. Last rally of the year THANK YOU ALABAMA AND THE SOUTH Biggest of all crowds expected see you there! _E_
RT @TeamTrump: "Police officers are the BEST of us. Law enforcement in this country is a force for GOOD. @mike_pence #VPDebate #BigLeagu... _E_
Congratulations to Eric & Lara on the birth of their son Eric Luke Trump this morning! __HTTP__ _E_
Entrepreneurs: Don't be confined by expectations. There are no exact rules for negotiation try to remain flexible and open to new ideas. _E_
I ask again how much is very wealthy South Korea paying the United States for protecting it against North Korea? _E_
Via @CBNNews by @TheBrodyFile: "Donald Trump: 'We Must Make America Great Again'" __HTTP__ _E_
.@FoxNews FBI's Andrew McCabe "in addition to his wife getting all of this money from M (Clinton Puppet) he was using allegedly his FBI Official Email Account to promote her campaign. You obviously cannot do this. These were the people who were investigating Hillary Clinton." _E_
On the way to the great state of Rhode Island big rally. Then to Pennsylvania for rest of day and night! _E_
Received a standing applause at #NCGOPcon when I said to have free trade be fair for the US we need really intelligent negotiators. _E_
The entire world understands that the good people of Iran want change and other than the vast military power of the United States that Iran's people are what their leaders fear the most.... __HTTP__ _E_
My @CNN interview with @TVAshleigh discussing @MittRomney's electability and @RickSantorum's Senate loss. __HTTP__ _E_
Donald Trump Ed Koch and the Ice Skating Rink: A Tale of Bureaucracy __HTTP__ @ActonInstitute _E_
People ask why do you tweet and re tweet to millions about @JebBush when he is so low in the polls? Because of his big $ hit ads on me! _E_
Via @BreitbartNews by @TheTonyLee: @Citizens_United sues @AGSchneiderman for violating 1st Amendment __HTTP__ _E_
Danger Weiner is a free man at 12:01AM. He will be back sexting with a vengeance. All women remain on alert. _E_
Good luck to @joniernst. You will make a wonderful Senator. _E_
We have a president who has a vendetta against businesspeople and considers them the enemy. #TimeToGetTough (cont) __HTTP__ _E_
If we do not protect the rule of law then we can expect even more illegals to cross the border. Obama's executive amnesty is dangerous. _E_
....it is very possible that those sources don't exist but are made up by fake news writers. #FakeNews is the enemy! _E_
Almost every T.V. show is asking me to go on especially the @Late_Show. It's simple I get the ratings! _E_
Why does the federal government send foreign aid to China? Unbelievable! Washington is financing America's de... (cont) __HTTP__ _E_
Failed Presidential Candidate Mitt Romney was campaigning with John Kasich & Marco Rubio and now he is endorsing Ted Cruz. 1/2 _E_
Here's a great video of the official launch of my new fragrance #Success @Macys Herald Square __HTTP__ _E_
Denver Minnesota and others are bracing for some of the coldest weather on record. What are the global warming geniuses saying about this? _E_
Maniac Sergeant who went on a killing spree in Afghanistan must be punished big time and quickly. _E_
Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_
The pathetic new hit ad against me misrepresents the final line. You can tell them to go BLANK themselves was about China NOT WOMEN! _E_
I asked @VP Pence to leave stadium if any players kneeled disrespecting our country. I am proud of him and @SecondLady Karen. _E_
Bill Clinton's meeting was a total secret. Nobody was to know about it but he was caught by a local reporter. _E_
Great @nytimes story about our conversion of the Old Post Office building in D.C. to luxury hotel __HTTP__ _E_
Little Marco Rubio the lightweight no show Senator from Florida is just another Washington politician. __HTTP__ _E_
See Newsmax story re Republican National Convention __HTTP__ _E_
To be successful your focus has to be broad enough to think big at the same time. 'Midas Touch' with @theRealKiyosaki _E_
I remained strong for @TigerWoods during his difficult period. He rewarded me (and himself) by winning at Trump National Doral. _E_
Looking forward to @David_Bossie & @RepJeffDuncan's @Citizens_United Freed Summit in Greenville SC this Saturday! _E_
Soon to be the greatest hotel in U.S. don_trump_jr @ivankatrump @erictrump #OldPostOffice __HTTP__ _E_
The so called 'moderate' Syrian rebels pledged their allegiance to ISIS after Obama's address. We should not be arming them! _E_
Replay of Fox News Sunday With Chris Wallace at 2:00 P.M. on @FoxNews. Big statement made by Chris! _E_
I pay millions of $'s a year to Florida Power & Light & they can't give us what we want. Maybe a major class action suit against them? _E_
The police in London say I'm right. Major article in Daily Mail. "We can't wear uniform in our own cars." __HTTP__ _E_
I am now in Iowa getting ready to speak. People are always amazed to find out that I am Protestant (Presbyterian). GREAT. _E_
Roger Ailes just called. He is a great guy & assures me that "Trump" will be treated fairly on @FoxNews. His word is always good! _E_
.@AndrewKreig Thank you Andrew so correct! _E_
Mark Begich votes with Obama 97%. He opposes drilling & supports Amnesty for illegals. Next Tuesday vote @DanSullivan2014! _E_
Get ready for two amazing episodes of Celebrity Apprentice tomorrow night (Monday) at 8:00. Some incredible things happen! _E_
Obama better than last time but again @MittRomney wins. Good night. #debate _E_
I don't believe you have to be better than everybody else. I believe you have to better than you ever thought you could be. Ken Venturi _E_
Stuart Stevens the failed campaign manager of Mitt Romney's historic loss is now telling the Republican Party what to do with Trump. Sad! _E_
Remember all these 'freedom fighters' in Syria want to fly planes into our buildings. _E_
Great column by David Bossie at @BreitbartNews: "A Battle Won but the War Continues to Defund ObamaCare" __HTTP__ _E_
After reading the false reporting and even ferocious anger in some dying magazines it makes me wonder WHY? All I want to do is #MAGA! _E_
Met with President Putin of Russia who was at #APEC meetings. Good discussions on Syria. Hope for his help to solve along with China the dangerous North Korea crisis. Progress being made. _E_
RECORD HIGH FOR S & P 500! _E_
"Trumps Are Giving @TrumpDoral A Makeover" __HTTP__ via @CBSMiami _E_
The Trans Pacific Partnership is an attack on America's business. It does not stop Japan's currency manipulation. This is a bad deal. _E_
My @gretawire interview re: the dismal job report getting ripped off by South Korea 2016 election & #WWEHOF __HTTP__ _E_
.@FoxNews treats me so badly. Using old Quinnipiac Poll where I have a much smaller lead than the just out @CNN Poll. All negative! _E_
Address to the NationFull Video & Transcript: __HTTP__ __HTTP__ _E_
Why are we letting the three girls who left the U.S. to join ISIS back into the country? How stupid has our once respected country become! _E_
Via @ TheScotsman: "Donald Trump to lay out new golf course plan" __HTTP__ _E_
Via @townhallcom by @MattTowery: "Why Trump Should Run" __HTTP__ _E_
New poll thank you! #Trump2016 __HTTP__ __HTTP__ _E_
"If you don't have problems you're pretending or you don't run your own business." –Donald J. Trump __HTTP__ _E_
I will be interviewed by @TuckerCarlson tonight at 9:00 P.M. on @FoxNews. Enjoy! _E_
"Trump to campaign for @SteveKingIA" __HTTP__ via @kscj1360 _E_
With ZERO Democrats to help and a failed expensive and dangerous ObamaCare as the Dems legacy the Republican Senators are working hard! _E_
Entrepreneurs: Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_
Just took off for ceremony @ Pearl Harbor. Will then be heading to Japan SKorea China Vietnam & the Philippines. Will never let you down! _E_
The NFL has all sorts of rules and regulations. The only way out for them is to set a rule that you can't kneel during our National Anthem! _E_
.@EWErickson is a total low life read his past tweets. A dummy with no "it" factor. Will fade fast. _E_
The New York Giants are looking really bad so far tonight. Does not get much worse than this! _E_
Crooked Hillary said loudly and for the world to see that she SHORT CIRCUITED when answering a question on her e mails. Very dangerous! _E_
#SuccessByTrump Here's a photo from my appearance at @Macy's Herald Square with @ximenanr __HTTP__ _E_
Celebrity Apprentice starts in 15 minutes on NBC. ENJOY! _E_
Looking forward to being at the @RyderCupUSA announcement tonight. _E_
The Miss U.S.A. pageant will be amazing tonight. To be politically incorrect the girls (women) are REALLY BEAUTIFUL. NBC at 8 PM. _E_
Good move by Aubrey to be the red headed model they didn't have. #sweepstweet _E_
.@EdRendell's book A Nation of Wusses is an excellent read especially page 10. Go get it! _E_
Does any Republican have the ability to negotiate? _E_
We need jobs & we need them fast. I am a job creator. None of the pols can or will. Let's Make America Great Again! __HTTP__ _E_
If Mitch McConnell wants to win his election he'd better get rid of jinxed Karl Rove and fast... _E_
So nice of @Cher greatly appreciated! __HTTP__ _E_
#CelebApprentice Time for the first firing of the night. _E_
.@marcthiessen is a failed Bush speechwriter whose work was so bad that he has never been able to make a comeback. A third rate talent! _E_
The Fannie and Freddie execs should not get million dollar bonuses with our tax dollars. They were bailed out with $169B of our money. _E_
"The only source of knowledge is experience." – Albert Einstein _E_
Watch #MissUSA 2012 live tonight on @NBC at 9PM EST! _E_
Why is the @GOP being asked to do a debate that is so much longer than the just aired and very boring #DemDebate? _E_
It was a great honor to welcome Atlanta's heroic first responders to the White House this afternoon! __HTTP__ _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
When Strasburg leaves @Nationals for another team for more money will Washington still like the decision to shut him down for his good? _E_
Statement on International Holocaust Remembrance Day: __HTTP__ _E_
Good luck to the US Men's National Team in tomorrow's CONCACAF Cup vs. Mexico! It should be a great game! __HTTP__ _E_
No such meeting or conversation ever happened a made up story by low ratings @CNN. _E_
Congratulations to Brandy as our new Apprentice and to Clint for being a great player. It's been a terrific season! _E_
I am hearing that @NRCC Digital Director @lansing is doing great work expanding and modernizing @GOP social media. Good – we need it. _E_
RT @jayMAGA45: NFLplayer PatTillman joined U.S. Army in 2002. He was killed in action 2004. He fought 4our country/freedom. #StandForOurAnt... _E_
We blow up the famous Blue Monster at Trump National Doral on.Monday in order to build a spectacular new bigger and better Blue Monster! _E_
With the world's top amenities @TrumpTO's luxury residential condominiums provide the ultimate Toronto lifestyle __HTTP__ _E_
Find out who and what is the best in your field. Identify the trendsetters leaders and authorities. Learn the standards they follow. _E_
Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_
RT @EricTrump: Join my family in this incredible movement to #MakeAmericaGreatAgain!! Now it is up to you! Please #VOTE for America! __HTTP__ _E_
Yea NBC has increased all remaining Celebrity Apprentice episodes to two hours starting at 9 P.M. on Sunday! Amazing show. _E_
To all young college graduates – stick in there keep your head up and make sure you don't miss any opportunities. They are out there. _E_
To be called Trump Links at Ferry Point course will be GREAT and over the years hold many tournaments and major championships $'s to NYC. _E_
#TrumpAdvice __HTTP__ _E_
I have instructed Homeland Security to check people coming into our country VERY CAREFULLY. The courts are making the job very difficult! _E_
RT @Morning_Joe: VIDEO: @realDonaldTrump announces 'a very powerful endorsement' will be coming today. __HTTP__ _E_
Great Twitter poll and I wasn't even there. Thank you! #GOPDebate __HTTP__ _E_
Mitt did the right thing—not because he had to but because he never would have been given a second chance after his first fiasco _E_
Just did final purchase on fabulous @LodgeatDoonbeg in Ireland. Will become Trump International Hotel & Golf Links Ireland. Very exciting! _E_
Very exciting. I will be at Macy's Herald Square this Wednesday at 5:30pm to celebrate the launch of Trump Home crystal! _E_
Thank you @ShopFloorNAM. An honor to be with you today. Great news! Manufacturers report record high economic optimism in 2017. #TaxReform __HTTP__ _E_
Our clubhouse facility & suites in Ireland @LodgeatDoonbeg #TrumpIreland __HTTP__ __HTTP__ _E_
46% of Americans think the Media is inventing stories about Trump & his Administration. @FoxNews It is actually much worse than this! _E_
If Russia or any other country or person has Hillary Clinton's 33000 illegally deleted emails perhaps they should share them with the FBI! _E_
RT @Scavino45: Under POTUS' @realDonaldTrump S&P 500 38th📈Record High NASDAQ 44th📈Record High#MakeAmericaGreatAgain __HTTP__ _E_
Has anyone seen the financials of @Univision. They are doing really badly. Too much debt and not enough viewers. Need money fast. Funny! _E_
Mike Flynn should ask for immunity in that this is a witch hunt (excuse for big election loss) by media & Dems of historic proportion! _E_
The @Lakers should have an amazing team next year with Kobe Nash and Howard. Will be fun to watch. _E_
Via @GolfweekMag: Major makeover: Trump has big vision for Doral __HTTP__ by @BKleinGolfweek _E_
Thank you! #AmericaFirst __HTTP__ _E_
Join me on #FacebookLive as I conclude my final #debate preparations. __HTTP__ _E_
#sweepstweet Teresa seems to underestimate the power of observance—that of the client as well as her team but she's a wonderful person _E_
Understand that difficulties mistakes & setbacks are an inevitable part of business and life. Don't allow them to knock you off your feet. _E_
"Relax & clear your mind if someone is speaking so that you're receptive to what they're saying." – Roger Ailes You are the Message _E_
With 18 beautiful holes each boasting unique characteristics Trump Nat'l Philadelphia is a Golf treasure __HTTP__ _E_
Jeb Bush just announced he raised over $100M. Everyone of those people who contributed are getting something to the detriment of America! _E_
Can you believe thatwith all of the problems and difficulties facing the U.S. President Obama spent the day playing golf.Worse than Carter _E_
The CIA deserves our praise for taking the fight to the enemy in the dark corners of the world. The CIA perseveres the politicians whine! _E_
Wake up Jeb supporters! __HTTP__ _E_
THE APPRENTICE. 10 years 182 shows many at number one for week or night Amazing! @NBC _E_
I am honored to be receiving the American Spectator Foundation Award for excellence in entrepreneurialism in Washington DC this fall. _E_
Just landed in North Carolina heading to the J.S. Dorton Arena. See you all soon! Lets #MakeAmericaGreatAgain! __HTTP__ _E_
Heading into the 12 days with great negotiating strength because of our tremendous economy. __HTTP__ _E_
"I also protect myself by being flexible. I never get too attached to one deal or one approach." – THE ART OF THE DEAL _E_
...Maybe the best thing to do would be to cancel all future press briefings and hand out written responses for the sake of accuracy??? _E_
The @WSJ Wall Street Journal loves to write badly about me. They better be careful or I will unleash big time on them. Look forward to it! _E_
My @nbcdfw int. by @EricKingNBC5 w/@IvankaTrump discussing the Sunday @nbc premiere of @ApprenticeNBC's 14th season __HTTP__ _E_
Trump Int'l Hotel & Tower Chicago has received accolades for design service & our signature restaurant Sixteen __HTTP__ _E_
I was referring to a backstop for pre existing conditions. I will eliminate the law in its entirety & replace it w/ something much better. _E_
This Russian connection non sense is merely an attempt to cover up the many mistakes made in Hillary Clinton's losing campaign. _E_
On 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_
Lets #MakeAmericaGreatAgain Maryland! #VoteTrump __HTTP__ _E_
. @Newsmax__Media is one of the top media outlets in the country. @ChrisRuddyNMX has revolutionized political commentary and reporting. _E_
The deplorables came back to haunt Hillary.They expressed their feelings loud and clear. She spent big money but in the end had no game! _E_
Axl Rose should take his #rockhall2012 honors and be happy. Stop the no induction nonsense. Do it for your fans @axlrose. _E_
Pocahontas is at it again! Goofy Elizabeth Warren one of the least productive U.S. Senators has a nasty mouth. Hope she is V.P. choice. _E_
'The Clinton Foundation's Most Questionable Foreign Donations'#PayToPlay #DrainTheSwamp __HTTP__ _E_
Big announcement coming soon regarding South Carolina... _E_
Hope & Change. Millions are losing their healthcare plans & ObamaCare is taking cancer patients' doctors away __HTTP__ _E_
The White House should stop publicly pressuring Israel on Iran. Iran's nuclear program is the threat not Israel's right to self defense. _E_
... the ratings of Shark Tank. Everyone was hitting on me until the numbers came in—and now—dead silence! _E_
Work has begun ahead of schedule to build the greatest golf course in history: Trump International – Scotland. _E_
My statement as to what's happening in Sweden was in reference to a story that was broadcast on @FoxNews concerning immigrants & Sweden. _E_
I am not trying to get top level security clearance for my children. This was a typically false news story. _E_
I will be on The Situation Room with @wolfblitzer from 5 7pm est on CNN _E_
.@CoachDanMullen Great to have you and your GREAT team at Trump National Doral. Go out and finish your fantastic season in style! _E_
Great meeting with Governor Mapp of the #USVI. He is very thankful for the great job done by @FEMA and First Responders. __HTTP__ _E_
#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
Remember @JebBush wants COMMON CORE (education from D.C.) and is very weak on ILLEGAL IMMIGRATION ( come as act of love ). Not a leader! _E_
An honor to welcome the Taoiseach of Ireland @EndaKennyTD to the @WhiteHouse today with @VP Pence. __HTTP__ _E_
Looking forward to being interviewed on the @marklevinshow tonight at 6:30 PM EST. Be sure to listen! _E_
Crooked Hillary Clinton was not at all loyal to the person in her rigged system that pushed her over the top DWS. Too bad Bernie flamed out _E_
Iran is going to buy 114 jetliners with a small part of the $150 billion we are giving them...but they won't buy from U.S. rather Airbus! _E_
Sad just 16% of American parents think their children will be better off than them __HTTP__ We can do much better! _E_
Playing politics with the Keystone decision? @BarackObama vetos 20000 jobs and cheaper oil. _E_
Via @gatewaypundit: "Mother of Murdered Teen Thanks Donald Trump During Senate Hearing" __HTTP__ _E_
I will be interviewed on @seanhannity tonight at 10pmE on @FoxNews. Enjoy! _E_
America wasted billions and precious lives in Iraq and Iran will soon take control very very sad. _E_
The Dems want to stop tax cuts good healthcare and Border Security.Their ObamaCare is dead with 100% increases in P's. Vote now for Karen H _E_
If you really want to succeed you'll have to go for it every day. The big time isn't for slackers. Keep up your stamina and remain curious. _E_
RT @AnnCoulter: I hear Churchill had a nice turn of phrase but Trump's immigration speech is the most magnificent speech ever given. _E_
Windmills are the greatest threat in the US to both bald and golden eagles. Media claims fictional 'global warming' is worse. _E_
Thank you America! #Trump2016 __HTTP__ _E_
...never allow the Republicans to pass even great legislation. 8 Dems control will rarely get 60 (vs. 51) votes. It is a Repub Death Wish! _E_
Must read article by @boonepickens & @AmbJohnBolton: "America's Untapped Energy Weapon" __HTTP__ We don't need foreign oil! _E_
Crooked @HillaryClinton's foundation is a CRIMINAL ENTERPRISE. Time to #DrainTheSwamp! __HTTP__ #BigLeagueTruth #Debate _E_
"Government's first duty is to protect the people not run their lives." – Ronald Reagan _E_
The @MissUSA 2012 contestants standing outside of Trump Tower in New York City __HTTP__ @MissUSA 2012 tomorrow at 9PM ET NBC. _E_
.@TrumpChicago's Spa has an array of 5 star services12 treatment rooms & 53 spa guestrooms w/great views __HTTP__ _E_
The independent watchdog who exonerated @BarackObama for the failed green energy loans just donated $52500 to Obama's campaign. _E_
Imagination is more important than knowledge. Albert Einstein _E_
The Pope should not have resigned—he should have lived it out. It hurts him it hurts the church... _E_
You are right more like the opening of the Tonys. _E_
I can't believe we are not asking South Korea for anything. They make a fortune on us while we spend a fortune defending them how stupid! _E_
The U.S. Senate should switch to 51 votes immediately and get Healthcare and TAX CUTS approved fast and easy. Dems would do it no doubt! _E_
RT @JackPosobiec: Dick Durbin called Trump racist for wanting to end chain migration. Here's a video of Dick Durbin calling for an end to... _E_
It has been 1000 days since @BarackObama has passed a budget. He continues to spend this country into the ground without any control. _E_
Presidents and their administrations have been talking to North Korea for 25 years agreements made and massive amounts of money paid...... _E_
Via @EllonTimesKenny: Trump course sparks international interest __HTTP__ _E_
The Great State of Arizona where I just had a massive rally (amazing people) has a very weak and ineffective Senator Jeff Flake. Sad! _E_
.@JebBushAt the debate you said your brother kept us safe I wanted to be nice & did not mention the WTC came down during his watch 9/11. _E_
Thank you to our fantastic veterans. The reviews and polls from almost everyone of my Commander in Chief presentation were great. Nice! _E_
Obama has zero credibility on oil and coal. If we do not win energy as a country we just do not win period! _E_
From the Wall Street Journal: Google Steps Into Autism Research re @autismspeaks __HTTP__ _E_
Via @CBNNews' @TheBrodyFile: "Poll: Donald Trump in GOP Top Tier for President" __HTTP__ _E_
Great and we should boycott Fake News CNN. Dealing with them is a total waste of time! __HTTP__ _E_
If the gov't shuts down it is because Obama wants to make working Americans buy ObamaCare while businesses and gov't are exempt. _E_
.@MissUniverse ratings were great! A big win and a wonderful night! __HTTP__ _E_
Negotiation is a true talent. It is an art. And our politicians are killing our country b/c they don't have it. @SRQRepublicans speech _E_
RT @DanScavino: .@POTUS @realDonaldTrump signs executive orders on trade that will set the stage for revival in American manufacturing. #Am... _E_
Surprise – China has spies throughout NASA stealing our R&D __HTTP__ When will we ever make them pay for espionage? _E_
I am confident when American public gets to know @MittRomney the race will go his way. He's honorable & successful man polls looking good. _E_
CBO estimates over 2.3M jobs will be lost due to ObamaCare __HTTP__ Elections have consequences. _E_
Just landed from Paris France. It was an incredible visit with President @EmmanuelMacron. A lot discussed and accomplished in two days! _E_
Penn Jillette shows his dark side in new crowdfunded film Director's Cut __HTTP__ @pennjillette @bradwyman _E_
Excellent story on @MittRomney very good moment for Ryan. #VPDebate _E_
I'll be signing copies of my new book @TimeToGetTough tomorrow at Trump Tower (5th Avenue between 56 and 57) from noon to 2pm. _E_
Stuart Stevens is a dumb guy who fails @ virtually everything he touches. Romney campaignhis booketc. Why does @andersoncooper put him on? _E_
I agree! The headline says it all. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Our economy is in trouble. The unemployed are more likely to drop out of the workforce than find a job. We need growth and now! _E_
Tom Brady just did it again. He is not only a great guy he is without question the BEST quarterback! _E_
It's Thursday. How much money did Barack Obama waste today on crony green energy projects? _E_
Jeb Bush just said about Marco Rubio he's my friend! Pure political speak. Why can't he be truthful and say disloyal guy no friend! _E_
Great job @IvankaTrump! #RNCinCLE __HTTP__ _E_
Don't attack Syria an attack that will bring nothing but trouble for the U.S. Focus on making our country strong and great again! _E_
500 of the most vicious prisoners escaped from an Iraq prison today. That country is a time bomb waiting to happen a total corrupt mess! _E_
I would absolutely consider investing in Atlantic City again great and hard working people but much would have to change taxes regs. etc _E_
Wow now leading in @ABC /@washingtonpost Poll 46 to 45. Gone up 12 points in two weeks mostly before the Crooked Hillary blow up! _E_
Thank you for your support. Together we will MAKE AMERICA SAFE AND GREAT AGAIN!#POTUSAbroad #USA __HTTP__ _E_
A Rod should do the Yankees a favor and never play again. _E_
#HappyIndependenceDay #July4 #USA __HTTP__ _E_
It was a great privilege to meet with President Moon of South Korea.Stay tuned! 🇰 #UNGA __HTTP__ _E_
#1. Keep the big picture in mind. There are always opportunities and possibilities and thinking too small can negate a lot of them. _E_
We continue to lose our nation's finest in Afghanistan almost daily. The Rules of Engagement are costing lives. _E_
Concerns over the national debt are stopping businesses from hiring and expanding __HTTP__ Obama's policies are unsustainable _E_
How can @BarackObama invoke Richard Nixon against @MittRomney when Obama just used Executive Privilege on Fast & Furious?! _E_
THANK YOU! #Trump2016 __HTTP__ _E_
RT @JaydaBF: VIDEO: Muslim migrant beats up Dutch boy on crutches! __HTTP__ _E_
Thank you @TheTodaysGolfer for the wonderful statement that the new par 3 9th hole @Trump Turnberry could be the most dramatic in Britain. _E_
RT @kevcirilli: CEDAR RAPIDS TRUMP'S DAUGHT IVANKA: I can just say without equivocation my father will make America great again. _E_
Boy is this guy @ShepNewsTeam tough on me. So totally biased. As a reporter he should be ashamed of himself! #Trump2016 _E_
We are going to have a great time in Cleveland. Will lead to special results for our country. We will Make America Great Again! _E_
The Fed's reckless monetary policies will cause problems in the years to come. The Fed has to be reined in or we will soon be Greece. _E_
Cutting taxes and simplifying regulations makes America the place to invest! Great news as Toyota and Mazda announce they are bringing 4000 JOBS and investing $1.6 BILLION in Alabama helping to further grow our economy! __HTTP__ _E_
Anybody that believes in strong borders and stopping illegal immigration cannot vote for Marco Rubio READ THIS: __HTTP__ _E_
Americans already believe that @PaulRyanVP is better qualified to serve as President over @JoeBiden __HTTP__ No surprise. _E_
Oil is under $50/barrel. Now is the time to increase sanctions against Iran not lift them. No deal is better than a bad deal. #ArtOfTheDeal _E_
Doctors have already died treating Ebola __HTTP__ We should not be importing the disease to our homeland. _E_
If you can't run your own house you certainly can't run the White House A statement made by Mrs. Obama about Crooked Hillary Clinton _E_
I have been saying for weeks for President Obama to stop the flights from West Africa. So simple but he refused. A TOTAL incompetent! _E_
Wow little Mac Miller has almost 100 million views on his song Donald Trump. Keep pushing Mac and come up with another hit just do it! _E_
Why is the NFL getting massive tax breaks while at the same time disrespecting our Anthem Flag and Country? Change tax law! _E_
My @FoxNews with @gretawire discussing the Keystone pipeline Re election is more important than 20000 jobs and (cont) __HTTP__ _E_
#TrumpVlog @Rosie wasn't even a short term fix at The View. __HTTP__ _E_
The Wall Street Journal stated falsely that I said to them "I have a good relationship with Kim Jong Un" (of N. Korea). Obviously I didn't say that. I said "I'd have a good relationship with Kim Jong Un" a big difference. Fortunately we now record conversations with reporters... _E_
The first 90 days of my presidency has exposed the total failure of the last eight years of foreign policy! So true. @foxandfriends _E_
Health insurance premiums are rising by double digits __HTTP__ Another tax to the consumer by Obama Care. Enjoy! _E_
Thank you on my way! __HTTP__ _E_
Today in Florida I pledged to stand with the people of Cuba and Venezuela in their fight against oppression cont: __HTTP__ _E_
I have gotten to know many Spanish speaking people as the owner of Trump National Doral in Miami. They are smart hard working and great _E_
My interview with @Newsmax_Media where I explain that gas is headed to $5 $6 and why @RickSantorum can't win __HTTP__ _E_
My experience in Iowa was a great one. I started out with all of the experts saying I couldn't do well there and ended up in 2nd place. Nice _E_
With Sen. Elizabeth Dole & @DoleFoundation Caregiver Fellows. Tremendous people caring for our military & veterans! __HTTP__ _E_
Must read via @IowaGOP by @shanevanderhart: "Congress Should Vote No on Syria" __HTTP__ _E_
Wind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @alexsalmond @Aberdeenshire _E_
With almost 1.3 million followers and rising really fast everyone is asking me to critique things(and people). Finally I will be a critic. _E_
I want to thank evangelical Christians for the warm embrace I've received on the campaign trail. Video: __HTTP__ _E_
Liberty University speech by DJT was biggest by far in school's history. Standing ovations...great young people! _E_
How many more billions of dollars will @BarackObama continue to waste in these solar companies? _E_
Beginning today the United States of America gets back control of its borders. Full speech from today @DHSgov:... __HTTP__ _E_
"Keep a good attitude and do the right thing even when it's hard. When you do that you are passing the test." @JoelOsteen _E_
You know what is the worst part of @BarackObama's Tuesday speech playing class warfare we paid for it with our tax dollars. _E_
People have got to stop working to be so politically correct and focus all of their energy on finding solutions to very complex problems! _E_
It is so great to be back home! Looking forward to a great rally tonight in Bethpage Long Island! _E_
With my friends at the great @Adidas Boost event at the @cadillacchamp at @trumpdoral __HTTP__ _E_
Not only did the $1B ObamaCare website not work it can't even protect your personal information __HTTP__ A disaster. _E_
Smart move by the Democrats to have Pres. @billclinton play a key role in their convention. _E_
I think the @NewYorkObserver was far too nice to sleazebag @AGSchneiderman. He's got plenty more to worry about!. _E_
Find something for everyone on your list with this Holiday Gift Guide from @TrumpSoHo on @TrumpCollection's Tumblr: __HTTP__ _E_
The important thing is not to stop questioning. Curiosity has its own reason for existing. Albert Einstein _E_
Tickets are now available for the 2015 @CadillacChamp at @TrumpDoral March 4 8: __HTTP__ _E_
In his own words @BarackObama was born in Kenya and raised in Indonesia and Hawaii. This statement was made (cont) __HTTP__ _E_
New ad concerning lightweight Senator Marco Rubio: __HTTP__ _E_
Catch me on Fox News right now my interview with Neil Cavuto __HTTP__ _E_
France was just stripped of its AAA bond rating. With the PMs radical tax rates... _E_
So now that Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the "unsolved mystery" that took place in Florida years ago? Investigate! _E_
Must read article by @EmilyMiller: "Anthony Weiner is a twit who treats women like dirt" __HTTP__ _E_
Via @CNNPolitics by @JDiamond1: "Trump: RNC call was 'congratulatory'" __HTTP__ _E_
Our soldiers can't even have any more joint exercises with Afghan soldiers because they are getting shot in the (cont) __HTTP__ _E_
With @VanityFair circulation and advertising revenue doing so badly rumor has it that dopey Graydon Carter is going to resign? He should. _E_
Very much enjoyed my tour of the Smithsonian's National Museum of African American History and Culture...A great job done by amazing people! _E_
The media is unrelenting. They will only go with and report a story in a negative light. I called Brexit (Hillary was wrong) watch November _E_
Young entrepreneurs should always remember that if you do not promote yourself no one else will! _E_
Obama vacationing in West Palm Beach starting tomorrow. He should play a round at Trump Int'l Golf Club #1 rated course in Florida. _E_
Via @PJMedia_com by @NicholasBallasy: "Trump Calls Election a 'Big Blow to Obama... I Think He's in Denial'" __HTTP__ _E_
Well maintained real estate is always going to be worth a lot more than poorly maintained real estate. The Art of the Deal _E_
It is a great honor for me to be inducted into the @WWE Hall of Fame. This will take place on April 6... _E_
TONIGHT! NORTH CAROLINA: __HTTP__ GEORGIA: __HTTP__ NEVADA: __HTTP__ _E_
.@BernardGoldberg was not good tonight on @oreillyfactor. He just doesn't know about winning! But he is a nice guy. _E_
Our President must be very careful with the 28 year old wack job in North Korea. At some point we may have to get very tough blatant threats _E_
Last October on @meetthepress @chucktodd attacked @jack_welch and I for saying Obama cooked the job number. Will he apologize? _E_
If you are a young entrepreneur just entering the business world I highly recommend that you read The Art of (cont) __HTTP__ _E_
It's Wednesday how many more of our embassies will be stormed by Islamists? _E_
Obama is tougher on WWII vets wanting to visit a DC memorial than Iran. He needs to show respect to our vets and not play games. _E_
While I am a critic of President Obama I hate it when someone (Robert Gates) writes a self serving negative book about his boss. _E_
You never know when the tide is going to turn in your favor. It's important to never give up on yourself. Think Like a Champion _E_
Thank you @scottienhughes for your powerful words on @FoxNews. I am with the Evangelicals and Tea Party big time. We will all WIN together! _E_
The Trump Tower restaurant Trump Grill just received the highest sanitary inspection grade possible "A" – the food is also great! _E_
Via @MiamiHerald: Donald Trump to be inducted into WWE Hall of Fame __HTTP__ _E_
Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person! _E_
Hank Greenberg formerly of AIG gave $10 million to the @JebBush campaign 3 months ago. He is not happy a total waste of money! _E_
Kate Middleton is great but she shouldn't be sunbathing in the nude only herself to blame. _E_
'President Donald J. Trump Approves Emergency Declarations' __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_
A bad manager such as @BarackObama will continually be plagued by scandals. __HTTP__ Leadership starts at the top. _E_
China attempted to sell embargoed computers to Iran __HTTP__ China loves these deals! _E_
Due to the holiday I will NOT be doing Fox & Friends this morning. Next Monday at 7. _E_
It is a shame Keystone wasn't powered by solar panels and wind because then @BarackObama would have wasted billions on it. _E_
Our trade deficit just jumped in May to "the second highest level on record" __HTTP__ FAIR trade not free trade. I TOLD YOU. _E_
There is ZERO margin for error on Ebola. Are we confident in Obama when he can't even make a website for $5 Billion? _E_
THANK YOU INDIANA! #Trump2016 __HTTP__ _E_
Record snowfall & freezing temps throughout the country. Where is Global Warming when you need it?! _E_
Re: Success Don't put blinders on and do not limit yourself reach out seek and explore. Think big at all times. _E_
Scary. Our military is a using a Chinese made satellite for North Africa command communications __HTTP__ _E_
Read Donald Trump's Top Ten Tips for Success: __HTTP__ _E_
I'd like to call JEB a liar but the truth is he has no clue & never revealed that he used Eminent Domain when criticizing me! (1/2) _E_
Thank you to all of the supporters who far out numbered the protesters yesterday at the Women's U.S. Open. Very cool! _E_
Congrats to Pres.Obama and Dems. CBO has TRIPLED its estimate of working hours lost due to ObamaCare __HTTP__ Job Killer _E_
It was just announced by sources that no charges will be brought against Crooked Hillary Clinton. Like I said the system is totally rigged! _E_
RT @gatewaypundit: Democrat Fire Marshal Turns THOUSANDS of Trump Supporters Away at Columbus Rally __HTTP__ via @gatewaypun... _E_
I really enjoyed being at the Iowa State Fair. The crowds love and enthusiasm is something I will never forget. _E_
Despite winning the second debate in a landslide (every poll) it is hard to do well when Paul Ryan and others give zero support! _E_
"Remember people's names and small details about them. Use both in conversation... _E_
Thank you for inviting me to the Western Conservative Summit in Colorado! #ImWithYou #WCS16 __HTTP__ __HTTP__ _E_
Adrian also gives autographs if you stop by the lobby of @TrumpTowerNY. #CelebApprentice _E_
I talk about Obamacare in today's #TrumpVlog __HTTP__ _E_
The new job figures don't include 315000 people who have given up looking for jobs. _E_
We are building China's wealth by buying all their products even though we make better products in America. _E_
'Presidential Executive Order on Strengthening the Cybersecurity of Federal Networks and Critical Infrastructure'... __HTTP__ _E_
Each day that Iran delays the deal if that is what you call it we must add another sanction and make them progressively tough. _E_
Via @HeraldBusiness by @hannahbsampson: "@TrumpDoral looking to hire hundreds" __HTTP__ _E_
The @nytimes is so poorly run and managed that other family members are looking to take over control. With unfunded liabilities big trouble! _E_
My thoughts on last night's meeting with @SarahPalinUSA in today's #trumpvlog... __HTTP__ _E_
Heading for Atlanta tomorrow morning for noon speech at North Atlanta Trade Center. Big crowds great people! _E_
Georgetown should not host @KathleenSebelius for the graduation ceremony. Her policies abuse Catholics. _E_
.@PolitiTrends @realdonaldtrump is dominating the discussion on Twitter with 79352 mentions today (via __HTTP__ ) _E_
RT @GovAbbott: To ensure your safety ahead of #Harvey heed warnings from local officials & review important safety information. __HTTP__ _E_
83% of the government is still running during the shutdown while 41% of nondefense federal workers are furloughed. Room for cuts. _E_
I am in Baton Rouge where the Miss USA Pageant will be shown live on NBC on Sunday night for 3 hours starting at 8 P.M. INCREDIBLE SHOW! _E_
My @Newsmax_Media interview discussing OPEC US gas resources @MittRomney and running a campaign against @BarackObama __HTTP__ _E_
Irrelevant clown @KarlRove sweats and shakes nervously on @FoxNews as he talks bull about me. Has zero cred. Made fool of himself in '12. _E_
Remember after this new episode starts 5 MINUTES! _E_
The Republicans have been played into a trap by the President they forgot the 14th amendment..... _E_
The convention in Cleveland will be amazing! __HTTP__ _E_
I just arrived in Miami where I will be checking out construction of the brand new Trump National Doral always closely watch construction! _E_
Last night in his SOTU @BarackObama claimed that he is a friend of Israel. Does anyone really believe that. _E_
We need a #POTUS with great strength & stamina. Hillary does not have that.#Trump2016 __HTTP__ __HTTP__ _E_
My Monday @foxandfriends interview discussing the fiscal cliff negotiations making the big deal and who has the cards __HTTP__ _E_
I will be interviewed from Cleveland Ohio on @seanhannity Tonight at 10:00 P.M. Enjoy! _E_
Clinton made a false ad about me where I was imitating a reporter GROVELING after he changed his story. I would NEVER mock disabled. Shame! _E_
When I made the Apprentice the #1 show in the US that was a good day for you... _E_
If Obama wins it is the end of the Republican party. @limbaugh _E_
Mainstream (FAKE) media refuses to state our long list of achievements including 28 legislative signings strong borders & great optimism! _E_
.@TrumpCollection's @DoralResort renovations are revitalizing Miami. The new course will be a great challenge __HTTP__ _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
Celebrity Apprentice is rebroadcasting last weeks episode at 9 P.M. WITH A GREAT NEW EPISODE FEATURING @MELANIA TRUMP AT 10 P.M. AMAZING! _E_
My interview last week with Greta van Susteren is available here in slightly abridged form. __HTTP__ Good info to know about. _E_
Obama is looking like an incompetent fool in the handling of the war against.ISIS! Why isn't China and Russia helping they gain so much! _E_
The negative television commercials about me paid for by the politicians bosses are a total #Mediafraud. When you watch remember! _E_
If @MittRomney has a good debate tomorrow night Obama is finished! _E_
I am on @foxandfriends at 7:00 A.M. ENJOY! _E_
Heading to the great state of Mississippi at the invitation of their popular and respected Governor @PhilBryantMS. Look forward to seeing the new Civil Rights Museum! _E_
Will be landing in Knoxville Tennessee shortly tremendous crowd expected. It's all very simple we want to #MakeAmericaGreatAgain! _E_
Honored to receive an endorsement from @SJSOPIO thank you! Together we are going to MAKE AMERICA SAFE & GREAT AG... __HTTP__ _E_
Thank you Indiana! #Trump2016 __HTTP__ _E_
Today we are going to win the great state of MICHIGAN and we are going to WIN back the White House! Thank you MI!... __HTTP__ _E_
Mariano Rivera is greatest closer of all time. A leader in the club house & an exceptional man. One of the best @Yankees in history. _E_
Why would anyone think Obama would attack Syria the day of his speech in Washington. He doesn't want to detract from his press & glory. _E_
If the GOP will have any chance to beat @BarackObama in November the great people of Michigan need to support @MittRomney's candidacy. _E_
I always said the people we fought for in Libya were bad news. Once again I was right. _E_
Making my speech. #WWEHOF __HTTP__ _E_
Don't forget the three hour episode of Celebrity Apprentice this Sunday night 8pm 11pm on NBC. You're in for a (cont) __HTTP__ _E_
The windfarm approval in Scotland is subject to many conditions that can never be met will be tied up in courts for years! #EOWDC _E_
I never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_
#HappyMothersDay! __HTTP__ __HTTP__ _E_
HAPPY THANKSGIVING to everyone I love you all even my many enemies (sometimes!). _E_
We must protect our veterans. #MakeAmericaGreatAgain __HTTP__ _E_
How do you like the boardroom so far? _E_
Wow l just found out that A.G. Schneiderman met with President Obama in Syracuse on Thursday and sued me on Saturday! Same as IRS etc. _E_
Bureaucratic red tape and overregulation are discouraging the American dream. It's time for a bold new direction! __HTTP__ _E_
When somebody challenges you unfairly fight back be brutal be tough don't take it. It is always important to WIN! _E_
Get out to VOTE on 11/8/2016 and we will #DrainTheSwamp!RASMUSSEN NATIONAL Trump 43%Clinton 41% __HTTP__ _E_
Dopey Sugar @Lord_Sugar I hear your ratings last week were at an all time low you better get them up or you'll be fired. _E_
This show was taped just before the terrible Bill Cosby revelations came to light.She still should have asked him for money goes to charity. _E_
Look what happened to the autism rate from 1983 2008 since one time massive shots were given to children __HTTP__ _E_
Just got home watching the news and every story is bad about the U.S. Someday we will return to being great again but we need leadership! _E_
CHILD CARE REFORMS THAT WILL MAKE AMERICA GREAT AGAIN!Transcript: __HTTP__ __HTTP__ _E_
Our incredible U.S. Coast Guard saved more than 15000 lives last week with Harvey. Irma could be even tougher. We love our Coast Guard! _E_
Via @ABCPolitics by @ajdukakis & @rickklein: Mr. Trump Goes to Washington And Talks 2016 __HTTP__ _E_
Franklin such a great photo. HAPPY 99th BIRTHDAY to your father @BillyGraham! __HTTP__ _E_
Will Team Power be able to withstand Omarosa as PM? Smooth sailing is not expected. _E_
#MakeAmericaGreatAgain __HTTP__ _E_
Don't talk to me about Bush I was never a defender or a fan! _E_
Mexico's court system corrupt.I want nothing to do with Mexico other than to build an impenetrable WALL and stop them from ripping off U.S. _E_
What's incredible is that Obamacare hasn't even kicked in yet and already it's doing tremendous damage. (cont) __HTTP__ _E_
.@KirstenPowers New book is excellent and so true! Congrats! _E_
The ever dwindling @WSJ which is worth about 1/10 of what it was purchased for is always hitting me politically. Who cares! _E_
I hope @MittRomney now starts asking for any & all of @BarackObama's sealed records it's time. _E_
I am very proud of @StephenBaldwin7's performance in the record 13th season of All Star @CelebApprentice. Watch. _E_
Very nice @HuffingtonPost @pollsterpolls has me in first place at 18% and Bush second at 14% __HTTP__ _E_
My @greta int. on @FoxNews with @MELANIATRUMP at OPO discussing my potential candidacy & making America great again __HTTP__ _E_
China demanded that we raise our debt ceiling and then their rating agency downgraded us. Our leaders are hope... (cont) __HTTP__ _E_
The Amateur! First @BarackObama was caught bowing to the Saudi King but now the President of Mexico! __HTTP__ _E_
Always protect against the downside the upside will take care of itself. Donald J. Trump _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Whether you have someone managing your finances or you're doing it yourself money like anything takes maintenance & planning to grow. _E_
Elite Traveler & the 12 Best Hotel Room Views in the World __HTTP__ #TrumpChicago _E_
Team Power+@LilJon= Spielberg? Let's find out. #CelebApprentice _E_
Congratulations to David Wright on signing a long term extension with the @Mets. David is an exceptional player and person. _E_
The @Washingtonpost reported about the closing hotels in Atlantic City but knowingly failed to report that I am not involved left years ago _E_
Watch ET tonight to find out what my beautiful wife will be wearing at the Met Gala! __HTTP__ _E_
...Re: China I told you that a long time ago. __HTTP__ _E_
Thank you to Tom Brady Coach Ditka Coach Bobby Knight and all of the many champions that have been so supportive! _E_
Yesterday Obama campaigned with JayZ & Springsteen while Hurricane Sandy victims across NY & NJ are still decimated by Sandy. Wrong! _E_
Via @TIME by @ZekeJMiller: "Trump Talks Politics at His Virginia Winery" __HTTP__ _E_
Be sure to check @fundanything to see my picks __HTTP__ _E_
Just out: The same Russian Ambassador that met Jeff Sessions visited the Obama White House 22 times and 4 times last year alone. _E_
RT @ErinBurnett: Sat down w/ @EricTrump @DonaldJTrumpJr here in Iowa. Talked God @realDonaldTrump late night tweets __HTTP__ _E_
Seth Myers is so unnatural and uncomfortable doing his show that you have to feel sorry for him. Bad interviewer marbles in his mouth! _E_
.@davidaxelrod David Thank you my great honor for a very worthy cause! _E_
This will prove to be a great time in the lives of ALL Americans. We will unite and we will win win win! _E_
Will be on Fox & Friends tomorrow morning at 7.00. Will be discussing the disgusting and wasteful $635 million website rollout and more! _E_
Rise high in affordable luxury. Trump Parc Stamford offers gracious living with entertainment spaces __HTTP__ _E_
I have a gift for my loyal viewers of All Star @ApprenticeNBC Mrs. @MELANIATRUMP debut on this week's episode __HTTP__ _E_
We boarded the helicopter for Sarasota earlier & will be landing soon! See you there. #Trump2016 __HTTP__ _E_
Thank you Macomb County Michigan! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Received a #HurricaneHarvey briefing this morning from Acting @DHSgov Secretary Elaine Duke @FEMA_Brock @TomBossert45 and COS John Kelly. __HTTP__ _E_
CNN will soon be the least trusted name in news if they continue to be the press shop for Hillary Clinton. _E_
FLASHBACK: Donald Trump Answers Boy's Prayer for New Bike __HTTP__ via @FoxNewsInsider _E_
Mark it on your calendar: Comedy Central Roast March 15th at 10:30 pm for the Roast of Trump __HTTP__ _E_
Enough about my ties etc. @Macys but they are doing really big numbers people love them (and @Macys loves Trump)! _E_
When everything seems to be going against you remember that the airplane takes off against the wind not with it. Henry Ford _E_
#TrumpVlog China is laughing at U.S. __HTTP__ _E_
Entrepreneurs: Success is good. Success with significance is even better. _E_
Every poll shows high approval of the new sign on @TrumpChicago. I am honored by the great support. _E_
My @foxandfriends interview discussing Pres. Obama playing golf w/@TigerWoods US Airways American merger & oil __HTTP__ _E_
Negotiation tip: Be reasonable & flexible. Being open to change could lead you into a fortunate situation and open the door to innovation. _E_
Since @BarackObama is on such a transparency kick how about releasing Fast & Furious info to Brian Terry's family? __HTTP__ _E_
RT @foxandfriends: President Trump vows America will respond to North Korean threats with fire & fury in a warning to the rogue nation ht... _E_
Many Syrian 'rebels' are radical Jihadis. Not our friends & supporting them doesn't serve our national interest. Stay out of Syria! _E_
When true golfers see what I do at Doral it will be the hottest club in the country. #sayfie #newsmax _E_
I salute all Tea Party Patriots for marching on DC today. Stand strong! _E_
ICYMI The ALS #IceBucketChallenge that Trumps them all __HTTP__ @MissUSA @MissUniverse @DonaldJTrumpJr @EricTrump _E_
Polling shows nearly 7 in 10 Americans support an immigration reform package that includes DACA fully secures the border ends chain migration & cancels the visa lottery. If D's oppose this deal they aren't serious about DACA they just want open borders. __HTTP__ _E_
"Some people dream of great accomplishments while others stay awake and do them." Anonymous _E_
Good news @RasmussenPoll has @MittRomney beating @BarackObama 49% 44% __HTTP__ Obama was up by 5% at same point in '08. _E_
Looks like Plan B is stuck with the mechanical dog. @THEGaryBusey has latched on and won't let go. #CelebApprentice _E_
A resort in Arizona is using sewage to make snow. Environmentalists are going crazy I won't be skiing in that snow. _E_
Tomorrow the House votes on #KatesLaw & No Sanctuary For Criminals Act. Lawmakers must vote to put American safety... __HTTP__ _E_
"The best entrepreneurs believe the true measure of success has to do with the number of jobs their business creates." – Midas Touch _E_
We will continue to follow developments in Charlottesville and will provide whatever assistance is needed. We are ready willing and able. __HTTP__ _E_
Our country will soon be relegated to THIRD WORLD status if proper decisions are not made by our president. He was never qualified for job! _E_
It's sad to see once decent newspapers like @USAToday failing so badly. I just don't know if they can be saved. _E_
Aetna CEO: Obamacare in 'Death Spiral' #RepealAndReplace __HTTP__ _E_
"Donald Trump: 'I Will Take Full Credit' for Romney Dropping Out" __HTTP__ via @Newsmax_Media by @ssfitzgerald _E_
....agencies not just the FBI & DOJ now the State Department to dig up dirt on him in the days leading up to the Election. Comey had conversations with Donald Trump which I don't believe were accurate...he leaked information (corrupt)." Tom Fitton of Judicial Watch on @FoxNews _E_
Departing for #GOPDebate. Let's #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
Today I signed the Global War on Terrorism War Memorial Act (#HR873.) The bill authorizes....cont __HTTP__ __HTTP__ _E_
Durst is a disaster at operating the new World Trade Center. It takes forever for workers or visitors to get in with impossible security. _E_
Others claim they can make America great again but only one knows The Art of The Deal. It's time for an outsider __HTTP__ _E_
A great new poll 33%! __HTTP__ _E_
When will @BarackObama release an actual budget? _E_
.@MittRomney should have been more aggressive last night. Yet some polls have him winning the debate. _E_
great business in total in order to fully focus on running the country in order to MAKE AMERICA GREAT AGAIN! While I am not mandated to .... _E_
Glad to see that @RondaRousey lost her championship fight last night. Was soundly beaten not a nice person! _E_
Happy birthday to my friend the great @jacknicklaus a totally special guy! _E_
Business is easy. Dealing with people is hard. If you are an entrepreneur your most important job is to choose who works with you. _E_
The media is so dishonest. If I make a statement they twist it and turn it to make it sound bad or foolish.They think the public is stupid! _E_
Today I signed the Veterans (OUR HEROES) Choice Program Extension & Improvement Act @ the @WhiteHouse. #S544 Watch... __HTTP__ _E_
If my many supporters acted and threatened people like those who lost the election are doing they would be scorned & called terrible names! _E_
Latin America's tallest building @TrumpPanama is the perfect getaway location to celebrate the New Year in luxury __HTTP__ _E_
General Petraeus should stop apologising and get on with his life. He is a good man and should have a great future. _E_
When Obama tried to tweak his previous statement on ObamaCare he made it an even greater lie even the Senate Democrats are angry with him! _E_
.@SenRandPaul's Tea Party rebuttal to Obama's SOTU explained why limited government promotes freedom. Well done! _E_
If you want to kill any idea in the world get a committee working on it. Charles Kettering _E_
If you're sitting in an office working in a job you hate then it's time to THINK BIG and plan your next step... _E_
I've been saying it for a long long time. #NoKo __HTTP__ _E_
Both are looking good! Now we begin! _E_
.@ronsirak Thank you for being so fair this morning on @GolfChannel—greatly appreciated. _E_
.@hardball_chris is a really dumb guy(and I know him well)—that's why he works swimmingly with our leaders in Washington. _E_
We are building our future with American hands American labor American iron aluminum and steel. Happy #LaborDay! __HTTP__ _E_
..under a magnifying glass they have zero tapes of T people colluding. There is no collusion & no obstruction. I should be given apology! _E_
Crimea was TAKEN by Russia during the Obama Administration. Was Obama too soft on Russia? _E_
Congratulation to Jane Timken on her major upset victory in becoming the Ohio Republican Party Chair. Jane is a loyal Trump supporter & star _E_
Oh the wonders of the Arab Spring. Our new allies in Egypt the Muslim Brotherhood just called the Holocaust a myth __HTTP__ _E_
Jeb Bush spent more than $40000000 in New Hampshire to come in 4 or 5 I spent $3000000 to come in 1st. Big difference in capability! _E_
New York we will make America great again! __HTTP__ _E_
Via @clarechampion by @DanDanaherNews: "Wind Farm Proposal Near @Trump_Ireland Rejected" __HTTP__ _E_
South Carolina and the audience were GREAT THANKS! _E_
The World as we know it is falling apart. Much of the blame can be attributed to the fact that the United States is no longer respected! _E_
The @Broncos had a truly bad day my advice is to go home forget about it and come back tough next year. _E_
"You want to compete and you want to compete at the highest level." @boonepickens _E_
Republicans must be careful with immigration—don't give our country away. _E_
Unemployment rate only dropped because more people are out of labor force & have stopped looking for work.Not a real recovery phony numbers _E_
I'm going to do what @MittRomney was totally unable to do WIN! _E_
Can you believe that Sony chief Amy Pascal wants to meet with Al Sharpton to seek forgiveness for her racial slurs. Al is laughing at her! _E_
.@serenawilliams had a flawless @usopen quarterfinal win last night. She's a great player and a wonderful person. _E_
"But if someone has a gun and is trying to kill you... it would be reasonable to shoot back with your own gun." @DalaiLama _E_
.@BrandenRoderick did a great job on All Star Celebrity @ApprenticeNBC. Raised a lot of money for charity while looking great. _E_
Will be covering President Obama's speech at 9.00 on Twitter you are all so lucky! _E_
Great shot by @KingJames yesterday. Lebron is a tough competitor who delivers under pressure. _E_
We will confront ANY challenge no matter how strong the winds or high the water. I'm proud to stand with Presidents for #OneAmericaAppeal. _E_
I had 15000 people in Phoenix but @politico said the rooms capacity is just over 2000. But said Bernie Sanders had 11000 in same room. _E_
My prayers and condolences to the victims and families of the terrible tragedy in Nice France. We are with you in every way! _E_
Entrepreneurs: Having an ego and acknowledging it is a healthy choice. There's nothing wrong with bringing your talents to the surface. _E_
REPEAL AND REPLACE!!! #ObamaCareInThreeWords _E_
HAPPY BIRTHDAY to the United States Air Force!! __HTTP__ _E_
I sure hope the sexting pervert Anthony Weiner runs for mayor. Will be great fun watching him both lose and be humiliated. _E_
"If it's worth doing it's worth fighting for. You'll have lots of people & obstacles in your way. Fight to get beyond them."–Midas Touch _E_
Today I was honored to be joined by Republicans and Democrats from both the House and Senate as well as members of my Cabinet to discuss the urgent need to rebuild and restore America's depleted infrastructure. __HTTP__ __HTTP__ _E_
I am in Miami at Trump National Doral. Just gave out contract to build a new ballroom and luxury suites. Blue Monster complete opens Dec 14. _E_
A drug free A Rod is just an average baseball player.@Yankees will soon move him down in the batting order & should renegotiate his contract _E_
Massive record setting snowstorm and freezing temperatures in U.S. Smart that GLOBAL WARMING hoaxsters changed name to CLIMATE CHANGE! $$$$ _E_
House Democrats want a SHUTDOWN for the holidays in order to distract from the very popular just passed Tax Cuts. House Republicans don't let this happen. Pass the C.R. TODAY and keep our Government OPEN! _E_
Obama just appointed an Ebola Czar with zero experience in the medical area and zero experience in infectious disease control. A TOTAL JOKE! _E_
RT @JoeNBC: Remarkable how cost effective Post says Trump campaign was per vote and stunning how much Jeb spent per vote. __HTTP__ _E_
No matter how good the replacement refs do they will be soundly criticized they can't win! _E_
"I pride myself on being obstinate stubborn & tough. I think those are important qualities found in successful people." – Think Big _E_
RT @WhiteHouse: Dr. King's dream is our dream. It is the American Dream. It's the promise stitched into the fabric of our Nation etched i... _E_
Just spoke to @JohnKasich to express condolences and prayers to all for the horrible shooting of two great police officers from @WestervillePD. This is a true tragedy! _E_
With proper thinking and leadership we can have a much better plan than Obamacare something that works for the people and costs much less _E_
Publicity seeking Lindsey Graham falsely stated that I said there is moral equivalency between the KKK neo Nazis & white supremacists...... _E_
.@JoseCanseco who I got to know very well during #CelebApprentice can't carry @SHAQ's jock. _E_
As a former host of Saturday Night Live I look forward to attending tonight! _E_
"Go for the jugular so that people watching will not want to mess with you." – Think Big _E_
RT @TuckerCarlson: .@RichardGrenell : @realDonaldTrump told Tillerson he had the full support of the U.S. Gov't to bring #OttoWarmbier home... _E_
A strong America creates opportunity and growth. We just need to change Washington. Let's Make America Great Again! __HTTP__ _E_
Do you ever notice that @CNN gives me very little proper representation on my policies. Just watched nobody knew anything about my foreign P _E_
The dishonest media does not report that any money spent on building the Great Wall (for sake of speed) will be paid back by Mexico later! _E_
RT @TeamTrump: RT if you agree @realDonaldTrump WON the #Debate BIG LEAGUE! #MAGA __HTTP__ _E_
Now we have a once in a lifetime opportunity to RESTORE AMERICAN PROSPERITY – and RECLAIM AMERICA'S DESTINY.But in order to achieve this bright and glowing future the SENATE MUST PASS TAX CUTS – and bring Main Street roaring back to life! __HTTP__ __HTTP__ _E_
A feature on the progress of the course @ #Trump Int'l #Golf Club will feature on @CNNLivingGolf Thurs 8 May 2014 @ 0930 & 1630 GMT #DAMAC _E_
.@hardball_chris Did you forget about Bill Ayers & so many others? You should apologize to all the people you offended yesterday. _E_
So much interest in my visit to Scotland! I greatly look forward to attending the opening event @TrumpTurnberry taking place on June 24th. _E_
Great job on @Greta @DonaldJTrumpJr. Nobody could have done it better! _E_
RT @GOP: .@IvankaTrump: This administration is committed to keeping working families at the forefront of our agenda. __HTTP__ _E_
Via @CNNMoney by @AaronSmithCNN: The Donald wins. Trump name coming off casino __HTTP__ _E_
If the GOP Establishment really wants to defeat @BarackObama then they should read #TimeToGetTough. _E_
Sen.Richard Blumenthal who never fought in Vietnam when he said for years he had (major lie)now misrepresents what Judge Gorsuch told him? _E_
They must be kidding can this be happening #Oscars _E_
Will be interviewed on @Morning_Joe at 7:00 A.M. So much to talk about! _E_
Give the public a break The FAKE NEWS media is trying to say that large scale immigration in Sweden is working out just beautifully. NOT! _E_
The only global warming that people should be concerned with is the global warming caused by nuclear weapons because of our weak U.S. leader _E_
I will be making a big surprise announcement to the massive crowd assembled in Huntsville/Madison Alabama! Landing now! #Trump2016 _E_
Take a look at __HTTP__ and __HTTP__ to see these beautiful hotels. _E_
You're just not getting there @DanaPerino. Sometimes things just don't work out but don't worry no problem! _E_
I will be interviewed on @foxandfriends at 7:30. Things are looking good had a great Easter look forward to spending the week in Wisconsin! _E_
Then on June 25th back to the USA to MAKE AMERICA GREAT AGAIN! _E_
In the very least Congress must defund Obama's unconstitutional amnesty order. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Looking forward to promoting a pro growth & positive message at this Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. _E_
Small crowds at @RedState today in Atlanta. People were very angry at EWErickson a major sleaze and buffoon who has saved me time and money _E_
I will be interviewed on Face The Nation @CBSNews at 10:30 A.M. Should be interesting ENJOY! _E_
Focus on your goals not your problems. And remember problems are a mind exercise so enjoy the challenge. _E_
Rexnord of Indiana is moving to Mexico and rather viciously firing all of its 300 workers. This is happening all over our country. No more! _E_
.@TrumpChicago is Chicago's sole destination showcasing a Five Star @Forbes rating for both hotel & restaurant __HTTP__ _E_
Fitch has downgraded our credit outlook to negative. Why? @BarackObama's failure to lead with the Super Committee. __HTTP__ _E_
Thank you Colorado! An honor to win @NBC @9News #GOPDebate Poll. __HTTP__ _E_
.@DineshDSouza's '2016: Obama's America' is expanding to over 1000 theaters this weekend. Will be highest grossing documentary in 2012. !! _E_
Busy day—working on buying a major property—and creating lots of jobs. _E_
I agree with Marco Rubio that Ted Cruz is a liar! _E_
Great meetings will take place today at Trump Tower concerning the formation of the people who will run our government for the next 8 years. _E_
Glad to see that Jamie Dimon passed yesterday's shareholder vote. The JP Morgan stock holders understand that a good CEO is worth keeping. _E_
Hillary Clinton is taking the day off again she needs the rest. Sleep well Hillary see you at the debate! _E_
.@CNBC is pushing the @GOP around by asking for extra time (and no criteria) in order to sell more commercials. _E_
Thank you Pennsylvania! #Trump2016 __HTTP__ _E_
.@Macy's is a big contributor to @PPFA . Anybody against Planned Parenthood should boycott racial profiling Macy's. _E_
Sleepy Chuck Todd of NBC falls far short of the late great Tim Russert. _E_
Former President Jimmy Carter is so happy that he is no longer considered the worst President in the history of the United States! _E_
THe Westminster Dog Show asked if I'd be interested in meeting Hickory a Scottish Deerhound who won Best in Show. She came to visit today! _E_
President Obama should ask the DNC about how they rigged the election against Bernie. _E_
Richard Mourdock a very good man running for the Senate in Indiana. Hopefully he will win! @richardmourdock _E_
Ted Cruz is a nervous wreck. He is making reckless charges not caring for the truth! His poll #'s are way down! _E_
Take the time to be thorough in whatever you undertake. Remain open to new ideas. Remain fluid not fixed in your expectations. _E_
I'll be on @foxandfriends on Monday at 7:30 AM tune in! _E_
We pay for Obama's travel so he can fundraise millions so Democrats can run on lies. Then we pay for his golf. _E_
When will @BarackObama release his transcripts? What is he hiding? _E_
I will win the election against Crooked Hillary despite the people in the Republican Party that are currently and selfishly opposed to me! _E_
Congratulations to #TeamUSA🏆on your great @PresidentsCup victory! __HTTP__ _E_
For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_
#trumpvlog My thoughts on Afghanistan @RickSantorum and why I fired two people on this week's #CelebApprentice... __HTTP__ _E_
Pocahontas just stated that the Democrats lead by the legendary Crooked Hillary Clinton rigged the Primaries! Lets go FBI & Justice Dept. _E_
'16 Fake News Stories Reporters Have Run Since Trump Won' __HTTP__ _E_
Republicans have a last chance to do the right thing on Repeal & Replace after years of talking & campaigning on it. _E_
A photo delivered yesterday that will be displayed in the upper/lower press hall. Thank you Abbas! __HTTP__ _E_
Reverend Wright must have great hatred for Obama and the manner in which he was shunted aside. _E_
Right now 4000 U.S. troops are stupidly heading to West Africa to help fight Ebola.No help from China Russia or wealthy African oil nations _E_
First segment of my @seanhannity @FoxNews interview discussing @GOP are terrible negotiators & lost all their cards __HTTP__ _E_
No deal was made last night on DACA. Massive border security would have to be agreed to in exchange for consent. Would be subject to vote. _E_
....likewise billions of dollars gets brought into Mexico through the border. We get the killers drugs & crime they get the money! _E_
I am enjoying my travels across Europe but home is where the heart is. Looking forward to coming back to the family in New York very soon. _E_
Obama is a terrible negotiator. He bails out Chrysler and now Chrysler wants to send all Jeep manufacturing to China and will! _E_
GDP was revised upward to 3.1 for last quarter. Many people thought it would be years before that happened. We have just begun! _E_
My condolences and prayers to the victims of the terrorist attack in Paris. _E_
Great article a must read by Peter Ferrara at @Forbes about The Biggest Government Spender in World History __HTTP__ _E_
Everyone should go see @HatingBreitbart. Great documentary showcasing @AndrewBreitbart's legacy. _E_
Everybody is asking about my announcement this Wednesday concerning Barack Obama just wait and see! _E_
If @TedCruz doesn't clean up his act stop cheating & doing negative ads I have standing to sue him for not being a natural born citizen. _E_
.@JustinRose99 Great playing today in the Scottish Open. I see our practice facility is helping—use it a lot! _E_
"Our side needs Donald Trump." @AnnCoulter on @seanhannity's show last night. Thanks Ann. _E_
Just leaving Akron Ohio after a packed rally. Amazing people! Going now to Texas. _E_
Wow just came out on secret tape that Crooked Hillary wants to take in as many Syrians as possible. We cannot let this happen ISIS! _E_
Lyin' Ted Cruz who can never beat Hillary Clinton and has NO path to victory has chosen a V.P.candidate who failed badly in her own effort _E_
.@PennJillette and @StephenBaldwin7's arguments are making the edit room look like the boardroom. #CelebApprentice _E_
"Is business success a natural talent? I think it's a combination of aptitude work and luck." – Think Like a Champion _E_
If I were President I would push for proper vaccinations but would not allow one time massive shots that a small child cannot take AUTISM. _E_
So true Ivanka! __HTTP__ _E_
Pretty audacious for Obama to call @MittRomney a BSer when he has lied about so much we don't have room to write. _E_
A great day in Wisconsin many stops many great people! Melania is joining me on Monday. Big crowds. MAKE AMERICA GREAT AGAIN! _E_
Thank you for your continued support!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
#trumpvlog Same last name same bad ratings @lawrence and @rosie..... __HTTP__ _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_
Looks like we will have a pervert running for mayor after all just what New York City needs and he will revert back to form always do! _E_
In '08 @BarackObama hit Bush for secrecy __HTTP__ When will Obama release all his sealed college records?! _E_
.@AlexSalmond suffered a huge defeat by the people of Blackdog. Communities all over Scotland are fighting this loser. _E_
Goofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be a Native American in order to advance her career. Very racist! _E_
I am deeply disturbed by what I have read in the case of @TrayvonMartin. I support a full investigation and justice. _E_
More great news as a result of historical Tax Cuts and Reform: Fiat Chrysler announces plan to invest more than $1 BILLION in Michigan plant relocating their heavy truck production from Mexico to Michigan adding 2500 new jobs and paying $2000 bonus to U.S. employees! __HTTP__ _E_
Will be interviewed by @SeanHannity on @FoxNews tonight at 10pm from Pennsylvania. Enjoy! #Trump2016 __HTTP__ _E_
Our new allies in Egypt the Muslim Brotherhood have close relations with Iran __HTTP__ We never should have abandoned Mubarak. _E_
The U.S. has a 60 billion dollar trade deficit with Mexico. It has been a one sided deal from the beginning of NAFTA with massive numbers... _E_
The acclaimed @TrumpChicago soars 92 stories high. You're either staying in @TrumpChicago or in its shadow. __HTTP__ _E_
Increasing America's debt weakens us domestically and internationally. US Senator @BarackObama 2007 _E_
Prediction: Rand Paul has been driven out of the race by my statements about him he will announce soon. 1%! _E_
The Democrats are delaying my cabinet picks for purely political reasons. They have nothing going but to obstruct. Now have an Obama A.G. _E_
.@HuffingtonPost actually gave me a positive story yesterday! _E_
Remember our brave men & women who have fallen protecting our country this Memorial Day! _E_
Law enforcement & military did a spectacular job in Hamburg. Everybody felt totally safe despite the anarchists. @PolizeiHamburg #G20Summit _E_
'What I Like About Trump ... and Why You Need to Vote for Him' __HTTP__ _E_
RT @EricTrump: __HTTP__ _E_
If ObamaCare is such a wonderful law then why does Obama summarily change the law before an election? _E_
Social media has changed the news & communication landscape for good. Everything must be up to date by the second instead of the hour or day _E_
Mechanical dog is going to be trending tonight. #MechanicalDog #CelebApprentice _E_
Crooked Hillary's V.P. pick said this morning that I was not aware that Russia took over Crimea. A total lie and taken over during O term! _E_
Entrepreneurs: Seek opportunity and see opportunity as a perk. You never know what will evolve. Keep an open mind! _E_
Thank you Roanoke Virginia this a MOVEMENT join us today!Sign up: __HTTP__ __HTTP__ _E_
Wow I have had so many calls from high ranking people laughing at the stupidity of the failing @nytimes piece. Massive front page for that! _E_
Great listening session with CEO's of the Retail Industry Leaders Association this morning! __HTTP__ _E_
Trump National Doral will have big crowds this weekend for the WGC. THE BLUE MONSTER IS READY FOR THE WORLD'S TOP FIFTY PLAYERS! _E_
Dems have been complaining for months & months about Dir. Comey. Now that he has been fired they PRETEND to be aggrieved. Phony hypocrites! _E_
When a complex website is broken the best thing to do is blow it up and start all over again then sue the culprits and use the proper team! _E_
The Letterman show really turned things around people finally understand my $5 million dollar offer to charity.... __HTTP__ _E_
Hillary says this election is about judgment. She's right. Her judgement has killed thousands unleashed ISIS and wrecked the economy. _E_
Bernie caved! __HTTP__ _E_
I have proven to be far more correct about terrorism than anybody and it's not even close. Hopefully AZ and UT will be voting for me today! _E_
Melania and I look forward to being with President Xi & Madame Peng Liyuan in China in two weeks for what will hopefully be a historic trip! __HTTP__ _E_
Tracking 149 polls from 29 pollsters nationwide/HuffPost Pollster #GOP __HTTP__ _E_
.@MattGinellaGC It's true Matt the NEW Blue Monster is better than Pinehurst so is Bedminster. Turnberry & Trump Aberdeen blow it away! _E_
There can never be a sharp economic recovery until @BarackObama is out of the White House. _E_
Today proves what I have always known that @Reince Priebus is the tough one and the smart one not Debbie Wasserman Shultz (@DWStweets.) _E_
I LOVE NEW YORK! #NewYorkValues __HTTP__ _E_
People do not assume this but more than anything else I like helping people. Be at Trump Tower at 11 AM today. _E_
It's sad that the WH is punishing children from across the country by closing all tours. Doesn't have to be. WH should take my offer. _E_
My @6abc int. with @Jim_Gardner on Atlantic City Philadelphia's real estate market & 2014 2016 elections __HTTP__ _E_
Cruz lies are almost as bad as Jeb's. These politicians will do anything to stay at the trough! _E_
The ties shirts and suits at Macy's are doing fantastically well check out the new designs and low prices nothing better! _E_
Tonight on @ExtraTV I'm talking #CelebApprentice. Tune in! _E_
Receiving the Algemeiner Liberty Award a great honor. __HTTP__ _E_
How could Michael Forbes get Scot of the Year when he lost—badly—to me & Andy Murray a true Scot who won the U.S. Open & Olympic gold? _E_
Just like Jonathan Gruber viciously lied & called Americans "stupid" on ObamaCare many consultants are doing the same on Global Warming. _E_
Wow even lowly Rand Paul has just past @JebBush in the new @CNN Poll. Jeb is at 3% I'm at 39%. Stop throwing your money down the drain! _E_
I will be on On The Record @gretawire tonight at 7 PM _E_
Unemployment is now 7.9%. Four years and $6.5T later that is really bad! _E_
Make sure you take some time to enjoy the weekend. Important for your mind and will help you be productive next week. _E_
New Reuters poll! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Record of Health: __HTTP__ #Trump2016 _E_
Will be at the Women's U.S. Open today! _E_
Will be doing Fox & Friends in a few minutes hope you enjoy! _E_
Come on MLB do the right thing! Let @PeteRose_14 into the Hall. No drugs—just hard work and talent! _E_
"Good communicators control space." – Roger Ailes 'You Are The Message' @FoxNews _E_
An honor to welcome PM of Australia @TurnbullMalcolm to America & join him in marking the 75th Anniversary of the... __HTTP__ _E_
Innovation distinguishes between a leader and a follower. Steve Jobs _E_
Where is the progress in the state of New York over the last three years? There is none only backwards. _E_
The government is borrowing 46 cents on every dollar it spends __HTTP__ Dangerous for us but great news for China. _E_
What a time we all had in Iowa yesterday massive overflow crowd. Love them! _E_
Heading now to Pella Iowa. Big crowd! Remember Trump is a big buyer of Pella windows. See you soon! _E_
Anderson Silva just got knocked out by new champion Chris Weidman! Congrats to Chris. _E_
.@MarkSteynOnline Thank you and great job on @seanhannity tonight! _E_
The big loss yesterday for Israel in the United Nations will make it much harder to negotiate peace.Too bad but we will get it done anyway! _E_
#GOPDebate #Trump2016 __HTTP__ _E_
So much for creating American jobs @BarackObama gave $529 Million to a Green car company so they can be manufactured in Finland. _E_
President Obama if it is important to you I will substantially increase the $5M offer! _E_
She's baaack! @Rosie needs me to salvage her dying career. But it won't help she's got no talent & no persona. Too many tv cancellations! _E_
Wow what a nice honor! __HTTP__ _E_
U.S. Stock Market up almost 20% since Election! _E_
Looking forward to being hosted by @NickLangworthy's Erie County Lincoln Leadership Reception tonight. Record crowd! Can't wait. _E_
An aerial shot of Jacksonville crowd yesterday! I may as well show you because the media won't. #Trump2016 __HTTP__ _E_
The very outdated filibuster rule must go. Budget reconciliation is killing R's in Senate. Mitch M go to 51 Votes NOW and WIN. IT'S TIME! _E_
Amazing that Derek Jeter played with an injury throughout most of last night's @yankees game and did so well. _E_
How do you spend over $635M on websites and they don't work? _E_
Why did @MittRomney give his tax returns without demanding that Obama release his college records & applications in return? _E_
Does anyone believe that @BarackObama did not fully write or review the 1991 publisher booklet? _E_
Attention all hackers: You are hacking everything else so please hack Obama's college records (destroyed?) and check place of birth _E_
The spotlight has finally been put on the low life leakers! They will be caught! _E_
.@TrumpToronto was just voted the #1 hotel in Canada in Conde Nast Traveler's prestigious Reader's Choice Awards __HTTP__ _E_
Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game. The Art of the Deal _E_
RT @DanScavino: Join @realDonaldTrump LIVE in Denver Colorado via his #Facebook page we are here!!#MakeAmericaGreatAgain __HTTP__ _E_
Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! _E_
25 days to go until fiscal cliff (bad name)—it is only a fiscal curb! Debt ceiling is real fiscal cliff...and that will be interesting! _E_
The United States is experiencing the coldest weather in decades with vast amounts of snow blanketing many states.Pendulum has swung to cool _E_
..North Korea is a rogue nation which has become a great threat and embarrassment to China which is trying to help but with little success. _E_
Thank you Michigan! #Trump2016 _E_
Crooked Hillary has never created a job in her life. We will create 25 million jobs. Think she can do that? Not a c... __HTTP__ _E_
We're missing a lot of information on autism. Support @AutismSpeaks' project by visiting mss.ng #MSSNG _E_
It's @BarackObama who wants to raise all our taxes who applauds China for cutting their taxes! (cont) __HTTP__ _E_
When will all the haters and fools out there realize that having a good relationship with Russia is a good thing not a bad thing. There always playing politics bad for our country. I want to solve North Korea Syria Ukraine terrorism and Russia can greatly help! _E_
Debate was somewhat hard to watch last night. Viewership will be way down. _E_
Obamacare is a disaster! Time to repeal & replace! #ObamacareFail __HTTP__ _E_
US tourists threaten to boycott Scotland over windfarms' __HTTP__ Write to Alex Salmond: firstminister@scotland.gsi.gov.uk _E_
RT @DonnaWR8: @realDonaldTrump You can boycott our anthem WE CAN BOYCOTT YOU! #NFL #MAGA __HTTP__ _E_
Who do you want negotiating for us? #MakeAmericaGreatAgain __HTTP__ _E_
I will be interviewed by @BretBaier @SpecialReport at 6pm ET tonight @FoxNews _E_
Congratulations to @JamesOKeefeIII on exposing more Democrat voter fraud. @DNC was caught red handed telling people to vote twice. _E_
Little Adam Schiff who is desperate to run for higher office is one of the biggest liars and leakers in Washington right up there with Comey Warner Brennan and Clapper! Adam leaves closed committee hearings to illegally leak confidential information. Must be stopped! _E_
My speech at @AmSpec Bartlet Gala Dinner where I received @boonepickens Entrepreneur Award __HTTP__ _E_
Watch this video for a look at our great course in Los Angeles Rancho Palos Verdes __HTTP__ @TrumpGolfLA _E_
.@MarkHalperin's and John Heilemann's book Double Down is an excellent read on the just passed election. Great book congrats! @jheil _E_
Congratulations Secretary Mattis! __HTTP__ _E_
Re my hair Should I change it? What do you think? _E_
.@VattenfallGroup doesn't have the finances or financial statement to build the hated windfarm in Aberdeen. _E_
It was a great honor to have President Xi Jinping and Madame Peng Liyuan of China as our guests in the United States. Tremendous... _E_
"Leverage: don't make deals without it." – The Art of The Deal _E_
The North Korean regime has pursued its nuclear & ballistic missile programs in defiance of every assurance agreement & commmitment it has made to the U.S. and its allies. It's broken all of those commitments... __HTTP__ _E_
It is time to send someone from the outside to fix DC from the inside. Let's Make America Great Again! __HTTP__ _E_
'Everything in Dubai': Learn from Emirate's rebound says @DonaldTrumpJr __HTTP__ via @Emirates247 by @Parag1301 _E_
Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game. #TheArtofTheDeal _E_
LIVE on #Periscope: Watch major press conference live from @TrumpTowerNY now! #MakeAmericaGreatAgain __HTTP__ _E_
New PPP Poll just out Trump up big Cruz Rubio and Bush down. The debate results even with a stacked RNC audience were wonderful! _E_
Alabama people are saying their team has real football & real girlfriends—not good for Notre Dame—but they'll be back! _E_
My new book Midas Touch in stores now.... __HTTP__ #trumpvlog _E_
The president of the pathetic Club For Growth came to my office in N.Y.C. and asked for a ridiculous $1000000 contribution. I said no way! _E_
Via @thehill by @martinmatishak: "Trump: 'We look like we're beggars' in Iran nuclear talks" __HTTP__ _E_
Back in NY from Scotland and fighting for our country to get better. Trump International Golf Links Scotland opened to rave reviews. _E_
Leaving for Jacksonville now. See you there! Miami was great. _E_
There are so many blatant lies coming out of the ADMINISTRATION healthcare spying NSA IRS brutally killed Americans WILL IT EVER END? _E_
Jailed USMC Sgt Andrew Tahmooressi should be released immediately. Since when does Mexico care about border security?#BringBackOurMarine _E_
RT @FoxBusiness: #BreakingNews: U.S. employers added 209000 jobs in July unemployment rate down to 4.3% #JobsReport __HTTP__ _E_
Crooked Hillary Clinton mentioned me 22 times in her very long and very boring speech. Many of her statements were lies and fabrications! _E_
I am very surprised that @lancearmstrong gave up. I never thought he was a quitter... _E_
I really enjoyed being in New Hampshire & speaking for Joe McQuaid @deucecrew & the Nackey Loeb School @LoebSchool honoring James Foley. _E_
She is so sad and pathetic that I almost feel sorry for Sec.Sebelius. She has done great harm to many people and must be fired. Incompetent! _E_
Join me at 7:00 P.M. on Tuesday August 22nd in Phoenix Arizona at the Phoenix Convention Center! Tickets at: __HTTP__ __HTTP__ _E_
"Donald Trump Congratulated on @foxandfriends for Receiving the @Algemeiner's 'Liberty Award'" __HTTP__ via @Algemeiner _E_
Watching these politicians trying to get a deal done is truly painful Republicans are in a much stronger position than they think. _E_
In 1960 there were approximately 20000 pages in the Code of Federal Regulations. Today there are over 185000 pages as seen in the Roosevelt Room.Today we CUT THE RED TAPE! It is time to SET FREE OUR DREAMS and MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Univision wants to back out of signed @MissUniverse contract because I exposed the terrible trade deals that the U.S. makes with Mexico. _E_
Thanks @AndreaTantaros for all of your kind words and thoughts. Big progress is being made. Keep up the great work! _E_
It was an honor to have the amazing Root family join me in Iowa. I have been so inspired by their courage & bravery. __HTTP__ _E_
Here is a letter I received yesterday from someone who has had personal experience with our health care situation. __HTTP__ _E_
My interview with @gretawire last night on @FoxNews: @BarackObama 'Missed His Opportunity' __HTTP__ _E_
Hillary Clinton only knows how to make a speech when it is a hit on me. No policy and always very short (stamina). Media gives her a pass! _E_
#badratings @Lawrence's show failed at 8pm and is failing(even worse) at 10pm not long for tv..... _E_
Oh really check out innocent @megynkelly discussion on @HowardStern show 5 years ago I am the innocent (pure) one! __HTTP__ _E_
"Winners never quit and quitters never win." Vince Lombardi _E_
RT @seanhannity: Tonight the truth about how despicable the media and the left are in America today. We will name names. 9 est Hannity Fox... _E_
Is this really America? Terrible! __HTTP__ _E_
RT @accesshollywood: @realDonald Trump: 'Celebrity Apprentice' Season 5 is 'Tough Nasty & Smart.' Watch: __HTTP__ _E_
I would absolutely kill Jon Stewart(?) in a debate it would be no contest he's not fast enough or smart enough (only obnoxious enough!). _E_
Hillary Clinton surged the trade deficit with China 40% asSecretary of State costing Americans millions of jobs. _E_
With a world renowned open air lobby w/ ocean views & top restaurants @TrumpWaikiki is Honolulu's premier hotel __HTTP__ _E_
Thanks to everyone for your support on @CNBC's "Top Leaders Icons and Rebels" vv __HTTP__ Thanks for voting Trump! _E_
. @WWE's @WrestleMania XXIX less than 3 weeks away. Looking forward to being inducted into the Hall of Fame! _E_
John Kasich was never asked by me to be V.P. Just arrived in Cleveland will be a great two days! _E_
Via @NYDailyNews by Rich Schapiro: Donald Trump slams Mitt Romney Jeb Bush __HTTP__ _E_
The real story turns out to be SURVEILLANCE and LEAKING! Find the leakers. _E_
#FlashbackFriday Many big movies have filmed in my buildings. Here is @TrumpChicago in #Transformers 3. __HTTP__ _E_
groveling when he totally changed a 16 year old story that he had written in order to make me look bad. Just more very dishonest media! _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
USA has the greatest business people in the world but we let political hacks negotiate our deals. We need change! #BigLeagueTruth #Debate _E_
It must have been President Obama that called in what will go down as the DUMBEST PLAY IN THE HISTORY OF FOOTBALL. Same thought process! _E_
New business start ups at the lowest level in 30 years and the EPA is now the Employment Prevention Agency. @bobmcdonnell _E_
I am on @foxandfriends Enjoy! _E_
Thank you Indiana! Was great seeing everyone on Wednesday! I will be back soon! #Trump2016 __HTTP__ __HTTP__ _E_
ALL SAFE IN ORANGE COUNTY NORTH CAROLINA. With you all the way will never forget. Now we have to win. Proud of you all! @NCGOP _E_
Via @DailyCaller by @AlexPappas: "Donald Trump Headed To Iowa Says Ebola Is Further Proof Of Obama's Incompetence" __HTTP__ _E_
Via @washtimes: Donald Trump warns of 'dangerous precedent' in Cyprus bank skimming __HTTP__ _E_
What a great couple... Katherine Webb and AJ McCarron. They are both winners! _E_
Polls close but can you believe I lost large numbers of women voters based on made up events THAT NEVER HAPPENED. Media rigging election! _E_
Thank you for having me this morning @AmericanLegion. I enjoyed my time with everyone! #ALConvention2016 __HTTP__ _E_
I have never been a fan of John Edwards but it is time for the gov't to focus on more important things. @johnedwards _E_
While in politics it is often smart to send out false messages one thing is clear: That Hillary does not want to run against TRUMP. _E_
Joan Rivers @Joan_Rivers was an amazing woman and a great friend. Her energy and talent were boundless. She will be greatly missed. _E_
Not believable that Manti Te'o was in love for one year with a girl he never met she then died. He is either very stupid.... _E_
Thank you Indiana! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
ObamaCare is a disaster. Americans will see record increases in their premiums and inferior care services. _E_
A fantastic day in D.C. Met with President Obama for first time. Really good meeting great chemistry. Melania liked Mrs. O a lot! _E_
Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_
RT @foxandfriends: Sec. Mattis: If North Korea fires missile at US it's 'game on' __HTTP__ _E_
"Tone it down? No way! Donald Trump needs to crank up the volume" __HTTP__ via @FoxNews by @toddstarnes _E_
A robust growing economy is how to fix Social Security and Medicare not cuts on Seniors. _E_
So true! __HTTP__ _E_
SC has kept us safe from exec amnesty for now. But Hillary has pledged to expand it taking jobs from Hispanic & African American workers. _E_
Eric Cantor's concession speech was ridiculous acted like nothing had just happened. WE NEED REAL LEADERS! _E_
We should cut off all aid to every country that does not respect our border. Why are we giving them money in the first place? _E_
The ill conceived windfarm that @AlexSalmond is pushing for Aberdeen will lose $50 million a year. Only a fool would build it or want it! _E_
"Keep focusing on doing what you love even if times are tough." – Think Big _E_
Rubio is totally owned by the lobbyists and special interests. A lightweight senator with the worst voting record in Senate. Lazy! _E_
Happy New Year from #MarALago! Thank you to my great family for all of their support. __HTTP__ _E_
Packed with holiday celebrations members & staff are enjoying the first Christmas season at @Trump_Charlotte __HTTP__ _E_
RT @TeamTrump: She put the office of Sec of State up for sale. If she ever got the chance she'd put the Oval Office up for sale too. #Fo... _E_
I have asked the reigning Miss Universe and Miss USA to do the honors. At least I will not have to wash my hair this morning! Enjoy. _E_
The NYPD has been doing a fantastic job protecting NYC. I hope Chief Ray Kelly is strongly considering running for mayor. _E_
In today's all new #TrumpVlog I discuss what a great honor it was to be inducted into the WWE Hall of Fame. __HTTP__ _E_
Via @Suntimes' @CSTearlyoften by @FSPIELMAN: "Council sign rules mean Trump name will loom large on river" __HTTP__ _E_
Everybody should boycott the @megynkelly show. Never worth watching. Always a hit on Trump! She is sick & the most overrated person on tv. _E_
.@CNN Will be interviewed by Jake Tapper at 9:00 A.M. Enjoy. _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
Thanks for all the great comments on all my recent interviews. Much appreciated. _E_
I will be on@gretawire tonight at 10 P.M. Now I know she will get great ratings! _E_
On my way to Des Moines Iowa will see you soon with @mike_pence. Join us! Tickets: __HTTP__ #ThankYouTour2016 _E_
Sen. @MaxBaucus has announced his retirement. A major proponent of ObamaCare Baucus now says it's a 'huge train wreck.' _E_
Fact – Obama still has not fixed the backend of the ObamaCare website. This could be the greatest internet boondoggle in history. _E_
Will be in Bangor Maine today at 3pm join me! #MAGATickets: __HTTP__ __HTTP__ _E_
George Ross and I have done some great real estate deals together. He's a tough negotiator. #CelebApprentice _E_
Vote for Mar a Lago __HTTP__ _E_
Congratulations to @serenawilliams on her superb @usopen win. She is terrific! _E_
I've never seen anything like it everything he touches turns to gold! So nice a quote by Fred C.Trump about his son Donald (me!). _E_
Michelle Nunn supports Amnesty a weak border & ObamaCare. She is an Obama liberal. Send DC an independent voice. Vote @Perduesenate! _E_
...to stop drugs they want to take money away from our military which we cannot do." My standard is very simple AMERICA FIRST & MAKE AMERICA GREAT AGAIN! _E_
Thank you to Linda Bean of L.L.Bean for your great support and courage. People will support you even more now. Buy L.L.Bean. @LBPerfectMaine _E_
Thank you! An honor to be the first candidate ever endorsed by the @NRA prior to @GOPconvention! #Trump2016 #2A __HTTP__ _E_
George Will was critical of @MittRomney throughout the primary. Maybe it is because his wife was turned down for (cont) __HTTP__ _E_
Brooklyn Nets have the worst uniform ever Boring won't matter if they win( Winning solves all problems (cont) __HTTP__ _E_
Manufacturing is now less than 9% of US GDP. The Rust Belt heart of our country's factory sector has been destroyed by our leaders. _E_
#DrainTheSwamp #PhoenixRally __HTTP__ _E_
Why would Republican candidates want the support of Mitt Romney. He lost an election against Obama that should NEVER have been lost! _E_
Who should star in a reboot of Liar Liar Hillary Clinton or Ted Cruz? Let me know. __HTTP__ _E_
The true question for the @UN... __HTTP__ _E_
.@WSJ is bad at math. The good news is nobody cares what they say in their editorials anymore especially me! _E_
Great to see @Yankees Captain Derek Jeter back on the field. He will have another great season and make NYC proud again. _E_
"The Conference Board said that consumer sentiment was at its highest level in nearly 17 years in November. The Consumer Confidence Index rose from 126.2 in October to 129.5 notching its best reading since December 2000..." __HTTP__ _E_
Via @Reuters_Biz Trump flies into ex Soviet Georgia for tower project project __HTTP__ _E_
I will be interviewed today on Fox News Sunday with Chris Wallace at 10:00 (Eastern) Network. ENJOY! _E_
Our country is not going to have a comeback with any politician. my @SRQRepublicans speech _E_
Thank you to the GREAT NYPD First Responders and all govt officials for having handled the terrible West Side attack so professionally! __HTTP__ _E_
On my way! #Inauguration2017 __HTTP__ _E_
"Playing golf with business associates creates a relaxing atmosphere where everyone has fun... _E_
Fact Obama does not read his intelligence briefings nor does he get briefed in person by the CIA or DOD. Too busy I guess! _E_
President Obama said over and over again if you like your plan you can keep your plan PERIOD! This turned out to be a total lie 90 mill. _E_
If history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_
With our record debt & trillion $ deficits our $ is now at an all time low against the Chinese Yuan. Time for our gov't to work together. _E_
Can you believe the head of Iran refused to meet with our great President?—Zero respect! _E_
via American Spectator @AmSpec Trump Card by Jeffrey Lord __HTTP__ _E_
Thank you Missouri! #Trump2016 __HTTP__ _E_
Opening for the 2014 season soon the National Register landmark Mar a Lago Club is the crown jewel of Palm Beach __HTTP__ _E_
Use adverse events and monumental challenges to make you stronger. Think Big _E_
Thoughts and prayers to the families of the four great Marines killed today. _E_
Some of you were asking about the All Star line up for Celebrity Apprentice __HTTP__ _E_
Crooked Hillary Clinton says that she got more primary votes than Donald Trump. But I had 17 people to beat—she had one! _E_
Keep stimulating your mind with big ideas. Fill your mind with new information & use this information to spawn new ideas. Think Big _E_
Major Mexican cartel boss El Diego was just arrested with weapons provided to him through Fast and Furious __HTTP__ Media?? _E_
This is a tragedy. The real unemployment rate is 14.8% with over 23.2 million unemployed Americans. We can do much better. _E_
Democrats would do much better as a party if they got together with Republicans on HealthcareTax CutsSecurity. Obstruction doesn't work! _E_
Thank you OHIO! #TrumpPence16 __HTTP__ __HTTP__ _E_
Looking forward to meeting everyone at the North Carolina GOP this Friday where I will be the keynote speaker at the dinner. #GOP _E_
This Sunday's LIVE FINALE of @ApprenticeNBC will be tough & nasty. Be sure to watch @pennjillette & @TraceAdkins fight to the finish! _E_
Now @BarackObama is planning to have we the taxpayers pay off mortgages he will spend this country into the ground. __HTTP__ _E_
Trump Offers To Donate $5 Million To Charity If Obama Releases College Transcripts __HTTP__ via @rcpvideo _E_
#badratings @Rosie you will never make it. You are not funny or talented. _E_
Due to popular demand @lisarinna returns to the 13th season of All Star @CelebApprentice. Lisa's fans won't be disappointed! _E_
Obama is planning on attacking Romney on Bain in tomorrow's debate __HTTP__ Mitt should bring up college applications & records _E_
Today at 1:30PM CT I will be addressing @RepLeadConf in New Orleans __HTTP__ Will focus on how to fix our great country. _E_
China just put a tariff on US cars and trucks 22% China is laughing at our inept leaders. @BarackObama _E_
Thank you Mr. & Mrs. @TomBarrackJr for the wonderful and magical evening last night. It will not be forgotten. #Trump2016 _E_
The stock market and US dollar are both plunging today. Welcome to @BarackObama's second term. _E_
46 stories in the center of downtown New York @TrumpSoHo's 391 spacious rooms each have floor to ceiling windows __HTTP__ _E_
The #FakeNews MSM doesn't report the great economic news since Election Day. #DOW up 16%. #NASDAQ up 19.5%. Drilling & energy sector... _E_
Via @Newsmax_Media by @wandacarruthers: "Trump: GOP on Edge of Winning 'Big' and Forcing Obama to Act" __HTTP__ _E_
We should be focused on clean and beautiful air not expensive and business closing GLOBAL WARMING a total hoax! _E_
Via @WashTimes by @harperbulletin: "Donald Trump Goes to Washington" __HTTP__ _E_
"Christmas waves a magic wand over this world and behold everything is softer and more beautiful." Norman Vincent Peale _E_
"The difference between stupidity and genius is that genius has its limits." Albert Einstein _E_
My @foxandfriends interview discussing my friend Whitney Houston @SarahPalinUSA's CPAC speechthe economy and primary __HTTP__ _E_
Happy #FirstRespondersDay to all of our HEROES out there. We are forever grateful to you for your service sacrifice and courage 24/7/365! __HTTP__ _E_
Nancy Reagan the wife of a truly great President was an amazing woman. She will be missed! _E_
Doing my best to disregard the many inflammatory President O statements and roadblocks.Thought it was going to be a smooth transition NOT! _E_
Legendary basketball coach Bobby Knight who has 900+ wins many championships and a gold medal will be introducing... __HTTP__ _E_
While our wonderful president was out playing golf all day the TSA is falling apart just like our government! Airports a total disaster! _E_
A country that does not control or respect its own borders is a country destined for failure. Secure our borders! _E_
If I win the presidency my judicial appointments will do the right thing unlike Bush's appointee John Roberts on ObamaCare. _E_
"After every setback start thinking big as soon as possible." Think Big _E_
My @FoxNews interview with @TeamCavuto discussing my endorsement of @MittRomney and how I came to my decision __HTTP__ _E_
'Moderate' Repubs plotting against @GOP strategy have short term memories. Tea Party gave them majority in House & primaries aren't fun. _E_
"Each excellent thing once learned serves for a measure of all other knowledge." Philip Sidney _E_
Flashback: Donald Trump would fire A Rod __HTTP__ via @espn 10.17.12 _E_
He @BarackObama claims he does not want higher gas prices. That's not what he said in 2008: __HTTP__ _E_
Alison Grimes will protect the 'sanctity' of her Obama ballot yet admits she voted for Hillary in primary. Hypocrite. Vote @Team_Mitch! _E_
Honestly whether you're for or against ObamaCare the 635 million dollar website fiasco is bad for the U.S. It makes us look totally inept! _E_
I said this was happening long ago I will stop this immediately! __HTTP__ _E_
Building a personal brand? Then focus on being great. Focus on being the best at what you do. Excellent article: __HTTP__ _E_
A message to the great people of New Hampshire on this important day! #VoteTrumpNH Video: __HTTP__ __HTTP__ _E_
Most people do not know what Presient Obama is going to do to save his legacy. I do! He's got to get back to basics.Forget Syria FIX THE USA _E_
Global warming has been proven to be a canard repeatedly over and over again. __HTTP__ The left needs a dose of reality. _E_
Entrepreneurs: What is the standard for which you want to be known? Identify that standard & then establish it. Simple but not easy. Focus! _E_
My performance from last week on David Letterman @Late_Show will be re aired tonight at 11:35 PM on CBS. _E_
Many generals and military leaders are now saying I told you so! They say this will have big impact on military strength & national sec. _E_
Focus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_
Two of the best ever episodes of Celebrity Apprentice tonight at 8. Totally vicious and crazy! I will live tweet. _E_
Thank you! __HTTP__ _E_
It wasn't Matt Lauer that hurt Hillary last night. It was her very dumb answer about emails & the veteran who said she should be in jail. _E_
Many people would like to see @Nigel_Farage represent Great Britain as their Ambassador to the United States. He would do a great job! _E_
RT @DiamondandSilk: The Media Says: The President Should Stop Tweeting about Russia. Well Why Don't the Media Take Their Own Advice & S... _E_
RT @statedeptspox: #GES2017 highlights the important role of women #entrepreneurs & demonstrates the importance of #innovation & partnershi... _E_
Take a sneak peek into one of Trump Park Avenue's most exclusive residences on the market __HTTP__ _E_
Join me live in Wilmington Ohio! __HTTP__ _E_
Why the hell did we help the Libyan rebels in the first place. That is the real scandal. _E_
Ralph Norman who is running for Congress in SC's 5th District will be a fantastic help to me in cutting taxes and.... _E_
A gallon of gas has more than doubled while @BarackObama has been POTUS and he still won't approve Keystone. _E_
Great! Last night @CelebApprentice winner @johnrich & alumni @RealMeatLoaf packed OH stadium rallying w/ @MittRomney __HTTP__ _E_
Interesting article by @MattTowery @townhallcom:"It Is Time to Use 'The Trump Card'" __HTTP__ Thanks Matt for the nice mention _E_
Oh no they are worried that they didn't read the Boston killer his rights and he may have a good legal argument. 12 year case to finish? t _E_
I was just given a great tour of Moscow fantastic hard working people. CITY IS REALLY ENERGIZED! The World will be watching tonight! _E_
Hopefully the Republican Party can come together and have a big WIN in November paving the way for many great Supreme Court Justices! _E_
.@Matt_Berry87 Piers did a great job the interview was very important. _E_
Together we are going to MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_
Opening in 2016 Trump Tower Punta del Este will bring our signature luxury living to the sands of Playa Brava __HTTP__ _E_
Some really dumb blogger for failing @VanityFair a magazine whose ads are down almost 18% this year said I wear a hairpiece I DON'T! _E_
I never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_
The Democrats dropped all references to God from their platform. Not good! _E_
James Gandolfini was a remarkable talent. He was also a decent man. We will all miss him. _E_
Re Negotiation: Know exactly what you want & focus on that. View conflict as an opportunity this will expand your mind and your horizons. _E_
.@AGSchneiderman Why is Douglas Durst allowed to use the Freedom Tower to get out of a lease with Conde Nast? _E_
Bad performance by Crooked Hillary Clinton! Reading poorly from the telepromter! She doesn't even look presidential! _E_
.@thehill discussing my @foxandfriends interview: Trump: 'Clamor for @MittRomney's tax returns has died down' __HTTP__ _E_
Great night in WI. I'm going to fight for every person in this country who believes government should serve the PEO... __HTTP__ _E_
Chris @hardball_chris Matthews ratings are at new historic lows. He is single handedly destroying the entire @msnbc channel. _E_
RT @JoeNBC: Pope Francis tear down that wall! #vaticanwalls __HTTP__ _E_
.@HeyTammyBruce Thank you for your nice words on Fox today. They never use my full statements on nuclear which you would agree with! _E_
Of course I don't think Jimmy Carter is dead saw him today on T.V. Just being sarcastic but never thought he was alive as President stiff! _E_
For those that don't think a wall (fence) works why don't they suggest taking down the fence around the White House? Foolish people! _E_
.@gretawire Greta—you're wrong Kirsten Powers is a dummy—wasn't she Anthony Weiner's girlfriend? _E_
Vladimir Putin said today about Hillary and Dems: In my opinion it is humiliating. One must be able to lose with dignity. So true! _E_
My visit to Japan and friendship with PM Abe will yield many benefits for our great Country. Massive military & energy orders happening+++! _E_
Thank you Michigan! #Trump2016 __HTTP__ _E_
However beautiful the strategy you should occasionally look at the results. Winston Churchill _E_
THANK YOU St. Augustine Florida! Get out and VOTE! Join the MOVEMENT and lets #DrainTheSwamp! Off to Tampa now!... __HTTP__ _E_
Congratulations to @TrumpWaikiki for being selected as Best of +VIP Access 2014 by @Expedia! _E_
Watch my @oreillyfactor appearance from this week discussing nuclear negotiations with Iran __HTTP__ _E_
While Obama is denying it he did receive intelligence about the attacks 3 days before __HTTP__ Too busy campaigning? _E_
Weekly AddressJoin me here: __HTTP__ __HTTP__ _E_
What a coincidence?! @BarackObama's campaign logo uses the same font as Cuban communist propaganda posters. __HTTP__ _E_
The world was gloomy before I won there was no hope. Now the market is up nearly 10% and Christmas spending is over a trillion dollars! _E_
Under @BarackObama 1 out of every 7 Americans is on food stamps. _E_
36 hrs Central Park as seen in @nytimes including a stop @TrumpNewYork for a bite in @Nougatine_NYC. Full article __HTTP__ _E_
With our amazing All Star cast @Joan_Rivers @johnrich @ArsenioOFFICIAL & @piersmorgan are also returning as boardroom advisors. _E_
It is simply immoral for the government to encourage able bodied Americans to think that a life on welfare of (cont) __HTTP__ _E_
"We would accomplish many more things if we did not think of them as impossible." Vince Lombardi _E_
Republicans have once again capitulated to Obama. This time on the Iran nuclear treaty. When will it end? _E_
Thank you Novi Michigan! Get out and VOTE #TrumpPence16 on 11/8. Together WE WILL MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_
I was on the TODAY Show this morning and then visited Regis & Kelly. The Celebrity Apprentice starts this Sunday night—don't miss it! _E_
Can you imagine if the election results were the opposite and WE tried to play the Russia/CIA card. It would be called conspiracy theory! _E_
Goofy Elizabeth Warren is now using the woman's card like her friend crooked Hillary. See her dumb tweet "when a woman stands up to you..." _E_
A Rod is a less than average baseball player now that he is unable to use drugs. A Rod misrepresented to th... (cont) __HTTP__ _E_
Departing @JBA_NAFW for St. Charles Missouri to help push our plan for HISTORIC TAX CUTS across the finish line.A successful vote in the Senate this week will bring us one giant step closer to delivering an incredible victory for the American people! __HTTP__ __HTTP__ _E_
Today is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_
.@HBO should fire @BillMaher and bring back @DennisDMZ someone that is actually funny. _E_
As election looms some bad news for Clinton Democrats: __HTTP__ _E_
These last 4 years have not had a single quarter over 4% GDP. Obama has overseen the weakest economic recovery in American history. _E_
Who is the dumbest man on TV? @Lawrence of MSNBC... __HTTP__ _E_
The Dollar is at an all time WWII low against the Yen. The Fed's recklessness is going to lead to record inflation. _E_
Via @HamptonsMag: @IvankaTrump Talks Hamptons Lifestyle with Emmy Rossum __HTTP__ _E_
It's true... Dennis is really into this very animated. I have never seen him this way before. _E_
VOTE TODAY! Go to __HTTP__ to find your polling location. We are going to Make America Great Again!... __HTTP__ _E_
Many people booed the players who kneeled yesterday (which was a small percentage of total). These are fans who demand respect for our Flag! _E_
The liberal media is focusing on @MittRomney's bank records. How about reviewing @BarackObama's illegal land deal contracts with Tony Rezko? _E_
A great honor to receive polling numbers like these. Record setting African American (25%) & Hispanic numbers (31%). __HTTP__ _E_
The problem with the U.S. is that our leadership has no knowledge or ability to negotiate or see into the future. Every nation beats us! _E_
I discuss yesterday's tragedy at the Boston Marathon in today's video blog. __HTTP__ _E_
The new reality – China's demand for oil now controls the market __HTTP__ And OPEC gets away with ripping us off at $105! _E_
If Goofy Elizabeth Warren a very weak Senator didn't lie about her heritage (being Native American) she would be nothing today. Pick her H _E_
The fact is you're not going to see real growth or create real jobs until we get these exorbitant energy costs (cont) __HTTP__ _E_
RT @NRA: But there IS something we will do on #ElectionDay: Show up and vote for the #2A! #DefendtheSecond #NeverHillary _E_
Good luck and best wishes to my dear friend the wonderful and very talented Joan Rivers! Winner of Celebrity Apprentice amazing woman. _E_
Kellyanne Conway went to @MeetThePress this morning for an interview with @chucktodd. Dishonest media cut out 9 of her 10 minutes. Terrible! _E_
Preliminary talks have begun for next season's #CelebrityApprentice. As usual we will have another great season. _E_
Why does @BarackObama continue to defend radical Islam? He is calling the Ft. Hood massacre workplace violence. _E_
Also appearing on the Miss USA Pageant will be Country Superstar Trace Adkins and Pop Rock Sensation Boys Like Girls... _E_
I will be in Washington D.C. tomorrow to receive the 2014 Joseph Wharton Award at the Wharton Club of D.C.—a great honor! @Wharton _E_
....John McCain has failed miserably to fix the situation and to make it possible for Veterans to successfully manage their lives. _E_
The gorgeous contestants of Trump Miss Universe are so excited to be simulcast on both @nbc and @Telemundo. Will be a beautiful show! _E_
Call me old school but I believe in the old warrior's credo that to the victor go the spoils. In other word... (cont) __HTTP__ _E_
Via @CBS19: Trump Winery President Nominated for Award by Wine Enthusiast Magazine __HTTP__ Congrats @EricTrump! _E_
Watch this amazing ad from @autismspeaks and learn the signs... __HTTP__ _E_
N.Y. City is paying FORTY MILLION DOLLARS to five men that many think are guilty as hell. So many facts should have been trial. Politics! _E_
My team of deplorables will be taking over my Twitter account for tonight's #debate#MakeAmericaGreatAgain _E_
It's finally happening Fiat Chrysler just announced plans to invest $1BILLION in Michigan and Ohio plants adding 2000 jobs. This after... _E_
... People love to hear their names and their stories said out loud." – Think Like a Billionaire _E_
My interview with @PaulWTalk on @wjrradio on behalf of @MittRomney discussing why Michigan needs to go for Romney. __HTTP__ _E_
Firing Bret was a tough one for me but Omarosa doesn't seem to mind. _E_
Spoke to Roy Moore of Alabama last night for the first time. Sounds like a really great guy who ran a fantastic race. He will help to #MAGA! _E_
Russia has never tried to use leverage over me. I HAVE NOTHING TO DO WITH RUSSIA NO DEALS NO LOANS NO NOTHING! _E_
The final part of restoring fiscal sanity to America is the most obvious and that's to control Obama style (cont) __HTTP__ _E_
Again don't forget to watch @hannityshow tonight on Fox at 9 o'clock EST. _E_
We crushed the original goal! I will write a $2 MILLION check to our campaign if we hit our end of month goal! __HTTP__ _E_
.@IvankaTrump @EricTrump & @DonaldJTrumpJr take no prisoners in boardroom of 'All Star' @CelebApprentice. Where do they get it from? _E_
You don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_
Trump Nears 100 days on Top via The Hill __HTTP__ _E_
The NYPD Surveillance Program kept NYC safe since 9/11. There will be tragic consequences for ending it. _E_
A GREAT HONOR to spend time with our BRAVE HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United States of America! __HTTP__ _E_
.@tedcruz must be doing something right if @cher sadly rated "the 4th ugliest celebrity" according to @listverse is attacking him. _E_
.@WashTimes states Democrats have willfully used Moscow disinformation to influence the presidential election against Donald Trump. _E_
.#Celebrityapprentice will be live tomorrow night. Entire cast will be there. Who do you like to win? _E_
.@MittRomney much better on Libya and Middle East problems. Obama has no answer. _E_
The tax scam Washington Post does among the most inaccurate stories of all. Really dishonest reporting. _E_
Various media outlets and pundits say that I thought I was going to lose the election. Wrong it all came together in the last week and..... _E_
My @gretawire interview discussing @billlmaher's comments attacks on @MittRomney and @CNN & @msnbctv's low ratings __HTTP__ _E_
I will be interviewed by @DavidMuir tonight at 10 o'clock on @ABC. Will be my first interview from the White House.... __HTTP__ _E_
If we keep on this path if we reelect @BarackObama the America we leave our kids and grandkids won't look (cont) __HTTP__ _E_
The Democrats just aren't calling about DACA. Nancy Pelosi and Chuck Schumer have to get moving fast or they'll disappoint you again. We have a great chance to make a deal or blame the Dems! March 5th is coming up fast. _E_
On 800 beautiful acres in Miami @TrumpDoral boasts 100000 sq. ft. in meeting space with event planning services __HTTP__ _E_
....8 Dems totally control the U.S. Senate. Many great Republican bills will never pass like Kate's Law and complete Healthcare. Get smart! _E_
.@BoonePickens Thank you for the T. Boone Pickens Entrepreneur Award—a great honor for me from a fantastic man. _E_
Today's assignment: read chapter three of Think Big "Basic Instincts." Focus on my acquisition of 40 Wall Street. _E_
Iraq is no longer our problem. We never should have been there in the first place! _E_
I couldn't make the Faith and Freedom confab in Orlando so I sent a video... __HTTP__ _E_
Thank you! Four new #DebateNight polls with the MOVEMENT winning. Together we will MAKE AMERICA SAFE & GREAT AGAIN... __HTTP__ _E_
Had dinner last night at Megu 845 United Nations Plaza fabulous food beautiful restaurant. _E_
RT @EricTrump: Friends: Remember to VOTE tomorrow if you live in Louisiana Maine Kentucky or Kansas! #MakeAmericaGreatAgain __HTTP__ _E_
Based on new oil prices the ugly windfarms being built in Scotland will quickly die! What a mess! _E_
ObamaCare not only has brought higher premiums decreased care & loss of jobs but now .1% Q1 growth. REPEAL BEFORE IT IS TOO LATE! _E_
The 5 star @Trump_Ireland graces over 500 acres fronting 2.5 miles on the Atlantic Ocean in County Clare Ireland __HTTP__ _E_
"Ice Skaters Invade Mar a Lago as Snow Falls on Palm Beach Salvation Army Ball!" __HTTP__ via @GossipExtra _E_
.@WendyWilliams Thanks for the nice statement especially about my wife and kids very much appreciated. _E_
Read my full statement here on the Supreme Court's executive amnesty decision #imwithyou __HTTP__ _E_
Our NOBEL PRIZE FOR PEACE president said I'm really good at killing people according to just out book Double Down. Can Oslo retract prize? _E_
I look forward to watching @megynkelly tonight 8 PM ET. It will be interesting to see how she treats me—I think she will be very fair. _E_
Join me in Delaware Ohio tomorrow at 12:30pm! #DrainTheSwamp Tickets: __HTTP__ __HTTP__ _E_
RT @DanScavino: Join #PEOTUS Trump & #VPEOTUS Pence live in West Allis Wisconsin! #ThankYouTour2016 #MAGA __HTTP__ __HTTP__ _E_
Today @MittRomney addressed the NAACP. @BarackObama takes their vote for granted which is why there is such high Black unemployment. _E_
Thank you @GOPLeader Kevin McCarthy! Couldn't agree w/you more. TOGETHER we are #MAGA __HTTP__ _E_
No matter how much I accomplish during the ridiculous standard of the first 100 days & it has been a lot (including S.C.) media will kill! _E_
Trump Int'l Hotel & Tower Vancouver will be a new landmark in a fantastic city __HTTP__ _E_
Lyin' Ted Cruz consistently said that he will and must win Indiana. If he doesn't he should drop out of the race stop wasting time & money _E_
RT @theRealKiyosaki: Donald Trump coined the phrase 'multilevel focusing' I love it. It is when two ideas intersect & form a new innovation _E_
Crooked Hillary said that I want guns brought into the school classroom. Wrong! _E_
One by one we are keeping our promises on the border on energy on jobs on regulations. Big changes are happening! _E_
Thank you! #GOPDebate Polls #MakeAmericaGreatAgain __HTTP__ _E_
"Leadership is perhaps the key to getting any job done." – The Art of The Deal _E_
Bay Bridge in California made in China for $1.8 billion. $300 million in cost overruns. Are we stupid? _E_
Oscar Pistorious the blade runner is as guilty as O.J. I wonder if the result will be the same? _E_
Thank you! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_
This guy @sethmeyers can't do a simple interview—saw him the other night stumbling & mumbling while trying to interview a guest. _E_
With autism being way up what do we have to lose by having doctors give small dose vaccines vs. big pump doses into those tiny bodies? _E_
RT @Scavino45: Time lapse video of the border wall prototypes when they were being built in San Diego. Next phase underway: testing and ev... _E_
Vattenfall the company behind a proposed asinine windfarm off the coast of Aberdeen Scotland is having serious financial difficulty. _E_
Keystone pipeline would create 20000 direct jobs another 50000 jobs servicing the pipeline. 700000 barrels a (cont) __HTTP__ _E_
The last person that Hillary or Bernie want to run against is Donald Trump and that is fact! _E_
Putin has no respect for our President really bad body language. _E_
Why is Obama's auto bailout now creating jobs in China? He is ruining American industry. _E_
General Flynn was given the highest security clearance by the Obama Administration but the Fake News seldom likes talking about that. _E_
Our ally Canada is 'frustrated' by @BarackObama's radical anti gas policies __HTTP__ BHO is forcing Canada to send gas to China. _E_
Entrepreneurs: Stay focused and be tenacious. Remain fixed on your goals. _E_
I am asking all citizens to believe in yourselves believe in your future and believe once more in America. #AmericaFirst __HTTP__ _E_
Want access to Crooked Hillary? Don't forget it's going to cost you!#DrainTheSwamp #PayToPlay __HTTP__ _E_
Crooked H is nasty to Sanders supporters behind closed doors. Owned by Wall St and Politicians HRC is not with you. __HTTP__ _E_
Cruz came to Mississippi there was nobody there he left the state. I had a rally in Madison MS with 10000! Thank you! _E_
Just left $259 million rebuilding of Doral in Miami. Amazing Trump National Doral will be a masterpiece (if I do say so myself)! _E_
The big problem for little @MacMiller is that he's going to have to have another hit song not just his Donald Trump bonanza. _E_
On Bill O'Reilly in 5 minutes! _E_
Just left the best golf course in the State of California @trumpgolfla. When in the LA area check it out even (cont) __HTTP__ _E_
FBI director said Crooked Hillary compromised our national security. No charges. Wow! #RiggedSystem _E_
America needs strong leadership. Politicians can talk but they don't get things done. Video: __HTTP__ __HTTP__ _E_
Great American heroes who averted an attack in France. THANK YOU! Spencer Stone Anthony Sadler & Alex Skarlatos. __HTTP__ _E_
RT @EricTrump: Aloha Hawaii: We would be honored to have your vote! Find your caucus __HTTP__ #TrumpWaikiki #Mahalo __HTTP__ _E_
Any American who fights w/ ISIS in Iraq or Syria should have their passport revoked. If they try to come back in send them to Gitmo. _E_
Have a GREAT EASTER I love you all! _E_
Just got back from Asheville North Carolina where we had a massive rally. The spirit of the crowd was unbelievable. Thank you! #MAGA _E_
True America is rapidly losing it's SPIRIT and when that's gone we will only be going in one direction and that direction is down! _E_
TERRORISM IMMIGRATION AND NATIONAL SECURITY SPEECH TRANSCRIPT: __HTTP__ __HTTP__ _E_
Leaders at Trump National Doral are only one under par. The great Ben Hogan said I've never seen a great course that was easy! _E_
"I don't measure a man's success by how high he climbs but how high he bounces when he hits bottom." George S. Patton _E_
A friend is one who has the same enemies as you have. Abraham Lincoln _E_
Lightweight @AGSchneiderman's phony lawsuit against Trump U was decimated by the court—he's a loser! _E_
People rarely say that many conservatives didn't vote for Mitt Romney. If I can get them to vote for me we win in a landslide. _E_
Very important that NFL players STAND tomorrow and always for the playing of our National Anthem. Respect our Flag and our Country! _E_
The country of Georgia is a small wonder. Performing well economically under the leadership of @SaakashviliM. A great American ally. _E_
Can't believe we are less than three weeks away from the election. Time certainly flies! _E_
Obama said in his SOTU that "global warming is a fact." Sure about as factual as "if you like your healthcare you can keep it." _E_
Super PACs should be disavowed by anyone running for President. They are a total scam on our system and country! I am self funding. _E_
For every CEO that drops out of the Manufacturing Council I have many to take their place. Grandstanders should not have gone on. JOBS! _E_
A GREAT DAY IN WISCONSIN!Thank you #Racine & #Wausau! Just arrived in #EauClaire! #Trump2016#WIPrimary #TrumpTrain __HTTP__ _E_
Leaving the great people of North Carolina. Amazing event. Heading to Tampa now! #VoteTrump _E_
United States looks more and more like a paper tiger. Won't be that way if I win! _E_
THANK YOU ASIA! #USA __HTTP__ _E_
.@PennyPritzker Really important to cover currency manipulation in trade agreements that's where China and others are beating us. Best! _E_
Thank you. __HTTP__ _E_
Good luck to the people of Scotland whatever their decision may be on Thursday. The whole world is watching—really exciting! _E_
The @GOP should not agree to the ridiculous debate terms that @CNBC is asking unless there is a major benefit to the party. _E_
#ICYMI: @KarlRove & @oreillyfactor discuss what Ted Cruz did to the great people of Iowa as they went to vote. __HTTP__ _E_
As promised my @SuperBowl pick is the San Francisco @49ers. _E_
My Twitter has been seriously hacked and we are looking for the perpetrators. _E_
I look forward to my press conference on Weds of next week @TrumpTurnberry to discuss changes & big investment I'll make. Very exciting! _E_
Because Obama was so pathetic in the first debate tonight's audience will be humongous people want to see if he is for real. _E_
Tomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM _E_
It was just announced that @MacMiller's song "DonaldTrump" went platinum—tell Mac Miller to kiss my ass! _E_
Today we are thrilled to welcome @Broadcom CEO Hock Tan to the WH to announce he is moving their HQ's from Singapore back to the U.S.A..... __HTTP__ _E_
Looking at the figures and plans behind @Disney's acquisition of Lucas Film makes you realize how stupid @AOL (cont) __HTTP__ _E_
Obama's $1T+ deficit budget expanded welfare & green cronyism & it cut domestic bomb prevention in half __HTTP__ _E_
Restoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_
Thank you America! #MAGARasmussen National PollDonald Trump 43%Hillary Clinton 40% __HTTP__ _E_
The so called A list celebrities are all wanting tixs to the inauguration but look what they did for Hillary NOTHING. I want the PEOPLE! _E_
W/a newly expanded 27 holes of golfing Trump Intl.Palm Beach is ranked by Florida Golf Magazine as FL's #1 course __HTTP__ _E_
Check @billmaher's background & you will find he is not a smart guy—he just wants people to think he is just call him dummy. _E_
Government's first duty is to protect the people not run their lives." – President Ronald Reagan _E_
An honor having the National Sheriffs' Assoc. join me at the @WhiteHouse. Incredible men & women who protect & serv... __HTTP__ _E_
My @SquawkCNBC interview discussing the @GOP convention @BarackObama's sealed records & @SenatorReid's tax claim __HTTP__ _E_
Interview w/ @AndreaTantaros discussing my WH tour offer @KarlRove's terrible ads & Ashley Judd's candidacy __HTTP__ _E_
Americans understand that the US has a spending problem not a revenue problem. #TimeToGetTough __HTTP__ __HTTP__ _E_
The trade deal is a disaster she was always for it! #DemDebate _E_
The seriously failing @nytimes despite so much winning and poll numbers that will soon put me in first place only writes dishonest hits! _E_
A bite from last night's @piersmorgan interview discussing Rev. Wright's Ed Klein interview and the 2012 campaign __HTTP__ _E_
Our not very bright Vice President Joe Biden just stated that I wanted to carpet bomb the enemy. Sorry Joe that was Ted Cruz! _E_
Just shows that you can have all the cards and lose if you don't know what you're doing. _E_
#TrumpVine from D.C. __HTTP__ _E_
With a @SharkGregNorman designed course directly along the water @Trump_Charlotte is North Carolina's elite club __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Had a great time hosting the Palm Beach County Republican at Mar a Lago. @IngrahamAngle gave a strong speech. She's great! _E_
"You have to be positive every single day. Positive stamina is a necessary ingredient for success." – Think Like a Champion _E_
Please @21Club go back to your original menu and preparation. Believe me it was much better. Let me know when the change is made! _E_
.@pennjillette has received a star on the Hollywood Walk of Fame— about time! #CelebApprentice _E_
Be aware of things that seem inexplicable because they can be a big step towards innovation. Donald J. Trump __HTTP__ _E_
I am not only fighting Crooked Hillary I am fighting the dishonest and corrupt media and her government protection process. People get it! _E_
Did the Boston terrorists register their guns? No. Another example of why gun control legislation is not the answer! _E_
A Clinton economy = more taxes and more spending! #DebateNight __HTTP__ _E_
I can't wait to donate @billmaher's $5 million to charity. Just waiting on @billmaher to send me the money. _E_
Peoples lives are being shattered and destroyed by a mere allegation. Some are true and some are false. Some are old and some are new. There is no recovery for someone falsely accused life and career are gone. Is there no such thing any longer as Due Process? _E_
So sad that @CNN and many others refused to show the massive crowd at the arena yesterday in Oklahoma. Dishonest reporting! _E_
This story is not about Mr. Khan who is all over the place doing interviews but rather RADICAL ISLAMIC TERRORISM and the U.S. Get smart! _E_
Getting back to the nicer and more normal parts of life Celebrity Apprentice is great tonight on NBC at 9. It will be a full two hour show! _E_
My twitter account is now reaching more people than the New York Times not bad. And we're only going to get better! _E_
Senator (Doctor) Bill Cassidy is a class act who really cares about people and their Health(care) he doesn't lie just wants to help people! _E_
The story with Hillary will never change. __HTTP__ _E_
The big story is the unmasking and surveillance of people that took place during the Obama Administration. _E_
.@TMobile gives terrible service and has many complaints just check. _E_
It's 10 AM: Two hours to go for Obama to easily pick up millions for charity! _E_
Gotta hand it to @IvankaTrump she loved Doral from the time we looked at it. The Trump Doral will be an Icon. #sayfie #newsmax _E_
Rubio is weak on illegal immigration with the worst voting record in the U.S. Senate in many years. He will never MAKE AMERICA GREAT AGAIN! _E_
Congratulations to @SpeakerBoehner on standing strong and tying government shutdown to defunding ObamaCare. _E_
Via @USATODAYsports: "Last year it was Tiger Woods with the walk off" __HTTP__ @CadillacChamp @DoralResort #TrumpDoral _E_
.@Lord_Sugar....but you wouldn't notice because you have no vision and you are a total loser. _E_
Thank you to all Americans who participated in Nat'l Rx Drug Take Back Day. A record amount of drugs collected & disposed. We can do this! _E_
...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_
The #AmazonWashingtonPost sometimes referred to as the guardian of Amazon not paying internet taxes (which they should) is FAKE NEWS! _E_
Some of the Fake News Media likes to say that I am not totally engaged in healthcare. Wrong I know the subject well & want victory for U.S. _E_
Thank you! "Trump's Defining Speech" WSJ Editorial: __HTTP__ __HTTP__ _E_
Very strange why do database records contradict @BarackObama and show he was only at Columbia 1 year? __HTTP__ _E_
Crooked Hillary can't even close the deal with Bernie and the Dems have it rigged in favor of Hillary. Four more years of this? No way! _E_
Big announcement tomorrow morning concerning the great Turnberry Resort in Scotland! _E_
RT @TeamTrump: .@HillaryClinton had her chance and she BLEW IT. #BigLeagueTruth #Debates __HTTP__ _E_
Obama's war on women has lead to the biggest decline in female employment in 40 years. 4 more years?? _E_
After Turkey call I will be heading over to Trump National Golf Club Jupiter to play golf (quickly) with Tiger Woods and Dustin Johnson. Then back to Mar a Lago for talks on bringing even more jobs and companies back to the USA! _E_
Looking forward to my Iowa visit at @bobvanderplaats' @theFAMiLYLEADER Summit __HTTP__ Big crowd! _E_
President Obama's Arab Spring is not looking so good right now! _E_
Guests are raving about our exclusive hotel mattress and so we've made it available for purchase! __HTTP__ _E_
Keystone must be approved. Oil is at a record high. We need to use our resources and support allies like Canada. _E_
.@ximenaNR Great job we are all proud of you one of our all time BEST! _E_
RT @foxandfriends: Getting the job done! Sen. Mitch McConnell delays August recess to work on health care bill __HTTP__ _E_
Marshawn Lynch of the NFL's Oakland Raiders stands for the Mexican Anthem and sits down to boos for our National Anthem. Great disrespect! Next time NFL should suspend him for remainder of season. Attendance and ratings way down. _E_
I just left the Trump Tower atrium it is packed with great people. #1 tourist attraction in NYC Fun! #TrumpTower _E_
Via @LuxuryDaily by Joe McCarthy: "Trump Collection leverages 2016 election frenzy for Washington debut" __HTTP__ _E_
"TRUMP: IMMIGRATION BILL A REPUBLICAN 'DEATH WISH'" __HTTP__ via @BreitbartNews by @mboyle1 _E_
Congressman Ron DeSantis is a brilliant young leader Yale and then Harvard Law who would make a GREAT Governor of Florida. He loves our Country and is a true FIGHTER! _E_
The new reality. 'China Daily' is sold in street newspaper vending machines across DC. Why not? They own the place. _E_
The @TrumpChicago Spa offers 5 star services12 treatment rooms & 53 spa guestrooms overlooking Chicago skyline __HTTP__ _E_
Republicans and Democrats have both created our economic problems. _E_
I am self funding my campaign so I do not owe anything to lobbyists & special interests. __HTTP__ __HTTP__ _E_
Jerry Finkelstein passed away last night a great New York mover & shaker & a really great guy! _E_
Speaking at the Red White and Blue Dinner in Maryland __HTTP__ _E_
Obama is laughing at Karl Rove & all the losers who spent hundreds of millions of dollars and didn't win one race including the big one! _E_
"Donald Trump to Build Trump Towers Complex in Rio de Janeiro" __HTTP__ via Hispanically Speaking News _E_
Secretary of Defense Chuck Hagel seems so lost and frankly dumb. He can't even speak properly. Poor leader in these very dangerous times! _E_
This is my last election. After my election I have more flexibility. Obama to @MedvedevRussiaE discussing our nuclear arsenal. _E_
Why I would not have approved the deal... __HTTP__ #trumpvlog _E_
RT @foxandfriends: VIDEO: Rep. Scalise — GOP agrees on over 85 percent of health care bill __HTTP__ _E_
You have to believe in what you want. Keep your focus keep your momentum and remain patient and persistent. _E_
Obama keeps namedropping Bill Clinton he is no Bill Clinton. _E_
Be tenacious. Being tenacious means you're tough and patient at once a formidable combination. _E_
China is the biggest environmental polluter in the World by far. They do nothing to clean up their factories and laugh at our stupidity! _E_
The NFL image is really tarnished! Now if the sponsors start leaving and the ratings go down the NFL will be in big trouble. Boring games! _E_
Emmy Awards show was terrible last night. Same shows winning over and over again (politics). Amazing race a joke. Host Seth Meyers bombed! _E_
My @foxandfriends interview on risk for @GOP on immigration wasting money in Middle East & firing @OMAROSA __HTTP__ _E_
I'll be on @foxandfriends on Monday at 7:30 AM...be sure to tune in. _E_
My prayers and best wishes are with the family of Edwin Jackson a wonderful young man whose life was so senselessly taken. @Colts _E_
Looking forward to being awarded the '2015 Statesman of the Year' by @SRQRepublicans this Thursday. A record 2000+ attendees Can't wait! _E_
Hillary has called for 550% more Syrian immigrants but won't even mention "radical Islamic terrorists." #Debate... __HTTP__ _E_
....victory and cannot be burdened with the tremendous medical costs and disruption that transgender in the military would entail. Thank you _E_
#MakeAmericaGreatAgain#TrumpPence16 __HTTP__ _E_
There is no excuse for riots in Ferguson regardless of the grand jury outcome. _E_
I am the king of debt. That has been great for me as a businessman but is bad for the country. I made a fortune off of debt will fix U.S. _E_
"There can be no liberty unless there is economic liberty." Margaret Thatcher _E_
Thank you Brian Krzanich CEO of @Intel. A great investment ($7 BILLION) in American INNOVATION and JOBS!... __HTTP__ _E_
You haven't seen fireworks until you see @OMAROSA & @piersmorgan go at it again! Let's just say it's no happy reunion... _E_
Steve Jobs is spinning in his grave Apple has lost both vision and momentum must move fast to get magic back! _E_
Great job @EricTrump! Proud of you!#AmericaFirst #RNCinCLE __HTTP__ _E_
RT @FLOTUS: Thank you to all who participated in today's discussion on opioid abuse. By talking about it we can start to make a real diffe... _E_
Via @BreitbartSports by @warnerthuston: "Donald Trump Buys Four Time British @The_Open Golf Course" __HTTP__ _E_
Thank you so much to __HTTP__ for naming me the 2015 Man of the Year. This is indeed a great honor for me! _E_
What apology didn't they go around beating the crap out of people and robbing them? Why did they all confess? Aren't police convinced? _E_
Today it was my great honor to welcome Prime Minister Erna Solberg of Norway to the @WhiteHouse a great friend and ally of the United States! Joint press conference: __HTTP__ __HTTP__ _E_
It is time to take care of OUR COUNTRY to rebuild OUR COMMUNITIES and to protect our GREAT AMERICAN WORKERS! #TaxReform __HTTP__ _E_
Crooked Hillary Clinton put out an ad where I am misquoted on women. Can't believe she would misrepresent the facts! My hit was on China _E_
My @foxandfriends interview discussing the @nyjets acquisition of @TimTebow and the timing of @RepPaulRyan's plan __HTTP__ _E_
We have to make the U.S.A. RICH again so that we can afford to pay Social Security Medicareand Medicaid and STRONG to keep our enemies out _E_
Our Marines are sent to kill the Taliban not coddle them. USMC should be praised not investigated. Semper Fi ! _E_
Why has nobody asked Kaine about the horrible views emanated on WikiLeaks about Catholics? Media in the tank for Clinton but Trump will win! _E_
.@TraceAdkins isn't excited about their ideas. Are you? #CelebApprentice _E_
"Give me a smart idiot over a stupid genius any day." Samuel Goldwyn _E_
Baseball player Ryan Braun turned out to be a total con man after so vociferously proclaiming his innocence only to be guilty as.hell! _E_
Can't believe Major League Baseball just rejected @PeteRose_14 for the Hall of Fame. He's paid the price. So ridiculous let him in! _E_
#CrookedHillary __HTTP__ _E_
RT @FoxNews: .@jessebwatters on @DonaldJTrumpJr meeting with Russian attorney: I believe Don Jr. is the victim here. #TheFive __HTTP__ _E_
DESPERATE @BarackObama is already asking supporters to 'find dirt' on @MittRomney's VP picks __HTTP__ Dirty tactics. _E_
Randy Moss said he was the greatest receiver of all time—no way—it was @JerryRice! _E_
A review of @MikeTyson's show great press on Trump International Golf Links Scotland and more in today's #trumpvlog __HTTP__ _E_
Elizabeth Warren often referred to as Pocahontas just misrepresented me and spoke glowingly about Crooked Hillary who she always hated! _E_
Will be on Fox & Friends at 7 (10 minutes). ENJOY! _E_
Wow the final ratings for the Miss Universe Pageant show that it won in all key demos number one on Sunday. I have a winner! _E_
RT @EricTrump: #ThrowbackThursdays @realDonaldTrump __HTTP__ _E_
We are inspired by the stories of everyday heroes who pull their communities from the depths of despair through leadership and love. __HTTP__ _E_
Megyn Kelly has two really dumb puppets Chris Stirewalt & Marc Threaten (a Bushy) who do exactly what she says. All polls say I won debates _E_
Congratulations to my son Eric on the fantastic job he has done in rebuilding Turnberry and its great Ailsa Course. Always support kids! _E_
.@Modern_Do_Good #asktrump __HTTP__ _E_
.@rushlimbaugh Rush I am in LA inspecting property (big job creator) & listening to you. You are truly fantastic thanks! _E_
Iowa was fantastic last night amazing crowd and people. I'm now in Florida getting ready to go to South Carolina. Big crowd very exciting _E_
.@BarackObama was caught telling Russian PM @MedvedevRussiaE that he can be more 'flexible' in his second term. Russia thinks he's weak. _E_
Via @Newsmax_Media by @OwenTew: "Donald Trump: 'Last Thing We Need Is Another Bush'" __HTTP__ _E_
RT @dmartosko: 'Duck Dynasty' star Phil Robertson says he'll back Trump for president __HTTP__ via @MailOnline _E_
I love taking lawsuits all the way when I'm right. @AGSchneiderman is finding that out the hard way! _E_
ISIS has infiltrated countries all over Europe by posing as refugees and @HillaryClinton will allow it to happen h... __HTTP__ _E_
Remember that I am self funding my campaign. Hillary Jeb and the rest are spending special interest and lobbyist money.100% CONTROLLED _E_
Join Governor Mike Pence in Reno Nevada tonight at 7pm! Tickets available at: __HTTP__ _E_
Fact: without Texas and states reaping the fracking boom Obama's job record would go from bad to worse! _E_
A great gift idea is my new book #TimeToGetTough easy to order on Amazon __HTTP__ _E_
Join me live from the @WhiteHouse. __HTTP__ _E_
I will be on Morning's with Maria on the Fox Business Network tomorrow during the 7am and 8am ET hours. _E_
Major League Baseball was really smart when they wouldn't let Mark Cuban buy a team. Was it his financials or the fact that he's an asshole? _E_
Brian if I'm well past the last exit to relevance how come you spent so much time reading my tweets last night? @NBCNightlyNews _E_
My two wonderful sons Don and Eric will be on @foxandfriends at 7:02 now! Enjoy. _E_
Via @Newsmax_Media:  Trump: I'd Be Better 'Meet the Press' Host Than 'Moron' Chuck Todd __HTTP__ _E_
I'm on @ETonlineAlert tonight to talk about what the Yankees should have done about A Rod long ago __HTTP__ _E_
Attorney General Jeff Sessions has taken a VERY weak position on Hillary Clinton crimes (where are E mails & DNC server) & Intel leakers! _E_
"You have to be patient as well as enthusiastic when it comes to your goals. Think big but be realistic." – Think Big _E_
I was putting together my early deals in New York & I was advised by many that I was too young. Believe in yourself & you can do anything. _E_
Great honor to have @GOP General Counsel #JohnRyder as a Trump delegate in TN. RNC meeting well worth it! Unifying the party! _E_
Arena was packed totally electric! _E_
Melania will be interviewed by @morningmika on @Morning_Joe now (8:30 A.M.). ENJOY! _E_
Yesterday @BarackObama actually spent a full day in Washington. He didn't campaign fund raise or play golf. Shocking. _E_
Jon Huntsman called to see me. I said no he gave away our country to China! @JonHuntsman _E_
RT @foxandfriends: FOX NEWS ALERT: 2 US drone strikes in Somalia target Al Qaeda and Al Shabaab __HTTP__ _E_
Florida Ethics Commission Advocate comes down hard on Rubio. So do two people who worked with him. Said he used the wrong credit card! Sure. _E_
It will now start to cool down concerning Sterling and the Clippers. This mess will start to fade after litigation into the murky past! _E_
. @deesnider is a great guy & a total winner! He understood he did not leave me any other choice. Look forward to keeping in touch. _E_
Nobody understands politicians like I do all talk and no action. They will never get our country where it needs to be truly great again! _E_
Mrs. Goldberg who filed the Chicago case many years ago is a vicious and conniving woman loved beating her. _E_
MUST READ It's time people listened to Trump' says mother of gunned down teenage football star __HTTP__ SECURE THE BORDER! _E_
Wow China's growth accelerated 7.8% in third quarter. If the U.S. had half that number we would be the talk of the World need leadership _E_
.@JTimberlake It was great having you play The Blue Monster. Thanks for your nice statements many agree that it is best they've seen! _E_
My rallies are not covered properly by the media. They never discuss the real message and never show crowd size or enthusiasm. _E_
I will be doing Greta Van Susteren @gretawire tonight at 10 PM on Fox News talking about China & Mitt's failed campaign team. _E_
Glad to hear Clint Eastwood endorsed @MittRomney. He understands that America needs a big boost to be strong again. _E_
Just got back to the White House from the Great States of Texas and Louisiana where things are going well. Such cooperation & coordination! _E_
North Korea is looking for trouble. If China decides to help that would be great. If not we will solve the problem without them! U.S.A. _E_
"Definiteness of purpose is the starting point of all achievement." W. Clement Stone _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
If you can't see it it will never happen. Bring your vision to fruition through perseverance and hard work. That will build momentum. _E_
I know Shia LaBeouf @thecampaignbook and when sober a really nice guy. Must get act together fast before too late. _E_
The economy is in terrible shape. @BarackObama is manipulating the job numbers to hide the truth. __HTTP__ _E_
When it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it'... (cont) __HTTP__ _E_
Best speech in #GoldenGlobes history __HTTP__ _E_
Watching the madness in Cyprus? If our government keeps spending trillion dollar deficits that could happen here. _E_
#CelebApprentice #TeamVortex or #TeamInfinity? _E_
Big day on Thursday for Indiana and the great workers of that wonderful state.We will keep our companies and jobs in the U.S. Thanks Carrier _E_
It's still exciting after all these years and this cast is special! _E_
One of the greatest tributes to a father I have ever witnessed given to the great @jacknicklaus by his wonderful son __HTTP__ _E_
How will raising taxes create jobs? Washington is all out of answers. New leadership is needed. _E_
According to a @gallupnews poll over 60% think ObamaCare will make things worse for taxpayers __HTTP__ ObamaCare is a T A X. _E_
I will be holding a major briefing on the Opioid crisis a major problem for our country today at 3:00 P.M. in Bedminster N.J. _E_
.@robertjeffress I greatly appreciate your kind words last night on @FoxNews. Have great love for the evangelicals great respect for you. _E_
Make sure to grab your copy of this month's @Newsmax_Media detailing The Trump Effect __HTTP__ _E_
Music cues audience participation sounds like a very active Team Power. #CelebApprentice _E_
NATIONAL DEBT January 2009 = $10.6 TRILLIONAugust 2016 = $19.4 TRILLION __HTTP__ _E_
Dem Gov. of MN. just announced that the Affordable Care Act (Obamacare) is no longer affordable. I've been saying this for years disaster! _E_
"When you can't make them see the light make them feel the heat." – President Ronald Reagan _E_
Over 35 CIA operatives were on the ground in Benghazi the night of the 9.11 attack __HTTP__ Still a phony scandal ? _E_
The people of Ireland are very smart—they just killed an ugly windfarm which would've hurt tourism @AlexSalmond __HTTP__ _E_
... to build a wind farm and destroy this view! _E_
I will be on @seanhannity tonight from Las Vegas Nevada at 10pmE. Enjoy! #Hannity #Trump2016 __HTTP__ _E_
The great State of Nebraska can do much better than @BenSasse as your Senator. Saw him on @greta totally ineffective. Wants paid for pols. _E_
Iran's continued public threats of annihilating @Israel are unacceptable. Iran's nuclear drive must be stopped. #TimeToGetTough _E_
Very little pick up by the dishonest media of incredible information provided by WikiLeaks. So dishonest! Rigged system! _E_
Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_
It was an honor to host our American heroes from the @WWP #SoldierRideDC at the @WhiteHouse today with @FLOTUS @VP... __HTTP__ _E_
This was sent out from Ted Cruz as Iowans arrived at their caucus sites to vote. #CruzFraud __HTTP__ _E_
RT @GOP: Reminder: last year Clinton pledged she had turned over all work related email under penalty of perjury __HTTP__ _E_
With the labor participation rate at a 36 yr. low over 92M Americans are out of the work force. _E_
Christians in the Middle East have been executed in large numbers. We cannot allow this horror to continue! _E_
Just arrived at Camp David where I am monitoring the path and doings of Hurricane Harvey (as it strengthens to a Class 3). 125 MPH winds! _E_
'Democratic operative caught on camera: Hillary PERSONALLY ordered 'Donald Duck' troll campaign that broke the law' __HTTP__ _E_
When Warren Buffett & others play w/ bankruptcy nobody cares—when Trump plays the game it becomes a big deal! __HTTP__ _E_
Heading back from a very exciting two days in Davos Switzerland. Speech on America's economic revival was well received. Many of the people I met will be investing in the U.S.A.! #MAGA _E_
Ask yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if hard times hit. _E_
I always said that @lancearmstrong had to keep fighting the charges. By stopping he gave his enemies an opening. _E_
Landing in New Hampshire soon to talk about the massive drug problem there and all over the country. _E_
Donald J. Trump Ethics Reform Plan For Washington D.C. __HTTP__ _E_
"Destiny has a part to play in your life and in your business – so give it a chance to work." – Think Like a Champion _E_
My wife Melania will be on @QVC today @ 5 PM selling really beautiful jewelry at a very low price. Perfect for Mother's Day—call in! _E_
Entrepreneurs: Keep your momentum. See yourself as victorious and leading a winning team. Keep everyone moving forward. _E_
RT @foxandfriends: Mark Levin: The collusion is among the Democrats __HTTP__ _E_
.@marklevinshow has written a great book Plunder and Deceit. He powerfully analyzes issues that are crucial to us today. Read it! _E_
Why is the Pentagon wasting precious dollars on going 'green.' Complete waste. We need the best & easiest fuel for our military. _E_
It was great being with Luther Strange last night in Alabama. What great people what a crowd! Vote Luther on Tuesday. _E_
TV's darling @TheRealMarilu is back in this year's "All Star" @CelebApprentice. Marilu is a fierce competitor. _E_
.@CNN & @CNNPolitics did not say that lawyer Beck lost the case and I got legal fees. Also she wanted to breast pump in front of me at dep. _E_
"Keep your brand standard in mind and your expansion will seem possible as well as gratifying." – Midas Touch _E_
I will be handing over my Twitter account to my team of deplorables for tonight's #debate#MakeAmericaGreatAgain _E_
Anti Morsi protests are 10 times larger than 2011 anti Mubarek protests. Interesting. _E_
"Trump: Illegal Immigrants Are Getting Treated Better than Vets" __HTTP__ via @nro by @AndrewE_Johnson _E_
I will be live tweeting during the @ApprenticeNBC tonight at 9PM ET. _E_
If you fail once twice three times it doesn't matter. Learn from your mistakes and push forward to VICTORY the sweetest feeling there is! _E_
Thank you Laura! __HTTP__ _E_
.@oreillyfactor was very negative to me in refusing to to post the great polls that came out today including NBC. @FoxNews not good for me! _E_
HAPPY NEW YEAR & THANK YOU! __HTTP__ __HTTP__ _E_
Millions of dollars being spent on false TV ads by special interest groups who own Rubio & Cruz.When you see them think of your puppet POLS _E_
It's Friday. How many bald eagles did wind turbines kill today? They are an environmental & aesthetic disaster. _E_
Just won the highest rated sanitary award in NY—an A & the food is great also. Trump Grill/ 57th & 5th. _E_
Every dollar @BarackObama spends costs $1.40 with interest borrowed from China on our children and grandchildren's backs. CUT CAP BALANCE! _E_
Looking forward to being honored with the prestigious 'Friend of Israel' award at the @Algemeiner Gala Dinner __HTTP__ _E_
Obama's plan to have Russia stand up to Iran was a horrible failure that turned America into a laughingstock. #TimeToGetTough _E_
"Success is dependent on effort." Sophocles _E_
Use adverse events and monumental challenges to make you strong Think Big _E_
This Sunday's LIVE FINALE of @ApprenticeNBC puts @pennjillette against @TraceAdkins. Watch two great competitors battle to win! _E_
What you dream about is what you do. If you cannot even dream of doing big things you will never do anything big. Think Big _E_
We have just begun! __HTTP__ _E_
Great poll Florida! Thank you! __HTTP__ _E_
.@WhiteHouse Briefing with Director Marc Short and Director Mick Mulvaney... __HTTP__ _E_
Celebrate 2013 @TrumpSoHo with downtown's nicest #NYE party. Get your tickets now: __HTTP__ _E_
Inspiration exists but it has to find us working. Pablo Picasso _E_
20 Most Anticipated Hotel Openings of 2016: Trump International Hotel Washington D.C. __HTTP__ _E_
I said gas prices would sky rocket after election Opec payback! _E_
Oil prices just went over $100 per barrel for first time in nine months! _E_
Obama is on yet another two day West Coast fundraising swing. Has to fit it in before his 15 day tax payer funded vacation. _E_
Snowden is a spy who has caused great damage to the U.S. A spy in the old days when our country was respected and strong would be executed _E_
I just answered my Facebook fan's questions in the latest #AskTheDonald watch the video __HTTP__ _E_
Who says Obama will do better in the next debate has he gotten smarter in 2 weeks! _E_
"The belief that security can be obtained by throwing a small state to the wolves is a fatal delusion." Winston Churchill _E_
Thank you @Morning_Joe & @morningmika a great show! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
I wonder how officials @TexasTech feel now after treating Coach Mike Leach with so little respect after their loss to @TCUFootball 82 27? _E_
Interesting article by @newtgingrich @HumanEvents: "WHY ROVE AND STEVENS ARE PLAIN WRONG" __HTTP__ _E_
Via @AP on @washingtonpost: Trumps look at building 18 hole golf course on former Kluge estate in rural Virginia __HTTP__ _E_
Will be in South Bend Indiana in a short while big rally! See you soon! _E_
I have just ordered Homeland Security to step up our already Extreme Vetting Program. Being politically correct is fine but not for this! _E_
.@KAThomas212 Congratulations on joining the finest and fastest growing group of very talented people in the City. You will be GREAT! _E_
Obama must now start focusing on OUR COUNTRY jobs healthcare and all of our many problems. Forget Syria and make America great again! _E_
Heading down to D.C. __HTTP__ _E_
DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER MAKE AMERICA GREAT AGAIN! _E_
10000 people in South Carolina unbelievable evening! Will be in New Hampshire tomorrow love it. __HTTP__ _E_
Glenfiddich is a joke—should have chosen Andy Murray—U.S. Open & Olympic gold winner—as Top Scot instead of a total loser! _E_
The Ebola patient who came into our country knew exactly what he was doing. Came into contact with over 100 people.Here we go I told you so! _E_
Thank you! __HTTP__ _E_
Democrats refused to vote down their ObamaCare subsidy. While Americans will be hit w/ rising premiums Washington won't feel any pain _E_
47M on food stamps. Over 23M Americans unemployed. 50% of college grads unemployed. And Obama wants us talking about Big Bird. _E_
I will be speaking Monday September 24 (10 A.M.) at Liberty University to a record setting student body. I look forward to it! _E_
Breaking news negotiations with Iranians broke down because Obama insisted that they use ObamaCare. _E_
As stated here is the press release. __HTTP__ _E_
ObamaCare is a complete disaster. Many of my friends have to scale down their businesses because they can't afford it. Terrible. _E_
Good luck to @Joy_Villa on her decision to enter the wonderful world of politics. She has many fans! _E_
Departing now thank you Cedar Rapids Iowa. This is a MOVEMENT! __HTTP__ _E_
If I would have challenged the man the media would have accused me of interfering with that man's right of free speech. A no win situation! _E_
What a shame that Kobe Bryant was so badly injured last night a truly great champion who brought the Lakers back from oblivion this year! _E_
Will be interviewed tonight by @seanhannity on @FoxNews at 10 PM. Enjoy! _E_
New episode starting now! _E_
Just arrived in New Hampshire. Another packed venue! Will be fun. _E_
In debate @MittRomney should ask Obama why autobiography states born in Kenya raised in Indonesia. _E_
After Poland had a great meeting with Chancellor Merkel and then with PM Shinzō Abe of Japan & President Moon of South Korea. _E_
Congratulations Chuck. Must be wonderful to have Donald Trump as your guest #BeCool! #Trump2016 __HTTP__ _E_
RT @Scavino45: Manufacturer Optimism Hits Record High After #TaxReform Plan Revealed __HTTP__ _E_
I promise to do a new #trumpvlog when I get back next week lots of requests. Thanks! _E_
.@MikeTyson and @SpikeLee I gave a great review of your show in my #trumpvlog __HTTP__ _E_
I will be making some very big campaign stops next week big crowds and tremendous energy! MAKE AMERICA GREAT AGAIN _E_
I dictate my tweets to my executive assistant and she posts them. Time is money The Art of the Deal. _E_
As usual the ObamaCare premiums will be up (the Dems own it) but we will Repeal & Replace and have great Healthcare soon after Tax Cuts! _E_
Black Lives Matter protesters totally disrupt Hillary Clinton event. She looked lost. This is not what we need with ISIS CHINA RUSSIA etc. _E_
The failing @UnionLeader newspaper in N.H. just sent The Trump Organization a letter asking that we take ads. How stupid how desperate! _E_
.@AC360 Has the absolutely worst anti Trump talking heads on his show. Dopey writer O'brian knows nothing about me or my wealth. A waste! _E_
.@BradSteinle Thank you for yr wonderful tweet of July 4. I wanted a little time to go by before calling. Your sister & family are amazing. _E_
Egypt is turning into a hot bed of radical Islam. The current protest is another coup attempt. We should never have abandoned Mubarak. _E_
The jury in the Jodi Arias trial is believe it or not still out. You never know but such a long deliberation could be good for the defense _E_
#TextTrump88022 for exclusive @realDonaldTrump updates! We will Make America Great Again! _E_
Great read: "Hollywood can kiss Adam Corolla's ass he's going Trump funding" __HTTP__ via @upstartbusiness _E_
I hope people will start to focus on our Massive Tax Cuts for Business (jobs) and the Middle Class (in addition to Democrat corruption)! _E_
China is neither an ally or a friend they want to beat us and own our country. _E_
72% of refugees admitted into U.S. (2/3 2/11) during COURT BREAKDOWN are from 7 countries: SYRIA IRAQ SOMALIA IRAN SUDAN LIBYA & YEMEN _E_
"Obama doesn't respect the fact that the money he wastes belongs to us. He thinks that the wealth you create (cont) __HTTP__ _E_
Open for the 2014 season Mar a Lago Club is an architectural masterpiece offering the finest amenities in the world __HTTP__ _E_
LIVE on #Periscope: Good morning Iowa! Let's #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Dennis Rodman is a project manager tonight on Celebrity Apprentice watch Dennis in full action! _E_
#TrumpVlog South African justice __HTTP__ _E_
Do you believe it? The Obama Administration agreed to take thousands of illegal immigrants from Australia. Why? I will study this dumb deal! _E_
Just finished a press conference in Trump Tower wherein I gave information on which VETERANS groups got the $5600000 that I raised/gave! _E_
Starting to develop a much better relationship with Pakistan and its leaders. I want to thank them for their cooperation on many fronts. _E_
Another Dishonest Politician #LightweightSenatorMarcoRubio __HTTP__ _E_
FL KS ME MD MN NJ OR & WV! It's the LAST DAY to mail in voter reg forms. Get the forms at... __HTTP__ _E_
Who would you rather have negotiating for the U.S. against Putin Iran China etc. Donald Trump or Hillary? Is there even a little doubt? _E_
Will be doing Fox & Friends at 7 A.M. It never ends (hopefully)! _E_
Great victory for people of Blackdog Scotland. They defeated substation stopping inefficient & ugly wind turbines.@AlexSalmond _E_
I will be interviewed on @GMA Good Morning America tomorrow at 7:00 A.M. Big new ABC poll coming out I hope I do well! _E_
Without more Republicans in Congress we were forced to increase spending on things we do not like or want in order to finally after many years of depletion take care of our Military. Sadly we needed some Dem votes for passage. Must elect more Republicans in 2018 Election! _E_
The polls have been really amazing we are all tired of incompetent politicians and bad deals! __HTTP__ _E_
.@Omarosa admitting she's a threat in the boardroom that's not revelation knowledge. #CelebApprentice _E_
Give your goals substance. Imbue them with a value that exceeds the monetary. Make them count on as many levels as you can. _E_
Congrats @TrumpChicago for being named #3 Best Business Hotel in Chicago in @TravlandLeisure's 2014 World's Best __HTTP__ _E_
Be prepared there is a small chance that our horrendous leadership could unknowingly lead us into World War III. _E_
RT @TheFive: Trump just won on law & order and now he's delivering the goods. @jessebwatters #thefive _E_
A former classmate Roy Eaton has published a great book "Makers Shakers & Takers" – check it out __HTTP__ _E_
.@TraceAdkins great job on FOX this morning. Keep up the good work! _E_
The two dumbest interviews in history may go down as Lance Armstrong who is being sued by everyone in the world & Michael Douglas. _E_
I will be going to Trump Links at Ferry Point for the official opening of this long delayed (but future NYC treasure) course. Great job D _E_
My @foxandfriends int. destroying Schneiderman's frivolous suit which he brought after meeting Obama on Thurs. __HTTP__ _E_
Having a vision for something can be a very powerful force for accomplishment. Midas Touch _E_
The Republican Party of New York has been conditioned to lose and there is no excuse for this. Leadership must move fast and decisively! _E_
If their highly unethical behavior including begging me for ads isn't questionable enough they have endorsed a candidate who can't win. _E_
... than his destruction of Scotland's magnificent lands.@AlexSalmond _E_
MY POSITION ON VISAS#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Congrats to @FLGovScott on today's inauguration and having done a great job! _E_
Washington is in total gridlock—no trust no leadership—very interesting! _E_
North Korea is behaving very badly. They have been playing the United States for years. China has done little to help! _E_
Shock! Obamacare's high risk pool spending DOUBLED government estimates __HTTP__ @BarackObama is bankrupting this country! _E_
"Once you know you love your job never stop and never give up." – Think Like a Billionaire _E_
Think big. Stay focused. Be passionate. Don't ever give up _E_
If you can count the amount of time you put into a project on your fingers then you haven't spent enough time on it. _E_
Did A Rod really try to buy the papers that would implicate him re. drugs wow that would be the end a disaster! _E_
P.S. 42 in Queens is getting a truckload of food and much needed supplies for Rockaway residents #HurricaneSandyRelief _E_
Ralph Norman ran a fantastic race to win in the Great State of South Carolina's 5th District. We are all honored by your success tonight! _E_
Join me this Thursday in Wilmington Ohio at noon! #ImWithYouTickets: __HTTP__ __HTTP__ _E_
It was an honor to be with @MittRomney the night he clinched the nomination. He will defeat @BarackObama and be a tremendous POTUS. _E_
Just out: Neera Tanden Hillary Clinton adviser said "Israel is depressing." I think Israel is inspiring! _E_
If NFL fans refuse to go to games until players stop disrespecting our Flag & Country you will see change take place fast. Fire or suspend! _E_
UPCOMING RALLIES JOIN ME!TOMORROWFletcher NC @ 12pm. __HTTP__ OH @ 7pm. __HTTP__ _E_
His spending is reckless: @BarackObama will set a record fourth year of a $1 trillion budget deficit. __HTTP__ _E_
Join me LIVE from the Rose Garden at 1:30pmE with Prime Minister Alexis Tsipras of Greece. __HTTP__ __HTTP__ _E_
The exclusive home of @PGATOUR's @CadillacChamp @TrumpDoral sits on 800 beautiful acres in the center of Miami __HTTP__ _E_
Order my book CRIPPLED AMERICA for your holiday gifts. I will be signing books for the next two weeks! __HTTP__ _E_
"Donald Trump to crown @FIU as Miss Universe venue" __HTTP__ via @MiamiHerald _E_
A really nice article about the Blue Monster from "The Street." __HTTP__ _E_
Donald Trump promises 'world class' Crandon Park golf course __HTTP__ via @WPLGLocal10 by @GlennaOn10 _E_
RT @DRUDGE_REPORT: 'Win lose deal that benefits Iran and hurts United States'... __HTTP__ _E_
Failed Presidential Candidate Mitt Romney is having a news conference tomorrow to criticize me. (1/2) _E_
We've gone from $10 trillion that the president inherited from all prior presidents to $16 trillion @MittRomney _E_
Will be on Fox & Friends in five minutes enjoy and good morning! _E_
We have made more progress in the last nine months against ISIS than the Obama Administration has made in 8 years.Must be proactive & nasty! _E_
Sorry I never went bankrupt and don't wear a wig (it's all mine)! _E_
It's Thursday. How many people have lost their healthcare today? _E_
"One man with courage is a majority." Thomas Jefferson _E_
Remember when I recently said that Brussels is a hell hole and a mess and the failing @nytimes wrote a critical article. I was so right! _E_
Oil is double the price now compared to last year OPEC is laughing at @BarackObama. _E_
Congratulations to the Miss USA Pageant it was the #1 telecast of the night among ABC CBS NBC and Fox. A great show and a huge success. _E_
Thank you. __HTTP__ _E_
Gas prices are way too high. With an economy contracting and lower demand how do OPEC & the speculators get away with this?! _E_
On the shores of the Lake Norman @Trump_Charlotte features a world class course designed by @SharkGregNorman __HTTP__ _E_
It was recently reported that 3rd rate $ losing @Politico is a foil for the Clintons. Questions given to Clinton in advance. No credibility. _E_
Hillary was involved in the e mail scandal because she is the only one with judgement so bad that such a thing could have happened! _E_
Wow! New National Zogby Poll just out:.TRUMP 45. CRUZ 13. RUBIO 8. Big numbers. _E_
Before I or anyone saw the classified and/or highly confidential hacking intelligence report it was leaked out to @NBCNews. So serious! _E_
Retail sales are at record numbers. We've got the economy going better than anyone ever dreamt and you haven't seen anything yet! _E_
I met a Trump Twitter hater last night (well known). As he came near me he nervously said Mr. Trump it is an honor to meet you sir! Nice _E_
The Iraqi Army is useless. President Obama stay the hell out of Iraq (we should never have been there in the first place). _E_
.@tedcruz should not make statements behind closed doors to his bosses he should bring them out into the open more fun that way! _E_
Obama's goal of 1 million electric car sales is a little off by over 910000 __HTTP__ $100B of our money wasted! _E_
When I was 18 people called me Donald Trump. When he was 18 @BarackObama was Barry Soweto. Weird. _E_
Thank you Miami! In 6 days we are going to WIN the GREAT STATE of FLORIDA and we are going to win back the White... __HTTP__ _E_
Tweet me your questions for the next #trumpvlog.... _E_
I just bought stock in Tiffany & Company and McDonald's. Two ends of the spectrum but I like both companies. _E_
Miss Alabama @_KatherineWebb stopped by to say hello today. __HTTP__ _E_
Am in Bedminster for meetings & press conference on V.A. & all that we have done and are doing to make it better but Charlottesville sad! _E_
The new hot term that they have recently invented is POLAR VORTEX give me a break! _E_
Thank you for your support on my way now! See you soon. #TrumpTrain __HTTP__ _E_
If we can help little #CharlieGard as per our friends in the U.K. and the Pope we would be delighted to do so. _E_
Via @fitsnews:"Donald Trump Surges In New Hampshire Poll: MOGUL REALITY STAR EMERGES AS GRANITE STATE'S 'ANTI BUSH' __HTTP__ _E_
Castro Chavez and Ahmadinejad are all anxiously awaiting our election results. They are praying Obama wins. _E_
Border agent: We might as well abolish our immigration laws altogether __HTTP__ _E_
In the East it could be the COLDEST New Year's Eve on record. Perhaps we could use a little bit of that good old Global Warming that our Country but not other countries was going to pay TRILLIONS OF DOLLARS to protect against. Bundle up! _E_
Go as far as you can see when you get there you'll be able to see farther. J.P. Morgan _E_
Congratulations to @WWERaw on passing 1000 episodes. @WWE is still going strong after all these years @VinceMcMahon is great! _E_
Fake News CNN is looking at big management changes now that they got caught falsely pushing their phony Russian stories. Ratings way down! _E_
The media is going crazy. They totally distort so many things on purpose. Crimea nuclear the baby and so much more. Very dishonest! _E_
Entrepreneurs: Keep the big picture in mind. There are always opportunities & possibilities and thinking too small can negate a lot of them _E_
Governor Rick Scott of Florida did really poorly on television this morning. I hope he is O.K. _E_
Trump promises special session to repeal Obamacare: __HTTP__ _E_
Tweet me your New Year's resolution to make America great again! #TrumpNewYearsRes __HTTP__ _E_
The U.S. has appealed ro Russia not to intervene in Ukraine Russia tells U.S. they will not become involved and then laughs loudly! _E_
I will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_
Will be doing Fox & Friends at 7 A.M. (1 hour). ENJOY! _E_
On my way! __HTTP__ _E_
"A brand is not a logo. A brand is the promise you put out there and the experience you deliver." – Midas Touch _E_
One of the best moves I made early in my career was buying the air rights from Tiffany's flagship. Trump Tower gleams over Fifth Avenue. _E_
WHAT THEY ARE SAYING ABOUT MIKE PENCE "DOMINATING" THE DEBATE: __HTTP__ #VPDebate _E_
Somebody with aptitude and conviction should buy the FAKE NEWS and failing @nytimes and either run it correctly or let it fold with dignity! _E_
Fiscal mismanagement of cash costing US Taxpayer billions cut fraud and waste before cutting funding for Seniors. _E_
Stock market hit yet another all time record high yesterday. There is great confidence in the moves that my Administration.... _E_
I want to end the day by saying there is no check I would rather write than that to a good charity designated by our President. _E_
Destiny has a part to play in your life and in your business so give it a chance to work. Think Like a Champion _E_
Why doesn't @JebBush in his ads show my answer to his statement in the debate? _E_
Raleigh North Carolina was fantastic last night. Such incredible spirit. We all want to and will MAKE AMERICA GREAT AGAIN! _E_
RT @EricTrump: Very proud of what my father has accomplished in the past 7 months Wishing him amazing luck and success tonight! #NVcaucus ... _E_
Our country has a big heart. And it's a point of national pride that we take care of our own. #TimeToGetTough (cont) __HTTP__ _E_
.@AndreaTantaros's radio show is a great addition to talk radio. She is sharp talented & great sense of humor. Congratulations. _E_
Attorney General Bill Shuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help! _E_
Once again #MSM is dishonest. Schlonged is not vulgar. When I said Hillary got schlonged that meant beaten badly. _E_
Failed presidential candidate Mitt Romney the man who choked and let us all down is now endorsing Lyin' Ted Cruz. This is good for me! _E_
Spoiler @dennisrodman has really got his act together so far on the upcoming season of @CelebApprentice... _E_
Instead of driving jobs and wealth away AMERICA will become the world's great magnet for innovation and job creati... __HTTP__ _E_
Poor @JebBush spent $50 million on his campaign I spent almost nothing. He's bottom (and gone) I'm top (by a lot). That's what U.S. needs! _E_
RT @realDonaldTrump: "President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers... _E_
RT @MarkHalperin: Utah Speaker of the House announces endorsement of @realDonaldTrump. Says @DonaldJTrumpJr played a big role _E_
Now that the ObamaCare website contractor has been terminated for obvious incompetence is the person who hired them going to be fired? _E_
Great article by @WayneRoot @theblaze Obama's College Classmate: 'The Obama Scandal Is at Columbia' __HTTP__ _E_
Awarded 5 stars from @ForbesInspector @TrumpTO offers 261 rooms & 115 suites in the center of downtown Toronto __HTTP__ _E_
Obama has now become the weakest POTUS against China yuan just hit record high against dollar __HTTP__ Very sad! _E_
NYC terrorist was happy as he asked to hang ISIS flag in his hospital room. He killed 8 people badly injured 12. SHOULD GET DEATH PENALTY! _E_
Lyin' Ted Cruz is now trying to convince prople that his problems with The National Enq.were caused by me. I had NOTHING to do with story! _E_
Every day Pastor Saeed is imprisoned by Iran is an indictment on Obama's 'diplomacy.' #SaveSaeed _E_
While Derek Jeter is training every day in the off season reports come out that A Rod is partying all over the country. Go Derek. @Yankees _E_
Homeland Security and law enforcement are on alert & closely watching for any sign of trouble. Our borders are far tougher than ever before! _E_
The new amnesty bill is over 1000 pages. It is another monstrosity a la ObamaCare. _E_
Just did theToday Show to announce that Baton Rouge Louisiana will host the Miss USA Pageant on Sunday June 8th. @Miss USA. _E_
Looking forward to addressing @TheEconomicClub on December 15th at the Marriot Marquis Washington DC. _E_
Look I have always liked Lance Armstrong I just hated what he did to himself including recently. His life will now be hell. _E_
I am lowering taxes far more than any other candidate. Any negotiated increase by Congress to my proposal would still be lower than current! _E_
"Get in. Get it done. Get it done right. Get out." – My father Fred C. Trump _E_
"Yesterday's home runs don't win today's games." Babe Ruth _E_
On at 9:00A.M. or 10:00 A.M. (depending on your location) on Fox is a tough but really good interview with Chris Wallace. Enjoy! _E_
Ted is the ultimate hypocrite. Says one thing for money does another for votes. __HTTP__ _E_
Sports fans should never condone players that do not stand proud for their National Anthem or their Country. NFL should change policy! _E_
I'll be on @gretawire On the Record tonight to talk about the ObamaCare fiasco 7 pm on Fox News _E_
Donald Trump: GOP Has 'Nuclear Weapon' In Fiscal Cliff Negotiation But They Don't Know It __HTTP__ via @mediaite _E_
l still think @Boeing should just bite the bullet & get rid of the new batteries in the 787. Those batteries will always be a problem! _E_
Great meeting with @THEHermanCain yesterday in Trump Tower. Great guy! _E_
Instead of creating new jobs Obamacare is destroying jobs. And the worst part is yet to come since the truly (cont) __HTTP__ _E_
.@lisarinna is at the top of her game in the upcoming season of @CelebApprentice All Stars. Our fans love her. _E_
Come on Republican Senators you can do it on Healthcare. After 7 years this is your chance to shine! Don't let the American people down! _E_
Thank you Foxconn for investing $10 BILLION DOLLARS with the potential for up to 13K new jobs in Wisconsin! MadeInTheUSA __HTTP__ _E_
Miami's top destination @TrumpDoral's remodeled Royal Palm Pool offers 18 luxurious cabanas __HTTP__ _E_
Be sure to watch Oprah today (4 pm on Channel 7) I'll be on with my entire family and it will be an entertaining hour.. __HTTP__ _E_
The Navy Yard shooting is a horrible disaster. If we don't clean up OUR COUNTRY of the garbage soon we are just going to do a death spiral! _E_
I don't know what will happen with the lawsuit against dummy @billmaher but have an obligation to charity to bring it. _E_
Great Strategic & Policy CEO Forum today with my Cabinet Secretaries and top CEO's from around the United States.... __HTTP__ _E_
It doesn't matter who you vote for it matters who is counting the votes. Be careful of voter fraud! _E_
Join me on Monday April 4th in Milwaukee! #WIPrimary #Trump2016Tickets: __HTTP__ __HTTP__ _E_
People love gossip. It's the biggest thing that keeps the entertainment industry going. @TheEllenShow _E_
If @OMAROSA is not in the Board Room I can't fire her. @latoyajackson made a strategic mistake. _E_
Sneak peek of Trump's trio of spectacular new seaside holes on the famed Ailsa course/@TrumpTurnberry __HTTP__ _E_
Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut and it will only get better. MUCH MORE TO COME! _E_
Another cover up. Obama won't disclose how many illegal immigrants he has released into our country __HTTP__ No surprise. _E_
It was my great honor to welcome Mayor's from across America to the WH. My Administration will always support local government and listen to the leaders who know their communities best. Together we will usher in a bold new era of Peace and Prosperity! __HTTP__ __HTTP__ _E_
...and job losses. American companies must be prepared to look at other alternatives. _E_
CNN which is totally biased in favor of Clinton should apologize. They knew they were wrong. __HTTP__ _E_
My interview with @jheil & @MarkHalperin at @WollmanRink airing at 5PM on @bpolitics. __HTTP__ _E_
Look forward to seeing final results of VoteStand. Gregg Phillips and crew say at least 3000000 votes were illegal. We must do better! _E_
General James Mad Dog Mattis who is being considered for Secretary of Defense was very impressive yesterday. A true General's General! _E_
....The Wall will be paid for directly or indirectly or through longer term reimbursement by Mexico which has a ridiculous $71 billion dollar trade surplus with the U.S. The $20 billion dollar Wall is "peanuts" compared to what Mexico makes from the U.S. NAFTA is a bad joke! _E_
What a dumb mistake AOL made buying the @huffingtonpost. How much longer will Arianna last I predict not much. _E_
I was just told by one of the top @PGATOUR players that my golf courses are the most elite in the country. Very nice compliment I agree. _E_
Last Friday's gaffe by @BarackObama claiming that the private sector is doing fine is illustrative.Everything to him revolves around gov't _E_
'President Elect Donald J. Trump Nominates Elaine Chao as Secretary of the Department of Transportation' __HTTP__ _E_
Via @TheScotsman: "Donald Trump's @TrumpTurnberry plan gets go ahead" __HTTP__ _E_
Why don't we ask the Navy SEALs who killed Bin Laden? They don't seem to be happy with Obama claiming credit. All he did is say O.K. _E_
.@ColinCowherd said such nice things about me during the debate that I thought I'd do his show @TheHerd on Monday (2:30pm EST). _E_
Heading to Boston to see another huge crowd! My friend Tom Brady is a great competitor and golf partner. __HTTP__ _E_
The World Economic Forum now ranks the US the fifth most competitive economy in the world. We have fallen from first under @BarackObama. _E_
The Country is being run just like the stadium. _E_
When will people and the media start to apologize to me for my statement Mexico is sending.... which turned out to be true? El Chapo _E_
How can a dummy dope like Harry Hurt who wrote a failed book about me but doesn't know me or anything about me be on TV discussing Trump? _E_
Entrepreneurs: Cover your bases. Know everything you can about what you're doing. _E_
The House Republicans and Democrats are finally unanimous! Yesterday they voted down @BarackObama's $3.6T budget (cont) __HTTP__ _E_
Hillary Clinton needs to address the racist undertones of her 2008 campaign. #FlashbackFriday __HTTP__ _E_
Our GREAT VETERANS can now connect w/ their VA healthcare team from anywhere using #VAVideoConnect available at: __HTTP__ __HTTP__ _E_
Tom marbles in his mouth Brokaw once thanked me for the great success of the Apprentice for NBC. Now he calls (cont) __HTTP__ _E_
RT @DanScavino: Last nights winner was clear & it will be proven time & time again lets #MAGA!! Lets WIN!! #TrumpTrain __HTTP__ _E_
Sugar @Lord_Sugar—you should say thank you Donald like a good little boy... ... _E_
RT @AlanDersh: We should stop talking about obstruction of justice. No plausible case. We must distinguish crimes from pol sins __HTTP__ _E_
Obama's speech indicates he wants to change this country as we know it wow he really feels emboldened. _E_
General says that the Armed Forces will be severely weakened if the large scale rape and sexual abuse problem is not brought under control. _E_
Polls show that the hurricane had a huge positive effect for Obama on his win isn't that ridiculous? _E_
.@IvankaTrump will lead the U.S. delegation to India this fall supporting women's entrepreneurship globally.#GES2017 @narendramodi _E_
The Russia Trump collusion story is a total hoax when will this taxpayer funded charade end? _E_
With all the talk of fiscal responsibility at the @DNC convention yesterday it was ironic that the debt passed $16T. _E_
Great honor to be inducted into the NJ Boxing Hall of Fame last night. Thank you! Timing could not have been better! __HTTP__ _E_
Congratulations to my daughter Ivanka and her husband Jared on the birth of their daughter Arabella Rose yesterday. _E_
Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_
Looking forward to hosting our heroes from the Wounded Warrior Project (@WWP) Soldier Ride to the @WhiteHouse on Th... __HTTP__ _E_
Stop The China Curse Pass the Chinese Currency Bill! _E_
...... Circulation is way down and all he thinks about are his bad food restaurants. @CondeNastCorp _E_
This is the One Year Anniversary of my Presidency and the Democrats wanted to give me a nice present. #DemocratShutdown _E_
Congratulations to Sung Hyun Park on winning the 2017 @USGA #USWomensOpen _E_
Very exciting—tomorrow night at Madison Square Garden I get inducted into the @WWE Hall of Fame. _E_
Looking forward to hosting the @FloridaGOP "House Majority 2014 Golf Tournament" at Trump Int'l West Palm Beach on Jan. 27th. _E_
New Yorkers will get a chance to see a film for free this summer from @attnyc and @tribecafilmfest. My choice? Citizen Kane #FilmForAll _E_
#ICYMI: Announcement of Air Traffic Control Initiative Watch __HTTP__ _E_
Just watched Facebook COO Sheryl Sandberg on 60 Minutes. She should spend more time trying to get the F stock price up & less on her ego! _E_
With all my Administration has done on Legislative Approvals (broke Harry Truman's Record) Regulation Cutting Judicial Appointments Building Military VA TAX CUTS & REFORM Record Economy/Stock Market and so much more I am sure great credit will be given by mainstream news? _E_
Fans shouldn't worry. We have adjusted the filming schedule of the upcoming 13th season of @CelebApprentice appropriately due to the storm. _E_
Just watched Brian Williams on @TODAYshow very sad! Brian should get on with a new life and not start all over at @msnbc. Stop apologizing _E_
The many losers and haters never have the brains or stamina to become truly successful! _E_
...when they have no environmental restrictions! America' s workers need us. __HTTP__ _E_
Thank you Houston Texas! #AmericaFirst #Trump2016 __HTTP__ _E_
Why Isn't the Senate Intel Committee looking into the Fake News Networks in OUR country to see why so much of our news is just made up FAKE! _E_
... Doesn't seem like they have a coherent strategy right now. _E_
The FAKE & FRAUDULENT NEWS MEDIA is working hard to convince Republicans and others I should not use social media but remember I won.... _E_
Yes I will be live tweeting during the final debate this coming Monday. _E_
Good luck #TeamUSA#OpeningCeremony #Rio2016 __HTTP__ _E_
This was a great evening I would like to thank everyone for their wonderful support. _E_
Our leaders are terrible. The government spends over $50B a day. It can't find cuts for less than 2 days of spending?! Sad! _E_
Career Advice from Donald Trump __HTTP__ via @BNDarticles by @brittneyplz _E_
This may be the worst football game ever played by one team Denver! Hard to watch. _E_
Our enemy China is illegally buying oil from our enemy Iran __HTTP__ China loves it! _E_
Shock Obama WH given three pinocchios for lying about Benghazi emails __HTTP__ _E_
The pessimist sees the difficulty in every opportunity and the optimist sees the opportunity in every difficulty. Pres. Lincoln _E_
.@Lawrence is the poor man's left wing @oreillyfactor(with no ratings)! _E_
RT @SheriffClarke: Happy Father's Day to all dads. My dad. Like father like son @realDonaldTrump supporters to the end. He an Airborne Ra... _E_
Paula Deen made a big mistake in using a forbidden word but must be given some credit fot admitting her mistake. She will be back! _E_
The NRA in Nashville today was amazing. Packed house and standing ovation for Trump. THANKS! _E_
Because I was told I could not do well in Iowa I spent very little there a fraction of Cruz & Rubio. Came in a strong second. Great honor _E_
I will be on Face The Nation (CBS) today at 10:30 A.M. and Media Buzz (Fox News) at 11:00 A.M. Enjoy! _E_
Ivanka is now on Twitter You can follow her @IvankaTrump Have a terrific weekend! _E_
Today we gathered in the East Room to pay tribute to the HEROES whose courageous actions under fire saved so many lives in Alexandria VA. __HTTP__ _E_
See you tomorrow Michigan!Grand Rapids MI tomorrow at noon: __HTTP__ MI tomorrow at 3pm:... __HTTP__ _E_
.@KatrinaCampins Thank you so much for the wonderful statements you made about me on TV. Also keep up the great work! _E_
Congratulations to @TrumpChicago and @SixteenChicago for receiving the @AAANews Five Diamond Award again this year! _E_
China is filling the vacuum left by Obama at the UN on the world stage. _E_
Washington is wasting over $2 billion this year on Solyndra type loans. Yet they want to cut military spending. _E_
Via @DailyCaller: Donald Trump: Obama should golf w/ Republicans not his 'local friends' __HTTP__ by @NicholasBallasy _E_
Look at the way Crooked Hillary is handling the e mail case and the total mess she is in. She is unfit to be president. Bad judgement! _E_
Texas @GovAbbott & Lt. Gov. @DanPatrickThank you for todays briefing on hurricane recovery efforts here in TX. Keep up the great work! __HTTP__ _E_
RT @charliekirk11: 3 big wins in 2017 you won't hear:Trump confirmed the most circuit court judges ever in a President's 1st year (all co... _E_
China OPEC and Russia laugh at us. But now thanks to Obama so does Syria. Very sad! _E_
Who is more believable on the state of employment the great @jack_welch or some government bureaucrat who is voting for Obama? _E_
I turned down a meeting with Charles and David Koch. Much better for them to meet with the puppets of politics they will do much better! _E_
The NRA strongly endorses Luther Strange for Senator of Alabama.That means all gun owners should vote for Big Luther. He won't let you down! _E_
Our country needs leadership now. There is total dysfunction in Washington. _E_
Remember this Sunday I am also featured on @datelinenbc at 8PM right before the premiere of All Star @CelebApprentice @nbc likes me! _E_
I have created tens of thousands of jobs and will bring back great American prosperity. Hillary has only created jobs at the FBI and DOJ! _E_
Can you believe that the builder of the failed ObamaCare website was just given a new government contract how stupid is that CLUELESS!!! _E_
#VoteTrumpKS #Trump2016March 5 2016 | Wichita Kansas: __HTTP__ __HTTP__ _E_
I wonder what the rest of the world is thinking about the United States as they watch the disgusting and out of control Baltimore riots? _E_
If Crooked Hillary Clinton can't close the deal on Crazy Bernie how is she going to take on China Russia ISIS and all of the others? _E_
How could Obama leave those American heroes out to die in Benghazi? And he continues to lie to the public! _E_
Via @PressClubDC by @snlyngaas: "Trump Says U.S. Brand Has Lost Its Luster" __HTTP__ _E_
No more Clintons or Bushes! __HTTP__ _E_
I will be interviewed by @MariaBartiromo at 6:00 A.M. @FoxBusiness. Enjoy! _E_
.@Team_Mitch Fantastic win we are all proud of you! Your victory speech last night was very gracious to an opponent whose speech was not. _E_
Donald Trump Defends His Big Obama Bombshell: 'It's Not a Publicity Stunt' __HTTP__ via @eonline _E_
Terrible! Just found out that Obama had my wires tapped in Trump Tower just before the victory. Nothing found. This is McCarthyism! _E_
Dummy @KarlRove continues to make and write false statements. He still thinks Romney won he should get a life! _E_
RT @TeamTrump: .@HillaryClinton & @timkaine think you're #Deplorables & #BasementDwellers. @realDonaldTrump & @mike_pence think you're PATR... _E_
08 02 2011 19:56:31 _E_
Trump Int'l Hotel & Tower Vancouver's original twisting design gives every unit a distinct view __HTTP__ A landmark! _E_
Congratulations to our great Women's Olympic Soccer team @ussoccer on their gold medal. They made us all proud! _E_
I appreciate the GOP candidates who remain strong on border security. They know I am right. A nation without borders cannot survive. _E_
RT @The_Trump_Train: @realDonaldTrump Make no mistake we are going to put the interest of AMERICAN CITIZENS FIRST! The forgotten men & w... _E_
Thank you to Eli Lake of The Bloomberg View The NSA & FBI...should not interfere in our politics...and is Very serious situation for USA _E_
Obama sadly has no business or private sector background and it shows. _E_
Watch You've Got Donald Trump at __HTTP__ _E_
The DJT Foundation unlike most foundations never paid fees rent salaries or any expenses. 100% of money goes to wonderful charities! _E_
Iraq is falling apart fast two trillion dollars and so many deaths Bush got us in and Obama took far too long to get us out! _E_
Such a serious problem for Ted & the GOP. Great doubt Dems will sue! Let's all work together to solve this problem. __HTTP__ _E_
Republican Senators will not let the American people down! ObamaCare premiums and deductibles are way up it was a lie and it is dead! _E_
Tune in tonight at 10 pm on NBC for another exciting episode of The Apprentice and see the Dog Whisperer make an appearance. _E_
RT @mike_pence: There's one clear choice in this election to create jobs and grow the American economy. #VPDebate __HTTP__ _E_
Now Chinese state run companies are taking over our coal market __HTTP__ China wants to deplete our resources here at home. _E_
One of the reasons I assume I was inducted into the @WWE Hall of Fame is that Vince McMahon and I have the all time highest ratings... _E_
Clinton Foundation's Fundraisers Pressed Donors to Steer Business to Former President __HTTP__ _E_
.@katyperry Katy what the hell were you thinking when you married loser Russell Brand. There is a guy who has got nothing going a waste! _E_
I wouldn't use @Richard_Meier to design a doghouse let alone a house or building! _E_
The real outsourcer @BarackObama is funding German automakers with the GM bailout money __HTTP__ How does that help us? _E_
Thank you for your support! Being #PoliticallyCorrect will NOT #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
Via @Newsmax_Media: "@RepMattSalmon: Obama 'Didn't Lift a Finger' to Help Free Marine in Mexican Prison" __HTTP__ _E_
Sad thing is Rolling Stone was (is) a dead magazine with big downward circulation and now for them at last people are talking about it! _E_
I never thought I'd be saying this but I've really enjoyed @RichLowry on television lately and he was terrific hosting @seanhannity _E_
Just tried watching Saturday Night Live unwatchable! Totally biased not funny and the Baldwin impersonation just can't get any worse. Sad _E_
We don't need a Secretary of Business to understand business we need a president who understands business and I do @MittRomney _E_
Via @CarrGaz: "Trump's grand plans for @TrumpTurnberry resort get the green light" __HTTP__ _E_
"You can always become better." @TigerWoods _E_
I hope the NY tax payer appreciates the millions Schneiderman is about to waste on a small case. I will litigate to victory. _E_
Violent crime is rising across the United States yet the DNC convention ignored it. Crime reduction will be one of my top priorities. _E_
James Comey better hope that there are no tapes of our conversations before he starts leaking to the press! _E_
The thing I like best about Rex Tillerson is that he has vast experience at dealing successfully with all types of foreign governments. _E_
Speech transcript at Arab Islamic American Summit __HTTP__ __HTTP__ #POTUSAbroad _E_
My @msnbc int w/ @krystalball at #WHCD on my 2016 timetable saving Social Security & Making America Great Again! __HTTP__ _E_
"God's word is the same yesterday and today and a million years from now." @Franklin_Graham _E_
There is no challenge too great no dream outside of our reach! Thank you Selma North Carolina!#ICYMI watch here... __HTTP__ _E_
Great news I'm now leading in most polls w/ new CNN poll also having me #1. NBC I am #1 in NH by a lot #2 in Iowa close & gaining. _E_
Read this about @lawrence...... __HTTP__ _E_
The United States needs great deals and fast. We have to make our country rich again in order to MAKE OUR COUNTRY GREAT AGAIN! _E_
President Obama seems so fawning and desperate to make a deal with Iran that lots of bad results can occur. Be cool and be careful! _E_
People buy deals & immediately put them into bankruptcy in order to make better deals. It's a very effective & commonly used business tool. _E_
Iran humiliated the United States with the capture of our 10 sailors. Horrible pictures & images. We are weak. I will NOT forget! _E_
The Wall Street Journal has reported that Obama's food stamp policies are ushering in a massive 'food stamp crime wave.' #TimeToGet Tough _E_
I will be traveling to Florida tomorrow to meet with our great Coast Guard FEMA and many of the brave first responders & others. _E_
On @foxandfriends in two minutes! _E_
Thank you! I miss my father. __HTTP__ _E_
.@MichaelPhelps you are the greatest Olympic champion of them all. Fantastic job! _E_
It is a miracle how fast the Las Vegas Metropolitan Police were able to find the demented shooter and stop him from even more killing! _E_
I have decided to postpone my trip to Israel and to schedule my meeting with @Netanyahu at a later date after I become President of the U.S. _E_
It's crunch time. This Sunday's All Star Celebrity @ApprenticeNBC's task will separate the winners from the losers. _E_
Getting ready to leave for Cincinnati in the GREAT STATE of OHIO to meet with ObamaCare victims and talk Healthcare & also Infrastructure! _E_
On my way to Charleston/Mount Pleasant South Carolina. Big crowd. Look forward to it! #USSYorktown __HTTP__ _E_
For the first time in the history of military operations a country has broadcast what when and where they will be doing in a future attack! _E_
One of the keys to thinking big is total focus." – THE ART OF THE DEAL _E_
When will our country stop wasting money on global warming and so many other truly STUPID things and begin to focus on lower taxes? _E_
Vets mistreated NO border security? I'm with @V4SA this Tuesday 9/15 to #MakeAmericasMilitaryGreatAgain! Join us! __HTTP__ _E_
After 7 months of investigations & committee hearings about my collusion with the Russians nobody has been able to show any proof. Sad! _E_
.@MittRomney's poll numbers are looking really good. One more great debate performance and it will be a total knockout. _E_
Obama just endorsed Crooked Hillary. He wants four more years of Obama—but nobody else does! _E_
To all journalists look into the financial dealings of Scottish Parliament members with Vattenfall...Follow the money. _E_
Obama just had another trillion dollar budget deficit for the fourth year in a row. At least he is consistent. _E_
Just landed in Ohio. Thank you America I am honored to win the final debate for our MOVEMENT. It is time to... __HTTP__ _E_
.@jack_welch is correct these reporters would not have been so brave while Jack was running GE. _E_
Congratulations to @CharlieCrist who has now lost a statewide election in Florida as a Republican Independent & Democrat. _E_
"To be a visionary and to be a billionaire you have to chase impossibilities. Few ever get rich easily." – Think Like a Billionaire _E_
.@AnnCoulter's new book Adios America! The Left's Plan to Turn Our Country into a Third World Hellhole is a great read. Good job! _E_
New Iowa poll. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
"Let your passion for your work carry you through all the setbacks they can throw at you." – Trump Never Give Up _E_
Higher Taxes kill job creation cut wild government spending and waste. _E_
.@FoxNews you should be ashamed of yourself. I got you the highest debate ratings in your history & you say nothing but bad... _E_
The only candidate who can get 1145 delegates is @MittRomney. The primary is over. _E_
Set your sights and aim high. You never know what you can achieve until you focus on achieving it. Midas Touch _E_
Celebrity Apprentice in 15 minutes don't miss it! _E_
.@nyrangers did a great job of winning tonight played like champions! _E_
Hillary Clinton didn't go to Louisiana and now she didn't go to Mexico. She doesn't have the drive or stamina to MAKE AMERICA GREAT AGAIN! _E_
Obama claims that he needs an extra $4B to secure the border. Well then he should not have wasted $5B on the ObamaCare website. _E_
Another broken promise by @BarackObama: @ObamaCare actually increases income inequality __HTTP__ It must be fully repealed! _E_
If you're going to think think big The Art of the Deal _E_
Nobody is watching @Morning_Joe anymore. Gone off the deep end bad ratings. You won't believe what I am watching now! _E_
Congress must protect our borders first. Amnesty should be done only if the border is secure and illegal immigration has stopped. _E_
Congrats @TrumpDoral for being named one of the Most Notable Openings of 2014 from @BizBash: __HTTP__ _E_
ObamaCare must be completely repealed. A recent report from UBS shows that it is the number one reason employers are not hiring. _E_
Thank you @greta. #ImWithYou __HTTP__ _E_
Doing David Letterman @Late_Show tonight at 11:30. 1st nite of Sweeps.Going into the lion's den but I've been there many times before. Enjoy _E_
Thank you Appleton Wisconsin!#WIPrimary #Trump2016 __HTTP__ __HTTP__ _E_
A great victory in Scotland ... __HTTP__ __HTTP__ _E_
Here's @Joan_Rivers. She & @IvankaTrump make a terrific team as my advisors. #CelebApprentice _E_
There are no buyers for the worthless @NYDailyNews but little Mort Zuckerman is frantically looking. It is bleeding red ink a total loser! _E_
Looking forward to being hosted by @saintanselm for Politics & Eggs next Tuesday. See you in Manchester! #NHPolitics _E_
I'll be making a major announcement on President Obama next week stay tuned! _E_
The reason that President Obama did NOTHING about Russia after being notified by the CIA of meddling is that he expected Clinton would win.. _E_
Congratulations to @TrumpNewYork @TrumpChicago @TrumpWaikiki @TrumpToronto on your Forbes Five Star ratings @ForbesInspector _E_
Nick Adams new book Green Card Warrior is a must read. The merit based system is the way to go. Canada Australia! @foxandfriends _E_
Thank you @Morning_Joe for throwing the pathetic reporter from the failing and money losing Daily Beast off the air. Really cool! _E_
Everyone is now saying how right I was with illegal immigration & the wall. After Paris they're all on the bandwagon. _E_
Please only respond by tweet @lawrence because like everyone else I don't watch your show. _E_
Ready to lead. Ready to Make America Great Again. #Debate #MAGA _E_
I just retained Sir Nick Faldo to be the architect of the Red Course at Doral he will do a tremendous job! @NickFaldo006 _E_
Thank you South Bend Indiana! Everyone get out & #VoteTrump tomorrow! #INPrimary __HTTP__ __HTTP__ _E_
He that is good for making excuses is seldom good for anything else. Benjamin Franklin _E_
The Baldwin family is well represented in the 13th season of All Star @CelebApprentice with @StephenBaldwin. Stephen does great. _E_
Wishing everyone a very Happy Holiday season! _E_
RT @LindseyGrahamSC: I support President Trump's desire to re enter the Paris Accord after the agreement becomes a better deal for America... _E_
Now they say obese women may cause Autism in children nonsense they use any excuse. The FDA should immediately (cont) __HTTP__ _E_
Bus crash in Tennessee so sad & so terrible. Condolences to all family members and loved ones. These beautiful children will be remembered! _E_
Today I welcomed the Victory Christian Center School. Good luck @ the Team America Rocketry Challenge! #TARC Watch... __HTTP__ _E_
RT @WhiteHouse: Happy Father's Day! __HTTP__ _E_
I've dealt w/politicians throughout the world. My deals are multi faceted transactions which involve many issues. I know the process & win! _E_
Passing what was once a vibrant manufacturing area in Pennsylvania. So sad! #MakeAmericaGreatAgain __HTTP__ _E_
Sorry @Rosie is a mentally sick woman a bully a dummy and above all a loser. Other than that she is just wonderful! _E_
The USC should be ruling any day now on @ObamaCare. Hopefully we will get the right result. _E_
A good example of how our country wastes money... __HTTP__ #trumpvlog _E_
Hope & Change! China now controls a record number of our debt __HTTP__ _E_
China is an international pariah. They are now harassing Japan over its purchase of 3 uninhabited islands __HTTP__ _E_
Let's properly check goofy Elizabeth Warren's records to see if she is Native American. I say she's a fraud! _E_
Success tip: Be ready for problems and be patient there are very few cases of instant gratification. _E_
China is openly sailing warships in our waters & arming countries in our hemisphere including Mexico __HTTP__ Ally? _E_
I will be in Wisconsin until the election. Jobs trade and immigration will be big factors. I will bring jobs back home make great deals! _E_
Heading to beautiful West Virginia to be with great members of the Republican Party. Will be planning Infrastructure and discussing Immigration and DACA not easy when we have no support from the Democrats. NOT ONE DEM VOTED FOR OUR TAX CUT BILL! Need more Republicans in '18. _E_
Trump Tuesday @SquawkCNBC tomorrow at 7:38 AM. _E_
Network news has become so partisan distorted and fake that licenses must be challenged and if appropriate revoked. Not fair to public! _E_
Miss Alabama Katherine Webb has been a truly great representative of the Ms. USA Organization ..We are proud of her! _E_
When foreigners attend our great colleges & want to stay in the U.S. they should not be thrown out of our country. _E_
The @AmSpec interview by Jeffrey Lord: A TRUMP CARD The Donald talks politics and parenting. __HTTP__ _E_
The Freedom Caucus will hurt the entire Republican agenda if they don't get on the team & fast. We must fight them & Dems in 2018! _E_
Re: Ashley Judd: Keep @KarlRove away. He already made her a viable candidate. _E_
Our gov't is so pathetic that some of the billions being wasted in Afghanistan are ending up with terrorists __HTTP__ _E_
Great going to Bob Kraft & Bill Belichick of the @Patriots on @TimTebow. Tim is a winner just like them! _E_
The 2013 Trump @MissUniverse Pageant comes to Moscow on November 9th. Airing from Crocus City Hall on @nbc! _E_
I'll be on @gretawire tonight on @foxnews at 10 pm. _E_
The so called angry crowds in home districts of some Republicans are actually in numerous cases planned out by liberal activists. Sad! _E_
RT @EricTrump: We should all take a moment to say a prayer for those who paid the ultimate price — Their bravery and sacrifice allows us t... _E_
On Friday @VPBiden said that China has better cities and airports than the US. Well what has @BarackObama done about it the last 3 years?! _E_
Please tune in January 15th at 6:00AM EST and 6:00PM EST to the QVC network to watch my wife @MELANIATRUMP... _E_
Plan a perfect weekend for the holidays in NYC's hottest neighborhood using @TrumpSoHo's 20% offer __HTTP__ _E_
The Village @Trump_Charlotte offers a variety of 5 Star dining experiences for everyday dining & catered affairs __HTTP__ _E_
Wow! What a great night. Thank you to all of the viewers and congratulations to @StephenAtHome __HTTP__ @colbertlateshow _E_
RT @FoxNews: New Poll Shows @POTUS Approval at 50 Percent __HTTP__ _E_
We must do everything possible to keep this horrible terrorism outside the United States. _E_
Interesting article from highly respected Wayne Allyn Root __HTTP__ _E_
With an award winning course designed by Tom Fazio Trump National Philadelphia is a 360 acre exclusive jewel __HTTP__ _E_
The President Changed. So Has Small Businesses' Confidence __HTTP__ _E_
Wow so many Fake News stories today. No matter what I do or say they will not write or speak truth. The Fake News Media is out of control! _E_
A few of the many clips of John McCain talking about Repealing & Replacing O'Care. My oh my has he changed complete turn from years of talk! __HTTP__ _E_
There will be no amnesty!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
Do you believe that Hillary Clinton now wants Obamacare for illegal immigrants? She should spend more time taking care of our great Vets! _E_
We need a tax system that is fair and smart one that encourages growth savings and investment. It's time to (cont) __HTTP__ _E_
I believe this book will rock a lot of people. Don't just read #TImeToGetTough but share it with your friends and family! RushLimbaugh _E_
RT @IvankaTrump: We must reform our tax code so that all Americans can succeed in our modern economy & achieve the American Dream! #TaxRefo... _E_
My @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney has to get tough real unemployment & bias press __HTTP__ _E_
Get ready for the fireworks between @OMAROSA & @latoyajackson in 13th season of All Star @CelebApprentice! Neither one will back down. _E_
.@MelaniaTrump looks amazing in 2000 @SInow! __HTTP__ _E_
Huma Abedin the top aide to Hillary Clinton and the wife of perv sleazebag Anthony Wiener was a major security risk as a collector of info _E_
Newly released emails prove that scientists have manipulated data on global warming. The data is unreliable. __HTTP__ _E_
Tremendous cold wave hits large part of U.S. Lucky they changed the name from global warming to climate change G.W. just doesn't work! _E_
Welcome to the new reality be careful. Retirement ages will be pushed to 80 due to the incompetence of our leaders. __HTTP__ _E_
Just released @CNN Poll gives me a big 13 point lead in Iowa. Change your false story failing @nytimes. Thank you Iowa! _E_
It's easy to see why Americans are sick of career politicians and both parties. _E_
Why has Obama let China and others take our jobs? _E_
Mexico's biggest drug lord escapes from jail. Unbelievable corruption and USA is paying the price. I told you so! _E_
Isn't it funny that I am now #1 in the money losing @HuffingtonPost (poll) and by a big margin. Dummy @ariannahuff must be thrilled! _E_
My son Don and his wife Vanessa just had a beautiful baby boy named Spencer Frederick very thrilling. _E_
Excited to be heading home to see the House pass a GREAT Tax Bill with the middle class getting big TAX CUTS!#MakeAmericaGreatAgain _E_
3. You should tweet your pick for MVP using the celebrity's name followed by the hashtag #CelebApprenticeMVP. _E_
.@VanityFair Magazine is doing really poorly. It has gotten worse and worse over the years and has lost almost all of it's former allure! _E_
Ted Cruz only talks tough on immigration now because he did so badly in S.C. He is in favor of amnesty and weak on illegal immigration. _E_
I hope you all are looking at the Donald J. Trump Signature Collection of ties shirts & cufflinks @Macys—great for Christmas & holidays. _E_
Great day yesterday at @TrumpDoral unveiling the new Gary Player Villa __HTTP__ Gary is a champion and a great guy. _E_
Watch this video to see how bad wind turbines are for the environment __HTTP__ _E_
The @TuckerCarlson opening statement about our once cherished and great FBI was so sad to watch. James Comey's leadership was a disaster! _E_
I just learned that @politico has no credibility total phonies that don't report the truth. A puppet of Obama? _E_
Via @TVbytheNumbers: 'Celebrity Apprentice' is Number 1 among ABC CBS & NBC for its Second Hour from 10 11 p.m. __HTTP__ _E_
Karl Rove is now making excuses for his total wasting of $400M—not one win—(the Republicans better get smart next time)... _E_
.@HillaryClinton ITS CALLED EXTREME VETTING! #Debates2016 __HTTP__ _E_
This is the first time in my life that I have caused controversy by NOT saying something. _E_
The people are really smart in cancelling subscriptions to the Dallas & Arizona papers & now USA Today will lose readers! The people get it! _E_
Join me live for the commissioning ceremony of the USS Gerald R. Ford! __HTTP__ #USA __HTTP__ _E_
Dopey @billmaher is in for a lot of trouble—I hope he has $5 million (for charity). _E_
Getting ready for @nbcsnl commercial. __HTTP__ _E_
I will be On The Record with Greta Van Susteren @gretawire tonight at 10 PM on Fox News. _E_
The people of Ireland have been so great about my purchase of Doonbeg I'll be there soon. @LodgeatDoonbeg _E_
Assad hit the jackpot! _E_
Here we go with the Oscars! _E_
By the way folks @billmaher is not a smart guy (just look at his past)—he just pretends he is! _E_
Does anybody notice that Atlantic City lost its magic after I left years ago. I had the big boxing introduced UFC (ask Dana)the best shows _E_
Where's the electability? Jeb is losing to HRC by 13 points. A Bush will never beat a Clinton. Wake up @GOP! _E_
Obama should stop running down the stairs when getting off Air Force One. Doesn't look presidential and at some point he will take a fall. _E_
It's Tuesday. How many more non stories will the liberal media try to manufacture so everyone ignores Obama's record? _E_
Thank you to all law enforcement agencies for a fabulous job!#LEO #LESM #Trump2016 __HTTP__ _E_
Celebrity Apprentice will be LIVE on Sunday at 9 PM (from New York City).Casting has already begun for next season. _E_
For first time the failing @nytimes will take an ad (a bad one) to help save its failing reputation. Try reporting accurately & fairly! _E_
Will be interviewed on @Morning_Joe at 7:40. ENJOY! _E_
How is Chris Christie running the state of NJ which is deeply troubled when he is spending all of his time in NH? New Jerseyans not happy! _E_
Happy Father's Day to all even the haters and losers! _E_
Thank you for the kind words tonight @OMAROSA. You were great! See you soon! _E_
Last night Melania and I attended the Skating with the Stars Gala at Wollman Rink in Central Park it was fantastic. Stay tuned for Part 2.. _E_
We create success or failure on the course primarily by our thoughts. Gary Player _E_
Via @ConcordNHPatch by @politizine: "Trump: 'We'll Make America Great Again'" __HTTP__ _E_
Hillary there is nothing to laugh about __HTTP__ _E_
.@sethmeyers Seth can't help it he is really trying hard but just doesn't have what it takes. Very awkward and insecure! _E_
See you tomorrow w/ Gov. @Mike_Pence Iowa & Wisconsin! 3pm __HTTP__ __HTTP__ __HTTP__ _E_
Why does US doping agency destroy an American icon @lancearmstrong for events that took place years ago in France? _E_
Departing Golden CO. for Arizona now after an unbelievable rally. Watch here: __HTTP__ __HTTP__ _E_
A wonderful article by a writer who truly gets it. I am for the people and the people are for me. #Trump2016 __HTTP__ _E_
Associated Press knowingly and inaccurately wrote about Liberty University speech. Shameful reporting...no credibility. _E_
Looks like @tedcruz is getting ready to attack. I am leading by so much he must. I hope so he will fall like all others. Will be easy! _E_
Despite having a black president the racial divide seems greater than it has in decades.If Obama were a leader this would not be the case _E_
Via @BreitbartNews by @NolteNC: DONALD TRUMP SURGES TO COMMANDING LEAD IN POST MCCAIN BACKLASH POLL __HTTP__ _E_
He is destroying our country:@BarackObama has requested to raise our debt limit to over $16.4Trillion by the end (cont) __HTTP__ _E_
Just looked at new selection of Donald J. Trump Signature Collection ties & shirts @Macys fantastic! Would make great gifts! _E_
Jeb used Eminent Domain & took advantage of a disabled vet in the process. (2/2) __HTTP__ _E_
The fans are going to love the tasks in the upcoming 13th season of All Star @CelebApprentice. The biggest yet! _E_
I have accepted the invitation of President Enrique Pena Nieto of Mexico and look very much forward to meeting him tomorrow. _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Where are @RepMarkMeadows @Jim_Jordan and @Raul_Labrador?#RepealANDReplace #Obamacare _E_
In NYC looks like another attack by a very sick and deranged person. Law enforcement is following this closely. NOT IN THE U.S.A.! _E_
Michael Forbes lives in a pigsty and bad liquor company Glenfiddich gave him Scot of the Year award... _E_
Listening to @rushlimbaugh on way back to Jury Duty. Fantastic show terrific guy! _E_
.@MagicJohnson Good luck with the Dodgers this season if they were like you they would never lose a game! _E_
One of the worst and most boring political pundits on television is @krauthammer. A totally overrated clown who speaks without knowing facts _E_
Univision apologized to me but I will not accept their apology. I will be suing them for a lot of money. Miss U.S.A. contestants are hurt! _E_
Lightweight @AGSchneiderman will probably win only because he is a Dem in NY but what a loser! _E_
I have watched sloppy Graydon Carter fail and close Spy Magazine and now am watching him fail at @VanityFair Magazine. He is a total loser! _E_
I'll be on @foxandfriends Monday at 7:30 AM don't miss it. _E_
Ratings way down show irrelevant. Why haven't they learned? @Rosie always fails. _E_
Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border. They could have easily made a deal but decided to play Shutdown politics instead. #WeNeedMoreRepublicansIn18 in order to power through mess! _E_
#HappyIndependenceDay #USA __HTTP__ _E_
'Manufacturing openings hires rise to highest levels of the recovery' __HTTP__ _E_
I bought Tim Tebow's jersey and helmet at auction for a good cause fighting breast cancer __HTTP__ _E_
For those of you in trouble—(in these troubled times)—never ever give up! _E_
If my people said the things about me that Podesta & Hillary's people said about her I would fire them out of self respect. Bad instincts _E_
Congress should be worried about American workers not people who came into our country by breaking our laws. _E_
RT @IvankaTrump: 2016 has been one of the most eventful and exciting years of my life. I wish you peace joy love and laughter. Happy New... _E_
.@CNN is the worst.They go to their dumb one sided panels when a podium speaker is for Trump! VAST MAJORITY want: Make America Great Again! _E_
MAKE AMERICA GREAT AGAIN! #IACaucus #CaucusForTrump __HTTP__ __HTTP__ _E_
My Twitter account was taken down for 11 minutes by a rogue employee. I guess the word must finally be getting out and having an impact. _E_
.@FrankLuntz is a low class slob who came to my office looking for consulting work and I had zero interest. Now he picks anti Trump panels! _E_
The CDC chief just said Ebola is spreading faster than Aids. Marines are preparing for a pandemic drill. Stop all flights from West Africa! _E_
Why the Rust Belt just gave Donald Trump a hero's welcome __HTTP__ _E_
From an amazing day on the border in Laredo. __HTTP__ _E_
Condolences to the family of the young woman killed today and best regards to all of those injured in Charlottesville Virginia. So sad! _E_
I keep getting great feedback on new #TRUMP cologne 'Success.' Exclusively available at @Macy's __HTTP__ And best shirts & ties _E_
Just read @marklevinshow's bestseller book—really great! _E_
The failing @nytimes does not mention the new @CNN Poll that has me leading Iowa by a massive 13 points I am at 33%. Maggie Haberman sad! _E_
Give me clean beautiful and healthy air not the same old climate change (global warming) bullshit! I am tired of hearing this nonsense. _E_
...Save your energy Rex we'll do what has to be done! _E_
Join me in Colorado at 12pm tomorrow or Arizona at 3pm!TICKETS:Golden: __HTTP__ __HTTP__ _E_
The best luck of all is the luck you make for yourself. Douglas MacArthur _E_
Getting ready to do the David Letterman @Late_Show tonight—I hope you all will watch—I think! _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
He was quick to issue an apology on behalf of America to Karzai. Why won't he release the letter? @BarackObama __HTTP__ _E_
We should immediately close all tax loopholes that favor foreign investments and taking our jobs overseas t... (cont) __HTTP__ _E_
I told you @TIME Magazine would never pick me as person of the year despite being the big favorite They picked person who is ruining Germany _E_
There are just so many penalties and such long commercials in these NFL games that they are no longer worth watching. Soft hitting & boring! _E_
Thank you America! #Trump2016 __HTTP__ _E_
Governor Cuomo only cut the Verrazano Bridge tolls because I made it a major point in speeches. I love the people of Staten Island! _E_
Via @Newsmax_Media by Alana Marie Burke: "Donald Trump 2016: 7 Key Political Positions" __HTTP__ _E_
Politicians are all talk and no action. Washington can only be fixed by an outsider. Let's make America great again! __HTTP__ _E_
...and West Virginia. The fact is the Fake News Russian collusion story record Stock Market border security military strength jobs..... _E_
I have spoken w/ @GovAbbott of Texas and @LouisianaGov Edwards. Closely monitoring #HurricaneHarvey developments & here to assist as needed. _E_
The Ebola doctor who just flew to N.Y. from West Africa and went on the subway bowling and dining is a very SELFISH man should have known! _E_
Together we will prevail in the GREAT state of Texas. We love you!GOD BLESS TEXAS & GOD BLESS THE USA __HTTP__ _E_
Even liberals & Democrats think Eric Schneiderman's use of the Atty General's office is unfair & unethical. __HTTP__ _E_
Speaker John Boehner who I like should never have agreed to raise taxes because the Republicans got absolutely nothing for it! _E_
...extremism and all reference was pointing to Qatar. Perhaps this will be the beginning of the end to the horror of terrorism! _E_
Via @EW by @DaltonRoss: "recap: 'Nobody Out Thinks Donald Trump'" __HTTP__ _E_
The S&P downgrade is a direct result of @BarackObama's increased reckless budget spending and Obama Care. He owns this. _E_
I'm on @CNN's @AC360 tonight @8pm & @FoxNews' @seanhannity @ 10PM discussing immigration and lots of other things.#LetsMakeAmericaGreatAgain _E_
Wonderful coordination between Federal State and Local Governments in the Great State of Texas TEAMWORK! Record setting rainfall. _E_
What a sad thing that the memory of Nelson Mandela will be stained by the phoney sign language moron who is in every picture at funeral! _E_
Landing in Phoenix now. Tomorrow's events will be amazing! #Trump2016 _E_
RT @foxandfriends: FOX NEWS EXCLUSIVE: President Trump 'seriously considering' a pardon for ex Sheriff Joe Arpaio __HTTP__ _E_
Leaving now for New Hampshire. Big crowd looking forward to it! #FITN _E_
Newsmax is a great news organization and its pres debate in IA on Dec 27 will be fair balanced and informative. @ralphreed _E_
A horrible day for Newtown CT and our country yesterday. My condolences to all of the families so tragically affected. _E_
Isn't it ironic that China is going all in nuclear for energy while at the same time making wind turbines for others. @alexsalmond _E_
.@ScottWalker is a nice guy but not presidential material. Wisconsin is in turmoil borrowing to the hilt and doing poorly in jobs etc. _E_
Via @MailOnline by @dmartosko: "President Trump? Says 'there's a very substantial chance' he'll run in 2016" __HTTP__ _E_
...Even though parts of healthcare could pass at 51 some really good things need 60. So many great future bills & budgets need 60 votes.... _E_
.@alexsalmond @pressjournal RT @JohnDuthie1 just sitting here looking out over Aberdeen bay. These clowns cannot be allowed... _E_
Word is that crying @GlennBeck left the GOP and doesn't have the right to vote in the Republican primary. Dumb as a rock. _E_
Health Insurance stocks which have gone through the roof during the ObamaCare years plunged yesterday after I ended their Dems windfall! _E_
Remember it was the Republican Party with the help of Conservatives that made so many promises to their base BUT DIDN'T KEEP THEM! Hi DT _E_
If the working proud and productive people of our country don't start exerting their authority and views the U.S. as we know it is doomed! _E_
One of the saddest things in journalism is what happened to the formerly great @AP. They have lost their way and are no longer credible. _E_
Change is not a destination just as hope is not a strategy. Rudy Giuliani _E_
Off to Indiana! #Trump2016 __HTTP__ _E_
Rosie O'Donnell should leave Lindsay Lohan alone @Rosie has bigger problems than Lindsay. Lindsay's mother called my office for help _E_
One of the many reasons that @VattenfallGroup dropped out of windfarm project—they couldn't solve military radar defense problems _E_
"Don't let the fear of striking out hold you back." – Babe Ruth _E_
When will the Fake Media ask about the Dems dealings with Russia & why the DNC wouldn't allow the FBI to check their server or investigate? _E_
ObamaCare gives free insurance to illegal immigrants. Yet @BarackObama is cutting our troops healthcare. (cont) __HTTP__ _E_
It is wonderful to be in beautiful Doonbeg touring @Trump_Ireland. I'm truly honored by the wonderful welcome to my family & organization _E_
If you're going to be thinking you may as well think big. _E_
Watch the game really good. _E_
"Success isn't permanent and failure isn't fatal." – Mike Ditka _E_
I will be in Iowa all day and until Tuesday morning. Finally after all these years of watching stupidity we will MAKE AMERICA GREAT AGAIN! _E_
True @THEGaryBusey is a scene stealer without trying. He's got a gift. #CelebApprentice _E_
If I become the next POTUS they will not be ignoring! #AmericaFirst __HTTP__ _E_
Is that all there is? We need a new President FAST! _E_
Invincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_
Innovation distinguishes between a leader and a follower. Steve Jobs _E_
Loved doing #NCGOPConvention keynote speech last night! Unbelievable reception. Had the biggest crowds by far of any of the GOP candidates. _E_
The talks between the U.S. and Iran are going on forever WORLD'S LONGEST NEGOTIATION. Obama has no idea what he is doing incompetent! _E_
You're all wrong—check the facts! UK is massively subsidizing Scotland's wind turbines & the people don't want them. _E_
... Rove's ad campaign has made Ashley Judd a totally credible candidate. Be careful Mitch! _E_
Tonight I trade places with Larry King @kingsthings and interview him on the 25th anniversary of his show. 9PM on CNN featuring best clips. _E_
My @foxandfriends interview discussing the Benghazi cover up Hostess' closing & celebrating Thanksgiving with family __HTTP__ _E_
Crooked took MILLIONS from oppressive ME countries. Will she give the $$$ back? Probably not. Don't forget her slog... __HTTP__ _E_
I am having 600 Thanksgiving dinners sent to the Rockaways prepared by my wonderful Trump Grill/Trump Tower staff. #SandyRelief _E_
Spolier alert...the record setting 13th season of All Star @CelebApprentice also features the return of previous winners in the boardroom. _E_
Venezuelan leader Hugo Chavez said in a television interview that aired on Sunday If I were American I'd vote for Obama. _E_
JOBS JOBS JOBS! #MAGA __HTTP__ _E_
On my way to Pensacola Florida. See everyone soon! #MAGA __HTTP__ _E_
Crooked Hillary Clinton has destroyed jobs and manufacturing in Pennsylvania. Against steelworkers and miners. Husband signed NAFTA. _E_
...@BarackObama is hiding plenty of bad things. _E_
With the coming forward today of the woman central to the failing @nytimes hit piece on me we have exposed the article as a fraud! _E_
This is the right TAX CUT @ the RIGHT TIME. We will ALL succeed & grow TOGETHER – as one team one people & one American family. #TaxReform __HTTP__ _E_
If you want to conquer fear don't sit home and think about it. Go out and get busy. Dale Carnegie _E_
Should have gone after the oil years ago (like I have been saying). _E_
An 'extremely credible source' has called my office & told me that @BarackObama applied to Occidental as a foreign student think about it! _E_
Crooked Hillary Clinton blames everybody (and every thing) but herself for her election loss. She lost the debates and lost her direction! _E_
Looking forward to addressing @ralphreed's @FaithandFreedom 'Road to Majority Conference' on June 13th __HTTP__ _E_
President Obama spends so much time speaking of the so called Carbon footprint and yet he flies all the way to Hawaii on a massive old 747. _E_
As a tribute to the late great Phyllis Schlafly I hope everybody can go out and get her latest book THE CONSERVATIVE CASE FOR TRUMP. _E_
The economy is bad and getting worse almost ZERO growth this quarter. Nobody can beat me on the economy (and jobs). MAKE AMERICA GREAT AGAIN _E_
We are excited to announce Trump Estates at Akoya by DAMAC luxury villas situated byTrump Int'l Golf Links Dubai __HTTP__ _E_
Breaking news The Washington Redskins have just announced that they will be removing the name Washington from their name! _E_
"@DamacOfficial Announces @TigerWoods to Create Golf Course for Trump World Golf Club Dubai" __HTTP__ via @BusinessWire _E_
RT @DanScavino: LOUISIANA GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Glad to hear @GovChristie will be delivering the Keynote for the @RNC convention. He will deliver a strong message. _E_
It's Tuesday. How much money will Karl Rove waste today trying to push amnesty through the House? _E_
No @JebBush you're pathetic for saying nothing happened during your brother's term when the World Trade Center was attacked and came down. _E_
Kasich voted for NAFTA a disaster for Ohio and now wants the even worse TPP approved. Vote Trump and end this madness! _E_
We must all be united in offering assistance to everyone suffering in Puerto Rico and elsewhere in the wake of this terrible disaster. _E_
A clip of my upcoming interview with @DavidBrody discussing #TimeToGetTough @Israel and the Islamist winter __HTTP__ _E_
RT @DanScavino: Join @realDonaldTrump LIVE in Wisconsin with Gov. @ScottWalker @MayorRGiuliani @Reince & Coach Bobby Knight! LIVE: __HTTP__ _E_
It is amazing how often I am right only to be criticized by the media. Illegal immigration take the oil build the wall Muslims NATO! _E_
Congratulations to Aberdeen and Scotland for just having our great golf course named Best New Course In World by The Robb Report. _E_
I love that thousands of people are boycotting @Macys and cutting up credit cards. No guts no glory. This really backfired love it! _E_
Facebook was always anti Trump.The Networks were always anti Trump henceFake News @nytimes(apologized) & @WaPo were anti Trump. Collusion? _E_
I hope Bill Clinton and NEWSMAX's Chris Ruddy are enjoying their mission to Africa. Two great people. _E_
Celebrity Apprentice continues to be a top ten trend on twitter this morning __HTTP__ _E_
No I'm saying that the World is paying the price for China's pollution while they make a fortune with their dirty factories! Very sad. _E_
The Club For Growthwhich asked me for $1000000 in an extortion attempt just put up a Wisconsin ad with incorrect math.What a dumb group! _E_
Via "TRUMP: HILLARY PRESIDENCY WILL CAUSE 'CRIME WAVE LIKE YOU'VE NEVER SEEN'" __HTTP__ via @BreitbartNews _E_
.@MattGinellaGC Don't forget to watch Matt tomorrow on Morning Drive talking about The Blue Monster and Trump Doral. @GolfChannel _E_
A quote was read from a parody account last night on MSNBC re: Jeb. __HTTP__ _E_
RT @Corrynmb: @realDonaldTrump Liberals have an agenda and it's not in America's best interest. Keep fighting the good fight! We stand with... _E_
Today it was my great honor to sign a new Executive Order to ensure Veterans have the resources they need as they transition back to civilian life. We must ensure that our HEROES are given the care and support they so richly deserve! __HTTP__ __HTTP__ _E_
Leaving now I'm spending the entire day in Iowa great people great state! _E_
#MakeAmericaSafeAgain #ImWithYou __HTTP__ _E_
Rambling and stumbling @hardball_chris is as dumb as a rock! _E_
My interview yesterday with @TeamCavuto discussing Europe's debt deal and the GOP primary __HTTP__ _E_
Great New Poll __HTTP__ _E_
The Justice Dept. should have stayed with the original Travel Ban not the watered down politically correct version they submitted to S.C. _E_
The best vision is insight. Malcolm Forbes _E_
Business is looking better than ever with business enthusiasm at record levels. Stock Market at an all time high. That doesn't just happen! _E_
The Senate must NOT pass TPA! Any Senator who votes for it is disqualified for being POTUS. Protect the American worker and manufacturer! _E_
Yet another terrorist attack this time in Turkey. Willthe world ever realize what is going on? So sad. _E_
A MUST WATCH TRULY BEAUTIFUL! @PrivateCaddie: Amazing Turnberry Ailsa course changes from @realDonaldTrump #Golf __HTTP__ _E_
I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_
Coming together is a beginning keeping together is progress working together is success. Henry Ford _E_
It's important to remain open to new ideas and new information. Keep your door open every day to something innovative and energizing. _E_
Located in Palm Beach FL historic Mar a Lago features 20 exquisite acres filled w/ world class amenities __HTTP__ _E_
Via @businessinsider by @hunterw: "TRUMP: 'I'm going to surprise a lot of people' in 2016" __HTTP__ _E_
Results are what matter. The bottom line is clearly the bottom line. Think Like a Champion _E_
I will be live tweeting my interview with @megynkelly on the Fox Network tonight at 8! Enjoy! __HTTP__ _E_
I will be interviewed by @MariaBartiromo on @MorningsMaria @FoxBusiness at 7:30 A.M. Enjoy. _E_
Bush and Rubio are finally attacking each other as I knew they would in order to be the last establishment man standing against me.Great _E_
This is what REAL PRIDE in our COUNTRY is all about! #USA __HTTP__ _E_
Bernie Sanders is being treated very badly by the Democrats the system is rigged against him. Many of his disenfranchised fans are for me! _E_
RT @axios: The DOJ is opening a civil rights investigation on the car attack in Charlottesville __HTTP__ _E_
Thank you New Hampshire! #FITN __HTTP__ _E_
Meeting with Iowa State Senate Leaders __HTTP__ _E_
Must read f/@ weeklystandard by @JayCostTWS: "Obamacare Myth Making Five phony success stories." __HTTP__ _E_
"Each life is made up of mistakes and learning waiting and growing practicing patience and being persistent." – Rev. @BillyGraham _E_
Being good in business is the most fascinating kind of art.Making money is art & working is art & good business is the best art. A. Warhol _E_
Republicans must be careful in that the Dems own the failed ObamaCare disaster with its poor coverage and massive premium increases...... _E_
Hillary's vision is a borderless world where working people have no power no jobs no safety. _E_
Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_
Ivanka Trump will be interviewed on @foxandfriends. _E_
Crude has skyrocketed since @BarackObama delayed the Keystone Pipeline. Not only are 20000 jobs gone but family budgets are tightening. _E_
Our founders invoked our Creator four times in the Declaration of Independence. Our currency declares "IN GOD WE TRUST." And we place our hands on our hearts as we recite the Pledge of Allegiance and proclaim that we are "One Nation Under God." #NationalPrayerBreakfast __HTTP__ _E_
Sugar @Lord_Sugar Why don't you tell the public what you're really worth they would be very disappointed. _E_
Closely monitoring #HurricaneHarvey from Camp David. We are leaving nothing to chance. City State and Federal Govs. working great together! _E_
The #Hyperlapse app in @TrumpTowerNY __HTTP__ _E_
I will be interviewed by @GStephanopoulos on @GMA at 7:00 A.M. There is much to talk about! _E_
Thank you for your support in Biloxi MS! Let's ALL get out & VOTE in 2016 so we can #MakeAmericaGreatAgain! __HTTP__ _E_
Crooked Hillary promised 200k jobs in NY and FAILED. We'll create 25M jobs when I'm president and I will DELIVER! __HTTP__ _E_
They finally let our Marine out of a Mexican prison no thanks to Obama. Way too long. Such an event should never be allowed to happen again _E_
It was just announced that @ErinBurnett won't be going to mornings on CNN. @OutFrontCNN just made a wise decision. _E_
No wonder the Today Show on biased @NBC is doing so badly compared to its glorious past. Little credibility! _E_
Join me in Mobile Alabama on Sat. at 3pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_
I will be interviewed by @MarthaMaccallum on @FoxNews tonight at 7pm. Enjoy! _E_
It is about time that Roger Goodell of the NFL is finally demanding that all players STAND for our great National Anthem RESPECT OUR COUNTRY _E_
I'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring'? (cont) __HTTP__ _E_
Thank you to @foxandfriends for the nice reviews of last night. _E_
Nasty Ted Cruz is at it again same dirty tricks he used w/ @RealBenCarson saying I may not be on ballot & I hold liberal positions. LIES! _E_
"He who knows when he can fight and when he cannot will be victorious." Sun Tzu _E_
My Trump Home Mattress Collection by Serta is setting records they are really phenomenal. You can order them at __HTTP__ _E_
Pigs get slaughtered ... again. Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_
Do you believe this singing? #Oscars _E_
.@antbaxter Anthony—did you illegally take clips from the Letterman @Late_Show show and @GolfChannel without their approval? _E_
Remember this the worst doctors (by far) are celebrity doctors. If you see their names or read about them in the newspapers stay away! _E_
Via @espn: @dallasmavs "most likely scenario remains finishing a frustrating ninth in the West" __HTTP__ _E_
Busy week planned with a heavy focus on jobs and national security. Top executives coming in at 9:00 A.M. to talk manufacturing in America. _E_
.@KellyandMichael are both wonderful people. Their show is terrific. #CelebApprentice _E_
Don't take vacations. What's the point? If you're not enjoying your work you're in the wrong job. Think Like A Billionaire _E_
Big ratings getter @seanhannity and Apprentice Champion John Rich are right now going on stage in Las Vegas for #VegasStrong. Great Show! _E_
American must now get very tough very smart and very vigilant. We cannot admit people into our country without extraordinary screening. _E_
#RiyadhSummit #POTUSAbroad __HTTP__ _E_
Really bad article about me in the dying (or dead) Esquire Magazine. Totally false lots of hatred. When will this boring magazine close? _E_
People that have read it tell me that @KarlRove book is terrible (and boring). Save your money! @FoxNews should can him no credibility! _E_
See the attack very possibly could have been stopped. We need real leadership and vision. __HTTP__ _E_
What's more important for the American public to have? @MittRomney's tax returns or @BarackObama's sealed records? _E_
Iran with all of the money and all else given to them by Obama has wanted a way to take over Saudi Arabia & their oil. THEY JUST FOUND IT! _E_
You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_
To aspiring entrepreneurs: Trust your instincts. They are there for a reason. _E_
COMING UP @GenFlynn @newtgingrich on @foxandfriends _E_
Who would you like to see on next season of #CelebrityApprentice? Let us know everyone wants to be on it. _E_
Trump defends campaign manager charged for bruising a reporter: __HTTP__ _E_
Stock Market up 5 months in a row! _E_
At 9:00 P.M. @CNN of all places is doing a Special Report on my daughter Ivanka. Considering it is CNN can't imagine it will be great! _E_
HILLARY'S BAD TAX HABIT! __HTTP__ _E_
Such long rhetorical and boring answers from Obama. No wonder nothing gets done. _E_
.@DannyZuker I hear your filmography is stacked with failures. _E_
I hope you buy my shirts and ties at @Macys _E_
A huge honor for @TrumpToronto for being named #1 Luxury Hotel in Canada by @TripAdvisor's #TravelersChoice Awards __HTTP__ _E_
RT @DonaldJTrumpJr: Great group at our Victory Office in Columbus Ohio. I'm incredibly grateful to have so many... __HTTP__ _E_
Wow new @ABCnews/@WashingtonPost @GOP preference poll has DonaldTrump 11 points up! Thank you. _E_
The delegates at the @DNC convention keep shouting Four More Years. Four more years of 18% real unemployment and another $6T in debt? _E_
The #CelebApprentice post @OMAROSA. Will it ever be the same? _E_
Referees are destroying the enjoyment of NFL games. Slowing down the fun. Big shots. Jets game is ridiculous! _E_
I hope Derek Jeter's recovery is going well. He is a very special player and a great guy. New York loves him. @yankees _E_
Thank you Pennsylvania I am forever grateful for your amazing support. Lets MAKE AMERICA GREAT AGAIN! #MAGA... __HTTP__ _E_
My @SquawkCNBC interview. __HTTP__ _E_
.@TheRealMarilu is impressing the All Star Celebrity @ApprenticeNBC viewers with her continued success on Team Power. _E_
An unbelievable night in Iowa with our great Veterans! We raised $6000000.00 while the politicians talked! #GOPDebate _E_
I missed the PGA Championship because it was not broadcast by TimeWarner @TWC. Why aren't they giving subscribers major discounts? _E_
If I only had 1 person running against me in the primaries like Hillary Clinton I would have gotten 10 million more votes than she did! _E_
Via @Newsmax_Media: "Trump to Speak at CPAC" __HTTP__ @CPACnews #CPAC13 _E_
The hatred that clown @krauthammer has for me is unbelievable – causes him to lie when many others say Trump easily won debate. _E_
Thank you Mahoning County Ohio! See you soon! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_
.@GovernorPataki was a terrible governor of NY one of the worst would've been swamped if he ran again! _E_
The new Red Tiger course at @TrumpDoral __HTTP__ Follow @TrumpGolf for more great photos. _E_
More and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the one sided coverage? _E_
Canadian PM Harper immediately called the Ottawa attack terrorism. At least North America has a strong leader who lives in reality. _E_
My first order as President was to renovate and modernize our nuclear arsenal. It is now far stronger and more powerful than ever before.... _E_
The failing @nytimes does major FAKE NEWS China story saying Mr.Xi has not spoken to Mr. Trump since Nov.14. We spoke at length yesterday! _E_
Medicare payments have become so unpredictable that record amount of doctors are now leaving __HTTP__ Bad for long term. _E_
Arnold Schwarzenegger isn't voluntarily leaving the Apprentice he was fired by his bad (pathetic) ratings not by me. Sad end to great show _E_
Thank you Virginia! #Trump2016#SuperTuesday _E_
Great Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts. __HTTP__ _E_
Excited to see @SixteenChicago's "elevated fine dining" explored by @USAToday @10Best! __HTTP__ _E_
It's Friday. How many people have been forced off their plans and lost their doctors today because of ObamaCare? _E_
Miss Israel and Miss Lebanon no more fighting! #TrumpVlog #MissUniverse __HTTP__ _E_
Robert I'm getting a lot of heat for saying you should dump Kristen but I'm right. If you saw the Miss Universe girls you would reconsider. _E_
Two people fired very early on Celebrity Apprentice tonight at 9 leading up to next weeks live Finale. Don't get angry at me tonight! _E_
It's important to promote an image of yourself each and every day. It's part of having a sense of self and a sense of purpose. _E_
Now Chinese agents are smuggling our military weapons through rogue US soldiers __HTTP__ China loves to cheat! _E_
Re: Negotiation: View any conflict as an opportunity. Be a diplomat as much as possible. _E_
I will be landing in Las Vegas shortly to pay my respects with @FLOTUS Melania. Everyone remains in our thoughts and prayers. _E_
Our billion dollar website __HTTP__ _E_
Via @BleacherReport: "Donald Trump to Be Inducted into WWE Hall of Fame" __HTTP__ _E_
RT @DonaldJTrumpJr: An Honor to be in #Indiana w @realDonaldTrump @greta & the legend Bobby Knight! I like our secret weapon better!!! __HTTP__ _E_
Going over to @TodayShow now to introduce @ApprenticeNBC cast etc. watch. _E_
Alaska Arizona Maine and Kentucky are big winners in the Healthcare proposal. 7 years of Repeal & Replace and some Senators not there. _E_
If I run I will be in all the primary debates and you will see why I am the only one who can Make America Great Again! _E_
Happy Thanksgiving to all even the haters and losers! _E_
Apprentice ratings doing great easily won the 10 o'clock hour over other networks! _E_
Can you imagine the anger and disgust when the heads of other countries found out that their cell phones were being tapped by NSA.Obama mess _E_
It's been great making so many new friends at Trump @DoralResort for the @CadillacChamp. Good luck to everyone! _E_
.@EricShawnonFox Highest rated Saturday Night Live in four years. 47% higher than their opening night with Hillary & Miley Cyrus. Nice words _E_
.@GStephanopoulos just announced that I am leading BIG in the new @ABC Poll which will be shown on This Week at 9:00 A.M. I will be on show _E_
THANK YOU California Maryland New York and Pennsylvania! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Out of our very big country with many choices does everyone notice that both the ban case and now the sanctuary case is brought in ... _E_
Remember if you don't sell yourself no one else will. Make sure the public friends & the business community hears about your success. _E_
My interview with @gretawire last night Everything Obama Does is a 'Campaign Speech' __HTTP__ _E_
Bernie Sanders totally sold out to Crooked Hillary Clinton. All of that work energy and money and nothing to show for it! Waste of time. _E_
Just received applause at #NCGOPcon when I said People ask me why I may run for President I might so we can Make America Great Again! _E_
Carson now admits his friend named Bob who he tried to stab (Bob was saved by his belt buckle!) no longer exists as Bob. Wrong name! _E_
The United States mourns for the victims of Nice France. We pledge our solidarity with France against terror. __HTTP__ _E_
Republican Tax Cuts are looking very good. All are working hard. In the meantime the Stock Market hit another record high! _E_
I saw from my window just before accident that the crane was not properly anchored for the storm. _E_
If we did all the things we are capable of we would literally astound ourselves. Thomas A. Edison _E_
.@MajorCBS Major Garrett of @CBSNews covers me very inaccurately. Total agenda bad reporter! _E_
Piers truly hates Omarosa! _E_
Certainly has been an interesting 24 hours! _E_
When will @BarackObama present an actual budget? Enough with the games. _E_
Failing @nytimes which has been calling me wrong for two years just got caught in a big lie concerning New England Patriots visit to W.H. _E_
What recovery? JP Morgan has readjusted Q2 growth down from 1.7% to 1.4% and Q3 to 1.5% with 2012 on a whole at 1.7% __HTTP__ _E_
China wouldn't provide a red carpet stairway from Air Force One and then Philippines President calls Obama the son of a whore. Terrible! _E_
"Presidential Proclamation Commemorating the 50th Anniversary of the Vietnam War" __HTTP__ __HTTP__ __HTTP__ _E_
RT @PressSec: .@POTUS and @FLOTUS meet w/ some of America's finest on the USS Kearsarge off the coast of PR. __HTTP__ _E_
Almost all reporters falsely report that I had a bad time at last year's White House Correspondents' Dinner. (cont) __HTTP__ _E_
Bernie Sanders is lying when he says his disruptors aren't told to go to my events. Be careful Bernie or my supporters will go to yours! _E_
We are going to ask Katherine Webb to be a judge at the Miss USA Pageant coming up in Las Vegas. _E_
On this wonderful Veterans Day I want to express the incredible gratitude of the entire American Nation to our GREAT VETERANS. Thank you! __HTTP__ _E_
A great night in Iowa! __HTTP__ _E_
I still don't know who I'm going to choose. @GeraldoRivera or @LeezaGibbons? Who do you like? @ApprenticeNBC _E_
He @BarackObama is incapable of admitting that he is a complete and utter failure. He is 100% responsible for Solyndra. __HTTP__ _E_
Why do people listen to clown @KarlRove on @FoxNews? Spent $430M & lost all races—a Bushy! _E_
Thanks you for all of the Trump Rallies today. Amazing support. We will all MAKE AMERICA GREAT AGAIN! _E_
Thoughts & prayers with everyone in Lafayette Louisiana this evening. _E_
.@TheEconomist Poll one of the most highly respected was just released. Wow wait until the media digests these numbers won't be happy! _E_
Thank you America! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
My great honor to host the 2017 back to back #StanleyCup Champion Pittsburgh Penguins at the WH with @FLOTUS today! __HTTP__ __HTTP__ _E_
I have a tip that can take 5 strokes off anyone's golf game. It's called an eraser. Arnold Palmer _E_
With unempoyment over 10% in 2009 @BarackObama held an extravagant Alice in Wonderland party. He is a man of the people! _E_
If you are steadfast in your efforts and self respect critics will be harmless. Keep your focus! _E_
It is being reported by virtually everyone and is a fact that the media pile on against me is the worst in American political history! _E_
Via @MoscowTimes: Donald Trump in New @eminofficial Video __HTTP__ Emin & family are wonderful people. _E_
#Imwithyou __HTTP__ _E_
Price of corn has jumped over 50%. This will cause a jump in food prices perhaps beyond what we've ever seen. Nasty for the economy. _E_
When confronted @RickSantorum can't defend his ridiculous attacks on @MittRomney __HTTP__ _E_
... So if you want to aim high you have to have the guts to handle the inevitable bumps in the road. Think BIG _E_
Millions protesting in Egypt for Morsi's ouster __HTTP__ When will Obama demand Morsi's resignation as he did to Mubarak _E_
#TBT As a young man when I proposed the Convention Center in New York City. __HTTP__ _E_
RT @DRUDGE_REPORT: WSJ: Grifters in Chief... __HTTP__ _E_
Putin has become a big hero in Russia with an all time high popularity. Obama on the other hand has fallen to his lowest ever numbers. SAD _E_
This story is no longer about John McCain it's about our horribly treated vets. Illegals are treated better than our wonderful veterans. _E_
...they are costly inefficient bird killing community destroying machines. They are obsolete! @maddow _E_
The last person corrupt Hillary Clinton wants to run against is Donald J. Trump. I'll end up beating her in every state. New Fox Poll Trump! _E_
#trumpvlog NY Area Two book signings Tonight and Thursday.... __HTTP__ _E_
Courageous Patriots have fought and died for our great American Flag we MUST honor and respect it! MAKE AMERICA GREAT AGAIN! _E_
Knowledge requires patience action requires courage. Put patience and courage together and you'll be a winner. _E_
My childcare plan makes a difference for working families more money more freedom. #AmericaFirst means... __HTTP__ _E_
Would seem that plane landed short of runway in San Francisco! _E_
My son Donald openly gave his e mails to the media & authorities whereas Crooked Hillary Clinton deleted (& acid washed) her 33000 e mails! _E_
Hillary Clinton is using race baiting to try to get African American voters but they know she is all talk and NO ACTION! _E_
Today's open call drew thousands of eager applicants. It was an impressive group I enjoyed meeting them. We've got some great candidates! _E_
Because I will be busy doing anything other than being in the movie #RoadHard. __HTTP__ _E_
I hear @pennjillette show on Broadway is terrible. Not surprised boring guy (Penn). Without The Apprentice show would have died long ago. _E_
Isn't it sad that on a day of national tragedy Hillary Clinton is answering softball questions about her email lies on @CNN? _E_
Just for your info tax returns have 0 to do w/ someone's net worth. I have already filed my financial statements w/ FEC. They are great! _E_
ISIS is taking credit for the terrible stabbing attack at Ohio State University by a Somali refugee who should not have been in our country. _E_
Now even @BarackObama's old professors are coming out in opposition to his re election. __HTTP__ He has embarrassed them. _E_
Only a fool would buy the @NYDailyNews. Loses fortune & has zero gravitas. Let it die! _E_
Entrepreneurs: Review your work habits regularly and make sure they are taking you in the right direction. Keep your focus intact. _E_
Advice from my mother Mary MacLeod Trump: Trust in God and be true to yourself. _E_
Ted Cruz was born in Canada and was a Canadian citizen until 15 months ago. Lawsuits have just been filed with more to follow. I told you so _E_
..my endorsement). He also wanted to be Secretary of State I said NO THANKS. He is also largely responsible for the horrendous Iran Deal! _E_
.@kilmeade It was great being with you on @foxandfriends this morning. So many people saw and loved the piece. Great work! _E_
Join us Saturday night for the South Carolina Primary Watch Party!#SCPrimary #Trump2016 __HTTP__ _E_
Pictures of @melaniatrump and me from the Men In Black III premiere in New York City __HTTP__ We loved the movie! _E_
The United States under President Obama has truly become the gang that couldn't shoot straight. Everything he touches turns to garbage! _E_
Thank you @DennisRodman. It's time to #MakeAmericaGreatAgain! I hope you are doing well! __HTTP__ _E_
RT @JackPosobiec: Meanwhile: 39 shootings in Chicago this weekend 9 deaths. No national media outrage. Why is that? __HTTP__ _E_
The Ebola nurse should NEVER have been allowed to fly to Cleveland and (amazing) back again. Nothing works in our once great country anymore _E_
Just signed 702 Bill to reauthorize foreign intelligence collection. This is NOT the same FISA law that was so wrongly abused during the election. I will always do the right thing for our country and put the safety of the American people first! _E_
I look forward to Saturday night and being inducted into the @WWE Hall of Fame. _E_
Clips from tax speech and @seanhannity on @foxandfriends now. Have a great day! _E_
Via @TPInsidr __HTTP__ _E_
The biggest doers often suffer the biggest setbacks in life... _E_
The legendary @BarbaraJWalters interviews my family and me tonight at 10:00 on @ABC2020 . Don't miss it! __HTTP__ _E_
CNN Poll just out on South Carolina – great #'s __HTTP__ _E_
'Obama Warned Of Rigged Elections In 2008.' Time to #DrainTheSwamp __HTTP__ __HTTP__ _E_
South Carolina was so great last night. Will be back soon! _E_
I will be interviewed on @FacetheNation Sunday 10AM on CBS. @johndickerson is a true pro! _E_
The #CNBCGOPDebate poll closed with #Trump2016 declared the official winner. Thank you! __HTTP__ __HTTP__ _E_
For those of you defending Bret and saying Omarosa should go remember Bret chose O which could also be considered a big mistake! _E_
Via @realitytvworld: La Toya Jackson fired from 'All Star Celebrity Apprentice' by Donald Trump __HTTP__ _E_
I am so glad @Rosie got fired by @Oprah. Rosie is a bully and it's always nice to see bullies go down! _E_
Tina Brown could finally be over. @thedailybeast is a total failure. She just got fired great! _E_
It is a great honor to have helped the community so much. __HTTP__ _E_
RT @foxandfriends: OPIOID CRISIS: Worse than we thought with a new study showing overdose deaths were under reported __HTTP__ _E_
Via @AP by @kronayne & @colvinj: Disavowed by GOP leaders Trump has supporters cheering __HTTP__ _E_
Trump @DoralResort's renovations are on schedule. With such a massive project underway I am watching closely. _E_
Crooked Hillary Makes History! #ImWithYou #AmericaFirst __HTTP__ _E_
Thank you the very dishonest Fake News Media is out of control! __HTTP__ _E_
Thank you for your support! TOGETHER we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
Wrong used to be called global warming and when that name didn't work they deftly changed it to climate change because it's freezing! _E_
.@ewanshearer Happy Birthday _E_
Beijing had a bigger celebration than Chicago last night. The Chinese are happier with the election than we are. _E_
Negotiation is an art. Treat it like one. _E_
People don't understand that I left The Apprentice to run for Pres—the Apprentice DID NOT leave me. Bob Greenblatt & folks @NBC were GREAT! _E_
At the request of many I will be doing live tweets during the next presidential debate. _E_
I loved being at Liberty University today! Record setting crowd unbelievable people! Thank you Jerry and Becki! __HTTP__ _E_
.@HillaryClinton #ICYMI WE ARE NOT IN A NARRATIVE FIGHT. @Mike_Pence #MAGA __HTTP__ _E_
Going to New Hampshire all sold out crowds. People want real change POLS WILL NEVER MAKE OUR COUNTRY GREAT AGAIN! _E_
Rep. Stephen Lynch (D Ma) said There's all of these taxes and fees that are the tough medicine..it's going to hit the fan 're ObamaCare. _E_
We are experiencing the coldest weather in more than two decades most people never remember anything like this. GLOBAL WARMING anyone? _E_
Is Cruz honest? He is in bed w/ Wall St. & is funded by Goldman Sachs/Citi low interest loans. No legal disclosure & never sold off assets. _E_
A lot of the @Yankees should be ashamed of their play in the post season. They are lucky they don't have to deal with George Steinbrenner. _E_
I have millions more votes/hundreds more dels than Cruz or Kasich and yet am not being treated properly by the Republican Party or the RNC. _E_
Live on the edge no complacency is allowed and keep an open mind. Business is a creative endeavor. _E_
Via @BreitbartNews' @biggovt: "WAR! TRUMP LEVIN PUMMEL ROVE AS CONSERVATIVE BATTLE ESCALATES" __HTTP__ _E_
.@genesimmons is terrific congratulations on Hall of Fame. _E_
...They have been in our country for many years through no fault of their own brought in by parents at young age. Plus BIG border security _E_
Trump Int'l Hotel Washington D.C.: The iconic Old Post Office Building will be one of the world's great hotels. __HTTP__ _E_
For those who missed my chat with @hannityshow on radio here it is on TV. Sean is terrific. __HTTP__ _E_
The media is really on a witch hunt against me. False reporting and plenty of it but we will prevail! _E_
Readout of my meeting with Israeli Prime Minister Benjamin Netanyahu: __HTTP__ __HTTP__ _E_
On the whole the teams seem to be working well together. No wars...yet. _E_
In all of television the only one who said anything bad about last nights landslide victory was dopey @KarlRove. He should be fired! _E_
It is time Republicans stop attacking each other and focus on @BarackObama. America cannot survive a second term. _E_
The Trump Signature Collection exclusively available at @Macys tops all menswear styles. Dress to impress! __HTTP__ _E_
Via @DMRegister by @JenniferJJacobs: Trump to hand out Trump memorabilia at Iowa summit __HTTP__ _E_
.@Yankees are in trouble without Derek. Try A Rod at short get him some confidence. _E_
.@TrumpWaikiki is Hawaii's top luxury hotel & destination. Each room features stunning views & superb amenities __HTTP__ _E_
Why did lightweight A.G. Eric Schneiderman come to my office on numerous occasions begging for campaign contributions? Also recent asks? _E_
I hate @USAToday's redesign the logo is terrible. Lightweight Al Neuharth must've had something to do with this No wonder paper is failing. _E_
THANK YOU NEVADA!#Trump2016 #MakeAmericaGreatAgain@Snapchat! Username: realdonaldtrump __HTTP__ __HTTP__ _E_
Watching the #GOPConvention#AmericaFirst #RNCinCLE _E_
Join me in California or Montana!5/25/16: Anaheim California __HTTP__ Billings Montana __HTTP__ _E_
If speeches and memoirs created jobs then @BarackObama would be Ronald Reagan. _E_
Good news is that my campaign has perhaps more cash than any campaign in the history of politics b/c I stand 100% behind everything we do. _E_
Thank you to @Franklin_Graham. I have always appreciated your courage but now more so than ever! _E_
Same CDC which is bringing Ebola to US misplaced samples of anthrax earlier this year __HTTP__ Be careful. _E_
.@mcuban is so short off the tee he can't have much of a punch. He's just a weak man with a big mouth! _E_
#trumpvlog @BarackObama is very inconsiderate... __HTTP__ _E_
RT @foxandfriends: Millions of gallons of Mexican waste threaten Border Patrol agents __HTTP__ _E_
It was great spending time with @joniernst yesterday. She has done a fantastic job for the people of Iowa and U.S. Will see her again! _E_
The arrogant young woman who questioned me in such a nasty fashion at No Labels yesterday was a Jeb staffer! HOW CAN HE BEAT RUSSIA & CHINA? _E_
Don't forget to watch Larry King tonight CNN at 9 pm. He's a television legend and a great friend. It's going to be a fantastic farewell. _E_
Via @BreitbartNews by @THESHARKTANK1: DONALD TRUMP FIRES ENTIRE 2016 GOP FIELD __HTTP__ _E_
. @foxandfriends interview discussing a budget deal my #CPAC2013 speech @RealBenCarson & firing @latoyajackson __HTTP__ _E_
#TBT On the stage during the Emmys performing Green Acres with Megan Mullally __HTTP__ _E_
Thank you to the Governor of Florida Rick Scott for your endorsement. I greatly appreciate your support! _E_
#CrookedHillary has FAILED all over the world! 􏰀 #BigLeagueTruth #Debates2016 __HTTP__ _E_
The movie may be garbage but we can't let a foreign country dictate to us what to watch. @SonyPictures _E_
It's Wednesday. I wonder how much money @BarackObama borrowed from China today? _E_
Just returned home from the great state of New Hampshire. Have made so many friends there special place! _E_
"@TrumpFerryPoint was something we've been working on for years and Donald Trump got it to the finish line." @rubendiazjr _E_
.@peachespulliam at @TrumpTowerNY this afternoon a wonderful woman. It was an honor to donate $25K to her charity. __HTTP__ _E_
America deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_
This is no surprise. Constant phony reporting from failing @CNN turns everyone off. The American people get it! __HTTP__ _E_
New job numbers once again show no growth or recovery. Unemployment has been over 8% for 41 straight months now up to 8.3% _E_
Don't wait for dire circumstances to test your quick thinking ability. Be on alert at all times. _E_
Real unemployment is at over 21%. Businesses won't hire until @BarackObama is defeated in 2012. #TimeToGetTough _E_
Can you imagine what the outcry would be if @SnoopDogg failing career and all had aimed and fired the gun at President Obama? Jail time! _E_
The truth continues to come out after 14 years. A truth that many in the media did not want to tell. #Trump2016 __HTTP__ _E_
Tomorrow's election will have historic repercussions for our country. Make America strong again. Vote for @MittRomney. _E_
We are going to WIN and MAKE AMERICA GREAT AGAIN maybe better than ever before! _E_
Our prayers are with Rev. @BillyGraham for a speedy recovery. His faith continues to inspire us all. _E_
Just leaving Mechanicsburg PA. Incredible crowd so enthusiastic! Will be back soon. #MAGA __HTTP__ _E_
RT @paulsperry_: BREAKING: top FBI investigator for Mueller PETER STRZOK busted sending political text messages bashing Trump & praising... _E_
Response to the Des Moines Register __HTTP__ _E_
.@Zagat named Christmas Day Brunch @TrumpChicago @SixteenChicago one of the best in the city! #TrumpHolidays __HTTP__ _E_
Thank you Michigan! This is a MOVEMENT that will never be seen again it's our last chance to #DrainTheSwamp! Watch... __HTTP__ _E_
I am on @FoxNewsSunday with Chris Wallace his 20th year anniversary with #FNS throughout the day. Enjoy! __HTTP__ _E_
Press Conference Following National Security Briefing in Bedminster New Jersey. __HTTP__ __HTTP__ _E_
$6 gas is coming sooner than later. America must become energy independent with our own resources and fast.Also (cont) __HTTP__ _E_
It's Thursday and again I ask how much money is China stealing from us? _E_
Believe you can and you're halfway there. Pres. Theodore Roosevelt _E_
I believe in #AmericaFirst and that means FAMILY FIRST! My childcare plan reflects the needs of modern working clas... __HTTP__ _E_
In case you missed it my @gretawire interview on Obama's IRA rate cut hurting savings & economic growth __HTTP__ _E_
Trump Making GOP Speech — Is 2016 in the Cards? __HTTP__ via @Newsmax_Media _E_
All signs are that business is looking really good for next year only to be helped further by our Tax Cut Bill. Will be a great year for Companies and JOBS! Stock Market is poised for another year of SUCCESS! _E_
I wonder if the Rutgers coach who had the audacity to yell at the player is a proponent of global warming? _E_
Thank you @BillyJoel many friends just told me you gave a very kind shoutout at MSG. Appreciate it love your music! _E_
#TeamTrump. Police and law enforcement seem to have killed one of the California shooters and are in a shootout with the others. Go police _E_
I loved firing goofball atheist Penn @pennjillette on The Apprentice. He never had a chance. Wrote letter to me begging for forgiveness. _E_
Thank you Wilmington North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_
We spent TWO TRILLION DOLLARS in Iraq and got NOTHING. Now we are going back and will again get NOTHING because our leaders are clueless! _E_
Will be on Hannity tonight. Rebroadcast of town hall from Pittsburgh PA. 8:00pm on FOX. Enjoy! #Trump2016 __HTTP__ _E_
I have self funded my winning primary campaign with an approx. $50 million loan. I have totally terminated the loan! _E_
Interesting read from Peggy Noonan. __HTTP__ _E_
Since stop & frisk was struck down gun shootings & victims have spiked while gun seizures have decreased. __HTTP__ _E_
RT @DoralResort: Thanks! RT @gem3wood: @DonaldJTrumpJr You guys @DoralResort have one hell of a leaderboard. Love this Tournament. _E_
Ted Cruz is incensed that I want to refocus NATO on terrorism as well as current mission but also want others to PAY FAIR SHARE a must! _E_
Moderator: Hillary plan calls for more regulation and more government spending. #Debate #BigLeagueTruth _E_
Joseph Kennedy is really being used by Venezuela and Hugo C. in oil commercial! _E_
So many lives and two trillion dollars wasted and our worst enemies will get the 2nd largest oil reserves in the World. Such stupid leaders _E_
Leaving for New Hampshire now. Making a speech—packed house. Love it! _E_
The Fake News is now complaining about my different types of back to back speeches. Well there was Afghanistan (somber) the big Rally..... _E_
A budget that puts #AmericaFirst must make safety its no. 1 priority—without safety there can be no prosperity: __HTTP__ _E_
Now that George Bush is campaigning for Jeb(!) is he fair game for questions about World Trade Center Iraq War and eco collapse? Careful! _E_
RT @EricTrump: Nevada remember you can Vote and Go walk in vote and walk out! Caucus locator: __HTTP__ #TrumpLV __HTTP__ _E_
Big news to share in New Hampshire tonight! Polls looking great! See you soon. _E_
Dopey Arianna @huffingtonpost is really after me boring story after boring story...but I hear she is in big trouble! _E_
China is a threat to America. They are not our friend. _E_
The Budget passed late last night 51 to 49. We got ZERO Democrat votes with only Rand Paul (he will vote for Tax Cuts) voting against..... _E_
The rally in Lowell Massachusetts was amazing. 10000 people going wild. MAKE AMERICA GREAT AGAIN! _E_
China keeps manipulating its currency at our financial expense. Why do our leaders continually let China run all over us? _E_
Puerto Rico survived the Hurricanes now a financial crisis looms largely of their own making. says Sharyl Attkisson. A total lack of..... _E_
Dying @GQMagazine just named me to a list. Too bad GQ is no longer relevant—won't be around long! _E_
Great meeting with CEOs of leading U.S. health insurance companies who provide great healthcare to the American peo... __HTTP__ _E_
This boardroom gets CRAZY! These people are wild _E_
Unemployment is plaguing both Black and Hispanic youths. Very troubling. _E_
I am happy to announce that the @PGAGrandSlam will be held at @TrumpGolfLA this year! __HTTP__ Follow @TrumpGolf for more! _E_
Awarded the renowned 5 Star @ForbesInspector rating the 65 story @TrumpTO brings style luxury & impeccable service __HTTP__ _E_
I'm protesting the @UnionLeader from having anything to do w/ ABC debate. Their unethical record doesn't give them the right to be involved! _E_
So nice thank you Laura. __HTTP__ _E_
.@antbaxter—Your documentary works better than any sleeping pill—in fact that may be your only way to make money with this recycled garbage! _E_
Derek get well soon the @Yankees need youl. _E_
Now @BarackObama is issuing regulatory demands to states ordering no firings in November __HTTP__ _E_
Thank you Delaware County Ohio! Remember either we WIN this election or we are going to LOSE this country!... __HTTP__ _E_
Thank you South Carolina! Everyone has to get out and VOTE on 11/8/16. #MakeAmericaGreatAgain... __HTTP__ _E_
Great day in D.C. with @SpeakerRyan and Republican leadership. Things working out really well! #Trump2016 __HTTP__ _E_
Obama has now had two record & historic midterm losses. There is Hope & Change for America. _E_
Poll numbers have nosedived for pervert NYC mayoral candidate Anthony Weiner good news for New York! _E_
Lightweight Schneiderman's suit was filed on a Saturday (unheard of) against a school with a 98% approval rating right after Obama meeting. _E_
Check out today's #trumpvlog about the upcoming episode of @ApprenticeNBC.... __HTTP__ #celebrityapprenticefinale _E_
Ted Cruz said he didn't know that he was a Canadian Citizen. He also FORGOT to file his Goldman Sachs Million $ loan papers.Not believable _E_
Democrats have shut down our government in the interests of their far left base. They don't want to do it but are powerless! _E_
.@Jimmyv3 @WWE Greatly appreciate your nice words re WrestleMania. That's why you are such a respected writer. _E_
In standing by @dennisrodman I was also representing many people who have addiction problems & are working hard to come back. _E_
Iran was on its last legs and ready to collapse until the U.S. came along and gave it a life line in the form of the Iran Deal: $150 billion _E_
The successful man will profit from his mistakes and try again in a different way. Dale Carnegie _E_
Thank you to the LGBT community! I will fight for you while Hillary brings in more people that will threaten your freedoms and beliefs. _E_
Follow me on Instagram __HTTP__ _E_
With that being said I have personally directed the fix to the unmasking process since taking office and today's vote is about foreign surveillance of foreign bad guys on foreign land. We need it! Get smart! _E_
#CelebApprentice Another exciting episode tune in next Monday at 8pm for 2 more new episodes! _E_
DONALD TRUMP BLASTS THE OSCARS __HTTP__ via @theblaze _E_
Early on Ted Cruz said that if he didn't win South Carolina it's over. He didn't win and lost to me in a landslide! _E_
It is a joke the amount of time that network news spends talking about the weather. No wonder their ratings are way down! Enough already. _E_
The Golden Rule of Negotiating: He who has the gold makes the rules. _E_
The dishonest media is fawning over the Democratic Convention. I wonder why then my speech had millions of more viewers than Crooked H? _E_
Like your current health care plan? Too bad you're going to lose it under ObamaCare. Hope Change & a 300% Increase in Your Premium. _E_
#AskTrump Getting ready to answer your questions. __HTTP__ _E_
Honored to host a luncheon for African leaders this afternoon. Great discussions on the challenges & opportunities facing our nations today. __HTTP__ _E_
.@THEGaryBusey and one of his Busey isms: "Art is only the search it is not the final form." #CelebApprentice _E_
Get it straight: Pakistan is not our friend. When our tremendous Navy SEALS took out Osama bin Laden they did... (cont) __HTTP__ _E_
Who did the House Task Force onUrgent Fiscal Issues call when America needed HELP? __HTTP__ _E_
I hope @billmaher pays quickly so that this money can immediately be given to the charities. _E_
Congratulations Kevin Gabriel on your amazing article. If I were a journalist this would be the next Watergate and I would be a star. _E_
He ruins the brand: @Robertgbeckel doesn't belong on @FoxNews . As CM for Mondale in '84 you lost 49 states. Sad! _E_
President @EmmanuelMacronThank you for the beautiful welcome ceremony at Les Invalides today! __HTTP__ _E_
Many many people are disappointed I didn't run third party but I won't risk @BarackObama benefiting from a split in the anti Obama vote. _E_
"America is the experiment that works." – President Ronald Reagan _E_
Stay on message is the chant. I always do trade jobs military vets 2nd A repeal Ocare borders etc but media misrepresents! _E_
It pays to have friends in high places like the Justice Department. Clearly the Clintons do. #DrainTheSwamp! __HTTP__ _E_
Entrepreneurs: There are no guarantees. But being ready sure beats being taken by surprise. Do your due diligence! _E_
Wow the two highest apartment rentals in all of 2013 were at Trump Park Avenue—each one = $100000 per month __HTTP__ _E_
I just sent @THEGaryBusey a check of $20000 for his charity Children's Kawasaki Disease . He worked hard and deserves it. _E_
I will be interviewed by @JudgeJeanine tonight on @FoxNews Enjoy! _E_
"Trump: Rove Gave Us Obama" __HTTP__ via @cnsnews _E_
"Integrity is the essence of everything successful." – Richard Buckminster Fuller _E_
Thank you @foxandfriends. Really great job and show! _E_
My @foxandfriends interview discussing how @BarackObama is running a hateful campaign & the @RNC convention 'Surprise' __HTTP__ _E_
Congratulations to @MariaBartiromo on her big move to @FoxBusiness. She is a total winner! _E_
Put this on your calendar: The Celebrity Apprentice live finale is this Sunday at 9 p.m. on NBC. Who will be the next Celebrity Apprentice? _E_
Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! __HTTP__ _E_
RT @hughhewitt: @realDonaldTrump I spoke to a group of influential CA GOPers tonight long time activists bundlers influencers. Support f... _E_
Great success in Iowa today. Fantastic sold out crowd. Will be back soon! _E_
Happy Easter to everyone! _E_
Dummy writer @tonyschwartz who wanted to do a second book with me for years (I said no) is now a hostile basket case who feels jilted! _E_
My @gretawire interview from last Friday discussing the unemployment numbers gas prices and acquiring the Doral __HTTP__ _E_
ObamaCare is a total disaster. Hillary Clinton wants to save it by making it even more expensive. Doesn't work I will REPEAL AND REPLACE! _E_
Surprise? 1970's global cooling alarmists were pushing same no growth liberal agenda as today's global warming __HTTP__ _E_
Coming up in March: The Comedy Central Roast of Donald Trump. March 15 mark your calendars. __HTTP__ _E_
Both Ted Cruz and John Kasich have no path to victory. They should both drop out of the race so that the Republican Party can unify! _E_
We will always take care of our GREAT VETERANS. You have shed your blood poured your love and bared your soul in... __HTTP__ _E_
Looking forward to speaking at #sparknb next week in Atlantic Canada my first time ever. _E_
.....but that's what I've been saying. Very unfair treatment by the media! _E_
.@JebBush has embarrassed himself & his family with his incompetent campaign for President. He should remain true to himself. _E_
It's disgraceful that the Obama Administration's first response was not to condemn attacks on our diplomatic (cont) __HTTP__ _E_
I will be signing copies of my new book TIME TO GET TOUGH tomorrow Dec 9th in Trump Tower from 11 a.m. to ... (cont) __HTTP__ _E_
... It is time to get out and rebuild our own nation. _E_
We must repeal Obamacare and replace it with a much more competitive comprehensive affordable system. #debate #MAGA _E_
The Huffington Post is such a loser it will die just as AOL is dying What a stupid deal AOL made to buy it! _E_
A Rod is now being investigated for continued doping __HTTP__ @yankees have a great opportunity to dump him now. Go for it! _E_
.@DonaldJTrumpJr & his wife @MrsVanessaTrump attended the #SnowflakeGardenBrunch here w/ Governor @TerryBranstad. __HTTP__ _E_
Let @PeteRose in the HOF it's time! _E_
The opening of the @TigerWoods Villa at trumpdoral __HTTP__ _E_
Wow China exports rise 15% in September. They are laughing at USA! _E_
The mark of a great player is in his ability to come back. The great champions have all come back from defeat. Sam Snead _E_
#MakeAmericaGreatAgain! __HTTP__ _E_
Rand Paul or whoever votes against Hcare Bill will forever (future political campaigns) be known as the Republican who saved ObamaCare. _E_
Consumer prices rose in June due to OPEC __HTTP__ OPEC continues to rip off hard working American families daily. _E_
Great jobs report today It is all beginning to work! _E_
.@AndreaTantaros You are a true journalistic professional. I so agree with what you say. Keep up the great work! #MakeAmericaGreatAgain _E_
How quickly people forget that Crooked Hillary called African American youth SUPER PREDATORS Has she apologized? _E_
Don't believe the @FoxNews Polls they are just another phony hit job on me. I will beat Hillary Clinton easily in the General Election. _E_
Models! Remember to register for the Trump Model Search. Check out the info here: __HTTP__ @CadillacChamp _E_
It won't stay a buyer's market forever. If you can take advantage and buy property asap. You'll thank me! _E_
Small bright spot in lackluster economy travel industry added 81000 jobs in 2012 __HTTP__ Trump Org had a record year. _E_
Sadly when it comes to using the energy industry to create American jobs @BarackObama has been a total (cont) __HTTP__ _E_
.@MittRomney should continue to stay on offense on the embassy issue. Obama who put these radicals in power deserves blame. _E_
Thank you to a #Trump2016 supporter for this video of my campaign over the past 6 months. Video: __HTTP__ _E_
Can you believe Crooked Hillary said We are going to put a whole lot of coal miners&coal companies out of business. She then apologized. _E_
Great new poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
.@whitehouse continues to defend the billions it pissed away on 'green energy' failures __HTTP__ Your money was wasted. _E_
The debate was pretty even but I thought Mitt should have been much more aggressive on Obama's failed foreign policy and I mean much more. _E_
Standing on what will be the greatest golf course in the world. Opens July 10th. __HTTP__ _E_
So much for 'global warming.' Earth is cooling at a record pace __HTTP__ _E_
Only you can #SavetheQueen during the LIVE telecast of #MissUSA on June 8 at 8/7c on NBC. Click for more info: __HTTP__ _E_
An 'extremely credible source' has called my office and told me that @BarackObama bought his house with the help of Tony Rezko. _E_
#TBT My confirmation picture at First Presbyterian Church in Jamaica NY. __HTTP__ _E_
Join me today in Wilmington Ohio at 4pm: __HTTP__ Tampa Florida at 10am: __HTTP__ _E_
The #CNNDebate was amazing so much fun! __HTTP__ _E_
Buy American & hire American are the principals at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_
Obama has called @GOP terrorists during this showdown. It's a shame he really doesn't think it because then he would meet all @GOP demands. _E_
I loved being a surrogate on behalf of @MittRomney. I am glad I was able to help him win. _E_
.@JuliInkster Congratulations on your great win what a captain what a champion! _E_
Entrepreneurs: Don't put blinders on or limit yourself. Reach out seek and explore. The opportunities are always there. _E_
Everybody knows why Obama would not show his college applications they are just not willing to say! _E_
RT @EWErickson: Personally I think it is awesome that @realDonaldTrump listens to @DLoesch on the radio. She's awesome. _E_
Watch my appearance on @foxandfriends... __HTTP__ _E_
Make sure you realize that this 'deal' is only a stop gap measure.Obama will be looking to raise even more taxes in the coming negotiation.. _E_
LIVE on #Periscope: Major announcement! #MakeAmericaGreatAgain __HTTP__ _E_
Why would the people of Kentucky want a rookie Senator– they have Sen. Mitch @McConnellPress who may be next Leader & bring $'s to KY _E_
I've been visiting Trump Int'l Golf Links Scotland and the course will be unmatched anywhere in the world. Spectacular! __HTTP__ _E_
Watch my endorsement of @MittRomney. __HTTP__ _E_
Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN! _E_
Chuck Hagel showed gross incompetence before yesterdays Senate panel...our new Secretary of Defense. _E_
Hillary Clinton is unfit to be president. She has bad judgement poor leadership skills and a very bad and destructive track record. Change! _E_
See the new sizzle reel for The Apprentice __HTTP__ _E_
.....Has worst attendance record in Senate rarely there to vote on a bill! @marcorubio _E_
The great GENERALS MacArthur and Patton real leaders and fighters are spinning in their graves as we give Syria info & time to prepare. _E_
A great honor to host the @SuperBowl Champion New England @Patriots at the White House today. Congratulations!... __HTTP__ _E_
Getting China to stop playing its currency charades can begin whenever we elect a president ready to take decisive action. #TimeToGetTough _E_
Sad sack @JebBush has just done another ad on me with special interest money saying I won't beat Hillary I WILL. But he can't beat me. _E_
I will be interviewed on @oreillyfactor at 8:00 P.M. Enjoy! _E_
Do you think the 14 African nations that are banning West Africans from coming into their nations are being called racists? Perhaps not! _E_
Media desperate to distract from Clinton's anti 2A stance. I said pro 2A citizens must organize and get out vote to save our Constitution! _E_
.@McIlroyRory Way to go Rory fantastic victory! _E_
The amazing Trump National Golf Club Los Angeles. __HTTP__ _E_
That was really exciting. Made all of my points. MAKE AMERICA GREAT AGAIN! _E_
Derek Jeter @yankees wants to rent an apartment. Derek only in a Trump building Trump is lucky for you. _E_
Tweet me your questions to answer. #trumpvlog _E_
I will be on the Mike & Mike Show on radio and ESPN at about 6 to 7 A.M. We will be talking Super Bowl and sports no Obama Care! _E_
Joined the @HouseGOP Conference this morning at the U.S. Capitol. __HTTP__ #PassTheBill #MAGA... __HTTP__ _E_
Volunteer to be a Trump Election Observer. Sign up today!#MakeAmericaGreatAgain __HTTP__ _E_
.@AnnCoulter's new book 'In Trump We Trust comes out tomorrow. People are saying it's terrific knowing Ann I am sure it is! _E_
I like Rob Astorino. He's a friend and really good guy. Sadly he has ZERO chance of beating Cuomo and the 2 to 1 Dems for governor! _E_
I will miss @Letterman & doing his show. He was always intriguing & smart. You never knew what would happen but he was fair! _E_
Was going to do a phoner this morning with @jaketapper on @CNN but they could not get their phone equipment to hook in. Will do next week. _E_
Help those affected by #Sandy. @TrumpSoHo is giving $10 per booking made by 11/23 to @RedCross for #sandyrelief. __HTTP__ _E_
Use your intelligence and your education to execute what your imagination presents to you. This is one step to becoming an entrepreneur. _E_
.@washingtonpost thinks @IvankaTrump is What Washington's Social Scene Needs __HTTP__ Truth is she's amazing. _E_
Every penny of the $7 billion going to Africa as per Obama will be stolen corruption is rampant! _E_
Obama's Def. Sec. just said US Asia focus 'not aimed to contain China' __HTTP__ China is hoping that Obama is re elected. _E_
Via @politico: Donald Trump to get more CPAC time than Marco Rubio __HTTP__ @CPACnews knows how to prioritize! _E_
How can an Attorney General ask for campaign contributions during his evaluation of a case a total sleazebag! _E_
Via @freep: Trump to speak to GOP __HTTP__ _E_
Rep. Steve Scalise of Louisiana a true friend and patriot was badly injured but will fully recover. Our thoughts and prayers are with him. _E_
Great write up on @thedailymeal about our new Executive Sous Chef Sydney Jones @TrumpLasVegas: __HTTP__ _E_
Look how bad it is getting! How much more crime how many more shootings will it take for African Americans and Latinos to vote Trump=SAFE! _E_
The same people who did the phony election polls and were so wrong are now doing approval rating polls. They are rigged just like before. _E_
.@willweatherford @FLGovScott Gaming in Miami will be incredible—best in world and create lots of jobs and revenue. _E_
"Be flexibly focused. Focus does not mean being narrow minded or rigid." – Think Big _E_
It's going to take an outsider to clean up after Clinton Bush and Obama. Let's Make America Great Again! __HTTP__ _E_
Can you imagine if Obama had to give today's press conference before the election? He would have lost. @GOP really blew it. _E_
Everyone is saying the bad news is that Donald Trump is going to take credit & they are right—Mitt wouldn't have won anyway. _E_
#ICYMI: #Trump2016 closing speech inBuffalo New York!#VoteTrumpNY  __HTTP__ _E_
I want to #MakeAmericaGreatAgain __HTTP__ _E_
IPSOS/REUTERS POLLThank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
It is so imperative that we have the right justices. #DrainTheSwamp #Debates #BigLeagueTruth __HTTP__ _E_
When are we going to wake up and realize that we are funding our enemies? #TimeToGetTough _E_
.@kevinolearytv Great job on @foxandfriends this morning. You tell it like it is! Also thx for the nice mention. Your book sounds great! _E_
Just watched the totally biased and fake news reports of the so called Russia story on NBC and ABC. Such dishonesty! _E_
Entrepreneurs: It's often to your advantage to be underestimated. _E_
Great news on the 2018 budget @SenateMajLdr McConnell first step toward delivering MASSIVE tax cuts for the American people! #TaxReform __HTTP__ _E_
.@FrankLuntz knows nothing about me or my religion. Came to my office looking for work. I had NO interest. I will save the vets! _E_
Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet __HTTP__ _E_
CPAC 2013: Donald Trump: Immigration reform is a 'suicide mission' for GOP __HTTP__ by @SethMcLaughlin1 _E_
Just as I predicted immigration reform will increase the cost of ObamaCare over $300B __HTTP__ More money borrowed from China. _E_
If anybody else but Coore and Crenshaw designed Pinehurst they would be run out of town—(and the turtleback greens are totally unfair)! _E_
John Kasich was managing director of Lehman Brothers when it crashed bringing down the world and ruining people's lives. A total failure! _E_
The military and Navy Seals should be given more credit for Bin Laden's death not Obama who works hard to take (cont) __HTTP__ _E_
#MakeAmericaGreatAgain! __HTTP__ _E_
The 18th hole at the Blue Monster @Doral in Miami is considered the toughest finishing hole in golf... __HTTP__ _E_
Via @washingtonpost: The Donald's video should have trumped Eastwood by @CapehartJ __HTTP__ _E_
and fair elections. We've accepted the outcomes when we may not have liked them and that is what must be expected of anyone standing on a _E_
So since the people at the @nytimes have made all bad decisions over the last decade why do people care what they write. Incompetent! _E_
.@youngmman @realDonaldTrump Conrad Hilton was a great man but Barron Hilton is a dope. Wrong on Barron! _E_
Ted Cruz talks about the Constitution but doesn't say that if the Dems win the Presidency the new JUSTICES appointed will destroy us all! _E_
I believe America can be great only with proper leadership. _E_
"Chalk failure up to experience don't take it personally and go find your next challenge." – Trump: Never Give Up _E_
Aubrey has a lot of self confidence—but will it be warranted? #sweepstweet _E_
For entrepreneurs ignorance is not bliss. It's fatal. It's costly. And it's for losers. You either get organized or get crushed. _E_
The Arab Spring has turned into the Islamist Winter. Our ally @Israel is in a perilous position. We must stand behind @Israel. _E_
The Mullahs are laughing at what they think is a very stupid president@BarackObama has asked for Iran to return the drone #TimeToGetTough _E_
Negotiation is an art. Treat it like one. Be open to change it's another word for innovation. _E_
The new selection of ties shirts and suits at Macy's is amazing also available in Trump Tower lobby. _E_
Melania and I extend our warmest greetings to those observing Rosh Hashanah here in the United States in Israel and around the world. _E_
In order to save Medicare and stop record premium increases we must repeal ObamaCare. _E_
Getting ready to visit Walter Reed Medical Center with Melania. Looking forward to seeing our bravest and greatest Americans! _E_
The Budget Agreement today is so important for our great Military. It ends the dangerous sequester and gives Secretary Mattis what he needs to keep America Great. Republicans and Democrats must support our troops and support this Bill! _E_
Any deal on DACA that does not include STRONG border security and the desperately needed WALL is a total waste of time. March 5th is rapidly approaching and the Dems seem not to care about DACA. Make a deal! _E_
While under no obligation to do so I have raised between 5 & 6 million dollars including 1million dollars from me for our VETERANS. Nice! _E_
Senator Luther Strange who is doing a great job for the people of Alabama will be on @foxandfriends at 7:15. Tough on crime borders etc. _E_
Great new polls! Thank you Nevada North Carolina & Ohio. Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_
Watching @trishstratuscom get inducted from the sold out crowd. #WWEHOF. __HTTP__ _E_
I'll be on @foxandfriends on Monday at 7:30 AM. Always interesting. Tune in! _E_
RT @WhiteHouse: Today @POTUS will welcome the Prime Minister of India @narendramodi to the White House. __HTTP__ _E_
Very honored: Trump Is Tops As Clinton Drops In Connecticut Primaries Quinnipiac University Poll Finds __HTTP__ _E_
"We are fully supportive of @Israel's right to defend itself." @BarackObama Very good I like it. _E_
Congratulations to @TheSlyStallone and Arnold @Schwarzenegger on 'Expendables 2' #1 box office opening. Still going strong! _E_
Thank you Kentucky! #Trump2016#SuperSaturday _E_
What a statesman! @BarackObama made sure to quickly call the Muslim Brotherhood victor to congratulate him on (cont) __HTTP__ _E_
I wonder if traitor Edward Snowden will be attending the Miss Universe Pageant in Moscow on November 9th. _E_
So @ReutersPolitics claims that @MittRomney's birth certificate evokes 'controversy' __HTTP__ Where (cont) __HTTP__ _E_
The failing @nytimes writes false story after false story about me. They don't even call to verify the facts of a story. A Fake News Joke! _E_
I hope everyone or rather almost everyone had a GREAT EASTER! We need our leaders to make great and wise decisions in these troubled times _E_
While @BarackObama seeks to further destroy our credit our economy continues to hemorrhage jobs. Such a total failure as a President. _E_
Big day planned on NATIONAL SECURITY tomorrow. Among many other things we will build the wall! _E_
Success seems to be connected with action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_
Crooked Hillary will NEVER be able to solve the problems of poverty education and safety within the African American & Hispanic communities _E_
"@HoganSeaisle129: @realDonaldTrump who who who ... Say it just say it #CelebApprentice" Watch and see what happens! _E_
Thank you to Carmen Yulin Cruz the Mayor of San Juan for your kind words on FEMA etc.We are working hard. Much food and water there/on way _E_
#FoxNews Poll THANK YOU!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Terrible tragedy at the Empire State Building today. Must have fast trials and death penalty for the animals. _E_
Happy 30th Birthday #Ghostbusters! It was great to have @TrumpTowerNY be a part of the series. __HTTP__ _E_
If you're passionate about your work you will never give up. _E_
Our online campaign store is open! Visit __HTTP__ for #MakeAmericaGreatAgain merchandise including my signature hat! _E_
With today's struggling job numbers it is clear that there is one choice this November. @MittRomney can turn the economy around. _E_
Seth Meyers was terrible co hosting with Kelly. Marbles in his mouth & he must stop picking at his hands insult to the great Regis Philbin! _E_
Drop A Rod in the order and cut his salary based on unreported drug use. Also not a pressure player. _E_
.@limbaugh is right. Watergate is much different than Benghazi. No one died in Watergate. _E_
"If the freedom of speech is taken away then dumb and silent we may be led like sheep to the slaughter." George Washington _E_
The Ted Cruz wiseguy apology to the people of New York is a disgrace. Remember his wife's employer and his lender is located there! _E_
We are using the absolute wrong negotiating technique with respect to the Iran nuclear talks. Strengthen sanctions until GREAT deal is made! _E_
"Talent wins games but teamwork and intelligence wins championships." – Michael Jordan @Jumpman23 _E_
At least 7 dead and 48 wounded in terror attack and Mayor of London says there is no reason to be alarmed! _E_
Congratulations to @joniernst on delivering a strong conservative message in her #SOTU response. Joni will be a great senator. _E_
Amazing... @VattenfallGroup tried to destroy Aberdeen. _E_
Entrepreneurs: Winners see problems as just another way to prove themselves. Remember to focus on the solution not the problem. _E_
Great meeting with the @RepublicanStudy Committee this morning at the @WhiteHouse! __HTTP__ _E_
Lyin' Ted Cruz can't get votes (I am millions ahead of him) so he has to get his delegates from the Republican bosses. It won't work! _E_
The United States needs the security of the Wall on the Southern Border which must be part of any DACA approval. The safety and security of our country is #1! __HTTP__ _E_
John Lewis said about my inauguration It will be the first one that I've missed. WRONG (or lie)! He boycotted Bush 43 also because he... _E_
For those that constantly say that "global warming" is now "climate change"—they changed the name. The name global warming wasn't working _E_
When I broadly proclaimed Mitt "choked" – and would do it again—everybody said yeah he's right. _E_
'True blue collar billionaire Donald Trump shows Hillary Clinton is out of touch' __HTTP__ _E_
Remember what I previously said Obama will someday attack Iran in order to show how tough he is. _E_
The failing @nytimes talks about anonymous sources and meetings that never happened. Their reporting is fiction. The media protects Hillary! _E_
I will be interviewed by @TheBrodyFile on @CBNNews tonight at 11pm. Enjoy! _E_
I have been a guest on The View many times when it was successful show. Now the show is dying for lack of ratings. Too bad! _E_
I'm not going to be watching much NFL football anymore. Too time consuming too boring too many flags and too soft. Focus on other things! _E_
I just had to fire someone he didn't have a clue he reminded me of Obama on Wednesday night. _E_
Put the glamour beauty & mystery back in the Oscars and the ratings will zoom. Also & most importantly the Oscars need credibility. _E_
New York Yankees President Randy Levine: 'End of the Republican Party' If Donald Trump Not Nominated. __HTTP__ _E_
Thank you for your support last night Iowa! #VoteTrump #Trump2016 #IACaucus #FITN #IAPolitics __HTTP__ _E_
#TrumpVine Where is the money @MacMiller? __HTTP__ _E_
Prior to the end of the year I will be traveling to Israel. I am very much looking forward to it. _E_
Kirsten Powers: Anti Trump Operative was Aggressively Shopping Cruz Story via the Gateway Pundit: __HTTP__ _E_
Wow just in John Beale the top person in government on climate change (EPA) is a total fraud and just admitted it! What can they say now _E_
Jeb Bush just talked about my border proposal to build a fence. It's not a fence Jeb it's a WALL and there's a BIG difference! _E_
It appears that @THEGaryBusey is entranced with @MELANIATRUMP and rightly so! #CelebApprentice _E_
On June 1st. near Washington D.C. I will be opening the greatest championship golf course in the U.S. All holes front on the Potomac River _E_
#ICYMI: John Podesta's Brother Pocketed $180000 from Putin's Uranium Company: __HTTP__ __HTTP__ _E_
The dying NY Daily News put out a false report about my kids not wanting me to criticize Obama...totally false! _E_
.@NikWallenda #Skywire As much credit as he's been given he wasn't given enough credit for his incredible feat over Grand Canyon. _E_
Our next President must stop China's Rip off of America. _E_
.@NBCNews is bad but Saturday Night Live is the worst of NBC. Not funny cast is terrible always a complete hit job. Really bad television! _E_
New The Next Generation videos @donaldjtrumpjr __HTTP__ @ivankatrump __HTTP__ @erictrump __HTTP__ _E_
Via @DMRegister by @newsmanone: "'Moon cycle' can't defeat @ShawnJohnson on @ApprenticeNBC" __HTTP__ _E_
Happy Anniversary to my wonderful wife @MELANIATRUMP a truly great decision by me! __HTTP__ _E_
Via @globegazette by @GGMaryP: "Trump: We'll bring American dream back" __HTTP__ _E_
RT @realDonaldTrump: #USA #Japan __HTTP__ _E_
Just returned to Bedminster NJ from Camp David. GREAT meeting on National Security the Border and the Military! #MAGA __HTTP__ _E_
Get rich quick! Crooked Hillary Clinton's pay to play guide: __HTTP__ _E_
Only a fool would believe that the meeting between Bill Clinton and the U.S.A.G. was not arranged or that Crooked Hillary did not know. _E_
A political commentator for @cnn which I no longer watch said Trump showed some weakness in the Repub Primaries. I set all time record! _E_
Loved Dallas and the tremendous crowd last night. Will be back! _E_
Via @CNNMoney: Donald Trump gets into crowdfunding __HTTP__ #FundAnything _E_
.@megynkelly Will be on Fox now. Watch and enjoy! _E_
Do you think @THEGaryBusey will be able to "step up" as PM? I know @lisarinna is hoping so. #CelebApprentice _E_
No surprise Assad is not destroying his chemical weapons. He never intended to in the first place. _E_
....And it will get even better with Tax Cuts! __HTTP__ _E_
TRUMP'S BIG ANNOUNCEMENT: HE'LL GIVE $5 MILLION TO CHARITY OF OBAMA'S CHOICE IF... __HTTP__ By @billyhallowell @theblaze _E_
I will be interviewed by @jessebwatters on @oreillyfactor tonight at 8pm. Enjoy! _E_
Putin is not feeling too nervous or scared. #DemDebate _E_
The media has not reported that the National Debt in my first month went down by $12 billion vs a $200 billion increase in Obama first mo. _E_
When I left Conference Room for short meetings with Japan and other countries I asked Ivanka to hold seat. Very standard. Angela M agrees! _E_
Will be joining @GovMikeHuckabee tonight at 8pmE on @TBN. Enjoy! __HTTP__ _E_
Today's groundbreaking at the Old Post Office Building in D.C. was amazing. Great people great dedication. @usgsa __HTTP__ _E_
As I have said the Tea Party is alive and well and fighting hard for the USA. BIG WIN TODAY! _E_
I will be doing the @TODAYshow with my wife Melania and the rest of my family in a major Town Hall. Hopefully it will be fun! Enjoy.7A.M. _E_
Thank you. __HTTP__ _E_
Will be on Bill O'Reilly @oreillyfactor tonight at 8 PM. Enjoy! _E_
I hope that Derek Jeter has such a fantastic year with @Yankees that he changes his mind about retiring. Great guy! _E_
With fellow inductees in front of the sold out crowd at MSG. #WWEHOF __HTTP__ _E_
RT @RealEagleBites: @realDonaldTrump It is the height of hypocrisy. Obama and Clinton in effect gave nuclear weapons to North Korea by thei... _E_
Entrepreneurs: Be prepared and be tough. Cover your bases! There are a lot of ups and downs but you can ride them out if you're prepared. _E_
... and many others. Drop to your knees Sugar and say thank you Mr. Trump. _E_
.@BretBaier I will be interviewed by Bret (on Fox) tonight at 6:00. Watch it will be good! _E_
.@BreitbartNews is much smarter than sleepy eyes @chucktodd @nbc __HTTP__ Thanks to Steve Bannon & real reporters. _E_
I wonder what the great generals like Patton the big M or Robert E. LEE would have thought about our stupid broadcasting of an attack? _E_
Jodi Arias jury is having a hard time with the death penalty judge just sent them back for further deliberatuon. _E_
The unbiased reporters and attendees said mine was the best and most well received speech at CPAC THANK YOU! _E_
Congratulations to Mitt Romney. He was not only good he was absolutely fantastic tonight! _E_
Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_
Our Southern border is unsecure. I am the only one that can fix it nobody else has the guts to even talk about it. __HTTP__ _E_
He @FLGovScott handled the Zimmerman matter very well. I am glad to see there will be a trial. Justice. Now let's wait for a fair trial. _E_
I do not understand how so many of my Jewish friends backed Obama in the last election. He is a TOTAL DISASTER FOR ISRAEL AND ALWAYS WILL BE _E_
True! __HTTP__ _E_
I had a great day campaigning in Connecticut. Looking for a big vote on Tuesday! _E_
Wrong! Under @BarackObama's watch @Israel is not being invited to NATO summit in Chicago this month __HTTP__ _E_
My @FoxNews interview with @gretawire discussing the 2012 GOP primary and ObamaCare's attack on the Catholic Church. __HTTP__ _E_
Minorities Line Up Behind Donald Trump #Trump2016 __HTTP__ _E_
Does everyone see that the Democrats and President Obama are now because of me starting to deport people who are here illegally. Politics! _E_
Illegal immigrant children non Mexicans surge across border at record rate __HTTP__ _E_
The stimulus is a net negative effect on the growth of GDP over 10 years as admitted by @BarackObama's own CBO __HTTP__ _E_
Via @si_golf: "Donald Trump Rory McIlroy and Matt Kuchar are guys to watch at @DoralResort" __HTTP__ @CadillacChamp _E_
Don't find fault. Find a remedy. Henry Ford _E_
What a great night. Thank you South Carolina a special place with truly amazing people! LOVE _E_
Will be back soon Virginia. We are going to MAKE AMERICA GREAT AGAIN! #TrumpPence16 __HTTP__ _E_
#IACaucus #CaucusForTrump#iCaucused #iVoted __HTTP__ _E_
There is nothing @TrumpSoHo did not think about for the holidays @RobbReport dives in: __HTTP__ _E_
Why does@ Bill O'Reilly keep putting Karl Rove on his show a total waste of time. Rove spent $400 000 000 and didn't win a race pathetic! _E_
RT @CPACnews: ACU Announces @realDonaldTrump will be a featured speaker at #CPAC2013! Get tickets today at __HTTP__ _E_
I will be interviewed by @LouDobbs tonight on @FoxBusiness 7pm ET _E_
.@JebBush is a sad case. A total embarrassment to both himself and his family he just announced he will continue to spend on Trump hit ads! _E_
Can @pennjillette @lisarinna and @THEGaryBusey continue to co exist? Find out on this Sunday's Celebrity All Star @ApprenticeNBC. _E_
ObamaCare is an absolute disaster which will destroy 16% of the economy and ultimately more! _E_
It is actually hard to believe how naive (or dumb) the Failing @nytimes is when it comes to foreign policy...weak and ineffective! _E_
Wow @CNN Town Hall questions were given to Crooked Hillary Clinton in advance of big debates against Bernie Sanders. Hillary & CNN FRAUD! _E_
#NeverForget __HTTP__ _E_
I look forward to my press conference @TrumpTurnberry Scotland this Wednesday lots of great people attending. _E_
Happy Easter to all have a great day! _E_
People ask me every day to pose for pictures but the camera never works the first time they are never prepared or maybe just very nervous! _E_
Don't worry THE UNITED STATES WILL BE GREAT AGAIN! _E_
Fines and penalties against Wells Fargo Bank for their bad acts against their customers and others will not be dropped as has incorrectly been reported but will be pursued and if anything substantially increased. I will cut Regs but make penalties severe when caught cheating! _E_
Congratulations to the 7 @TrumpCollection properties who made @USNewsTravel's Best Hotels List: __HTTP__ _E_
Life always presents new opportunities you would never expect. I hosted @WrestleMania & then I starred in one which sold most PPVs. _E_
Watch @BarackObama admit Obamacare is a TAX __HTTP__ The GOP must continue to Disrupt Dismantle & Repeal! _E_
The U.S. under my administration is completely rebuilding its military and they're spending hundreds of billions of dollars to the newest and finest military equipment anywhere in the world being built right now. I want peace through strength! __HTTP__ _E_
Romney Ryan Slam Obama Administration on China Currency Manipulation __HTTP__ via @ABC _E_
Just learned that Jon @Ossoff who is running for Congress in Georgia doesn't even live in the district. Republicans get out and vote! _E_
I settled the Trump University lawsuit for a small fraction of the potential award because as President I have to focus on our country. _E_
Are Republicans suicidal? Now they want to push amnesty through Congress. Allowing Democrats into the country. _E_
President Obama campaigned hard (and personally) in the very important swing states and lost.The voters wanted to MAKE AMERICA GREAT AGAIN! _E_
Republicans should just REPEAL failing ObamaCare now & work on a new Healthcare Plan that will start from a clean slate. Dems will join in! _E_
If Jon Stewart is so above it all & legit why did he change his name from Jonathan Leibowitz? He should be proud of his heritage! _E_
I'll be on THe Willis Report @GerriWillisFBN today at 5 pm EST _E_
Via @fitsnews by @TaylahhKane: Donald Trump's Refreshing Lack Of A Filter __HTTP__ _E_
Remember Celebrity Apprentice tonight on CNBC at 9. Amazing episode watch Omarosa get the boot! _E_
Entrepreneurs: Set your mind on winning and losing won't have a chance. See yourself as victorious! _E_
It's about time Italy recognized the innocence of @AmandaKnox great news! _E_
Journalists shower Hillary Clinton with campaign cash __HTTP__ __HTTP__ _E_
Under @BarackObama 5 major banks now control 56% of economy from 43% in 2007 __HTTP__ Another catastrophe is brewing. _E_
Just watched recap of #CrookedHillary's speech. Very short and lies. She is the only one fear mongering! _E_
WOW! I just heard that the previously unknown singer Mac Miller has received over 67 million hits on his song Donald Trump. _E_
Via @washingtonpost by @OConnellPostbiz:"Bidding to stay at Trump's hotel for '17 inauguration?Pick the next POTUS. __HTTP__ _E_
Thank you @RepLouBarletta! __HTTP__ __HTTP__ _E_
This is the definition of ransom ⬇ __HTTP__ _E_
My representatives had a great meeting w/ the Hispanic Chamber of Commerce at the WH today. Look forward to tremendous growth & future mtgs! _E_
Video of my day at The Old Post Office soon to be the most fabulous hotel! __HTTP__ _E_
"When you're at a meeting monitor your behavior and work at being an observer – of yourself and of others." – Think Like a Billionaire _E_
Some great quotes from the legendary and courageous Winston Churchill: Never never never give up. ... _E_
#SuperBowl Vote for me and @CENTURY21 __HTTP__ _E_
Dangerous The USC ObamaCare ruling means the government can now tax you for inactivity. _E_
Sometimes by losing a battle you find a new way to win the war. Don't ever get down on yourself just keep fighting in the end you WIN! _E_
Yesterday was @BarackObama's favorite day of the year he collects our taxes to redistribute. _E_
Going to North Carolina to make keynote speech sold out crowd! _E_
Wow I hear @Morning_Joe has gone really hostile ever since I said I won't do or watch the show anymore.They misrepresent my positions! _E_
While all agree the U. S. President has the complete power to pardon why think of that when only crime so far is LEAKS against us.FAKE NEWS _E_
Phylis Schlafly: 'Marco Rubio Betrayed Us All' __HTTP__ _E_
Getting rdy to leave for France @ the invitation of President Macron to celebrate & honor Bastille Day and 100yrs since U.S. entry into WWI. _E_
Six hours left to #VoteTrump Connecticut! __HTTP__ _E_
Patience is the greatest of all virtues. Cato _E_
One of the best moves I ever made was staying out of last decade's artificial real estate boom. But I used the downturn to my advantage. _E_
True! __HTTP__ _E_
I'll be interviewed by Greta Van Susteren @Gretawire tonight at 10 pm ET on Fox. _E_
Must read @AmSpec article by Jeffrey Lord: "The Ruling Class Liberty Medal" __HTTP__ _E_
Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_
Reuters just announced that Secret Service never spoke to me or my campaign. Made up story by @CNN is a hoax. Totally dishonest. _E_
Donald E. Ballard on behalf of the people of the United States THANK YOU for your courageous service. YOU INSPIRE US ALL! #ALConv2017 __HTTP__ _E_
Thanks. __HTTP__ _E_
Now the Chinese are hacking @nytimes __HTTP__ & Twitter __HTTP__ When will we hold these thieves accountable? _E_
.@Richard_Meier a highly overrated architect has had many problems with buildings he has designed. _E_
With great patriots in Mason City who also want to bring the American Dream back! We can Make America Great Again! __HTTP__ _E_
Obama talks about what he is going to do why the hell didn't he just do it especially in the first 2 years when he had all votes necessary _E_
Thank you Las Vegas Review Journal! EDITORIAL: 'Donald Trump for president' __HTTP__ via @reviewjournal _E_
Pat Buchanan gave a fantastic interview this morning on @CNN way to go Pat way ahead of your time! _E_
Wow interview released by Wikileakes shows quid pro quo in Crooked Hillary e mail probe.Such a dishonest person & Paul Ryan does zilch! _E_
.@BarbaraJWalters called my office to ask me to do election night coverage with her sadly I won't be able to do it. _E_
The US is always getting ripped off! China gets cheap oil from Iran and Iraq as US pays for Hormuz Patrols to (cont) __HTTP__ _E_
Why is @MittRomney the only guy who talks about getting tough with China and their currency manipulation? _E_
Entrepreneurs: Keep your focus and keep your momentum. Believe in yourself if you don't no one else will either. _E_
A spectacular lake front club w/ dramatic course designed by @SharkGregNorman @Trump_Charlotte is NC's top club __HTTP__ _E_
Thank you to @AmSpec & Jeffrey Lord for the lovely article "Governor Trump? The conservative Nelson Rockefeller." __HTTP__ _E_
Congratulations to Martin Kaymer for winning the 2014 #USOpen. #USGA Great playing from beginning to end! _E_
My @CNN interview with @wolfblitzercnn yesterday discussing by meeting with @MittRomney __HTTP__ _E_
"As you go through life you've got to see the valleys as well as the peaks." – Neil Young _E_
The SECRET meeting between Bill Clinton and the U.S.A.G. in back of closed plane was heightened with FBI shouting go away no pictures. _E_
Marco Rubio is a member of the Gang Of Eight or very weak on stopping illegal immigration. Only changed when poll numbers crashed. _E_
This is a time for big ideas. This is a time for real reform for a real recovery. @PaulRyanVP _E_
Just received a standing ovation at #NCGOPCon when I said We need to bring the American Dream back better and stronger than ever before! _E_
The Trump Organization continues to expand internationally at a record pace. Many new announcements to come soon. _E_
Thank you Louisville Kentucky. Together we will MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_
One of the big problems facing Atlantic City are the ridiculously high real estate taxes which I fought for years before leaving.Corruption! _E_
RT @benshapiro: Pope on Trump: A person who thinks only about building walls...is not Christian. This is Vatican City. __HTTP__ _E_
.@Mayor_Nutter of Philadelphia who is doing a terrible job should be ashamed for using such a disgusting word in referring to me.Low life! _E_
I am supportive of Lamar as a person & also of the process but I can never support bailing out ins co's who have made a fortune w/ O'Care. _E_
The only Forbes 5 Star & 5 Diamond hotel with a 5 Star & 5 Diamond restaurant @TrumpNewYork offers elite luxury __HTTP__ _E_
Age wrinkles the body. Quitting wrinkles the soul. General Douglas MacArthur _E_
Stopped by @TrumpDC to thank all of the tremendous men & women for their hard work! __HTTP__ _E_
Hillary just said that she will not use the term radical Islamic but was incapable of saying why. She is afraid of Obama & the e mails! _E_
Obama said he never met his uncle Oscar who was arrested for whatever. Turns out he lived with his uncle in Boston. SO MANY LIES! _E_
Watch a powerful and frank interview with Donald Trump about the economy on Greta Van Susteren's On The Record: __HTTP__ _E_
America is proud to stand shoulder to shoulder w/a free & ind UK. We stand together as friends as allies & as a people w/a shared history. _E_
The invention of email has proven to be a very bad thing for Crooked Hillary in that it has proven her to be both incompetent and a liar! _E_
At the debate the President kept talking of what he is going to do. I kept saying why didn't he do it? He lost me a long time ago. _E_
I didn't start the fight with Lyin'Ted Cruz over the GQ cover pic of Melania he did. He knew the PAC was putting it out hence Lyin' Ted! _E_
Pres. O a bump in the road in reference to our Ambassador's (and others) killing in Libya _E_
'Trump lays out policies for first 100 days in White House' __HTTP__ _E_
Today both @BarackObama and @MittRomney are giving speeches on their economic policies in Ohio. The choice is (cont) __HTTP__ _E_
...owed to Wall Street and the banks which sadly must be dealt with. Food water and medical are top priorities and doing well. #FEMA _E_
Why did the failing @nytimes refuse to use any of the names given to them that I was so proud to have helped with their careers. DISHONEST! _E_
I love @LibertyUniversity such great people! _E_
...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong).To bad the Dems have no one who can change tones! _E_
'U.S. Consumer Comfort Just Reached Its Highest Level in a Decade' __HTTP__ __HTTP__ _E_
Trump Nat'l Jupiter's @jacknicklaus designed course is a challenging & innovative 7531 yds w/special features __HTTP__ _E_
Does anybody think that @CNBC will get their fictitious polling numbers corrected sometime prior to the start of the debate. Sad! _E_
Happy to have @ralphreed and the FFC's endorsement of the Newsmax @iontv debate. FFC is a great organization. _E_
If the Senate Democrats ever got the chance they would switch to a 51 majority vote in first minute. They are laughing at R's. MAKE CHANGE! _E_
I will bring jobs back and get wages up. People haven't had a real wage increase in almost twenty years. Clinton killed jobs! _E_
Will be interviewed by @StephenAtHome tonight by phone a late show first @CBS @colbertlateshow. Enjoy! #Colbert #LSSC _E_
How dumb is our president to send thousands of poorly trained and ill equipped soldiers over to West Africa to fight Ebola. Stop all flights _E_
Thank you Attorney General Gonzales so many people feel this way. __HTTP__ _E_
While I'm beating my opponents in the polls I'm also beating lobbyists special interests & donors that are supporting them with billions. _E_
I told you whenever I go to a @Yankees game the @Yankees win. _E_
Must read article: "Conservative Fury at Rove Erupts" __HTTP__ By Jeffrey Lord @AmSpec _E_
Gov. Gary Johnson pulling votes from @MittRomney Don't waste your vote. Obama must go! _E_
"We have a system that increasingly taxes work and subsidizes nonwork." Milton Friedman _E_
Via @BreitbartNews: "EXCLUSIVE TRUMP COUNSEL 'CANNOT CONFIRM OR DENY' INTEREST IN BUYING NEW YORK TIMES" __HTTP__ _E_
62 years ago this week a brave seamstress in Montgomery Alabama uttered one word that changed history... __HTTP__ _E_
My friend Ronald Kessler explains in @washingtonpost that Secret Service problems are much bigger than prostitutes __HTTP__ _E_
Big poll comes out today on Face The Nation at 10:30 on @CBSNews. _E_
26000 sexual assaults in the military last year way up from previous years. Armed Forces are in total turmoil! _E_
Just landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! _E_
Keep up the GREAT work. I am with you 100%! ISIS is losing its grip... Army Colonel Ryan DillonCJTF–OIR __HTTP__ __HTTP__ _E_
President Obama I have an idea! Pretend that West Africa is Israel and then you will be able to stop the Ebola area flights. _E_
Don't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_
My @KWWL int. from @WartburgCollege discussing how politicians have failed us & Making America Great Again! __HTTP__ _E_
'Hillary Clinton Had Gun Control Supporters Planted In Town Hall Audience' __HTTP__ _E_
Bernie Sanders gave Hillary the Dem nomination when he gave up on the e mails. That issue has only gotten bigger! _E_
With Spitzer & Anthony Weiner running for office New York is pervert central! Pathetic _E_
Join me tomorrow in Sanford or Tallahassee Florida!Sanford at 3pm: __HTTP__ at 6pm: __HTTP__ _E_
Join me tonight in Fayetteville North Carolina at 7pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_
My warmest condolences and sympathies to the victims and families of the terrible Las Vegas shooting. God bless you! _E_
Ed Gillespie will be a great Governor of Virginia. His opponent doesn't even show up to meetings/work and will be VERY weak on crime! _E_
My @foxandfriends interview duscussing my meeting with @newtgingrich the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_
Unemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Market ever up $5.4 trill _E_
Remember when you hear the words sources say from the Fake Media often times those sources are made up and do not exist. _E_
Also tomorrow night I will be going to Boone and Ames. Really look forward to seeing all of my friends in Iowa. _E_
RT @EricTrump: .@LaraLeaTrump and I look forward to being on @JudgeJeanine tonight at 9pm! @FoxNews #MakeAmericaGreatAgain __HTTP__ _E_
"Actions are the seed of fate deeds grow into destiny." – Harry S. Truman _E_
A message for Hollywood __HTTP__ _E_
A friend is one who has the same enemies as you have. Pres. Abraham Lincoln _E_
On Taxes: "This is the biggest corporate rate cut ever going back to the corporate income tax rate of roughly 80 years ago.This is a huge pro growth stimulus for the economy. Every year the Obama WH overstated how the economy would grow. Now real economics and jobs." @WSJ Report _E_
Another radical Islamic attack this time in Pakistan targeting Christian women & children. At least 67 dead400 injured. I alone can solve _E_
Via @SteveKingIA's Steve King for Congress Facebook Page: "Donald Trump has a special announcement!" __HTTP__ _E_
If you are interested in balancing work and pleasure you will never succeed! _E_
Why would college graduates want Crooked Hillary as their President? She will destroy them! __HTTP__ _E_
Brits spent $57.8M on the royal family. Obamas cost us $1.4B in expenses including entertainment __HTTP__ Living large on us. _E_
Really looking forward to watching The Masters this weekend one of THE GREATEST SHOWS ON EARTH! _E_
Crooked Hillary refuses to say that she will be raising taxes beyond belief! She will be a disaster for jobs and the economy! _E_
I was viciously attacked by Mr. Khan at the Democratic Convention. Am I not allowed to respond? Hillary voted for the Iraq war not me! _E_
Baltimore had a really tough night only great leadership can solve the many inner city problems facing our country. Jobs jobs jobs! _E_
Crooked Hillary Clinton overregulates overtaxes and doesn't care about jobs. Most importantly she suffers from plain old bad judgement! _E_
It's Friday. How many millions has the White House wasted on the ObamaCare website today? _E_
Watch me on Late Night with Jimmy Fallon tomorrow night at 12:35 a.m. on NBC I'll be making a big announcement! _E_
Via @USATODAY Amateur hour with the Iran nuclear deal __HTTP__ _E_
#WeeklyAddress __HTTP__ _E_
I'm very proud of my daughter Ivanka. Great interview. __HTTP__ _E_
Isis terror group has now fully taken over large sections of Iraq and will soon have control of massive oil reserves. I told you so. _E_
Don't underestimate yourself or your possibilities keep your focus intact and focus on the positives. _E_
Not looking good for our great Military or Safety & Security on the very dangerous Southern Border. Dems want a Shutdown in order to help diminish the great success of the Tax Cuts and what they are doing for our booming economy. _E_
We are being embarrassed by Russia and China on Snowden (and much more) yet Obama is talking about global warming on Tuesday. _E_
Global warming is based on faulty science and manipulated data which is proven by the emails that were leaked __HTTP__ _E_
What I am saying is that we never should have been in Iraq in the first place. Bush was terrible Obama is worse! Make America GREAT again. _E_
Visited some very beautiful golf courses this weekend...this is one... __HTTP__ _E_
Steve Bannon will be a tough and smart new voice at @BreitbartNews...maybe even better than ever before. Fake News needs the competition! _E_
"All Star Celebrity Apprentice" ranked #1 for the 10 o'clock hour among ABC CBS and NBC with a season high 19% margin. _E_
Thank you @kayleighmcenany for your nice words great knowledge and style! We are doing really well in South Carolina. @CNN @donlemon _E_
Just got to the #USWomensOpen in Bedminster New Jersey. People are really happy with record high stock market up over 17% since election! _E_
Made in America? @BarackObama argues that his long form birth certificate is irrelevant in court. __HTTP__ _E_
Even if you're on the right track you'll get run over if you just sit there. Will Rogers _E_
#CelebrityApprentice Who will win? __HTTP__ Find out tonight live Season Finale at 9PM ET on NBC. _E_
Congratulations to @TrumpCollection's @TrumpPanama for receiving the Certificate of Excellence & Top 10 Hotels in Panama from @TripAdvisor! _E_
Gas prices are the lowest in the U.S. in over ten years! I would like to see them go even lower. _E_
Via @kcautv: "Donald Trump Coming to Sioux City in May" __HTTP__ _E_
Remarks from the Roosevelt Room with @SenateMajLdr Mitch McConnell @SpeakerRyan and Secretary of Defense General James Mattis. __HTTP__ _E_
Looking forward to being with @SenTedCruz at our big rally in D.C. on Wednesday (1:00 P.M. at the Capitol) to protest insane Iran nuke deal! _E_
"The worst thing you can possibly do in a deal is seem desperate to make it." – The Art of The Deal. _E_
I'm sick of always reading about outsourcing. Why aren't we talking about onshoring ? We need to bring manufa... (cont) __HTTP__ _E_
#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_
WH refused a meeting with the Israeli Defense Minister. If only Obama hated Iran as much as he dislikes Israel. _E_
The only American who has met with the North Korean man child is Dennis Rodman. Isn't that frightening and sad? _E_
#ObamacareFail __HTTP__ _E_
RT @TeamTrump: A @realDonaldTrump Administration will bring JOBS BACK! #Debates2016 __HTTP__ _E_
Someone just wrote that "you predicted every single major event that's now happening—and they knock you instead of giving you credit." _E_
The only one to fix the infrastructure of our country is me roads airports bridges. I know how to build pols only know how to talk! _E_
Russia must be laughing up their sleeves watching as the U.S. tears itself apart over a Democrat EXCUSE for losing the election. _E_
Representative Devin Nunes a man of tremendous courage and grit may someday be recognized as a Great American Hero for what he has exposed and what he has had to endure! _E_
.@BarackObama sent over 100000 jobs and Canadian oil to China all because he would not approve Keystone XL. _E_
Interesting reading re September 11th __HTTP__ _E_
Read Ivanka's blog about last night's Apprentice on Entertainment Weekly ... __HTTP__ _E_
This is the summer of box office bombs. Who is green lighting this garbage? The scripts are terrible. _E_
Thank you Greta. __HTTP__ _E_
08 09 2011 19:33:31 _E_
Trump shows complete domination of Facebook conversation __HTTP__ _E_
Free enterprise is still the greatest force for upward mobility economic security and the expansion of the middle class. @MittRomney _E_
ObamaCare Tragedy Primed to Further Explode the Deficit __HTTP__ And @Obama transferred $500 billion from Medicare to fund it! _E_
A great honor to welcome President Juan Manuel Santos of Colombia to the White House today! Joint Press Conf... __HTTP__ _E_
My sons Don and Eric are in Ireland looking at my new club. It will be phenomenal! @LodgeatDoonbeg _E_
Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ _E_
Dennis—Thank you for being honest. Somebody put words in your mouth & you wouldn't take it. Great! @dennisrodman _E_
.@JebBush today said he didn't want to be the front runner he would rather be where he is now 2%. That is the talk of a loser can't win! _E_
On @FoxNews at 7:00 P.M. Special: Meet the Trumps Hope you enjoy! _E_
Tony Romo just made a great play Giants are getting killed! _E_
Iraq was one of our biggest mistakes. We got absolutely nothing for our sacrifices.The country will collapse (cont) __HTTP__ _E_
Hillary Clinton just had her 47% moment. What a terrible thing she said about so many great Americans! _E_
Via @usweekly: "Donald Trump Sounds Off on Joan Rivers' Death: 'I Think The Doctors Made a Terrible Mistake'" __HTTP__ _E_
.@davidaxelrod use Buffet Icahn Sam Zell Leon Black Kravis Caesars and many more when talking about using the bankruptcy laws not me! _E_
Via @ WSOC_TV: "Blair Miller talks with Donald Trump about Charlotte ventures" __HTTP__ _E_
Derek must move back into one of my buildings immediately. It will be lucky for him like in past. _E_
I'm in Scotland getting ready for a major news conference on the Great Dunes of Scotland announcing the second North Sea course amazing! _E_
Get ready for @Oreillyfactor tonight at 8 always interesting! _E_
Amazing. People are sending letters of support for @TrumpChicago's sign to my other properties including even @TrumpScotland. Thank you! _E_
Get your ballots in Colorado I will see you soon and we will win!#MakeAmericaGreatAgain __HTTP__ _E_
After 200 days rarely has any Administration achieved what we have achieved..not even close! Don't believe the Fake News Suppression Polls! _E_
Any deal completed before the fiscal curb must have tangible cuts on expenditures in baseline spending so we can get our credit back. _E_
China has so much of our debt that they can't put us in default w/o killing themselves US needs our toughest negotiator and fast! _E_
Melania and I will be appearing on The View tomorrow at 11 a.m. on CBS. Tune in for some great fun! _E_
RT @IvankaTrump: Very proud of Arabella and Joseph for their performance in honor of President Xi Jinping and Madame Peng Liyuan's official... _E_
Even with lower profit projections American firms are still throwing money into China __HTTP__ Obama is killing investment. _E_
Great meeting w/ coal miners & leaders from the Virginia coal industry thank you! #MAGA __HTTP__ __HTTP__ _E_
Look when it comes to China America better stop messing around. China sees us as a naive gullible foolish (cont) __HTTP__ _E_
With long gas lines & total disarray from storm the hurricane may yet be a negative for Obama. _E_
As a favor to my friends at EXTRA I am co hosting tonight at 7 p.m. on @nbc _E_
The American dream is back. We're going to create an environment for small business like we haven't had in many ma... __HTTP__ _E_
Thanks to @SteveKingIA for the kind introduction at the IA Freedom Summit & congrats to @David_Bossie & @Citizens_United on a great success! _E_
Be sure to watch "The History of WrestleMania" on @netflix. My interview explains how I supported the event early on. I'm proud of it. _E_
The failing @NYDailyNews which just raised its prices because it's dying said I wear a "wig" when they know I don't. Dishonest. _E_
Even though I have the legal right to use Steven Tyler's song he asked me not to. Have better one to take its place! _E_
Thx Mark I appreciate your words about the school. You sound like you're doing well happy for you. @businessinsider __HTTP__ _E_
The Establishment and special interests are absolutely killing our country. Stop them now: __HTTP__ _E_
Thank you for joining me this afternoon New Hampshire! Will be back soon. #FollowTheMoneySpeech transcript:... __HTTP__ _E_
Thank you New Orleans Louisiana!#MakeAmericaGreatAgain #VoteTrump __HTTP__ __HTTP__ _E_
Weak and low energy @JebBush whose campaign is a disaster is now doing ads against me where he tries to look like a tough guy. _E_
If you like having the world collapse and being told America is leading from behind vote Obama. _E_
I will be on @foxandfriends at 7:00 A.M. So much to talk about but not much good news for the U.S.A. MAKE AMERICA GREAT AGAIN! _E_
I told Rex Tillerson our wonderful Secretary of State that he is wasting his time trying to negotiate with Little Rocket Man... _E_
The ring announcers are working hard to justify the Mayweather victory. They should be ashamed of themselves! A TOTAL JOKE. _E_
Many of the Syrian rebels are radical jihadi Islamists who are murdering Christians. Why would we ever fight with them? _E_
Just arrived at Camp David where I am closely watching the path and doings of Hurricane Harvey as it strengthens to a Category 3. BE SAFE! _E_
It is amazing how rude much of the media is to my very hard working representatives. Be nice you will do much better! _E_
America's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_
Develop your gut instincts and act on them. You will have your biggest successes when you go with your gut but be very smart & careful. _E_
I will be interviewed by @oreillyfactor tonight on @FoxNews at 8pm. Enjoy! _E_
My interview with @IngrahamAngle discussing the real unemployment number and how the 7.8% number is a fraud __HTTP__ _E_
Negotiations on DACA have begun. Republicans want to make a deal and Democrats say they want to make a deal. Wouldn't it be great if we could finally after so many years solve the DACA puzzle. This will be our last chance there will never be another opportunity! March 5th. _E_
Congratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Dec! _E_
#AmericaFirst #RNCinCLE __HTTP__ _E_
.@AScottPGA Really solid playing keep going! _E_
I always enjoy watching young entrepreneurs enter the business world. I can tell who reads my books and who doesn't. #MidasTouch _E_
THe 2012 election is the most important in my lifetime. We must nominate a candidate who will win and will roll back @BarackObama's damage. _E_
So many false and phony T.V. commercials being broadcast in Indiana. Reminds me of Florida where thousands were put up I won in a landslide! _E_
Mitt's got it right: @RickSantorum's attacks on @MittRomney's pro growth tax cut proposal are foolish. _E_
I am the only one who can beat Hillary Clinton. I am not a Mitt Romney who doesn't know how to win. Hillary wants no part of Trump _E_
"The world is changing very fast. Big will not beat small anymore. It will be the fast beating the slow." @rupertmurdoch _E_
Such a great honor! __HTTP__ _E_
Why should we have any defense cuts in any deal? America must remain strong. _E_
RT @seanhannity: HRC mishandles and destroys classified info NO PROBLEM! Pay/play on Uranium one NO PROBLEM! Lynch BC tarmac: it's a matte... _E_
Wow! What a great honor from @DRUDGE_REPORT __HTTP__ _E_
Congratulations to our new #VASecretary Dr. David Shulkin. Time to take care of Veterans who have fought to protect... __HTTP__ _E_
Via @BreitbartNews TRUMP WINS NASHVILLE GRASSROOTS STRAW POLL WITH 52 PERCENT __HTTP__ _E_
Great thanks. __HTTP__ _E_
...If you plan for the worst—if you can live with the worst—the good will always take care of itself. _E_
Looking forward to speaking @acnnews International Convention tomorrow morning in Charlotte NC __HTTP__ _E_
Enjoy the #SuperBowl and then we continue: MAKE AMERICA GREAT AGAIN! _E_
You can listen to my interview today with Jay Sekulow Live and the @JordanSekulow show here __HTTP__ @12PM EST. _E_
Thank you Ohio! VOTE so we can replace Obamacare and save healthcare for every family in the United States! Watch:... __HTTP__ _E_
I let @pennjillette come back on the record 13th season of 'All Star' @CelebApprentice after he relentlessly begged me to good t.v. _E_
When a country is no longer able to say who can and who cannot come in & out especially for reasons of safety &.security big trouble! _E_
#MakeAmericaGreatAgain! __HTTP__ _E_
My @TMZ interview with @HarveyLevinTMZ discussing how I will see my $5M lawsuit against @billmaher to the end __HTTP__ _E_
#MakeAmericaGreatAgain! __HTTP__ _E_
We should tell China that we don't want the drone they stole back. let them keep it! _E_
The President of the U.S. is the leader of the Free World. He should dress like it at all times. Wear a suit and a tie for major interviews. _E_
I wonder how much our leaders have promised or given Russia in order for them to behave and not make the U.S. look even worse? _E_
Magician extraordinaire @pennjillette is back in the All Star @ApprenticeNBC. This time he has even more tricks up his sleeve. _E_
Americans who can afford to buy enough food is now at a 3 year low. Is this @BarackObama's 'recovery'? __HTTP__ _E_
Entrepreneurs: Realize that success requires 100% effort and 100% focus. Nothing less. _E_
I fully support the @NYPD @MayorBloomberg and @CommissionerKelly. They should all be honored for protecting us since 9/11 not demonized. _E_
The bus driver who saved the woman from jumping off the bridge was really cool great guy. I'm going to send him $10 000 he deserves it! _E_
Hillary Clinton failed all over the world. LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_
Wow Hillary Clinton was SO INSULTING to my supporters millions of amazing hard working people. I think it will cost her at the Polls! _E_
I will be in California this weekend making a speech for Clint Eastwood. Then to Arizona and Vegas. Big crowds. Discussing illegals & more! _E_
"There are no environments where you're only going to win because life just isn't like that." Bobby Orr _E_
It is my great honor to send $25000 to Sgt. Andrew Tahmooressi. #marinefreed _E_
Despite major outside money FAKE media support and eleven Republican candidates BIG R win with runoff in Georgia. Glad to be of help! _E_
Six months in it is the hope of GROWTH📈that is making AmericaFOUR TRILLION DOLLARS RICHER. Stuart @VarneyCo __HTTP__ __HTTP__ _E_
Another solar company @BarackObama funded with our money has filed for bankruptcy __HTTP__ One (cont) __HTTP__ _E_
My @gretawire interview re: how the debt ceiling is key point the fiscal curb & why we must & can make a great deal. __HTTP__ _E_
The biggest winner of Obama's '08 win Vladimir Putin. Ultimately he could be tied with Iran after Tehran becomes a nuclear power. _E_
Once Iran has nuclear weapons they will shut down the Strait of Hormuz. Oil will be over $300/Barrel. Iran'... (cont) __HTTP__ _E_
People of Ohio are fantastic. Thank you so much. What an evening! __HTTP__ _E_
2 million more people just dropped out of ObamaCare. It is in a death spiral. Obstructionist Democrats gave up have no answer = resist! _E_
Via @NewsInTheBurg: "@chefjoseandres to open restaurant in Trump Int'l Washington D.C." __HTTP__ _E_
Merry Christmas & Happy Holidays!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Once again under@BarackObama the US has fallen down the ranks of global competitiveness __HTTP__ We must do better. _E_
No American should be separated from their loved ones because of preventable crime committed by those illegally in our country. Our cities should be Sanctuaries for Americans – not for criminal aliens! __HTTP__ _E_
My thoughts on Gadhafi's death @BarackObama and the misery index... __HTTP__ #trumpvlog _E_
Big meetings today at the United Nations. So many interesting leaders. America First will MAKE AMERICA GREAT AGAIN! _E_
Mariano Rivera is one of top @Yankees of all time. Greatest closer of all time. A true warrior. Last night's MVP award well deserved. _E_
Congratulations to my children Don and Tiffany on having done a fantastic job last night. I am very proud of you! _E_
#CNNDebate __HTTP__ _E_
Look Snowden is bad done tremendous damage to our country and standing but we have far worse in our government (guess who?). _E_
Thank you to all of the men and women who protect & serve our communities 24/7/365! #LawEnforcementAppreciationDay... __HTTP__ _E_
Why does HI Revised Statute 338 17.8 allow an HI resident who doesn't have to be US citizen to procure an official Hawaii birth certificate? _E_
Trump: Rove 'Made a Fool Out of Himself' __HTTP__ via @cnsnews _E_
Heading to New Hampshire. Will be talking about the disaster known as ObamaCare! _E_
Russia beat the United States in the Olympics another Obama embarrassment! Isn't it time that we turn things around and start kicking ass? _E_
Kim Jong Un of North Korea who is obviously a madman who doesn't mind starving or killing his people will be tested like never before! _E_
The racial divide in our country is almost at an all time high and getting worse every time you turn on the television. _E_
Make the Boston killer talk before our doctors make him better. Once he is well he will say speak to my lawyers. _E_
Last night's @extratv 's interview by @MarioLopezExtra of gorgeous 2012 @MissUniverse @oliviaculpo __HTTP__ Great job! _E_
Once again someone we were told is ok turns out to be a terrorist who wants to destroy our country & its people how did he get thru system? _E_
I just left the Doral in Miami it is going to be amazing! __HTTP__ _E_
The Clinton's are the real predators... __HTTP__ _E_
For all of today's voters please remember that I am the only candidate that is self funding my campaign I am not bought and paid for! _E_
.@MarkHalperin showed a focus group on @Morning_Joe me using a very bad word. I never said the word left an open blank. Please apologize! _E_
Watching @loudobbsnews fantastic show! Has very interesting take on Paul Ryan. _E_
I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ _E_
When I intelligently turned down The Club For Growth crazy request for $1000000 they got nasty.What a waste of money that would have been _E_
My joint @seanhannity int. on @FoxNews with @GeraldoRivera recapping @ApprenticeNBC & discussing the 2016 election __HTTP__ _E_
Dine With The Donald and Mitt: __HTTP__ _E_
Will be participating in a Town Hall tonight on @SeanHannity at 10pmE from Austin Texas. Enjoy! __HTTP__ _E_
Weiner says many more pictures may be out there—this is just what NYC needs a pervert Mayor. _E_
Wow two candidates called last night and said they want to go to my event tonight at Drake University. _E_
Let's be honest if Obama thought he could get away with campaigning during the storm then he would have been in Ohio on Monday. _E_
38 stories high @TrumpWaikiki's 462 luxury guest rooms & suites offer exceptional services __HTTP__ _E_
.@VanityFair's 2013 dwindling sales continue to sink at an even faster record rate under Graydon Carter __HTTP__ Disaster! _E_
Wall Street paid for ad is a fraud just like Crooked Hillary! Their main line had nothing to do with women and they knew it. Apologize? _E_
I'm getting ready to be inducted tonight into the WWE Hall of Fame at Madison Square Garden a great honor for me and the Trump family! _E_
As a presidential candidate I have instructed my long time doctor to issue within two weeks a full medical report it will show perfection _E_
TRUMP: GOP MUST DUMP 'USELESS' ROVE TO WIN PRESIDENTIAL ELECTIONS __HTTP__ by @mboyle1 @BreitbartNews _E_
Another new post debate poll. THANK YOU! #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Paul Ryan is far from my first choice but a very nice guy. The Republicans should go for tough and (very) smart this time no games! _E_
RT @markets: What Is Trump worth to Twitter? One analyst estimates $2 billion __HTTP__ __HTTP__ _E_
Sue them Tom! #TrumpVlog __HTTP__ _E_
Today is a big day for us and for Toronto: Trump International Hotel & Tower Toronto opens today. (cont) __HTTP__ _E_
For all those sick degenerates contemplating a knockout attack please remember the late great Charles Bronson no more crime! _E_
I will be interviewed by Anderson Cooper at 8pm on @CNN from New Hampshire. Should be very interesting! _E_
The Republicans never discuss how good their healthcare bill is & it will get even better at lunchtime.The Dems scream death as OCare dies! _E_
Just leaving Florida. Big crowds of enthusiastic supporters lining the road that the FAKE NEWS media refuses to mention. Very dishonest! _E_
As our Country rapidly grows stronger and smarter I want to wish all of my friends supporters enemies haters and even the very dishonest Fake News Media a Happy and Healthy New Year. 2018 will be a great year for America! _E_
Hillary Clinton conceded the election when she called me just prior to the victory speech and after the results were in. Nothing will change _E_
Too many people rely on auto correct...an assistant of mine apologizes! _E_
HAPPY 4TH OF JULY TO EVERYONE! MAKE AMERICA GREAT AGAIN! _E_
I will implement effective missile defenses to protect against threats. On this there will be no flexibility with Vladimir Putin. Mitt _E_
Happy Birthday President Reagan #FlashbackFriday __HTTP__ _E_
Buy American & hire American are the principles at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_
The reality is that no gun bill will ever stop tragedies. And as we have learned from ObamaCare Washington only makes things worse! _E_
I WILL DEFEAT ISIS. THEY HAVE BEEN AROUND TOO LONG! What has our leadership been doing?#DrainTheSwamp __HTTP__ _E_
Can you imagine the Boston killer being lovingly tended to in a hospital room right next to his victims who lost their arms legs and worse! _E_
The single greatest Witch Hunt in American history continues. There was no collusion everybody including the Dems knows there was no collusion & yet on and on it goes. Russia & the world is laughing at the stupidity they are witnessing. Republicans should finally take control! _E_
A detainee released from Gitmo has killed an American. When will our so called leaders ever learn! _E_
I don't want to be the only billionaire in America I want all Americans to be rich. _E_
Frank "FX" Giaccio On behalf of @FLOTUS Melania & myself THANK YOU for doing a GREAT job this morning! @NatlParkService gives you an A+! __HTTP__ _E_
Talking with @SammartinoBruno backstage __HTTP__ #WWEHOF _E_
.@realDonaldTrump will PROTECT and DEFEND the Constitution #Debate #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
My son @EricTrump is in Memphis at St. Jude Children's Research Hospital... _E_
Flashback – Jeb Bush received a $4M tax payer bailout in 1990 __HTTP__ Guess who was POTUS then? _E_
"Donald Trump to headline SC Tea Party Convention" __HTTP__ via @wyffnews4 _E_
El Chapo comes to the U.S. often thru our border—it's been revealed he has CA drivers license. __HTTP__ _E_
Members from Obama's own job council are endorsing @MittRomney __HTTP__ Not surprising _E_
Watch Coach Mike Ditka a great guy and supporter tonight at 8pmE on #WattersWorld with @jessebwatters @FoxNews. _E_
It never ends! __HTTP__ _E_
It was only after I informed NBC that I wouldn't do the Apprentice that they became upset w/ me. They couldn't care less about "inclusion _E_
Malfeasance at Fannie Mae and Freddie Mac helped cause our current financial meltdown. _E_
Does anybody remember when Bill Clinton in 2008 worked long and hard for Hillary? She LOST! Now Bill is at it again. Just watch. _E_
The greatest commodity to own is land. It is finite. God is not making any more of it. _E_
Glad to hear my @foxandfriends' Monday interview continues to get big ratings. Great way to start your week _E_
I will be interviewed on @FaceTheNation this morning at 10:00 A.M. Have a great day! _E_
.@KarlRove is a failed Jeb Bushy. Never says anything good & never will even after I beat Hillary. Shouldn't be on the air! _E_
Little Marco Rubio treated America's ICE officers like absolute trash in order to pass Obama's amnesty. __HTTP__ _E_
The Iranians are sure happy with Obama's nomination of Hagel. Already praising Hagel as 'Anti Israel' __HTTP__ _E_
How long will it take for chants bring back the replacement refs when a bad call is made? _E_
When will @davidaxelrod realize he is on a fool's errand trying to defend @BarackObama's ineptitude? _E_
Hypocrite! @HillaryClinton claims she needs a "public and a private stance" in discussions with Wall Street banks. #Debate _E_
"The road to Easy Street goes through the sewer." – John Madden _E_
How can Hillary run the economy when she can't even send emails without putting entire nation at risk? _E_
So what will happen to the Big O on Celebrity Apprentice tonight. Remember I only fire people when it is deserved not for other reasons! _E_
We will fight the #FakeNews with you! __HTTP__ _E_
Diligence is the mother of good luck. Benjamin Franklin _E_
Meeting with "Chuck and Nancy" today about keeping government open and working. Problem is they want illegal immigrants flooding into our Country unchecked are weak on Crime and want to substantially RAISE Taxes. I don't see a deal! _E_
'Presidential Executive Order on the Establishment of Presidential Advisory Commission on Election Integrity'... __HTTP__ _E_
If ObamaCare should not be repealed then why has Obama & Congress exempted their staffs? _E_
We ALL must be united & condemn all that hate stands for. There is no place for this kind of violence in America. Lets come together as one! _E_
I wonder why @BarackObama is not going to the NAACP Convention. Is it because he can't answer questions about 14.7% Black unemployment? _E_
.@FoxNews is much better and far more truthful than @CNN which is all negative. Guests are stacked for Crooked Hillary! I don't watch. _E_
Just had a great legal victory in Ft. Lauderdale won trial now will receive tremendous $ in legal fees from losers. Love it! _E_
Wow just announced that Lyin' Ted and Kasich are going to collude in order to keep me from getting the Republican nomination. DESPERATION! _E_
Via @NOLAnews by @DaveWalkerTV: Donald Trump praises @Joan_Rivers as 'strong' 'vibrant' in @ApprenticeNBC return __HTTP__ _E_
National Review is a failing publication that has lost it's way. It's circulation is way down w its influence being at an all time low. Sad! _E_
I will be interviewed on @megynkelly's The Kelly File tonight. Be sure to watch on @FoxNews! _E_
"Donald Trump to visit metro Detroit in May" __HTTP__ via @wxyzdetroit _E_
I am continuing to get rid of costly and unnecessary regulations. Much work left to do but effect will be great! Business & jobs will grow. _E_
After watching all about the horror story that is A Rod I realized again that it is time to let Pete Rose into the Baseball Hall of Fame! _E_
I will be doing Fox and Friends at 7 A.M. this morning. _E_
My #TrumpTuesday @SquawkCNBC interview discussing golf VP choices the real estate market & healthcare reform __HTTP__ _E_
Looking forward to tonight's conversation w/ David Rubenstein @TheEconomicClub. Airing live on @cspan at 7PM EST __HTTP__ _E_
Wow @Macys shares are down more than 40% this year. I never knew my ties & shirts not being sold there would have such a big impact! _E_
RT @DonnaWR8: @realDonaldTrump I wonder what this BRAVE American would give to stand on his OWN two legs just ONCE MORE for our #Anthem?... _E_
Watch me explain on the @Late_Show how my charitable offer to Obama changes the election and is about transparency __HTTP__ _E_
He @BarackObama believes that the War on Terror is over __HTTP__ Who does he think won? _E_
both countries will perhaps work together to solve some of the many great and pressing problems and issues of the WORLD! _E_
Join me for my #WeeklyAddress __HTTP__ __HTTP__ _E_
Pocahontas bombed last night! Sad to watch. _E_
Thank you for your incredible support Wisconsin and Governor @ScottWalker! It is time to #DrainTheSwamp & #MAGA!... __HTTP__ _E_
.@JebBush's opening and closing in the debate were said by all to be terrible fumbled around incoherent. _E_
.@SenJohnMcCain should be defeated in the primaries. Graduated last in his class at Annapolis dummy! _E_
The Fed's pumping is great news in the short term but it can't last forever. Be prudent in your market investing. _E_
Looks like my work here is done bringing a close to the first ever #NBC #SweepsTweet. Keep watching @ApprenticeNBC every Sunday 9/8c. _E_
.@Israel could very well be close to attacking Iran. Could be this election's big October surprise... _E_
Thank you to Bob Woodward who said That is a garbage document...it never should have been presented...Trump's right to be upset (angry)... _E_
Via @BreitbartNews by @j_strong: "Obama Administration Quietly Prepares 'Surge of Millions' of New Immigrant Ids" __HTTP__ _E_
Via @FoxSportsGolf: Trump's protégé earns US Open spot __HTTP__ _E_
Some jerk fraudulently tweeted that his parents said I was a big inspiration to them + pls RT—out of kindness I retweeted. Maybe I'll sue. _E_
I am going to repeal and replace ObamaCare! Read more about my positions on healthcare reform here: __HTTP__ _E_
This is one of the COLDEST WINTERS ever freezing all over the country for long periods of time! So much for GLOBAL WARMING. _E_
Discussing the 9/11 attack and coverage with @kingsthings while hosting the 25th anniversary of his @CNN show __HTTP__ _E_
Doing Fox and Friends in two minutes! _E_
Will be doing Fox & Friends at 7 2 minutes. _E_
The Trans Pacific Partnership will lead to even greater unemployment. Do not pass it. _E_
I guess I have reached yet another ceiling 49.7% with four people. My highest Reuters poll yet! Thank you! __HTTP__ _E_
Even the @NYTimes and @WashingtonPost Editorial Boards condemned Justice Ginsburg for her ethical and legal breach. What was she thinking? _E_
Via @ConMonitorNews by @CMonitor_JVF: Donald Trump guest speaker at event honoring James Foley __HTTP__ _E_
Donald Trump returns to the 'Apprentice' boardroom __HTTP__ via @BW _E_
I campaigned on creating a merit based immigration system that protects U.S. workers & taxpayers. Watch: __HTTP__ #RAISEAct __HTTP__ _E_
Nation's Immigration And Customs Enforcement Officers (ICE) Make First Ever Presidential Endorsement: __HTTP__ _E_
Congratulations to @PiersMorgan on winning @BritishGQ TV Personality Of The Year. Piers deserves his success! _E_
.@DannyZuker Danny—Let your bosses on Modern Family lend you the money to play the game. Show courage! _E_
I never want someone working for me who doesn't want to be there and in the same way you shouldn't want to be there either. _E_
If elected POTUS I will stop RADICAL ISLAMIC TERRORISM in this country! In order to do this we need to... __HTTP__ _E_
Getting ready for the big news conference in Dubai. It should all be happening in the U.S. but it isn't SAD! _E_
President Obama has a personal responsibility to visit & embrace all people in the US who contract Ebola! _E_
Melania and I just had interview with the legendary @BarbaraJWalters. Watch #abc2020 this Friday. Tonight we talk ISIS @WNTonight _E_
WaPo attack on alleged high school incidents by @MittRomney is a hit job to me. Where are @BarackObama's high school and college records? _E_
Obama should meet with Putin snd convince him to do what is good for the U.S. It's called good dealmaking or simply leadership! Cajole. _E_
Great leaders listen to and support law enforcement officials. Police discuss no go areas: __HTTP__ __HTTP__ _E_
Diane Black of Tennessee the highly respected House Budget Committee Chairwoman did a GREAT job in passing Budget setting up big Tax Cuts _E_
Hopefully Republican Senators good people all can quickly get together and pass a new (repeal & replace) HEALTHCARE bill. Add saved $'s. _E_
Since November 8th Election Day the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs! _E_
No cuts to welfare no cuts to food stamps & NOT A SINGLE CUT TO OBAMACARE yet the new budget cuts military benefits. Sad! _E_
"Build your reputation on intelligence responsibility and results. That's building the right way." – Think Like a Champion _E_
Iran's nuclear program must be stopped – by any and all means necessary. _E_
Before I bought the site the Sun Times had the biggest ugliest sign Chicago has ever seen. Mine is magnificent and popular. _E_
THANK YOU SYRACUSE! #NYPrimary __HTTP__ __HTTP__ _E_
Set the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_
Not since Watergate have we been going thru a time like this Benghazi IRS wiretapping of @AP... _E_
My @CNN interview with @wolfblitzercnn where I discuss @BarackObama's 'birth certificate' and why @CNN has low ratings __HTTP__ _E_
There must be a higher standard of accuracy in the media. Incredible that some so called journalists can make up lies and get away with it _E_
RT @TrumpInaugural: Counting down the days until the swearing in of @realDonaldTrump & @mike_pence. Check in here for the latest updates. #... _E_
Rowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday joined @foxandfriends. __HTTP__ _E_
Why do we always know how the four liberals are going to rule but have to think about which side the Republican judges will go. _E_
In the span of two months @BarackObama the habitual vacationer has called America soft and lazy. He loves to criticize America. _E_
Must read @IBDinvestors editorial: "Child Alien Crisis Obama's Fault But GOP Won't Pounce" __HTTP__ _E_
.@CPACnews had its largest ever ticket sales the day of my announcement. Really an honor. Can't wait to see everyone. _E_
Congrats to the new Gov. of Texas @GregAbbott_TX for taking a tough & bold stance at the border. Should have been done long ago by Perry. _E_
Sadly there is no way that Ted Cruz can continue running in the Republican Primary unless he can erase doubt on eligibility. Dems will sue! _E_
Entrepreneurs: Keep an open mind! Business is a creative endeavor. There are always opportunities and possibilities. _E_
Lyin'Ted Cruz is weak & losing big so now he wants to debate again. But according to DrudgeTime and on line polls I have won all debates _E_
My father Fred Trump left me a relatively small amount of money (compared to where I am today over $10 billion) but vast amount of knowledge _E_
Why isn't Hillary Clinton 50 points ahead?#DebateNight __HTTP__ _E_
Despite the false @nytimes story about Jeb Bush being happy with the Trump surge he fell more than anybody & is miserable. _E_
Emails prove WH knew ObamaCare website wouldn't work in October why didn't they delay the launch? __HTTP__ _E_
I look forward to attending Saturday Night Live on Sunday night. I am sure it will be a great show. @nbcsnl __HTTP__ _E_
.@megynkelly used this poll (nobody else did) when I was down—wonder if she'll use it now that I'm up? __HTTP__ _E_
My @FoxNews interview last night with @Gretawire __HTTP__ _E_
To all the Bernie voters who want to stop bad trade deals & global special interests we welcome you with open arms. People first. _E_
Intrinsic means basic inborn elemental. If you have an intrinsic value it cannot be taken away. Think Like a Champion _E_
The Chinese must still be laughing at Kerry's trip to China. He got nothing gave them everything and promised even more. _E_
The Donald J. Trump Signature Collection available @Macys offers this fall's top styles in ties shirts & suits __HTTP__ _E_
I am in IstanbulTurkey. Just opened magnificent #TrumpTowers a big hit. _E_
RT @DonaldJTrumpJr: Someone please fact check her coal comments. Give me a break. #debates _E_
RT @netanyahu: Ever Strongerחזקים תמיד 🇱 __HTTP__ _E_
Thank you Hilton Head South Carolina! @SCTeamTrump #Trump2016 __HTTP__ __HTTP__ _E_
Upstate New York needs jobs. Frack Now & Frack Fast! Pay off NY State debt. _E_
Via Politico: Trump Extends Lead in New Hampshire Poll __HTTP__ _E_
RT @JaydaBF: VIDEO: Islamist mob pushes teenage boy off roof and beats him to death! __HTTP__ _E_
Chuck Jones who is President of United Steelworkers 1999 has done a terrible job representing workers. No wonder companies flee country! _E_
Despite the Fake News Media in conjunction with the Dems an amazing job is being done in Puerto Rico. Great people! _E_
Everything you can imagine is real. —Pablo Picasso _E_
Isn't it interesting that now that I'm #1 in the polls the networks show polls that are a month old! _E_
More radical Islam attacks today it never ends! Strengthen the borders we must be vigilant and smart. No more being politically correct. _E_
I am at the Saturday Night Live Studio electricity all over the place. We will be doing a tweeting skit so stay tuned! _E_
Trump National Golf Club Los Angeles on the Palos Verdes Peninsula overlooking the Pacific Ocean spectacular! __HTTP__ _E_
We should start an immediate investigation into @SenSchumer and his ties to Russia and Putin. A total hypocrite! __HTTP__ _E_
Will be doing a joint press conference in Hanoi Vietnam then heading for final destination of trip the Phillipines. _E_
I know some of you may think l'm tough and harsh but actually I'm a very compassionate person (with a very high IQ) with strong common sense _E_
.@WSJ Editorial says Clinton primary vote total is 8646551.Trump's is 7533692 a knock. But she had only 3 opponents I had 16.Apologize _E_
Bob Schieffer will do a great job tonight. Always treated me fairly. _E_
RT @foxandfriends: Yesterday's hearings provided zero evidence of collusion between our campaign and the Russians because there wasn't any... _E_
With all of the Fake News coming out of NBC and the Networks at what point is it appropriate to challenge their License? Bad for country! _E_
How quality a woman is Rowanne Brewer Lane to have exposed the @nytimes as a disgusting fraud? Thank you Rowanne. _E_
Maybe Boehner will stop this one sided deal in the House...I hope so! _E_
NEW FBI TEXTS ARE BOMBSHELLS! _E_
Great news! Just out the highly respected USA Today/Suffolk University Poll. Enjoy! __HTTP__ _E_
RT @MollyCBraswell: WHAT?! @realDonaldTrump is speaking at #CPAC2013? This conference just became like a hundred times more awesome! _E_
Thank you so nice. _E_
The Supercommittee is a disaster. The Republicans made a crucial mistake agreeing to this debt deal. They hat... (cont) __HTTP__ _E_
Looking over New York City with luxurious 5 Star hotel rooms @TrumpNewYork top dining & amenities __HTTP__ _E_
Wow just in ObamaCare projected to cause large scale drop in jobs even Dems are shocked by 2.5 million number. DISASTER! _E_
The U.S. needs to protect our intelligence assets especially in China. If the Chinese want to spy on us then we need to return the favor. _E_
Paula Broadwell's book on Gen. Petreus is titled All In. Did she know something? _E_
It's time for politicians to be reminded they work for us! We can get it done. Let's Make America Great Again! __HTTP__ _E_
RT @RSBNetwork: LIVE Stream now: Donald Trump press conference #TrumpTrain #Trump2016 __HTTP__ _E_
Interesting polls on who won the GOP debate. __HTTP__ _E_
At the Old Post Office __HTTP__ _E_
Congratulations to Miss Mexico Jimena Navarrete our new Miss Universe 2010 and congratulations to everyone for a fantastic show. _E_
Donald Trump Leads Polls in Florida __HTTP__ _E_
Even though I beat him in the first six debates especially the last one Ted Cruz wants to debate me again. Can we do it in Canada? _E_
.@StephenBaldwin7 and me at a press event for All Star @ApprenticeNBC earlier today at @TrumpTowerNY.... __HTTP__ _E_
Obamacare premiums continue to rise and bend up the cross curve. And the back end of the website does not even work. _E_
I was on CNN last night with @ErinBurnett. _E_
.@WhoopiGoldberg had better surround herself with better hosts than Nicole Wallace who doesn't have a clue. The show is close to death! _E_
If Trump became president he would do an amazing job if Obama took over Celebrity Apprentice he'd fail. What's your opinion? I agree! _E_
Looking forward to a big rally in Nashville Tennessee tonight. Big crowd of great people expected. Will be fun! _E_
Economic growth can save Social Security Medicare and America. _E_
THANK YOU Council Bluffs Iowa! The silent majority is silent no more!#Trump2016 #FITN __HTTP__ __HTTP__ _E_
The Council is concerned over the health & safety for the village of Blackdog w/ placement of sub station. @AlexSalmond @pressjournal _E_
.@MarkHalperin I totaly won the RJC meeting yesterday. Know many members who said not even close. Only FULL standing O. But don't want $'s _E_
"I don't think you should ever run from history. You should learn from it and embrace it." @LAClippers Coach Doc Rivers _E_
The failing @NYTimes would do much better if they were honest! __HTTP__ _E_
Yesterday there was yet another massive intelligence leak by the @BarackObama administration. __HTTP__ _E_
MILITARY LIVES MATTER! END GUN FREE ZONES! OUR SOLDIERS MUST BE ABLE TO PROTECT THEMSELVES! THIS HAS TO STOP! _E_
New and great selection of ties shirts and cufflinks@Macy's check them out! _E_
I was right—TV ratings for US Open are way down from last year. People don't want to look at a burned out ugly course! _E_
He @RickSantorum should get out of the race so Republicans can focus on @BarackObama. _E_
Please send a psychiatrist to help @Rosie she's in a bad state. To @Rosie's girlfriend's parents get (cont) __HTTP__ _E_
I will be on @foxandfriends this morning at 8:30. Enjoy! _E_
The ruling @GOP consultant class of losers like @KarlRove have no respect for the Tea Party. They do this at their own peril! _E_
.@Yankees are making a big mistake sending the doping @AROD to rehab assignment. Should suspend him until investigation is over. _E_
.@MittRomney will create 2 million new jobs if elected POTUS. If reelected @BarackObama will create over $12T in new debt. Easy choice. _E_
Brian Williams who is not the nice guy that people think he is has now become totally irrelevant. He will never again hold court! _E_
My @greta interview discussing why we do not need another Bush __HTTP__ _E_
Absentee Governor Kasich voted for NAFTA and NAFTA devastated Ohio a disaster from which it never recovered. Kasich is good for Mexico! _E_
This season's cast of @ApprenticeNBC brings excitement to the Board Room. Lots of surprises & great tasks. Enjoy – Jan. 4th! _E_
Celebrating New Year's Eve in the Windy City? Join @TrumpChicago for the chic & elegant Cirque Soiree Celebration __HTTP__ _E_
Via @eonline: Donald Trump wants Katherine Webb for Miss USA judge __HTTP__ _E_
.@MacMiller has over 79M hits on YouTube & just hit platinum with his Donald Trump song—screw you Mac! _E_
I want to thank my @Cabinet for working tirelessly on behalf of our country. 2017 was a year of monumental achievement and we look forward to the year ahead. Together we are delivering results and MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
Great news Chinese companies who were fixing prices and accounting are leaving the US stock market __HTTP__ #TimeToGetTough _E_
Good news is Melania's speech got more publicity than any in the history of politics especially if you believe that all press is good press! _E_
In game 7 of the World Series tonight the Giants are making a big mistake in not starting their ace against K.C. even with two days rest. _E_
.@GMA at 7:00 A.M. _E_
If your actions inspire others to dream more learn more do more and become more you are a leader. – John Quincy Adams _E_
So nice thank you very much. __HTTP__ _E_
.@AlexSalmond Heatwave in Scotland makes wind turbines useless. Big problem expensive mess. _E_
.@THR The Donald Trump Ratings Bump: Who's Benefiting Most? __HTTP__ _E_
My major hotel conversion of The Old Post Office on Pennsylvania Avenue in D.C. is under budget and ahead of schedule. Should be U.S.A. _E_
I visited our Trump Tower campaign headquarters last night after returning from Ohio and Arizona and it was packed with great pros WIN! _E_
Via @washingtonpost's @goingoutguide by @timcarman: " @gzchef open the National at the Old Post Office Pavilion" __HTTP__ _E_
Great day today in South Carolina. Fantastic capacity crowd amazing people! _E_
Today Judge St. Eve ruled in my favor on the two remaining claims brought by Goldberg in Chicago. The case is now officially over... _E_
My @Newsmax_Media int. with @SteveMTalk on my Iowa @theFAMiLYLEADER speech @jonkarl 2016 & Benghazi __HTTP__ _E_
RT @shawgerald4: @realDonaldTrump Thank you President TRUMP!! __HTTP__ _E_
Leaving today for California to inspect my fantastic golf course & club on the Palos Verdes peninsula. Big success. __HTTP__ _E_
Dave Brubeck was great and will be missed! _E_
Jeb Bush should stop trying to defend his brother and focus on his own shortcomings and how to fix them. Also Rubio is hitting him hard! _E_
Via @Newsmax_Media by "Donald Trump: Don't Give Obama Fast Track Trade Authority" __HTTP__ _E_
A strong Poland is a blessing to the nations of Europe and a strong Europe is a blessing to the West and to the world. __HTTP__ _E_
The measure of who we are is what we do with what we have. Vince Lombardi _E_
Wow Jeb Bush just lost three of his top fundraisers they quit! _E_
I've helped pass and signed 38 Legislative Bills mostly with no Democratic support and gotten rid of massive amounts of regulations. Nice! _E_
Thank you Indiana we were just projected to be the winner. We have won in every category. You are very special people I will never forget! _E_
Miss Universe Paulina Vega criticized me for telling the truth about illegal immigration but then said she would keep the crown Hypocrite _E_
"Strive for wholeness and keep your sense of wonder intact & you will find yourself ready for a grand slam." Think Like A Champion _E_
All eyes are on Florida today. I will be watching the GOP primary results very closely. We need the right candidate to beat @BarackObama. _E_
Why do you need a photo ID to buy a drain cleaner __HTTP__ not to vote? _E_
"Trump Tiger Team Up to Create 'Stunning' Golf Course in Dubai" __HTTP__ via @Newsmax_Media by @Jlorenz _E_
Via @fitsnews by Will Folks: "'THE DONALD' REBUKES OBAMATRADE" __HTTP__ _E_
THANK YOU! #AmericaFirst __HTTP__ _E_
Clinton campaign & DNC paid for research that led to the anti Trump Fake News Dossier. The victim here is the President. @FoxNews _E_
Arriving to check out the border. __HTTP__ _E_
Leaking and even illegal classified leaking has been a big problem in Washington for years. Failing @nytimes (and others) must apologize! _E_
We have got to take our country back. It's time! _E_
Major League Baseball: The best thing you can do is let @PeteRose_14 your all time hits leader into the Hall of Fame. It's time! _E_
Eric Trump on @foxandfriends now! _E_
The jury was not told the killer of Kate was a 7 time felon. The Schumer/Pelosi Democrats are so weak on Crime that they will pay a big price in the 2018 and 2020 Elections. _E_
Beyond eliminating the wasteful spending we need to get tough in cracking down on the hundreds of billions of (cont) __HTTP__ _E_
I have been saying it for sometime now!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Exclusive Davi: Trump The Lion We Need __HTTP__ _E_
.@TraceAdkins the winner of @ApprenticeNBC after last night's victory __HTTP__ _E_
"Go as far as you can see when you get there you'll be able to see farther." J. P. Morgan _E_
RT @rupertmurdoch: As predicted Trump reaching out to make peace with Republican establishment . If he becomes inevitable party would be... _E_
.@seanhannity should have corrected Jeb Bush when he said that I ran for president twice. Never ran merely considered running! _E_
"Always remember: Dress for the job you want not the job you have." – Think Like a Billionaire _E_
Thank you to everyone who came out & joined us @TrumpTurnberry yesterday! @EricTrump @IvankaTrump @DonaldJTrumpJr __HTTP__ _E_
Losers and haters are invited to watch Celebrity Apprentice along with the many great and productive people in the hope that you will learn. _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Bowe Bergdahl walked off the base after he was told not to. Solders died looking for him. U.S. should NEVER have made the deal! PUNISHMENT? _E_
Obama's promise to build an international coalition against ISIS is already broken. No one trusts him at home or abroad. _E_
...It's old electrical grid which was in terrible shape was devastated. Much of the Island was destroyed with billions of dollars.... _E_
My sense is that people are far angrier at the President than they are at Congress re the shutdown—an interesting turn! _E_
Thanks @SherriEShepherd 4 your nice comments today on The View. U were terrific! _E_
He's saddled our children with more debt than we accumulated in 225 years in America. @BarackObama has done an (cont) __HTTP__ _E_
From The Desk Of Donald Trump two new videos up at __HTTP__ and __HTTP__ _E_
You're hired! The @CENTURY21 ad is airing during the #SuperBowl and you need to get voting! Vote for me & @CENTURY21: __HTTP__ _E_
Congrats to Team USA & Capt. @AllenWronowski for retaining the PGA Cup! Well done and well deserved! _E_
Met a big fan today! __HTTP__ _E_
Talks on Repealing and Replacing ObamaCare are and have been going on and will continue until such time as a deal is hopefully struck. _E_
Keep lightweight Marco and his friends out of the White House. #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
If we want to renew our PROSPERITY restore OPPORTUNITY & re establish our economic DOMINANCE then we need tax reform that is pro growth.. __HTTP__ _E_
Thank you @DallasPD! __HTTP__ _E_
'Dem Operative Who Oversaw Trump Rally Agitators Visited White House 342 Times' #DrainTheSwamp __HTTP__ _E_
Thanks. __HTTP__ _E_
Getting ready to watch the debate as they say let's get ready to rumble ! _E_
Remember the harder you work the luckier you get! _E_
Some people dream of great accomplishments while others stay awake and do them." Anonymous _E_
ObamaCare is a broken mess. Piece by piece we will now begin the process of giving America the great HealthCare it deserves! _E_
Legal immigrants want border security. It is common sense. We must build a wall! Let's Make America Great Again! __HTTP__ _E_
Our country is totally divided and our enemies are watching. We are not looking good we are not looking smart we are not looking tough! _E_
Thank you to Prime Minister of Australia for telling the truth about our very civil conversation that FAKE NEWS media lied about. Very nice! _E_
Together we will show the world that the forces of destruction and extremism are NO MATCH for the BLESSINGS of PROSPERITY and PEACE! __HTTP__ _E_
Will be joining @jimmyfallon on @FallonTonight at 11:35pmE tonight. Enjoy! _E_
My @amtalker int. on @whoradio w/@SteveKingIA discussing my upcoming campaign visit for Steve this Sat. in Iowa __HTTP__ _E_
Via @TV3Xpose: "@IvankaTrump: Think pink in the boardroom." __HTTP__ _E_
Boston incident is terrible. We need energy and passion but we must treat each other with respect. I would never condone violence. _E_
Last night William Shatner had more airtime than any winner. It should have been called the William Shatner show... _E_
Massive crowds already forming in Jacksonville will be and incredible day 12 noon! MAKE AMERICA GREAT AGAIN! _E_
Hope we all enjoy @60Minutes tomorrow night. I do believe they will treat me fairly! _E_
Thank you to Sue Kruczek who lost her wonderful and talented son Nick to the Opioid scourge for your kind words while on @foxandfriends. We are fighting this terrible epidemic hard Nick will not have died in vain! _E_
Still time to get out and VOTE!#WIPrimary #Trump2016 #MAGA __HTTP__ _E_
I told the Republicans the debt ceiling talks should come before election & we would have a Republican president—they wouldn't listen. _E_
Q/A @stalkinpeople Yes I'd give the real numbers. _E_
I'll be in one of my favorite places this morning Staten Island. Big crowd will be fun! _E_
Join us in Salt Lake City Utah tonight!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
In getting the endorsement of the 16500 Border Patrol Agents (thank you) the statement was made that the WALL was very necessary! _E_
Via @Newsmax_Media by @wandacarruthers: Donald Trump: US Defeating ISIS Only in John Kerry's Imagination __HTTP__ _E_
I hope the @GOP realizes that if they blow this election the Tea Party won't be with them next time. _E_
When candidate John Kasich on the @oreillyfactor talked about dismantling Medicare and Medicaid he was referring to Ben Carson. _E_
Amazing rally in Florida this is a MOVEMENT! Join us today at __HTTP__ __HTTP__ _E_
Fascinating to watch people writing books and major articles about me and yet they know nothing about me & have zero access. #FAKE NEWS! _E_
Today @FLOTUS hosted a Military Mother's Day Event in the East Room of the WH. It was an honor to stop by say hel... __HTTP__ _E_
"A leader does not deserve the name unless he is willing occasionally to stand alone." Henry A. Kissinger _E_
We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_
Failing @GlennBeck lost all credibility. Not only was he fired @ FOX he would have voted for Clinton over McCain. __HTTP__ _E_
Will be on @Morning_Joe in 5 minutes at 7:00. Enjoy! _E_
The @CelebApprentice will be broadcast tonight on @CNBC at 9 PM. _E_
Tomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_
Ashley Judd has just thanked Karl Rove for all the attention he has given her—unreal!—how stupid can we get? _E_
Waste! With a $16T debt and $1T budget deficit @BarackObama is sending $770M overseas to fight global warming __HTTP__ _E_
Seems to be the next election must be about jobs and gas prices not birth control. _E_
Vote for your favorite TRUMP HOTEL COLLECTION hotels in Travel + Leisure's 2012 World's Best Awards Survey __HTTP__ _E_
How long did it take your staff of 823 people to think that up and where are your 33000 emails that you deleted? __HTTP__ _E_
I truly LOVE all of the millions of people who are sticking with me despite so many media lies. There is a great SILENT MAJORITY looming! _E_
Now @BarackObama has decided there are 5 million Palestinian refugees __HTTP__ He always goes against @Israel's interest. _E_
Marco Rubio lost big last night. I even beat him in Virginia where he spent so much time and money. Now his bosses are desperate and angry! _E_
Do you notice that because of Ebola ISIS etc. ObamaCare has gone to the back burner despite horrible results coming out. A disaster! _E_
The massive TAX CUTS/REFORM that I have submitted is moving along in the process very well actually ahead of schedule. Big benefits to all! _E_
The Democrats in the Southwest part of Virginia have been abandoned by their Party. Republican Ed Gillespie will never let you down! _E_
Gas prices are at crazy levels fire Obama! _E_
I hope everybody is having a FANTASTIC Christmas! No matter how tough things may seem remember that you will ride it out & go on to victory! _E_
Is everyone seeing how incompetently our country is being run by watching the mess with Syria? Our leaders don't know what they are doing! _E_
.@AlexSalmond –the man who let terrorist (Pan Am Flight 103) al Megrahi go lost another battle over ugly wind turbines in Blackdog. _E_
Via @FOXSports: Trump 'blowing up' @DoralResort after WGC @CadillacChamp __HTTP__ by @AP _E_
A fantastic day and evening in Washington D.C.Thank you to @FoxNews and so many other news outlets for the GREAT reviews of the speech! _E_
I employ many people in Hawaii at my great hotel in Honolulu. I'll be there very soon. Vote for me Hawaii! _E_
It is my great honor to support our Veterans with you! You can join me now. Thank you! #Trump4Vets __HTTP__ _E_
Tune in tonight at 1 AM EST to the QVC network to watch Melania Trump debut her first 2012 Melania Timepieces & (cont) __HTTP__ _E_
VOTE 4 @mariamenounos & derekhough#01 tonight! She's doing a great job on Dancing with the Stars #DWTS (& a good person). 1 800 868 3401 _E_
The Yankees really have to be embarrassed losing all four games to the Mets my great friend George Steinbrenner would be going nuts! _E_
Getting ready to lift off for Laredo. Will land at 1:OO P.M. Should be exciting and informative! _E_
It was a great honor to represent the United States at the magnificent #BastilleDay parade. Congratulations President @EmmanuelMacron! __HTTP__ _E_
.@MELANIATRUMP @IvankaTrump @EricTrump @DonaldJTrumpJr & I thank our loyal fans for another great season of @ApprenticeNBC! _E_
RT @WhiteHouse: #Obamacare has led to higher costs and fewer health insurance options for millions of Americans. It has failed the American... _E_
Later today I'm being honored at the Park Hyatt in Washington D.C. by the Wharton Club. The Joseph Wharton Award Dinner. A great honor. _E_
My son Donald will be interviewed by @seanhannity tonight at 10:00 P.M. He is a great person who loves our country! _E_
No wonder Sony is doing so badly. Really stupid leadership that wants Al Sharpton to help. Watch him turn the tables on chief Amy Pascal. _E_
RT @charliekirk11: Incredible video: @CBS does a special on the GOP tax plan The result?Every middle class family they sat down with SA... _E_
Verlander is great but very beatable. Does not have a good ERA in playoff games _E_
Heading to D.C. to speak at Faith and Freedom Coalition and visit OPO. _E_
"No matter how good you get you can always get better and that's the exciting part." @TigerWoods _E_
Yesterday was amazing—5 victories. Lyin' Ted Cruzhad zero. Things are going very well! _E_
Thank you for your support!#AmericaFirst #ImWithYou __HTTP__ _E_
Thank you Louisiana! #Trump2016#SuperSaturday _E_
I am greatly honored to receive Sarah Palin's endorsement tonight. Video: __HTTP__ __HTTP__ _E_
ObamaCare is one of the worst political disasters of all time 4992343 AMERICANS LOSING COVERAGE LESS THAN 50OOO NEW SIGNUPS. _E_
.... but you only want to talk about 10 years later when I still win 10PM in all key demos.@DannyZuker _E_
Via @todayshow by @ReeHines: "Donald Trump reveals new @ApprenticeNBC cast talks Joan Rivers' role on show" __HTTP__ _E_
Sometimes your best investments are the ones you don't make. _E_
I will be visiting Trump Int'l Golf Links in Scotland tomorrow. Always great to see the Great Dunes of Scotland. __HTTP__ _E_
I'll soon be leaving for Washington where @AmSpec will give me the T. Boone Pickens Entrepreneur Award. Very exciting! _E_
Crooked Hillary Attacks Foreign Government Donations While Ignoring Her Own: __HTTP__ _E_
I think that both candidates Crooked Hillary and myself should release detailed medical records. I have no problem in doing so! Hillary? _E_
Yogi Berra was not only a great baseball player he was a great guy. Yogi will be missed. __HTTP__ _E_
RT @piersmorgan: Trump makes a funny obvious joke about Russia going after Hillary's emails & U.S. media goes insane with fury.He plays t... _E_
Back by popular demand @latoyajackson returns to the 13th season of All Star @CelebApprentice. She is fierce in the Board Room! _E_
Anyone reading this profile of Marco Rubio would never vote for him. Never made ten cents & is totally controlled! __HTTP__ _E_
Join me in Sacramento California tomorrow evening @ 7pm! #Trump2016 __HTTP__ __HTTP__ _E_
LIVE on #Periscope __HTTP__ _E_
The three great essentials to achieve anything worth while are: Hard work stick to itiveness and common sense. Thomas A. Edison _E_
A test: tweet me the reason @billmaher was fired from @ABC (other than his bad ratings). _E_
The @USNavy is conducting search and rescue following aircraft crash. We are monitoring the situation. Prayers for all involved. __HTTP__ _E_
Meryl Streep one of the most over rated actresses in Hollywood doesn't know me but attacked last night at the Golden Globes. She is a..... _E_
Can you believe that the U.S. will be sending 3000 troops to Africa to help with Ebola.They will come home infected? We have enough problems _E_
Honored to welcome Georgia Prime Minister Giorgi Kvirikashvili to the @WhiteHouse today with @VP Mike Pence.... __HTTP__ _E_
.@BernieSanders who blew his campaign when he gave Hillary a pass on her e mail crime said that I feel wages in America are too high. Lie! _E_
The failing @politico news outlet which I hear is losing lots of money is really dishonest! _E_
I am happy to announce that theoriginal Apprentice which will offer job opportunities to those in need is coming back. _E_
If Obama keeps pushing wind turbines our country will go down the tubes economically environmentally & aesthetically. _E_
Trust yourself. Create the kind of self that you will be happy to live with all your life. Golda Meir _E_
RT @IvankaTrump: Beautiful article about @realDonaldTrump written by my friend the incredibly talented golfer Natalie Gulbis: __HTTP__ _E_
Guess what folks the ObamaCare website just went down again. What a disaster. _E_
As we are learning the hard way both domestically & internationally hope is not a strategy. _E_
My @gretawire int. on Obama scandals not resonating no retribution on Benghazi Obama not being engaged __HTTP__ _E_
Jerry Buss was a great guy and friend. He will be missed! _E_
Deportations are now at a record low. Obama manipulated the numbers to lie to the public that they were at a record high. Secure the border! _E_
.@AGSchneiderman has never once said that he didn't ask for campaign contributions during the investigation. _E_
...and will be very embarrassed unless they get smart fast. _E_
The US is stupidly closing all of its coal fired plants while at the same time we're selling our coal to (cont) __HTTP__ _E_
Democrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that happen! _E_
#timetogettough The White House Correspondents' Dinner in my new book Time To Get Tough .....watch the #trumpvlog __HTTP__ _E_
Tim Kaine is and always has been owned by the banks. Bernie supporters are outraged was their last choice. Bernie fought for nothing! _E_
Always nice to see the terrific @mariamenounos at the #WWEHOF. __HTTP__ _E_
Obama told @NBC that Egypt is no longer an ally. They used to be until he pushed out Mubarak. _E_
Her instincts are suboptimal. __HTTP__ _E_
After 14 years U.S. beef hits Chinese market. Trade deal an exciting opportunity for agriculture. __HTTP__ _E_
Donald Trump appeared on the final episode of The Jay Leno Show to deliver a very special message: __HTTP__ _E_
.@AndreBauer Great job and advice on @CNN @jaketapper Thank you! _E_
.@johnboehner—if you can't make a great deal go over the cliff & negotiate new deal along with debt ceiling in February!—Trump 101. _E_
Ben Bradlee was truly one of the greats. What an amazing life he led. My warmest condolences to Dino & the whole family. #BenBradlee _E_
.@RealDonaldTrump wants a SAFE America w/ stronger borders no amnesty and an END to sanctuary cities. He is... __HTTP__ _E_
If the @yankees can somehow beat Verlander tonight then they can still salvage the series. And I will go to games 6& 7 so they will win! _E_
The replacement refs are getting blamed for everything. I've seen many bad sports calls over the years. _E_
Bernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them and run as an Independent! _E_
"A list golfing buddy! @Tegan__Martin enjoys golf w/Donald Trump ahead of@MissUniverse" __HTTP__ via @DailyMailCeleb _E_
It's Thursday how many more bias press reports will be released against @MittRomney? _E_
Congratulations to Bill O'Brien on being named the Republican Speaker of the NH House. Well earned & well deserved. A great guy. _E_
Statement Regarding Recent Executive Order Concerning Extreme Vetting: __HTTP__ _E_
The electric power grid in Puerto Rico is totally shot. Large numbers of generators are now on Island. Food and water on site. _E_
We fully support @SaveCulzean in Turnberry great for beauty & tourism. Wind turbines are death to environment. __HTTP__ _E_
It is that time of the year. The Trump Wollman Skating Rink is open to the public in Central Park. The greatest ice rink in the country. _E_
Failed presidential candidate @MittRomney was made to look like a fool by Senator Harry Reid & didn't release his tax returns until 9/21/12. _E_
.@Deadspin will never make it—they don't understand graciousness or money—and best guy is leaving? _E_
Will be on @foxandfriends now. _E_
"You have to have a good reason for doing what you're doing because people connect with the why." – Midas Touch _E_
.@tuckercarlson is doing a really good job on Fox especially when talking politics. He has come a long way fast! _E_
Thank you James Freeman of the @WSJ for the very nice words. All polls said I won the debate except NBC (3rd). Explain to Daniel Henninger! _E_
Why would a very low ratings radio talk show host like Hugh Hewitt be doing the next debate on @CNN. He is just a 3rd rate gotcha guy! _E_
Obama thinks he can just laugh off the fact that he refuses to release his records to the American public. He can't. _E_
By folding Penn State leadership made things worse. The deal is ridiculous & punishes the wrong people. I hope the alumni sue to overturn. _E_
What does.Obama know about the VA or business nothing just look at the five billion dollar ObamaCare website. We need a real leader! _E_
Trump Chicago was featured in Transformers 3. Trump Tower was featured in Dark Knight Rises. Both are summer blockbusters. #MidasTouch _E_
#TrumpVlog Be careful with Iran. __HTTP__ _E_
THe WH should not have hosted the Muslim Brotherhood. @BarackObama's friends are enemies of the US and @Israel. The Islamist winter is here. _E_
Make your NYC getaway memorable @TrumpNewYork provides both true luxury and top access to Midtown West __HTTP__ _E_
Unemployment is up in 44 states showing July's unemployment numbers to be broad based __HTTP__ @BarackObama is a job killer. _E_
Brian @kilmeade wrote a wonderful book called George Washington's Secret Six that is truly worth reading. __HTTP__ _E_
"If you strike out nobody is going to help you not your friends not the government. You have to look to look out for yourself." Think Big _E_
I will be interviewed on @NewDay @CNN at 7:00 A.M. _E_
Illegal immigration is a wrecking ball aimed at US taxpayers. Washington needs to get tough and fight for W... (cont) __HTTP__ _E_
Entrepreneurs: Difficulties mistakes & setbacks are an inevitable part of business & life. Remember to keep your equilibrium intact. _E_
How can Ted Cruz be an Evangelical Christian when he lies so much and is so dishonest? _E_
Fact – while Jeb was governor & Rubio was House Majority Leader Florida's debt more than doubled. Conservatives? _E_
"I love it when people doubt me. It makes me work harder to prove them wrong." – Derek Jeter _E_
NEVADA! Tomorrow is the deadline to register Republican.Visit: __HTTP__ from @IvankaTrump: __HTTP__ _E_
John Sununu was more right than he even knew yesterday @BarackObama indeed needs to learn how to be an American. _E_
Must read @WSJ column by Senator Phil Gramm "The Multiple Distortions of Wind Subsidies" __HTTP__ _E_
Played golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_
Such a great experience in New Hampshire amazing people! Will be leaving for a big event in South Carolina today. _E_
AN AMERICA FIRST ENERGY PLAN#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Amazing NH poll released! We are getting ready to Make America Great Again! __HTTP__ _E_
A general is just as good or just as bad as the troops under his command make him." Gen. Douglas MacArthur _E_
Welcome to the United States @IsraeliPM Benjamin & Sara!#ICYMI 🇱Joint Press Conference: __HTTP__ __HTTP__ _E_
Would love to send the NYC terrorist to Guantanamo but statistically that process takes much longer than going through the Federal system... _E_
This is what we can expect from #CrookedHillary. More Taxes. More Spending. #BigLeageTruth #DrainTheSwamp #Debates __HTTP__ _E_
More $ thrown away @BarackObama gave $20M to Amonix and praised its success in '10. It just filed for bankruptcy __HTTP__ _E_
19 firefighters killed in Arizona terrible tragedy! _E_
Trump Links will be a great championship golf course that will host many major tournaments and bring tremendous $'s & prestige to N.Y.C.! _E_
Printing money is neither a short or long term solution to our country's economic woes. The Fed is destroyin... (cont) __HTTP__ _E_
Republicans and Democrats must come together now to make America great again! _E_
Thank you! #MAGA #AmericaFirst __HTTP__ _E_
I will rebuild the military take care of vets and make the world respect the US again! Join me today. Info: __HTTP__ _E_
"Listen to others but never negate your own instincts." – Trump Never Give Up _E_
Do you believe this Iran wants to trade our 3 prisoners (not 4) for 19 prisoners held by the U.S. Should have been let go with last deal! _E_
Low energy candidate @JebBush has wasted $80 million on his failed presidential campaign. Millions spent on me. He should go home and relax! _E_
#NEPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_
.@SecShulkin's decision is one of the biggest wins for our VETERANS in decades. Our HEROES deserve the best! ... __HTTP__ _E_
Join me in Oklahoma tomorrow night!#MakeYoutubeGreatAgain #Trump2016 __HTTP__ _E_
Go to Trump National Doral Miami and watch Tiger Phil Ernie Rory and all of the other great players compete in The WGC Cadillac Champ! _E_
If Hillary thinks she can unleash her husband with his terrible record of women abuse while playing the women's card on me she's wrong! _E_
RT @IsraelUSAforevr: @realDonaldTrump __HTTP__ _E_
The Herschel Walker interview on The Tim McCarver Show was fantastic much can be learned from watching. Congrats to Herschel and Tim! _E_
I have been drawing very big and enthusiastic crowds but the media refuses to show or discuss them. Something very big is happening! _E_
Entrepreneurs: See yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_
Sadly because president Obama has done such a poor job as president you won't see another black president for generations! _E_
Entrepreneurs: See yourself as victorious. Look at the solution not the problem. Keep your focus positive. _E_
An investment in knowledge pays the best interest. Benjamin Franklin _E_
Meeting w/ Washington D.C. @MayorBowser and Metro GM Paul Wiedefeld about incoming winter storm preparations here... __HTTP__ _E_
Veterans please call 855 VETS 352 or email address veterans@donaldtrump.com to share your stories about the need to reform the VA. _E_
#CelebApprentice What do you think of the new teams/PMs? _E_
Just got back from Las Vegas. @TrumpLasVegas Hotel was fantastic in every way but the fight was a total waste of time. The aggressor lost? _E_
WWII vs. Now! During the 3 1/2 years of World War II that started with the Japanese bombing of Pearl Harbor (cont) __HTTP__ _E_
Thank you Denver Colorado! #MakeAmericaGreatAgain! __HTTP__ _E_
... The Republicans just didn't resonate with the people—but they will have better days. _E_
The White House is continuing to be openly uncooperative with the Fast and Furious investigation. American lives were lost. We need answers. _E_
ObamaCare premiums could jump as high as 51% __HTTP__ Terrible for economy. Repeal & Replace with free market solution! _E_
Just arrived at @trumpdoral for the @cadillacchamp starting tomorrow __HTTP__ _E_
Iron Mike Tyson was not asked to speak at the Convention though I'm sure he would do a good job if he was. The media makes everything up! _E_
Keep an open mind business is a creative endeavor. Strive for innovative ideas. _E_
RT @CLewandowski_: Please watch @foxandfriends today at 7:30 AM to watch me discuss @realDonaldTrump. _E_
Thank you. __HTTP__ _E_
"The more predictable the business the more valuable it is. Predictability also means consistency of brand experience." Midas Touch _E_
The failing @nytimes hates the fact that I have developed a great relationship with World leaders like Xi Jinping President of China..... _E_
Crooked Hillary Clinton has zero natural talent she should not be president. Her temperament is bad and her decision making ability zilch! _E_
Pres. Obama was touting Yemen as a great success story it just fell. Obama doesn't know what he is doing. Saudi Arabia is in big trouble. _E_
Eliot Spitzer has failed at everything he has ever done and now he wants to be comptroller. Thrown out of politics and off of TV CRAZY! _E_
It is so important to audit The Federal Reserve and yet Ted Cruz missed the vote on the bill that would allow this to be done. _E_
.@Gracematters Thank you a very wise bet! Best wishes. _E_
Via @trdmiami: "@TrumpDoral project will boast 800 hotel rooms" __HTTP__ $250M renovation on 800 acres in sunny Miami. _E_
Michelle Obama made a terrible mistake in Iowa. When endorsing Bruce Braley before a large crowd she called him Bruce Bailey seven times. _E_
.@CarlyFiorina Carly not just you I also told Gov. Kasich to "let Jeb talk give him a chance" because Kasich was constantly cutting in. _E_
It was great seeing @Schwarzenegger at the #WWEHOF. __HTTP__ _E_
How come the @TODAYshow & @chucktodd show the new @NBCNews Poll for Hillary vs Bernie but do not show the SAME poll where I am killing Cruz? _E_
JFK Files are being carefully released. In the end there will be great transparency. It is my hope to get just about everything to public! _E_
Looks like @bwilliams is having some problems with his Rock Center with Brian Williams show I hate to see such bad ratings for @NBC. _E_
.@CNN is #FakeNews. Just reported COS (John Kelly) was opposed to my stance on NFL players disrespecting FLAG ANTHEM COUNTRY. Total lie! _E_
China's Financial Institutions are expanding overseas. __HTTP__ They will own everything if we don't stop them now. _E_
I will be going to Trump National Doral in Miami early today to check on the construction of the hotel and the new Blue Monster. AMAZING! _E_
I can't believe that President Obama isn't able or willing to make just one phone call to the family of Kate Steinle.Come on Pres MAKE CALL! _E_
On Tuesday I visited with the incredible men & women of @ICEgov & @DHSgov Border Patrol in Yuma AZ. Thank you. We respect & cherish you! __HTTP__ _E_
Stock Market at all time high unemployment at lowest level in years (wages will start going up) and our base has never been stronger! _E_
My prayers are with the victims and hostages in the horrible Paris attacks. May God be with you all. _E_
I cannot imagine that Congress would dare to leave Washington without a beautiful new HealthCare bill fully approved and ready to go! _E_
For the haters out of hundreds of deals or transactions I have used the bankruptcy laws 4 times in order to cut better deals. _E_
Watch video of Ivanka Trump sharing business advice with 4 entrepreneurial women on GMA: __HTTP__ _E_
The people of Scotland love the golf course I have built it is now considered perhaps the greatest ever built! Thank you also to Robb Report _E_
So nice to get an endorsement from the founder and owner of Pizza Ranch in Iowa! A great guy and great places! #CaucusForTrump _E_
Have we ever had a POTUS before @BarackObama who earned over 1/3 of his income from foreign sources and paid taxes to another country? _E_
Who's the outsourcer? @BarackObama's campaign is using a travel company with outsourced jobs in China and India. __HTTP__ _E_
this election. That is a direct threat to our democracy. She then said We have to accept the results and look to the future Donald _E_
Never let them see you sweat! __HTTP__ _E_
Pakistani intelligence had full knowledge that Bin Laden was living in Abbottabad. They were sheltering him. _E_
The @NBCNews story has just been totally refuted by Sec. Tillerson and @VP Pence. It is #FakeNews. They should issue an apology to AMERICA! _E_
In case you missed it last week's @extratv interview with @AJCalloway discussing Tiger Woods & much more __HTTP__ _E_
Obama can open the Mall for illegals to protest our country yet he continues to barricade WWII memorial. That's an absolute disgrace. _E_
Prior to the election it was well known that I have interests in properties all over the world.Only the crooked media makes this a big deal! _E_
Right now we are running a massive $300 billion trade deficit with China. That means every year. China is (cont) __HTTP__ _E_
Re: immigration. Do the Republicans not realize that Dems will get 100% of 11 million votes no matter what they do? _E_
After the way I beat Gov. Scott Walker (and Jeb Rand Marco and all others) in the Presidential Primaries no way he would ever endorse me! _E_
WELCOME HOME AYA!#GodBlessTheUSA __HTTP__ _E_
RT @foxandfriends: Wall Street hits record highs after Trump pulls out of Climate pact __HTTP__ _E_
Obama should stop talking about wind turbines they are a disaster for a country or community & are very expensive & unreliable. _E_
I always felt I would be running and winning against Bernie Sanders not Crooked H without cheating I was right. _E_
I will be making the announcement of my Vice Presidential pick on Friday at 11am in Manhattan. Details to follow. _E_
#TBT Here I am with @gwenstefani and @donaldjtrumpjr __HTTP__ _E_
Subject to the receipt of further information I will be allowing as President the long blocked and classified JFK FILES to be opened. _E_
Working in Bedminster N.J. as long planned construction is being done at the White House. This is not a vacation meetings and calls! _E_
.@kimguilfoyle just watched you on @OutnumberedFNC thank you! _E_
This week we saw what Obama Care actually does when implemented. It is a losing issue for @BarackObama and must be repealed. _E_
I'll be on @foxandfriends this morning at 7:00. So much to talk about! _E_
High above the city @TrumpLasVegas' pool deck mixes business & pleasure over a soaring bar of sky bound gold __HTTP__ _E_
Obama called Reverend Wright his friend counselor & great leader then dumped him like a dog! _E_
even those registered to vote who are dead (and many for a long time). Depending on results we will strengthen up voting procedures! _E_
Why did @BarackObama let Iran keep our drone? Now it is going straight to the Chinese. He should have taken it out. _E_
Major rescue operations underway! _E_
Can't believe I finally got a good story in the @washingtonpost. It discusses the enthusiasm of Trump voters through campaign.... _E_
Truly honored to receive the first ever presidential endorsement from the Bay of Pigs Veterans Association. #MAGA... __HTTP__ _E_
Treasury has refused to name China a currency manipulator even though the yuan "remains significantly undervalued" __HTTP__ _E_
Why would I call China a currency manipulator when they are working with us on the North Korean problem? We will see what happens! _E_
Trump Virginia Office Announces Statewide TV Ad Strategy and Leadership Team: __HTTP__ __HTTP__ _E_
He @RickSantorum is now losing in the latest @ppppolls to @MittRomney in Pennsylvania __HTTP__ Rick is wasting everyone's time. _E_
Van Jones: 'There Is A Crack in the Blue Wall' — It Has to Do With Trade: __HTTP__ _E_
Via @BreitbartNews @biggovt: DONALD TRUMP TO SPEAK AT CPAC __HTTP__ by @michaelpleahy _E_
The United Nations Security Council just voted 15 0 in favor of additional Sanctions on North Korea. The World wants Peace not Death! _E_
Wife Huma wants @RepWeiner to pull a @billclinton by giving a tell all interview. Unlike Clinton Anthony is a sick puppy. _E_
Under his administration oil and gas production on public land is down over 10% __HTTP__ Obama did not tell truth last night. _E_
Via @nypost by @GeoffEarle: "Polls show 'President Trump' may not be so far fetched" __HTTP__ _E_
Obama's own top donor is now laying employees off and lowering hours in anticipation of Obama Care __HTTP__ The new reality. _E_
Entrepreneurs: You have to have passion. If you love your work success will follow. _E_
Big things going on today at Trump National Westchester! _E_
Together we will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_
With 3.5 million Americans receiving bonuses or other benefits from their employers as a result of TAX CUTS 2018 is off to great start!✅Unemployment rate at 4.1%.✅Average earnings up 2.9% in the last year.✅200000 new American jobs.✅#MAGA __HTTP__ _E_
The polls are really looking good—#1 everywhere despite all lobbyist & special interest $ being spent against me. I'm turning down millions. _E_
The White House Correspondents' dinner was so boring this year I guess that's because I didn't attend(even... __HTTP__ _E_
"Never think of learning as a burden. It may require some discipline but it prepares you for a new beginning."– Think Like a Champion _E_
So General Flynn lies to the FBI and his life is destroyed while Crooked Hillary Clinton on that now famous FBI holiday "interrogation" with no swearing in and no recording lies many times...and nothing happens to her? Rigged system or just a double standard? _E_
I want to thank Elizabeth Steve Brian and all of the great folks of @foxandfriends for the long and successful run we had together. NICE! _E_
It's Wednesday how much money is China stealing from us today? _E_
ObamaCare has cut workers' pay by over $22B & eliminated 350000+ small business jobs __HTTP__ Repeal before it's too late! _E_
Great new poll thank you!#MakeAmericaGreatAgain __HTTP__ _E_
I try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is. ~Donald Trump _E_
The Fed should not bail out the EU. Europe's financial mess is their problem not our problem! _E_
Democrats are laughingly saying that McCain had a moment of courage. Tell that to the people of Arizona who were deceived. 116% increase! _E_
I knew disgusting and unwanted porn star @REPWEINER was a sleazebag the first time I met him. Thank goodness he was revealed (so to speak). _E_
If @amazon ever had to pay fair taxes its stock would crash and it would crumble like a paper bag. The @washingtonpost scam is saving it! _E_
China is raising its defense budget by 11% __HTTP__ @BarackObama wants to cut ours by over $1Trillion. Wrong policy. _E_
"Big jobs usually go to the men who prove their ability to outgrow small ones." Theodore Roosevelt _E_
Was in Iowa yesterday great people. Record crowds at both speeches. Something big is happening. Pols are all talk. Make America great again! _E_
Wow! Does Eliot Spitzer have a girlfriend? This is getting exciting. _E_
First Minister of Scotland released bomber of Pan Am flight #103 on compassionate grounds. Do you believe? _E_
Thank you Kevin. With unification of the party Republican wins will be massive! __HTTP__ _E_
YOU NEED BOTH A PUBLIC AND A PRIVATE POSITION @HillaryClinton #Debates2016 __HTTP__ _E_
.@MRbelzer is a stone cold loser with no talent why did they ever put him on Law and Order? _E_
Weekly jobless claims are up once again. The economy cannot recover with Obama in office. _E_
Thank you Springfield Ohio. Get out and #VoteTrumpPence16!#ICYMI watch here: __HTTP__ __HTTP__ _E_
I have long stated that Brian Williams was not a very smart guy all you have to do is look at his past. Now he has proven me correct! _E_
I will be on ON THE RECORD @gretawire tonight at 10 pm _E_
Donald Trump's back with 14 'Apprentice' All Stars __HTTP__ via @AP _E_
Thank you for your support! Together we can #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_
#MakeAmericaWorkAgain #TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_
In Charlottesville VA @trumpwinery is Virginia's largest winery with 200 acres of French vinifera varieties __HTTP__ _E_
country and with the massive cost reductions I have negotiated on military purchases and more I believe the people are seeing big stuff. _E_
Derek Jeter's baseball and more in today's #trumpvlog... __HTTP__ _E_
I believe @BarackObama is manipulating the jobless numbers __HTTP__ _E_
While Hillary said horrible things about my supporters and while many of her supporters will never vote for me I still respect them all! _E_
ObamaCare does indeed ration care. Seniors are now restricted to comfort care instead of brain surgery. Repeal now! __HTTP__ _E_
I am doing On the Record With Greta Van Susteren at 10 P.M. on Fox. We will be talking about the bad economy and other subjects of interest! _E_
How ironic that @BarackObama's campaign would call me a charlatan. Have they looked at their boss's record? _E_
And the FAKE NEWS winners are... __HTTP__ _E_
Lightweight Senator Marco Rubio is VERY weak on immigration knows nothing about finance and would be incapable of making great trade deals! _E_
I have determined that it is time to officially recognize Jerusalem as the capital of Israel. I am also directing the State Department to begin preparation to move the American Embassy from Tel Aviv to Jerusalem... __HTTP__ _E_
.@NRO @JonahNRO Wow just looked at the stats for National Review. Dying fast doing very little business. Save this conservative voice! _E_
Robin Williams was a truly wonderful actor & comedian. One of the few people who could make me laugh. Very tragic. _E_
Great going to all of Dubai in winning what will be a fantastic #Expo2020 we will all be there! _E_
.@redcross CEO's salary in 2011 was $951957. Where is the outrage? _E_
Where is the outrage for this Disney book? Is this the 'Star of David' also? Dishonest media! #Frozen __HTTP__ _E_
.@JonahNRO You should be totally focused on trying to save the badly failing National Review instead of focusing on me. Work hard! @NRO _E_
Via @WBJonline by @WBJHolan: "Donald Trump hints at presidential run promises 'great luxury hotel' for D.C." __HTTP__ _E_
The Chinese are smart. They bought up over $7B in US housing last year __HTTP__ U.S. is busy making China even richer. _E_
A general is just as good or just as bad as the troops under his command make him. Douglas MacArthur _E_
Have you ever seen our country look weaker or more pathetic: Snowden ObamaCare VA Russia jobs decimated military debt and so much more _E_
The now $1.2B ObamaCare website is as bad as ever insurers not getting the proper data. __HTTP__ _E_
Have a happy successful and healthy New Year! _E_
Congratulations to the 2016 @ClemsonFB Tigers!Full ceremony: __HTTP__ __HTTP__ _E_
Congratulations to @AmericansElect for winning a spot on the California 2012 ballot. A major feat! __HTTP__ _E_
"See yourself as victorious! That will focus you in the right direction." – Trump Never Give Up _E_
Will be in Alabama tonight. Luther Strange has gained mightily since my endorsement but will be very close. He loves Alabama and so do I! _E_
Michelle Nunn will be a solid vote for Obama. She supports ObamaCare & opposes 2nd Amendment. Vote for @Perduesenate to change things! _E_
President Obama is losing on so many fronts in fact all fronts that I am concerned he will do something totally irrational. He can't lead! _E_
We just have to get tough get smart and get a president willing to stand up for America and stick it to the (cont) __HTTP__ _E_
Michelle Nunn will be a rubber stamp for Barack Obama. @Perduesenate. GOTV for David this Tuesday! _E_
I will be in Maryland this afternoon for a major rally. Things are looking good for Tuesday! _E_
I will be interviewed on @greta at 7:00 P.M. @FoxNews _E_
Re build the United.States not places that hate our country and everything we stand for! _E_
The Obama's Spain vacation cost taxpayers over $476K __HTTP__ They love to spend money. _E_
I watched the last two minutes of the @dallasmavs game last night I just loved watching them lose. _E_
Not only does ObamaCare have at least 21 new taxes but it will lead to a tremendous doctor shortfall. _E_
I'm a star maker Adrian has continued to receive many fans in @TrumpTowerNY and @AmandaTMiller is definitely on the map! #CelebApprentice _E_
China's new AND ADVANCED currency manipulation is killing the U.S. Help! _E_
Whitney Houston was a great friend and an amazing talent. We will all miss her and send our prayers to her family. _E_
We have millions in our country unemployed yet we are wasting millions arming Syrian 'rebels.' What is wrong with Washington?! _E_
Good luck to my new friends on your testimony in DC. You are amazing people doing something so important stopping illegal immigration! _E_
Via @worldnetdaily: JAILED U.S. PASTOR'S WIFE PRAISES TRUMP: 'I hope more people like him will speak out' __HTTP__ _E_
Crooked Hillary says we must call on Saudi Arabia and other countries to stop funding hate. I am calling on cont'd: __HTTP__ _E_
I've always been a fan of Steve Jobs especially after watching Apple stock collapse w/out him – but the yacht he built is truly ugly. _E_
Have to go now to sign a great and job producing deal! Good night. _E_
Hope to see you tomorrow in Trump Tower (5th Ave betw 56 and 57) I'll be signing copies of my book #TimeToGetTough from noon until 2 pm _E_
Nobody will protect our Nation like Donald J. Trump. Our military will be greatly strengthened and our borders will be strong. Illegals out! _E_
Old Post Office Building in DC will be a world class Trump property. Honored to be doing this historic building Washington will be proud. _E_
Move slowly carefully and then strike like the fastest animal on the planet! _E_
Thank you to our amazing law enforcement officers! #MAGA __HTTP__ _E_
American steel & American hands have constructed a 100000 ton message to the world: American MIGHT IS SECOND TO NONE!#USSGeraldRFord #USA __HTTP__ _E_
Hurricane is good luck for Obama again he will buy the election by handing out billions of dollars. _E_
We will never cut spending until we actually work off of a budget. The Democrats haven't passed one in over 3 years. What a joke. _E_
The Better Business Bureau report with an A rating for Trump University. #GOPDebate __HTTP__ __HTTP__ _E_
RT @DRUDGE_REPORT: RICE ORDERED SPY DOCS ON TRUMP? __HTTP__ _E_
.@RNC leadership should not be afraid of a government shutdown. They should be afraid of not defunding ObamaCare. _E_
Trump's Campaign Hat Becomes an Ironic Summer Accessory The New York Times. __HTTP__ _E_
Even though I have a very biased and unfair judge in the Trump U civil case in San Diego I have thousands of great reviews & will win case! _E_
Will be speaking with Italy this morning! _E_
Tremendous investment by companies from all over the world being made in America. There has never been anything like it. Now Disney J.P. Morgan Chase and many others. Massive Regulation Reduction and Tax Cuts are making us a powerhouse again. Long way to go! Jobs Jobs Jobs! _E_
There's only one candidate who cut medicare and that's Barack Obama. Cut over $700M to move into ObamaCare. _E_
My @gretawire interview where I discuss fixing the economy killing Bin Laden the John Edwards trial and fair trade. __HTTP__ _E_
Isn't the WORLD tired of hearing President Obama say he knew nothing about anything time to take responsibility for all of your mistakes! _E_
I am happy to see the majority of the GOP candidates agree with me that the tax code must be simplified and the rates dropped. _E_
Will be doing Fox and Friends in two minutes! _E_
The only reason I am critical of the Pinehurst look is because I'm a lover of golf—and that look on TV hurts golf badly. _E_
Vegas' top destination @TrumpLasVegas is a 64 story tower of golden glass __HTTP__ What goes on there stays there! _E_
.@ThisWeekABC with @GStephanopoulos had fantastic numbers last Sunday Trump interview. Nice! _E_
#TBT Taking piano lessons from my friend Elton John. __HTTP__ _E_
Just got a great new selection of ties & shirts @Macys. Go buy them now for Father's Day—they're beautiful! _E_
Going to Charlotte NC to speak before more than 20000 people on Saturday morning—total sellout crowd—will be great! _E_
I hear @billmaher really bombed in Springfield people were leaving show way early stupid guy! _E_
Great POLL numbers are coming out all over. People don't want another four years of Obama and Crooked Hillary would be even worse. #MAGA _E_
I'll be on @foxandfriends on Monday at 7:30 AM. _E_
"Invincibility lies in the defence the possibility of victory in the attack." Sun Tzu _E_
They found Jessica in Colorado body was mutilated death to the pervert killer. _E_
Looking forward to being the special guest at tonight's Dutchess County #GOP dinner to a SOLD OUT crowd. It will be great fun. _E_
Just landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! __HTTP__ _E_
Respected Morning Consult poll just out. I lead all Republicans and beat Hillary head to head by a wide margin 45 to 40! _E_
'Presidential Executive Order on Promoting Agriculture and Rural Prosperity in America'Executive Order:... __HTTP__ _E_
I received calls from the President of Mexico and the Prime Minister of Canada asking to renegotiate NAFTA rather than terminate. I agreed.. _E_
'Scandals surround Clinton's gatekeeper at State'#DrainTheSwamp __HTTP__ _E_
Yesterday Obama compared Nelson Mandela to George Washington in Africa. Do you think he really believes it? _E_
Had a meeting with the terrific @GovPenceIN of Indiana. So excited to campaign in his wonderful state! __HTTP__ _E_
I will be in Cincinnati Ohio tomorrow night at 7:30pm join me! #OhioVotesEarly #VoteTrumpPence16 Tickets:... __HTTP__ _E_
I will be on @oreillyfactor at 8:00 P.M. Enjoy! _E_
My comments on a larger screen iPhone were in addition to existing unit not a replacement. Screen should be 10% larger than Samsung. _E_
Why is the UN condemning @Israel and doing nothing about Syria? What a disgrace. _E_
The podium in the Oval Office looks odd! Not good but the words will be the key. _E_
Majority of Independents want Obamacare overturned __HTTP__ The best way to do it is by voting out @BarackObama _E_
He made a great contribution to the press @AndrewBreitbart will be missed. _E_
Our President should stop trying to be an economist to the world and start fighting for our economy. Instead (cont) __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
I am honored that the great men and women of the @Teamsters have created a movement from within called Teamsters for Trump! Thank you. _E_
I knew Chris Matthews when he was sane and quite honestly wonderful. Now he's gone off the deep end as an Obama surrogate. @hardball_chris _E_
Looking forward to visiting the Trump Vineyard Estates today in Charlottesville VA for a press conference and the grand opening. _E_
Exclusively @Macys The Donald J. Trump Signature Collection features the best ties & shirts at the best prices. __HTTP__ _E_
... Is a third party coming? I hope not. _E_
My conversation from ON THE RECORD @Gretawire __HTTP__ _E_
.@GovernorPerry stopped by to say hello. __HTTP__ _E_
If @BarackObama had to use the same labor participation he had when he entered office then the unemployment number would be 11.2% _E_
"Competitive golf is played mainly on a five and a half inch course... the space between your ears." Bobby Jones _E_
Former Obama White House economic adviser @Austan_Goolsbee gave his old boss a 'C' on the economy __HTTP__ Pretty generous! _E_
The @MissUSA 2012 contestants pose for a picture with me at Trump Tower in New York City __HTTP__ _E_
It's Friday. How much money has been wasted on defunct ObamaCare website today? _E_
.@KevinHart4real joined @woodmank104 @katek104 @K1047 & was asked about his thoughts on @realDonaldTrump #Trump2016 Thanks Kevin so nice! _E_
There is no question who will handle the threat of terrorism best as #POTUS. #Trump2016 __HTTP__ __HTTP__ _E_
President Obama has a major meeting on the N.Y.C. Ebola outbreak with people flying in from all over the country but decided to play golf! _E_
Fake News is at an all time high. Where is their apology to me for all of the incorrect stories??? _E_
I will be on @SeanHannity tonight at 10pmE talking about my new book #CrippledAmerica and much more! #MakeAmericaGreatAgain #Trump2016 _E_
Further proof that Gang of Eight member Marco Rubio is weak on illegal immigration is Paul Singer's Mr. Amnesty endorsement.Rubs can't win _E_
.@DanaPerino & @BradThorThank you so much for the wonderful compliment. Working hard! #MAGA __HTTP__ _E_
For what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_
Now the world is looking to China for an economic 'lift' __HTTP__ @BarackObama has ruined our economic hegemony. _E_
Economics behind ugly bird killing wind turbines do not work will destroy Scotland's beautiful coastline. (cont) __HTTP__ _E_
.@aaronschock Aaron it was great to meet you at Trump Tower. Also really good job on television! _E_
The Greater Miami area and numerous others are fighting hard to get the Miss Universe Pageant. A decision will be made very soon! _E_
RT @MichaelCohen212: I have never been to Prague in my life. #fakenews __HTTP__ _E_
Should not pass bad deal! __HTTP__ _E_
The only thing that can stop this corrupt machine is YOU. The only force strong enough to save our country is US.... __HTTP__ _E_
People forget it was Club for Growth that asked me for $1 million. I said no & they went negative. Extortion! __HTTP__ _E_
I will be on the @colbertlateshow tonight at 11:30 __HTTP__ _E_
Negotiation is persuasion more than power. Be reasonable and flexible and never let anyone know exactly where you're coming from. _E_
Thank you @TheFix Chris Cillizza. It is a true person of character that can change his opinion & do what is right. __HTTP__ _E_
I think somebody should pick Johnny Football he will be a star. _E_
As long as we have faith in each other and confidence in our values then there is no challenge too great for us to conquer! #ALConv2017 __HTTP__ _E_
Looking forward to receiving the T. Boone Pickens Entrepreneur Award at tomorrow's @AmSpec Robert L. Bartley Gala dinner. _E_
Beautiful weather all over our great country a perfect day for all Women to March. Get out there now to celebrate the historic milestones and unprecedented economic success and wealth creation that has taken place over the last 12 months. Lowest female unemployment in 18 years! _E_
.@AlexSalmond Ireland just ended the bird killing wind farm near my great resort on the Atlantic Ocean. The reason would hurt tourism! _E_
...they do NOTHING for us with North Korea just talk. We will no longer allow this to continue. China could easily solve this problem! _E_
Photo from @IvankaTrump of Trump International Golf Links & Hotel Ireland __HTTP__ _E_
Let's get out of Afghanistan. Our troops are being killed by the Afghanis we train and we waste billions there. Nonsense! Rebuild the USA. _E_
The best vision is insight. Malcolm Forbes _E_
Reporter @AlHunt is one boring and low vision guy! _E_
We all have the capability to read or sense what's happening with others. It can often give you the edge (cont) __HTTP__ _E_
.@GiulianaRancic & @nickjonas are co hosting Miss USA 2013 Sunday night at 9 PM ET on NBC. @JonasBrothers will be performing. Tune in! _E_
.@THEGaryBusey is definitely different. #CelebApprentice _E_
Democrats are smiling in D.C. that the Freedom Caucus with the help of Club For Growth and Heritage have saved Planned Parenthood & Ocare! _E_
GOP Voters Trust Donald Trump to Keep Our Country Safe __HTTP__ _E_
I am the only potential owner of the @buffalobills who will keep the team in Buffalo where it belongs! _E_
Congratulations to @Yankees Derek Jeter on passing Eddie Murray last night to become the 11th all time @MLB hit leader. _E_
Does anyone really believe that President Obama found out about Petraeus immediately after the election? _E_
... among ABC CBS and NBC in the key news demo of adults.... _E_
I'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Sleazy Adam Schiff the totally biased Congressman looking into Russia spends all of his time on television pushing the Dem loss excuse! _E_
Political strategist Stuart Stevenswho led Romney down the tubes in what should have been an easy victoryhas terrible political instincts! _E_
My interview yesterday with @foxandfriends discussing the failure of the Super Committee and GOP 2012.... __HTTP__ _E_
Biggest Tax Bill and Tax Cuts in history just passed in the Senate. Now these great Republicans will be going for final passage. Thank you to House and Senate Republicans for your hard work and commitment! _E_
In times of tragedy the bonds that sustain us are those of family faith community and country. These bonds are stronger than the forces of hatred and evil and these bonds grow even stronger in the hours of our greatest need. __HTTP__ __HTTP__ _E_
NBC Wall St Journal Poll of African American voters: 94% @BarackObama 0% @MittRomney.Even worse than Hillary's old numbers. Is that racism? _E_
I am impressed with how clearly @PaulRyanVP explains the challenges we face and the solutions @MittRomney will bring as President. _E_
Have a great game today @USArmy and @USNavy I will be watching. We love our U.S. Military. On behalf of an entire Nation THANK YOU for your sacrifice and service! #ArmyNavyGame #USA __HTTP__ _E_
"Golfer bids $130000 for round with Donald Trump" in Scotland for charity __HTTP__ via Evening Express _E_
Via @Reuters: Donald Trump takes steps toward 2016 presidential run __HTTP__ _E_
Entrepreneurs: Keep your eyes on your ideals as well as reality. Accentuate the positive without being blind to the negative. _E_
Chinese demand is raising the price of oil to$123/Barrel __HTTP__ We need to use our own energy resources. _E_
2011 #CelebrityApprentice winner @JohnRich and @MarleeMatlin interviewed the final four in this week's episode __HTTP__ _E_
.@HuffingtonPost is doing very badly. Also very inaccurate stories. Like AOL when will they fail? _E_
Why is @BarackObama delaying the sale of F 16 aircraft to Taiwan? Wrong message to send to China. #TimeToGetTough _E_
I am offering the chance for Barack Obama to redistribute $5M to any charity of his choice. Everyone wins. Take the deal. _E_
China is pushing North Korea! _E_
#WVPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_
ObamaCare is a major threat to America's entrepreneurial spirit and competitiveness. Small businesses will b... (cont) __HTTP__ _E_
An Iranian nuclear scientist's car exploded in Tehran yesterday lots of problems to come @BarackObama we need real leadership. _E_
If our border is not secure we can expect another attack. A country with open borders is open to the terrorists. _E_
Do not allow our very stupid leaders to sign a deal that keeps us in Afghanistan through 2024 with all costs by U.S.A. MAKE AMERICA GREAT! _E_
Yesterday was Veterans Day. I hope our armed service members felt appropriately honored. This nation loves and respects all of you. _E_
Just had a very nice meeting with @Reince Priebus and the @GOP. Looking forward to bringing the Party together and it will happen! _E_
Crooked's top aides were MIRED in massive conflicts of interests at the State Dept. We MUST #DrainTheSwamp __HTTP__ #Debate _E_
Someone unknown tweeted incorrectly that I'm for Sen. Mitch @McConnellPress for speaker. I'm supporting him for Senate Majority Leader _E_
Time Magazine called to say that I was PROBABLY going to be named "Man (Person) of the Year" like last year but I would have to agree to an interview and a major photo shoot. I said probably is no good and took a pass. Thanks anyway! _E_
Donald Trump plans return to Iowa __HTTP__ via @KCCINews _E_
"Iowa hirings suggest Donald Trump serious about 2016 White House bid" __HTTP__ via @WashTimes by @SethMcLaughlin1 _E_
...It's called intellectual property rights something they know nothing about. _E_
The ObamaCare website still is not complete. $5 billion and no progress. Scary and sad! _E_
David Brooks of the New York Times is closing in on being the dumbest of them all. He doesn't have a clue. _E_
These Tsarnaev brothers did not work alone. They had help and assistance from other cell members. Be vigilant and on the lookout. _E_
The tragedy in Newtown really makes you understand how life is so fragile. Must appreciate every minute! _E_
Trump vows to fight 'epidemic' of human trafficking __HTTP__ _E_
It all begins today! I will see you at 11:00 A.M. for the swearing in. THE MOVEMENT CONTINUES THE WORK BEGINS! _E_
Honor to have been interviewed by the very wonderful @bishopwtjackson in Detroit last week tune in at 9pmE. Enjoy! __HTTP__ _E_
The New Black Panthers are back at the same Philly polling station from '08 __HTTP__ Don't let them intimidate you! _E_
Time to start building in our country with American workers & with American iron aluminum & steel. It is time to... __HTTP__ _E_
"To keep your momentum going you must have intrinsic values as well as monetary values. Know when to give back." – Think Big _E_
I will be going to the funeral of my friend Joan Rivers today. I got to know her really well when she became the winner of The Apprentice! _E_
Wouldn't it be nice if our government could build a wall on the border under budget and ahead of schedule?! my @SRQRepublicans speech. _E_
Excited that @OurCountryPAC's @Amy Kremer has endorsed the Newsmax @iontv debate. The Tea Party Express is a great group. _E_
Why is the @GOP congress focusing on amnesty when so many Americans are unemployed? _E_
Thank you Andrew Jackson! #POTUS7 #USA __HTTP__ _E_
Hillary Clinton made a speech today using the biggest teleprompter I have ever seen. In fact it wasn't even see through glass it was black _E_
RT @Reince45: Promise kept. @POTUS exits flawed #ParisAccord to seek better deal for U.S. workers & economy. This WH will always put #Ameri... _E_
Many people walked out on Madonna's concert when she told them to vote for Obama. Years ago I walked out because the concert was terrible! _E_
It was an honor to welcome President Al Sisi of Egypt to the @WhiteHouse as we renew the historic partnership betwe... __HTTP__ _E_
Watch to see the new cast of @ApprenticeNBC __HTTP__ _E_
A big day for the U.S. at the United Nations! _E_
Just out: TRUMP GOP DEBATE 18000000. CLINTON DEMOCRAT DEBATE 6700000. And they were on major network vs. cable! _E_
Be weak on immigration and ensure Democratic victory. _E_
Everyone should cancel HBO until they fire low life dummy Bill Maher! Get going now and feel good about yourself! _E_
Rickie Fowler @therealrickiefowler Instagram photos | Websta __HTTP__ via @websta _E_
Even NY Democrats are avoiding @BarackObama's convention __HTTP__ He is dragging his own party down with him _E_
Getting ready to take off for Nashua New Hampshire. Big crowd will be there soon. Fun! _E_
.@cyndilauper Condolences on the passing of your uncle and best wishes. _E_
...really hard to help but many have lost their homes. Military is now on site and I will be there Tuesday. Wish press would treat fairly! _E_
Jeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately but it was clearly not.... _E_
I have many great people but also an amazing number of haters and losers responding to my tweets why do these lowlifes follow nothing to do! _E_
Less than ten days until I keynote @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tix going fast. __HTTP__ _E_
Everybody is arguing whether or not it is a BAN. Call it what you want it is about keeping bad people (with bad intentions) out of country! _E_
In the last 2 weeks I had $35M of negative ads against me in Florida & I won in a massive landslide.The establishment should save their $$! _E_
I really liked everyone at the @WWE Hall of Fame ceremony fantastic people! _E_
Amazing various celebrities were far harsher than me with political statements but media doesn't care about... __HTTP__ _E_
RT @realDonaldTrump: I will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_
Putin is having such a good time. Our President is making him look like the genius of all geniuses. Do not fearwe are a NATION OF POTENTIAL _E_
RT @EricTrump: Friends: If you live in AL AK AR CO GA MA MN OK TN TX VT or VA get out and VOTE on Tuesday! #Trump2016 __HTTP__ _E_
Here I am with Whitney Houston at a party at Mar a Lago. __HTTP__ _E_
Thank you! #GOPDebate MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Here I am with @IvankaTrump and erictrump presenting the WGC @CadillacChamp Trophy to Tiger Woods at... __HTTP__ _E_
Rated Toronto's #1 hotel the 65 story 5 Star @TrumpTO is located in the heart of the city's finest attractions __HTTP__ _E_
Thank you for having me! I enjoyed the tour and spending time with everyone. See you soon. #MAGA __HTTP__ _E_
Dishonest reporters knowingly write lies that I said "children should not get vaccinated." I believe fully (cont) __HTTP__ _E_
RT @MoskowitzEva: .@BetsyDeVos has the talent commitment and leadership capacity to revitalize our public schools and deliver the promise... _E_
If @RepMarkMeadows @Jim_Jordan and @Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_
Home ownership is at a 19 year low. If you can buy now. You will thank me later. _E_
Via @AmSpec by Jeffrey Lord: "Donald Trump: America's Entrepreneur" __HTTP__ Wow thank you to Jeffrey Lord & @AmSpec! _E_
While @BarackObama tries to push gun control __HTTP__ He still has not answered for Project Gun Runner __HTTP__ _E_
Fight is over Mayweather lost big but lets see what judges say! _E_
If Obama mentions Mitt's tax returns in tomorrow's debate then Mitt should immediately ask for Obama's college records & applications _E_
The White House never looked more beautiful than it did returning last night. Important meetings taking place today. Big tax cuts & reform. _E_
Congratulations to @DanaPerino on your book going to number one on Amazon. Great book Great job! _E_
Can someone explain to me how a Chechnyan permanent resident non citizen in our country is planning Jihad while on welfare? _E_
Ed Gillespie worked hard but did not embrace me or what I stand for. Don't forget Republicans won 4 out of 4 House seats and with the economy doing record numbers we will continue to win even bigger than before! _E_
Watch my live book signing now! __HTTP__ _E_
Just watched lightweight Marco Rubio lying to a small crowd about my past record. He is not as smart as Cruz and may be an even bigger liar _E_
Thank you @BillKristol. I am going to Make America Great Again! _E_
American corporations and entrepreneurs are masters of technological and business innovation but the Chinese (cont) __HTTP__ _E_
Clive Davis gave a great eulogy at my friend Whitney Houston's funeral absolutely amazing! _E_
RT @DonaldJTrumpJr: Thanks New Hampshire!!! #NH #NewHampshire #MAGA __HTTP__ _E_
Which team do you think has the edge in this interactive photo experience task assignment? _E_
Via @EW: "@CelebApprentice All Stars' first trailer" __HTTP__ _E_
Everyone join me tomorrow at 11 AM in Trump Tower atrium. _E_
Great Tax Cut rollout today. The lobbyists are storming Capital Hill but the Republicans will hold strong and do what is right for America! _E_
RT @DRUDGE_REPORT: GREAT AGAIN: FEDS ARREST MURDER SUSPECT IN 'FAST AND FURIOUS' SCANDAL... __HTTP__ _E_
Since the Democrats decided to kill the filibuster they now own it.Republicans should keep the new rule when they're in the majority. _E_
Broken promises. A broken billion dollar website. ObamaCare can't be fixed. Repeal! _E_
Serious stuff IRS Commissioner visited White House 157 times far more than Sec. of State or Defense. What a big story this is! _E_
Taking a photo with my family on the opening day of Trump International Golf Links Scotland __HTTP__ _E_
The U.S. is now begging Russia to give back Edward Snowden. In a letter they promised no death penalty for the traitor. No respect! _E_
So great to have the endorsement and support of Paul Ryan. We will both be working very hard to Make America Great Again! _E_
We should be focused on magnificently clean and healthy air and not distracted by the expensive hoax that is global warming! _E_
According to new WPOST ABC poll Obama has just lost 14 points on public trust with economy _E_
On my way to South Carolina. Big Crowd look forward to it! _E_
Dee Dee Sorvino @deedeegop I am betting on Trump _E_
If someone made a nasty or controversial statement about me to the president do you really think he would come to my rescue? No chance! _E_
Via @ABCPolitics by @rickklein: Trump Blasts Romney Bush Says GOP Has 'Nobody Like Trump' __HTTP__ _E_
I have a judge in the Trump University civil case Gonzalo Curiel (San Diego) who is very unfair. An Obama pick. Totally biased hates Trump _E_
Apple must make the IPhone screen bigger. Losing major market share. _E_
Many think that the Championship Course at Turnberry home of The Duel In The Sun will be the worlds best after the renovation. _E_
To @TigerWoods He is truly a great champion and we were honored to have him at Trump National Doral. @DoralResort #Trump _E_
While the Republicans and Democrats in Congress are working hard to come up with a solution to DACA they should be strongly considering a system of Merit Based Immigration so that we will have the people ready willing and able to help all of those companies moving into the USA! _E_
MUST READ @IBDeditorials: "President Obama's Amnesty At Any Price" __HTTP__ Congress Use the Power of Purse! Defund Amnesty! _E_
My speech at yesterday's @SteveKingIA @Citizens_United Iowa Freedom Summit __HTTP__ via @FoxNews _E_
Just the beginning & it is going to get worse. Rates & deductibles are so high nobody is going to be able to use it. __HTTP__ _E_
Today Obama will give another speech on the economy. Tomorrow our country will still be $17T+ in debt with 18% real unemployment. _E_
Went to the Yankees game last night with Bill O'Reilly we had a great time watching the Yankees win! _E_
Wow Kasich didn't qualify to run in the state of Pennsylvania not enough signatures. Big problem! _E_
On @seanhannity show @FoxNews now. ENJOY! _E_
Join me on Wednesday May 25th at the Anaheim Convention Center!#Trump2016 #MAGA Tickets: __HTTP__ __HTTP__ _E_
The failing @nytimes reporters don't even call us anymore they just write whatever they want to write making up sources along the way! _E_
Thank you @FLGovScott. __HTTP__ _E_
I will be interviewed on @foxandfriends at 8:30 A.M. ENJOY! _E_
Business is an art in itself and powerful negotiation skills are one of the techniques necessary to facilitate success. _E_
Put big game trophy decision on hold until such time as I review all conservation facts. Under study for years. Will update soon with Secretary Zinke. Thank you! _E_
As the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Democrats and Russians! _E_
.@TrumpNewYork on CPW in NYC is the home of the globe that has become an icon in the city. #CelebApprentice _E_
90 stories over midtown New York Trump World Tower's glass curtain wall is a true landmark __HTTP__ _E_
Realize that an entrepreneur's most important gift to the world is jobs security and well being for others. Midas Touch _E_
The #MissUniverse Pageant is the biggest pageant of them all—by far! _E_
Behind the scenes photo of @Gretawire and I filming an interview __HTTP__ Watch tonight at 10PM ET on @FoxNews. _E_
When the economy is bad @BarackObama wants to raise taxes. When the economy is good @BarackObama wants to raise taxes. Notice a trend? _E_
If these scandals happened before the election Obama could not have won. _E_
RT @billoreilly: Hannity crushing MSNBC at 9. Good for him! Check the No Spin News on __HTTP__ Killing England a huge bests... _E_
A great honor to visit the 9/11 Memorial Museum with my wife @MELANIATRUMP today. #NewYorkValues __HTTP__ _E_
I am surprised that Hugo Chavez can keep power in his weak physical condition! _E_
Leaving Hamburg for Washington D.C. and the WH. Just left China's President Xi where we had an excellent meeting on trade & North Korea. _E_
House of Representatives shouldn't give anything to Obama unless he terminates Obamacare. _E_
The money losing @politico is considered by many in the world of politics to be the dumbest and most slanted of the political sites. Losers! _E_
Michelle Obama likes to be addressed as Your Excellency. __HTTP__ She is an excellent spender of taxpayer money on herself. _E_
All new @ApprenticeNBC starts right now! __HTTP__ _E_
Who should win Celebrity Apprentice on Monday night? Show will be telecast LIVE! _E_
If a player wants the privilege of making millions of dollars in the NFLor other leagues he or she should not be allowed to disrespect.... _E_
It's a shame the ruling class of Republicans don't attack Obama and the Democrats the way they hit Senators Cruz & Lee. _E_
.@DavidGregory got thrown off of TV by NBC fired like a dog! Now he is on @CNN being nasty to me. Not nice! _E_
Make sure to catch @history's season finale of "The Men Who Built America" on Sun November 11th. Great show. _E_
Thank you Dan I agree! Best wishes. __HTTP__ _E_
Remember don't believe sources said by the VERY dishonest media. If they don't name the sources the sources don't exist. _E_
Thank you for the incredible support Melania Barron Ivanka Jared Tiffany Don Vanessa Eric and Lara! __HTTP__ _E_
I don't blame China I blame the incompetence of past Admins for allowing China to take advantage of the U.S. on trade leading up to a point where the U.S. is losing $100's of billions. How can you blame China for taking advantage of people that had no clue? I would've done same! _E_
When @BarackObama is not vacationing he is hosting his top donors in the White House __HTTP__ Always having a good time! _E_
There is no way my friend Bob Kraft agreed not to appeal the NFL decision without making a deal to at least get something. We love Tom Brady _E_
If a person is #1 at Harvard and comes from Europe or Asia they can't get into the U.S. From Mexico etc. with a criminal record no problem _E_
"What you dream about is what you will do. If you cannot even dream of doing big things you will never do anything big in life." Think Big _E_
Our thoughts and prayers are with everyone in the path of California's wildfires. I encourage everyone to heed the advice and orders of local and state officials. THANK YOU to all First Responders for your incredible work! __HTTP__ _E_
Obama has not passed a single budget in 4 years. Democrats don't even vote them in Congress. He has failed to lead! _E_
RT @DonaldJTrumpJr: Great pic from a friend on @CBPflorida @CustomsBorder who have been helping with #harvey recovery and now with #irma. T... _E_
Thank you Governor @Mike_Pence!Lets MAKE AMERICA SAFE AND GREAT AGAIN with the American people. #AmericaFirst... __HTTP__ _E_
Leaving for New Hampshire now. Will be doing the @TODAYshow there live at 7:00 A.M. New @CBSNews Poll of New Hampshire: Trump 38 Carson 12! _E_
Huge crowd expected tomorrow night! VT Police say first come first serve. Arrive early! _E_
I have offered DACA a wonderful deal including a doubling in the number of recipients & a twelve year pathway to citizenship for two reasons: (1) Because the Republicans want to fix a long time terrible problem. (2) To show that Democrats do not want to solve DACA only use it! _E_
The #G20Summit was a wonderful success and carried out beautifully by Chancellor Angela Merkel. Thank you! _E_
RT @DRUDGE_REPORT: Fears of new terror attack after van 'mows down 20 people' on London Bridge... _E_
Together we dream of a Korea that is free a peninsula that is safe and families that are reunited once again! __HTTP__ _E_
A NEW ERA IN AMERICAN ENERGY! #MadeInTheUSAWatch here: __HTTP__ __HTTP__ _E_
Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight! _E_
#ICYMI: Governor @mike_pence and I were in Valley Forge Pennsylvania today. You can watch it here:... __HTTP__ _E_
Via @AP: "Donald Ivanka Trump say DC's Old Post Office Pavilion will be 1 of country's finest hotels" __HTTP__ _E_
Donald Trump bids to buy the Oreo Double Stuf Racing League. Check it out: __HTTP__ _E_
Do not go back into Iraq unless they agree in a signed formal instrument to give the U.S. 50% of their oil reserves.Make the deal dummies! _E_
What is the standard for which you want to be known? Identify that standard and follow it. _E_
Recently opened @TrumpToronto it's beautiful and here is a video of the ribbon cutting ceremony.. __HTTP__ _E_
Via @beforeitsnews: "WATCH: See How Trump Just Torched Obama Biden Kerry For Snubbing Paris Anti Terror March" __HTTP__ _E_
Crooked Hillary is wheeling out one of the least productive senators in the U.S. Senate goofy Elizabeth Warren who lied on heritage. _E_
Every day Mexico continues to hold Sgt. Tahmooressi is an insult to our country. _E_
Of course @hardball_chris attacked 'birthers' in praising @CondoleezzaRice's speech. Chris has completely lost it. _E_
Honor of a lifetime to meet His Holiness Pope Francis. I leave the Vatican more determined than ever to pursue PEAC... __HTTP__ _E_
Read about my @LibertyU speech in @jameshohmann's @politico Morning Score __HTTP__ _E_
"Don't find fault. Find a remedy." – Henry Ford _E_
ObamaCare premiums are going up up up just as I have been predicting for two years. ObamaCare is OWNED by the Democrats and it is a disaster. But do not worry. Even though the Dems want to Obstruct we will Repeal & Replace right after Tax Cuts! _E_
"Trump hails liberation of Raqqa as critical breakthrough in anti ISIS campaign" __HTTP__ _E_
I am pleased to inform you that I have just named General/Secretary John F Kelly as White House Chief of Staff. He is a Great American.... _E_
Big news just out NEW @CNN POLL TRUMP 39 and leads in every major category. Likeability way up. CRUZ 18 CARSON 10 RUBIO 10 _E_
'Must Act Immediately': Clinton Charity Lawyer Told Execs They Were Breaking The Law __HTTP__ _E_
Watch me tonight at 9PM ET on @CNN full hour. @Piersmorgan won @ApprenticeNBC before taking over Larry King's slot should be interesting. _E_
.@BarbaraJWalters Barbara—get better fast & stay healthy forever. _E_
Despite all the statements to the contrary Obama's policies will increase taxes on everyone __HTTP__ Enjoy! _E_
Another great charity that the $5M could go to just a recommendation to the Pres. the Wounded Warriors represented so well by @TraceAdkins _E_
Crooked Hillary Clintons foreign interventions unleashed ISIS in Syria Iraq and Libya. She is reckless and dangerous! _E_
It's hard to believe that we are rationing gas in NYC. OPEC is laughing all the way to the bank. _E_
I will be asking for a major investigation into VOTER FRAUD including those registered to vote in two states those who are illegal and.... _E_
A big part of the country even the southern states is under massive attack from snow and freezing cold. Global warming anyone? _E_
"Image is important and speaks more than the words or fine print that goes along with the product." – Midas Touch _E_
Iran is rapidly taking over more and more of Iraq even after the U.S. has squandered three trillion dollars there. Obvious long ago! _E_
Last night's live show was so much fun. Congrats to the entire cast they are all winners! From beginning over $13 million for charity. _E_
Via @ConroeCourier by @StephenGreen91:"Trump talks 2016 run jobs at @TXPatriotsPAC" __HTTP__ _E_
ICYMI via @PageSix by @Mohris: "Donald Trump honored at Marine Corps charity gala" __HTTP__ _E_
Thank you #Biloxi #Mississippi! Remember this night & spread the word to get out & #VoteTrump2016! __HTTP__ _E_
Democrat Jon Ossoff would be a disaster in Congress. VERY weak on crime and illegal immigration bad for jobs and wants higher taxes. Say NO _E_
Staff at Trump Park Avenue disliked A Rod to put it mildly The staff at Trump World Tower loves Derek Jeter. _E_
Karl Rove is a total loser. Money given to him might as well be thrown down the drain. _E_
By the way if Russia was working so hard on the 2016 Election it all took place during the Obama Admin. Why didn't they stop them? _E_
Thanks. __HTTP__ _E_
Iran's threats are no excuse for the 9 month high price of oil. OPEC is ripping us off while @BarackObama watches. __HTTP__ _E_
Entrepreneurs: Success is good. Success with significance is even better. Work on what you will be proud to be associated with. _E_
My thoughts and prayers are with those affected by the tragic storms and tornadoes in the Southeastern United States. Stay safe! _E_
Entrepreneurs: Having a product requires something very important you have to think about the market. Do your due diligence. _E_
I just saw my new tie & shirt collection—it's fantastic—unbelievable look. Go to Macy's now to buy! _E_
China owes us money.... __HTTP__ #trumpvlog _E_
Now AP is banning the term illegal immigrants What should we call them? 'Americans'?! This country's political press is amazing! _E_
.@HillaryClinton's Careless Use Of A Secret Server Put National Security At Risk: __HTTP__ #VPDebate#BigLeagueTruth _E_
Just landed in D.C. __HTTP__ _E_
Glad to see no charges against Greg Kelly. His accusers' charges never made sense! _E_
Will go back on for a final question now! _E_
We are going to bring steel and manufacturing back to Indiana! _E_
Obamacare is far toooo expensive far toooo complicated (thousands of pages) and most importantly doesn't work. WE CAN DO MUCH BETTER! _E_
Be yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_
Why didn't the writer of the twelve year old article in People Magazine mention the incident in her story. Because it did not happen! _E_
Thank you for a wonderful evening in Washington D.C. #Inauguration __HTTP__ _E_
Leightweight @Lord Sugar virtually begged my reps to have me stop mocking him. Every time this dope goes on Apprentice I make money too easy _E_
Thank you Wilkes Barre Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
More of your questions answered in today's video at __HTTP__ here is my appearance on Neil Cavuto __HTTP__ _E_
Churches in Texas should be entitled to reimbursement from FEMA Relief Funds for helping victims of Hurricane Harvey (just like others). _E_
.@Lord_Sugar If you think ugly windmills are good for Scotland you are an even worse businessman than I thought... _E_
Meeting with African American Pastors at Trump Tower was amazing. Wonderful news conference followed. Now off to Georgia for big speech! _E_
You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_
Occupy Wall Street is at it again go out and get a job. It's actually easier work and far more rewarding. _E_
Performing live on the Miss Universe Pageant from the Mandalay Bay Resort & Casino will be Telemundo Orianthi John Legend and The Roots. _E_
South Korea must in some form pay for our help the U.S. must stop being stupid! _E_
...and people like Ms. Heyer. Such a disgusting lie. He just can't forget his election trouncing.The people of South Carolina will remember! _E_
Dummy @mcuban made up a story about a visit to Mar a Lago last night on Leno. It never happened—I don't talk that way. _E_
RT @KellyannePolls: more media #polls showing @realDonaldTrump ahead in states Pres Obama won twice. __HTTP__ _E_
9 million fewer people voted for Obama this election than last & yet the Republicans lost—do you think they might be doing something wrong! _E_
Receiving @AmericanCancer Lifetime Achievement Award & chairing @FollowLola debut @CarnegieHall on Jan.19 __HTTP__ _E_
I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
.@JonahNRO You stated that I started "relentlessly tweeting like a 14 year old girl..." Horrible insult to women. Resign now or later! _E_
The S&P are losers. They did this for personal publicity in order to straighten out their terrible reputatio... (cont) __HTTP__ _E_
Will be on @megynkelly tonight at 9:00. WILL BE TALKING ABOUT EVERYTHING! _E_
Ebola patient Duncan lied on his exit papers by saying he never came into contact with a person with Ebola. He knew he did and person died. _E_
All the contestants have arrived to compete in Trump Miss Universe Pageant in Las Vegas. Today's welcoming ceremony will be terrific! _E_
.@MittRomney must ask for Obama's college records & applications why is he not doing this? _E_
If Americans understood just how many hidden government fees and taxes are absorbed into the prices of the (cont) __HTTP__ _E_
There is an incredible spirit of optimism sweeping the country right now—we're bringing back the JOBS! __HTTP__ _E_
Graydon Carter is laughing at the stupidity of Chuck Townsend on his contract renewal even he doesn't believe it! @CondeNastCorp _E_
The @nytimes states today that DJT believes more countries should acquire nuclear weapons. How dishonest are they. I never said this! _E_
Best Apprentice episode EVER tonight at 8:00. _E_
Experience knowledge and prescience are a formidable combination of powers. Do not underestimate them. Think Like a Champion _E_
NBC news is #FakeNews and more dishonest than even CNN. They are a disgrace to good reporting. No wonder their news ratings are way down! _E_
Thank you Bobby Bowden for the intro tonight and your support! I hope I can do as well for Florida as you have done! __HTTP__ _E_
.@_KatherineWebb with some of my memorabilia. __HTTP__ _E_
RT @EricTrump: Thank you to @GolfDigest for this incredible feature! Golfer in Chief @RealDonaldTrump __HTTP__ __HTTP__ _E_
Was there another loan that Ted Cruz FORGOT to file. Goldman Sachs owns him he will do anything they demand. Not much of a reformer! _E_
Rick Perry a good man a great family and a patriot. _E_
.@ForbesInspector 5 Star & @TripAdvisor #1 Luxury Hotel @TrumpToronto offers style luxury & impeccable service __HTTP__ _E_
Will be on Fox & Friends at 7.00 Enjoy! _E_
"Whether you realize it or not your brand can be many times more valuable than your business." – Midas Touch _E_
I'll be on @foxandfriends at 7:30 AM Monday _E_
.@CNN just doesn't get it and that's why their ratings are so low and getting worse. Boring anti Trump panelists mostly losers in life! _E_
Rick Perry failed at the border. Now he is critical of me. He needs a new pair of glasses to see the crimes committed by illegal immigrants. _E_
We _E_
RT @IvankaTrump: Thank you Angie Phillips for inviting me to tour your plant Middletown Tube Works. #Ohio __HTTP__ _E_
I will be speaking at the NRA event today in Nashville. Many friends will be there. _E_
I guess @BillMaher saw my ratings on the @Late_Show the other night where Letterman beat Leno. Bill you are no Letterman. _E_
From my first day in office we've taken swift action to lift the crushing restrictions on American energy. Remarks... __HTTP__ _E_
I explained to the President of China that a trade deal with the U.S. will be far better for them if they solve the North Korean problem! _E_
If Scotland would have gone independent predicated on $100 $150 oil they would now be bust! _E_
"ACU ANNOUNCES DONALD TRUMP TO ADDRESS CPAC 2013" __HTTP__ via @CPACnews _E_
Briarcliff Manor should get a better town manager. Philip Zegarelli has no clue—bad roads a total puppet of the mayor? @westchestergov _E_
Amazing that Crooked Hillary can do a hit ad on me concerning women when her husband was the WORST abuser of woman in U.S. political history _E_
Will be signing the biggest ever Tax Cut and Reform Bill in 30 minutes in Oval Office. Will also be signing a much needed 4 billion dollar missile defense bill. _E_
Autism Speaks' Bob and Suzanne Wright will address the Pontifical Council on Health Care Workers at the Vatican in Rome. November 20 22 _E_
Marco Rubio is totally weak on illegal immigration & in favor of easy amnesty. A lightweight choker bad for #USA! _E_
Via @WashTimes by @dsherfinski __HTTP__ _E_
President Obama strongly considering a plan to bring non U.S. citizens with Ebola to the United States for treatment. Now I know he's nuts! _E_
RT @DanScavino: 'Trump as Commander in Chief Making the Hard Decisions' by LTG (Ret) Kellogg a highly decorated Vietnam War Vet: __HTTP__ _E_
Honolulu's best @TrumpWaikiki features a dozen distinct tropically decorated Hawaii hotel rooms and suite layouts __HTTP__ _E_
The Tea Party delivered the House for @GOP so they could be fiscally responsible. Instead they have been irresponsible! _E_
.@JebBush had a tiny 300 person crowd at Senator Tim Scott's forum. I had thousands and they had real passion! __HTTP__ _E_
Thank you Arkansas! #Trump2016#SuperTuesday _E_
I will take full credit for Mitt Romney dropping out of the race—looks like he won't be endorsing Trump any time soon. _E_
RT @foxandfriends: FOX NEWS ALERT: U.S. flexes its defense muscles destroys incoming test missile off coast of Alaska __HTTP__ _E_
Via @todayshow: Trump: Attorney general behind lawsuit a 'total lightweight' __HTTP__ _E_
Join me in Colorado Springs Colorado tomorrow at 1:00pm! #MAGA Tickets: __HTTP__ _E_
Happy Lá Fheile Phadraig to all of my great Irish friends! _E_
Will CNN send its cameras to the border to show the massive unreported crisis now unfolding or are they worried it will hurt Hillary? _E_
How can General Martin Dempsey tell Obama that delaying the Syria bombardment will have no consequences? He is no Patton or MacArthur. _E_
Great to see @RandPaul looking well and back on the Senate floor. He will help us with TAX CUTS and REFORM! _E_
Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_
Tune in at __HTTP__ and get the word out #BigLeagueTruth #Debate Help us spread the TRUTH stop the... __HTTP__ _E_
While Bernie has totally given up on his fight for the people we welcome all voters who want a better future for our workers. _E_
Sally Yates made the fake media extremely unhappy today she said nothing but old news! _E_
If only the morons @AP were as concerned with Obama's inconsistent statements on the Embassy attacks as they are (cont) __HTTP__ _E_
#AmericaFirst! __HTTP__ _E_
Thank you to Joe Passov (Travelin' Joe) of Golf Magazine for the great article... __HTTP__ __HTTP__ _E_
Philly FOP Chief On Presidential Endorsement: Clinton 'Blew The Police Off' __HTTP__ _E_
Can't wait to meet patriotic small business owners next week in Sarasota and Tampa! Hey @BarackObama We Did Build It! _E_
So what did you think of my decision? What would you have done? #CelebApprentice _E_
Happy Birthday @IvankaTrump! You are an amazing daughter! _E_
Our views trump the rest for the #Thanksgiving #MacysParade. Stay @TrumpNewYork for exclusive parade access __HTTP__ _E_
See Charles Gasparino's article in today's NYPost about Eric Schneiderman's witch hunt against Republicans __HTTP__ _E_
Time for Sebelius to be fired. She has admitted that the Administration did not vet the ObamaCare website __HTTP__ _E_
The Democrats sent a very political and long response memo which they knew because of sources and methods (and more) would have to be heavily redacted whereupon they would blame the White House for lack of transparency. Told them to re do and send back in proper form! _E_
My ties and shirts are doing very big numbers @Macy's beyond my wildest thoughts! Thanks @GoAngelo and the rest of the losers for mentions! _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
Syrian ceasefire seems to be holding. Many lives can be saved. Came out of meeting. Good! _E_
Thank you to Matt Boyle @BreitbartNews for analytical & well written piece on sleazebag blogger @mckaycoppins & irrelevant @BuzzFeed _E_
27 days until America's greatest test since our founding. In this election we decide whether we become great again. _E_
...didn't do it so now we have a big deal with Dems holding them up (as usual) on Debt Ceiling approval. Could have been so easy now a mess! _E_
I will be working late into the evening closing a big real estate deal—soon to be announced. Happy Easter and/or Holiday to all. _E_
Let Pete Rose into the Hall of Fame now 35 years is enough! _E_
Governor John Kasich of the GREAT GREAT GREAT State of Ohio called to congratulate me on the win. The people of Ohio were incredible! _E_
Via @limbaugh: "Trump Doubled Down and It Worked" __HTTP__ _E_
You've got something unique to offer find out what it is. Ask yourself: What can I provide that does not yet exist? Innovation can follow.. _E_
Welcome to the new ObamaCare reality – Doctor spent 2 hours on hold w/insurance company to get approval for surgery __HTTP__ _E_
Capital isn't scarce vision is. Sam Walton _E_
I wonder how @JoeBiden feels after last night's love fest between Obama and Hilary on @60Minutes. Can't be too happy. _E_
.@MileyCyrus – don't worry about Liam. You can do much better and you have plenty of time—remain strong! _E_
.@BrandenRoderick returns in All Star @ApprenticeNBC 2001 Playmate of the Year is a determined competitor. She is terrific! _E_
I will be interviewed at 7:00 A.M. on @foxandfriends Enjoy. _E_
My new book #TimeToGetTough is the best present of the holiday season. A great gift for anyone who cares about this country. _E_
Will be interviewed by @ainsleyearhardt on @foxandfriends Enjoy! _E_
Thank you. __HTTP__ _E_
.@DeeSnider @StephenBaldwin7 and the rest of your favorites are back! All Star @ApprenticeNBC premieres Sunday... __HTTP__ _E_
Jeb's policies in Florida helped lead to its almost total collapse. Right after he left he went to work for Lehman Brothers—wow! _E_
Be sure to watch my interview on @Gretawire tonight! _E_
RT @Carl_C_Icahn: 2/2 How many of our presidents even our great presidents would have handled the antics that went on in that auditorium... _E_
"Intellectuals solve problems geniuses prevent them." – Albert Einstein _E_
I laugh when I see Marco Rubio and Jeb Bush pretending to love each other with each talking of their great friendship. Typical phony pols _E_
I received such a nice letter today from someone who took refuge in Trump Tower during Sandy. It was my pleasure to help. _E_
... Also if they're at home who the hell knows what they're doing (a second job maybe). _E_
No matter what happens in the election @davidaxelrod deserves a lot of credit. He has kept Obama in it even with his terrible record. _E_
Maybe Derek Jeter should ask A Rod about renting his apartment next year. Very soon A Rod won't need a place in NYC. _E_
UL has lost all credibility under Joe McQuaid w circulation dropping to record lows. They aren't worthy of representing the great people NH. _E_
Met with @RepCummings today at the @WhiteHouse. Great discussion! _E_
Two years ago I told everybody to start looking & buying houses—I hope you listened! (but there is still time). _E_
HAPPY THANKSGIVING your Country is starting to do really well. Jobs coming back highest Stock Market EVER Military getting really strong we will build the WALL V.A. taking care of our Vets great Supreme Court Justice RECORD CUT IN REGS lowest unemployment in 17 years....! _E_
For the sake of transparency @BarackObama should release all his college applications and transcripts both from Occidental and Columbia. _E_
#ArizonaPrimary message from @IvankaTrump! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
This is a terrible deal for the country and an embarrassment for Republicans! _E_
Crooked Hillary has been fighting ISIS or whatever she has been doing for years. Now she has new ideas. It is time for change. _E_
New poll WOW 53% say President Obama is not honest & trustworthy. What took them so long. Go back and look at his house purchase in Chicago _E_
Lots of people are asking whether or not I should have run for President—stay tuned for the answer. _E_
Do you think I will get credit for keeping Ford in U.S. Who cares my supporters know the truth. Think what can be done as president! _E_
This will be the biggest TAX CUT in the history of our country and we need it! #TaxReform Read more: __HTTP__ __HTTP__ _E_
"Sure the home field is an advantage but so is having a lot of talent." @DanMarino _E_
Wow Trump International Hotel & Tower Toronto was just ranked #1 out of 138 hotels in Toronto! @TrumpToronto _E_
The Clintons spend millions on negative ads on me & I can't tell the truth about her husband? Don't feel sorry for crooked Hillary! _E_
My thoughts on The O'Reilly Factor and more here... __HTTP__ _E_
Very proud to announce that Mar a Lago was awarded top Historic building in the state by the illustrious (cont) __HTTP__ _E_
A letter written to one of my many critics! __HTTP__ _E_
N.Y.C. has the worst Mayor in the United States. I hate watching what is happening with the dirty streets the homeless and crime! Disgrace _E_
While Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him and his dumb clown humor. Too bad! _E_
Join me in Phoenix Arizona tomorrow at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_
So nice thank you! __HTTP__ _E_
Thousands of fans have been sending letters to Trump Tower in anticipation of @CelebApprentice. Really good show. _E_
Morning Consult poll: Trump Leads __HTTP__ _E_
#trumpvlog My thoughts on the State of the Union address Apple and a great @WSJ article.... __HTTP__ _E_
Outrageous @BarackObama has spent over $2.7B on implementing @ObamaCare since the oral arguments at SCOTUS __HTTP__ _E_
You mean the fact that my father left me some money (as a good father will) and I multiplied it many many times to over $10 billion is bad? _E_
Don't be fooled. In 2008 @BarackObama promised immigration reform in his 1st yr of his 1st term. Now promising (cont) __HTTP__ _E_
I really like Chelsea Clinton an amazing young woman. She got the best of both parents. (@IvankaTrump agrees) _E_
Trade between China and North Korea grew almost 40% in the first quarter. So much for China working with us but we had to give it a try! _E_
When Obama took office in 2009 employer provided premiums cost $13375. Today they are $18142. Thanks Obama. _E_
Watch my interview with @ericbolling on @FoxNews today at 11:30AM ET _E_
#sweepstweet @johnrich and @marleematlin were on #CelebrityApprentice—and they're back! _E_
Check out the last webisode in our 3 part series featuring me with Serta. Which one was your favorite? www.youtube.com/user/mattressserta _E_
Remember Republicans are 5 0 in Congressional Races this year. The media refuses to mention this. I said Gillespie and Moore would lose (for very different reasons) and they did. I also predicted "I" would win. Republicans will do well in 2018 very well! @foxandfriends _E_
The Carson story is either a total fabrication or if true even worse trying to hit mother over the head with a hammer or stabbing friend! _E_
I'm glad President Obama followed my lead and lowered the flags half staff. It's about time! _E_
RE: FB Vanity URLs: SF Chronicle David Beckham was one of the first along with Britney Spears & Donald Trump. __HTTP__ _E_
__HTTP__ _E_
FoxNewsInsider with comments on my speech at CPAC in Washington DC __HTTP__ _E_
It was a great honor to be with President @EmmanuelMacron of France this afternoon with his delegation. Great bilateral meeting! #UNGA __HTTP__ _E_
Join me in Cedar Rapids Iowa tomorrow at 7:00pm! #MAGA __HTTP__ __HTTP__ _E_
Great poll numbers out of @UMassAmherst. Thank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Strong debate by @PerdueSenate. No question he won. We need more business leaders with bold vision to fix Washington. #GASen _E_
Had dinner this week at @MEGUNYC (at Trump World Tower) opposite the United Nations—fantastic food! 212.964.7777 _E_
Massive crowds expected in Mississippi tomorrow night. Look forward to it! 2015 IN PHOTOS: __HTTP__ __HTTP__ _E_
Obama once again just missed a self imposed deadline with Iran. Our leadership is weak & ineffective. Double the sanctions! _E_
Congrats @LindseyGrahamSC. You just got 4 points in your home state of SC—far better than zero nationally. You're only 26 pts behind me. _E_
#CrookedHillary = Obama's third term which would be terrible news for our economic growth seen below. __HTTP__ _E_
Just arrived in Syracuse NY. Big crowd great place! We will bring back the desperately needed jobs. #NYPrimary __HTTP__ _E_
You get what you vote for. 21% of small business owners planning to cut their workforce in 2013 __HTTP__ _E_
Watched Crooked Hillary Clinton and Tim Kaine on 60 Minutes. No way they are going to fix America's problems. ISIS & all others laughing! _E_
You have all been waiting the response has been amazing! Watch my announcement now press release to follow at 12:15. __HTTP__ _E_
Spoke with Governor @PatMcCroryNC of North Carolina today. He is doing a tremendous job under tough circumstances. _E_
My thoughts on last night's Celebrity Apprentice __HTTP__ also an observation I made recently __HTTP__ _E_
Today I hosted an immigration roundtable ahead of two votes taking place in Congress tomorrow. Watch and read more... __HTTP__ _E_
Obamacare will bankrupt our country and lead to socialized medicine. We must all focus now on electing @MittRomney this November. _E_
Don't talk about Rolling Stone Magazine but most importantly don't buy it. This degenerate killed and maimed so many wonderful people! _E_
"Home Prices Reach New All Time Highs in August" Read more: __HTTP__ __HTTP__ _E_
Which team is your favorite? _E_
.@jamieaydt Happy Birthday Jamie! _E_
On my way to Cedar Falls Iowa now. Will be great I love the people of Iowa! _E_
Remember Russia still has Snowden. When are we going to bring that piece of human garbage back home to stand trial? He caused great damage! _E_
.@TigerWoods is playing like his old self in the Farmers Insurance Open. He will have a great year. _E_
...We cannot keep FEMA the Military & the First Responders who have been amazing (under the most difficult circumstances) in P.R. forever! _E_
Republicans seem intent on negotiating against themselves. Many senior Senators are doing Obama's bidding. Can't win this way. _E_
Really dumb @CheriJacobus. Begged my people for a job. Turned her down twice and she went hostile. Major loser zero credibility! _E_
The #SOTU speech is really boring slow lethargic very hard to watch! _E_
Tomorrow!Las Vegas NV 11a: __HTTP__ CO 4p: __HTTP__ NM 7p: __HTTP__ _E_
Great to see Tony La Russa manage one last game last night. Congratulations to the National League on winning the @MLB All Star Game. _E_
China is worried. The polls are trending for @MittRomney. They won't be able to steal from us anymore. _E_
Deportations are "plummeting" __HTTP__ while Obama continues to grant amnesty. _E_
Great reporting by @foxandfriends and so many others. Thank you! _E_
It's record cold all over the country and world where the hell is global warming we need some fast! _E_
Unbelievable how he gets away with it: @BarackObama is flying around on Air Force One laughing at everybod... (cont) __HTTP__ _E_
July U.S. construction had biggest drop in 12months. Bad indicator on economic numbers for rest of the year. _E_
I am so honored by all the great NY State Repubs who came to my office called & wrote for me to run for Governor. If I do I will win. _E_
Congress must defund ObamaCare. It is destroying Medicare and breaking promises to our Seniors including veterans. _E_
Via @dallasnews' @neighborsgo by Heather Noel: Shelton School graduate receives handwritten note from Donald Trump __HTTP__ _E_
ATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_
.@TrumpChicago's river lake and skyline views in each of its deluxe 5 Star guestrooms __HTTP__ _E_
Thank you for all of the nice compliments and reviews on the State of the Union speech. 45.6 million people watched the highest number in history. @FoxNews beat every other Network for the first time ever with 11.7 million people tuning in. Delivered from the heart! _E_
If I were president Sgt. Andrew Tahmooressi would be let out of jail with one phone call. If notMexico would pay a price like never before! _E_
Morning Joe Panel is stealing many of my statements and ideas to better America without giving credit the story of my life! _E_
Obama & the Democrats want this shutdown. They think it helps their electoral prospects for 2014. Don't believe! _E_
Anybody who watched all of Ted Cruz's far too long rambling overly flamboyant speech last nite would say that was his Howard Dean moment! _E_
Via @wbtwnews13 by @elizabethk_wbtw: "Donald Trump will deliver keynote address to the SC Tea Party Convention" __HTTP__ _E_
Our athletes in the Olympics are proving once again to be the greatest competitors in the world. Makes us proud to be Americans. _E_
I will be releasing the full interview with a guy named Baxter @antbaxter only to show the bias and stupidity of him and @BBCWorld. Clowns! _E_
.@LaToyaJackson & @Omarosa are not likely to become friends –ever! #CelebApprentice _E_
Sorry for such silence—spent weekend at closing of Ritz Carlton in Jupiter Florida—just bought it will be great! _E_
So how and why are they so sure about hacking if they never even requested an examination of the computer servers? What is going on? _E_
Honoring the men and women who made the ultimate sacrifice in service to America. Home of the free because of the brave. #MemorialDay _E_
...Based on that the Military has hit ISIS much harder over the last two days. They will pay a big price for every attack on us! _E_
I am thrilled to nominate Dr. @RealBenCarson as our next Secretary of the US Dept. of Housing and Urban Development... __HTTP__ _E_
Senator @lisamurkowski of the Great State of Alaska really let the Republicans and our country down yesterday. Too bad! _E_
Congratulations to @ScottKWalker of Wisconsin a great victory. A smart and tough guy. Great going. _E_
A general is just as good or just as bad as the troops under his command make him. General Douglas MacArthur _E_
"Take calculated risks. That is quite different from being rash." George S. Patton _E_
New Reuters Poll just came out and has me at 32% highest number yet.The silent majority is back and we will MAKE AMERICA GREAT AGAIN! _E_
Pervert alert serial sexter @repweiner is polling to test the waters for NYC political run. __HTTP__ _E_
RT @FLOTUS: Had a wonderful visit from @JBA_NAFW children today at the @whitehouse! #WhiteHouseChristmas __HTTP__ _E_
The rescue icebreaker trying to free the ship of the GLOBAL WARMING scientists has turned back the ice is massive (a record). IRONIC! _E_
A government shutdown will be devastating to our military...something the Dems care very little about! _E_
No surprise Saudis turned down spot on UN Security Council. They don't want responsibility. Just have us do their heavy lifting. _E_
The brand new Blue Monster Golf Course at Trump National Doral is doing fantastic business. Also the new driving range is open at night! _E_
This will be a big week for Infrastructure. After so stupidly spending $7 trillion in the Middle East it is now time to start investing in OUR Country! _E_
With the run on our dollar about to take place commodity prices will rise. Gold silver & timber will spike also certain real estate. _E_
Someone should look into who paid for the small organized rallies yesterday. The election is over! _E_
I am on my way! See you all soon! __HTTP__ _E_
Wow great news from Wisconsin. Just made two speeches there with a big one coming tonight. Thank you! __HTTP__ _E_
RT @SpeakerRyan: For individuals and families the final Tax Cuts & Jobs Act:✔lowers individual taxes✔nearly doubles the standard deducti... _E_
I had a great time answering your questions in the latest #AskTheDonald. Watch and see if your question made it in __HTTP__ _E_
Congrats to Roger Clemens he showed great courage. This case never should have been brought to trial. Andy Pettitte did the right thing. _E_
Press conference at the opening of the @GaryPlayer Villa at @TrumpDoral . __HTTP__ _E_
.@FoxNews is MUCH more important in the United States than CNN but outside of the U.S. CNN International is still a major source of (Fake) news and they represent our Nation to the WORLD very poorly. The outside world does not see the truth from them! _E_
Interesting...the last time a Democrat succeeded a two term Democratic pres. was in 1836 when Martin Van Buren succeeded Andrew Jackson. _E_
Thank you Louisiana! #Trump2016 __HTTP__ _E_
People having a great time in the Trump Tower atrium unlike others I stayed open. __HTTP__ _E_
What has happened in Orlando is just the beginning. Our leadership is weak and ineffective. I called it and asked for the ban. Must be tough _E_
The meeting with Republican Senators yesterday outside of Flake and Corker was a love fest with standing ovations and great ideas for USA! _E_
Not so smart after all ... Man with name on Duke law library must pay me legal fees after Trump trial victory. _E_
I'm watching Knicks game I'd bet all of those guys with the terrible tattoos wish they never got them too bad too late! _E_
My @gretawire interview discussing @BarackObama's economic failures attack on capitalism and playing class warfare. __HTTP__ _E_
Statement on John McCain __HTTP__ _E_
.... Do I get the credit for this? Thank you! __HTTP__ _E_
I agreed to take the worst spot at CPAC because nobody else wanted it and it was the only time I could be there it was great fun! _E_
Our country needs to reestablish the work ethic. In NY welfare pays better than jobs __HTTP__ Zero incentive. _E_
To all @MittRomney supporters make sure you have taken advantage of early voting now so you can GOTV on election day. _E_
The UK is seriously thinking about halting wind turbine subsidies. Good news killing country. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
My thoughts are with all those observing Yom Kippur the holiest day of the Jewish year. __HTTP__ _E_
Join me LIVE on my Facebook page in St. Augustine Florida! Lets #DrainTheSwamp & MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_
There will be NO change to your 401(k). This has always been a great and popular middle class tax break that works and it stays! _E_
RT @WhiteHouse: President Trump proclaims today as #WorldAIDSDay: __HTTP__ __HTTP__ _E_
WHO IS GOING TO GET IRAQ'S OIL??????? _E_
Together we will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
We are making great progress with healthcare. ObamaCare is imploding and will only get worse. Republicans coming together to get job done! _E_
It was my great honor to join our wonderful Veterans at AMVETS Post 44 in Youngstown Ohio this evening. A grateful nation salutes you! __HTTP__ _E_
#WeeklyAddress __HTTP__ __HTTP__ _E_
.@NicolleDWallace Your father is a brilliant man with wonderful sense therefore you must be good! _E_
Looking forward to receiving 2015 Statesman of the Year award tonight by @SRQRepublicans. A record 2000+ sell out __HTTP__ _E_
My @foxandfriends interview discussing @BarackObama's #WHCD lowering tax rates Republic of Georgia & (cont) __HTTP__ _E_
.@DJohnsonPGA We are so proud of you Dustin. Your reaction under pressure was amazing. First of many Majors. You are a true CHAMPION! _E_
I will be on @foxandfriends at 7.00 45 minutes. Talking about Ebola Obama and other strange U.S. happenings! _E_
My @gretawire interview discussing @MittRomney debate responses Obama's hidden records my tweets and unemployment __HTTP__ _E_
Is President Obama going to finally mention the words radical Islamic terrorism? If he doesn't he should immediately resign in disgrace! _E_
Does anybody really believe that a reporter who nobody ever heard of went to his mailbox and found my tax returns? @NBCNews FAKE NEWS! _E_
.@IsraeliPM @netanyahu delivered an excellent speech yesterday at the UN. Too bad @AmbassadorRice wasn't there. _E_
My @SquawkCNBC interview discussing today's primary contests @MittRomney's lead and my stock picks __HTTP__ _E_
I will be the best by far in fighting terror. I'm the only one that was right from the beginning & now Lyin' Ted & others are copying me. _E_
Watched @davidaxelrod on @oreillyfactor and the dog hit me even after I made a big contribution to his charity. I never went bankrupt! _E_
I really enjoy doing @foxandfriends every Monday at 7 AM. @sdoocy @ehasselbeck and @kilmeade are great people. _E_
"Manufacturing Optimism Rose to Another All Time High in the Latest @ShopFloorNAM Outlook Survey" __HTTP__ _E_
Very nice article from Daily Mail __HTTP__ _E_
The statement about leaving the base came directly from CBS Evening News. _E_
Gallup poll proves that @BarackObama's regulation and Obamacare are stopping small business owners from hiring __HTTP__ SHOCK! _E_
#MakeAmericaGreatAgain __HTTP__ _E_
Then ask: What am I pretending not to see? These two simple questions can pave the way for some very clear answers. _E_
The 9.11.12 attack on the Benghazi consulate was a sophisticated multi prong wave attack. When will all the 50+ fighters face justice? _E_
Lyin' Ted Cruz can't win with the voters so he has to sell himself to the bosses I am millions of VOTES ahead! Hillary would destroy him & K _E_
"Never judge someone by their job title. You'd be surprised at the talents people can have." – Midas Touch _E_
The plane I saw on television was the hostage plane in Geneva Switzerland not the plane carrying $400 million in cash going to Iran! _E_
Whatever you are doing right now make sure to stop for a minute focus and ask yourself "Am I thinking BIG?" _E_
.@David_Cameron Why do you give Scotland so much money to destroy their magnificent land with wind turbines causing massive taxes & E bills _E_
We can make Washington work for us. It's time for real leadership. Let's #MakeAmericaGreatAgain! __HTTP__ _E_
Iran has been formally PUT ON NOTICE for firing a ballistic missile.Should have been thankful for the terrible deal the U.S. made with them! _E_
How will Mitt Romney defend his record on jobs and Romneycare in tonight's debate? _E_
Thank you Las Vegas Nevada I love you! Departing for Greeley Colorado now. Get out & VOTE! #ICYMI watch here:... __HTTP__ _E_
My @foxandfriends interview re: @SuperBowl blackout @BobbyJindal's stupid comment & suing @billmaher f/$5M __HTTP__ _E_
Obama is trying to block sequester layoff notices in Virginia __HTTP__ Another example of sleazy politics! _E_
As dishonest as @RollingStone is I say @HuffingtonPost is worse. Neither has much money sue them and put them out of business! _E_
President Obama's literary agent (in 1991) promoted a book about the first African American president of the (cont) __HTTP__ _E_
Entrepreneurs – be tough resolute & trustworthy. The most crucial time to build your reputation is when you start making deals. _E_
The NFL has decided that it will not force players to stand for the playing of our National Anthem. Total disrespect for our great country! _E_
Countries charge U.S. companies taxes or tariffs while the U.S. charges them nothing or little.We should charge them SAME as they charge us! _E_
The lights went out at the White House today __HTTP__ Symbolic of the Obama presidency. _E_
I wonder why somebody doesn't do something about the clowns @politico and their totally dishonest reporting. _E_
Press release. Video response to follow. __HTTP__ _E_
The banks were bailed out by us. They should start lending to private entrepreneurs. The banks are slowing American growth. _E_
Unlike the other Republican candidates I will be in Nevada all day and night I won't be fleeing in and out. I love & invest in Nevada! _E_
Will be in Bangor Maine today! Join me 4pmE at the Cross Insurance Center! __HTTP__ __HTTP__ _E_
I hope the Republicans are happy. Just as I predicted that stupid deal they voted for only whetted Obama's appetite for more taxes. _E_
"President Trump?" __HTTP__ via @MiamiHerald by Wayne E. Williams _E_
Via @HorsetalkNZ: "Florida's Trump Invitational to kick off showjumping year" __HTTP__ Mar a Lago's 3rd ann. Trump Grand Prix! _E_
The Government spends 30% more than it admits __HTTP__ @BarackObama is out of control with his deficit spending. _E_
Obama administration fails to screen Syrian refugees' social media accounts: __HTTP__ _E_
Via @DailyCaller: Trump on Obama and Congress: 'Lock them up' in a room like Vatican conclave __HTTP__ by @NicholasBallasy _E_
Very tacky set! _E_
With Boston terrorist cell widening in suspects it's now clear that it was a mistake to read the bomber the Miranda warning so early. _E_
I love the Lakers and when you love the Lakers you want them to win so badly that you will work tirelessly. Dr. Jerry Buss _E_
#JointSession #MAGA __HTTP__ _E_
With taxes set to go up and Obama about to cut the mortgage deduction now is the time to buy a house if you can. Can get a great deal. _E_
Does anyone know that Crooked Hillary who tried so hard was unable to pass the Bar Exams in Washington D.C. She was forced to go elsewhere _E_
Bill Clinton has been Obama's most effective surrogate out on the trail. _E_
It doesn't cost any money to think bigger. The Art of the Deal _E_
"Did you know that with the natural gas reserves we have in the United States we could power America's (cont) __HTTP__ _E_
Via @MarketWatch: "@TrumpSoHo New York Unveils $50 Million Presidential Penthouse" __HTTP__ _E_
Thank you. __HTTP__ _E_
.@redstate I miss you all and thanks for all of your support. Political correctness is killing our country. weakness. _E_
RT @ricardorossello: Briefed @POTUS @realDonaldTrump in #SituationRoom and thanked him for his leadership quick response & commitment to o... _E_
FEMA & First Responders are doing a GREAT job in Puerto Rico. Massive food & water delivered. Docks & electric grid dead. Locals trying.... _E_
RT @ScottAdamsSays: Trump's speech today is the best persuasion I have ever seen. Game over. Now running unopposed: __HTTP__ _E_
.@mkhammer a Fox contributor isn't smart enough to know what is going on at the border. @TheJuanWilliams made the point far better! _E_
Such a great honor to be the Republican Nominee for President of the United States. I will work hard and never let you down! AMERICA FIRST! _E_
RT @EricTrump: Who has voted today??? Feedback from the polls? I'm like a kid on Christmas! #SuperTuesday #MakeAmericaGreatAgain __HTTP__ _E_
.@timkaine is wrong for defense: __HTTP__ #BigLeagueTruth #VPDebate _E_
Leaving for Mobile Alabama right now can't be late! _E_
It is going to be a long and tough road to turn around CNN they are looking at the wrong people! _E_
Great read by @VDHanson: "Mexico's Hypocrisy Is Evident In Its Own Strict Policy Toward Immigrants" __HTTP__ _E_
Via @DailyCaller by @rpollockDC: "NYC Mayor Action Against Donald Trump Is 'Not the American Way'" __HTTP__ _E_
Do you believe Obama just said that America would be less safe with a travel ban from West Africa.This is the thinking of a total mad man! _E_
The Solyndra Scandal @BarackObama's $500Million photo op. He loves wasting our money. _E_
If you love it own it. @TrumpCondosLV bring unparalleled style elegance and world class amenities to Las Vegas __HTTP__ _E_
Join me at 2:00pmEST today live from Trump Tower via Facebook & Periscope! __HTTP__ _E_
The Boston Bomber got immediate emergency surgery for a gunshot yet our vets die on waiting lines at the VA. We must do better! _E_
What team would you choose to win? #CelebApprentice _E_
.@AllenWest Great seeing you last night at record setting Mar a Lago Republican event. The crowd loved you! _E_
Presidential Memorandum for the @CommerceGov @SecretaryRoss re: Aluminum Imports and Threats to National Security:... __HTTP__ _E_
Thank you Minnesota! It is time to #DrainTheSwamp & #MAGA! #ICYMI watch: __HTTP__ __HTTP__ _E_
Wow Lyin' Ted Cruz really went wacko today. Made all sorts of crazy charges. Can't function under pressure not very presidential. Sad! _E_
I feel sure that my friend @RandPaul will come along with the new and great health care program because he knows Obamacare is a disaster! _E_
I strongly pressed President Putin twice about Russian meddling in our election. He vehemently denied it. I've already given my opinion..... _E_
I will be doing Fox & Friends tomorrow morning at 7.00. Will be discussing all sorts of current disasters! _E_
Wow @CNN ratings are up 75% because it's all Trump all the time. The networks are making a fortune off of me! MAKE AMERICA GREAT AGAIN! _E_
Al Qaeda taking over Libya after we made it possible really amazing. _E_
Obama has unilaterally & unconstitutionally drawn 4 ObamaCare exemptions for his friends. All @GOP wants is (cont) __HTTP__ _E_
RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_
Thank you Marco I agree! __HTTP__ _E_
In 2008 @BarackObama campaigned against $3.50 gas __HTTP__ It is now $6 in Florida and on the rise. He is a disaster! _E_
Happy Canada Day to all of the great people of Canada and to your Prime Minister and my new found friend @JustinTrudeau. #Canada150 _E_
via Bloomberg: Fox News Couldn't Kill Trump's Momentum Made Him Stronger @FoxNews @business __HTTP__ _E_
If the Republicans need a chief negotiator I am always available or can recommend some really good ones! _E_
.@BillRancic Bill fantastic job this morning on @foxandfriends you are a total winner and I am proud of you as first Apprentice CHAMP! _E_
Isn't it sad the way Putin is toying with Obama regarding Snowden. We look weak and pathetic. Could not happen with.a strong leader! _E_
Great players in sports make the game fun to watch. @DerekJeter has continued to impress with another amazing season. Absolute professional. _E_
Our ally Canada wants to send their oil down south to us. @BarackObama is forcing Canada to send it west to China. _E_
Sorry folks I'm just not a fan of sharks and don't worry they will be around long after we are gone. _E_
I will be interviewed by Chris Wallace at 2:00 P.M. on @FoxNews Turn off the football for 15 minutes Make America Great Again! _E_
Via @pbpost: "Faldo calls team up for golf course with Trump 'entertaining'" __HTTP__ _E_
Great job on the Larry King Live Gulf Telethon last night $1.3 million was raised in 2 hours. _E_
I wish everyone including the haters and losers a very happy Easter! _E_
RT @TheFive: Media bias is not just about what they report it's also about what they don't report. @jessebwatters #thefive _E_
We need to fix our broken education system! #StopCommonCore #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_
WOW great new poll New York! Thank you for your support! #Trump2016#NewYorkValues __HTTP__ __HTTP__ _E_
If dummy Bill Kristol actually does get a spoiler to run as an Independent say good bye to the Supreme Court! _E_
The national debt continues to rise at record levels and today @BarackObama is in Disney World. He lives in a fantasy. _E_
Photo from yesterday's USGA announcement that Trump National Golf Club Bedminster will host the 2017 U.S. Women's Open __HTTP__ _E_
The special interests and people who control our politicians (puppets) are spending $25 million on misleading and fraudulent T.V. ads on me. _E_
Not only is @Toure a racist (and boring) he's a really dumb guy! _E_
Join me in San Jose California tomorrow evening at 7pm! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Thanks for your support! __HTTP__ _E_
I hear the very ungrateful @ArsenioHall has a show that is absolutely dying in the ratings. Really too bad! _E_
Best wishes to the Republic of Korea on hosting the @Olympics! What a wonderful opportunity to show everyone that you are a truly GREAT NATION! __HTTP__ _E_
I don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1in terror no problem! _E_
.@ArsenioHall just got "fired"—the people spoke ratings were terrible. The Apprentice brought him back from the dead but he blew it! _E_
Great conversations with President @EmmanuelMacron and his representatives on trade military and security. _E_
Just out the new nationwide @FoxNews poll has me alone in 2nd place closely behind Jeb Bush but Bush will NEVER Make America Great Again! _E_
Hillary's 33000 deleted emails about her daughter's wedding. That's a lot of wedding emails. #debate _E_
I can't believe he would choose @OMAROSA as his first choice! She is hard to handle. #CelebApprentice _E_
Crowd is booing the hell out of that phony decision place is angry and going wild. Fight was not even close! DISGUSTING. _E_
Unfortunately@BarackObama's continued attack on the US $ will lead to ever rising gas prices at the pump and lots of other really bad things _E_
Gas prices are up 30 cents this month rising 21 days in a row __HTTP__ Don't worry @BarackObama has a solution ALGAE! _E_
Heading to Pennsylvania for a big rally tonight. We will MAKE AMERICA GREAT AGAIN! _E_
Middle Eastern countries must participate militarily (no running away) and big league financially in order for us to go in and save them! _E_
RT @TeamTrump: ONLY @realDonaldTrump will end what even @BillClinton called a CRAZY SYSTEM. #BigLeagueTruth #Debate __HTTP__ _E_
HAPPY BIRTHDAY to our 40th President of the United States of America Ronald Reagan! __HTTP__ _E_
"In the end you're measured not by how much you undertake but what you finally accomplish. The Art of the Deal _E_
I have NOTHING to do with The Apprentice except for fact that I conceived it with Mark B & have a big stake in it. Will devote ZERO TIME! _E_
Donald Trump Partners with TV1 on New Reality Series Entitled Omarosa's Ultimate Merger: __HTTP__ _E_
Today we remember and honor our fellow Americans and NYPD and FDNY who fell 11 years ago. _E_
.@chucktodd is so dishonest in his reporting...and to think he was going off the air until I came along no ratings. I will beat Hillary! _E_
I hope everyone (especially the haters and losers) goes to Macy's today and buys some DJT ties shirts and suits and SUCCESS Fragrance love! _E_
Has anyone looked at the really poor numbers of @VanityFair Magazine. Way down big trouble dead! Graydon Carter no talent will be out! _E_
.#RogerStone was just banned by @CNN their loss! Tough loyal guy. _E_
I just had an amazing day in Mumbai India. Building an almost 80 story building super luxury which is doing great! Press is going wild. _E_
Going to South Carolina now great place SRO crowd. Iowa was amazing yesterday! _E_
Join me in Wilmington Ohio tomorrow at 4:00pm! It is time to #DrainTheSwamp! Tickets: __HTTP__ __HTTP__ _E_
Where's the press? 1484: 72% of Afghan Casualties Have Occurred Under Obama __HTTP__ _E_
Hillary can never win over Bernie supporters. Her foreign wars NAFTA/TPP support & Wall Street ties are driving away millions of votes. _E_
Great day of bilateral meetings at #ASEANSummit on trade which we are turning around to be great deals for our country! __HTTP__ _E_
I wonder if Apple is upset with me for hounding them to produce a large screen iPhone. I hear they will be doing it soon—long overdue. _E_
RT @SLandinSoCal: When you kneel for our #NationalAnthem you aren't protesting a specific issue you are protesting our Nation and EVERYTH... _E_
When will the unemployment numbers be corrected? Sadly after the election! _E_
Melania and I will be appearing on Larry King Live tonight 9 p.m. on CNN. Be sure to tune in for some great conversation! _E_
Do you believe that Secy. KERRY just went to Egypt to talk about human rights problems and this as everything is being blown up around him _E_
Are you a young professional getting ready for a big meeting? Pick up a #Trump suit @Macys __HTTP__ Look your best! _E_
THANK YOU! #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_
Tweet me tonight who your favorite is during the live telecast of the Miss Universe Pageant. _E_
"60 Minutes" treats President Obama with kid gloves Mike Wallace is spinning in his grave! _E_
Proud to welcome our great Cabinet this afternoon for our first meeting. Unfortunately 4 seats were empty because S... __HTTP__ _E_
.@CarlyFiorina Ben Carson said in his own book that he has a pathological temper & pathological disease. I didn't say it he did. Apology? _E_
I will be announcing my decision on Paris Accord Thursday at 3:00 P.M. The White House Rose Garden. MAKE AMERICA GREAT AGAIN! _E_
.@TheAlabamaBand was great last night in D.C. playing for 147 Diplomats and Ambassadors from countries around the world. Thanks Alabama! _E_
If @rihanna is dating @chrisbrown again then she has a death wish. A beater is always a beater just watch! _E_
Via @KingwoodNews by @JayRJordan:: "@TXPatriotsPAC gives public a chance to hear Trump speak" __HTTP__ _E_
.@oreillyfactor Please correct I WON Virginia! _E_
Now @BarackObama wants us to believe the Republicans cancelled Keystone and are responsible for $4 gas. He (cont) __HTTP__ _E_
How badly will the Country be hurt by the three scandals and the very poor implementation and cost of Obama Care? _E_
But @mcuban is physically weak he has no clubhead speed or game! _E_
The Democrats' solution is the same solution they have for everything tax tax tax. Just one problem: it doesn't work #TimeToGetTough _E_
I turned down going to the debate tonight so that I could do live tweets to my many followers. _E_
I'm very proud of my new crystal collection. Here's a sneak peak of my favorite collection Elmsford __HTTP__ (cont) __HTTP__ _E_
My shirts ties & suits (and fragrance Success) are doing great go over & check out Macy's now—beautiful new selection! _E_
ObamaCare is on LIFE SUPPORT it will soon be DEAD ON ARRIVAL A bad concept that was imcompetently administered! _E_
Remember I am the only candidate who is self funding. While I am given little credit for this by the voters I am not bought like others! _E_
The only people who are not interested in being the V.P. pick are the people who have not been asked! _E_
Happy 94th birthday to Nelson Mandela! _E_
Ask yourself Is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_
Funny if you listen to @FoxNews the Democrats did not have a good day. If you listen to the other two they are fawning. What a difference _E_
"Trump on Obama: 'I never thought it would be this bad'" __HTTP__ via @breitbarttv _E_
Obama Care's taxes vest in 2014. Conveniently after the 2012 election. Coincidence? _E_
Wishing everyone a wonderful Independence Day weekend. We have a lot to be thankful for. _E_
Last Saturday A Rod was 0 3 and left 6 stranded. But he was still hitting on girls from the dugout __HTTP__ He is very selfish! _E_
Despite my great respect for King Abdullah II I will not be visiting Jordan at this time. This is in response to the false @AP report. _E_
The W.H. is functioning perfectly focused on HealthCare Tax Cuts/Reform & many other things. I have very little time for watching T.V. _E_
On Fifth Avenue the iconic @TrumpTowerNY is one of NYC's most heavily visited tourist attractions __HTTP__ _E_
Awarded the prestigious 2014 @ForbesInspector 5 Star Guide @TrumpToronto is located in beautiful downtown Toronto __HTTP__ _E_
I wish the 23 million who are unemployed were able to celebrate like the Democrats in Charlotte right now. _E_
Great first day with world leaders at the #G20Summit here in Hamburg Germany. Looking forward to day two! #USA __HTTP__ _E_
My opponents big bosses lobbyists and donors are trying to do damage. They will fail! Money down the drain! _E_
.@TomLlamasABC cannot report the news truthfully. Why not apologize for your fraudulent story on World News Tonight.Gang members & criminals _E_
Cadillac Championship at Doral a great success I just bought Doral it will be amazing. Cadillac a great American car. _E_
No president in history has lied to the American people more than President Obama in fact it is not even close! _E_
Innovation distinguishes between a leader and a follower. Steve Jobs _E_
Great poll out of Nevada thank you! See you soon. #MAGA #AmericaFirst __HTTP__ __HTTP__ _E_
.@Rosie get better fast. I'm starting to miss you! _E_
Congrats @marklevinshow on fantastic article when "B" writes so nicely about you it really means something. __HTTP__ _E_
I will be on Face The Nation this morning at various times across the U.S. @CBSNews Enjoy! _E_
.@NBCNews is so knowingly inaccurate with their reporting. The good news is that the PEOPLE get it which is really all that matters! Not #1 _E_
What controversy? 2 'active' @BarackObama supporters at Bain have confirmed that @MittRomney left in '99 __HTTP__ No story here. _E_
Why Donald Trump Won't Touch Your Entitlements: DES MOINES Iowa—Donald Trump says if he runs for p... __HTTP__ _E_
Thank you! #ImWithYou __HTTP__ _E_
Young entrepreneurs – Remember that your first deals are the most important of your career. Win & gain confidence. _E_
Why does Obama continue to release the worst of the worst from Gitmo?! Look at Paris and wake up! _E_
I know of no more encouraging fact than the unquestionable ability of man to elevate his life by conscious endeavor. Henry David Thoreau _E_
Waste! The CBO now estimates that @BarackObama's stimulus cost $831B and a ridicuous $4.1M per job created __HTTP__ _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Behind the scenes video with "Uncle Sam" (eagle's name) and me. __HTTP__ _E_
Great night in Denver Colorado thank you! Together we will MAKE AMERICA GREAT AGAIN! #ICYMI watch rally here:... __HTTP__ _E_
Iraq is far worse and of more danger to the U.S. now than it ever was under Saddam Hussein and this after $2 trillion and so many lives! _E_
I had a great time at @TwitterNYC #AskTrump __HTTP__ _E_
If Bernie Sanders after seeing the just released e mails continues to look exhausted and done then his legacy will never be the same. _E_
That said the rich Arab countries should get involved with the Syrian mess not us.We should start rebuilding our own country & military. _E_
Looking like my 5 victories on Tuesday will be just as good as if I won Ohio. Two more days and Ohio was mine! _E_
The Democrats want to shut down the Government over Amnesty for all and Border Security. The biggest loser will be our rapidly rebuilding Military at a time we need it more than ever. We need a merit based system of immigration and we need it now! No more dangerous Lottery. _E_
This is the time for the United States to be strengthening all important military components not rolling over and dealing from weakness! _E_
Notice the first word in my Think Big credo: Think = the 1st step. Use everything in your power to utilize & develop that capacity. _E_
Why doesn't Fake News talk about Podesta ties to Russia as covered by @FoxNews or money from Russia to Clinton sale of Uranium? _E_
.@BLTPrimeMiami @TrumpDoral's signature restaurant has set the standard for today's modern steakhouse __HTTP__ _E_
The Mar a Lago Club has the best meatloaf in America. Tasty. __HTTP__ _E_
RT @ProgressPolls: Who is a better President of the United States? #ObamaDay _E_
.@Graeme_McDowell Great playing this weekend. You are a true winner. We look forward to having you back to Trump National Doral. _E_
RT @RealJamesWoods: I've never witnessed such hatred for a man who is willing to work for free to make his beloved country a better place.... _E_
Why isn't @chucktodd using the much newer @CNN Poll when discussing how well I am doing instead of the older Q Poll? CNN even better! _E_
They just called this the biggest scandal in U.S. sports history (GMA). Roger looks really weak and indecisive must put on a better front! _E_
The failing @nytimes was forced to apologize to its subscribers for the poor reporting it did on my election win. Now they are worse! _E_
Gov. @BobbyJindal referred to the Republicans as "the stupid party". Now he has given Dems a phrase to use. _E_
#WheresHillary? Sleeping!!!!! _E_
Chuck Hagel's nomination has been held up for at least 12 more days. A lot can happen. _E_
Restoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_
Spent full day with contractors at Trump National Doral it will be amazing! __HTTP__ _E_
Double digit premium increases because of ObamaCare. Dems trying to delay showing numbers until after election but news is spreading fast! _E_
"TRUMP HITS BACK AT CHRIS MATTHEWS' BIRTHER RANT: 'HE USED TO BE A MUCH MORE INTELLIGENT MAN' __HTTP__ @MadeleineBlaze _E_
The media must denigrate ISIS at all levels or youth will continue to be drawn to it. These are low level degenerates NOT masterminds! _E_
Thank you! __HTTP__ _E_
On rugged Aberdeenshire coastline@TrumpScotland's Par 72 course is 7400 picturesque yds. threaded through dunes __HTTP__ _E_
RT @ericbolling: Good morning friends! The Swamp out today. President Trump has a copy... get yours here __HTTP__ #maga... _E_
Donna Brazile Shreds Obama Economy Acting DNC chair says 'people are more in despair about how things are' __HTTP__ _E_
We cannot solve our problems with the same thinking we used when we created them. Albert Einstein _E_
My @SteveDeaceShow int. on China the economy and my upcoming keynote at @theFAMiLYLEADER Leadership Summit __HTTP__ _E_
The number of unemployed Americans has increased over 60% during Obama's term. The economy can't survive another 4 years. _E_
Know when to walk away from the table. The Art of the Deal _E_
Many people have commented that my fragrance "Success" is the best scent & lasts the longest. Try it & let me know what you think! _E_
Another horrific attack this time in Nice France. Many dead and injured. When will we learn? It is only getting worse. _E_
I am a Republican...but the Republicans may be the worst negotiators in history! _E_
Jeff Flake with an 18% approval rating in Arizona said a lot of my colleagues have spoken out. Really they just gave me a standing O! _E_
Who is paying for that tedious Smokey Bear commercial that is on all the time enough already! _E_
Story will be released today at 12 noon EST on Twitter and Facebook. _E_
I can't wait to read this...RT @Newsmax_Media: SEAL Book Explodes Obama Furious __HTTP__ _E_
Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_
.@MittRomney was a disaster candidate who had no guts and choked! Romney is a total joke and everyone knows it! _E_
Tomorrow we celebrate Independence Day America's 236th birthday. Here is America's actual birth certificate __HTTP__ _E_
My interview with @BarbaraJWalters in her @ABC special 'Most Fascinating People of 2011' __HTTP__ _E_
Welcome to Twitter @melaniatrump! _E_
RT @TrumpNV: #NVcaucus locator &gt __HTTP__ _E_
Via @NRO by @JOELMENTUM: "Matchless Name Recognition and Deep Pockets Make Trump a Threat in Iowa" __HTTP__ _E_
Much as it pays to emphasize the positive there are times when the only choice is confrontation. #TheArtofTheDeal _E_
Just got back from New Hampshire. Great crowd great people! Will be back soon! _E_
RT @LouDobbs: #AmericaFirst @KellyannePolls: The Middle Class & businesses will benefit from @POTUS' historic tax revolution. #Dobbs #MAGA... _E_
My interview on @ASavageNation discussing why @MittRomney will defeat @BarackObama with a strong campaign. __HTTP__ _E_
Via @espn: Donald Trump would fire A Rod __HTTP__ _E_
Enter the contest.... __HTTP__ stay at Trump International Hotel Las Vegas _E_
Huma calls it a MESS the rest of us call it CORRUPT! WikiLeaks catches Crooked in the act again.#DrainTheSwamp __HTTP__ _E_
RT @TomOdell: .@FoxNews Pope who lives in a Vatican city fortified with huge walls thinks it's wrong to build walls? Really? __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
All Star Celebrity @ApprenticeNBC continues to surprise our loyal viewers each and every week. More and bigger coming... _E_
For someone who demanded 20 years of Mitt's tax returns you would think my offer to donate $5M to charity for his records is an easy go. _E_
"Don't emphasize the problem so much emphasize the solution. It's a mindset that works." – Think Like a Champion _E_
Just finished my second speech. 20K in Dayton & 25K in Cleveland perfectly behaved crowd. Thanks I love you Ohio! __HTTP__ _E_
A number of months ago I was not expected to win South CarolinaTed Cruz was and yet I won in a landslide every group and category. WIN! _E_
Asking why my dislike of A Rod dishonorable dealings with me on an apartment deal _E_
The day after @BarackObama blocks a Texas voter photo ID law @JamesOkeefeIII exposes more dead people getting ballots __HTTP__ _E_
Nothing ever happened with any of these women. Totally made up nonsense to steal the election. Nobody has more respect for women than me! _E_
I look forward to going to the Land Investment Expo in Iowa on Jan. 23. Record crowd—sold out venue. @LandExpo @PeoplesCompany _E_
Getting ready to go to Las Vegas (Freedom Fest) great crowd. Then on to amazing Phoenix that will be a total happening! Love America. _E_
#TrumpVlog Quarantine the nurse! __HTTP__ _E_
Storm turned Hurricane is getting much bigger and more powerful than projected. Federal Government is on site and ready to respond. Be safe! _E_
Be sure to watch @IvankaTrump's @FoxBusiness @FBNATB interview from the @NYU #HospitalityConference __HTTP__ _E_
Via @Entrepreneur by @MDMSEO : "8 Lessons for Entrepreneurs That @ApprenticeNBC Emphasizes" __HTTP__ _E_
Dress your best. Trump Signature Collection exclusively available @Macys tops all male business attire __HTTP__ _E_
I am promising you a new legacy for America. We're going to create a new American future. Thank you OHIO! #ImWithYou __HTTP__ _E_
.@FinancialTimes writes that @BarackObama should pray that China overtakes US __HTTP__ Don't worry he is making it happen. _E_
.@ariannahuff is unattractive both inside and out. I fully understand why her former husband left her for a man he made a good decision. _E_
The WALL which is already under construction in the form of new renovation of old and existing fences and walls will continue to be built. _E_
Remember when guns are outlawed only outlaws will have guns! _E_
.@CelebApprentice wins 10 11 o'clock hour in all key ratings demographics including most importantly the 18 49 age group. _E_
Asians are very offended that JEB said that anchor babies applies to them as a way to be more politically correct to hispanics. A mess! _E_
Mar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_
Obama will send troops back into Iraq combat zone. Don't believe anything he says. Just covering for his mistakes. _E_
With America's debt topping $21T by the end of his presidency Obama will have effectively bankrupted our country. @davidaxelrod _E_
RT @Scavino45: #WeThePeople#USAatUNGA #UNGA __HTTP__ _E_
Trump International Golf Links was just rated one of the greatest courses in the world. Virtually all reviews are saying the same thing. _E_
Via @eonline: @IvankaTrump Wears @MissUniverse 2014's $300000 Crown Nails Beauty Pageant Winner Look __HTTP__ _E_
Mexico sent USMC Andrew Tahmooressi back to jail after court hearing. Mexico does not respect our border or U.S. Boycott! #FreeOurMarine _E_
Signing The Facebook Wall __HTTP__ _E_
Ted Cruz Was For Welcoming Syrian Refugees Before He Was Against It __HTTP__ _E_
You know the world is crazy when New York gets hit by a hurricane and Florida doesn't. _E_
We will have to see what Russia's next move will be. They may have given him an out of an embarrassing situation or drove into deeper mess! _E_
"In this game knowledge is the key to power." Think Big _E_
My @SquawkCNBC #TRUMPTUESDAY interview discussing the upcoming debates the real state of unemployment & bias media __HTTP__ _E_
I highly recommend the just out book THE FIELD OF FIGHT by General Michael Flynn. How to defeat radical Islam. _E_
.@megynkelly the most overrated anchor at @FoxNews worked hard to explain away the new Monmouth poll 41 to 14 or 27 pt lead. She said 15! _E_
My @SquawkCNBC interview discussing Jamie Dimon the Doral Miami purchase OPEC's output & @PGATOUR Open __HTTP__ _E_
The Democrats are turning down services and security for citizens in favor of services and security for non citizens. Not good! _E_
Donald Trump could again defy the conventional wisdom of the chattering class in November. @Newsmax_Media's cover The Trump Effect _E_
How come Snowden and ObamaCare have access to all records and information but don't have even the smallest tidbits on President Obama? _E_
This Sunday at 9/8C the real playoffs begin with the premiere of @apprenticenbc! Game on in the Boardroom... __HTTP__ _E_
Marco Rubio had no idea what he was doing on Chris Wallace show. Said Iraq was not a mistake. He looked clueless! _E_
A great evening in Iowa! Thank you Des Moines Area Community College for a great forum! #Trump2016 #IAForums __HTTP__ _E_
'President Donald J. Trump Proclaims September 3 2017 as a National Day of Prayer' #HurricaneHarvey #PrayForTexas __HTTP__ __HTTP__ _E_
We need to secure our borders ASAP. No games we must be smart tough and vigilant. MAKE AMERICA GREAT AGAIN & MAKE AMERICA STRONG AGAIN! _E_
My @gretawire int. discussing business difficulties with ObamaCare & how it is stopping businesses from hiring __HTTP__ _E_
They changed the name from "global warming" to "climate change" after the term global warming just wasn't working (it was too cold)! _E_
Ted Cruz is totally unelectable if he even gets to run (born in Canada). Will loose big to Hillary. Polls show I beat Hillary easily! WIN! _E_
Congratulations to @IvankaTrump and Jared on the big news. I will have yet another grandchild this fall! _E_
The reason lyin' Ted Cruz has lost so much of the evangelical vote is that they are very smart and just don't tolerate liars a big problem! _E_
Do you think @SenTedCruz knows about @bobvanderplaats dealings? Actually I doubt it! _E_
....and Japan will put up with this much longer. Perhaps China will put a heavy move on North Korea and end this nonsense once and for all! _E_
Record crowd in Tampa Florida thank you! We will WIN FLORIDA #DrainTheSwamp in Washington D.C. and MAKE AMERICA... __HTTP__ _E_
Nothing great was ever achieved without enthusiasm. Ralph Waldo Emerson _E_
"Stay confident even when something bad happens. It is just a bump in the road. It will pass." – Think Big _E_
China is already preparing to benefit economically from this mess. They will pick up the pieces and make yet another fortune & laugh at us! _E_
Olympic Gold Medalist Evan Lysacek just left my office. He is in town and wanted to meet me he's a fanastic guy and a true champion. _E_
Thank you for all of your support South Carolina! #Trump2016 __HTTP__ _E_
Watch @DonaldJTrumpJr and @EricTrump accept my #ALSIceBucketChallenge __HTTP__ _E_
Once the Bloomberg administration selected Trump to take over the very expensive and years late project I kicked ass and got it done fast _E_
Truth will ultimately prevail where there is pains to bring it to light. George Washington _E_
Goofy Elizabeth Warren has been one of the least effective Senators in the entire U.S. Senate. She has done nothing! _E_
Ted Cruz lifts the Bible high into the air and then lies like a dog over and over again! The Evangelicals in S.C. figured him out & said no! _E_
Via @FoxNews: Donald Trump sends Bill Maher birth certificate awaits $5 million __HTTP__ _E_
Thank you Maryland! #Trump2016 __HTTP__ _E_
ObamaCare is clearly unconstitutional. Hopefully the USC rules correctly but in the end repealing ObamaCare requires a political solution. _E_
Today we continued a wonderful American Tradition at the White House. Drumstick and Wishbone will live out their days in the beautiful Blue Ridge Mountains at Gobbler's Rest... __HTTP__ _E_
HISTORIC rainfall in Houston and all over Texas. Floods are unprecedented and more rain coming. Spirit of the people is incredible.Thanks! _E_
Thank you Miss Katie's Diner!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
The Democrats are in a total meltdown but the biased media will say how great they are doing! E mails say the rigged system is alive & well! _E_
Creativity and control can go hand in hand. Brainpower is the ultimate leverage. _E_
"Never Ignore Donald Trump" __HTTP__ by Jeffrey Lord @AmSpec _E_
Now the @BarackObama campaign is fundraising off of me. I should get a tax rebate! __HTTP__ _E_
.@chelseahandler stop calling my office for me to do your rather gross show I have less interest in you than Andre. _E_
Pervert Weiner is dead in his race for mayor of NYC but WOW Eliot Spitzer has dropped way down in recent poll for comptroller. SLEAZE! _E_
Big wins against ISIS! _E_
Will be interviewed on @SeanHannity on @FoxNews from #Wisconsin tonight. My wife Melania will join me for the entire show. _E_
I suggest that we add more dollars to Healthcare and make it the best anywhere. ObamaCare is dead the Republicans will do much better! _E_
In a clumsy move to get out of his anchor babies dilemma where he signed that he would not use the term and now uses it he blamed ASIANS _E_
My deepest condolences to the victims of the terrible shooting in Douglas County @DCSheriff and their families. We love our police and law enforcement God Bless them all! #LESM _E_
I will be interviewed on @foxandfriends at 7:00 30 minutes. Some very interesting topics. _E_
.@acuconservative's #CPACCO kept up the momentum from the debate. @MittRomney even made a surprise appearance. Now go win CO! _E_
Why did @BarackObama liberate Libya and do nothing for the Iranian protestors? Iran is a threat to our national security. _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
"I've found that people work harder when they are held accountable and their confidence rises along with that." – Midas Touch _E_
It is now a FACT that President Obama lied in order to get ObamaCare passed that is fraud and the legislation should be recinded INTERESTING _E_
Flying to New Hampshire to keynote the @LoebSchool First Amendment Award Ceremony. Always great to visit the Granite State! _E_
The RNC which is probably not on my side just illegally put out a fundraising notice saying Trump wants you to contribute to the RNC. _E_
Our country has become so politically correct that it has lost all sense of direction or purpose. We are unable to move forward painlessly. _E_
I predict that President Obama will at some point attack Iran in order to save face! _E_
.@GOP congress needs to actually defund ObamaCare not waste time passing non binding resolutions. _E_
I have raised/given a tremendous amount of money to our great VETERANS and have got nothing but bad publicity for doing so. Watch! _E_
So sad that Burt Reynolds has lost all of his money. I wish he came to me for advice he would be rich as hell! _E_
#CrookedHillary sending U.S. intelligence info. to Podesta's hacked email is 'unquestionably an OPSEC violation' __HTTP__ _E_
...do the typical political thing and BLAME. The fact is ObamaCare was a lie from the beginning. Keep you doctor keep your plan! It is.... _E_
The House of Representatives seeks contempt citations(?) against the JusticeDepartment and the FBI for withholding key documents and an FBI witness which could shed light on surveillance of associates of Donald Trump. Big stuff. Deep State. Give this information NOW! @FoxNews _E_
The constant interruptions last night by Tim Kaine should not have been allowed. Mike Pence won big! _E_
Entrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_
Won't be a buyer's market for long. If you can purchase a home but remember I told you this three years ago. _E_
It was my great honor to welcome President @JC_Varela & Mrs. Varela from Panama this afternoon. ... __HTTP__ _E_
Great news that @FoxNews has cancelled the additional debate. How many times can the same people ask the same question? I beat Cruz debating _E_
...hasn't worked agreements violated before the ink was dry makings fools of U.S. negotiators. Sorry but only one thing will work! _E_
"@MissUSA Nia Sanchez of Nevada is ready for @missuniverse" __HTTP__ via @lasvegassun by @Robin_Leach _E_
The weak illegal immigration policies of the Obama Admin. allowed bad MS 13 gangs to form in cities across U.S. We are removing them fast! _E_
RT @SecretService: Our thoughts & prayers are with the families friends & colleagues of #Virginia's @VSPPIO Lt Cullen & Tpr Bates #Charlot... _E_
Sadly Vanity Fair is a rapidly dying magazine. Needs new blood and fast! Going the way of SPY Magazine. _E_
A nation without borders is no nation at all. We must build a wall. Let's Make America Great Again! __HTTP__ _E_
Entrepreneurs: Work on what you will be proud to be associated with. Make your work count. _E_
Trump National Golf Club Los Angeles will be the host in October for the @PGAGrandSlam. __HTTP__ _E_
We deserve all the answers on Benghazi! __HTTP__ @RepWOLFPress _E_
Nice mention by Brian Kelly __HTTP__ Conservative Action Alerts _E_
Why didn't A.G. Sessions replace Acting FBI Director Andrew McCabe a Comey friend who was in charge of Clinton investigation but got.... _E_
"My advice to you regarding momentum is definitive: Get yours going!" – Think Like a Champion _E_
I love watching these poor pathetic people (pundits) on television working so hard and so seriously to try and figure me out. They can't! _E_
Revisionist history. Now Obama claims he never told us that everyone could keep their healthcare plans. Crazy! _E_
Never never never give up. Winston Churchill _E_
My friend Larry King @kingsthings asked me to do an interview with him—he was always great to me—& I agreed. Watch tonight 9 PM on RTV. _E_
There is a good possibility that a person who treated patients in West Africa and who FLEW into New York has Ebola. Touched many bedlam! _E_
I am self funding my campaign & don't owe anybody anything! I only owe it to the American people! #Trump2016Watch: __HTTP__ _E_
.@BarackObama said he doesn't take the Navy Seals campaigning against him too seriously. _E_
The failing @nytimes has been wrong about me from the very beginning. Said I would lose the primaries then the general election. FAKE NEWS! _E_
Will Barack Obama personally read the Boston terrorist his Miranda Rights? _E_
About to begin a rally here in Henderson Nevada. New Reuters poll just out thank you! Join the MOVEMENT:... __HTTP__ _E_
Our worst threat to unemployment is @ObamaCare. It will also destroy our country's basic standards. _E_
Where's the leadership? Obama only met with Sebelius ONCE since ObamaCare passed __HTTP__ His signature legislation... _E_
People don't know that Eliot's father is very rich. Eliot likes to pretend he's poor to appeal to voters. _E_
I will make my final decision on the Paris Accord next week! _E_
Sun Sentinel says: Rubio lacks the experience work ethic and gravitas needed to be president. HE HAS NOT EARNED YOUR VOTE! _E_
Good luck to Derek on his operation. I know it will be a success he is a great champion. _E_
Life is very fragile and success doesn't change that. If anything success makes it more fragile. The Art of the Deal _E_
The Green Party just dropped its recount suit in Pennsylvania and is losing votes in Wisconsin recount. Just a Stein scam to raise money! _E_
Do not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Take control and move forward every day. _E_
"The object of war is not to die for your country but to make the other bastard die for his." Gen. George S. Patton _E_
We better be vigilant careful and strong. __HTTP__ _E_
RT @NRA: .@RealDonaldTrump is right. If @HillaryClinton gets to pick her anti #2A #SCOTUS judges there's nothing we can do. #NeverHillary _E_
.@foxandfriends at 7:00 A.M. _E_
In January '12 3 turbines were wrecked in rough weather ... __HTTP__ _E_
.@FoxNews legal analyst & former prosecutor @kimguilfoyle destroyed hack Schneiderman's suit on @FNTheFive yesterday.She's very sharp! _E_
The EPA is caught saying that their philosophy is to crucify oil companies __HTTP__ That will sure lower the price of gas. _E_
Will be doing @seanhannity at 10 PM on @FoxNews. As always with Sean will be interesting! _E_
RT @realDonaldTrump: Unemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Mark... _E_
South Carolina rally last night was so unbelievably exciting (and fun). I am now off to Iowa for two big rallies packed houses. Love it! _E_
Thank you very much for the nice story I greatly appreciate it __HTTP__ _E_
We need a tax system that is FAIR to working FAMILIES & that encourages companies to STAY in AMERICA GROW in AMERICA and HIRE in AMERICA! __HTTP__ _E_
Sarah Jessica Parker voted "unsexiest woman alive" – I agree. She said "it's beneath me to comment on the... __HTTP__ _E_
Thank you Naples Florida! Get out and VOTE #TrumpPence16 on 11/8. Lets #MakeAmericaGreatAgain! Full Naples rally... __HTTP__ _E_
Greece's financial calamity should serve as a warning. @BarackObama's massive deficit spending is unsustainable. _E_
Home of the 2022 @PGAChampionship Trump Nat'l Bedminster features 36 holes designed by famed architect Tom Fazio __HTTP__ _E_
#ImWithYou #AmericaFirst __HTTP__ _E_
If Vera Coking had taken my millions of $'s like she should have she would have lived for many years in Palm Beach Florida. _E_
Our Q1 GDP was 2.9%. Worst in memory ObamaCare killing jobs stopping growth and making small business insecure. _E_
Just arrived in Italy for the G7. Trip has been very successful. We made and saved the USA many billions of dollars and millions of jobs. _E_
.@KieranLalor I created far more jobs and success in Dutchess than you you should be Fired. _E_
In other words our military has a very big problem! _E_
Why is lightweight A.G. Eric Schneiderman allowed to ask for campaign contributions from my people during settlement negotiations? _E_
Scary Obama's budget deficits are so out of control that he has to borrow 40 cents on every dollar he spends. _E_
Via @KCRG by @markwcarlson: "Donald Trump stops in Coralville" __HTTP__ _E_
Lightweight @JebBush is spending a fortune of special interest against me in SC. False advertising desperate and sad! _E_
How amazing the State Health Director who verified copies of Obama's "birth certificate" died in plane crash today. All others lived _E_
American Exceptionalism and the Navy Yard shooting do not go hand in hand. Foreign countries in particular Russia are mocking the U.S. _E_
Oil should not cost more than $40 a barrel. Ideally it should be $25. Cheap to produce and we protect the OPEC countries. _E_
Wing bangers the name given to wind turbines by bird lovers for the thousands of birds they kill in the U.S. _E_
Good Morning America weather headline for U.S. NEVER ENDING COLD _E_
Spoke to Jerry Jones of the Dallas Cowboys yesterday. Jerry is a winner who knows how to get things done. Players will stand for Country! _E_
Just landed in Paris France with @FLOTUS Melania. __HTTP__ _E_
I love Mexico but not the unfair trade deals that the US so stupidly makes with them. Really bad for US jobs only good for Mexico. _E_
Hope everyone is watching the Finale rerun of Celebrity Apprentice on CNBC especially the haters and losers! It is on right now. _E_
I am in Iowa. Will be making two speeches today. Good luck to all of the great folks on the East coast. Enjoy the beauty of the storm! _E_
My @SquawkCNBC interview discussing why the Fed shouldn't do a QE3 @BarackObama's college records & 2012 election __HTTP__ _E_
My condolences to those involved in today's horrible accident in NJ and my deepest gratitude to all of the amazing first responders. _E_
Was the brother of John Podesta paid big money to get the sanctions on Russia lifted? Did Hillary know? _E_
Good response on jobs by @MittRomney. _E_
In addition to doing a lousy job in taking care of our Vets John McCain let us down by losing to Barack Obama in his run for President! _E_
Lightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately—because he is being looked at! _E_
The ObamaCare website is in the news again it is turning out to cost even more than previously thought AND IT DOESN'T WORK! Big trouble! _E_
It's Monday. How many fundraisers will Obama hold today? _E_
"The only place success comes before work is in the dictionary." – Vince Lombardi _E_
The person that Hillary Clinton least wants to run against is by far me. It will be the largest voter turnout ever she will be swamped! _E_
I thought and felt I would win big easily over the fabled 270 (306). When they cancelled fireworks they knew and so did I. _E_
Hillary Clinton strongly stated that there was absolutely no connection between her private work and that of The State Department. LIE! _E_
The polls are now showing that I am the best to win the GENERAL ELECTION. States that are never in play for Repubs will be won by me. Great! _E_
I look forward to tonight's debate but look far more forward to making America great again. It can happen! _E_
THEBillMcGee @realDonaldTrump after a year of wear your shirts still look great! Glad I made the purchase! Thank you. _E_
The Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. It's absolutely... __HTTP__ _E_
THANK YOU Baton Rouge Louisiana! WE will #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_
We are now at 1001 delegates. We will win on the first ballot and are not wasting time and effort on other ballots because system is rigged! _E_
The Consumer Financial Protection Bureau or CFPB has been a total disaster as run by the previous Administrations pick. Financial Institutions have been devastated and unable to properly serve the public. We will bring it back to life! _E_
The Unaffordable Care Act will soon be history! _E_
...big dollars ($700000) for his wife's political run from Hillary Clinton and her representatives. Drain the Swamp! _E_
Can you envision Jeb Bush or Hillary Clinton negotiating with 'El Chapo' the Mexican drug lord who escaped from prison? .... _E_
.@timkaine oversaw unemployment INCREASE by 179249 while @mike_pence DECREASED unemployment in Indiana by 113826.... __HTTP__ _E_
Both of our New York hotels are on the Top Ten list of the most luxurious hotels in NYC... __HTTP__ Congrats to all! _E_
WSJ covers Ride of Fame __HTTP__ _E_
Just made the point at #NCGOPcon that we have to protect our border & I think everyone here knows nobody can build a wall like Trump! _E_
Obama was very disloyal to Wisconsin Democrats. @BarackObama never showed up to help them even though he (cont) __HTTP__ _E_
I find it really hard to listen to @BarackObama's speeches. He doesn't have a clue. _E_
In the ridiculous @JebBush ad about me Jeb is speaking to me during the debate but doesn't allow my answer which destroys him SO SAD! _E_
.@mcuban Mark okay with me but don't start your bullshit again! _E_
Change before you have to. Jack Welch _E_
My interview yesterday with @MyFoxNY __HTTP__ _E_
The Miss Universe Pageant will be on August 23 (9 11 p.m. on NBC ET) with Bret Michaels and Natalie Morales to co host live from Las Vegas _E_
Look great for Thanksgiving. Trump Signature Collection exclusively available @Macys offers top men's styles __HTTP__ _E_
I'm in Moscow for Miss Universe tonight picking a winner is very hard they are all winners. Total sellout of arena. Big night in Russia! _E_
.@THEGaryBusey as Project Manager... is Team Power in trouble?? #CelebApprentice _E_
First Obama says Egypt is not an ally. Then he promises to keep handing over aid __HTTP__ Incompetent and unqualified. _E_
#MakeAmericaGreatAgain I will be in Cedar RapidsIA this Saturday. Get your tickets __HTTP__ _E_
There's definitely no love lost between Piers and Omarosa. _E_
.@KyleStephens30 #asktrump __HTTP__ _E_
.@MannyPacquiao and friends at @TrumpDoral __HTTP__ _E_
Our country is in a major crisis of incompetent leadership. We cannot continue to go on with these politicians who do nothing but talk. _E_
Gov. Cuomo's Moreland Comm should be looking at AG Schneiderman shaking down those under investigation/ in litigation for campaign $$$ _E_
.@RollingStone admitted their scam. Phony @HuffingtonPost and others are no better total joke! _E_
The Fed is destroying the dollar. When inflation hits the economy then even more jobs will go overseas. _E_
Everyone is talking about how Trump Tower is the exterior for Wayne Enterprises in Dark Knight Rises it's true. __HTTP__ _E_
Mark Udall was the deciding vote for ObamaCare & now 250000 Coloradans were dropped from their plans. Vote @CoryGardner! _E_
#ObamacareFail #HillarycareFail __HTTP__ _E_
Jeb why did your brother attack and destabalize the Middle East by attacking Iraq when there were no weapons of mass destruction? Bad info? _E_
Entrepreneurs: Be a cautious optimist. I call it positive thinking with a lot of reality checks. _E_
The only place where our border is protected is from Europeans. We educate them in our finest institutions & then have them deported. _E_
Thank you Vermont! #VoteTrumpVT __HTTP__ _E_
Thank you High Point NC! I will fight for every neglected part of this nation & I will fight to bring us together... __HTTP__ _E_
A tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_
Continued success is built on building a brand people know will deliver. Unless you're @KarlRove. Then you just blame the Tea Party. _E_
Via @Newsmax_Media by @JAGERFILE: "Donald Trump: 'Morally Unfair' to Use Soldiers in Ebola Fight" __HTTP__ _E_
THANK YOU! #VoteTrump __HTTP__ _E_
Entrepreneurs: See yourself as victorious. Look at the solution not the problem. Be tough be strong be tenacious. _E_
Charlie Hebdo reminds me of the satirical rag magazine Spy that was very dishonest and nasty and went bankrupt. Charlie was also broke! _E_
China is stealing our jobs. We need to demand China stop manipulating its currency and end its rampant corporate espionage. _E_
Our country does not feel 'great already' to the millions of wonderful people living in poverty violence and despair. _E_
Doesn't want to remove Assad worries what comes next. _E_
.@DiamondandSilk Just watched you on #WattersWorld with a large group of people. Everybody loves you two amazing people! #Trump2016 _E_
Watch this great behind the scenes video of @IvankaTrump's Spring 2013 photo shoot __HTTP__ _E_
Join me! #Trump20166/10: Richmond __HTTP__ Tampa __HTTP__ Pittsburgh __HTTP__ _E_
The reporter that called Kevin Durant Mr. Unreliable should be fired or at least apologize. He is a truly great player and a winner! _E_
Anytime you see someone talking about celebrity weight loss on my twitter it is a total scam! _E_
I appreciate the kind words of Mike Huckabee a fine American __HTTP__ _E_
The #2A to our Constitution is clear. The right of the people to keep & bear Arms shall not be infringed upon. __HTTP__ _E_
Those who lack courage will always find a philosophy to justify it. Albert Camus _E_
Today I signed an Executive Order @ the U.S. Dept. of @Interior: 'Review of Designations Under the Antiquities Act... __HTTP__ _E_
Is President Obama trying to destroy Israel with all his bad moves? Think about it and let me know! _E_
Watch the 63rd Annual @MissUniverse Pageant tomorrow on NBC at 8PM! __HTTP__ _E_
Thank you Wayne Root we will #MakeAmericaGreatAgain! __HTTP__ _E_
Congratulations to Gabby Douglas on winning the Gold for the USA in gymnastics. She is terrific! _E_
Very sad & dangerous that soon to be ex Intelligence Chair Dianne Feinstein released the CIA report. Glad she is losing her Comm. Chair. _E_
The National Border Patrol Council (NBPC) said that our open border is the biggest physical & economic threat facing the American people! _E_
The 2013 @NJPGA Course of the Year Trump Nat'l Bedminster is honored to be hosting the 2022 @PGAChampionship __HTTP__ _E_
I just don't know why some of these NFL teams with lousy quarterbacks don't give Tim Tebow a chance what do they have to lose? _E_
When you're "hot" the lowlifes really shoot at you... and they try hitting from every angle! Never let the bastards get you down. _E_
UK is freezing through longest & coldest winter in over 50 years __HTTP__ Where's the global warming? @gatewaypundit _E_
Watching the show. #WWEHOF __HTTP__ _E_
On top of the disrespect shown by Russia don't forget they still have Snowden who has given them (& everyone) massive US secrets. _E_
We.signed our deal to take over the historic Old Post Office on Pennsylvania Ave. from the U.S. and convert it into super luxury hotel jobs! _E_
... I will soon start naming magazines that I think will fold I predicted Newsweek. _E_
Libya is adopting a more radical form of Sharia Law now under their new leadership. Is this what @BarackObama wanted? _E_
Lucky for New York highly respected John Cahill is running for NY State AG against incumbent lightweight dope @AGSchneiderman @CahillForAG _E_
If you read my last number of tweets only one opinion can be formed that our President and therefore leader is grossly incompetent! _E_
The golden rule of negotiation: He who has the gold makes the rules. _E_
When will Sleepy Eyes Chuck Todd and @NBCNews start talking about the Obama SURVEILLANCE SCANDAL and stop with the Fake Trump/Russia story? _E_
.@KirschnerDavid @realDonaldTrump Congrats Mr. Trump on making @Forbes list of wealthiest in the world. Thanks! _E_
...expect the country to be further downgraded in the future. The rich are all leaving! _E_
I am being investigated for firing the FBI Director by the man who told me to fire the FBI Director! Witch Hunt _E_
The WGC @CadillacChamp leadership board is available here: __HTTP__ @DoralResort _E_
All those politicians in Washington and not one good negotiator. _E_
China is not our friend. They are not our ally. They want to overtake us and if we don't get smart and tough soon they will. _E_
As bad as Qaddafi was what comes next in Libya will be worse just watch. _E_
Bill Kristol has been wrong for 2yrs an embarrassed loser but if the GOP can't control their own then they are not a party. Be tough R's! _E_
Thank you Maine New Hampshire and Iowa. The waiting is OVER! The time for change is NOW! We are going to... __HTTP__ _E_
.@SharkGregNorman @Trump_Charlotte Looking great love the improvements to the buildings and grounds! not to mention course Thank you. _E_
The Islamists are taking over Egypt through the election. __HTTP__ Why did @BarackObama force Mubarak out? He was an ally. _E_
When someone attacks me I always attack back...except 100x more. This has nothing to do with a tirade but rather a way of life! _E_
Graydon Carter whose reign over failing @VanityFair has been a disaster has acted in two movies both bombed & got bad reviews. _E_
The @BarackObama recovery US unemployment is 9.1% US underemployment is 19.1% __HTTP__ Businesses won't hire under Obama. _E_
For many years our country has been divided angry and untrusting. Many say it will never change the hatred is too deep. IT WILL CHANGE!!!! _E_
Libyan Rebels should have given us 50% of the oil in return for our military support we don't even ask! _E_
#CelebApprentice I will be live tweeting(no spoilers) during tonight's all new @ApprenticeNBC at 8PM ET. _E_
Alabama will shine tomorrow. It will be a big and glorious day! _E_
.@stuartpstevens made some of the dumbest political decisions of all time in helping Romney to get destroyed by Obama. Should have won! _E_
.@Toure when you are fired from MSNBC for your bad ratings and racist coverage stop by and say hello. _E_
Thank you @SenatorSessions!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
I see where Mayor Stephanie Rawlings Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! _E_
I see @FLGovScott poll numbers are improving. Good man doing a good job. _E_
We must stop outsourcing our jobs overseas and end our multi billion dollar trade deficits. _E_
Wow in the new CBS Poll I went way up into the forties! Thank you! _E_
I will be making a big speech tomorrow to discuss the failed policies and bad judgment of Crooked Hillary Clinton. _E_
Incredible progress at @trumptowerpde – Punta del Este Uruguay the views are going to be fantastic! __HTTP__ _E_
Via @WalidShoebat: "Watch Donald Trump: He Is Patriotic And He Can Fix America" __HTTP__ _E_
Remember as a senator Obama did not vote for increasing the debt ceiling __HTTP__ I guess things change when President?! _E_
California gas prices going thru the roof others to follow. An election losing event for Obama. _E_
...a real loser named Tim O'Brien and it's never recovered. _E_
Naghmeh Abedini the lovely wife of the Christian Pastor Saeed being held in an Iranian jail just left my office. #savesaeed _E_
Will be on @jimmykimmel in 20 minutes on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_
The United States must immediately institute strong travel restrictions or Ebola will be all over the United States a plague like no other! _E_
Democrats jeopardizing the safety of our troops to bail out their donors from insurance companies. It is time to put #AmericaFirst _E_
Do you think the three UCLA Basketball Players will say thank you President Trump? They were headed for 10 years in jail! _E_
#MadeInAmerica📸 __HTTP__ __HTTP__ _E_
A great day in Puerto Rico yesterday. While some of the news coverage is Fake most showed great warmth and friendship. _E_
A lot of comments re @MELANIATRUMP vs. Milania last week. I think spelling has taken on a new significance. #CelebApprentice _E_
I won the debate if you decide without watching the totally one sided spin that followed. This despite the really bad microphone. _E_
It's amazing that people can say such bad things about me but if I say bad things about them it becomes a national incident. _E_
My thoughts and prayers are with the @KissimmeePolice and their loved ones. We are with you!#LESM _E_
#PeaceOfficersMemorialDay and#PoliceWeek Proclamation: __HTTP__ __HTTP__ _E_
Trump right: Illegal families crossing border set to double 51152 so far __HTTP__ _E_
Joe thanks for not running! __HTTP__ _E_
A sneak peek at Sunday's episode of The Celebrity Apprentice... __HTTP__ #trumpvlog _E_
.@KarlRove is a biased dope who wrote falsely about me re China and TPP. This moron wasted $430 million on political campaigns and lost 100% _E_
Via @peoplemag by @amandamichl: "@IvankaTrump: @Joan_RiversWas 'Very Warm' During Appearance on @ApprenticeNBC" __HTTP__ _E_
"Don't toss off your problems and don't dwell on them either. Deal with them!" – Think Like a Champion _E_
Via David Ebner re Stanley Cup & Trump poster: "If you're going to be thinking anything you might as well think big" __HTTP__ _E_
One of @GolfWorldUS top private clubs @TrumpNationalNY features a Jim Fazio designed 7291 yd par 72 course __HTTP__ _E_
.@ritter1025 Wishing your wife a Happy Birthday _E_
Standing room only in Mason City Iowa! Thanks to the record crowd of over 400 supporters! __HTTP__ _E_
.@karlrove's ad is the best thing that ever happened to Ashley Judd—simply increases her profile. _E_
I was on @SquawkBox this morning __HTTP__ _E_
I would like to wish everyone including all haters and losers (of which sadly there are many) a truly happy and enjoyable Memorial Day! _E_
It's important that we help poor people to become independent self sufficient individuals who gain the benefits of work. #TimeToGetTough _E_
He @newtgingrich is sounding more and more like a real team player...he is a really good guy! _E_
.... I only respond to people that register more than 1% in the polls. I never thought he had a chance and I've been proven right. _E_
Just got back from Wisconsin. Great day great people! _E_
Karzai of Afghanistan is not sticking with our signed agreement. They are dropping us like dopes. Get out now and re build U.S.! _E_
China's top academics are working w/ PLA in cyber espionage of our state secrets & R&D __HTTP__ They are laughing at us! _E_
Congratulations to @DavidWright of the #Mets. What a great season he is having batting over 400 and clutch hitting. Also a fantastic guy. _E_
Why the nation's debt keeps growing a Dept of Agriculture employee made over $242K with a $63K bonus __HTTP__ Ridiculous. _E_
If Karl Rove & @GOP Establishment continue to attack the Tea Party who delivered in 2010 then there will be a 3rd Party in 2016. _E_
Will be on Fox & Friends at 7.00 this morning ENJOY! _E_
On Jimmy Fallon tonight. _E_
Remember when you vote Obamacare is a disaster! _E_
Get ready this should be informative and fun! #VPDebate _E_
Trump Int'l Golf Club Turnberry Scotland. A legendary course ... and rightly so. __HTTP__ _E_
Golf is a brain game & is a great way to improve your business skills. Concentrationassessment technique & passion...it's all there. _E_
The Fed continues to recklessly flood the market with dollars. This will eventually create record inflation. It has to stop. #TimeToGetTough _E_
What a night! 10000 amazing supporters in Greenville South Carolina! THANK YOU!VOTE on Saturday! #VoteTrumpSC __HTTP__ _E_
Had @SenScottBrown asked me to do a robo call for him I would have done it and he would have won. _E_
Haim Saban: Hillary Clinton's Top Hollywood Donor Demands Racial Profiling of Muslims __HTTP__ _E_
A letter to @CNN President Jeff Zucker __HTTP__ _E_
It would be really nice if the Fake News Media would report the virtually unprecedented Stock Market growth since the election.Need tax cuts _E_
Thanks Dave! __HTTP__ _E_
Jay Sekulow on @foxandfriends now. _E_
Thrilled to hear that @RakutenTravelJP has awarded @TrumpWaikiki the 'Rakuten Diamond Award' for the 4th consecutive year! Congrats! _E_
Here's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_
Entrepreneurs: Remember the golden rule of negotiating he who has the gold makes the rules. _E_
RT @realDonaldTrump: Thank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day... _E_
It's springtime and it just started snowing in NYC. What is going on with global warming? _E_
Many NATO countries have agreed to step up payments considerably as they should. Money is beginning to pour in NATO will be much stronger. _E_
Made additional remarks on Charlottesville and realize once again that the #Fake News Media will never be satisfied...truly bad people! _E_
The lobbyist and political hack that President Obama just appointed as the Ebola Czar just missed his first major meeting on Ebola A joke _E_
I was on The View this morning. We talked about The Apprentice. Tonight's episode is a great one tough exciting and surprising. 10 pm/NBC _E_
Today it was my pleasure and great honor to announce my nomination of Jerome Powell to be the next Chairman of the @FederalReserve. __HTTP__ _E_
Via @TheStreet by @swan_investor: Trump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_
Great defense by the @nyjets this weekend—congratulations to @woodyjohnson4—only 6 points allowed! _E_
wanting to sell their product cars A.C. units etc. back across the border. This tax will make leaving financially difficult but..... _E_
Some day when things calm down I'll tell the real story of @JoeNBC and his very insecure long time girlfriend @morningmika. Two clowns! _E_
.@SheriffClarke Great insight in dealing with the media today. You are a wonderful representative of calm and reason a real pro! _E_
Donald Trump's commercial free WWE Raw does big rating: __HTTP__ _E_
.@antbaxter I tried watching but fell asleep. _E_
There are many editorial writers that are good some great & some bad. But the least talented of all is frumpy Gail Collins of NYTimes. _E_
Hillary Clinton may be the most corrupt person ever to seek the presidency. Donald J. Trump _E_
Great news! Thank you Governor Ralph DLG Torres! #Trump2016 __HTTP__ _E_
Thank you to our amazing Wounded Warriors for their service. It was an honor to be with them tonight in D.C.... __HTTP__ _E_
Shirley B did a very good job singing Goldfinger! Not easy. _E_
So I have spent almost nothing on my run for president and am in 1st place. Jeb Bush has spent $59 million & done. Run country my way! _E_
One of the most accurate polls last time around. But #FakeNews likes to say we're in the 30's. They are wrong. Some people think numbers could be in the 50's. Together WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Do you think I made the right decision? #CelebApprentice _E_
"To succeed one must be creative and persistent." John H. Johnson _E_
You can be an @nfl player with murder charges and not be suspended. Yet with NO EVIDENCE @nfl targeted Tom Brady. B.S.! _E_
Obama is finally stopping the Chinese from buying something in America – windfarms __HTTP__ What a joke! _E_
Many meetings today in Bedminster including with Secretary Linda M and Small Business. Job numbers are looking great! _E_
Learn work and think in equal proportions and you'll be going in the right direction. _E_
Here we go again via @timesunion.com __HTTP__ ... another bad deal. _E_
The first ever All Star Celebrity @ApprenticeNBC premieres Sunday March 3rd! __HTTP__ _E_
This is a crossroads in the history of our civilization that will determine whether or not We The People reclaim c... __HTTP__ _E_
I am deeply committed to preserving our strong relationship & to strengthening America's long standing support for... __HTTP__ _E_
To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_
Trump Tower at Century City brings luxury to Makati the financial & social capital of Philippines __HTTP__ _E_
It is finally happening for our great clean coal miners! __HTTP__ _E_
Things will work out fine between the U.S.A. and Russia. At the right time everyone will come to their senses & there will be lasting peace! _E_
Amazing that while I lead by big numbers in the new Q and and USA Today polls the the press only wants to report on the phony WSJ/NBC poll. _E_
... will happen when you go against the tide when you take a risk and it works. Think Big _E_
"Failure is simply the opportunity to begin again this time more intelligently." Henry Ford _E_
.... to do The Apprentice but I approved you anyway. Without my show you'd be nothing! _E_
...design or negotiations yet. When I do just like with the F 35 FighterJet or the Air Force One Program price will come WAY DOWN! _E_
Thank you @FrankLuntz for saying I was a winner tonight. It is my great honor. #Trump2016 _E_
Jeb has been confused for forty years __HTTP__ _E_
Good news. Voters give @MittRomney the edge over @BarackObama on handling the economy according to @gallupnews __HTTP__ _E_
I will be on Fox & Friends at 7 A.M. 10 minutes. Much to talk about enjoy! _E_
THANK YOU @MayorGimenez for following the RULE OF LAW! Sanctuary cities make our country LESS SAFE! Full remarks: __HTTP__ __HTTP__ _E_
RT @TeamTrump: RT if you agree @HillaryClinton & @timkaine are WRONG for America! #VPDebate #MAGA __HTTP__ _E_
I am going to keep our jobs in the U.S. and totally rebuild our crumbling infrastructure. Crooked Hillary has no clue! @Teamsters _E_
New polls are good because the media has deceived the public by putting women front and center with made up stories and lies and got caught _E_
#FlashbackFriday #CrookedHillary __HTTP__ _E_
Why would anyone in Kentucky listen to failed presidential candidate Rand Paul re: caucus. Made a fool of himself (1%.)KY his 2nd choice! _E_
The Donald J. Trump Signature Collection exclusively available @Macys offers the finest style in menswear __HTTP__ _E_
The shale boom is saving our economy __HTTP__ Good for jobs national security & trade balance. Frack Now & Frack Fast! _E_
...In other words Secy John Kerry is so out of his element... _E_
'Presidential Executive Order on Identifying and Reducing Tax Regulatory Burdens' Executive Order:... __HTTP__ _E_
THE HARDER YOU WORK THE LUCKIER YOU GET! _E_
I am really beginning to respect Mark Halperin and John Heilemann as political reporters they truly get why Trump poll numbers are high. _E_
Incompetent Hillary despite the horrible attack in Brussels today wants borders to be weak and open and let the Muslims flow in. No way! _E_
China has a business tax rate of 15%. We should do everything possible to match them in order to win with our economy. Jobs and wages! _E_
The $85B sequester is just 2% of Obama's $3.5T record deficit spending budget. Our leaders are ruining our children's future. _E_
Should I do the #GOPdebate? __HTTP__ _E_
"Success depends...on how effectively you learn to manage the game's two ultimate adversaries: the course and yourself." @jacknicklaus _E_
All because of me people don't care about you Cher. @cher My week on twitter 1k retweets 29 new listings 15k new followers 2k mentions. _E_
In analyzing the Alabama Primary race Fake News always fails to mention that the candidate I endorsed went up MANY points after Election! _E_
Barack Obama has everything to gain. Why would anyone ever deny $5M to charity? _E_
Sorry there is no STAR on the stage tonight! _E_
"I succeeded by saying what everyone else is thinking." @Joan_Rivers _E_
Our law enforcement officers deserve our appreciation for the incredible job they do. Video: __HTTP__ __HTTP__ _E_
"It's not that I'm so smart it's just that I stay with problems longer." Albert Einstein _E_
I am a registered Republican. __HTTP__ With @MittRomney as the nominee we can defeat @BarackObama. _E_
Democrats don't want massive tax cuts how does that win elections? Great reviews for Tax Cut and Reform Bill. _E_
Our country's debt crisis cannot be solved by tax increases. We must cut government spending. _E_
Central America's tallest building @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_
Don't let them build a wind turbine in your backyard (or near your house). It will destroy your property value. _E_
More poll results from last nights Commander in Chief Forum. #AmericaFirst #TrumpTrain __HTTP__ _E_
Appreciate the congrats for being right on radical Islamic terrorism I don't want congrats I want toughness & vigilance. We must be smart! _E_
Great to see @MittRomney being well received in Poland __HTTP__ The Poles understand the value of freedom through strength _E_
Dems had a very good and professional convention. The Republicans must be smart and tough and fast! _E_
So many people who know nothing about me are commenting all over T.V. and the media as though they have great D.J.T. insight. Know NOTHING! _E_
California shooting looks very bad. Good luck to law enforcement and God bless. This is when our police are so appreciated! _E_
"If you like your healthcare plan you can keep it." = "I was born in Hawaii." _E_
Another false story this time in the Failing @nytimes that I watch 4 8 hours of television a day Wrong! Also I seldom if ever watch CNN or MSNBC both of which I consider Fake News. I never watch Don Lemon who I once called the "dumbest man on television!" Bad Reporting. _E_
I win awards for speaking but the enemies either won't comment or will say only bad...leave Clint alone! _E_
Thank you @EricTrump! __HTTP__ _E_
No investor would be stupid enough to pour their money into the bottomless Vattenfall pit. They totally gave up __HTTP__ _E_
'It's just a 2 point race Clinton 38% Trump 36%' __HTTP__ _E_
The American work ethic is what led generations of Americans to create our once prosperous nation. (cont) __HTTP__ _E_
He @MittRomney had another impressive win last night in Illinois. His delegate lead is insurmountable. It is (cont) __HTTP__ _E_
.@TrumpChicago is the Windy City's sole skyscraper to feature a 4 star hotel 4 star restaurant & spa __HTTP__ _E_
Radical Islamic Terrorism must be stopped by whatever means necessary! The courts must give us back our protective rights. Have to be tough! _E_
Over $1T in annual deficit spending and adding over $6T to the debt for what? May jobless numbers are horrendous. The great Obama recovery. _E_
A working dinner tonight with Prime Minister Abe of Japan and his representatives at the Winter White House (Mar a Lago). Very good talks! _E_
Where were all the @VanityFair exposes on When Rev. Wright disciples go to Washington? Sad! _E_
Jonah Goldberg @JonahNRO of the once great @NRO #National Review is truly dumb as a rock. Why does @BretBaier put this dummy on his show? _E_
Shoplifting is a very big deal in China as it should be (5 10 years in jail) but not to father LaVar. Should have gotten his son out during my next trip to China instead. China told them why they were released. Very ungrateful! _E_
I hope everyone had a great Memorial Day! _E_
Americans never quit. General Douglas MacArthur _E_
I'll be going to the Old Post Office Building on Pennsylvania Avenue in D.C. today. Will create one of world's great hotels. Lots of jobs! _E_
7 of 10 Americans prefer 'Merry Christmas' over 'Happy Holidays' __HTTP__ No surprise. _E_
I wish tonight's debate would cover more than foreign policy. _E_
RT @RightlyNews: What's a high priced Clinton attorney doing representing a low level IT staffer for the Democrats? @jessebwatters on t... _E_
Check out Trump International Hotel & Tower New York spectacular! __HTTP__ _E_
If you're interested in 'balancing' work and pleasure stop trying to balance them. Instead make your work more pleasurable. _E_
The Miss Universe Pageant raked in some great ratings! A great job by everyone. _E_
...Senators should focus their energies on ISIS illegal immigration and border security instead of always looking to start World War III. _E_
One of @GolfWorldUS' top public courses @TrumpGolfLA's course stands as a testament to the greatness of golf __HTTP__ _E_
Read this @BarackObama's birth certificate cannot survive judicial scrutiny because of phantom numbers __HTTP__ _E_
Looks like yet another terrorist attack. Airplane departed from Paris. When will we get tough smart and vigilant? Great hate and sickness! _E_
It's often necessary to boast but it's even better if others do it for you." – Think Like A Billionaire _E_
Great new poll. Thank you Texas! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_
"On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military estimates the remaining 1000 or so fighters occupy roughly 1900 square miles..." via @jamiejmcintyre __HTTP__ _E_
Statement on House Passage of Kate's Law and No Sanctuary for Criminals Act. __HTTP__ _E_
ICYMI you can watch my full press conference with @SteveKingIA on @shanevanderhart's @CaffThoughts __HTTP__ _E_
"Even such traits as who makes the most eye contact in conversation can be an indication of who seeks to dominate." Think Like A Billionaire _E_
What a great day it was yesterday showing the public Trump Links at Ferry Point. I took over a disaster and made it GREAT! Good job to all! _E_
Via @thestate by @andyshain: "Donald Trump joins other 2016 prospects speaking at SC Tea Party Convention" __HTTP__ _E_
...Bad decisions can be devastating. _E_
"In N.H. Trump says his business experience would play well in government" __HTTP__ via @ConMonitorNews by @AP _E_
My prayers and condolences to the families of the victims of the terrible Florida shooting. No child teacher or anyone else should ever feel unsafe in an American school. _E_
A vote to CUT TAXES is a vote to PUT AMERICA FIRST. It is time to take care of OUR WORKERS to protect OUR COMMUNITIES and to REBUILD OUR GREAT COUNTRY! __HTTP__ __HTTP__ _E_
Under @MittRomney Bain had an 80% success rate with annual returns of over 50%. Under @BarackObama America has added over $6T in debt. _E_
Obamacare has to be killed now before it grows into an even bigger mess as it inevitably will. #TimeToGetTough _E_
Trump: Something 'mentally wrong' with Weiner __HTTP__ via @hilltube by @DanielStrauss4 _E_
They changed the name global warming to climate change because the concept of global warming just wasn't working! _E_
New National Rasmussen Poll: __HTTP__ _E_
"Winning takes talent to repeat takes character." John Wooden _E_
Just as I have been predicting for years Iraq will fall to the people that hate the U.S. the most just outside of Baghdad. Keep the oil _E_
Crazy Joe Scarborough and dumb as a rock Mika are not bad people but their low rated show is dominated by their NBC bosses. Too bad! _E_
What people don't know about Kasich he was a managing partner of the horrendous Lehman Brothers when it totally destroyed the economy! _E_
I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! _E_
Good morning Wisconsin! The polls are now open! #VoteTrump today & we will MakeAmericaGreatAgain! __HTTP__ _E_
Great decision by @SpeakerBoehner in placing @TGowdySC as chairman of the Benghazi select committee. Gowdy is a seasoned prosecutor. _E_
"Discovery breeds discovery as in success breeds success. Questions are thoughts with a quest." – Think Like a Champion _E_
ICYMI my speech this past Monday at the South Carolina Tea Party Convention in Myrtle Beach __HTTP__ #SCTeaParty15 _E_
Benghazi is now a full blown training center for jihadists __HTTP__ Congratulations to the Obama administration. _E_
Do you think Crooked Hillary will finally close the deal? If she can't win Kentucky she should drop out of race. System rigged! _E_
Congratulation to Adam Scott and all of the folks at Trump National Doral on producing a really great WGC Tournament. Amazing finish! _E_
New York State's lightweight A.G. is driving business & jobs out of N.Y. Look into his past he shouldn't even be allowed to hold office! _E_
"He who defends everywhere defends nowhere." – Sun Tzu _E_
When people find out how bad a job Scott Walker has done in WI they won't be voting for him. Massive deficit bad jobs forecast a mess. _E_
Join me in Manheim Pennsylvania on Saturday at 7pm! #TrumpRallyTickets: __HTTP__ __HTTP__ _E_
When you have exhausted all possibilities remember this you haven't. Thomas A. Edison _E_
Report raises questions about 'Clinton Cash' from Russians during 'reset' __HTTP__ _E_
#CelebApprentice fans watch today's #trumpvlog __HTTP__ to find out about our new App __HTTP__ _E_
A family in Las Vegas just stopped a violent home invasion by shooting one of the perpetrators the other fled and will be captured. Great! _E_
.....Guy in front asked for picture said he was the biggest fan never saw the guy in back. _E_
I spoke with Fox and Friends today watch here __HTTP__ _E_
The Trump Signature Collection exclusively available @Macys offers high end fashion for men. Dress your best. __HTTP__ _E_
#sweepstweet @lisalampanelli wins $100000 for her charity and that's a nice gift. _E_
The Prayer Breakfast was used by @BarackObama to say that the Bible commands higher income taxes. That's not the way it is! _E_
Today is April 15th Obama's favorite day of the year. T E A. TAXED ENOUGH ALREADY! _E_
The attack on Mosul is turning out to be a total disaster. We gave them months of notice. U.S. is looking so dumb. VOTE TRUMP and WIN AGAIN! _E_
A woman is suing one of my businesses despite the fact that she loved her classes. Our legal system is a mess. Watch __HTTP__ _E_
My daughter @IvankaTrump will be on @Greta tonight at 7pm. Enjoy! __HTTP__ _E_
Great story on @TrumpToronto in @globeandmail about our new Sky and Wellness Suites: __HTTP__ _E_
Mr. President tell Iran to immediately free the CHRISTIAN PASTOR as a sign of good faith & if they refuse break off talks big sanctions _E_
Re Negotiation: Know what you want & think about what the other side wants. Know where they're coming from & do not underestimate them. _E_
Donald Trump CPAC Speech: U.S. Is Run By 'Very Stupid People' __HTTP__ via @HuffPostPol by @elisefoley _E_
For all of the morons who have been complaining about my comment on sexual assault & rape in the military (cont) __HTTP__ _E_
Celebrity Apprentice on in 5 minutes on CNBC it's great! _E_
I think Joe Biden made correct decision for him & his family. Personally I would rather run against Hillary because her record is so bad. _E_
No complaints but how many people would be watching these really dumb but record setting debates if I wasn't in them? Interesting question! _E_
Thank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The news about our beautiful Miss Venezuela Monica Spear is devastating to all who knew her. A spectacular woman she will be missed. _E_
Be tough be smart be personable but don't take things personally. That's good business. _E_
.@BenFergusonShow just watched you on @CNN. Thank you for your nice comments. _E_
Congratulations to Bret Michaels the new Celebrity Apprentice. Bret's a true champion all of us were happy to see him and to see him win! _E_
My @foxandfriends int. on Benghazi cover up the ObamaCare mess & firing @TheRealMarilu on @ApprenticeNBC __HTTP__ _E_
Enthusiasm is a vital element in individual success. ― Conrad Hilton _E_
While everyone is waiting and prepared for us to attack Syria maybe we should knock the hell out of Iran and their nuclear capabilities? _E_
HAPPY THANKSGIVING! __HTTP__ _E_
I will be doing @foxandfriends at 8.00 a.m. _E_
Via @TheYBF: "@msvivicafox Attends A Private Screening + Donald Trump DONATES $25K To @peachespulliam'S Kamp Kizzy" __HTTP__ _E_
.@CNBC Titans: Donald Trump' is available to live stream on @netflix and @hulu. Watch! _E_
Main Street is BACK! Strongest Holiday Sales bump since the Great Recession beating forecasts by BILLIONS OF DOLLARS. __HTTP__ _E_
Today it was my honor to welcome President Nursultan Nazarbayev of Kazakhstan to the @WhiteHouse! __HTTP__ _E_
Just returned from Ireland Scotland and Dubai. Amazing trip great places but always good to be back. _E_
.@BarackObama's Super PAC has continually called @MittRomney a murderer __HTTP__ Ironic since Obama is destroying Medicare. _E_
Joy Behar who was fired from her last show for lack of ratings is even worse on @TheView. We love Barbara! _E_
MUST READ – via @IBDinvestors: "VA Scandal Grows As Bonuses Went To Worst Hospitals" __HTTP__ _E_
I believe @BarackObama made a deal with the Saudis to increase oil production until after the election. Then (cont) __HTTP__ _E_
China is expanding its military bases abroad. We must expand our naval fleet. Now is no time for defense cuts. (cont) __HTTP__ _E_
A great honor to spend time with our brave HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United Staes of America! __HTTP__ _E_
Big win in Montana for Republicans! _E_
Just as I predicted ObamaCare is a complete disaster which is failing on its own. May never be fully implemented. _E_
.@TrumpNationalNY a great place! __HTTP__ @TrumpGolf _E_
I never gave anybody hell! I just told the truth and they thought it was hell. Harry S. Truman _E_
It's Tuesday. How many terrible predictions and advice will Karl 1.6% Rove make today? _E_
Entrepreneurs: Ask yourself: What can I provide that does not yet exist? Be open to new ideas. Be innovative! _E_
Big announcement by Ford today. Major investment to be made in three Michigan plants. Car companies coming back to U.S. JOBS! JOBS! JOBS! _E_
...Never let yourself be pushed around but treat the good folks great. _E_
Loved doing the debate last night on @CNBC. Check out all of the polls! Everyone agrees that Harwood bombed! _E_
War on the families. Price of electricity hit record high in October __HTTP__ Terrible especially during holiday season. _E_
$5 a gallon gas and we have yet to approve the Keystone XL Pipeline. OPEC is laughing at us. _E_
Flashback from October 2013: "Donald Trump demands larger iPhone screen" __HTTP__ You're welcome! Apple listened. _E_
.@HillaryClinton has been a foreign policy DISASTER for the American people. I will #MakeAmericaStrongAgain #Debate... __HTTP__ _E_
Looks like another great day for the Stock Market. Consumer Confidence is at Record High. I guess somebody likes me (my policies)! _E_
The Blue Monster at Trump National Doral. __HTTP__ _E_
The U.S. is going to substantialy reduce taxes and regulations on businesses but any business that leaves our country for another country _E_
I WILL BE ON @foxandfriends AT 7:30 NOW! _E_
They should have got Darrell Hammond as the Donald Trump impersonator. #CelebApprentice _E_
Great article by Chris Ruddy @Newsmax_Media: @AnnDRomney and Jackie's Example __HTTP__ _E_
Watch my speech at CPAC in Washington DC yesterday ... __HTTP__ _E_
Read about how this hotel came into being in my book "Never Give Up"—it's quite a story. #CelebApprentice @TrumpNewYork _E_
Obama's speech on climate change was scary. It will lower our standard of living and raise costs of fuel & food for everyone. _E_
One hit wonder @DannyZuker I notice you are not disputing all of the failures that I said you had. Let's talk about it! _E_
#CrookedHillary "was at center of negotiating $12M commitment from King Mohammed VI of Morocco" to Clinton Fdn. __HTTP__ _E_
My @USATOpinion piece: Trump: I don't need to be lectured __HTTP__ _E_
We should be building up our military and our missile defense systems to their highest levels ever. Must be very strong to prosper & survive _E_
As hard as it is to believe sexting pervert Anthony Weiner is leading in some polls for Mayor of NYC. _E_
.@JordanSpieth Great job you are a true champion! See you soon. _E_
will only get higher. Car companies and others if they want to do business in our country have to start making things here again. WIN! _E_
The time has come to take action to IMPROVE access INCREASE choices and LOWER COSTS for HEALTHCARE! __HTTP__ __HTTP__ _E_
Prayers and condolences to all of the families who are so thoroughly devastated by the horrors we are all watching take place in our country _E_
RT @billoreilly: FNC dominated ratings last night. MSNBC disaster demonstrating folks don't trust the network. __HTTP__ toni... _E_
When Ted Cruz quits the race and the field begins to clear I will get most of his votes no problem! _E_
Perhaps this is the kind of thinking we need in Washington ... __HTTP__ _E_
Wow! Letterman show @Late_Show won the ratings last night big time and guess who was his guest? DJT _E_
Congratulations to Obama and the @DNC. The federal deficit has topped $1T for a fourth year in a row __HTTP__ Nice work! _E_
Winning isn't everything but the will to win is everything. Vince Lombardi _E_
Today will be a great day at work have only one word in mind VICTORY! _E_
Look here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and destroyed jobs #TimeToGetTough _E_
Dine with The Donald and Mitt __HTTP__ _E_
Thank you @RepReneeEllmers! __HTTP__ __HTTP__ _E_
Will be on @oreillyfactor tonight. Signing a copy of Crippled America for Bill! __HTTP__ _E_
The evidence continues to mount against lightweight @AGSchneiderman. It is time for JCOPE and Moreland Commissions to act. _E_
"Being true to yourself...will give you a lot of power over any negatives thrown your way." – Midas Touch _E_
April is Autism Awareness Month join me in raising awareness get your "Light It Up Blue" sign here! #LIUB __HTTP__ _E_
Howard Stern will do a great job on @America'sGotTalent. He's very smart and really gets what talent is. @HowardStern _E_
Via @AmSpec by @JeffJlpa1: Exclusive: Trump Says Obama Shows 'Total Desperation' on Iran __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
.@antbaxter Only the stupid @BBC would air your garbage—no wonder they are in such deep trouble. _E_
A lot of people have imagination but can't execute you have to execute with the imagination. Donald J. Trump __HTTP__ _E_
Numerous polls have me beating Hillary Clinton. In a race with her voter turnout will be the highest in U.S. history I get most new voters! _E_
RT @mike_pence: History teaches us that weakness arouses evil. America needs to be strong on the world stage. #VPDebate __HTTP__ _E_
.@Morning_Joe is so off on Iowa which I am leading big in new @CNN poll. I will win Iowa. Also I beat Hillary easily! _E_
Well the year has officially begun. I have many stops planned and will be working very hard to win so that we can turn our country around! _E_
.@jessebwatters Watching your show from Arizona where we just had a big rally. It is fantastic everybody loves it!#MakeAmericaGreatAgain _E_
Obama and the Democrats are laughing at the deal they just made...the Republicans got nothing! _E_
Obama's budget spends $2B making our navy ships algae powered __HTTP__ The strong world is laughing at us. _E_
"My view is that not only has Trump been vindicated in the last several weeks about the mishandling of the Dossier and the lies about the Clinton/DNC Dossier it shows that he's been victimized. He's been victimized by the Obama Administration who were using all sorts of....... _E_
I've just released my position papers on The Second Amendment. __HTTP__ _E_
Fast and Furious put semi automatics in the hands of Mexican drug lords that killed Americans @BarackObama should answer all questions. _E_
China has been unfairly subsidizing the export of cars & auto parts. I've been saying this for 3 years... _E_
"@Algemeiner Honors @Joan_Rivers Donald Trump @YuliEdelstein at Second Annual 'Jewish 100′ Gala" __HTTP__ via @Algemeiner _E_
Don Jr. will present the Keynote Address in South Africa on Dec. 1st @TheInvestShow _E_
I look forward to being in Lowell Massachusetts today. I hear a very big crowd is expected we will have lots of fun! _E_
Only 15 days until ObamaCare is implemented. Congress must waive the monstrosity for regular Americans. Why should they be punished? _E_
My sons Don and Eric are on @foxandfriends now 7:35. Great kids enjoy! _E_
Getting the support of @DanaWhite of UFC means a lot. A total winner who has done an amazing job. Just ordered his fight to watch tonight! _E_
Shining over Fifth Avenue @TrumpTowerNY (a NY icon) offers a full service restaurant bar cafe ice cream parlor and Gucci. _E_
"It's a tough game and you never want to take that aspect out of the game." – @NYRangers Stanley Cup Champion Mark Messier _E_
Better off? The $16T US debt works out to $136260 per household a 50% increase since @BarackObama took office. _E_
Tune in tonight at 9 pm on TV One for The Ultimate Merger starring the one and only Omarosa and twelve brave bachelors ... _E_
Tune in for #TrumpTuesday on @SquawkCNBC tomorrow morning. _E_
.@foxandfriends in five minutes. Enjoy! _E_
Sgt.Thamooressi has been held in Mexico for 115 Days. Mexico has zero respect for our border & our servicemen. Boycott! #freeourmarine _E_
The Veterans of our country have been treated like third class citizens for many years... _E_
Many people have said I'm the world's greatest writer of 140 character sentences. _E_
"Build up your weaknesses until they become your strong points." Knute Rockne _E_
My @foxandfriends interview discussing Chuck Hagel nomination Republicans terrible deal making & where we go next __HTTP__ _E_
Interview with David Muir of @ABC News in 10 minutes. Enjoy! _E_
Via @BreitbartNews: GAME ON: TRUMP RESPONDS TO JEB __HTTP__ _E_
Remember save your evening to watch Celebrity Apprentice tonight at 9 increased to a full two hours great episode watch Gary B. _E_
"Age is whatever you think it is. You are as old as you think you are." @MuhammadAli _E_
40 days until the election. Crunch time. @MittRomney must stay on offense and take the fight to Obama. _E_
Announced 3 years ago that Scottish course would close in winter like Kingsbarns and others too cold. _E_
The shirts and ties at Macy's are so good beautiful and do so well that guys like the one that sued me wrongly want a piece l kicked his ass _E_
RT @APCampaign:Trump to Obama: $5 million donation to charity if you release passport and college records __HTTP__ #Election2012 _E_
See you in Arizona on Friday and Saturday. __HTTP__ _E_
Learn more about @TrumpIntRealty's @Mgriffithnyc and some of our spectacular real estate in NYC __HTTP__ _E_
How many more of our soldiers have to be shot by the Afghanis they are training? Let's get the hell out of there and focus on U.S. _E_
Bernie Madoff and Tony La Russa in today's #trumpvlog... __HTTP__ _E_
RT @RealBHorowitz: @VinceMcMahon @realDonaldTrump @WWE My two favorite billionaires! _E_
By @BarackObama's design the middle class will be hit with record taxes under ObamaCare through inflation __HTTP__ REPEAL! _E_
China is going to complete 59 new theme parks by 2020 over $23B in expansion. That would take over 100 years in our country. _E_
Congratulations to @IvankaTrump on being named @FoxNewsSunday Power Player of the Week. Ivanka is doing a great job w/ DC Post Office. _E_
Florida Power & Light has disgusting rotting utility poles outside Doral in Miami. They should put in new ones or will be sued. _E_
Yet another weak hit by a candidate with a failing campaign. Will Jeb sink as low in the polls as the others who have gone after me? _E_
The people of Scotland have spoken—a great decision. I wish @AlexSalmond well & look forward to playing golf with him at Aberdeen! _E_
.@EdGoeas thank you for your support tonight on @JudgeJeanine. _E_
.@FrankLuntz is a total clown. Has zero credibility! @FoxNews @megynkelly _E_
Congratulation to Roy Moore and Luther Strange for being the final two and heading into a September runoff in Alabama. Exciting race! _E_
.@marthamaccallum Martha great interview with my son @EricTrump smart tough & professional. Thank you! @FoxNews _E_
My golf club @TrumpNationalNY in Westchester a great place! __HTTP__ _E_
#sweepstweet @3nVMusic I very much rely on my own 'take' of the situation and people involved. My instincts (cont) __HTTP__ _E_
Congressman John Lewis should spend more time on fixing and helping his district which is in horrible shape and falling apart (not to...... _E_
Very few people read the National Review because it only knows how to criticize but not how to lead. _E_
By the end of this year China will be the number one economic power on earth and the U.S. will owe 20 trillion dollars much of it to China! _E_
Amtrak crash near Philadelphia train derails many hurt some badly. Our country has horrible infrastructure problems. Pols can't solve! _E_
Great job Kevin we are all proud of you! __HTTP__ _E_
Don't miss my Fabulous World of Golf now in its second season on Golf Channel beginning tonight at 9 pm ET __HTTP__ _E_
#MerryChristmas __HTTP__ _E_
The GOP primary schedule is a disaster. Not enough time. _E_
What do we get from our economic competitor South Korea for the tremendous cost of protecting them from North Korea? NOTHING! _E_
Be focused be disciplined be patient there are very few cases of instant gratification. _E_
Freedom is never more than one generation away from extinction. Ronald Reagan #MakeDCListen #DefundObamaCare _E_
If the U.S. does not win this case as it so obviously should we can never have the security and safety to which we are entitled. Politics! _E_
Nominating Chuck Hagel for SOD is the wrong move for Obama. He doesn't need the fight. Too much political capital will be wasted. _E_
Entrepreneurs: Believe in yourself. If you don't no one else will either. _E_
.@ericbolling did a fantastic job on O'Reilly tonight. Way to go Eric! _E_
Entrepreneurs: Set the example and you'll be a magnet for the right people. Great leaders determine the teams they assemble. _E_
When I look at all of the money the special interests and lobbyists are giving to candidates beware the candidates are mere puppets $$$$! _E_
Wow really nice and unexpected from Ed Schultz. Thank you Ed! @edshow __HTTP__ _E_
I would like to wish everyone A HAPPY AND HEALTHY NEW YEAR. WE MUST ALL WORK TOGETHER TO FINALLY MAKE AMERICA SAFE AGAIN AND GREAT AGAIN! _E_
SEAL who shot Bin Laden is unemployed & can't feed his family __HTTP__ Everyone can get welfare but this SEAL can't eat! _E_
Heading over to the Miss USA Pageant. The young women participating are amazing and accomplished. Competition is very tough. ENJOY THE SHOW! _E_
Great meeting with @NaghmehAbedini the wonderful wife of Christian Pastor Saeed who is in Iranian prison. #savesaeed __HTTP__ _E_
Just arrived in Las Vegas for a packed house speech tomorrow. Big poll results today Leading big everywhere. MAKE AMERICA GREAT AGAIN! _E_
I'll bet Obama goes down just like Washington because he doesn't use our(this country's) best people to win. _E_
Heading to South Carolina really big crowd! Will be back in New Hampshire tomorrow.#MakeAmericaGreatAgain _E_
.@seanhannity Carly whose campaign is dead is making false statements about me in order to salvage hope! Sad. _E_
"There is a point in every contest when sitting on the sidelines is not an option." Dean Smith _E_
I got George Zimmerman right watch __HTTP__ _E_
Romney campaign used me in 6 primary states and won every one they should have used me in Florida and Ohio & he would be President. _E_
.@AP and @HuffingtonPost should change their fraudulent story to say THAT I DROPPED @NBC & The Apprentice to run for President! _E_
For the sake of New York City all recent sexting victims of Anthony 'Carlos Danger' Weiner should come forward. _E_
Lucky to have been chosen for the purchase of the magnificent The Point Lake and Golf Club on Lake Norman in (cont) __HTTP__ _E_
"President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers. Most explosive Stock Market rally that we've seen in modern times. 18000 to 26000 from Election and grounded in profitability and growth. All Trump not 0... _E_
RT @Team_Trump45: @realDonaldTrump We won. Move on. __HTTP__ _E_
THANK YOU for another wonderful evening in Washington D.C. TOGETHER we will MAKE AMERICA GREAT AGAIN __HTTP__ _E_
"Failure has a thousand explanations. Success doesn't need one." Alec Guinness _E_
I can't believe that the judge in the Oscar Pistorious case has found him not guilty of murder. No one has been more guilty since O.J.! _E_
I've always defended @jayleno but he never defends me. He's not a loyal person & I now understand why everybody dumped him. Jay sucks! _E_
The Republican Establishment has been pushing for lightweight Senator Marco Rubio to say anything to hit Trump.I signed the pledge careful _E_
Save your time @rosie and focus on your horrible ratings and don't mention my name on talk shows anymore or you will get more of the same. _E_
I do not know the reporter for the @nytimes or what he looks like. I was showing a person groveling to take back a statement made long ago! _E_
...is all of the illegal leaks of classified and other information. It is a total witch hunt! _E_
Sleepy eyes Chuck Todd a man with so little touch for politics is at it again.He could not have watched my standing ovation speech in N.C. _E_
Thank you New Hampshire! #MakeAmericaGreatAgain #FITN __HTTP__ _E_
Why did @BarackObama and his family travel separately to Martha's Vineyard? They love to extravagantly spend on the taxpayers' dime. _E_
...the entire World WAS laughing and taking advantage of us. People like liddle' Bob Corker have set the U.S. way back. Now we move forward! _E_
The Fake News is working overtime. As Paul Manaforts lawyer said there was no collusion and events mentioned took place long before he... _E_
The protesters blocked a major highway yesterday delaying entry to my RALLY in Arizona by hours and the media blames my supporters! _E_
Congratulations to Emmanuel Macron on his big win today as the next President of France. I look very much forward to working with him! _E_
Not only did Egypt destroy its civil society w/ the Muslim Brotherhood now it is a complete economic mess __HTTP__ _E_
Major grudge match this weekend between @nyjets & @Patriots. I have a dilemma I am good friends w/ both Woody (cont) __HTTP__ _E_
Building a brand is like building a skyscraper the foundation comes first. The bigger the building the deeper the foundation needs to be _E_
The final Wisconsin vote is in and guess what we just picked up an additional 131 votes. The Dems and Green Party can now rest. Scam! _E_
.@CNN Poll just came out amazing numbers for those who want to MAKE AMERICA GREAT AGAIN! TRUMP 36% a 20 point lead over 2nd place. Thanks. _E_
Join me at Clemson University on Wednesday February 10th! #MakeAmericaGreatAgain __HTTP__ _E_
The YouTube of the 2012 Miss USA contestants @GiulianaRancic and me singing Call Me Maybe __HTTP__ has over 2M views. _E_
Trump Invitational at Mar a Lago was a huge success. Raised millions for charity and was the 1st equestrian event held in Palm Beach. _E_
Thank you Henderson NV. This is a MOVEMENT like never seen before! Watch some of the rally via my Facebook page:... __HTTP__ _E_
Entrepreneurs: Everything starts with you. Realize that you're in charge. Whatever happens you're responsible. _E_
The Federal government has increased its employment by 12% since 2007. We need to stop replacing retired workers unless position is needed. _E_
During the GOP convention CNN cut away from the victims of illegal immigrant violence. They don't want them heard. __HTTP__ _E_
Big progress being made in ridding our country of MS 13 gang members and gang members in general. MAKE AMERICA SAFE AGAIN! _E_
Somebody got rich building the ObamaCare website which doesn't even come close to working where has the money gone? _E_
Working hard to get the Olympics for the United States (L.A.). Stay tuned! _E_
Just out but lightly reported: Fewest jobless claims since 1973 show firm U.S. Job Market Lowest since March 1973. @bpolitics _E_
People do business with those people they like and trust. Ralph J. Roberts Founder of Comcast _E_
So nice when media properly polices media. Thank you @BreitbartNews. __HTTP__ _E_
Weiner and Spitzer are on top of the latest polls. A sad day for the greatest city on earth! They will spend lots of time together. _E_
While I greatly appreciate the efforts of President Xi & China to help with North Korea it has not worked out. At least I know China tried! _E_
We need strong tough and brilliant leadership now more than ever! MAKE AMERICA GREAT AGAIN! _E_
Together we are going to MAKE AMERICA GREAT AGAIN!#AmericaFirst __HTTP__ _E_
Had a fantastic dinner last night at Quattro in the Trump SoHo Hotel. It's already one of the hottest new restaurants in the city. _E_
Raising the capital gains tax in this fragile economic time is the dumbest thing Washington could do. So they will probably do it. _E_
Just heard that crazy and very dumb @morningmika had a mental breakdown while talking about me on the low ratings @Morning_Joe. Joe a mess! _E_
Thank you Kansas! Thousands of people inside and thousands outside who couldn't get into the hall. Really amazing! #CaucusForTrump _E_
Back by popular demand TV personality @TheRealMarilu returns in the record 13th season of 'All Star' @CelebApprentice. Marilu does great! _E_
For those of you that have conveniently fotgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_
Our heartfelt prayers go out to our fellow Americans suffering from the storms & tornadoes. _E_
My @foxandfriends interview discussing how @BarackObama should release his college applications & records __HTTP__ _E_
Really bad ratings for Lawrence O'Donnell on MSNBC O'Reilly is killing him! _E_
Hope you enjoy the story in the highly respected Real Estate Weekly __HTTP__ _E_
See you in D.C. tomorrow at 1:00 P.M. at the Capitol to protest the horribly negotiated deal with Iran. Really sad! _E_
Is it even slightly possible that Jodi Arias could be set free wow what a miscarriage of justice that would be! _E_
Five people killed in Washington State by a Middle Eastern immigrant. Many people died this weekend in Ohio from drug overdoses. N.C. riots! _E_
I just arrived at Trump National Doral in Miami where I'll spend the day checking work just completed by contractors. This place is amazing! _E_
Mitt Romney who was one of the dumbest and worst candidates in the history of Republican politics is now pushing me on tax returns. Dope! _E_
about that...Those Intelligence chiefs made a mistake here & when people make mistakes they should APOLOGIZE. Media should also apologize _E_
Thanks to @SenateMajLdr McConnell and the @SenateGOP we are appointing high quality Federal District... _E_
2. The celebrity with the highest totals by Tuesday noon ET gets an extra donation to his or her charity... _E_
The Democrats want to shut government if we don't bail out Puerto Rico and give billions to their insurance companies for OCare failure. NO! _E_
Just arrived in New Hampshire. Thank you to all of my supporters!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Congratulations to the dedicated professionals of the USSS as they celebrate their 152nd anniversary. Thank you! __HTTP__ __HTTP__ _E_
I made a fortune in Atlantic City got out years ago (great timing) and havn't been back in many years. I have NOTHING to do with A.C. _E_
Thank you Rep. @MarshaBlackburn! __HTTP__ __HTTP__ _E_
I would gain a whole new respect for President Obama if he would say look we made a big mistake sorry! No more lies or deception. _E_
RT @ABCNewsRadio: Global fund championed by Ivanka Trump to help women entrepreneurs begins operations __HTTP__ __HTTP__ _E_
Thank you to our U.S. Navy for protecting our country both in times of peace & war. Together WE WILL MAKE AMERICA... __HTTP__ _E_
With the high prices of corn to continue expect even more inflation on the price of food. _E_
Happy Birthday to my wonderful daughter @IvankaTrump. _E_
In politics and in life ignorance is not a virtue. This is a primary reason that President Obama is the worst president in U.S. history! _E_
Watch Gary B tonight on Celebrity Apprentice some really crazy things happen! _E_
Because of me the Republican Party has taken in millions of new voters a record. If they are not careful they will all leave. Sad! _E_
Crooked Hillary Clinton made up facts about me and forgot to mention the many problems of our country in her very average scream! _E_
It's amazing how many people still come up to me to thank me for 'The Art of The Deal.' The book has changed a lot of lives. _E_
Obama has blocked ICE officers and BP from doing their jobs. That ends when I am President! _E_
What did you think of @THEGaryBusey's mechanical dog idea? _E_
My kids never negatively discussed my criticism of President Obama with me or anyone...it's not in their nature! _E_
The least number of hurricanes in the U.S. in decades. So they change global warming (too cold) to climate change now what will they call it _E_
Tonight's episode of The Apprentice has a big surprise at the top of the show don't miss it! 10 p.m. on NBC. _E_
If ObamaCare is so amazing then why is Obama delaying significant parts of the bill before the election? #MakeDCListen _E_
The voting booth process was a total disaster—it could and should be much better and more efficient—tremendous room for error! _E_
Within the heart of beautiful Somerset County Trump Nat'l Bedminster is the proud host of the 2022 @PGAChampionship __HTTP__ _E_
Very proud of Trump Int'l Golf Links in Aberdeen Scotland. Just got the five star award from @VisitScotNews __HTTP__ _E_
Thank you North Carolina! #MAGA __HTTP__ _E_
PM Sarah Westcot Williams incompetence should not be rewarded. You should vote for anyone who runs against her—loser! @PrimeMinisterSX _E_
Glad to see RomneyCare/ObamaCare architect Gruber being eviscerated on the Hill today. He should return all taxpayer money he was paid. _E_
#MakeAmericaGreatAgainVideo: __HTTP__ __HTTP__ _E_
North Carolina's most exclusive club @Trump_Charlotte's features @SharkGregNorman designed golf course which fronts the biggest lake in NC _E_
If you accept the expectations of others especially negative ones then you will never change the outcome. Michael Jordan _E_
When will we see @BarackObama's passport records (sealed)? _E_
Thank you for all of the great comments on the debate last night. Very exciting! _E_
"@BrandiGlanville @KenyaMoore Talk @ApprenticeNBC Feud" __HTTP__ via @ChristianPost by Virnelli Mercader _E_
My experience yesterday in Poland was a great one. Thank you to everyone including the haters for the great reviews of the speech! _E_
This very expensive GLOBAL WARMING bullshit has got to stop. Our planet is freezing record low tempsand our GW scientists are stuck in ice _E_
The developer of the Scottish wind monstrosities Vattenfall just laid off 2500 people & has serious financial difficulties. _E_
My lawyers want to sue the failing @nytimes so badly for irresponsible intent. I said no (for now) but they are watching. Really disgusting _E_
.@GiulianaRancic & @nickjonas both did a wonderful job hosting @MissUSA! Everyone loved @JonasBrothers & @DJPaulyD's performances! _E_
.@SteveRattner While I think you should have gone to prison for what you did I guess Obama saved you. But watch – I will win! _E_
Has everyone forgotten our marine who now sits in a Mexican prison because we have a president too incompetent or too lazy to make a call? _E_
Thank you to Donald Rumsfeld for the endorsement. Very much appreciated. Clinton's conduct has been disqualifying. _E_
Back by popular demand @GiulianaRancic and @BravoAndy are co hosting tonight's #MissUniverse pageant. They are great! _E_
A great afternoon in Tampa Florida. Thank you! #TrumpPence16 __HTTP__ _E_
Let me sum this up for you... __HTTP__ _E_
People are struggling to get gasoline for their cars we are like a third world country. _E_
All I heard in the SOTU was proposals for more govt more spending and more bureaucrats. Very bad! _E_
Congrats to Jim Lipton and Inside the Actors Studio for winning the Emmy Award for the 250th Episode. I was honored to appear in it. _E_
A Rod's forgery defense is blown __HTTP__ The more he lies the worse it's going to get. @yankees want out of his contract _E_
I will be live on all of the major morning talk shows. Enjoy! _E_
Rick Santorum making a strong point on the Newsmax @iontv debate: @RickSantorum. __HTTP__ _E_
Trump Int'l Palm Beach offers an award winning par 72 Championship measuring 7326 yards. Florida's top course __HTTP__ _E_
Senator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no trust! Durbin blew DACA and is hurting our Military. _E_
Thank you to the great crowd of supporters in Newtown Pennsylvania. Get out & VOTE on 11/8/16. Lets #MAGA! Watch:... __HTTP__ _E_
The World is falling apart around us but we don't have people who know how to play the game. The U.S. is in big trouble no leadership! _E_
Great line from @TheGaryBusey: "I am an angel in an earth suit." Do you agree? #CelebApprentice _E_
Always remember I was the one who got Obama to release his birth certificate or whatever that was! Hilary couldn't McCain couldn't. _E_
Donald Trumps Speech Is a Game Changer. __HTTP__ __HTTP__ _E_
Now that African Americans are seeing what a bad job Hillary type policy and management has done to the inner cities they want TRUMP! _E_
Today's announcement by @BarackObama on immigration was done for reelection. He is using the office of the presidency as a campaign tool. _E_
The cast and producers of Hamilton which I hear is highly overrated should immediately apologize to Mike Pence for their terrible behavior _E_
A legitimate article about me... __HTTP__ _E_
Fraud lightweight Marco made a TV ad on TrumpU featuring 2 people who signed these letters: __HTTP__ _E_
I will hold a press conference in the near future to discuss the business Cabinet picks and all other topics of interest. Busy times! _E_
How does Ben Carson survive this problem – really big. Similar story on front page of New York Times. __HTTP__ _E_
Also the more desperate you are to close a deal the less likely it will happen. Stay calm and focused on your ultimate goals. Be smart! _E_
#VoteTrump #SuperTuesday✅Florida✅Illinois✅Missouri✅North Carolina✅Ohio #TrumpTrain __HTTP__ __HTTP__ _E_
All former Bush administration officials should have zero standing on Syria. Iraq was a waste of blood & treasure. _E_
The fact is right now and for the foreseeable future the planet runs on oil and that means we need to get (cont) __HTTP__ _E_
Tonight be sure to watch Melania and Ivanka on Larry King Live for a Celebrity Relief Telethon __HTTP__ _E_
So I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_
Jackie Evancho's album sales have skyrocketed after announcing her Inauguration performance.Some people just don't understand the Movement _E_
Join @mike_pence at the University of Northwestern Ohio tonight at 7pm. Tickets: __HTTP__ _E_
When it comes to Iran's nuclear weapons program here's my advice: Distrust dismantle and verify. @IsraeliPM @netanyahu _E_
Looking forward to keynoting the Nackey S. Loeb School of Communications First Amendment Awards event tomorrow in New Hampshire. _E_
Strange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy! _E_
The only place where success comes before work is in the dictionary. Vidal Sassoon _E_
Via @BBCNews: "Donald Trump golf clubhouse at Menie approved" __HTTP__ _E_
Thank you to all of the television viewers that made my speech at the Republican National Convention #1 over Crooked Hillary and DEMS. _E_
The Keystone pipeline will create 20000 jobs and lower gas prices. But Obama says No. Dumb. _E_
Our trade deficit continues to rise at record rates __HTTP__ The US manufacturing sector is being (cont) __HTTP__ _E_
Trading Shots with Donald Trump a great article in the Wall Street Journal __HTTP__ _E_
I feel sorry for the 4000 soldiers who are being forced to go the West Africa to fight Ebola. Their families are up in arms. Not trained. _E_
The Fed should not do another 'stimulus.' We can't keep spending our children's future away on waste. _E_
I am having a great time in Iowa at Jack Trice Stadium! Unbelievable people. _E_
.@mike_pence and I will defeat #ISIS. __HTTP__ #VPDebate _E_
RT @OCChoppers: Bike we built for @realDonaldTrump. The gold flakes in the paint out in the sunlight looked amazing! __HTTP__ _E_
Lines for my @CPACnews address start at 7:00AM outside the Potomac Ballroom. ACU has asked that you get there early. #CPAC2013 _E_
Rex Tillerson never threatened to resign. This is Fake News put out by @NBCNews. Low news and reporting standards. No verification from me. _E_
For those on TV defending my use of the word schlonged bc #MSM is giving it false meaning tell them it means beaten badly. Dishonest #MSM _E_
While not at all presidential I must point out that the Sloppy Michael Moore Show on Broadway was a TOTAL BOMB and was forced to close. Sad! _E_
Via @reviewjournal "Event offers glimpse of Trump high life" by Holly Ivy Dore __HTTP__ Great interview @EricTrump! _E_
Does anyone remember this @BillMaher clip when he got fired from ABC in fact fired like a dog! __HTTP__ _E_
.@THEGaryBusey doesn't need instructions. Couch time is more fun. #CelebApprentice _E_
Ice storm rolls from Texas to Tennessee I'm in Los Angeles and it's freezing. Global warming is a total and very expensive hoax! _E_
Watch @JudgeJeanine on @FoxNews tonight at 9:00 P.M. _E_
Remembering the fallen heroes on #DDay June 6 1944. __HTTP__ _E_
Thank you! WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
"You can't put a limit on anything. The more you dream the farther you get." @MichaelPhelps _E_
Wow despite the switch to Monday night @ApprenticeNBC ratings were higher than even the Sunday night show. _E_
"America is too great for small dreams." — Pres. Ronald Reagan _E_
Many of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_
We need leaders who can negotiate great deals for Americans. It is common sense. Let's Make America Great Again! __HTTP__ _E_
He @BarackObama made a deal with Saudi Arabia to pump the hell out of oil until after the election. Watch what (cont) __HTTP__ _E_
Donald Trump retains national lead in new ABC News/WaPo poll with 37%: __HTTP__ __HTTP__ _E_
I promoted the hell out of Trump Tower but I also had a great product. The Art of the Deal _E_
We need a President who isn't a laughing stock to the entire World. We need a truly great leader a genius at strategy and winning. Respect! _E_
People are smart. They know you can't be for jobs but against those who create them. It doesn't work. (cont) __HTTP__ _E_
Statement on Clinton Foundation: __HTTP__ _E_
These politicians like Cruz and Graham who have watched ISIS and many other problems develop for years do nothing to make things better! _E_
Lawyer Elizabeth Beck was easy for me to beat. Ask her clients if they are happy with her results against me. Got total win and legal fees. _E_
In business you make decisions that are in your best interests. Time for the US gov't to do the same. Let's Make America Great Again! _E_
I will be on @piersmorganlive tonight at 9PM. __HTTP__ _E_
Just watched the very incompetent Mitt Romney Campaign Strategist Stuart Stevens. Now I know why Mitt lost so badly. Stevens is a clown! _E_
Looking forward to being interviewed by Sam Clovis tomorrow at @MorningsideEdu in Sioux City at 10AM CT! Let's Make America Great Again! _E_
Via @scotsmandotcom: Via Donald Trump makes plans for Menie Estate marquee __HTTP__ _E_
.@OMAROSA You were fantastic on television this weekend. Thank you so much – you are a loyal friend! _E_
Thank you Charlotte North Carolina! We are going to have an AMAZING victory on November 8th...because this is all... __HTTP__ _E_
Great article in @torontodotcom @DonaldJTrumpJr: the original apprentice __HTTP__ _E_
Congratulations to @NYCParks on quickly repairing the Lasker Rink. Record skaters this past Thanksgiving! _E_
Congratulations on the GREAT job done by POLICE and law enforcement on the California shootings. Give credit where credit is due. _E_
Hey @POTUS WE AGREE!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
It's more important to be smart than tough. I know businessmen who are brutally tough but they're not smart." – Think Like A Billionaire _E_
Why do losers & haters always say I wear a "wig" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_
Our thoughts and prayers are w/ the families of the 19 brave firefighters who died fighting the Arizona wildfire. God bless them. _E_
Now Sebelius is "'urging' insurers to cover people who haven't paid" __HTTP__ Complete mess. Enrollment Numbers are a sham. _E_
.@billmaher says that the Iraelis are controlling our government __HTTP__ @HBO. Let's fire him a second time. _E_
Thank you for all of your support! Most importantly we need to get everyone out to VOTE! #VoteTrump2016 __HTTP__ _E_
Arriving at Joint Base Andrews with @SecretaryPerry @SecretaryZinke and @SecPriceMD..... __HTTP__ _E_
It's amazing @hardball_chris has completely lost all connections to reality. He is a complete shill for Obama. _E_
John McCain couldn't get him to release "it" and neither could Hillary Clinton—but Donald did! _E_
Will be doing Fox and Friends in 10 minutes at 7.05 enjoy! _E_
USA should take oil from Iraq in repayment for their liberation. __HTTP__ _E_
Thank you to Brad Blakeman on @FoxNews for grading year one of my presidency with an "A" and likewise to Doug Schoen for the very good grade and statements. Working hard! _E_
So if Iran is going to take over the oil I say we take over the oil first by hammering out a cost sharing plan with Iraq. #TimeToGetTough _E_
What is better advice The Art of the Deal or Rules for Radicals ? I know which one @BarackObama prefers. _E_
.@WayneDupreeShow A fantastic guy! _E_
Ex Presidential Pollster Pat Cadell says most voters sick of both parties and their failure. _E_
I would have done even better in the election if that is possible if the winner was based on popular vote but would campaign differently _E_
HILLARY FAILED ALL OVER THE WORLD. #BigLeagueTruth LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_
It is really too bad that the scientists studying GLOBAL WARMING in Antarctica got stuck on their icebreaker because of massive ice and cold _E_
He's back! @THEGaryBusey returns to cause even more trouble in the13th season of All Star @CelebApprentice. _E_
Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat __HTTP__ _E_
Thank you Iowa! #Trump2016#MakeAmericaGreatAgain #FITN __HTTP__ _E_
Via @dcexaminer: @realDonaldTrump to speak at @LibertyU __HTTP__ _E_
President Obama has made one mistake after another for a very long time and the people of the United States are just plain tired of it! _E_
ISIS is advancing even against Obama's airstrikes. Obama is disengaged and making the Middle East even more dangerous. _E_
Just got to listen to Rush Limbaugh the guy is fantastic! _E_
I hope voters in Mississippi cast their ballot for @senatormcdaniel. He is strong he is smart & he wants things to change in Washington. _E_
This is just not the right time for Jeb Bush. His campaign is in total disarray too much staff being paid way too much money = U.S. GOVT. _E_
What do you think of my suing @billmaher for $5M for charity? He made an offer I accepted. _E_
Do you really believe our once great country can continue to survive with incompetent leadership. The answer is no and we better move fast! _E_
Tell Saudi Arabia and others that we want (demand!) free oil for the next ten years or we will not protect their private Boeing 747s.Pay up! _E_
How is ABC Television allowed to have a show entitled Blackish ? Can you imagine the furor of a show Whiteish ! Racism at highest level? _E_
Re Kerry admitting to "working" for Pastor Abedini's release why has US already released Iranian spies & nuclear scientist? Dumb! _E_
A GREAT day in South Carolina. Record crowd and fantastic enthusiasm. This is now a movement to MAKE AMERICA GREAT AGAIN! _E_
THANK YOU Arkansas! Get out & #VoteTrump on Tuesday. We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ _E_
The Trump Tower atrium is such a great place & kept thousands of people warm & safe during the storm thanks staff! _E_
2016 Republican Primary Morning Consult Poll was just released. TRUMP 32 CARSON 12 BUSH 11 FIORINA 6 RUBIO 5 CRUZ 5. Taken after debate _E_
A winning attitude will put everything in perspective. Keep negative thoughts and people where they belong out of the big picture. _E_
Where's the transparency? Despite Obama's denial @sfchronicle stands by report he just talked with Jeremiah Wright. _E_
Phoenix Convention Center officials did not want to have thousands of people standing outside in the heat so they let them in. A GREAT day! _E_
I would bet that we have many great American technology companies that would build and fix the pathetic ObamaCare website for ZERO dollars! _E_
...Hence I would fully expect Corker to be a negative voice and stand in the way of our great agenda. Didn't have the guts to run! _E_
Just put in ad for a real estate executive: "Hard work low pay mean boss!" _E_
Great time last night in Louisiana. Big and energetic crowd. Go out and vote now polls open. MAKE AMERICA GREAT AGAIN! _E_
Dr. Ben Carson blasted Ted Cruz for deceit and dirty tricks and lies. _E_
Spoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenateMajLdr @SpeakerRyan. Th... __HTTP__ _E_
EXCLUSIVE: FBI Agents Say Comey 'Stood In The Way' Of Clinton Email Investigation: __HTTP__ _E_
Left Paris for U.S.A. Will be heading to New Jersey and attending the#USWomensOpen their most important tournament this afternoon. _E_
I don't know if President Obama isn't stopping the flights from Ebola torn West Africa because he is stubborn stupid or just doesn't care! _E_
"When you can't make them see the light make them feel the heat." – Ronald Reagan _E_
"Concentration comes out of a combination of confidence and hunger." Arnold Palmer _E_
China is buying our shale and gas fields __HTTP__ & Obama still won't approve Keystone __HTTP__ Pathetic! _E_
Thanks to @pnehlen for your kind words very much appreciated. _E_
Dummy @GoAngelo who had 11 people show up for 15 min. at his "massive" rally at Macy's is trying to get publicity for self by using me _E_
WRONG!@BarackObama capitulated to China by releasing Chen Guangcheng out of the US Embassy __HTTP__ China really has our number _E_
In terms of energy we need to be exploring and developing numerous approaches...and I also include in that (cont) __HTTP__ _E_
On Saturday a great man Elie Wiesel passed away.The world is a better place because of him and his belief that good can triumph over evil! _E_
Most people can learn from their own experiences quite well but many ignore the experiences and lessons of others. The Way To The Top _E_
To all young entrepreneurs entering the business world stay positive focused and remember everything has its ups and downs. _E_
.@FoxNews Outgoing CIA Chief John Brennan blasts Pres Elect Trump on Russia threat. Does not fully understand. Oh really couldn't do... _E_
Union Leader refuses to comment as to why they were kicked out of the ABC News debate like a dog. For starters try getting a new publisher! _E_
Just terrible! #Oscars _E_
Celebrity Apprentice tonight at 9 on NBC some amazing things happen! _E_
.@KarlRove is far more to blame for Obama's victory than the Tea Party. _E_
Did @BarackObama try to bribe Rev. Wright with $150K? __HTTP__ I am sure the media will be all over this. _E_
.@realDonaldTrump will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate... __HTTP__ _E_
Can you imagine if @billmaher said about Obama what he said about me (orangutan etc)—the press would run him out of the country... _E_
Must watch @IvankaTrump interview on @gma discussing #Girlpower __HTTP__ _E_
Will be on @foxandfriends at 7:00 A.M. Enjoy! _E_
It was an honor to welcome @GLFOP to the @WhiteHouse today with @VP Pence & Attorney General Sessions. THANK YOU fo... __HTTP__ _E_
The U.S. has gained more than 5.2 trillion dollars in Stock Market Value since Election Day! Also record business enthusiasm. _E_
The only place success comes before work is in the dictionary. Vince Lombardi _E_
If you can't say great things about yourself who do you think will? Think Like a Champion _E_
LIVE on #Periscope: Tax Plan Press Conference#Trump2016 __HTTP__ _E_
.@ashleycam2883 Re: Libya Hillary took the blame for Obama. _E_
W/ signature Trump amenities 5 star rooms & world class restaurants @TrumpWaikiki brings excellence to Hawaii __HTTP__ _E_
It's the Democrats' total weakness & incompetence that gave rise to ISIS not a tape of Donald Trump that was an admitted Hillary lie! _E_
STATEMENT ON MELANIA SPEECH __HTTP__ _E_
In 1999 @BarackObama said that he didn't support Welfare Reform __HTTP__ He just gutted the entire program. _E_
Do you think Putin will be going to The Miss Universe Pageant in November in Moscow if so will he become my new best friend? _E_
US Army Reserve @leezeldin will bring Conservative solutions to DC. Next Tuesday vote for Lee in the NY 1 primary. #zeldinforcongress _E_
.@ralphreed is doing a great job! _E_
For the record I have ZERO investments in Russia. _E_
Just arrived in Scotland. Place is going wild over the vote. They took their country back just like we will take America back. No games! _E_
Chelsea Clinton will be very successful in the world of politics. She's always been a great person a winner. (cont) __HTTP__ _E_
Thank you Arizona! #Trump2016#MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_
The $200M in renovations of Trump Int'l Washington DC are on track. The Old Post Office is being transformed into true luxury. _E_
To be a winner think like a winner. Practice positive thinking with reality checks. _E_
Many of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_
Join us at 10pmE on @ABC2020 @ABC with @BarbaraJWalters! #MeetTheTrumps #ABC2020 __HTTP__ _E_
Today it was my honor to join the great men and women of @DHSgov @CustomsBorder @ICEgov and @USCIS at the U.S. Customs and Border Protection National Targeting Center in Sterling Virginia. Fact sheet: __HTTP__ __HTTP__ _E_
.@IvankaTrump and I are looking forward to visiting Vancouver next week. Big announcement... _E_
RT @TeamTrump: .@realDonaldTrump is here to talk about the REAL issues #BigLeagueTruth #Debates2016 __HTTP__ _E_
#FullRepeal: Stopping Obamacare is now up to the American people. We must elect @MittRomney this November. _E_
Looking forward to RALLY in the Great State of Pennsylvania tonight at 7:30. Big crowd big energy! _E_
their country (the U.S. doesn't tax them) or to build a massive military complex in the middle of the South China Sea? I don't think so! _E_
My @SquawkCNBC interview discussing the GOP primary gas prices the Doral purchase and my outlook on the economy. __HTTP__ _E_
Shock @BarackObama's DNC Convention has a $27M deficit and events are starting to be canceled. __HTTP__ _E_
Great going. _E_
RT @IvankaTrump: It was an honor to meet with you Prime Minister Modi. Thank you for co hosting the 8th annual Global Entrepreneurship Summ... _E_
Donald Trump Jr. Ivanka Trump Eric Trump and myself in front of The Old Post Office D.C. on Pennsylvania... __HTTP__ _E_
Glad to hear @ehasselbeck will be staying on @theviewtv. Elizabeth has great presence & doesn't back down from sharing her views. _E_
Oil would be $25 a barrel if our government would let us drill. Our country would be rich again who needs OPEC. _E_
My fragrance Success is flying off the shelves @Macys. The perfect Christmas gift! _E_
"Successful people don't have fewer problems.They have determined that nothing will stop them from going forward." Dr. Benjamin Carson _E_
Always be prepared to start." @JoeMontana _E_
Clinton betrayed Bernie voters. Kaine supports TPP is in pocket of Wall Street and backed Iraq War. _E_
Entrepreneurs: always remember that deals are fluid. Terms are always negotiable and time can be the best option for success. _E_
The Miss USA Pageant #MissUSA was a big ratings hit for @nbc NBC won the evening. Thank you Donald. _E_
First candidate in Virginia with over 16000 validated signatures for the ballot. An honor thank you! #Trump2016 #MakeAmericaGreatAgain _E_
Passion gives great momentum and can be the catalyst for great achievement. _E_
A TRULY GREAT CHAMPION WILL SELDOM FAIL AND ALWAYS COME BACK. NEVER UNDERESTIMATE THE POWER OF GREATNESS! _E_
Waterboarding KSM gave us the intelligence that lead to Bin Laden. _E_
Tune in to The Marriage Ref onThursday night at 10 p.m. on NBC I'm on the panel of experts along with Gloria Estefan & Adam Carolla. _E_
In order to get elected @BarackObama will start a war with Iran. _E_
Via @EWErickson: "Stop Complaining About Donald Trump" __HTTP__ _E_
'S&P 500 Edges Higher After Trump Renews Jobs Pledge' __HTTP__ _E_
I just got Mike Leach's new book Swing Your Sword. He's a great coach and he's written a great book. It's definitely worth reading. _E_
Lyin' Ted Cruz steals foreign policy from me and lines from Michael Douglas— just another dishonest politician. _E_
Trump National Golf Club Jupiter is close to Palm Beach and designed by Jack Nicklaus a masterpiece of a course. __HTTP__ _E_
How did Obama go to a Las Vegas fundraiser on 9.12 the day after he refused to send help to Americans in Benghazi? _E_
It's good to see that @FLGovScott is protecting the sanctity of this November's elections Voter fraud must be broken. _E_
Everybody loves @bretmichaels! He's a great champion and this is where he should be. He agrees! _E_
Wonderful weekend at Camp David. A very special place. A lot of very important work done. Heading back to the @WhiteHouse now. __HTTP__ _E_
We will stop heroin and other drugs from coming into New Hampshire from our open southern border. We will build a WALL and have security. _E_
Our American comeback story begins 11/8/16. Together we will MAKE AMERICA SAFE & GREAT again for everyone! Watch:... __HTTP__ _E_
Thank you Illinois! Great news! #VoteTrumpIL on 3/15!Trump 28%Cruz 15%Rubio 14%Kasich 13%Bush 8%Carson 6%Simon Poll/SIU _E_
It's amazing how people can talk about me but I'm not allowed to talk about them. _E_
If you look at the horrible picture on the front page of the NY Times of the rebels executing prisoners you would say forget the rebels! _E_
America's primary goal with Iran must be to destroy its nuclear ambitions. Let me put this as plainly as I know (cont) __HTTP__ _E_
Don't miss my fabulous World of Golf now in its second season on Golf Channel beginning January 31 at 9 pm ET. Celebrity matches and more... _E_
Got the endorsement of Brian France and @NASCAR yesterday in Georgia. Also many of the sports great drivers. Thank you Nascar and Georgia! _E_
General Motors is sending Mexican made model of Chevy Cruze to U.S. car dealers tax free across border. Make in U.S.A.or pay big border tax! _E_
It seems there is never a problem for which @BarackObama cannot find a reason for another speech and another tax. _E_
#TBT With Tommy Lee Jones at Mar a Lago. __HTTP__ _E_
Big increase in traffic into our country from certain areas while our people are far more vulnerable as we wait for what should be EASY D! _E_
"Failure isn't fatal but failure to change might be" – John Wooden _E_
Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_
Happy Birthday to my friend @garyplayer... __HTTP__ _E_
Don't forget next Friday December 9th: I'll be signing my new book @HowToGetTough in Trump Tower from 11 a.m... (cont) __HTTP__ _E_
Army officer who led a sexual abuse prevention unit was just fired after being charged with violently going after his wife.What is going on? _E_
Remember get TIME magazine! I am on the cover. Take it out in 4 years and read it again! Just watch... _E_
Still waiting to hear from @billmaher. Every day he dodges me is one less day that $5M is being used for charity. _E_
All civilized nations must join together to protect human life and the sacred right of our citizens to live in safety and in peace. _E_
A big fat hit job on @oreillyfactor tonight. A total waste of time to watch boring and biased. @brithume said I would never run a dope! _E_
Leaving now for a one night trip to Scotland in order to be at the Grand Opening of my great Turnberry Resort. Will be back on Sat. night! _E_
Celebrity Apprentice on CNBC tonight at 9. _E_
Boycott all Apple products until such time as Apple gives cellphone info to authorities regarding radical Islamic terrorist couple from Cal _E_
Cover your bases know everything you can about what you're doing. Keep your focus by being well informed on a daily basis. _E_
"If we ever forget that we are One Nation Under God then we will be a nation gone under." Ronald Reagan (Feb. 6 1911–June 5 2004) _E_
Congratulations to my brother Robert & Ann Marie on the success of @MontesKitchen in Dutchess County New York (Amenia.) Great food! _E_
Entrepreneurs: See yourself as victorious look at the solution not the problem. _E_
The @nytimes sent a letter to their subscribers apologizing for their BAD coverage of me. I wonder if it will change doubt it? _E_
The Chicago machine is scared. @PaulRyanVP shows that @MittRomney will run on a conservative & coherent platform. 85 days until victory! _E_
Q1 GDP has just been revised down to 1.9% __HTTP__ The economy is in deep trouble. _E_
ISIS is in retreat our economy is booming investments and jobs are pouring back into the country and so much more! Together there is nothing we can't overcome even a very biased media. We ARE Making America Great Again! _E_
Unsustainable. With our $17T debt & $90T in unfunded liabilities government "blatantly" wasted $30B this year __HTTP__ _E_
RT @LouDobbs: We are Watching A Leader Who for the First Time in Three Presidencies Will Put America and Americans First! @realDonaldTrump... _E_
The highly respected Suffolk University poll just announced that I am alone in 2nd place in New Hampshire with Jeb Bust (Bush) in first. _E_
Fidel Castro is dead! _E_
I hope everyone that read @DanAmira's reprehensible statement will cancel their subscription to @NYMag in protest. Let me know. _E_
from Donald Trump: I saw Lady Gaga last night and she was fantastic! _E_
.@lisarinna is the last lady standing in All Star Celebrity @ApprenticeNBC. Watch out men she's sharp and tough. _E_
The legendary @BarbaraJWalters will be asking me questions about the Presidential campaign on @WNTonight at 6:30 PM. _E_
#Trump360 Watch this 360 video of my speech last night at Trump Tower __HTTP__ _E_
Entrepreneurs: In the best negotiations everyone wins. This is a possibility and it's the ideal situation to strive for. _E_
President Obama & Putin fail to reach deal on Syria so what else is new? Obama is not a natural deal maker. Only makes bad deals! _E_
Via @PVPatch by Paige Austin: "Trump to Donate 12 Acres for Conservation in Palos Verdes" __HTTP__ _E_
RT @SecretarySonny: Serious @Cabinet meeting today called by @POTUS at Camp David. Reports on #Irma's track potential impact fed & state... _E_
A Rod is now looking for an expensive home in Beverly Hills why aren't the @Yankees terminating his contract for misrepresentation? _E_
Via @DMRegister by @KObradovich: "Donald Trump: Next president needs to be 'a great one'" __HTTP__ _E_
Third quarter GDP was lowered to 2% . There won't be any economic recovery until @BarackObama is defeated. _E_
George Will one of the most overrated political pundits (who lost his way long ago) has left the Republican Party.He's made many bad calls _E_
Wow  I never saw the Petraeus thing coming. A straight laced guy! Very sad for him and his family. _E_
Join me live in Toledo Ohio. Time to #DrainTheSwamp & #MAGA! __HTTP__ _E_
Go to the website for the Judge's full decision re Trump University: __HTTP__ _E_
Vera Coking made a big mistake in Atlantic City by turning down many millions of $'s years ago for property that just sold for $530000. _E_
I will be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_
As I predicted 1 year ago gasoline prices hit a record high today...OPEC is having a ball at our expense. _E_
The President has until tomorrow at 12 noon to pick up $5M for his favorite charity. Looking like he won't be doing it. What is he hiding? _E_
I will be talking about my wonderful experience in Iowa and the simultaneous unfair treatment by the media later in New Hampshire. Big crowd _E_
Hope he won't spend too much time ripping apart the 2nd. Amendment! _E_
Today we heard the experiences of law enforcement professionals and community leaders working to combat the threat of MS 13 and the reforms we need from Congress to defeat it. Watch here: __HTTP__ __HTTP__ _E_
RT @MikeHolden42: @foxandfriends @realDonaldTrump He's a fascist so not unusual. _E_
Thank you Albany New York!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
.@MelindaDC Don't misrepresent in order to make a point. I was always tough on ISIS as you'll find out after I get elected. _E_
The United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_
I was never a fan of Bush 2 FOR MANY REASONS including the fact that we should never have gone into Iraq but once there kept the oil! DUMB _E_
Fact – the tighter the gun laws the more violence. The criminals will always have guns. _E_
I hope everybody goes to Macy's today to get Donald J. Trump shirts ties suits and cufflinks they are really beautiful at low price _E_
Heroin overdoses are taking over our children and others in the MIDWEST. Coming in from our southern border. We need strong border & WALL! _E_
Gas prices are still too high. We really need to pressure OPEC to lower the price of oil. _E_
Texas is lucky to have him @GovernorPerry is a great guy! _E_
Why does @FoxNews give @KarlRove so much airtime. He (and other Fox pundits) is so biased. Still thinks Romney won. Unfair coverage of Trump _E_
.@IvankaTrump and @PiersMorgan will be wonderful advisors. #CelebApprentice _E_
ObamaCare will cost 3 times as much as Obama promised – $2.6T __HTTP__ It is not sustainable. (h/t @gatewaypundit) _E_
Thank you Pennsylvania! #Trump2016 __HTTP__ __HTTP__ _E_
Watching the returns at 9:45pm. #ElectionNight #MAGA __HTTP__ _E_
"If you're still in school pay attention. Education is a money machine." – Think Like a Billionaire _E_
Isn't it time that Obama release his college records and applications? Boy would that create a mess! He is not who you think. _E_
Re run of O'Reilly on Fox NOW! _E_
We're singlehandedly transferring hundreds of billions of dollars a year... _E_
Last night in Orlando Florida was incredible massive crowd THANK YOU FLORIDA! Today at 3:00 P.M. I will be in Alabama for last rally! _E_
Now Obama is having our army coordinate with Iran against ISIS. What's next? _E_
Oh wow lightweight Governor @BobbyJindal who is registered at less than 1 percent in the polls just mocked my hair. So original! _E_
Good messaging and staying on point. @MittRomney called @BarackObama anti investment anti business anti jobs __HTTP__ _E_
Congrats everyone we topped 4 million today on Twitter and heading up fast! _E_
Happy #VeteransDay to all. And it is nice to have Sgt. Andrew Tahmooressi back home. _E_
I was nice to loser @rosie and she attacked me it just shows never let up with a bully. They only fade when you hit them hard! _E_
Looking forward to seeing the World Champion Yankees today on opening day! _E_
Only a grossly incompetent government led by an equally incompetent president could have made the terrible trade for Bergdahl. #OrangeRoom _E_
A must watch: Legal Scholar Alan Dershowitz was just on @foxandfriends talking of what is going on with respect to the greatest Witch Hunt in U.S. political history. Enjoy! _E_
Few people know that @FortuneMagazine is still in business. Tell your writer Alisa Soloman that I left The Apprentice to run for president _E_
If America unlocked its energy potential we would once again be the most powerful country in the world. Washington is holding us back. _E_
Welcome to the new reality. 23116928 US households on food stamps __HTTP__ Obama's Hope & Change. _E_
In '08 @BarackObama said that Bush adding $4T to the debt was unpatriotic. __HTTP__ @BarackObama has already added $6T. _E_
Thank you Peter if elected I will think big for our country & never let the American people down! #AmericaFirst __HTTP__ _E_
Justice Roberts did the Republican Party and @MittRomney a great favor. He essentially said ObamaCare is a tax (cont) __HTTP__ _E_
...Such poor leadership ability by the Mayor of San Juan and others in Puerto Rico who are not able to get their workers to help. They.... _E_
Jack Nicklaus II gave the best tribute to a parent I have ever heard at yesterday's Congressional Gold Medal Ceremony honoring @jacknicklaus _E_
Ask Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Counsel. _E_
China taxing imports from the US 22% why aren't we taxing China? _E_
Egypt is going the exact opposite of what it was. They will soon be very strongly against Israel. Thanks President Obama. @BarackObama _E_
Just met with General Petraeus was very impressed! _E_
I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... _E_
Tell me which is "cooler"—my induction into the @WWE Hall of Fame or my Star on the Hollywood Walk of Fame? _E_
The Republican Party is racking up record amounts of small dollar donations fueled by Trump supporters..... @nypost Thank you! _E_
Our major airports are decaying. It's embarrassing. We need to have them renovated by competent professionals and fast. _E_
"You don't necessarily need the best location. What you need is the best deal." – The Art of The Deal _E_
.@NBA Hall of Famer @dennisrodman rebounds for a tremendous performance in his return to this year's All Star @ApprenticeNBC! Great guy! _E_
Big day in Texas tomorrow! Having a rally in Fort Worth. Tremendous crowd. Will be exciting! #Trump2016 __HTTP__ _E_
Bloomberg News Spain's renewable projects lead by money losing wind turbines facing bankruptcy. Hopefully Scotland is watching! _E_
I was just told by a television pro thay @DannyZucker is one of the truly dumbest guys in the business he's obsessed with T so many flops! _E_
Reports say #ISIS now has a passport machine to have its believers infiltrate our country. I told you so. __HTTP__ _E_
"Faldo to rework two Doral courses" __HTTP__ via @FOXSports _E_
New Q poll out we are going to win the whole deal and MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_
Via @AFP: Trump tees off on new golf course in Scotland __HTTP__ _E_
Biden @VP Spends $1 Million Annually for Weekend Trips __HTTP__ _E_
My wife Melania will be interviewed tonight at 8:00pm by Anderson Cooper on @CNN. I have no doubt she will do very well. Enjoy! _E_
I am no fan of President Obama but to show you how dishonest the phony Washington Post is: __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Congrats to the Senate for taking the first step to #RepealObamacare now it's onto the House! _E_
Must read column by Bob Woodward explaining how Obama pushed for sequestration & promised no tax increase __HTTP__ _E_
Wow I just had two very good Iowa polls and a phenomenal just out National Poll from @ABC @washingtonpost 38%. MAKE AMERICA GREAT AGAIN! _E_
WEST VIRGINIA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Thank you Texas! 10000 amazing supporters! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
I hope @Official1MCD is recuperating well in LA. Get better! @OMAROSA _E_
.@montgomeriefdn Your commentary this weekend was fantastic. People love what you say and how you say it. _E_
Is it legal for @BarackObama to make campaign donor calls from Air Force One? __HTTP__ Obama is always fundraising on our dime. _E_
Via @DMRegister by @brianneDMR:"Trump: Bring back jobs from overseas" __HTTP__ Let's Make America Great Again! _E_
Money pouring into Insurance Companies profits under the guise of ObamaCare is over. They have made a fortune.Dems must get smart & deal! _E_
I will be on The Tonight Show with Jimmy Fallon tonight at 11:30. Should be fun! @jimmyfallon _E_
Via RealClear Politics __HTTP__ _E_
"Do more be more give more and everyone will benefit." – Think Like a Champion _E_
It was my great honor to celebrate the opening of two extraordinary museums the Mississippi State History Museum & the Mississippi Civil Rights Museum. We pay solemn tribute to our heroes of the past & dedicate ourselves to building a future of freedom equality justice & peace. __HTTP__ _E_
I know you don't like to hear this @DannyZuker but the biggest nights of The Apprentice were far bigger than the biggest nights of Mod Fam _E_
Entrepreneurs: See yourself as victorious. Look at the solution not the problem. _E_
The Democrats in Congress don't want ObamaCare for themselves or big businesses. So why are they forcing it on the American people? _E_
Gas prices have doubled under Obama. Over $5/gallon now in California. We must start drilling from our own resources to become independent. _E_
Passion motivates. Passionate people don't give up their zeal eliminates fear. Passion can also create business opportunities. _E_
...then he who continues the attack wins." Ulysses S. Grant _E_
The Scottish windfarm was conceived by the same mind that released terrorist al Megrahi for humanitarian reasons. .. _E_
Everyone loves @AmandaTMiller here she is with @Joan_Rivers and me. __HTTP__ _E_
Oprah will end up doing just fine with her network she knows how to win. @Oprah _E_
How do you take care of our people if you don't make anything? We don't make anything. We are rapidly losing our manufacturing to China etc. _E_
Obama now just wants to save face Russia is now telling him don't do it . He waited too long and the other side is much better prepared. _E_
Wow a really great review of my golf club in Scotland @TrumpScotland in todaysgolferco.uk. Thank you! __HTTP__ _E_
Negotiation tip: Think about what the other side wants. Know where they're coming from. Try to create a win/win situation. _E_
Put Kathleen Sebelius out of her misery and lovingly say YOU'RE FIRED! Let her go home to her family and rest. BRING IN TOP FLIGHT PEOPLE! _E_
The United Kingdom is trying hard to disguise their massive Muslim problem. Everybody is wise to what is happening very sad! Be honest. _E_
To be a big success in any field you need to build momentum. Momentum is all about energy and timing. Think BIG _E_
Like it or not haters and losers everybody is talking about Miss U.S.A. and Miss Utah. By the way she is a fine young woman unfair to her. _E_
True! __HTTP__ _E_
White House relaxes penalty for canceled health policies a major blow to the sustainability (and concept) of ObamaCare! They are desperate _E_
We will repeal & replace #Obamacare which has caused soaring double digit premium increases. It is a disaster! __HTTP__ _E_
The incompetence of our current administration is beyond comprehension. TPP is a terrible deal. _E_
MY PRO GROWTH Econ Plan:✅Eliminate excessive regulations! ✅Lean government!✅Lower taxes!#Debates ... __HTTP__ _E_
Based on the shoots which silent film do you think will be better? #CelebApprentice _E_
Just left news conference at @TrumpTowerNY with @TheGaryBusey people love @TheGaryBusey! __HTTP__ _E_
Thoughts and prayers with the victims and their families along with everyone at the Berrien County Courthouse in St. Joseph Michigan. _E_
The record 13th season of 'All Star' @CelebApprentice features the return of the beautiful @BrandenRoderick. The fans love her! _E_
The Republicans should use everything against @BarackObama just as @BarackObama is going to use everything (cont) __HTTP__ _E_
RT @IvankaTrump: We're working to make tax cuts & the expanded Child Tax Credit a reality for American families. The time is now! #TaxRefor... _E_
With multiple space options @TrumpChicago is the ideal venue to hold your dream wedding __HTTP__ _E_
Puerto Rico is devastated. Phone system electric grid many roads gone. FEMA and First Responders are amazing. Governor said great job! _E_
Crooked Hillary Clinton who I would love to call Lyin' Hillary is getting ready to totally misrepresent my foreign policy positions. _E_
On behalf of @FLOTUS Melania and myself thank you Poland!🇱#ICYMI watch here __HTTP__ #POTUSinPoland __HTTP__ _E_
Crooked Hillary Clinton likes to talk about the things she will do but she has been there for 30 years why didn't she do them? _E_
...are now fighting back like never before. There is so much GUILT by Democrats/Clinton and now the facts are pouring out. DO SOMETHING! _E_
I am in Toronto checking the great Trump International Hotel highest rated hotel in Canada. It is a beauty! _E_
.@BradPaisley came up to see me. A really nice and talented guy. __HTTP__ _E_
Wow @GolfMagazine just rated the renovation of The Blue Monster the best of the year. Even better they stated it may be best of all time! _E_
Congrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. __HTTP__ _E_
I have had the pleasure of getting to know @AnnDRomney & @MittRomney this past year. They love America. Let's push them over the top today. _E_
Via @baltimoresun by @ErinatTheSun: "Maryland GOP book Trump for major fundraiser" __HTTP__ _E_
Thank you to our wonderful team @USUN and their families. Keep up the GREAT work! #USA __HTTP__ _E_
I hate hearing after all of the hard work that @MittRomney never wanted to become President. _E_
If The Art of the Deal is a must read then #TimeToGetTough is my opus. It is available Dec 5th! _E_
Thank you Pennsylvania! Together we are going to MAKE AMERICA GREAT AGAIN! Watch here: __HTTP__ __HTTP__ _E_
Thank you Kirkwood Community College. Heading to the U.S. Cellular Center now for an 8pmE MAKE AMERICA GREAT AGAIN... __HTTP__ _E_
Crooked's State Dept gave special attention to Friends of Bill after the Haiti Earthquake. Unbelievable! __HTTP__ _E_
Big new @ABC Poll to be announced at 9:00 A.M. on This Week with @GStephanopoulos. I will be interviewed on show! _E_
Credit the Bloomberg administration for having the foresight and courage to get this decades old project finished will be BIG for NY. _E_
Join us tomorrow in Kiawah South Carolina! #SCPrimary #VoteTrumpSC#Trump2016 __HTTP__ _E_
Thank you for all of your support Iowa!#MakeAmericaGreatAgain #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_
I feel sorry for Rosie 's new partner in love whose parents are devastated at the thought of their daughter being with @Rosie a true loser. _E_
.@CNN is looking at Jeff Zucker to lead them out of the forest Jeff would be a great choice. _E_
Top searched candidate by state as seen in the #GOPDebate media filing center. WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
RT @realDonaldTrump: Democrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that hap... _E_
Trump Int. Hotel & Tower Vancouver will transform the skyline w/ its 616 ft twisting & beautiful tower __HTTP__ _E_
Did Hillary Clinton ever apologize for receiving the answers to the debate? Just asking! _E_
Even though every poll Time Drudge etc. has me winning the debate by a lot @FoxNews only puts negative people on. Biased a total joke! _E_
Money is really cheap so this is a great time to buy a house but be sure to lock in long term financing (without which don't buy). _E_
A suggestion for the dishonest media __HTTP__ _E_
"If you can accept losing you can't win." Vince Lombardi _E_
China has been taking out massive amounts of money & wealth from the U.S. in totally one sided trade but won't help with North Korea. Nice! _E_
Melania and I were thrilled to join the dedicated men and women of the @USEmbassyFrance members of the U.S. Military and their families. __HTTP__ _E_
I said don't invade Iraq from the very beginning. my @SRQRepublicans speech _E_
fires its employees builds a new factory or plant in the other country and then thinks it will sell its product back into the U.S. ...... _E_
RT @GMA: WATCH: @IvankaTrump on women who work empowering campaign celebrates modern women. __HTTP__ _E_
Great column by @howardfineman on @HuffPostPol: Karl Rove Is Done __HTTP__ _E_
Via @ChristianToday: "Donald Trump vows to be the 'greatest representative of Christians' if he wins White House" __HTTP__ _E_
Remember we don't get any oil from Iraq China gets whatever ISIS hasn't already taken. So why isn't China sending the troops? Too smart! _E_
New National GOP Zogby Poll#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
So disgraceful that a person illegally in our country killed @Colts linebacker Edwin Jackson. This is just one of many such preventable tragedies. We must get the Dems to get tough on the Border and with illegal immigration FAST! _E_
AMAZING how the press protected President Obama when he did the so called comedy routine with Zach G. He looked like a fool they said cute _E_
The failing @nytimes story is so totally wrong on transition. It is going so smoothly. Also I have spoken to many foreign leaders. _E_
I hear the Rickets family who own the Chicago Cubs are secretly spending $'s against me. They better be careful they have a lot to hide! _E_
Remember politicians are all talk and no action they will never be able to MAKE OUR COUNTRY GREAT AGAIN! Controlled by lobbyists & donors _E_
We are going through contentious primaries now but the GOP must unite. Let's take the Senate and stop Obama's dangerous agenda. _E_
But while Dallas dropped to its knees as a team they all stood up for our National Anthem. Big progress being made we all love our country! _E_
So many people are angry at my comments on Mexico—but face it—Mexico is totally ripping off the US. Our politicians are dummies! _E_
...instead of biting the hand that feeds you! Don't bother just keep making me money! _E_
Wow! Ted Cruz received $487K in campaign contributions $11M from a NY hedge fund mogul & $1M low int. loan from Goldman Sachs. Hypocrite _E_
We need to worry about the American worker first! _E_
Direct view of crane from apartment window. Crane was never properly secured blowing in the breeze. __HTTP__ _E_
With the fantastic ratings last weekend @meetthepress & @ThisWeekABC I think it's only fair that I go on @FoxNewsSunday w/ Chris Wallace. _E_
I was sorry to decline headlining the Reagan Dinner last Saturday due to a prior business commitment. Pres. Reagan was one of the greats. _E_
RT @ColumbiaBugle: @realDonaldTrump Love our @FLOTUS! __HTTP__ _E_
Vote for the next Miss USA... __HTTP__ #VEGASusa11 #MissUSA _E_
America deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_
Thank you to Jack Morgan Tamara Neo Cheryl Ann Kraft and all of my friends and supporters in Virginia. GREAT JOB! _E_
The NFL is now thinking about a new idea keeping teams in the Locker Room during the National Anthem next season. That's almost as bad as kneeling! When will the highly paid Commissioner finally get tough and smart? This issue is killing your league!..... _E_
If you don't have passion everything you do will ultimately fizzle out or at best be mediocre. Is that how (cont) __HTTP__ _E_
Fox News Sunday With Chris Wallace will be re broadcast on @FoxNews at 6:00 P.M. _E_
The FDA must immediately stop allowing massive dose vaccinations in babies. It is mind boggling that they allow this practice to continue. _E_
China is ripping wealth out Africa and yet as usual refuses to put anything back to help with Ebola. Let the stupid Americans do it! SAD _E_
Thx and from a better quotation source: You miss 100% of the shots you don't take. Wayne Gretzky _E_
RT @foxandfriends: Former President Obama's $400K Wall Street speech stuns liberal base Sen. Warren saying she was troubled by that __HTTP__ _E_
The top leadership of the New York State Republican Party is totally dysfunctional they haven't won a major election in many years. _E_
"Donald Trump on 'Brutal' New Season of @ApprenticeNBC" __HTTP__ via @YahooTV _E_
Great job today by the NYPD in protecting the people and saving the climber. _E_
CNBC poll: Trump won #GOPDebate #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Great job @AdamScott you deserve it! _E_
Why would the people of Texas support Ted Cruz when he has accomplished absolutely nothing for them. He is another all talk no action pol! _E_
Leaving now for Tennessee. Big crowd! _E_
Departing for Long Island now. An area under siege from #MS13 gang members. We will not rest until #MS13 is eradicated. #LESM __HTTP__ _E_
If Obama resigns from office NOW thereby doing a great service to the country—I will give him free lifetime golf at any one of my courses! _E_
A wonderful afternoon in Iowa! Great people! Heading now to Florida tomorrow South Carolina! #MakeAmericaGreatAgain #Trump2016 _E_
Congrats to Charles @krauthammer for his statements on climate change formerly known as global warming! _E_
I'm leading big in every poll and we are going to WIN! Remember Trump NEVER gives up! _E_
Great meeting with a wonderful woman today former Secretary of State Condoleezza Rice! #USA __HTTP__ _E_
Entrepreneurs: Trust your instincts even after you've honed your skills. They're there for a reason. _E_
A country must enforce its borders. Respect for the rule of law is at our country's core. We must build a wall! __HTTP__ _E_
The new Ebola czar will report to the WH & NSA adviser Susan Rice. More mismanagement & duplicity with CDC. Obama is terrible executive. _E_
I hope Tom Brady sues the hell out of the @nfl for incompetence & defamation. They will drop the case against him and he will win. _E_
Congrats to @KarlRove on blowing $400 million this cycle. Every race @CrossroadsGPS ran ads in the Republicans lost. What a waste of money. _E_
RT @charliekirk11: 100 days ago a new message leader & movement took the Oval Office! A government FOR the people BY the people. This is... _E_
Jeb Bush is weak on illegal immigration in favor of common core bad on women's health issues and thinks the Iraq war was a good thing. _E_
Do you believe that highly overrated political pundit @krauthammer said this is the best Republican field in 35 years. What a dope! _E_
To the people of Kentucky Rand Paul didn't want you. Now he runs back due to his presidential failure. #VoteTrump #MakeAmericaGreatAgain _E_
A MUST READ! @AndrewBreitbart's last article The Vetting Part I: @BarackObama's Love Song to Alinsky __HTTP__ _E_
Hillary's staff thought her email scandal might just blow over. Who would trust these people with national security? __HTTP__ _E_
It's very sad that the administration isn't sending anyone to Margaret Thatcher's funeral. She was a big U.S. supporter. _E_
With Jemele Hill at the mike it is no wonder ESPN ratings have tanked in fact tanked so badly it is the talk of the industry! _E_
.@Graeme_McDowell Great playing Graeme you are a true champion! _E_
Canada has made business for our dairy farmers in Wisconsin and other border states very difficult. We will not stand for this. Watch! _E_
Black politicians are in prison based on Shirley Huntley's statements but not white @AGSchneiderman RACISM! __HTTP__ _E_
.@USATODAY Poll and @QuinnipiacPoll say that I beat both Hillary and Bernie and I havn't even started on them yet! _E_
Cruz did not renounce his Canadian citizenship as a US Senator only when he started to run for #POTUS. He could be Canadian Prime Minister. _E_
Negotiation: It is persuasion more than power. _E_
Will be doing Fox & Friends at 7 A.M. 20 minutes. ENJOY! _E_
Thank you Colorado Springs. If I'm elected President I am going to keep Radical Islamic Terrorists out of our count... __HTTP__ _E_
RT @TeamTrump: .@HillaryClinton just claimed she has a positive optimistic view for America. #Debates __HTTP__ _E_
Just as I predicted people are going to be shocked by the rise in premium prices thanks to Obama Care __HTTP__ Enjoy! _E_
Part 2 of my @jimmyfallon interview giving away some @CelebApprentice spoilers & discussing 2012 Miss Universe Pageant __HTTP__ _E_
Thank you Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_
and knew they were in big trouble which is why they cancelled their big fireworks at the last minute.THEY SAW A MOVEMENT LIKE NEVER BEFORE _E_
.@Jrprotalker Thanks Judy for the wonderful statements on @TrumpTurnberry. Great seeing you there & you did a fabulous job on commentary. _E_
Five U.S. soldiers killed in Afghanistan by so called friendly fire. What are we doing? _E_
Inflation is here. Record beef prices are hitting consumers pockets __HTTP__ Bad for family grills. _E_
Today I am standing with patriots in Arizona for border security! Build a wall! Let's Make America Great Again! __HTTP__ _E_
The harder you work the harder it is to surrender. @ProFootballHOF @buffalobills Head Coach Marv Levy _E_
Wow Macy's numbers just in Trump is doing better than ever thanks for your great support! _E_
I am asking the chairs of the House and Senate committees to investigate top secret intelligence shared with NBC prior to me seeing it. _E_
Why is it that when Warren Buffett uses the bankruptcy laws to his benefit nobody cares but with me they go nuts! _E_
Letterman @Late_Show begging me to go back on his low rated show calls lots must apologize for racist comment. _E_
Thank you @FoxNews Huge win for President Trump and GOP in Georgia Congressional Special Election. _E_
The World Bank is tying poverty to 'climate change' __HTTP__ And we wonder why international organizations are ineffective. _E_
Goofy Elizabeth Warren who may be the least productive Senator in the U.S. Senate must prove she is not a fraud. Without the con it's over _E_
{Crooked Hillary Clinton} created this mess and she knows it. #DrainTheSwamp __HTTP__ __HTTP__ _E_
.@secupp who can't believe that her candidate has bombed so badly is one of the dumber pundits on T.V. Hard to watch zero talent! @CNN _E_
Former President Vicente Fox who is railing against my visit to Mexico today also invited me when he apologized for using the f bomb. _E_
It seems @BarackObama had our tax dollars buy guns for Mexican drug lords that were used to kill Americans. We need answers now. _E_
I have been leading big in all polls with two more today @nbc and @CNN. The NBC poll is more than double next at 29%. Fiorina has 11%. _E_
Pres. Obama should leave the baseball game in Cuba immediately & get home to Washington where a #POTUS under a serious emergency belongs! _E_
Read about my victory against sleazebag @AGSchneiderman. More people should fight when they're right! __HTTP__ _E_
Young entrepreneurs – in an economic climate like this only the strong survive. You can do it. Think Big! _E_
Boycott @Macys no guts no glory. Besides there are far better stores! _E_
.@Graeme_McDowell You are the toughest guy there is. If you were a boxer you'd be the champ. Great going! _E_
Via @scotsmandotcom: "Awards for Trump's golf course" __HTTP__ _E_
Visit the highly acclaimed Trump International Hotel & Tower Chicago and its exceptional 'Sixteen' restaurant __HTTP__ _E_
Will be interviewed on @foxandfriends at 7:00 5 minutes. Then I head to New Hampshire great people! _E_
After destroying the Middle East & our economy the Bushes last gift was having Justice Roberts legalize ObamaCare. No more Bushes! _E_
Get respect and do not give a damn if people like you. Think Big _E_
The worst thing you can possibly do in a deal is seem desperate to make it. #TheArtofTheDeal _E_
A new INTELLIGENCE LEAK from the Amazon Washington Postthis time against A.G. Jeff Sessions.These illegal leaks like Comey's must stop! _E_
"TRUMP TO CPAC: BUILD A GREAT ECONOMY" __HTTP__ via @BreitbartVideo _E_
#TBT Filming an Oreo commercial with Eli Manning Peyton Manning and Darrell Hammond __HTTP__ _E_
Had a great time in Myrtle Beach and Charleston this past Saturday and Monday. Looking forward to going back soon. _E_
The majority of Americans agree with @MittRomney's comments on @Israel and Iran. _E_
LinkedIn Workforce Report: January and February were the strongest consecutive months for hiring since August and September 2015 _E_
#FlashbackFriday Just after I did my renovation in Central Park of @TrumpRink __HTTP__ _E_
The dishonest media likes saying that I am in Agreement with Julian Assange wrong. I simply state what he states it is for the people.... _E_
Via @NorthvillePatch: Donald Trump to Speak in Novi This May __HTTP__ _E_
The GOP doesn't waste an opportunity to waste an opportunity. Defunding Obamacare should be central to any deal. _E_
Just found out I won the Rockingham County Republican Booth Straw Poll at the Deerfield Fair in New Hampshire this past weekend. 39% Wow! _E_
My thoughts and prayers are with the two police officers shot in Sebastian County Arkansas. #LESM _E_
The Central Park Five documentary was a one sided piece of garbage that didn't explain the.horrific crimes of these young men while in park _E_
Since the first day I took office all you hear is the phony Democrat excuse for losing the election Russia RussiaRussia. Despite this I have the economy booming and have possibly done more than any 10 month President. MAKE AMERICA GREAT AGAIN! _E_
My father's 4 step formula for success: Get in get it done get it done right and get out. Fred C. Trump _E_
Watch @AC360 on NOW! @CNN _E_
Crooked Hillary Clinton is spending a fortune on ads against me. I am the one person she doesn't want to run against. Will be such fun! _E_
Join me in Reno Nevada on Wednesday at 3:30pm at the Reno Sparks Convention Center! #MAGATickets:... __HTTP__ _E_
Congrats to @greggutfeld on his new @FoxNews show! Greg makes great TV and is a terrific guy. _E_
W/state of the art Clubhouse & our signature amenities @Trump_Charlotte brings true luxury to The Tar Heel State __HTTP__ _E_
Looking forward to speaking at @Citizens_United & @SteveKingIA's "Iowa's Freedom Summit" on January 24th __HTTP__ _E_
Paulina @MissUniverse Vega will be introduced tonight at the Finale of Celebrity Apprentice.She is a great beauty and a monster star in S.A. _E_
.@TrumpDoral offers multiple award winning dining options in our all new signature restaurant and lounges __HTTP__ _E_
Fracking poses ZERO health risks __HTTP__ In fact it increases our national security by making us energy independent. _E_
My speech is right now on C SPAN 1 _E_
I will be meeting with Henry Kissinger at 1:45pm. Will be discussing North Korea China and the Middle East. _E_
Watch @extratv's spot covering the first annual Trump Invitational at Mar a Lago __HTTP__ _E_
Congratulations to my head pro of Trump International Golf Club (Florida) John Nieporte for qualifying for the U.S. Open! @usopengolf _E_
Obama stop the flights to and from West Africa NOW before it is too late! Can't you see what's happening? Can you be that thick (stupid)? _E_
Does @BarackObama ever work? He is constantly campaigning and fundraising on both the taxpayer's dime and time not fair! _E_
When it comes to the future of America's energy needs we will FIND IT we will DREAM IT and we will BUILD IT.... __HTTP__ _E_
Thank you Roseanne very much appreciated. __HTTP__ _E_
.@politico has no power but so dishonest! _E_
Entrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_
Govt. collapsing in Iraq only 2 weeks after withdrawal of our troops. Sadly I called this one and please remember I alone called it. _E_
I think @megynkelly should take another eleven day unscheduled vacation. _E_
Is Jon Stewart a racist? See video that includes clip... __HTTP__ #thedailyshow _E_
.@kevinjonas was great but he brought the wrong person into the boardroom. Had he brought Lorenzo in he would not have been fired. _E_
.@antbaxter should really be ashamed about his massive box office disaster. Take a hint and get out of the film (cont) __HTTP__ _E_
Statement Regarding British Referendum on E.U. Membership __HTTP__ _E_
In three years people won't be building wind turbines anymore they are obsolete & totally destroy the environment in which they sit. _E_
Get ready for fireworks...@Joan_Rivers & @THEGaryBusey face off in the Board Room this Sunday on All Star Celebrity @ApprenticeNBC. _E_
Lots of autism and vaccine response. Stop these massive doses immediately. Go back to single spread out shots! What do we have to lose. _E_
Still looking to give away a RECORD $1M reward on @fundanything for a crowd funding campaign __HTTP__ _E_
In addition to those without health coverage those that have disastrous #Obamacare are seeing MASSIVE PREMIUM INCR... __HTTP__ _E_
.@Neilyoung A few months ago Neil Young came to my office looking for $$ on an audio deal & called me last week to go to his concert. Wow! _E_
Why won't Obama release his college applications? Is there something 'foreign' about them? _E_
An HR solutions company polled 1000 employed adults to find out who would make ideal bosses... __HTTP__ _E_
If the Palestinians want statehood then why are they run by the terrorist group Hamas? _E_
Great time in Burlington Vermont. Crowd was amazing. _E_
My heartfelt condolences to the family of Kathryn Steinle. Very very sad! _E_
Watch this behind the scenes video of @IvankaTrump's Fall 2012 collection photo shoot __HTTP__ _E_
Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_
Failing @NYTimes will always take a good story about me and make it bad. Every article is unfair and biased. Very sad! _E_
RT @FoxNews: .@POTUS: I'm not against the media. I'm against the FAKE media. #CashinIn __HTTP__ _E_
Met @newtgingrich at Trump Tower today. He's a big thinker. _E_
Outrageous @BarackObama is suing to suppress the military vote in Ohio __HTTP__ Our Commander in Chief should be ashamed. _E_
Be sure to watch #MissUniverse tonight at 8PM on @nbc with its first simulcast on @Telemundo! _E_
I'd bet the horrible look of Pinehurst translates into poor television ratings. This is not what golf is about! _E_
Right To Play uses the power of play to educate and empower children facing adversity. A great cause check it out. __HTTP__ _E_
The Mar a Lago Club was amazing tonight. Everybody was there the biggest and the hottest. Palm Beach is so lucky to have best club in world _E_
Mullet Bay Golf Course looks like a slum on the beautiful island of St. Maarten. @PrimeMinisterSX should be ashamed for allowing this. _E_
My @IngrahamAngle interview discussing @JebBush's comments a united 2012 GOP #CelebApprentice & Trump#Miss Universe __HTTP__ _E_
"If you can't explain it simply you don't understand it well enough." Albert Einstein _E_
I spoke with President Moon of South Korea last night. Asked him how Rocket Man is doing. Long gas lines forming in North Korea. Too bad! _E_
Just as I predicted today Obama called for even more tax increases. The Republicans played right into his hands and blew their cards. _E_
"Always strive to outdo yourself." – Think Big _E_
No wonder the @nytimes is failing—who can believe what they write after the false malicious & libelous story they did on me. _E_
"Attitude is a little thing that makes a big difference." Winston Churchill _E_
Do you believe that Obama is giving weapons to moderate rebels in Syria.Isn't sure who they are. What the hell is he doing.Will turn on us _E_
I really enjoyed doing the show circuit this AM discussing lightweight AG Eric Schneiderman & the terrible job he has done for NY. _E_
One of the reasons I am no fan of John McCain is that our Vets are being treated so badly by him and the politicians. I will fix VA quickly. _E_
The banks need to start lending again otherwise the economy will continue its downturn. This is why we bailed the banks out! _E_
Puerto Rico Governor Ricardo Rossello just stated: The Administration and the President every time we've spoken they've delivered...... _E_
Lightweight @AGSchneiderman is driving business out of New York for his own public relations benefit. A real dope! _E_
Why does @FoxNews keep George Will as a talking head? Wrong on so many subjects! _E_
So Obama's top people responsible for ObamaCare think the American Public is stupid! All based on lies and deception! Repubs should sue. _E_
We should not attack Syria but if they make the stupid move to do so the Arab Leaguewhose members are laughing at us should pay! _E_
Thanks to everyone who has waited in the long lines at the #TimeToGetTough book signings. It is great to meet fellow patriots. _E_
Conde Nast Traveler Readers' Choice Awards Best Resorts in Europe: Trump Int'l Hotel & Golf Links Doonbeg voted #1. __HTTP__ _E_
Thomas Jefferson wrote the Senate filibuster rule. Harry Reid & Obama killed it yesterday. Rule was in effect for over 200 years. _E_
.@ericbolling you can do much better than you did tonight on @oreillyfactor. Better luck tomorrow! _E_
.@BarackObama Hood: Rob our children's future by borrowing from the Chinese to pay for socialist programs that will bankrupt us. _E_
The Church is yet another victim to his liberal agenda: @BarackObama lied to his Catholic supporters to pass ObamaCare. _E_
...the beauty that is being taken out of our cities towns and parks will be greatly missed and never able to be comparably replaced! _E_
Horrible and cowardly terrorist attack on innocent and defenseless worshipers in Egypt. The world cannot tolerate terrorism we must defeat them militarily and discredit the extremist ideology that forms the basis of their existence! _E_
Without passion you don't have energy without energy you have nothing! _E_
The Dallas event in two weeks at the American Airlines Center is filling up fast. Get your tickets fast before it is too late! _E_
So impt Rep Senators under leadership of @SenateMajLdr McConnell get healthcare plan approved. After 7yrs of O'Care disaster must happen! _E_
The @BarackObama administration is far more enthusiastic about boosting food stamp enrollment than about preventing fraud. #TimeToGetTough _E_
VERY IRONIC: In 2010 video Clinton lectured underlings on cybersecurity and guarding 'sensitive information' __HTTP__ _E_
.@VattenfallGroup will never solve the issues with the Ministry of Defense. Besides they smartly just left the project. _E_
Highly overrated & crazy @megynkelly is always complaining about Trump and yet she devotes her shows to me. Focus on others Megyn! _E_
Hillary whose decisions have led to the deaths of many accepted $ from a business linked to ISIS. Silence at CNN. __HTTP__ _E_
The U.S. Consumer Confidence Index for December surged nearly four points to 113.7 THE HIGHEST LEVEL IN MORE THAN 15 YEARS! Thanks Donald! _E_
.@CNN is so negative getting even worse as I get closer. Just had two anti Trump losers with zero rebuttal from my team. Turning off! _E_
Join me this Saturday at Ladd–Peebles Stadium in Mobile Alabama! #ThankYouTour2016 Tickets:... __HTTP__ _E_
.@chucktodd said today on @meetthepress that attacking Bill to get to Hillary has never worked before. Wrong attacked him in '08 & won! _E_
...long he doesn't know how to win anymore just look at the mess our country is in bogged down in conflict all over the place. Our hero.. _E_
We are one step closer to delivering MASSIVE tax cuts for working families across America. Special thanks to @SenateMajLdr Mitch McConnell and Chairman @SenOrrinHatch for shepherding our bill through the Senate. Look forward to signing a final bill before Christmas! __HTTP__ _E_
Entrepreneurs: Set the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_
Thank you Colorado! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_
Just as I predicted @Joe_Biden was a complete disaster in China. He condoned the Chinese one child policy an... (cont) __HTTP__ _E_
Via @DC_Decoder: Donald Trump to 'surprise' GOP convention. What might he do? __HTTP__ Answer: Something major! _E_
What is happening in Atlantic City casino closures is very sad but does anybody give me credit for getting out before its demise? Timing _E_
How can Jeb Bush expect to deal with China Russia + Iran if he gets caught doing a "plant" during my speech yesterday in NH? _E_
The best thing you can do is deal from strength and leverage is the biggest strength you have. Leverage is (cont) __HTTP__ _E_
Watching my beautiful wife Melania speak about our love of country and family. We will make you all very proud.... __HTTP__ _E_
Today there were terror attacks in Turkey Switzerland and Germany and it is only getting worse. The civilized world must change thinking! _E_
Tomorrow I will be tweeting on only one subject! _E_
Looking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone and Natalie Morales. _E_
My interview on @gretawire last night Our Leaders Are Leading Us Into 'Oblivion' __HTTP__ _E_
Who is @Macys to pretend innocence when they "racial profile" all over the place? Paid big fine! _E_
I will be commenting LIVE on Sunday night (9 to 11) on TWITTER Celebrity Apprentice will be great this season amazing cast! _E_
I'll be turning the table on Larry King this Saturday night. I'll be interviewing him in honor of the 25th Anniversary of his show. _E_
China is primed to continue to rob us and steal our jobs through their exports __HTTP__ We need @MittRomney to rein them in. _E_
Take a chance! All life is a chance. The man who goes farthest is generally the one who is willing to do and dare. Dale Carnegie _E_
The public is learning (even more so) how dishonest the Fake News is. They totally misrepresent what I say about hate bigotry etc. Shame! _E_
A Rod has disgraced the blessed @Yankees organization lied to the fans & embarrassed NYC. He does not deserve to wear the pinstripes. _E_
THANK YOU IOWA!#ThankYouTour2016 __HTTP__ _E_
.@EricTrump was FANTASTIC on @foxandfriends this morning. He may be my son but he is a special guy! _E_
Being politically correct takes too much time. We have too much to get done! #Trump2016 __HTTP__ __HTTP__ _E_
RT @JerryTravone: @realDonaldTrump __HTTP__ _E_
Getting ready to leave for Michigan will be an amazing evening! See you there. _E_
Join me at 4pm over at the Lincoln Memorial with my family!#Inauguration2017 __HTTP__ _E_
The Fake News is at it again this time trying to hurt one of the finest people I know General John Kelly by saying he will soon be..... _E_
do this under the law I feel it is visually important as President to in no way have a conflict of interest with my various businesses.. _E_
LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_
When the Super Committee fails @BarackObama will get exactly what he really wants automatic cuts in defense spending. This is his plan. _E_
RT @dcexaminer: EXCLUSIVE: How Donald Trump's 30 million followers are crashing the Internet __HTTP__ __HTTP__ _E_
Charles McCullough the respected fmr Intel Comm Inspector General said public was misled on Crooked Hillary Emails. "Emails endangered National Security." Why aren't our deep State authorities looking at this? Rigged & corrupt? @TuckerCarlson @seanhannity _E_
Just finished speaking in Jacksonville Florida. Incredible crowd fantastic people. Thank you! _E_
Rio de Janeiro joins the @TrumpCollection in 2016. It's going to be a spectacular hotel! __HTTP__ _E_
This George Zimmerman is really a mess he really has to just disappear! (He attacked his wife last night). _E_
Congratulations to Woody Johnson and @nyjets on acquiring @TimTebow.@TimTebow is not only a winner but a leader. (cont) __HTTP__ _E_
Looking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone & Natalie Morales live from Las Vegas. _E_
Thank you Oregon! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Via CNET: Donald Trump Bests Jeb Bush in Website Performance Experts Say __HTTP__ _E_
Only 109 people out of 325000 were detained and held for questioning. Big problems at airports were caused by Delta computer outage..... _E_
ObamaCare is dead and the Democrats are obstructionists no ideas or votes only obstruction. It is solely up to the 52 Republican Senators! _E_
Apple is finally considering a large screen for the I Phone they better get moving fast. When I told them to do this last year they scoffed _E_
Thoughts and prayers for those in the floods affecting the great people of South Carolina. _E_
Will be on @foxandfriends at 7:02 A.M. Enjoy. _E_
I certainly hope the Democrats do not force Nancy P out. That would be very bad for the Republican Party and please let Cryin' Chuck stay! _E_
N.Y.Times headline states Obama suffers setbacks in Japan trade deal. Can somebody please tell him that with all they sell us WE HAVE CARDS _E_
.@mcuban has less TV persona than any other person I can think of. He's an arrogant crude dope who met some very stupid people... _E_
With few exceptions only really smart people are able to make a lot of money. Hard work is also important but brains will supersede. _E_
A terrible deal with Iran! __HTTP__ _E_
Egypt's Muslim Brotherhood just made its first visit to Hamas led Gaza. Why did @BarackObama promote the Arab Spring ? _E_
My @gretawire int. on Leon Panetta's critique of Obama Ebola rise of ISIS Obama's lack of common sense & 2016 __HTTP__ _E_
"Positive thinking is not merely wishful thinking... _E_
Dress for success. The Donald J. Trump Signature Collection exclusively available @Macys.com __HTTP__ _E_
Interestingly the hurricane may now be a disaster for Obama's reelection because of his grandstanding. _E_
A great honor from somebody that knows how to win! __HTTP__ _E_
Great to hear that @nfl legend and hall of famer John Elway has endorsed @MittRomney in Colorado. CO is a must win state for Mitt. _E_
.@TheJuanWilliams you never speak well of me & yet when I saw you at Fox you ran over like a child and wanted a picture. Please share pic! _E_
Congratulations to our great resident of Chicago Trump Tower Patrick Kane @88PKane for the #StanleyCup win & winning MVP of series. _E_
With @stuartpstevens expected to represent @GovChristie in the Presidential race Chris will have a very hard time winning. _E_
....your release possible and HAVE A GREAT LIFE! Be careful there are many pitfalls on the long and winding road of life! _E_
We need your vote. Go to the POLLS! Let's continue this MOVEMENT! Find your poll location: __HTTP__ __HTTP__ _E_
.@serenawilliams is a special player. After winning the Gold for the US in the Olympics it looks like she will (cont) __HTTP__ _E_
They are great people! __HTTP__ _E_
Congratulations to @spurs on their @NBA championship. Well deserved. _E_
I will be interviewed by @donlemon tonight on @CNN at 10PM. _E_
.@bwilliams knows that I think his newscast has become totally boring so he took a shot at me last night. _E_
The passage of the @DeptVetAffairs Accountability and Whistleblower Protection Act is GREAT news for veterans! I lo... __HTTP__ _E_
Trump: Obama is 'Unlucky President' __HTTP__ via @Newsmax_Media _E_
If traveling to the Windy City to celebrate 100th anniversary of Wrigley Field @TrumpChicago is Chicago's #1 hotel __HTTP__ _E_
.@SenMikeLee refuted every point Karl 1.6% Rove made on the need to defund ObamaCare.Must listen __HTTP__ @TheRightScoop _E_
Was with great people last night in Fort Myer Virginia. The future of our country is strong! _E_
#Trump2016 #TrumpInstagram: __HTTP__ __HTTP__ _E_
Do you notice that the polling establishment doesn't put me in polls but put in folks who hardly register. MAKE AMERICA GREAT AGAIN! _E_
Often times being 'innovative' is simply putting together pre existing elements into something new. Be resourceful & expect success. _E_
STATEMENT IN RESPONSE TO PRESIDENT OBAMA'S FAILED LEADERSHIP: __HTTP__ _E_
The economy cannot take four more years of these same failed policies.#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
Thank you Willie Robertson! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_
You miss 100% of the shots you don't take. Wayne Gretzky _E_
Perhaps Miss USA can lure Snowden back? _E_
DACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed money away from our Military. _E_
I never quit trying. I never felt that I didn't have a chance to win. Arnold Palmer @KingdomMag _E_
My @Gretawire interview where I discuss why @BarackObama is an economic ignoramus and how OPEC is inflating gas prices. __HTTP__ _E_
Via @digitaljournal: Donald Trump tweets Obama is 'an incompetent President' __HTTP__ _E_
Great evening with the @AmSpec & the T. Boone Pickens Entrepreneur Award. Amazing crowd—thank you! _E_
.@MissUSA Erin Brady is doing a fantastic job representing Trump Miss USA. Smart gorgeous a really positive force! _E_
Any and all weather events are used by the GLOBAL WARMING HOAXSTERS to justify higher taxes to save our planet! They don't believe it $$$$! _E_
I hope you can go to @oreillyfactor and vote for Donald Trump in order to Make America Great Again! Thanks. _E_
Thank you Rep. Collins! #Trump2016 __HTTP__ _E_
Time Warner cable out AGAIN in Manhattan no television. They have a real problem! _E_
Via @ConservReview by @JeffJlpa1: Why Donald Trump is Right __HTTP__ _E_
"Polling strong Donald Trump starting to get serious" __HTTP__ via @bostonherald by @JaclynCashman _E_
Via @lohud by @hoopsmbd: "Buzz builds for @TrumpFerryPoint" __HTTP__ _E_
Trump Towers Istanbul Sisli will be one of the country's top landmarks __HTTP__ _E_
Dopey @GeorgeWill the most overrated political pundit in the business continues to downgrade the Republican (cont) __HTTP__ _E_
Republicans had all the cards but not the guts to make a great deal! _E_
How does this cast look to you? Pretty amazing. #CelebApprentice _E_
The ratings for the Republican National Convention were very good but for the final night my speech great. Thank you! _E_
I'm going to be live with @ericbolling and @kimguilfoyle to ring in the New Year 2016. Everybody should tune in to @foxnews tomorrow night! _E_
Unprecedented success for our Country in so many ways since the Election. Record Stock Market Strong on Military Crime Borders & ISIS Judicial Strength & Numbers Lowest Unemployment for Women & ALL Massive Tax Cuts end of Individual Mandate and so much more. Big 2018! _E_
Prime Minister @David_Cameron is very foolish in giving @AlexSalmond so much money to build wind turbines which r destroying Scotland. _E_
Watch my interview on @CBSNews Face The Nation now and also the new CBS POLLS which if good for me the media won't report! _E_
The Wall is a very important tool in stopping drugs from pouring into our country and poisoning our youth (and many others)! If _E_
While Hillary profits off the rigged system I am fighting for you! Remember the simple phrase: #FollowTheMoney... __HTTP__ _E_
Do you believe @algore is blaming global warming for the hurricane? _E_
Congrats @JanineTurner on new book A Little Bit Vulnerable you're a breath of fresh air in the political forum __HTTP__ _E_
The hedge fund guys (gals) have to pay higher taxes ASAP. They are paying practically nothing. We must reduce taxes for the middle class! _E_
What's with this rap stuff with me and Ebenezer Scrooge? __HTTP__ _E_
The Chinese want to steal our jobs and technology that includes so called green energy which they make but (cont) __HTTP__ _E_
Join @TeamTrump on Facebook & watch tonight's rally from Geneva Ohio our 3rd rally of the day. #AmericaFirst #MAGA __HTTP__ _E_
With the $635 million dollar website fiasco getting caught tapping phones of WORLD LEADERS and so much more U.S. is looking really stupid! _E_
We are taking care of hundreds of people in the Trump Tower atrium they are seeking refuge. Free coffee and food. _E_
.@MacMiller's Donald Trump just hit 60 million hits. Maybe I should go into a new business. _E_
Thoughts & prayers with the millions of people in the path of Hurricane Matthew. Look out for neighbors and listen... __HTTP__ _E_
Talent is cheaper than table salt. What separates the talented individual from the successful one is a lot of hard work. Stephen King _E_
MAKE AMERICA GREAT AGAIN!#AmericaFirst #Trump2016 __HTTP__ _E_
.@TrumpGolfLA has panoramic Pacific Ocean views features a 7242 yard public course designed by Pete Dye __HTTP__ _E_
Congratulations to @nyknicks on winning their first Atlantic Division title since 1994. @carmeloanthony is a great New Yorker and Knick! _E_
.@KarlRove stated clearly that he wants to repeal the 2nd Amendment. I thought @FoxNews was going to fire that jerk after his Romney fiasco? _E_
.@GolfMonthly re: my Scottish course "Quite simply this is not the best new links course in the UK it is the best links course full stop _E_
Today I signed the Holocaust Remembrance Proclamation: __HTTP__ #ICYMI My statement last night at... __HTTP__ _E_
My parents: Trust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_
Great poll numbers all over and beating Hillary Clinton one on one. Thank you! _E_
It is time for Iran to face serious consequences. This regime is a threat to our national security. _E_
.@CNBC has just agreed that the debate will be TWO HOURS. Fantastic news for all especially the millions of people who will be watching! _E_
Don't like @SamuelLJackson's golf swing. Not athletic. I've won many club championships. Play him for charity! _E_
The Democratic National Committee would not allow the FBI to study or see its computer info after it was supposedly hacked by Russia...... _E_
NEW @MittRomney TV AD Dream For these small businesses hope and change was not so kind: __HTTP__ #tcot _E_
Why are Democrats fighting massive tax cuts for the middle class and business (jobs)? The reason: Obstruction and Delay! _E_
I hope we never find life on other planets because there's no doubt that the U.S. Government will start sending them money! _E_
The ratings for the Celebrity Apprentice were fantastic and everyone had a great time. It was a terrific season congrats to everyone! _E_
A Rod @Yankees had hip surgery & will be out 6 months. Do you notice all the "druggies" have bad hips. _E_
Turnberry one of the most beautiful places in the world.... soon to be Trump Turnberry a Luxury... __HTTP__ _E_
RT @DRUDGE_REPORT: GREAT AGAIN: +235000 __HTTP__ _E_
Thank you @JeffJlpa1 and @AmSpec for the wonderful and very true article "Total Desperation on Iran" __HTTP__ _E_
I hear that dopey political pundit Lawrence O'Donnell one of the dumber people on television is about to lose his show no ratings?Too bad _E_
I'd bet the lawyers for the Central Park 5 are laughing at the stupidity of N.Y.C. when there was such a strong case against their clients _E_
Scary – in the past 90 days Obama has set over 6125 regulatory burdens __HTTP__ Terrible for the economy. _E_
Thanks. __HTTP__ _E_
Ted Cruz attacked New Yorkers and New York values we don't forget! __HTTP__ _E_
Standing with Jamiel Shaw Sabine Durdin Don Rosenberg Lupe Moreno Brenda Sparks Robin Hvidston & their spouses. __HTTP__ _E_
Crooked Hillary Clinton will be a disaster on jobs the economy trade healthcare the military guns and just about all else. Obama plus! _E_
Via @USNewsTravel: "Best New York City Hotels: @TrumpNewYork" __HTTP__ _E_
Muhammad Ali is dead at 74! A truly great champion and a wonderful guy. He will be missed by all! _E_
Negotiation tip: View any conflict as an opportunity this will expand your mind as well as your horizons. Persistence can go a long way. _E_
Word is spreading that I got a tattoo no way I am not a fan! _E_
Obama sent weapons through Benghazi to ISIS yet he is holding up shipments to Israel. What is he thinking? _E_
The Patch a total loser for @AOL will be a good deal compared to @HuffingtonPost. @ariannahuff laughs at "stupid" Armstrong! _E_
As I have always said let ObamaCare fail and then come together and do a great healthcare plan. Stay tuned! _E_
Thank you Ohio! #AmericaFirst __HTTP__ _E_
Remember THE HARDER YOU WORK THE LUCKIER YOU GET! _E_
Happy New Year to all including to my many enemies and those who have fought me and lost so badly they just don't know what to do. Love! _E_
Saudis just cut oil supplymaking prices rise "immediately" while we are fighting ISIS for them __HTTP__ What are we doing! _E_
"It is hard to fail but it is worse never to have tried to succeed." Theodore Roosevelt _E_
The U.S. once again condemns the brutality of the North Korean regime as we mourn its latest victim. Video: __HTTP__ _E_
Thank you so many people have given me credit for winning the debate last night. All polls agree. It was fun and interesting! _E_
The latest update on Bret Michaels is that he's making every effort to attend the live finale of Celebrity Apprentice on Sunday so tune in! _E_
Great piece by @EWErickson @RedState exposing how Karl 1.6% Rove cooked a poll in support of ObamaCare __HTTP__ _E_
Thank you! #VoteTrump __HTTP__ _E_
.@CNN @jaketapper at 9:00 A.M. _E_
Keep talking about me: use #TrumpRoast to tweet about how good I look on @ComedyCentral tonight at 10:30/9:30c __HTTP__ _E_
Tonight I will be signing copies of #TimeToGetTough in Westbury at Costco 1250 Old Country Rd from 6 pm to 8 pm _E_
Thank you Iowa! #ImWithYou __HTTP__ _E_
.@FoxNews will be re running Objectified: Donald Trump the ratings hit produced by the great Harvey Levin of TMZ at 8:00 P.M. Enjoy! _E_
Congratulations to @MittRomney on Tuesday night's sweep. He also delivered a 'Killer Speech' __HTTP__ _E_
I love that in addition to everything else so much money is raised for such great causes on Celebrity Apprentice all proud of that! _E_
My @gretawire interview discussing @BarackObama's USC comments insurance premiums @SarahPalinUSA on the (cont) __HTTP__ _E_
When you think big you will automatically trigger more details because details are the major component of making anything big. _E_
Via @eonline by @BrettMalec: "2014 @MissUniverse Contestants" __HTTP__ _E_
RT @TeamTrump: RT if you believe @HillaryClinton is the one who owes America an apology! #BigLeagueTruth #Debates __HTTP__ _E_
Lying traitor Snowden now claims that he did not give any information to the Russians or Chinese. Why doesn't he come home then? _E_
I can't believe that Mitt Romney would run for president again. He had his chance and blew it in the last weeks of the race. _E_
Why is Douglas Durst allowed to use the World Trade Center to get out of a lease with Conde Nast? _E_
Another new poll. Thank you for your support! Join the MOVEMENT today! #ImWithYou __HTTP__ __HTTP__ _E_
not anymore. The beginning of the end was the horrible Iran deal and now this (U.N.)! Stay strong Israel January 20th is fast approaching! _E_
We can create jobs in the American economy by protecting our own manufacturing sector. _E_
Budget that just passed is a really big deal especially in terms of what will be the biggest tax cut in U.S. history MSM barely covered! _E_
Thank you @JerryJrFalwell will see you soon. #TrumpPence16 __HTTP__ _E_
.@HillaryClinton channels John Kerry on trade: she was for bad trade deals before she was against them. #TPP #Debates2016 _E_
Goofy Elizabeth Warren and her phony Native American heritage are on a Twitter rant. She is too easy! I'm driving her nuts. _E_
Why is it that the horrendous protesters who scream curse punch shut down roads/doors during my RALLIES are never blamed by media? SAD! _E_
.kimguilfoyle great job tonight! _E_
More of my #TRUMPTUESDAY @SquawkCNBC interview discussing how the US gets killed negotiating with other countries __HTTP__ _E_
Trump Int'l Hotel & Tower New York includes Central Park views & our signature restaurant Jean Georges. Perfection! __HTTP__ _E_
Lance Armstrong just got sued by the Federal Government they want their money back I told you so! What was he thinking when he did that int? _E_
Entrepreneurs: See yourself as victorious. Look at the solution not the problem. And never give up! _E_
RT @FoxNews: Jobs added during @POTUS' time in office. __HTTP__ _E_
#CelebApprentice Photo from last night's boardroom. __HTTP__ _E_
RT @Reince: With a strong candidate in @POTUS & @GOP revolutionary data program Republicans carried WI for 1st time in 30 years __HTTP__ _E_
Marine Plane crash in Mississippi is heartbreaking. Melania and I send our deepest condolences to all! _E_
It is my opinion that many of the leaks coming out of the White House are fabricated lies made up by the #FakeNews media. _E_
RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_
My robocall on behalf of @MittRomney playing across the state of Michigan __HTTP__ _E_
President Obama our great leader wants to declare martial law in New York City as a means of helping out with the massive storm. _E_
Hagel's performance yesterday was the worst I have ever seen before a committee of any kind! _E_
Our great project in South America Trump Tower Punta Del Este in Uruguay will have spectacular views and the... __HTTP__ _E_
Congratulations to @MittRomney on getting the @DMRegister @NewYorkPost @NewYorkObserver & @NashuaTelegraph endorsements! _E_
Huckabee is a nice guy but will never be able to bring in the funds so as not to cut Social Security Medicare & Medicaid. I will. _E_
With an elite course designed by @SharkGregNorman @Trump_Charlotte is North Carolina's most desirable club __HTTP__ _E_
WRONG: A China court ordered @apple to pay $60M to a Chinese company that registered iPad before @apple __HTTP__ _E_
One of the things that has been lost in the politics of this situation is that the Russians collected and spread negative information..... _E_
The Senate should be more concerned about actually passing a budget than spreading lies about @MittRomney's taxes. _E_
Yesterday was a referendum on ObamaCare & all other Obama fiascos. Republicans can now rein him in. _E_
Logic will get you from A to B. Imagination will take you everywhere. Albert Einstein" _E_
Honored to have received the endorsement of Lou Holtz a great guy! #INPrimary #Trump2016 __HTTP__ _E_
Ted Cruz went down big in just released Reuters poll what's going on? Is it Goldman Sachs/Citi loans or Canada? _E_
Rush is right. @limbaugh and I have both created more jobs than @BarackObama...in fact far more jobs! _E_
Tuesday will be a big day for our country to do a complete turnaround. MAKE AMERICA GREAT AGAIN! _E_
A message to my fellow Americans#IrmaHurricane2017 __HTTP__ __HTTP__ __HTTP__ _E_
The ONLY bad thing about winning the Presidency is that I did not have the time to go through a long but winning trial on Trump U. Too bad! _E_
Why is that Hillary Clintons family and Dems dealings with Russia are not looked at but my non dealings are? _E_
.@politico covers me more inaccurately than any other media source and that is saying something. They go out of their way to distort truth! _E_
My @FoxNews @megynkelly int. on why I am considering running for POTUS negotiations & making America great again __HTTP__ _E_
With the great vote on Cutting Taxes this could be a big day for the Stock Market and YOU! _E_
I am no fan of Bill Cosby but never the less some free advice if you are innocent do not remain silent. You look guilty as hell! _E_
Immigration reform is fine—but don't rush to give away our country! Sounds like that's what's happening. _E_
There's nothing like fall in #NewYorkCity. See where @TrumpCollection recommends you take in the season's beauty: __HTTP__ _E_
without retribution or consequence is WRONG! There will be a tax on our soon to be strong border of 35% for these companies ...... _E_
We were let down by all of the Democrats and a few Republicans. Most Republicans were loyal terrific & worked really hard. We will return! _E_
Don't let the fake media tell you that I have changed my position on the WALL. It will get built and help stop drugs human trafficking etc. _E_
Obama said not optimal to Ambassador & embassy killings bad word usage for a Harvard graduate. _E_
See you tonight Huntington West Virginia!#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_
.@Franklin_Graham: Great job on @foxandfriends this morning. You beautifully stated what most people are thinking! Say hi to all. _E_
Via CNN: Trump now leads in odds to win GOP nomination __HTTP__ _E_
Nancy Pelosi and Fake Tears Chuck Schumer held a rally at the steps of The Supreme Court and mic did not work (a mess) just like Dem party! _E_
Will be interviewed on the @oreillyfactor tonight at 8:00 P.M. Will be talking about the debate and more! _E_
The writer of the now proven false story in the @nytimes Michael Barbaro who was interviewed on CBS this morning was unable to respond. _E_
No matter how far down a path you go if it's the wrong path turn around and go back home before it is too late. _E_
Mar a Lago is Florida's most lavish and exclusive private club and spa with world class amenities __HTTP__ _E_
Departing NH now great morning with record crowd in Portsmouth in a snow storm! Thank you! __HTTP__ __HTTP__ _E_
You must be kidding zero chance he is innocent! _E_
Be sure to tune in for Melania's second QVC show for Melania Timepieces & Jewelry tonight live from 9 10 pm on QVC __HTTP__ _E_
RT @IvankaTrump: Check out my May Redbook magazine cover. Very exciting! #Redbook __HTTP__ _E_
A rough night for Hillary Clinton ABC News. _E_
Pay attention to global news and developments in today's world that is a requirement not an elective. _E_
Join me in San Jose California tonight!#MakeAmericaGreatAgain #Trump2016Tickets: __HTTP__ __HTTP__ _E_
Explain to @brithume and @megynkelly who know nothing that I will beat Hillary and win states (and dem indie votes) that no other R can! _E_
#CrookedHillary #PayToPlay __HTTP__ _E_
If the U.S. Government doesn't give the money necessary for the burials of our military personnel I will.The U.S. under Obama's leadership! _E_
She's back! Champion @Joan_Rivers returns to the boardroom in this year's All Star @ApprenticeNBC. Joan is ferocious. _E_
A clip from last night's @Late_Show where I detail my charitable offer to Obama and Dave describes his terrible grades __HTTP__ _E_
In making any decision you need all the facts. But after exhausting all due diligence in the end you have to go with your gut! _E_
Thank you @CarlHigbie. Great work on @CNN. #Trump2016 _E_
President Obama and other world leaders don't know how close they were to being seriously injured (or worse) standing next to psycho in SA. _E_
Sorry folks but Donald Trump is far richer and much better looking than dopey @mcuban! _E_
#ICYMI: Joint Statement with Prime Minister Shinzo Abe on North Korea. __HTTP__ _E_
Remember but for Conservatives Bush would have given us not only Roberts but also Harriet Miers. Face it Bush was terrible! _E_
It would have been much easier for me to win the so called popular vote than the Electoral College in that I would only campaign in 3 or 4 _E_
"@DonaldJTrumpJr: 'We want to build everything in Dubai" __HTTP__ via @CWO_dotcom _E_
Thank you for your support at this mornings Town Hall in Salem New Hampshire. #FITN #NHPrimary __HTTP__ _E_
The Spa @TrumpWaikiki offers unique treatments that use traditional Hawaiian botanicals & healing techniques __HTTP__ _E_
My @SquawkCNBC interview discussing interest rates the deficit @RepPaulRyan's timing @TimTebow and the Doral __HTTP__ _E_
That's what I find so morally offensive about welfare dependency: it robs people of the chance to improve. Work (cont) __HTTP__ _E_
TIME #DebateNight poll over 800000 votes. Thank you! #AmericaFirst #MAGA __HTTP__ _E_
RT @AnnCoulter: RUMSFELD: Trump has a touched a nerve in our country...in a way that most politicians have not been able to do. __HTTP__ _E_
John @CahillForAG is one of the most respected people in politics. Dopey @AGSchneiderman is one of the least respected! _E_
Great op ed from @RepKenBuck. Looks like some in the Freedom Caucus are helping me end #Obamacare. __HTTP__ _E_
Lots of great new polls big leads! __HTTP__ __HTTP__ __HTTP__ _E_
Obama our Welfare & Food Stamp President is praising himself for expanding welfare __HTTP__ He doesn't believe in work. _E_
I am very supportive of the Senate #HealthcareBill. Look forward to making it really special! Remember ObamaCare is dead. _E_
RT @Realjmannarino: @realDonaldTrump The ungratefulness is something I've never seen before. If you get someone's son out of prison he sho... _E_
Entrepreneurs: Set the example. You can motivate others as well as yourself by remembering you are setting the example. _E_
For all of those that were hoping I was wrong and this is a very unimportant subject to me Dwight Howard just officially announced Houston _E_
The only place success comes before work is in the dictionary. Vince Lombardi _E_
People must remember that ObamaCare just doesn't work and it is not affordable 116% increases (Arizona). Bill Clinton called it CRAZY _E_
.@CharlesGKoch is looking for a new puppet after Governor Walker and Jeb Bush cratered. He now likes Rubio next fail. _E_
Looking forward to speaking at the @ARGOP Reagan Rockefeller Dinner tonight! Record crowd. We are no longer silent! #MAKEAMERICAGREATAGAIN! _E_
RT @hughhewitt: #NeverTrumpers elite MSMers and virtue signalers are persuading themselves that @realDonaldTrump supporters are deserting.... _E_
"Learn work and think in equal proportions and you'll be going in the right direction." – Think Like a Champion _E_
Do executives at @msnbc know that the business of TV centers on viewers & ratings? @msnbc is #19 on cable __HTTP__ Sad. _E_
I had a great time today visiting Facebook NY. __HTTP__ _E_
It's Monday how many more excuses will Obama make today about the economy? _E_
.@SouthJerseyMag "According to the Pros" just named Trump National Golf Club Philadelphia the #1 private club. Thanks! _E_
I'm really saddened to see that @Cher was voted "the 4th ugliest celebrity" according to @listverse.... _E_
We are leaving Iraq after expending a tremendous amount of blood and treasure. We should be reimbursed with oil! Don't give it to Iran. _E_
Watch Late Night with Jimmy Fallon on NBC at 12:35 EST tonight I'll be bringing a couple of surprises with me. _E_
Business is no place for stream of consciousness babbling. Keep it short fast and direct. Think Like a Champion _E_
NY State Republican Party must unify or November will be another disaster. _E_
.@MichaelRCaputo Thank you for all of your support you have been amazing! _E_
People who lost money when the Stock Market went down 350 points based on the False and Dishonest reporting of Brian Ross of @ABC News (he has been suspended) should consider hiring a lawyer and suing ABC for the damages this bad reporting has caused many millions of dollars! _E_
After settling for a ridicilous 13 billion dollars J.P.Morgan's lawyer is critical of the amount of the fine why did they settle then DUMB! _E_
Terrific response to my previous tweet: I'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. __HTTP__ ... _E_
Hypocrite! in '06 @BarackObama called private equity the best opportunity for long term economic vitality __HTTP__ _E_
Emmys telecast is way down & lowest telecast on record among young adults. Emmys have no credibility Should have nominated Apprentice again! _E_
#CelebrityApprentice ranked #1 among ABC CBS and NBC in all key demos from 10 11PM. It won the 10PM hour by a 53% margin in 18 49 rating. _E_
If crazy @megynkelly didn't cover me so much on her terrible show her ratings would totally tank. She is so average in so many ways! _E_
West Virginia was incredible last night. Crowds and enthusiasm were beyond GDP at 3% wow!Dem Governor became a Republican last night. _E_
: @realDonaldTrump @HelpUServe When we have people eating out of trash cans in this country we have no business helping any other country _E_
I will be going to Sarasota Florida today for a big rally with amazing people! I have one goal on mind: MAKE AMERICA GREAT AGAIN! _E_
Made a speech in Arkansas last night before a record GOP crowd. Great spirit and amazing people. MAKE AMERICA GREAT AGAIN! _E_
Jeffrey Lord @AmSpec—Thank you for the presentation—terrific job! _E_
I am making a major speech in West Palm Beach Florida at noon. Tune in! _E_
We just had the worst jobs report since 2010. _E_
The ObamaCare enrollment numbers are a lie.They will be 'readjusted' by the White House at an opportune time probably after '14 election _E_
We are one nation. When one state hurts we all hurt. We must all work together to lift each other up. __HTTP__ _E_
HAPPY BIRTHDAY to my son @DonaldJTrumpJr! Very proud of you! #TBT __HTTP__ __HTTP__ _E_
There is only one way to avoid criticism: do nothing say nothing and be nothing. – Aristotle _E_
Win a dinner with @MittRomney and me in New York this June 28th. It's selling like hotcakes! __HTTP__ _E_
Bob Corker who helped President O give us the bad Iran Deal & couldn't get elected dog catcher in Tennessee is now fighting Tax Cuts.... _E_
Leaving South Korea now heading to China. Looking very much forward to meeting and being with President Xi! _E_
Mitt Romney had his chance and blew it. Lindsey Graham ran for president got ZERO and quit! Why are they now spokesmen against me? Sad! _E_
Crooked Hillary Clinton knew everything that her servant was doing at the DNC they just got caught that's all! They laughed at Bernie. _E_
As families prepare for summer vacations in our National Parks Democrats threaten to close them and shut down the government. Terrible! _E_
.CNN & @CNNPolitics Lawyer Elizabeth Beck did a terrible job against me she lost (I even got legal fees). I loved beating hershe was easy _E_
Remember when Jeb gave Hillary a medal on the 1 year anniversary of Benghazi?! __HTTP__ Guess he would have invaded Libya too! _E_
Congratulations to the House of Representatives for passing the #TaxCutsandJobsAct — a big step toward fulfilling our promise to deliver historic TAX CUTS for the American people by the end of the year! __HTTP__ _E_
The Mexican legal system is corrupt as is much of Mexico. Pay me the money that is owed me now and stop sending criminals over our border _E_
If Michael Bloomberg ran again for Mayor of New York he wouldn't get 10% of the vote they would run him out of town! #NeverHillary _E_
China is building 50 brand new airports while our country continues to rott! Very sad. _E_
Congratulations to @KingJames on winning Athlete of the Year in last night's @ESPYS. LeBron is also a great guy! _E_
I'm thrilled to announce that my new tailored clothing line has officially launched at Macy's. In business it'... (cont) __HTTP__ _E_
I will be at the Cadillac World Golf Championship @TrumpDoral in Miami tomorrow! Rory Phil Bubba Adam and Dustin all at the top! _E_
If ObamaCare is hurting people & it is why shouldn't it hurt the insurance companies & why should Congress not be paying what public pays? _E_
Thank you! #Trump2016 __HTTP__ _E_
RT @DeptofDefense: VIDEO: Elements of the #DoD and @FEMA are providing humanitarian relief for #PuertoRico and #USVI 🇻 . __HTTP__ _E_
Placing the ball in the right position for the next shot is eighty percent of winning golf. Ben Hogan _E_
My @CNBCClosingBell interview discussing America's financial uncertainty due to @BarackObama and the job report __HTTP__ _E_
Policy towards our enemies: Hit them hard hit them fast hit them often & then tell them it was because they are the enemy! _E_
You must promise that you will never cheat off Manti Te'o's test papers. _E_
...goodwill and friendship was formed but only time will tell on trade. _E_
The FAKE NEWS media (failing @nytimes @NBCNews @ABC @CBS @CNN) is not my enemy it is the enemy of the American People! _E_
ObamaCare is imploding. It is a disaster and 2017 will be the worst year yet by far! Republicans will come together and save the day. _E_
Twitter is on @BarackObama's enemies list __HTTP__ _E_
While Putin is scheming and beaming on how to take over the World President Obama is watching March Madness (basketball)! _E_
RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_
Wow big lines in Kansas. _E_
The Republicans should NOT give @BarackObama the authority to raise the debt another $1.2Trillion (cont) __HTTP__ _E_
.@lancearmstrong really blew it went down in flames too bad! _E_
Good advice from my mother: Trust in God and be true to yourself. Mary Trump _E_
It was a great honor to welcome Prime Minister Najib Abdul Razak of Malaysia and his distinguished delegation to the @WhiteHouse today! __HTTP__ _E_
Why aren't the same standards placed on the Democrats. Look what Hillary Clinton may have gotten away with. Disgraceful! _E_
.@CNN should stop apologizing for the mistake they made the other day & get back to reporting! _E_
My @FoxNews interview with @gretawire discussing how @BarackObama is delusional and how a 3rd party candidate can win. __HTTP__ _E_
Being at the Army Navy Game was fantastic. There is nothing like the spirit in that stadium. A wonderful experience and congrats to Army! _E_
Just did Howard Stern Show great time. Now doing The Today Show with Ivanka. ENJOY! _E_
What are Hillary Clinton's people complaining about with respect to the F.B.I. Based on the information they had she should never..... _E_
Which National Costume do you think should win? __HTTP__ _E_
Congratulations to our Olympic team for by far winning the most medals including first place gold. _E_
It's freezing and snowing in New York we need global warming! _E_
Polls close at 6pm! #INPrimary #Trump2016 #VoteTrump __HTTP__ _E_
.@foxandfriends in five minutes! _E_
After consultation with my Generals and military experts please be advised that the United States Government will not accept or allow...... _E_
.@FLOTUS & I were honored to host our first WH Congressional Picnic. A wonderful evening & tradition. @MarineBand:... __HTTP__ _E_
The Trump Signature Collection exclusively available at @Macys is the pinnacle of style and prestige __HTTP__ _E_
Pathetic @BarackObama is 'sweetening' his offer to the Taliban __HTTP__ Read 'The Art of The Deal.' _E_
Great going @themichellewie –you showed the world that all of that amazing talent is for real. We love you at Trump Jupiter @TNGCJ _E_
.@DonaldJTrumpJr's @CNBC interview discussing the starving demand that is fueling high end luxury __HTTP__ _E_
The safest way to preserve Medicare is with a robust and vibrant economy. We should lower corporate and capital gain taxes immediately. _E_
Clinton is trying to wash away her bad judgement call on BREXIT with big dollar ads. Disgraceful! _E_
I had a great time doing press interviews with @LisaLampanelli and @Teresa_Giudice earlier today __HTTP__ _E_
70 years ago today the National Security Council met for the first time. Great history of advising Presidents then & now! Thanks NSC Staff! _E_
Congratulations to the 2016 #StanleyCup Champions Pittsburgh @Penguins! _E_
I will be in beautiful Burlington Vermont tonight for a rally. Will be great fun. MAKE AMERICA GREAT AGAIN! _E_
Canada will now sell its oil to China because @BarackObama rejected Keystone. At least China knows a good deal when they see it. _E_
Very good speech by @MichelleObama and under great pressure Dems should be proud! _E_
Why was @BarackObama selling guns to Mexican drug dealers? _E_
Trump Turnberry is a spectacular place and home to four of the greatest Open Championships of all time. __HTTP__ _E_
Join us via our new #AmericaFirst APP! #TrumpPence16 __HTTP__ __HTTP__ _E_
Thank you Pennsylvania! Going to New Hampshire now and on to Michigan. Watch PA rally here: __HTTP__ __HTTP__ _E_
Paul Ryan said that I inherited something very special the Republican Party. Wrong I didn't inherit it I won it with millions of voters! _E_
Congratulations to @Yankees Derek Jeter on being named to 2014 @MLB @AllStarGame! _E_
I don't like the opening even a little bit! _E_
Today is Donald Trump's Birthday! Send him your B'day wishes here: __HTTP__ _E_
Do your homework. Wasting other people's time due to poor planning or thoughtlessness leaves a bad impression. – Think Like a Billionaire _E_
Congress now has 6 months to legalize DACA (something the Obama Administration was unable to do). If they can't I will revisit this issue! _E_
Will be interviewed on @foxandfriends at 8:30 A.M. Eastern. ENJOY! _E_
Wow one of the all time greats in fashion OSCAR DE LA RENTA has just died at 82. Great fashion achievements but also a really nice guy! _E_
Via @dcexaminer by @rebeccagberg: "Trump: 'I'm the only one who can beat' Hillary" __HTTP__ _E_
Last week's episode of the Celebrity Apprentice set the stage for a great new season. Tune in this Sunday on NBC for even more excitement. _E_
I have over seven million hits on social media re Crooked Hillary Clinton. Check it out Sleepy Eyes @MarkHalperin @NBCPolitics _E_
Obama has no problem leaking national security secrets. Why can't he release his records? Especially when $5M is going to charity. _E_
Via @reason: Donald Trump: I Can Fix America __HTTP__ _E_
....we need to keep America safe including moving away from a random chain migration and lottery system to one that is merit based. __HTTP__ _E_
To be really successful it is always good to have A COOL HEAD WARM HEART AND BEAUTIFUL COMMON TOUCH! _E_
My @FoxNews interview on @gretawire discussing The China Curse __HTTP__ _E_
Yes Arnold Schwarzenegger did a really bad job as Governor of California and even worse on the Apprentice...but at least he tried hard! _E_
Join me in Indianapolis Indiana tomorrow at 3pm! #Trump2016#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_
Remember official campaign merchandise (hats apparel etc.) can only be bought at __HTTP__ Be careful don't get ripped off _E_
Marco Rubio would keep Barack Obama's executive order on amnesty intact. See article. Cannot be President. __HTTP__ _E_
Big game trophy decision will be announced next week but will be very hard pressed to change my mind that this horror show in any way helps conservation of Elephants or any other animal. _E_
#TrumpVlog @Rosie needs to rest and relax. It's not working. __HTTP__ _E_
Happy Halloween! __HTTP__ _E_
By failing to prepare you are preparing to fail. Benjamin Franklin _E_
First there was the Declaration of Independence then there was the Constitution. Now there is #TimeToGetTough. Available today. _E_
Who would have thought that an @ApprenticeNBC champion would return to compete? @bretmichaels returns to All Star @CelebApprentice... _E_
My new radio ad airing today in Wisconsin! See you soon!#WIPrimary #Trump2016 __HTTP__ _E_
Come on goAngelo don't give up now just because your rally at Macy's drew only eleven people for twenty minutes! I love@ Macy's. _E_
Solyndra's government loan and subsequent bankruptcy prove that @BarackObama is both corrupt and inept. _E_
Trump Int'l Golf Links Ireland in County Clare fronts the Atlantic Ocean & is #1 Resort in Europe/Conde Nast Traveler __HTTP__ _E_
The Democrats have been told and fully understand that there can be no DACA without the desperately needed WALL at the Southern Border and an END to the horrible Chain Migration & ridiculous Lottery System of Immigration etc. We must protect our Country at all cost! _E_
I've sent a 10 wheeler filled with 358 master cases of food and supplies to my hometown of Queens today #TrumpCares _E_
.@RalphGilles of Chrysler should focus on design rather than filthy language not very professional. _E_
Young Entrepeneurs: Think Big Stay Motivated & Always Remain Confident. The Sky is the Limit. _E_
House GOP better get its act together.Defund ObamaCare. Out negotiate on debt ceiling. Form commissions on Benghazi & IRS. No excuses! _E_
It's freezing in New York—where the hell is global warming? _E_
Congratulations to @sethmeyers on "Emmy's Rating Tumble" __HTTP__ Just as I predicted Seth bombed! . _E_
Is this true about Univision and Fusion? Wow!?! __HTTP__ _E_
Via @thehill by @timdevaney: Donald Trump: GOP nominee 'can't be Mitt can't be Bush' __HTTP__ _E_
My @Live5News int. with @WilliamLive5 in South Carolina with @citadelgop cadets on my 757 discussing 2016 __HTTP__ _E_
Today the House votes on two crucial bills:#NoSanctuaryForCriminalsAct #KatesLaw Pass these bills & lets... __HTTP__ _E_
Had dinner with @RickPerry last night great guy straight shooter impressive record. _E_
Is it the Neil Patrick Harris show or the Emmy Awards?How was he ever put in this position to start with? CRAZY! _E_
Everytime someone tweets that I wear a wig realize to yourself that you are dealing with them just another sad & lonely hater and loser! _E_
Do you notice that Hillary spews out Jeb's name as often as possible in order to give him status? She knows Trump is her worst nightmare. _E_
Voting for @GovGaryJohnson is voting for Obama don't waste your vote! _E_
Our country must get very strong and very tough and fast before it is too late. We have zero leadership and never WIN! We want victory. _E_
"How much money can you stand to lose? That's how much risk you should assume." – Think Like a Billionaire _E_
.@BarackObama wants to see 10 yrs of @MittRomney's tax returns tell him ok but we want to see your college applications first.' _E_
I hope @TGowdySC does better for Rubio than he did at the #Benghazi hearings which were a total disaster for Republicans & America! _E_
Canada's legal immigration plan starts with a simple and smart question: How will any immigrant applying fo... (cont) __HTTP__ _E_
Remember victims of Hurricane Sandy during Thanksgiving. Many will not be celebrating the holiday in comfort.Their lives are in turmoil! _E_
Thank you Northern Mariana Islands!#SuperTuesday #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Join us Monday February 8th @ the Verizon Wireless Arena in Manchester New Hampshire! #FITN #NHPolitics #Trump2016 __HTTP__ _E_
Via @fitsnews: "Donald Trump: John McCain Is 'A Loser'" __HTTP__ _E_
Signing orders to move forward with the construction of the Keystone XL and Dakota Access pipelines in the Oval Off... __HTTP__ _E_
Thank you Indiana! Will be back soon!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
If I am elected President I will immediately approve the Keystone XL pipeline. No impact on environment & lots of jobs for U.S. _E_
Watch What's America Worth? hosted and narrated by me this Sunday at 9PM on @Discovery __HTTP__ _E_
Think. That's the first step. Use all your power to utilize and develop that capability Donald J. Trump __HTTP__ _E_
Crooked Hillary no longer has credibility too much failure in office. People will not allow another four years of incompetence! _E_
RT @SLandinSoCal: @foxandfriends @realDonaldTrump Nothing can stop the #TrumpTrain __HTTP__ _E_
Goofy political pundit George Will spoke at Mar a Lago years ago. I didn't attend because he's boring & often wrong—a total dope! _E_
Mitt Romney had his chance to beat a failed president but he choked like a dog. Now he calls me racist but I am least racist person there is _E_
Coincidence? More than half of @BarackObama's 47 biggest fundraisers have been given administration jobs. __HTTP__ _E_
The Celebrity Apprentice delivers the goods and the puppets Sunday at 9 pm on NBC __HTTP__ _E_
"Trump Gives 'Em Hell" __HTTP__ via @limbaugh _E_
The dummies left Iraq (and Libya) without the oil! _E_
Did you know Donald Trump is on Facebook? __HTTP__ Become a fan today! _E_
We should immediately stop sending our beautiful American tax dollars to countries that hate us and laugh at our President's stupidity! _E_
Always remember that as your success grows you will be asked for more favors. Learn how to say 'No.' It is critical. _E_
Death spiral!'Aetna will exit Obamacare markets in VA in 2018 citing expected losses on INDV plans this year' __HTTP__ _E_
84% of US troops wounded & 70% of our brave men & women killed in Afghanistan have all come under Obama. Time to get out of there. _E_
Wow just saw an ad Cruz is lying on so many levels. There is nobody more against ObamaCare than me will repeal & replace. He lies! _E_
"You have to have confidence in yourself and confidence to know that what you are doing is right." – Think Big _E_
I will not let you down! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
"If you want the best you'd better be the best – in all aspects of business." – Think Like a Billionaire _E_
Obama still will keep all military recruitment centers & bases Gun Free Zones! It has to stop. MILITARY LIVES MATTER! _E_
Hillary Clinton is a major national security risk. Not presidential material! _E_
Well this is it the final debate let's see how it goes. I'll be tweeting live. _E_
RT @EricTrump: #MakeAmericaGreatAgain!!! __HTTP__ _E_
Check out Serta's Counting Sheep (and me) at the Trump International Hotel New York __HTTP__ _E_
My thoughts and prayers are with the victims and families of those affected by two powerful earthquakes in Italy and Myanmar. _E_
Sen. @DavidVitter & @David_Bossie w/@seanhannity __HTTP__ demand 'Congress Live By Your Laws' __HTTP__ _E_
Today's assignment: read Chapter 7 'Trump Tower: The Tiffany Location' of The Art of the Deal. Focus on how I marketed the property. _E_
Obama lied when he said "you can keep your plan" so why would anyone believe his bogus ObamaCare enrollment numbers?! _E_
By popular request I will also be tweeting live during the Vice Presidential debate Thursday night. It will be very interesting I promise. _E_
The United States is considering in addition to other options stopping all trade with any country doing business with North Korea. _E_
Of the 9 battleground states we only carried North Carolina. I'm proud of @NCGOP & glad I delivered keynote at their state convention. _E_
Yankees should have dropped A Rod long ago not even bothered with arbitration. They would have saved a fortune! _E_
Going to D.C. for big groundbreaking on Old Post Office site. Will be spectacular new hotel. Lots of jobs! _E_
Disproven and paid for by Democrats "Dossier used to spy on Trump Campaign. Did FBI use Intel tool to influence the Election?" @foxandfriends Did Dems or Clinton also pay Russians? Where are hidden and smashed DNC servers? Where are Crooked Hillary Emails? What a mess! _E_
I like Mexico and love the spirit of Mexican people but we must protect our borders from people from all over pouring into the U.S. _E_
On @FallonTonight with @jimmyfallon at 11:30 PM. Enjoy! _E_
Father's Day is Sunday. Find the perfect gift.Trump Signature Collection is exclusively available @Macys __HTTP__ _E_
I'm sending lots of bottled water out to Staten Island & Long Island. _E_
Little Marco Rubio the lightweight no show Senator from Florida is set to be the puppet of the special interest Koch brothers. WATCH! _E_
Attorney General Bill Schuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help _E_
I'm turning down millions of dollars of campaign contributions—feel totally stupid doing so but hope it is appreciated by the voters. _E_
THANK YOU IOWA! Highly respected @OANN @GravisMarketing poll just released. #VoteTrump #IowaCaucus __HTTP__ _E_
The Trump Hotel Collection is currently nominated for Conde Nast Traveler Readers Choice Awards Travel & Leisure and World Travel Awards. _E_
Getting ready to land in Hawaii. Looking so much forward to meeting with our great Military/Veterans at Pearl Harbor! _E_
I spent Friday campaigning with John Kennedy of the Great State of Louisiana for the U.S.Senate. The election is over JOHN WON! _E_
The great workers who just completed the skylight at Trump International Hotel D.C. (Old Post Office) __HTTP__ _E_
Obama should work on a ceasefire in Chicago as well as Gaza. _E_
Canadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_
When I think big which is often you can be sure I'm aware of the enormous amount of little things that we will have to account for. _E_
I'm saying that the Tea Party perhaps by another name will soon have another big moment and will be a major factor in victory! _E_
The fastest way we can start saving Social Security is to get Americans back to work. #TimeToGetTough (cont) __HTTP__ _E_
To Jamie Dimon—I love kicking lightweight @AGSchneiderman's ass. Stop settling and fight! _E_
"There is no worse feeling than being trapped in a job you do not enjoy. You have to love what you do." Think Big _E_
My @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney can win the first debate & the last 35 days __HTTP__ _E_
Thank you @FaithandFreedom Coalition! An honor joining you today to discuss our shared values.#RTM2016 #Trump2016 __HTTP__ _E_
"Partner with people who share your values attitude and drive." – Midas Touch with @theRealKiyosaki _E_
"House votes on controversial FISA ACT today." This is the act that may have been used with the help of the discredited and phony Dossier to so badly surveil and abuse the Trump Campaign by the previous administration and others? _E_
Interview with @LouDobbs coming up at 7pmE on @FoxBusiness. Enjoy! __HTTP__ _E_
RT @foxandfriends: Head of the NYPD union slams Mayor de Blasio for skipping vigil for assassinated cop Miosotis Familia __HTTP__ _E_
...So far he has been a complete failure at doing so. He should read The Art of the Deal and use his energy to focus on a new career. _E_
.@GovernorPerry in my office last cycle playing nice and begging for my support and money. Hypocrite! __HTTP__ _E_
It's not enough that we do our best sometimes we have to do what's required. Winston Churchill _E_
Crude is at $85 right now – isn't even worth half that. OPEC is ripping us off. _E_
Crooked Hillary Clinton lied to the FBI and to the people of our country. She is sooooo guilty. But watch her time will come! _E_
I've realized that success requires 100% effort and 100% focus. Nothing less. Get out there and go for it. _E_
Only very stupid people think that the United States is making good trade deals with Mexico.Mexico is killing us at the border and at trade! _E_
Check out my most recent interview with CNN... __HTTP__ _E_
Coach W to his basketball players BE QUICK BUT DON'T HURRY! _E_
The Democrats are all talk and no action. They are doing nothing to fix DACA. Great opportunity missed. Too bad! _E_
Stop flights into the U.S. from West Africa immediately! _E_
Tom Ridge should be focused on trying to bring the party together rather than ripping it apart w/ your faulty thought process. I will win! _E_
The Miss Universe Pageant will be broadcast live from MOSCOW RUSSIA on November 9th. A big deal that will bring our countries together! _E_
Two dozen NFL players continue to kneel during the National Anthem showing total disrespect to our Flag & Country. No leadership in NFL! _E_
Great honor to receive today's endorsement of @RickSantorum. Really nice! #Trump2016 _E_
.@DanaPerino wrote a wonderful book "And the Good News is.. Dana has a fabulous perspective on life & politics—go get it! _E_
The upcoming All Star season of @CelebApprentice has @lisarinna returning to compete. She doesn't disappoint! _E_
If we could force Russia China and other competitors to use ObamaCare we would be able to instantly destroy their great economic success! _E_
Heading to Myrtle Beach South Carolina. Really big crowd—so much to talk about! _E_
Agreed @piersmorgan says he and @OMAROSA have a "communication malfunction." #CelebApprentice _E_
Our trade deficit is still on pace to be over $500B. This is killing our manufacturing sector and sending jobs overseas. _E_
Ebola's spread is 'unprecedented' says CDC chief __HTTP__ _E_
Central Park's top locale @TrumpRink is open throughout the holidays. Our Skating School is excellent & acclaimed __HTTP__ _E_
Almost every television network wants me badly—but I stay loyal to @NBC. _E_
Thank you @CharlesHurt for the nice words on @seanhannity. I will win and Make America Great Again! _E_
The real war on women over 175000 fewer held jobs in July & 94000 dropped out of labor force __HTTP__ We must do better. _E_
Thank you @JerryJrFalwell! __HTTP__ _E_
Lance Armstrong fought for 7 years & then just ran out of energy. Very sad story although they caught him red handed.He definitely cheated! _E_
In one of the biggest stories in a long time the FBI now says it is missing five months worth of lovers Strzok Page texts perhaps 50000 and all in prime time. Wow! _E_
The failing @nytimes wrote yet another hit piece on me. All are impressed with how nicely I have treated women they found nothing. A joke! _E_
Never quit and always hit back The Art of the Comeback _E_
Remember I said Derek don't sell your Trump World Tower apartment...its been lucky for you. The day after he sold it he broke his foot. _E_
Signing a recent tax return isn't this ridiculous? __HTTP__ _E_
The electoral college is a disaster for a democracy. _E_
I think it was terrible that Tim Cook of Apple apologized to China. What the hell is he apologizing for? Steve Jobs wouldn't. _E_
The 48000 sq. ft. Spa @TrumpDoral boasts 33 treatment rooms and over 100 signature spa services and treatments __HTTP__ _E_
During the campaign I promised to MAKE AMERICA GREAT AGAIN by bringing businesses and jobs back to our country. I am very proud to see companies like Chrysler moving operations from Mexico to Michigan where there are so many great American workers! __HTTP__ _E_
Thank you Farmington New Hampshire! #FITN #Trump2016 __HTTP__ _E_
Where are the 50000 important text messages between FBI lovers Lisa Page and Peter Strzok? Blaming Samsung! _E_
FAKE NEWS media knowingly doesn't tell the truth. A great danger to our country. The failing @nytimes has become a joke. Likewise @CNN. Sad! _E_
To be a visionary you have to chase impossibilities. Few ever get rich easily. Think Like a Billionaire _E_
I will be doing @foxandfriends this morning at 8 (not 7). _E_
I want to see people make lots of $$ and live better lives. I really think they can do that through TheTrumpNetwork __HTTP__ _E_
Today we remember the crew of the Space Shuttle Challenger 31 years later. #NeverForget __HTTP__ _E_
The cast for next season looks really good! _E_
All recent Presidents have released their transcripts. What is @BarackObama hiding? _E_
Congratulations to @SenScottBrown on running an aggressive & fair campaign. Vote for Scott today New Hampshire! _E_
Remember new environment friendly lightbulbs can cause cancer. Be careful the idiots who came up with this stuff don't care. _E_
Trump Tuesday: I'll be on @SquawkCNBC tomorrow morning at 7:30 AM. Be sure to tune in. _E_
....getting great border security and healthcare. #VoteRalphNorman tomorrow! _E_
The new line of Trump ties shirts and cufflinks are out at Macy's and are really beautiful at a really reasonable.price. Go check them out! _E_
I will be going to Aberdeen Scotland today to help my team celebrate the great success of Trump International Golf Links press conference. _E_
#TrumpVlog Obama stop chewing gum! __HTTP__ _E_
Iran looks like it is toying with John Kerry on nuclear talks he is begging for a deal to save face. Negotiation is just not his thing! _E_
Attending Chief Ryan Owens' Dignified Transfer yesterday with my daughter Ivanka was my great honor. To a great and brave man thank you! _E_
Traitor Snowden has requested asylum in Russia. Why would Russia grant it? Snowden already gave them all the intel he stole! _E_
While I hear the Koch brothers are in big financial trouble (oil) word is they have chosen little Marco Rubio the lightweight from Florida _E_
Obama was beaten but not knocked out. He lives to fight another day. But in the real world presidents are not given a second chance... _E_
.@oreillyfactor called me a master marketeer last night I am not. I am a great builder I build great things & people come. _E_
RT @FLOTUS: I had a wonderful time with the students at the American International School #Riyadh today. #SaudiaArabia __HTTP__ _E_
"Exclusive: Donald Trump wants to build a luxury hotel in Dubai" __HTTP__ via @itp_ab by @ctrenwith _E_
Politician @SenatorCardin didn't like that I said Baltimore needs jobs & spirit. It's politicians like Cardin that have destroyed Baltimore. _E_
Workers of firm involved with the discredited and Fake Dossier take the 5th. Who paid for it Russia the FBI or the Dems (or all)? _E_
Discovery breeds discovery as in success breeds success. Questions are thoughts with a quest. Think Like a Champion _E_
#TrumpVlog Hagel quits __HTTP__ _E_
While I was in Moscow I see that President Obsma apologized for his lie I mean statement on ObamaCare! How nice of him to be so forthright _E_
So @BarackObama is celebrating his 'birthday' with a fundraiser in his home he bought with the help of Rezko __HTTP__ _E_
Great to see @SarahPalinUSA back on @FoxNews. She's a wonderful woman and commentator. _E_
How much money are the lawyers for the Central Park Five getting out of the 40 million dollars or are they paid by the City (or both)? _E_
RT @DanScavino: .@NikkiHaley in 2012 w/ Romney on tax returns🤔(political ploy.) Fast forward..2016 w/ Robot Rubio🤖#FAIL👎#Politician __HTTP__ _E_
RT @JacobAWohl: @realDonaldTrump The #MAGA great again movement is WINNING and the left wing media can't stand it! _E_
We need economic growth and jobs not blue ribbon panels to study the problem. _E_
Looks like @OMAROSA is up to the challenge. #CelebApprentice _E_
Florida Power & Light did a fantastic job of providing service & energy during the big storm in Palm Beach. @insideFPL _E_
Everybody is asking why the Justice Department (and FBI) isn't looking into all of the dishonesty going on with Crooked Hillary & the Dems.. _E_
Thank you @SenatorDole very kind! __HTTP__ _E_
Just in—all efforts to stop sexual abuse in the military have totally failed—in fact the stoppers have become the abusers. _E_
Dopey Sugar @Lord_Sugar I'm worth more than $8 billion acknowledged almost no debt ... _E_
RT @CLewandowski_: The Scrum: Video Emerges to Suggest WaPo Reporter Ben Terris Misidentifies Lewandowski in Fields Incident Breitbart __HTTP__ _E_
I can't believe @VanityFair would renew Graydon Carter's contract...... _E_
"Trump: 'Very much inclined' to enter GOP White House race" __HTTP__ via @McClatchyDC by @LightmanDavid _E_
A great day in New Hampshire and Maine. Fantastic crowds and energy! #MAGA _E_
FBI Deputy Director Andrew McCabe is racing the clock to retire with full benefits. 90 days to go?!!! _E_
Great news @BarbaraJWalters has fully recovered and will be back on @theviewtv this coming Monday. Barbara is wonderful! _E_
Will be having many meetings this weekend at The Southern White House. Big 5:00 P.M. speech in Melbourne Florida. A lot to talk about! _E_
.@_Just_Mads_ #asktrump __HTTP__ _E_
Terrible attacks in NY NJ and MN this weekend. Thinking of victims their families and all Americans! We need to be strong! _E_
#trumpvlog @BarackObama's dismal record in today's video blog.... __HTTP__ _E_
Gov Mike Pence has just stated that Donald Trump has taken a strong stance on Hoosier jobs and he thanks me! I will bring back jobs to USA. _E_
Had a great time with @MittRomney last night. He is focused and ready for the battle ahead. Lots of money was raised. _E_
Lightweight @AGSchneiderman is fighting with @NYGovCuomo –Cuomo wins that one easily. Schneiderman is a total loser. _E_
I was referring to the fact that Jeb Bush wants to keep common core. _E_
The countdown is on. The 13th season of All Star @ApprenticeNBC premieres this Sunday March 3rd at 9PM EST on @nbc. Big! _E_
I'll be speaking at the first ever National Achievers Congress at the San Jose Convention Center (San Jose CA) (cont) __HTTP__ _E_
Very little reporting about the GREAT GDP numbers announced yesterday (3.0 despite the big hurricane hits). Best consecutive Q's in years! _E_
Even Usain Bolt from Jamaica one of the greatest runners and athletes of all time showed RESPECT for our National Anthem! 🇲 __HTTP__ _E_
Billions of dollars in investments & thousands of new jobs in America! An initiative via Corning Merck & Pfizer: __HTTP__ __HTTP__ _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
I will bring our jobs back to the U.S. and keep our companies from leaving. Nobody else can do it. Our economy will sing again. _E_
Happy belated birthday wishes to @BarbaraJWalters. Barbara is terrific! _E_
Gov. John Kasich has really failed on the campaign trail. I thought he would have been far more talented. He is just wasting time & money! _E_
"If you want to succeed you should strike out on new paths rather than travel worn paths of accepted success." John D. Rockefeller _E_
Very interesting election currently taking place in France. _E_
A great honor to host PM Paolo Gentiloni of Italy at the White House this afternoon! #ICYMI Joint Press Conference... __HTTP__ _E_
Announced w/ @pgaofamerica that we will bring @seniorpgachamp to @TrumpGolfDC & @pgachampionship to Trump Bedminster _E_
Attention to detail is critical choose scents that exude sophistication & confidence. Find out more 4/18 5:30 pm @Macys Herald Square. _E_
Great event last night @trumpwinery with @GovernorVA to support @TheVFoundation @UVA @VCU __HTTP__ _E_
A beautiful article by @IvankaTrump on my newly opened golf course in NYC Trump Links Ferry Point __HTTP__ _E_
I will be having lunch at the White House today with Republican Senators concerning healthcare. They MUST keep their promise to America! _E_
Adversity is a fact of life. Be bigger than the problems be ready to fight for your rights & all will be well – Trump Never Give Up _E_
If you have any doubt that @BarackObama must be defeated see @DineshDSouza's '2016: Obama's America.' Amazing film! _E_
.@MittRomney if Obama gets wise tonight just ask for his college records & transcripts he will quiet down quickly. _E_
While I have never met @nytdavidbrooks of the NY Times I consider him one of the dumbest of all pundits he has no sense of the real world! _E_
Our debt finances China's military. It's time to get tough – we hold all the cards. Let's Make America Great Again! __HTTP__ _E_
There's only only one person who has defunded Medicare. His name is @BarackObama. _E_
....People are angry. At some point the Justice Department and the FBI must do what is right and proper. The American public deserves it! _E_
Via @DailyCaller by @AlexPappas: "Donald Trump To Blast Obama Trade Pact In Radio Ads: 'A Bad Bad Deal'" __HTTP__ _E_
Without momentum there's a lack of energy that can lead the best of ideas to nowhere. Get your momentum going and keep it going. _E_
Coming up soon: The two hour premiere of The Apprentice. Next Thursday September 16th at 9 pm on NBC. __HTTP__ _E_
Who handed Iraq over to Iran yesterday? @BarackObama. We have gotten nothing from the Iraqis we should have them pay us back with oil. _E_
A @senatormcdaniel win is a victory for our country. Chris is a Constitutional Conservative who'll make a difference in Washington. _E_
President Obama you have a big job to do. Go to Baltimore and bring both sides together. With proper leadership it can be done! Do it. _E_
Today I announced an Air Traffic Control Initiative to take American air travel into the future finally!... __HTTP__ _E_
Be sure to read my column in @cnni "Europe is terrific place for investment" __HTTP__ _E_
...whether there are tapes or recordings of my conversations with James Comey but I did not make and do not have any such recordings. _E_
Even @PiersMorgan is impressed by @THEGaryBusey. #CelebApprentice _E_
Just received from @PeteRose_14. Thank you Pete! #VoteTrump on Tuesday Ohio! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
A sad day for America with Snowden being granted asylum in Russia. Putin is laughing at Obama. _E_
RT @Jim_Jordan: President Trump did the right thing by withdrawing us from Paris treaty it would hurt American companies and American wor... _E_
All successful people are high energy people who are passionate about what they do. Find a passion that energizes you. Think Big _E_
Ted Cruz said on @oreillyfactor that illegals sent out of country by my administration would come right back as citizens. Another lie crazy! _E_
The onus of the Chicago teachers' strike falls squarely on the teachers & their union. Inexcusable to leave children without school. _E_
Get the big picture but be prepared for the picture to change. Be persistent and alert every single day. _E_
Great honor to be endorsed by popular & successful @gov_gilmore of VA. A state that I very much want to win THX Jim! __HTTP__ _E_
Leaving Superior Wisconsin now. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_
Kasich has helped decimate the coal and steel industries in Ohio. I will bring them back! #MakeAmericaGreatAgain _E_
Russia just said the unverified report paid for by political opponents is A COMPLETE AND TOTAL FABRICATION UTTER NONSENSE. Very unfair! _E_
Newsmax is a great news org and and its pres debate in IA on 12 27 will be fair balanced and informative. @ralphreed _E_
I'll be on @foxandfriends Monday at 7:30 AM. Tune in! _E_
"@Letterman to Donald Trump: 'Fire @geraldorivera'" __HTTP__ via @Mediaite by @TheMattWilstein _E_
A very good NBC/Wall Street Journal Poll was just released wherein I went up from last month and am in the lead. Nice! _E_
11AM #MakeAmericaGreatAgain __HTTP__ _E_
A true piece about the standing ovations I got yesterday __HTTP__ _E_
After 13 seasons @ApprenticeNBC easily beat Shark Tank in ratings last year better demos as well. _E_
Must watch for all Georgians @Perduesenate's new ad "Secure Our Border __HTTP__ Michelle Nunn supports amnesty & ObamaCare _E_
Today I spoke @LibertyU Convocation a great crowd... __HTTP__ _E_
I attended @Aerosmith concert last night in Newark NJ. Doesn't get any better than that. @IamStevenT was fantastic great energy! _E_
Dope Frank Bruni said I called many people including Karl Rove losers true! I never called my friend @HowardStern a loser he's a winner! _E_
It's Wednesday. How many times will A Rod sue the @Yankees today? A Rod has no one to blame but himself for his predicament. _E_
Watch @MissUSA Olivia Culpo crowned as @MissUniverse 2012 in the Trump #MissUniverse Pageant __HTTP__ _E_
I will be interviewed on @foxandfriends at 6:00 A.M. Enjoy! _E_
You have to set higher and higher goals. You have to want more or you will start slipping backwards fast. Think Big _E_
Congratulations to @FLGovScott on winning access to federal database __HTTP__ He is making FL a safe & legal election for 2012 _E_
.@lancearmstrong revise your decision to quit go back and fight. _E_
Entrepreneurs: Be ready for problems you'll have them every day. Keep open to new ideas that's where innovation begins. _E_
#DemDebate was really boring but had a lot of fun live tweeting and picked up by far the most followers. _E_
Secretary Kerry cannot get other nations to join us in fighting ISIS. They are afraid and he is a poor salesman who reps a pathetic leader! _E_
After Solyndra @BarackObama is stil intent on wasting our tax dollars on unproven technologies and risky companies. He must be accountable. _E_
Presidential Proclamation Honoring the Victims of the Tragedy in Parkland Florida: __HTTP__ __HTTP__ _E_
Frankly for a writer I don't think @DannyZuker's stuff is good. In fact it's terrible. _E_
Whether we like it or not oil is the axis on which the world's economies spin. It just is. When the price o... (cont) __HTTP__ _E_
The only problem I have with Mitch McConnell is that after hearing Repeal & Replace for 7 years he failed!That should NEVER have happened! _E_
The new reality China and Japan are warning us not to default __HTTP__ Reckless government spending has made us weak. _E_
Strong leader: @IsraeliPM Netanyahu explained at AIPAC the threat Israel faces from Iran's nuclear drive. He is (cont) __HTTP__ _E_
Via @Law360: "Trump's $200M Old Post Office Project Gets Early Approval" __HTTP__ _E_
.@alexsalmond @pressjournal RT @GailLorene Ask our Canadian neighbors who abhor the windfarms. And poor Scotland _E_
I'll be on The Late Show with David Letterman tonight be sure to tune in for a great show. 11:30 pm on CBS. _E_
Isn't it interesting that immediately after September 11th everybody was asking for and indeed demanding torture of any kind. No reports! _E_
RT @foxandfriends: .@jasoninthehouse: Comey went silent when I asked him about his memos which raised a lot of eyebrows. __HTTP__ _E_
Great night in Iowa special people. Thank you! _E_
My heart goes out to the people of Boston on this terrible day! _E_
Donald Trump Has Given Millions To Pro Romney SuperPACs and His Whole Family Is Cutting Checks to Mitt's Campaign __HTTP__ _E_
I will be interviewed on @seanhannity tonight at 10:00. Many things mostly bad to talk about! _E_
Ohio Gov.Kasich voted for NAFTA from which Ohio has never recovered. Now he wants TPP which will be even worse. Ohio steel and coal dying! _E_
.@VattenfallGroup wants out of their Aberdeen windfarm fiasco so badly but @AlexSalmond won't let them—he's (cont) __HTTP__ _E_
With all of its phony unnamed sources & highly slanted & even fraudulent reporting #Fake News is DISTORTING DEMOCRACY in our country! _E_
Congratulations to @PGA_JohnDaly on his big win yesterday. John is a great guy who never gave up and now a winner again! _E_
The era of division is coming to an end. We will create a new future of #AmericanUnity. First we need to... __HTTP__ _E_
Interesting that @Macys criticized me but just paid $650000 in fines for racial profiling. Are they racists? _E_
New York Times Apologizes to Donald TrumpA recent story in the New York Times incorrectly stated that Donald (cont) __HTTP__ _E_
In that @TimeWarner has @HBO with really dumb racist Bryant Gumbel(and I mean dumb) and no CBS (which fired Bryant) I am switching bldgs. _E_
Happy 102nd birthday to President Ronald Reagan. Every day that passes Reagan's presidency looks better and better. _E_
Why did @AGSchneiderman have to fill out 3 successive ballots on Election Day? And this is our A.G. _E_
#ThankYouTour2016 Tue: West Allis WI. Thur: Hershey PA. Fri: Orlando FL. Sat: Mobile AL. Tickets:... __HTTP__ _E_
Steps away from Waikiki's famous beaches @TrumpWaikiki is Hawaii's top destination w/our signature amenities __HTTP__ _E_
The Republican platform is most pro Israel of all time! _E_
My @FoxNews interview last night on @hannityshow discussing OWS and @BarackObama's incompetent leadership. __HTTP__ _E_
Arriving @TrumpScotland with @DonaldJTrumpJr & @EricTrump. Back to New York tonight. Video: __HTTP__ _E_
...massive regulation cuts 36 new legislative bills signed great new S.C.Justice and Infrastructure Healthcare and Tax Cuts in works! _E_
Have some fun with this __HTTP__ _E_
Check out my speech from last Friday __HTTP__ as well as my appearance this morning on @foxandfriends __HTTP__ _E_
MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_
The success of Shark Tank over the years is a total joke compared to the success of The Apprentice one of the biggest hits in T.V. history. _E_
To the geniuses at 'Americans United for Change': the more you tax me the less people I employ. Get it? _E_
My friend @AriEmanuel of @IMG bought the Miss Universe pageants from me and they are on tonight on #Fox! Tune in! _E_
The past 4 years have seen the weakest multiyear recovery since WWII __HTTP__ Need to loosen regulations and lower taxes. _E_
Thank you Pittsburgh Pennsylvania! Will be back soon! #AmericaFirst __HTTP__ _E_
Iraq should be paying us while we fight ISIS. Give the money to the families of our brave soldiers. _E_
Everyone makes mistakes but it's what you do with them and what you learn from them that matters. Midas Touch _E_
Thank you South Dakota! #Trump2016 __HTTP__ __HTTP__ _E_
Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_
Woody Johnson's comments that he would rather have @MittRomney win the election than his @nyjets win games shows real patriotism. _E_
Afghani soldiers those on our side killed 7 Marines last month. __HTTP__ They don't want us what (cont) __HTTP__ _E_
My thoughts and prayers are with all of the victims involved in this mornings train collision in South Carolina. Thank you to our incredible First Responders for the work they've done! _E_
Too bad @morningmika did not allow her interview with @SpitzerForNYC to go on another few minutes...would have been interesting... _E_
Plans to build wind farm near Trump Turnberry in Scotland have been dropped. GREAT! @GolfDigest @GolfweekMag @GolfChannel @ESPNGolf _E_
Via @BreitbartNews: TRUMP TO REPUBLICANS: 'PLAY THE DEBT CEILING CARD' __HTTP__ by @joelpollak _E_
Not only does the media give a platform to hate groups but the media turns a blind eye to the gang violence on our streets! __HTTP__ _E_
Autism rates through the roof why doesn't the Obama administration do something about doctor inflicted autism. We lose nothing to try. _E_
.@AlexSalmond sought my support after he released terrorist Al Megrahi who blew up Pan Am #103 killing all aboard. I said "no way!" _E_
The Iranians are having 'difficulties' with their nuclear program __HTTP__ But no thanks to us! _E_
I am proud to announce our newest project Trump Tower Mumbai. Together with the Lodha Group it will be incredible! __HTTP__ _E_
Will be interviewed on Media Buzz with Howie Kurtz on Fox Sunday at 11:00 A.M. _E_
Way to go @serenawilliams you are a true champion. Proud of you! _E_
#MakeAmericaGreatAgain From my speech in South Carolina yesterday __HTTP__ _E_
Great boardroom! What did you think? #CelebApprentice _E_
Big win by @Yankees last night to take control of AL East. Jeter & company now control their destiny. _E_
The Republican House Freedom Caucus was able to snatch defeat from the jaws of victory. After so many bad years they were ready for a win! _E_
Karen Handel's opponent in #GA06 can't even vote in the district he wants to represent.... _E_
I'll bet Lance Armstrong wishes he didn't do the interview with Oprah he's saying to himself what was I thinking? _E_
.@NBC really happy with how well the #MissUniverse pageant went. _E_
The Fake Media is working overtime today! _E_
Not under my watch __HTTP__ _E_
There are great campaigns on @fundanything __HTTP__ Be sure to take a look and donate to one today. _E_
Heading to rally with Bobby now! See you soon! __HTTP__ _E_
RT @IvankaTrump: Thank you New Hampshire! __HTTP__ _E_
RT @ChuckGrassley: Jerusalem Embassy Act of '95 (Senate vote 93 5 & I voted for it) states embassy should be in Jerusalem by 5/31/99. For 1... _E_
Today is the day that ObamaCare website was supposed to be up and working. WRONG website is closed down a total disaster! 90 million doomed _E_
That's right we need a TRAVEL BAN for certain DANGEROUS countries not some politically correct term that won't help us protect our people! _E_
I absolutely support Kate's Law—in honor of the beautiful Kate Steinle who was gunned down in SF by an illegal immigrant. _E_
Crooked Hillary Clinton deleted 33000 e mails AFTER they were subpoenaed by the United States Congress. Guilty cannot run. Rigged system! _E_
At your request I will be doing live tweeting during tonight's @ApprenticeNBC. #CelebApprentice _E_
I just beat a lawyer from Yale and a lawyer from Harvard who teamed up against me in a major case worth millions ($). They were so dumb! _E_
" Pennies don't fall from heaven they have to be earned here on earth. – PM Margaret Thatcher (October 13 1925 – April 8 2013) _E_
Luther Strange has been shooting up in the Alabama polls since my endorsement. Finish the job vote today for Big Luther. _E_
Border Patrol Officer killed at Southern Border another badly hurt. We will seek out and bring to justice those responsible. We will and must build the Wall! _E_
"In every battle there comes a time when both sides consider themselves beaten... _E_
The rules DID CHANGE in Colorado shortly after I entered the race in June because the pols and their bosses knew I would win with the voters _E_
Sorry won't be doing Fox & Friends this morning will be in India on a couple of major business deals! _E_
Trump volunteers were out early today to offload cases of food and supplies for hard hit Rockaways residents #Sandy _E_
#FlashbackFriday At Military Academy second from left. __HTTP__ _E_
It's Tuesday. How many jobs has ObamaCare cost the economy today? _E_
Just saw the phony ad by Cruz totally false more dirty tricks. He got caught in so many lies is this man crazy? _E_
China's domestic economic and political problems prove how pathetic our leadership is in allowing China to rip us off __HTTP__ _E_
As we come together to celebrate the extraordinary contributions of African Americans to our nation our thoughts turn to the heroes of the civil rights movement whose courage and sacrifice have inspired us all. Proclamation: __HTTP__ __HTTP__ _E_
#JFKFiles __HTTP__ _E_
Thank you for your support! __HTTP__ _E_
Certain people are ruining their reputations tonight really sad! #Oscars _E_
Good news @AFPhq is going to fight back against Rove's attack on the Tea Party __HTTP__ Go get em! @marklevinshow _E_
Congratulations to my friend @seanhannity on @hannityshow 1000th show consecutively #1 in his time slot! Great going! _E_
Just spoke to President XI JINPING of China concerning the provocative actions of North Korea. Additional major sanctions will be imposed on North Korea today. This situation will be handled! _E_
The @EricTrumpFDN is doing amazing work helping the children... _E_
especially how to get people even with an unlimited budget out to vote in the vital swing states ( and more). They focused on wrong states _E_
From 1954 to 1960 there were 10 major hurricanes that hit the East Coast. _E_
This morning I will be going to the Commissioning Ceremony for the largest aircraft carrier in the world The Gerald R. Ford. Norfolk Va. _E_
We spend billions of dollars helping nations all over the World but with hurricane Sandy and Oklahoma tornado not one nation helped us! _E_
Thank you Washington! Together WE will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_
.@Cher attacked @MittRomney. She is an average talent who is out of touch with reality. Like @Rosie O'Donnell a total loser! _E_
What could be better than dinner with @MittRomney and me? __HTTP__ _E_
#IceBucketChallenge For those of you who wanted a picture here it is __HTTP__ _E_
.@VattenfallGroup has topped Carbon Data's rankings of the most carbon intensive companies in the EU's emissions trading scheme. _E_
Via @worldnetdaily by @MichaelCarl7: "Trump: Obama blew chance to free U.S. pastor" __HTTP__ _E_
Crooked Hillary Clinton wants to flood our country with Syrian immigrants that we know little or nothing about. The danger is massive. NO! _E_
I was saddened to see how bad the ratings were on the Emmys last night the worst ever. Smartest people of them all are the DEPLORABLES. _E_
... and pay per view records with "Battle of the Billionaires" in Detroit. It was a wild day! _E_
Being the best requires full time attention and application." – Midas Touch _E_
.@AJDelgado13 Thank you so much for the nice words and support really enjoy listening to your ideas and thoughts. _E_
"Deals are my art form. I like making deals preferably big deals." – The Art of The Deal _E_
The FAKE NEWS media (failing @nytimes @CNN @NBCNews and many more) is not my enemy it is the enemy of the American people. SICK! _E_
Mariano Rivera Yankee pitcher is the greatest ever. Get well fast. _E_
With magnificent views @TrumpChicago is the perfect venue to host impact events & business meetings __HTTP__ _E_
The Obama Administration has a very important duty to provide a budget and then negotiate! OUR COUNTRY is a laughingstock! _E_
Are you ready for the All Star @CelebApprentice? @TraceAdkins is back in the upcoming season...which is the best yet! _E_
Thank you! #Trump2016 #WIPrimary __HTTP__ _E_
I was speaking with Don Imus this morning.... __HTTP__ _E_
The Democrats have said some of the worst things about James Comey including the fact that he should be fired but now they play so sad! _E_
They call it climate change now because the words global warming didn't work anymore. Same people fighting hard to keep it all going! _E_
Amazingly with all of the money I have raised for the vets I have got nothing but bad publicity from the dishonest and disgusting media. _E_
Scary. Over 8332000 Americans left the work force during Obama's first term __HTTP__ How did Romney lose that election? _E_
.@ARealSuperMan #asktrump __HTTP__ _E_
We should stop talking stay out of Syria and other countries that hate us rebuild our own country and make it strong and great again USA! _E_
The Cruz campaign issued a dishonest and deceptive get out the vote ad calling voters in violation. They are now under investigation. Bad! _E_
"Trump to build second Scottish course" __HTTP__ via @UPI _E_
Staff Sgt. Salvatore A. Giunta received the Medal of Honor from Pres. Obama this month. It was a great honor to have him visit me today. _E_
Amazing story in @BreitbartNews about the sleazebag blogger Coppins who fabricated nonsense about me for irrelevant @BuzzFeed. CONGRATS! _E_
Merry Christmas to all. Have a great day and have a really amazing year. Together we will MAKE AMERICA GREAT AGAIN! It will be done! _E_
Sorry to hear of yesterday's passing of General Norman Schwarzkopf. He was a terrific general and leader we could use more like him. _E_
Really great numbers on jobs & the economy! Things are starting to kick in now and we have just begun! Don't like steel & aluminum dumping! _E_
Thank you New Jersey! #Trump2016 __HTTP__ __HTTP__ _E_
I am being proven right about massive vaccinations—the doctors lied. Save our children & their future. _E_
My shirts ties & cufflinks @Macys have never been better or more beautiful. Great holiday gifts great price. _E_
Colorado was amazing yesterday! So much support. Our tax trade and energy reforms will bring great jobs to Colorado and the whole country. _E_
The Fake Media (not Real Media) has gotten even worse since the election. Every story is badly slanted. We have to hold them to the truth! _E_
The deficits under @BarackObama are the highest in America's history. Why is he bankrupting our country? _E_
Rumor has it Pataki Kasich & Senator Lindsey Graham are dropping out of the race very soon. Hope it's not true they're so easy to beat! _E_
#Trump2016 #MakeAmericaGreatAgain #ECONOMY VIDEO: __HTTP__ __HTTP__ _E_
Lincoln never sounded like that! _E_
The Countryside Party just formed in Scotland to fight ugly wind turbines & @AlexSalmond. Congrats to Jim Crawford & Countryside Party. _E_
Trump National Hudson Valley's 7693 yd par 72 course features one of the country's great golf courses. __HTTP__ _E_
#MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_
Via @MailOnline: "But did his hair survive? @MissUniverse & @MissUSA dump water over Donald Trump" __HTTP__ _E_
To all young (and old) entrepreneurs: Believe in yourself talk yourself up! Energize yourself and you'll energize others. _E_
Wow a really nice lead in New Hampshire an increase since my last poll! __HTTP__ _E_
Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_
Mike Bloomberg is doing a great job as Mayor of New York City. Ray Kelly is a great Police Commissioner. @MikeBloomberg _E_
Really bad shooting in Orlando. Police investigating possible terrorism. Many people dead and wounded. _E_
Re Negotiation: Patience is an enormous virtue & needs to be cultivated for successful negotiations on any level. _E_
...have it. Fake News said 17 intel agencies when actually 4 (had to apologize). Why did Obama do NOTHING when he had info before election? _E_
While @BarackObama watches China is trying to have the yuan overtake our dollar as the international (cont) __HTTP__ _E_
Have a great and peaceful Memorial Day but remember there are people out there who don't want us to have peace. WE MUST BE STRONG!!!! _E_
The press is so totally biased that we have no choice but to take our tough but fair and smart message directly to the people! _E_
The only reason Obama gave a speech last night was because it was on the schedule Putin is laughing and the reviews have been really bad! _E_
Brian Thanks dummy I picked up 70000 twitter followers yesterday alone. Cable News just passed you in the ratings. @NBCNightlyNews _E_
Pathetic excuse by London Mayor Sadiq Khan who had to think fast on his no reason to be alarmed statement. MSM is working hard to sell it! _E_
Attended last night's @Yankees game Derek Jeter is both a great player and a great guy. _E_
Roger Goodell of NFL just put out a statement trying to justify the total disrespect certain players show to our country.Tell them to stand! _E_
Entrepreneurs: Learn to trust yourself. Being an entrepreneur is not a group effort. _E_
THANK YOU Daytona Beach Florida!#MakeAmericaGreatAgain __HTTP__ _E_
Thank you Florida Ohio and Pennsylvania! #CrookedHillary is not qualified. #ImWithYou __HTTP__ _E_
"@WestJournalism Exclusive – We Asked Donald Trump What Jobs He Would Offer ISIS" __HTTP__ _E_
The brand new season of @CelebApprentice starts filming in less than 5 weeks. The 'All Star' cast will be announced very soon. _E_
Will be working all weekend in choosing the great men and women who will be helping to MAKE AMERICA GREAT AGAIN! _E_
If Democrats do not start opposing ObamaCare and fast Republicans will have a massive victory in 2014 far greater than any predictions! _E_
CRIPPLED AMERICA is the perfect gift for friends & family. Order signed copy & join me at 7:30pm live streaming! __HTTP__ _E_
Crazy Maureen Dowd the wacky columnist for the failing @nytimes pretends she knows me well wrong! _E_
I only wish my wonderful father Fred gave me $200 million to start my business like lightweight Rubio says. He didn't total fabrication! _E_
Bush is pretending that the Trump surge is great for him and the @nytimesworld is reporting Bush delight con job a Bush nightmare! _E_
The @BarackObama campaign keeps highlighting a web video of John McCain being nice & respectful. I'll bet John (cont) __HTTP__ _E_
I look forward to Tuesday night's presidential debate I wonder if Obama will use my name again. _E_
Bobby Jindal did not make the debate stage and therefore I have never met him.... _E_
JUST IN: A jury awarded a complete and total victory in buyer's remorse lawsuit against me in Ft. Lauderdale. _E_
Jay Carney won't answer reporters questions of Why Obama won't release his college transcripts Come on Jay! _E_
We should stay the hell out of Syria the rebels are just as bad as the current regime. WHAT WILL WE GET FOR OUR LIVES AND $ BILLIONS?ZERO _E_
.@lancearmstrong should immediately reconsider or his legacy is ruined. _E_
"I believe anybody who is not afraid to fail is a winner." @JoeTorre _E_
'Better Be Careful':Donald Trump Warns GOP On Immigration Creating '12 Million' New Dem Voters __HTTP__ via @Mediaite _E_
Thank you for the support South Carolina! #USSYorktown #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Spain's government is closing down wind turbines the maintenance is higher than the income. _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
Check out OAN and compare to what you are watching now! _E_
Have a GREAT weekend everybody enjoy yourself but always keep your goals and aspirations in mind. Never lose sight of the victory ahead! _E_
The Democrats have zero intention of coming to any deal on the fiscal cliff. They will raise taxes and blame it on the Republicans. _E_
THANK YOU WISCONSIN! #VoteTrump next Tuesday April 5th! #WIPrimary __HTTP__ __HTTP__ _E_
Wow thank you Pensacola FL. See you Friday at 7pm join me! __HTTP__ __HTTP__ _E_
Celebrated doctor @BillCassidy will be a tremendous Senator. Louisiana – send a Conservative to the Senate vote for Bill this November! _E_
Corrupt @BarackObama's largest bundlers are fundraisers linked to the Obama Solyndra boondoggle __HTTP__ Chicago cronyism _E_
The Fake News Media has never been so wrong or so dirty. Purposely incorrect stories and phony sources to meet their agenda of hate. Sad! _E_
"Advertising is totally unnecessary. Unless you hope to make money." Jef I. Richards _E_
.@TrumpCollection continues to deliver the goods __HTTP__ _E_
RT @FLOTUS: The decorations are up! @WhiteHouse is ready to celebrate! Wishing you a Merry Christmas & joyous holiday season! __HTTP__ _E_
.@MikeAndMike I will be on the Mike & Mike Show at 7.05 a.m. (ESPN) 10 minutes. Will be fun great guys! Radio and T.V. _E_
What is wrong with the @GOP? Now they want to give all authority on the sequester cuts to Obama __HTTP__ Pathetic. _E_
My interview with Michael Patrick Shiels on WJIM in Lansing on behalf of @MittRomney __HTTP__ _E_
Please read __HTTP__ and watch a recent trip made to Trump Vineyard Estates by @EricTrump __HTTP__ _E_
years as a pol in Connecticut Blumenthal would talk of his great bravery and conquests in Vietnam except he was never there. When.... _E_
The reason I originally endorsed Luther Strange (and his numbers went up mightily) is that I said Roy Moore will not be able to win the General Election. I was right! Roy worked hard but the deck was stacked against him! _E_
So generous and pious! After spending millions of our tax dollars on his campaign through travel @BarackObama donated to himself. _E_
Remarks by President Trump on the Policy of the U.S.A. Towards Cuba Video: __HTTP__ __HTTP__ _E_
.@foxandfriends in 15 minutes! _E_
Via @Inc by @steelwire: "Donald Trump – To Micromanage or Not To Micromanage?" __HTTP__ _E_
This is a buyers' market. Buy now. You will thank me in 3 years. _E_
Congratulations to @foxandfriends on its unbelievable ratings hike. _E_
Lightweight A.G. Eric Schneiderman sued school with a 98% approval rating while billions in corruption goes unpunished. A total crook? _E_
Best book ever on dealmaking (or so they say) TRUMP: THE ART OF THE DEAL. Go get it and others Washington you really can do better! _E_
Via @scotsmandotcom: Trump joins with Chandler in bid to attract events __HTTP__ _E_
Thank you Georgia! See you soon!#Trump2016 __HTTP__ _E_
Sad.@BarackObama has already exempted major oil importers on Iranian sanctions and is negotiating a waiver with China. __HTTP__ _E_
A really bad night for President Obama. Now the Republicans have to get together and get the job done! _E_
Stocks rose yesterday during the first day of government shutdown. Markets like being left alone for a day. _E_
Which is worse and which is more dishonest the #Oscars or the Emmys? _E_
Today's @WSJ Editorial is WRONG again. I know that China is not in the new T.P.P. trade deal but would come in latter through a back door. _E_
John Kasich fell right into President Obama's trap on ObamaCare and the people of Ohio are suffering for it. Shame! _E_
#TrumpAdvice __HTTP__ _E_
Benghazi was a massive cover up. _E_
"Lifestyle unveils Trump Home brand in GCC" __HTTP__ via @TradeArabia _E_
In order to stop the Ebola outbreak in Africa perhaps the President should put all Africans on ObamaCare rather than sending the troops! _E_
Ron Estes is running TODAY for Congress in the Great State of Kansas. A wonderful guy I need his help on Healthcare & Tax Cuts (Reform). _E_
THANK YOU PORTLAND Maine!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
The United States troops which were sent to West Africa have only gotten 4 hours of Ebola training very unfair to them and their families! _E_
Wow—Golf Magazine just named Trump Scotland "best new course." __HTTP__ _E_
Congratulations to the @NYRangers on bringing the series home last night. _E_
Karl Rove's ads are the worst in political history! _E_
I am calling on Congress to TERMINATE the diversity visa lottery program that presents significant vulnerabilities to our national security. __HTTP__ _E_
Will be interviewed on @SquawkCNBC by @JoeSquawk coming up at 6:00amE from Davos Switzerland. Enjoy! #WEF18 __HTTP__ _E_
Thank you Georgia! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Turnberry in Scotland is a far superior golf course to Pinehurst and it isn't even close! Likewise the Blue Monster at Doral. _E_
Trump SoHo opens this Friday and it is fantastic! Check out the Trump Hotel Collection... __HTTP__ _E_
.@WestwoodLee Great going this weekend. You are a true champion! _E_
Trump Int'l Hotel & Tower Chicago has won many awards & accolades as has Sixteen its signature restaurant. __HTTP__ _E_
Is Anthony Weiner a jerk or what! _E_
In the general course of human nature a power over a man's subsistence amounts to a power over his will. Alexander Hamilton _E_
During my trip to Saudi Arabia I spoke to the leaders of more than 50 Arab & Muslim nations about the need to confront our shared enemies.. __HTTP__ _E_
An attack on our Embassy is an attack on our soil. We have been attacked by Libya. Go into Libya & take the (cont) __HTTP__ _E_
.@HillaryClinton loves to lie. America has had enough of the CLINTON'S! It is time to #DrainTheSwamp! Debates __HTTP__ _E_
"Learn to think continentally." Alexander Hamilton _E_
While in Charlotte this weekend will visit my Trump National Golf Club on Lake Norman—a magnificent place & doing really well! _E_
Just received a copy of @SarahPalinUSA new book a great read! Sarah is a terrific person. _E_
I spoke with other candidates to a Jewish group many friends in D.C. I said I'm a negotiator like you Got standing O rated best of day! _E_
The price of greatness is responsibility. Winston Churchill _E_
British Prime Minister May was very angry that the info the U.K. gave to U.S. about Manchester was leaked. Gave me full details! _E_
Great meeting with automobile industry leaders at the @WhiteHouse this morning. Together we will #MAGA! __HTTP__ _E_
I wonder what the late great Vince Lombardi would say about the Rutgers football player who says he is being bullied because coach yelled? _E_
A hurricane will be coming to Tampa. My @RNC convention surprise hits Monday night! _E_
Establishment flunky @KarlRove is going crazy with the just released CBS poll that has me way ahead. New Fox poll has me beating Hillary. _E_
There is nothing I would be happier to do than to donate the $5M to a charity of Obama's choice once he releases all of his records. _E_
So wonderful to be in Las Vegas yesterday and meet with people from police to doctors to the victims themselves who I will never forget! _E_
Current @NYMag really sad not only boring but highly inaccurate. Use better paper product looks like a death march (which it is!). _E_
House Republicans should be doing everything possible to defund ObamaCare. Instead Leadership is funding it __HTTP__ _E_
Greatly dishonest of @TedCruz to file a financial disclosure form & not list his lending banks then pretend he is going to clean up Wall St _E_
There's no bigger name in America than Donald Trump political or nonpolitical. Sarasota GOP Chair Joe Gruters _E_
No surprise welfare spending is up over 30% under Obama. __HTTP__ He is the food stamp & welfare king _E_
Obama now wants to deny due process to the police. He'll give all constitutional rights to the terrorists but not our cops. _E_
Last week was a first in #CelebApprentice when I fired 2 celebrities at once. Wish I could FIRE @RickSantorum! (cont) __HTTP__ _E_
Who else could take 16 vacations play over 100 rounds of golf and hold over 300 fundraisers while serving as (cont) __HTTP__ _E_
If Obama is concerned about the border he should stop vacationing. Gov't will save millions which it can use to stop illegal migration. _E_
Great now Supreme Court Justices are talking about a constitutional right to a cell phone __HTTP__ Obama just stop already. _E_
Take responsibility for yourself it's a very empowering attitude. _E_
Mitt Romney did great in the debate last night. _E_
Tucson killer Loughner should be given the death penalty not his plea bargained life in prison which will cost (cont) __HTTP__ _E_
Billions of dollars spent on Baltimore and it's still a total mess. Leadership is needed not dollars. Our whole country is going to hell! _E_
This is the simple fact about @HillaryClinton: she is a typical politician all talk no action. #Debates2016 _E_
Thank you! #MakeAmericaGreatAgain __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Again more dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_
We're going to use American steel we're going to use American labor we are going to come first in all deals. ... __HTTP__ _E_
Obama's second term is going to very tough for the Republicans. The Republicans must pick their battles wisely and play smart. _E_
Letter to @Univision Re: @TrumpDoral __HTTP__ _E_
Will be on @CNBC at @7:22. Enjoy! _E_
Will be another Sean success! __HTTP__ _E_
Wow...NYT reports @celebrityapprentice was the number 1 show in branding on television for all of 2012. _E_
Thoughts and prayers are with everyone in West Virginia dealing with the devastating floods. #ImWithYou _E_
The $25 Billion settlement with the banks on mortgages will slow the housing market down even more and create higher user fees. Stupid! _E_
Job tip: If you were the employer what kind of person would you most desire as an employee? Be that person. _E_
I will be on @letterman tonight. Be sure to watch! Always a great time. @LateShow _E_
We are now at a time perhaps more than ever before when the World needs GREAT leadership! _E_
Entrepreneurs: See yourself as victorious and the best way to be victorious is to be passionate. Find something you love doing! _E_
The radar defense shipping and civil aviation problems will stop the ugly windfarm. #EOWDC _E_
The media coverage this morning of the very average Clinton speech and Convention is a joke. @CNN and the little watched @Morning_Joe = SAD! _E_
It is a shame that the biased media is able to so incorrectly define a word for the public when they know that the definition is wrong. Sad! _E_
A clip from @KatieShow where I take @katiecouric's audience on the Katie Coach __HTTP__ _E_
When the achiever achieves it's not a plateau it's a beginning. Donald J. Trump __HTTP__ _E_
Every business has surprises hidden dangers beneath the surface and little known opportunities that can lead to huge success. _E_
Fans like winners. They come to watch stars great exciting players who do great exciting things. #TheArtofTheDeal _E_
Imagine how much money the average American would save if we busted the OPEC cartel. (cont) __HTTP__ _E_
Here we go! I stated long ago that we should cancel all flights from West Africa. Now we have Ebola in U.S. AND IT WILL ONLY GET WORSE! _E_
Will be cutting ribbon at 10 A.M. with Mayor Bloomberg and Jack Nicklaus for the opening of TRUMP LINKS at FERRY POINT. _E_
"Spend your time enjoying your big dreams." Think Big _E_
Do you ever notice that lightweight @megynkelly constantly goes after me but when I hit back it is totally sexist. She is highly overrated! _E_
Wow new polls just came out from @CNN Great numbers especially after total media hit job. Leading Ohio 48 44. _E_
Mitt Romney matches sitting President in fundraising for April not an easy thing to do. Bad news for (cont) __HTTP__ _E_
Designed by @IvankaTrump @TrumpDoral's New Villa Deluxe Guestrooms include vintage artwork of golf legends __HTTP__ _E_
Tomorrow at 11AM #MakeAmericaGreatAgain __HTTP__ _E_
We need the Wall for the safety and security of our country. We need the Wall to help stop the massive inflow of drugs from Mexico now rated the number one most dangerous country in the world. If there is no Wall there is no Deal! _E_
Yesterday on the same day I had meetings with Russian Foreign Minister Sergei Lavrov and the FM of Ukraine Pavlo... __HTTP__ _E_
Turn to QVC now to watch Melania really good stuff! _E_
With eleven Republican candidates running in Georgia (on Tuesday) for Congress a runoff will be a win. Vote R for lower taxes & safety! _E_
The Failing New York Times foiled U.S. attempt to kill the single most wanted terroristAl Baghdadi.Their sick agenda over National Security _E_
Will be doing Fox and Friends at 7 A.M. (in 20 minutes). _E_
What a shame that @msnbc's ratings have sunk even lower in 2013. Prime time down 50%. @TheRevAl's are (cont) __HTTP__ _E_
"Surround yourself with people who are smarter than you." @UncleRUSH _E_
The Democrats are most angry that so many Obama Democrats voted for me. With all of the jobs I am bringing back to our Nation that number.. _E_
Must read editorial today about lightweight New York State Attorney General Eric Schneiderman. Is he a crook? __HTTP__ _E_
Thank you West Virginia!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The polling numbers show a close race. @MittRomney needs all of our support. _E_
.@BarbaraJWalters @theviewtv will apologize to me just like she did when I was right about @Rosie. Besides I get great ratings on The View. _E_
Great evening with President @EmmanuelMacron & Mrs. Macron. Went to Eiffel Tower for dinner. Relationship with France stronger than ever. __HTTP__ _E_
The Obama administration gives better medical care to Al Qaeda at Gitmo than to our vets. _E_
The great @MarianoRivera in my office with my son @EricTrump __HTTP__ _E_
Response to @LindseyGrahamSC: __HTTP__ _E_
Do you notice the silence lately on wind turbine monstrosities? The people of Scotland & many other countries are fighting back. _E_
"You owe it to yourself and to your community to make your property the best it can be." – Think Like a Billionaire _E_
The MOVEMENT in Lakeland Florida. Voter registration extended to 10/18. REGISTER ASAP @ __HTTP__ &... __HTTP__ _E_
This is not a media event or about Donald J. Trump this is about the United States of America. I will be (cont) __HTTP__ _E_
If @BarackObama wanted the Super Committee to succeed he would have lead. Instead he has been campaigning. Where is the leadership? _E_
It is important to think positively. Negative thinking will kill your focus and destroy any chance you have of being successful. _E_
Little @MacMiller you illegally used my name for your song "Donald Trump" which now has over 75 million hits. _E_
Now that Mitt is gone all we have to do is get Bush to drop out and Trump to run—and we will win! _E_
I will be in Washington D.C. on Wednesday1 P.M.in front of the Capitol to protest the horrible and incompetent deal being made with Iran. _E_
DACA has been made increasingly difficult by the fact that Cryin' Chuck Schumer took such a beating over the shutdown that he is unable to act on immigration! _E_
I am happy to hear how badly the @nytimes is doing. It is a seriously failing paper with readership which is way down. Becoming irrelevant! _E_
Rapidly failing @VanityFair magazine hits me for my strong stance against Obama's brilliant 5 killers for 1 deserter trade. Amazing! _E_
The Iran deal is terrible. Why didn't we get the uranium stockpile it was sent to Russia. #SOTU _E_
Looking forward to the Florida rally tomorrow. Big crowd expected! _E_
Ratings starved @CNN and @CNNPolitics does not cover me accurately. Why can't they get it right it's really not that hard! _E_
W/ the ransom Obama paid for deserter Bergdhal getting Mexico to release USMC Sgt Andrew Tahmooressi is much harder. #BringBackOurMarine _E_
.@katyperry must have been drunk when she married Russell Brand @rustyrockets – but he did send me a really nice letter of apology! _E_
Jusr arrived at the studio the place is going wild! LIVE AT 8 P.M. #CELEBRITYAPPRENTICE _E_
Now China is threatening our allies who share defense pacts with us the latest is the Philippines __HTTP__ Very aggressive _E_
"If it doesn't sell it isn't creative." David Ogilvy _E_
My @FoxNews interview from yesterday with @TeamCavuto discussing the economy my trip to Australia @MittRomne... (cont) __HTTP__ _E_
Those five hotels includeTrump International Hotel & Tower New York Trump Soho New York Trump International Hotel & Tower Chicago... _E_
More reports of voting machines switching Romney votes to Obama. Pay close attention to the machines don't let your vote be stolen _E_
Moving forward f/tonight's competitive primaries it is crucial that the Tea Party & @GOP remain united towards November. Take the Senate! _E_
#trumpvlog Why I cancelled the great debate ..... __HTTP__ _E_
"The problem is that no government can create real jobs. Only entrepreneurs can do that." – Midas Touch _E_
Speech on Veterans' Reform: __HTTP__ _E_
I am tired of @BarackObama talking about @MittRomney's father. Why don't we discuss Barack Obama Sr.! _E_
Thank you to @GolfweekMag for naming Trump International Golf Links Scotland #1 GB&I Best Modern Course A great honor! _E_
Pete Rose should now be allowed in The Baseball Hall of Fame. The all time hits leader has paid the price already! _E_
Looking forward to keynoting the @NCGOP #NCGOPcon dinner tomorrow night! @NCGOP is a top state party! _E_
The woman who is the Secret Service Director looks like she is way over her head.Why can't the president appoint the best and the brightest? _E_
The only way for Medicare and Social Security to remain solvent is if our economy is healthy. @BarackObama doesn't get it. _E_
Wow great Ohio poll. Shows me leading by 5 points beating K! _E_
I find hope in the darkest of days and focus in the brightest. Dalai Lama _E_
Sooner or later those who win are those who think they can. Paul Tournier _E_
My interview yesterday on the S&P downgrade with Wolf Blitzer on CNN __HTTP__ _E_
"Courageous people do not fear forgiving for the sake of peace." – Nelson Mandela _E_
On behalf of @FLOTUS Melania & myself THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief @RedCross & @SalvationArmyUS __HTTP__ _E_
The Honolulu accommodations of @TrumpWaikiki are the perfect merger of beauty and function __HTTP__ _E_
Just as I predicted @Rosie would fail on The View __HTTP__ _E_
.@BillMaher's so called show on HBO must be the cheapest special produced in the history of television it sucks! _E_
via __HTTP__ Donald Trump announces launch of his first Indian project in Pune __HTTP__ _E_
It was great being on @MikeAndMike in the Morning (ESPN)—two great guys fantastic show! _E_
It will be interesting to see how Jenna Talackova does as Miss Universe Canada. We all wish her luck. _E_
Make sure to enjoy your time with your family during the holiday. It is a special time. Love and appreciate your family. _E_
Getting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_
Now @RonWyden is also "concerned" about ObamaCare along with @MaxBaucus __HTTP__ Program may fold through its own doing. _E_
AGAIN TO OUR VERY FOOLISH LEADER DO NOT ATTACK SYRIA IF YOU DO MANY VERY BAD THINGS WILL HAPPEN & FROM THAT FIGHT THE U.S. GETS NOTHING! _E_
Just arrived in Indianapolis Indiana to make an announcement on #TaxReform! Together we are going to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Big WIN today for building the wall. It will secure the border & save lives. Now the full House & Senate must act! __HTTP__ __HTTP__ _E_
Other than a small group of people who have suffered massive and embarrassing losses the party is VERY united. Great love in the arena! _E_
As President of the United States of America I will ALWAYS put #AmericaFirst#UNGAFull remarks: __HTTP__ __HTTP__ _E_
I commend Roger Ailes for publicly supporting @FoxNews' employees against the Obama administration's intimidation of its reporters. _E_
Congratulations to John Rich and to Marlee Matlin for a terrific job throughout the season. You are both great! __HTTP__ _E_
Entrepreneurs: Cover your bases. Know everything you can about what you're doing. Then go with your gut. Your instincts r there for a reason _E_
Prayers go out to the victims of the terrible fire in New Jersey. Stay strong and remember it will soon get better. _E_
Problem with @GOP is not their message it's that they are incapable of controlling the message. _E_
Jimmy Fallon show will be great tonight I'm on! _E_
Fewer Americans are now insured through their employers due to higher premiums. Obamacare must be fully repealed. __HTTP__ _E_
The Washington Post calls out #CrookedHillary for what she REALLY is. A PATHOLOGICAL LIAR! Watch that nose grow! __HTTP__ _E_
I promise you that I'm much smarter than Jonathan Leibowitz I mean Jon Stewart @TheDailyShow. Who by the way is totally overrated. _E_
Once again @Cher tweets nonsense about @MittRomney. She needs to stop tweeting & start worrying about some of her many problems. _E_
Joint Press Conference with Prime Minister Saad Hariri of Lebanon beginning shortly. Join us live! __HTTP__ __HTTP__ _E_
Spent the full day at meetings and a major rally yesterday in South Carolina. Great people and spirit. Today will be more of the same. _E_
I don't know whether I will win or lose the @billmaher lawsuit but had an obligation to sue for charity. _E_
We grieve for the officers killed in Baton Rouge today. How many law enforcement and people have to... __HTTP__ _E_
.@nbc did a great job last night with the @GoldenGlobes! _E_
Via @nytpolitics by @AshleyRParker: "Strong Showings for Donald Trump in Iowa and New Hampshire Polls" __HTTP__ _E_
Windmills are destroying every country they touch and the energy is unreliable and terrible. __HTTP__ _E_
Watch my wife Melania Trump tonight on @QVC at 1 a.m. So proud of her! _E_
Via @bpolitics by @emtitus: "Defying Doubters Donald Trump Makes Presidential Bid Official" __HTTP__ _E_
Via @newhampshirecom:"Tickets on sale for Loeb School Event featuring Donald Trump" __HTTP__ _E_
I have accepted @billmaher's $5 million offer paid to me for charity (made on the @jayleno show). _E_
That was a great football game. _E_
Are you a Democrat running in a race you should lose? Get @KarlRove to run an ad against you and you will win. _E_
Looking forward to speaking at @ralphreed's @FaithandFreedom Gala Dinner on Friday in D.C. His staff has been great! _E_
Melania and I saw American Idiot on Broadway last night and it was great. An amazing theatrical experience! _E_
Congrats to @BarbaraJWalters on winning the @MadeinNY Mayor's Award for Lifetime Achievement! I love Barbara! _E_
Happy 5th Anniversary to @TrumpWaikiki&lt __HTTP__ ! Can't believe it's already been 5 years.. _E_
On this Memorial Day holiday we honor our fallen soldiers who have made the greatest sacrifice for freedom. They are our country's finest. _E_
#trumpvlog Windfarms in today's video blog... __HTTP__ _E_
Wow some new and even greater polls thank you! _E_
The U.S. cannot allow EBOLA infected people back. People that go to far away places to help out are great but must suffer the consequences! _E_
The Council was shocked by the exuberance of the demonstration in Blackdog. @AlexSalmond @pressjournal _E_
Via @Newsmax_Media: "Trump: @KarlRove 'The Most Over rated Man in Politics'" __HTTP__ _E_
.@VattenfallGroup had no answers at demonstration last night. It's a failing company. Aberdeen windmills will destroy it. _E_
Hillary Clinton has been involved in corruption for most of her professional life! _E_
Entrepreneurs: Get and keep your momentum going. Without momentum a lot of great ideas go nowhere. _E_
Great article in the @guardian Donald Trump opens £100m golf course __HTTP__ _E_
In one of the biggest stories in a long time the FBI says it is now missing five months worth of lovers Strzok Page texts perhaps 50000 all in prime time. Wow! _E_
Join me in Henderson Nevada on Wednesday at 11:30am! #MAGA Tickets: __HTTP__ _E_
Wind Farms are not only disgusting to look at but also cause tremendous damage to their local ecosystems. __HTTP__ _E_
Via @Slate: Who won the #GOPDebate? __HTTP__ _E_
Fracking will lead to American energy independence. With price of natural gas continuing to drop we can be at a tremendous advantage. _E_
The Voter Violation certificate gave poor marks to the unsuspecting voter(grade of F) and told them to clear it up by voting for Cruz. Fraud _E_
The Saudis are taking credit for a meager 2% drop on crude __HTTP__ They always play this game (cont) __HTTP__ _E_
I enjoyed meeting with @MattBlunt @TrumpTowerNY to discuss why our government must address currency manipulation. Many US jobs are at stake. _E_
The signature restaurant of @TrumpNewYork @jeangeorges is both Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_
Via @BreitbartNews by @mboyle1: Donald Trump Slams Liberals In 'Dishonest Press': 'I'm Going To Start Naming Names' __HTTP__ _E_
Thank you for joining us at the Lincoln Memorial tonight a very special evening! Together we are going to MAKE AM... __HTTP__ _E_
A signed copy of CRIPPLED AMERICA makes a great gift. Order & join my live streaming book signing event on 12/3 __HTTP__ _E_
Congratulations to @BarackObama he is the first POTUS to run trillion dollar deficits in all four years of his term! _E_
A mediocre person tells. A good person explains. A superior person demonstrates.... _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
Thank you to my great supporters in Wisconsin. I heard that the crowd and enthusiasm was unreal! _E_
Dummy I'm asking a question look at the question mark at the end of the sentence! Use your head. _E_
John Heilemann the lightweight reporter begging to be on@morning joe looks like a timebomb waiting to explode he's a nervous and sad mess! _E_
Gary as the Cat in the Hat? He can work it out. _E_
How much is New York State spending on that obnoxious T.V. commercial that is being played endlessly for a tax incentive that doesn't work? _E_
I have directed that U.S. Cyber Command be elevated to the status of a Unified Combatant Command focused on....cont: __HTTP__ _E_
USSS did an excellent job stopping the maniac running to the stage. He has ties to ISIS. Should be in jail! __HTTP__ _E_
Via The Brody File: The Lesson Evangelicals Can Learn From Donald Trump Thank you David & CBN News so nice. __HTTP__ _E_
If the Wall Street protesters are upset about the economy then they should really be protesting @BarackObama at the White House. _E_
New CNN/WMUR New Hampshire poll just released. Thank you! #FITN #Trump2016 __HTTP__ _E_
to the U.S. but had nothing to do with TRUMP is more FAKE NEWS. Ask top CEO's of those companies for real facts. Came back because of me! _E_
Virtually no one has spent more money in helping the American people with disabilities than me. Will discuss today at my speech in Sarasota _E_
Great to see that Dr. Kelli Ward is running against Flake Jeff Flake who is WEAK on borders crime and a non factor in Senate. He's toxic! _E_
The failing @nytimes should focus on fair and balanced reporting rather than constant hit jobs on me. Yesterday 3 boring articles today2! _E_
The Chinese are illegally dumping bird killing wind turbines on our shores. Only one of many grievances we should act. _E_
Happy Thanksgiving to everyone. We will together MAKE AMERICA GREAT AGAIN! _E_
Photo from a recent episode of @ApprenticeNBC saying those two famous words! __HTTP__ _E_
...Spread shots out over long period and watch positive result. _E_
Featuring @BLTPrime & Palm Grill @TrumpDoral offers a wide array of acclaimed top dining options __HTTP__ _E_
This morning @nbc @todayshow played some of the @RNC video I filmed for the Tampa Convention __HTTP__ _E_
Raised a lot of money for the Republican Party. There will be a big gasp when the figures are announced in the morning. Lots of support! Win _E_
See you tonight in North Carolina. Making keynote for the Republican party will be fun. _E_
ObamaCare will increase individual market premiums by 99% for men and 62% for women __HTTP__ DEFUND!! #MakeDCListen _E_
#AmericaFirst #ImWithYou __HTTP__ _E_
RT @Carl_C_Icahn: 1/2 Believe Trump gave a great speech. _E_
Russia and the world has already started to respect us again! __HTTP__ _E_
Big win for Republicans as Democrats cave on Shutdown. Now I want a big win for everyone including Republicans Democrats and DACA but especially for our Great Military and Border Security. Should be able to get there. See you at the negotiating table! _E_
Will be spending the day campaigning in Connecticut another state where jobs are being stolen by other countries. I will stop this fast! _E_
Regardless of the USC's ruling ObamaCare can only be defeated politically. It must be legislatively repealed or America will go bankrupt. _E_
Via @HotelierME: Olympic golf course designer named by Trump Damac __HTTP__ _E_
...al Megrahi was the man who blew up Pan Am Flight 103 over Lockerbie Scotland. _E_
.@oreillyfactor please explain to the very dumb and failing @glennbeck that I supported John McCain big league in 2008 not Obama! _E_
As I predicted long ago the war in Iraq was a disaster for the U.S. Heading for civil war there are bombings all over the place.Iran happy _E_
Without passion you don't have energy without energy you have nothing. Find work that you love and the energy will be there. _E_
Entrepreneurs: Never give up. Be tough. Apply your skills and talent but above all be tenacious. _E_
Great to be back in Arizona!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Hillary Clinton wants to create the most liberal Supreme Court in history #debate #DrainTheSwamp __HTTP__ _E_
I try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is! _E_
The event with me and @V4SA in L.A on 9/15 is turning out to be huge. Get your tickets before they're gone __HTTP__ _E_
While @JoeBIden is a gaffe machine yesterday's comments that @MittRomney will put y'all back in chains was not at all proper. _E_
The people of Cuba have struggled too long. Will reverse Obama's Executive Orders and concessions towards Cuba until freedoms are restored. _E_
All these polls released by news outlets are oversampling Democrats. They want to influence public perception of the race. _E_
Great event in Columbus taking off for Cincinnati now. Great new Ohio poll out thank you!OHIO NBC/WSJ/MARIST POLLTrump 42% Clinton 41% _E_
Bad move @BarackObama released $147M in aid to the Palestinians __HTTP__ That money is going to Hamas. _E_
Wow with all this talk @MissUniverse is going to Russia on November 9th __HTTP__ _E_
Soaring 92 stories @TrumpChicago boasts a @ForbesInspector 5 Star rating for both its hotel & restaurant __HTTP__ _E_
I always said Obama is lucky for himself but unlucky for the country. The storm could be very good for him as he (cont) __HTTP__ _E_
Somebody please inform Jay Z that because of my policies Black Unemployment has just been reported to be at the LOWEST RATE EVER RECORDED! _E_
The decision on Sergeant Bergdahl is a complete and total disgrace to our Country and to our Military. _E_
President Obama has absolutely no control (or respect) over the African American community they have fared so poorly under his presidency. _E_
I totally respect that Angelina Jolie has shown such great bravery in the face of danger she has really come a long and positive way! _E_
Good news @MittRomney is now leading in North Carolina according to @ppppolls. The NC GOP is united after their (cont) __HTTP__ _E_
There is incredible progress on the site of Trump Tower Punta del Este Uruguay situated on the sands of Playa Brava __HTTP__ _E_
Interesting that certain Middle Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction! _E_
"Keep your focus global and you may very well find yourself ahead of the game." – Trump Never Give Up _E_
Join me in Bedford New Hampshire tomorrow at 3:00pm. Can't wait to see everyone! #AmericaFirst #MAGA... __HTTP__ _E_
Another example of the destruction caused by wind turbines. Unnecessary waste horrible! __HTTP__ _E_
....because he doesn't live there! He wants to raise taxes & kill healthcare. On Tuesday #VoteKarenHandel. _E_
Will be participating in a town hall event hosted by @SeanHannity tonight at 10pmE on @FoxNews. Enjoy! __HTTP__ _E_
Instead of driving jobs and wealth away AMERICA will become the world's great magnet for INNOVATION & JOB CREATION. __HTTP__ _E_
I can't believe that @CNN would allow the very nice Jeffrey Lord to be savaged by a panel of seven Trump haters. 7 to 1 Don't watch CNN! _E_
The election is still close but trending toward @MittRomney. He leads all national polls and Obama's likeability is imploding. VOTE! _E_
Newly released documents show Geithner to be laughing as the financial crisis loomed __HTTP__ _E_
If once you forfeit the confidence of your fellow citizens you can never regain their respect and their esteem. Abraham Lincoln _E_
To all haters and losers: I am NOT anti vaccine but I am against shooting massive doses into tiny children. Spread shots out over time. _E_
In Nashville Tennessee! Lets MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Has Charles @krauthammer ever apologized for being so totally wrong on Iraq? I called it right in every way—Make America Great Again! _E_
A great morning with everyone @LibertyU! Thank you! Off to New Hampshire now. #Trump2016 __HTTP__ __HTTP__ _E_
Will be interviewed on @ThisWeekABC this morning. Enjoy! _E_
'The goal is to be the winner': Donald Trump's campaign is for real. Via The Guardian __HTTP__ _E_
.@AGSchneiderman must take a drug test immediately—make results public. NY Attorney General cannot be a cokehead. _E_
Rubio was very disloyal to Bush his mentor when he decided to run against him. Both said they love each other.They don't word is hate! _E_
Toyota Motor said will build a new plant in Baja Mexico to build Corolla cars for U.S. NO WAY! Build plant in U.S. or pay big border tax. _E_
Once again @RickSantorum proves he can't run a professional campaign. He is ineligible in large section of (cont) __HTTP__ _E_
Many people are saying that the Iranians killed the scientist who helped the U.S. because of Hillary Clinton's hacked emails. _E_
RT @Scavino45: Florida Governor Rick @FLGovScott. #HurricaineIrma __HTTP__ _E_
ObamaCare enrollment lie: Obama counts an enrollee as a web user putting a plan in "their online shopping carts" __HTTP__ _E_
"There are 2 things I've found I'm very good at: overcoming obstacles and motivating good people to do their best work."–The Art of The Deal _E_
Iran admits to aiding the Libyan Rebels and Ahmadinejad received a letter of thanks when will Washington learn? __HTTP__ _E_
The Great State of Michigan was just certified as a Trump WIN giving all of our MAKE AMERICA GREAT AGAIN supporters another victory 306! _E_
David Wright of the NY Mets should have been on the 1st Team All Stars. He's having a great year. _E_
I always enjoy speaking to young aspiring entrepreneurs. They are hungry motivated and eager to learn. Proves America can still be great. _E_
I had a great time in Texas yesterday. A tremendous crowd of wonderful and enthusiastic people. Will be back soon! _E_
Together we will Make America Great Again!#AmericaFirst __HTTP__ _E_
Weakness of attitude becomes weakness of character. Albert Einstein _E_
After four years of getting the run around America needs a turnaround and the man for the job is Governor Mitt Romney. @PaulRyanVP _E_
....This is real collusion and dishonesty. Major violation of Campaign Finance Laws and Money Laundering where is our Justice Department? _E_
Check out Donald Trump's new iGoogle Showcase page: __HTTP__ _E_
HAPPY THANKSGIVING EVERYONE ENJOY YOUR DAY! _E_
Another health insurer is pulling back due to 'persistent financial losses on #Obamacare plans.' Only the beginning! __HTTP__ _E_
I like thinking big. I always have. To me it's very simple: if you're going to be thinking anyway you might as (cont) __HTTP__ _E_
Great visit to Detroit church fantastic reception and all @CNN talks about is a small protest outside. Inside a large and wonderful crowd! _E_
.@AlexSalmond's insane release of the terrorist—for humanitarian reasons will go down as a better decision.. _E_
Tonight despite everything put A Rod in the lineup. _E_
Entrepreneurship is engine of American success. I bring it to crowdfunding w/ @fundanything's $1M RECORD reward __HTTP__ _E_
Just read the nice remarks by President Jimmy Carter about me and how badly I am treated by the press (Fake News). Thank you Mr. President! _E_
Via @THESHARKTANK1: Donald Trump's Controversial Mexican Comments Are Accurate __HTTP__ _E_
No person who is enthusiastic about his work has anything to fear from life. Samuel Goldwyn _E_
Word is that despite a record amount spent on negative and phony ads I had a massive victory in Florida. Numbers out soon! _E_
Isn't it amazing that the U.S. and NSA can listen to the highly protected phone conversations of world leaders but can't get O's records! _E_
#MakeAmericaGreatAgain #GOPdebate __HTTP__ _E_
Senator Bob Corker begged me to endorse him for re election in Tennessee. I said NO and he dropped out (said he could not win without... _E_
Great article by @AmSpec's Jeffrey Lord: "The Reagan Revolution. And now... the @realDonaldTrump Revolution?" __HTTP__ _E_
Nice interview in the @The Atlantic of Sarasota GOP Chair Joe Gruters on my 2012 'Statesman of the Year' award __HTTP__ _E_
The Arab League stated that it wants nothing to do with an attack on Syria but they want us to attack.Are our leaders insane or just stupid _E_
"You have to keep going and moving forward no matter what is happening around you or to you." – Think Like a Champion _E_
RT @DailyCaller: Guam Governor To Trump: I've Never Felt Safer Than 'With You At The Helm' __HTTP__ __HTTP__ _E_
Along with two championship courses on the Potomac River @TrumpGolfDC's also offers limitless social events __HTTP__ _E_
I will be having a general news conference on JANUARY ELEVENTH in N.Y.C. Thank you. _E_
RT @FoxNews: .@EricTrump: My father was elected for one reason and that's because he actually believes in putting America first which is... _E_
Honored to sign S.442 today. With this legislation we support @NASA's scientists engineers and astronauts in the... __HTTP__ _E_
Via @gatewaypundit: "Please Pray for Me... I Am Losing My Insurance" __HTTP__ Just one of the millions of cases like this... _E_
More Bush cronyism – "Jeb Bush and the Common Core Money Trail" __HTTP__ It's the Bush way! _E_
That would mean that Eliot Spitzer has failed at everything he's done politics TV & even real (cont) __HTTP__ _E_
Good timing: @TraceAdkins won big for American Red Cross last night on @ApprenticeNBC. Now the Red Cross is in Oklahoma doing a great job. _E_
Remember the old saying The more you learn the more you realize you don't know it's true. Learning is a daily challenge. _E_
My heartfelt thoughts and prayers are with the 7 @USNavy sailors of the #USSFitzgerald and their families. ... __HTTP__ _E_
A suicide bomber has just killed U.S. troops in Afghanistan. When will our leaders get tough and smart. We are being led to slaughter! _E_
Great job to Missy Franklin. She's got a smile that can take over the world. She's also a major talent. Great going Missy! _E_
America will have record growth and prosperity during his adminstration: @MittRomney's success in the private sector is a tremendous asset. _E_
Just spoke to President Macri of Argentina about the five proud and wonderful men killed in the West Side terror attack. God be with them! _E_
Today I announced another historic breakthrough for the VA. We are working tirelessly to keep our promises to our GREAT VETERANS! #USA __HTTP__ _E_
RT @EricTrump: So proud to be out on the campaign trail with @realDonaldTrump thanks for an amazing night #Biloxi #Trump2016 __HTTP__ _E_
I am watching the NFL DRAFT will be interesting! A lot of talent but only a few will become STARS. _E_
"What America Needs: The Case for Trump" Great new book by the esteemed Jeffrey Lord @JeffJlpa1 Available now. __HTTP__ _E_
Entrepreneurs: Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_
The ultimate Golf experience @TrumpTurnberry is a unique destination located on the beautiful Ayrshire coastline __HTTP__ _E_
'President Elect Donald J. Trump Intends to Nominate Congressman Tom Price and Seema Verma.' __HTTP__ __HTTP__ _E_
Book on Bin Laden is a terrible violation of code makes @BarackObama's story a big lie. _E_
Lots of comments—Do you really believe these two brothers operated alone without influence of others? _E_
Congratulations to @Likud_Party MK @dannydanon on being offered Deputy Defense Minister of IDF by @IsraeliPM @netanyahu. _E_
THANK YOU AMERICA! #Trump2016#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Heading for Ohio really big crowd of amazing people! Much to talk about! _E_
Rupert Murdoch is a great guy who likes me much better as a very successful candidate than he ever did as a very successful developer! _E_
Heading to South Carolina now meeting with fantastic people! _E_
Obama says a WALL at our southern border won't enhance our security (wrong) and yet he now wants to build a much bigger wall (fence) at W.H. _E_
Rick Perry is a good guy who had a really tough evening. @RickPerry _E_
Between Iraq war monger @krauthammer dummy @KarlRove deadpan @GeorgezWill highly overrated @megynkelly among others @FoxNews not fair! _E_
Peter Navarro: 'Trump the Bull vs. Clinton the Bear' #DrainTheSwamp __HTTP__ _E_
The man made climate change that our great president should be focused on is of the NUCLEAR variety brought upon us because of weakness! _E_
I love watching dummy @mcuban promote on ok show named Shark Tank—but he is just a small part of that show. _E_
The Architect @KarlRove is directly responsible for losing both houses & @BarackObama becoming President. Ignore him. _E_
RT @IvankaTrump: Thank you for the warm welcome. I'm excited to be in Hyderabad India for #GES2017. __HTTP__ _E_
I will be on State of the Union @CNN with @jaketapper at 9am. Enjoy! _E_
Scary & Unsustainable: On Monday the US added more debt than from 1776 through Pearl Harbor __HTTP__ _E_
Trump International Hotel & Tower Vancouver will include Vancouver's first pool bar nightclub & Trump Spa __HTTP__ _E_
Joe Biden called America the Problem vis a vis Iran __HTTP__ He never wastes an opportunity to say something stupid.@JoeBiden _E_
Also great comeback by the New York Jets. That game was over until a really dumb defensive play by Tampa. Amazing. _E_
Brainpower is the ultimate leverage. Keep your focus intact! _E_
To vote for me and CENTURY 21 for the best #Superbowl commercial click the following link and "Like" the page. __HTTP__ _E_
Our thoughts are with the forces fighting ISIS in Iraq. We must never back down against this extreme radical Islami... __HTTP__ _E_
Tried watching low rated @Morning_Joe this morning unwatchable! @morningmika is off the wall a neurotic and not very bright mess! _E_
Scenes from last night's episode of @OCChoppers where @DonaldJTrumpJr and I visit the OCC HQ __HTTP__ _E_
Rosechem1 One of the reasons that I like you is because I feel that old American greatness in your mentality. It makes me feel hope! Thx. _E_
Important editorial by John Faso in @nydailynews: "Spitzer's reckless leadership" __HTTP__ _E_
Dopey @Lord_Sugar I'm worth $8 billion and you're worth peanuts...without my show nobody would even know who you are. _E_
.@MissUSA Olivia Culpo has been a star a young Audrey Hepburn. _E_
A true honor. @PressSec considers asking for @BarackObama's college transcripts a Donald Trump question. __HTTP__ Release it! _E_
Why do we always try to destroy our true champions and winners in this country while at the same time leaving the losers alone? STUPID! _E_
The damage that Democrats weak Repubicans and this disaster of a president have inflicted on America has put (cont) __HTTP__ _E_
I will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_
The #IranDeal is a catastrophe that must be stopped. Will lead to at least partial world destruction & make Iran a force like never before. _E_
We need to bring manufacturing jobs back home where they belong. #TimeToGetTough __HTTP__ _E_
THE ROLLOUT OF OBAMACARE IS A TOTAL DISASTER AND AN EMBARRASSMENT TO OUR COUNTRY. THE WORLD IS WATCHING AND LAUGHING.$635000000 WEBSITE! _E_
Concentration is a fine antidote to anxiety. Jack Nicklaus _E_
I will end common core. It's a disaster. __HTTP__ #Trump2016 __HTTP__ _E_
Thank you Pastor Robert Jeffress! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
It is amazing that after lambasting Donald Sterling on @foxandfriends some DISHONEST press only reported my GIRLFRIEND FROM HELL statement! _E_
My @SquawkCNBC interview discussing the close election the enthusiasm gap between Mitt & Obama & the fiscal cliff __HTTP__ _E_
Why won't @BarackObama repeal the Defense of Marriage Act if he supports gay marriage? __HTTP__ He is gaming the issue. _E_
I never heard of @DannyZucker until his very dumb and endless tweets started pouring out of insecure mind but I have a great deal for him! _E_
Likewise the primary victims of violent crimes are in the African American and Hispanic communities. These people want LAW AND ORDER now! _E_
Today's job report is dismal. Now a record 88921000 Americans are no longer in the work force. _E_
Again immigration reform is fine—but don't rush to give away our country. That's what's happening! _E_
Just finished reading a poorly written & very boring book on the General Motors Building by Vicky Ward. Waste of time! @WileyBiz _E_
CEO's most optimistic since 2009. It will only get better as we continue to slash unnecessary regulations and when we begin our big tax cut! _E_
.@BarackObama's dismal job record is reason alone that he must be defeated this November. _E_
On Anthony Wiener I TOLD YOU SO! _E_
I am getting great credit for my press conference today. Crooked Hillary should be admonished for not having a press conference in 179 days. _E_
I am always on the front page of the failing @nytimes but when I won the GOP nomination I'm in the back of the paper. Very dishonest! _E_
Via @nypost's @PageSix: "Trump researching 2016 run" __HTTP__ _E_
Is Chris Jackson as dumb as I hear but I still like that he follows me like a good little soldier! _E_
Here we go! _E_
Losers and haterseven you as low and dumb as you are can learn from watching Apprentice and checking out my tweets you can still succeed! _E_
The Christmas Story begins 2000 years ago with a mother a father their baby son and the most extraordinary gift of all—the gift of God's love for all of humanity.Whatever our beliefs we know that the birth of Jesus Christ and the story of his life... __HTTP__ _E_
If Scotland doesn't stop insane policy of obsolete bird killing wind turbines country will be destroyed. @AlexSalmond @AberdeenCC _E_
Look what the President of NBC sent me recently about his stay in my Las Vegas hotel. Very loyal guy. __HTTP__ _E_
I've just started blocking out some of the repetitive and boring (& dumb) haters and losers. They are a waste of time and energy! _E_
ObamaCare Horror Story: "Navigators Tell Applicants To Lie Like Administration" __HTTP__ @JamesOKeefeIII strikes again! _E_
#VoteTrumpMI! #Trump2016 __HTTP__ _E_
Tom Brady has done a great job tonight amazing New England comeback. Good game not over yet! _E_
After foolishly spending two trillion $'s and losing so many great young people the U.S. will be the only one who won't get the oil in Iraq _E_
Here's a sneak peek at the @DNC convention theme: It's not our fault. Blame Bush. Oh and government built it. _E_
At least 24 players kneeling this weekend at NFL stadiums that are now having a very hard time filling up. The American public is fed up with the disrespect the NFL is paying to our Country our Flag and our National Anthem. Weak and out of control! _E_
To entrepreneurs: Watching you could be the motivation for your employees so make it an example that will best serve your success. _E_
The lightweight @JonHuntsman used my name in a debate for gravitas it didn't work. Sad! _E_
Economists on the TAX CUTS and JOBS ACT:"The enactment of a comprehensive overhaul complete with a lower corporate tax rate will IGNITE our ECONOMY with levels of GROWTH not SEEN IN GENERATIONS..." __HTTP__ _E_
Congratulations to Paul Ryan Kevin McCarthy Kevin Brady Steve Scalise Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes! _E_
The Democrats lead by head clown Chuck Schumer know how bad ObamaCare is and what a mess they are in. Instead of working to fix it they.. _E_
ICYMI: PENCE: I RAN A STATE THAT WORKED KAINE RAN A STATE THAT FAILED. __HTTP__ _E_
I wonder if President Obama would have attended the funeral of Justice Scalia if it were held in a Mosque? Very sad that he did not go! _E_
Check out a list of Donald Trump's books for summer reading at the Trump University Blog: __HTTP__ _E_
Congratulations to @BillCassidy on a decisive win this past Saturday. Bill will be a pro growth & pro energy Senator. _E_
Melania will be on QVC tomorrow night at 9 p.m. ET to introduce her beautiful and inspiring Melania Timepieces & Fashion Jewelry collection. _E_
Twisted Sister frontman @deesnider shines in the record 13th season of 'All Star' @CelebApprentice. The Iron Man of Rock and Roll is great! _E_
Backstage with @jimmyfallon before opening skit great fun! @fallontonight __HTTP__ _E_
My interview with WMUR's @JoshMcElveen at #NHFreedomSummit __HTTP__ _E_
Honored to meet this years @SenateYouth delegates w/ @VP Pence in the East Room of the @WhiteHouse. Congratulations... __HTTP__ _E_
Be tenacious. Being tenacious means you're tough and patient at once so it's a formidable combination. _E_
Ted Cruz should be disqualified from his fraudulent win in Iowa. Weak RNC and Republican leadership probably won't let this happen! Sad. _E_
FLASHBACK October 9 2012: "Donald Trump: Jobs Numbers Are 'A Lot Of Monkey Business'" __HTTP__ Proven right again! _E_
Thank you to FEMA our great Military & all First Responders who are working so hardagainst terrible oddsin Puerto Rico. See you Tuesday! _E_
Join me LIVE with @VP @SecretaryPerry @SecretaryZinke and @EPAScottPruitt. #UnleashingAmericanEnergy __HTTP__ _E_
The Iran nuclear deal is a terrible one for the United States and the world. It does nothing but make Iran rich and will lead to catastrophe _E_
Big day tomorrow in Georgia and South Carolina. ObamaCare is dead. Dems want to raise taxes big! They can only obstruct no ideas. Vote R _E_
The terrorists in Syria are calling themselves REBELS and getting away with it because our leaders are so completely stupid! _E_
Every poll done on debate last night from Drudge to Newsmax to Time Magazine had me winning in a landslide. #MakeAmericaGreatAgain! _E_
For all of those (DACA) that are concerned about your status during the 6 month period you have nothing to worry about No action! _E_
I'm on @foxandfriends every Monday morning at 7:30... _E_
.@Omarosa's new name via @DennisRodman: "Ms. Saboteur" sounds rather elegant. #CelebApprentice _E_
Just arrived in Arizona! #ImWithYou __HTTP__ _E_
The entire village of Blackdog in Scotland protested to the Council last night about the ugly windmills. @AlexSalmond @pressjournal _E_
My interview in @politico with @pwgavin discussing being awarded the 2012 Statesman of the Year by Sarasota GOP __HTTP__ _E_
There's a reason @mcuban's partners can't stand him and on top of that the team sucks! _E_
Whenever you see the words 'sources say' in the fake news media and they don't mention names.... _E_
"Donald Trump: $200 Million D.C. Hotel Will Be Among World's Best" __HTTP__ via @WNEW _E_
So @BarackObama will attack @MittRomney's career at Bain Capital but won't return donations from Bain executives __HTTP__ _E_
RT @realDonaldTrump: Consumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_
Thank you! We are at 35% in new Reuters poll with #2 coming in at 12%. Time to #MakeAmericaGreatAgain!#Trump2016 __HTTP__ _E_
I made a great deal of money in Atlantic City but left years ago when I saw so many political mistakes being made. I have ZERO involvement! _E_
Just got back from Georgia. The crowds and love for U.S. was so amazing! We all had a great day together will be back soon! _E_
"The biggest doers often suffer the biggest setbacks in life. So if you want to aim high you have be able to handle the bumps."–Think Big _E_
If Republicans are going to pass great future legislation in the Senate they must immediately go to a 51 vote majority not senseless 60... _E_
Why isn't Mexico releasing our Marine. U.S. should come down really hard on them. They have ZERO respect for our so called leader _E_
Via @Ammoland by Fredy Riehl: "Donald Trump Talks: Gun Control Assault Weapons Gun Free Zones & Self Defense" __HTTP__ _E_
Thank you! Vote in 2016! #MakeAmericaGreatAgain __HTTP__ _E_
The VA scandal shows the fatal ineptitude of big central planning government. When will we learn? _E_
#FraudNewsCNN #FNN __HTTP__ _E_
.@Morning_Joe: Marco only won the debate in the minds of desperate people. I won every on line poll even crazy @CNBC. Marco good looking? _E_
He @BarackObama wants to release 5 senior Taliban detainees back to the Taliban. __HTTP__ The Taliban out negotiates him! _E_
The Club For Growth said in their ad that 465 delegates (Cruz) plus 143 delegates (Kasich) is more than my 739 delegates. Try again! _E_
Once again we will have a government of by and for the people. Join the MOVEMENT today! __HTTP__ __HTTP__ _E_
Keep it fast short and direct whatever it is. Donald J. Trump __HTTP__ _E_
The middle class has become the new poor in this country and our incompetent politicians are unable to do anything about it.They don't care! _E_
"Consider the fact that for every gallon of gas you put in your car you pay 45.8 cents in state local and federal taxes." #TimeToGetTough _E_
Thank you for such a beautiful welcome Hawaii. My great honor to visit @PacificCommand upon arrival. Heading to Pearl Harbor w/ @FLOTUS now. __HTTP__ _E_
Just arrived at the Pensacola Bay Center. Join me LIVE on @FoxNews in 10 minutes! #MAGA __HTTP__ _E_
HAPPY EASTER HAVE A GREAT DAY! _E_
.@katyperry I watched Russell Brand and I think his mind is fried he looks really bad. Russell is a total joke a dummy who is lost! _E_
With the complete Ft. Lauderdale victory I will now sue for millions of $'s in attorney fees for which plaintiffs are liable. _E_
Remember @dannyzuker you are not even the real boss of Modern Family no big $$$$$$'s for you! _E_
Check out a picture of the custom made Trump Bike that Paul Teutul Sr. presented to me today in Trump Tower __HTTP__ _E_
Such a wonderful statement from the great @LouDobbs. We take up what may be the most accomplished presidency in modern American history. _E_
Young entrepreneurs across the US are trying to make deals & build businesses daily. Stay positive think big & big things will happen _E_
The @nfl games are so boring now that actually I'm glad I didn't get the Bills. Boring games too many flags too soft! _E_
Please remember I am the ONLY candidate who is self funding his campaign. Kasich Rubio and Cruz are all bought and paid for by lobbyists! _E_
You are right the media is always offending Donald Trump they have no limits but they will do anything not to offend the Boston killer! _E_
Ernie Els and myself at Trump National Doral. __HTTP__ _E_
We must fix our education system for our kids to Make America Great Again. Wonderful day at Saint Andrew in Orlando. __HTTP__ _E_
Spoke yesterday with the King of Saudi Arabia about peace in the Middle East. Interesting things are happening! _E_
Entrepreneurs: Use your imagination. Use your intelligence to execute what your imagination presents to you. _E_
Why isn't the House Intelligence Committee looking into the Bill & Hillary deal that allowed big Uranium to go to Russia Russian speech.... _E_
.@JRubinBlogger one of the dumber bloggers @washingtonpost only writes purposely inaccurate pieces on me. She is in love with Marco Rubio? _E_
I am now in Texas doing a big fundraiser for the Republican Party and a @FoxNews Special on the BORDER and with victims of border crime! _E_
#TBT With my friend @muhammadali __HTTP__ _E_
Just won the lawsuit on leadership of Consumer Financial Protection Bureau CFPB. A big win for the Consumer! _E_
Congratulations to @SixteenChicago @TrumpChicago for being honored with a @MichelinGuideChi two star rating again this year! _E_
John Menard of Menards home improvement stores in Midwest treats employees horribly should they form a union? __HTTP__ _E_
It's a national embarrassment that an illegal immigrant can walk across the border and receive free health care and one of our Veterans..... _E_
#ObamacareFail __HTTP__ _E_
Honored to serve as Commander in Chief to the courageous men and women of our U.S. Armed Forces. A grateful nation thanks you! __HTTP__ _E_
Video: Trump Golf Links at Ferry Point @TrumpFerryPoint __HTTP__ _E_
Weak JEB getting thrown out by management during speech. Do you think he will be this tough on Putin & others? __HTTP__ _E_
RT @paulsperry_: Fusion GPS firm behind disputed Russia dossier retracts its claim of FBI mole in Trump camp __HTTP__ _E_
Just got back from Iowa. Fantastic evening with truly fabulous people. Will be back again soon. Thanks! _E_
.@katyperry will do much better __HTTP__ _E_
Entrepreneurs: We win in our daily lives by being careful with every day every moment. _E_
Go with your gut. Take chances. If you think you have the ingredients that you need take chances because your biggest successes... _E_
Via @nypost: Trump's links getting green __HTTP__ _E_
ICYMI This week we hosted a #MadeInAmerica event right here at the @WhiteHouse! If it is MADE IN AMERICA it is the BEST! USA __HTTP__ _E_
The Answer to both Social Security and Medicare is a robust growing economy not cuts on the elderly. _E_
Learning to expect problems saved me from a lot of wasted energy. Winners see problems as just another way to prove themselves. _E_
The premier landmark in midtown NYC Trump Tower features our signature amenities w/a magnificent waterfall __HTTP__ _E_
Will be interviewed on @Morning_Joe at 7:20. Great crowd in Las Vegas yesterday! _E_
National Review @NRO may be going out of business because of the really pathetic job being done by @JonahNRO. No talent means death sad! _E_
RT @TeamTrump: .@HillaryClinton is RAISING your taxes to a disastrous level. @realDonaldTrump is going to LOWER your taxes BIG LEAGUE! #D... _E_
"Obama's promises on the Iran deal are like him promising 'if you like your healthcare plan you can keep it'" @marklevinshow _E_
If Mitt Romney were in the private sector & he suffered the horrendous loss of 2012 do you think he'd rehire himself for 2016?—I don't! _E_
If other countries benefit from our armed forces protecting them those countries should pay for the protection. #TimeToGetTough _E_
Re CRIPPLED AMERICA I am signing books for the next two weeks. Order yours for holiday gifts! __HTTP__ _E_
$1B down another $1B to go. ObamaCare website is 40% unfinished. This is beyond pathetic. _E_
Trump locks down Delaware GOP delegates. #Trump2016 #MAGA __HTTP__ _E_
Money was never a big motivation for me except as a way to keep score. The real excitement is playing the game! _E_
This Sunday's @CelebApprentice will shock you! Big Development...Be sure to tune in on @NBC this Sunday at 9PM EST! _E_
Republicans Senators are working hard to get their failed ObamaCare replacement approved. I will be at my desk pen in hand! _E_
#TimeToGetTough: Making America #1 Again my new book available today. The book both China and OPEC do NOT want you to read. _E_
Think positively. There are always opportunities. Keep your focus and don't give up! _E_
I still don't get how @KarlRove spent $400 million & lost all. _E_
I had a fun time doing the #CallMeMaybe video featuring  the @MissUSA contestants @BravoAndy and @GiulianaRancic __HTTP__ _E_
What is vital now is a swift restoration of law and order and the protection of innocent lives.#Charlottesville __HTTP__ _E_
In light of Newtown our country has to pull together. _E_
How can Crooked Hillary put her husband in charge of the economy when he was responsible for NAFTA the worst economic deal in U.S. history? _E_
"I can accept failure everyone fails at something. But I can't accept not trying." – Michael Jordan _E_
Home of the iconic Ailsa a four time @The_Open course @TrumpTurnberry is a landmark on the Ayrshire coastline __HTTP__ _E_
...the Uranium to Russia deal the 33000 plus deleted Emails the Comey fix and so much more. Instead they look at phony Trump/Russia.... _E_
Europe and the U.S. must immediately stop taking in people from Syria. This will be the destruction of civilization as we know it! So sad! _E_
Will be on @FallonTonight with @JimmyFallon on @NBC at 11:35pmE. Enjoy! #Trump2016 __HTTP__ _E_
.@Rosie If America's Got Talent uses you the show will fail like all your others! _E_
Congrats to Congress on their 112 'gold tier' healthcare plans __HTTP__ Why should they suffer like regular Americans? _E_
Life is very fragile and success doesn't change that. If anything success makes it more fragile. Anything can (cont) __HTTP__ _E_
In 2016 the Old Post Office will be fully transformed into an iconic destination Trump Int'l Washington DC __HTTP__ _E_
Bernie Sanders supporters have every right to be apoplectic of the complete theft of the Dem primary by Crooked Hillary! _E_
Thank you Redding California!#MakeAmericaGreatAgain #CAPrimary __HTTP__ _E_
The last thing our country needs is another BUSH! Dumb as a rock! _E_
Lyin' Ted and Kasich are mathematically dead and totally desperate. Their donors & special interest groups are not happy with them. Sad! _E_
Crooked Hillary Clinton who called BREXIT 100% wrong (along with Obama) is now spending Wall Street money on an ad on my correct call. _E_
Obama's wind turbines kill "13 39 million birds and bats every year!" __HTTP__ Save our bald eagles symbol of our nation! _E_
Entrepreneurs: Have your own vision and stick with it. Don't be afraid to be unique. Don't tread water get out there and go for it. _E_
"Winning is habit. Unfortunately so is losing." Vince Lombardi _E_
I can't believe @Denver_Broncos allowed final touchdown—dumbest defensive play I have ever seen in football. _E_
The Al Qaeda flag is now flying over Benghazi. @BarackObama spent over $3Billion of our money for this? _E_
Beautiful rally in Albuquerque New Mexico this evening thank you. Get out & VOTE! #DrainTheSwampWatch rally:... __HTTP__ _E_
"You can attack defend counterattack sell or ignore. " Roger Ailes to Pres. Reagan during prep for 2nd Mondale debate/ '84 election _E_
The CIA report should not be released. Puts our agents & military overseas in danger. A propaganda tool for our enemies. _E_
Entrepreneurs: Vision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_
I will be meeting General Kelly General Mattis and other military leaders at the White House to discuss North Korea. Thank you. _E_
Then we attended the Scottish fashion show that benefits veterans Dressed to Kilt 2010 which I co hosted with Sir Sean and Lady Connery. _E_
God bless the people of Mexico City. We are with you and will be there for you. _E_
Here we go A healthcare worker who treated Thomas Duncan the man who flew into the U.S. from West Africa infected with Ebola caught it! _E_
Located in beautiful Briarcliff NY @TrumpNationalNY features a 7291 yard course just 25 minutes outside NYC __HTTP__ _E_
Reality TV's #1 Bad Girl @OMAROSA is back on the upcoming 13th season of All Star @CelebApprentice. She is great as always. _E_
The Boston killer applying today for ObamaCare. He demands that medical bills be taken care of immediately. Does this include dental? _E_
I am attracting the biggest crowds by far and the best poll numbers also by far. Much of the media is totally dishonest. So sad! _E_
Snowden is a liar.and a fraud! _E_
RT @TeamTrump: When @realDonaldTrump is POTUS families are going to be safe and secure. Law and order will be RESTORED! #MAGA #Debates #De... _E_
Thanks Geraldo you're a champion. __HTTP__ _E_
SECURE THE BORDER! BUILD A WALL! _E_
Hope you liked it. Tune in tomorrow night at 8:00 and 9:00 for two episodes and two boardrooms! Will be a great evening of television! _E_
The LIVE FINALE of @ApprenticeNBC is this Sunday at 9/8C. Watch and see who will be the first ever All Star Celebrity Apprentice. _E_
I'll be discussing a variety of topics tonight with Greta Van Susteren 10 p.m. on Fox News. It will be the first of a two part series. _E_
Iowa was amazing today. Great crowd great people. Thanks will be back soon! _E_
There is no comparison between @ApprenticeNBC and Shark Tank in the ratings. The Apprentice beats Shark Tank hands... __HTTP__ _E_
New report from DOJ & DHS shows that nearly 3 in 4 individuals convicted of terrorism related charges are foreign born. We have submitted to Congress a list of resources and reforms.... _E_
The top course on the west coast @TrumpGolfLA overlooks Pacific Ocean & offers a luxurious public golf experience __HTTP__ _E_
President Obama do not attack Syria. There is no upside and tremendous downside. Save your powder for another (and more important) day! _E_
We will always ENFORCE our laws PROTECT our borders and SUPPORT our police! #LESMHarrisburg Pennsylvania #FlashbackFriday #MS13 __HTTP__ _E_
In presidential voting so far John Kasich is ZERO for 22. So why would he be a good candidate? Hillary would beat him I will beat Hillary! _E_
Always great to see the wonderful people of South Carolina. Thank you for the beautiful welcome at Greenville Spartanburg Int'l Airport! __HTTP__ _E_
Tens of millions of dollars in airstrikes had no impact because key leaders fled after hearing ON NEWS REPORTS the strikes were coming. DUMB _E_
This memo totally vindicates "Trump" in probe. But the Russian Witch Hunt goes on and on. Their was no Collusion and there was no Obstruction (the word now used because after one year of looking endlessly and finding NOTHING collusion is dead). This is an American disgrace! _E_
I am truly honored and grateful for receiving SO much support from our American heroes... __HTTP__ __HTTP__ _E_
Since Election Day on November 8 the Stock Market is up more than 25% unemployment is at a 17 year low & companies are coming back to U.S. _E_
Entrepreneurs: Successful negotiation means knowing what the other side wants. You've got to know where they're coming from. Pay attention! _E_
much worse just look at Syria (red line) Crimea Ukraine and the build up of Russian nukes. Not good! Was this the leaker of Fake News? _E_
I'm doing The David Letterman Show tonight should be interesting! _E_
What will happen to Omarosa tonight? One of our all time great episodes! _E_
Another terrorist attack in Paris. The people of France will not take much more of this. Will have a big effect on presidential election! _E_
The new Dark Knight Rises Trailer is great __HTTP__ The movie filmed scenes in Trump Tower last October. _E_
Hey Missouri let's defeat Crooked Hillary & @koster4missouri! Koster supports Obamacare & amnesty! Vote outsider Navy SEAL @EricGreitens! _E_
Vanity Fair Magazine which used to be one of my favorites is failing badly. Newsstand sales are plummeting (cont) __HTTP__ _E_
My @todayshow show interview with @IvankaTrump discussing the fierce competition in All Star @CelebApprentice __HTTP__ _E_
Using Alicia M in the debate as a paragon of virtue just shows that Crooked Hillary suffers from BAD JUDGEMENT! Hillary was set up by a con. _E_
I want to express our support and extend our prayers to all those affected by the vile terror attack in Spain last month. __HTTP__ _E_
"When everyone works with the same energy loyalty and focus it makes for smooth sailing all around." – Midas Touch _E_
Fact – Amnesty lowers wages and invites more lawlessness. Obama has unilaterally cancelled any chance of immigration reform. _E_
The winner of Best in Show at the Westminster Kennel Club Show Miss P will be coming to my office this morning. _E_
Remember that Marco Rubio is very weak on illegal immigration. South Carolina needs strength as illegals and Syrians pour in. Don't allow it _E_
Have the right mindset for the job. See your work as an art form which means paying attention to every detail. _E_
Could this be my newest apprentice? __HTTP__ ...Enter the contest .. . __HTTP__ _E_
Enjoy the Super Bowl! _E_
Government is shut down yet Obama is now harassing the privately owned @Redskins to change its name.He needs to focus on his job! _E_
Looking forward to being in Council Bluffs Iowa later today. Despite weather rally is on will be fantastic! #MakeAmericaGreatAgain! _E_
People are pouring into Washington in record numbers. Bikers for Trump are on their way. It will be a great Thursday Friday and Saturday! _E_
Our country and our leaders are getting dumber all the time. Now they are about to release full documentation on torture. Will destroy CIA _E_
Bob Dole Warns of 'Cataclysmic' Losses With Ted Cruz and Says Donald Trump Would Do Better via New York Times: __HTTP__ _E_
President Obama spoke for me and every American in his remarks in #Newtown Connecticut. _E_
RT @AnnCoulter: Trump's speech today was Churchillian only better. You can tell by the spluttering hysteria on TV about @realDonaldTrump. _E_
Entrepreneurs: Ask yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_
If you want to be successful two important considerations are passion and efficiency. Think Like a Champion _E_
Today on Earth Day we celebrate our beautiful forests lakes and land. We stand committed to preserving the natural beauty of our nation. _E_
Heading to Washington this morning. Much work to do. Focus on trade and military. #MAGA _E_
I just filed a major ethics complaint against crooked New York State Attorney General Eric Schneiderman he should resign from office! _E_
We are winning and the press is refusing to report it. Don't let them fool you get out and vote! #DrainTheSwamp on November 8th! _E_
.@alexsalmond @pressjournal RT @rdowns @realdonaldtrump Margaret Thatcher NEVER would have allowed those wind mill monstrosities. _E_
.@williebosshog watched you on @foxandfriends. You were great and I appreciate the nice statements. I'm sending out for your new book now! _E_
Another @BarackObama investment triumph the $500Billion American funded Finnish plug in cars are all being recalled __HTTP__ _E_
Obama and all others have been so weak and so politically correct that terror groups are forming and getting stronger! Shame. _E_
.@washingtonpost by @OConnellPostbiz:"Donald Trump lands @chefjoseandres for Old Post Office flagship restaurant" __HTTP__ _E_
I am in Dubai with Damac. PLACE IS BOOMING AMAZING! Major news conference in two hours. Announcing luxury villas and major golf course. _E_
If you want to be successful in business you must take risks. Make sure each risk is calculated and can have a positive fallback. _E_
Congrats @TrumpToronto for being ranked #1 on @TripAdvisor and a Travellers' Choice 2013 Winner! _E_
Big week coming up! _E_
There usually is an easy solution to every problem. For instance a lot of our country's problems can be solved in next year's election. _E_
Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_
.@MattBevin: As someone well versed in job creation and the Private Sector if you lie on your resume You're Fired! _E_
Judge Jeanine Slams GOP Establishment: __HTTP__ _E_
March 5th is rapidly approaching and the Democrats are doing nothing about DACA. They Resist Blame Complain and Obstruct and do nothing. Start pushing Nancy Pelosi and the Dems to work out a DACA fix NOW! _E_
Bad break for @TigerWoods hits a great shot which hits the pin and kicks into the water gets a bogey on hole with another great shot Champ! _E_
Just spoke with @NYGovCuomo and @NYCMayor de Blasio to let them know that the federal government... _E_
It begins Republican Party of Virginia controlled by the RNC is working hard to disallow independent unaffiliated and new voters. BAD! _E_
Re Lance Armstrong—not only was it a big lie but a big lie that lasted too long! _E_
The primary plaintiff in the phony Trump University suit wants to abandon the case. Disgraceful! _E_
Dems failed in Kansas and are now failing in Georgia. Great job Karen Handel! It is now Hollywood vs. Georgia on June 20th. _E_
Record crowd and standing ovation at Simpson College in Iowa lots of fun wonderful audience! _E_
A great night in Raleigh North Carolina! THANK YOU! #Trump2016 __HTTP__ _E_
I'm a skeptical guy but I don't believe Petraeus used this to get out of the Benghazi hearings. _E_
It's Thursday. I wonder how much money @BarackObama drained from Medicare today to finance ObamaCare. _E_
.@billmaher has not yet sent me the $5M he owes which I am giving to various charities. Come on Bill—you made a deal. _E_
Wow I'm at 2200000 followers but I'd love to get rid of the haters & losers—they're such a waste of time! _E_
Third Gun Linked to 'Fast and Furious' Identified at Border Agent's Murder Scene. When will the White House come clean? _E_
Vanity Fair which looks like it is on its last legs is bending over backwards in apologizing for the minor hit they took at Crooked H. Anna Wintour who was all set to be Amb to Court of St James's & a big fundraiser for CH is beside herself in grief & begging for forgiveness! _E_
Now China is publicly supporting the OWS protests __HTTP__ It's time for the protesters to go home. _E_
Entrepreneurs: Brainpower is the ultimate leverage. Don't underestimate yourself or your possibilities. _E_
Trump International Golf Club Turnberry Scotland has been home to four of the greatest Open Championships in history __HTTP__ _E_
Congratulations to my Catholic friends on the selection of Pope Francis I to lead the Catholic Church. People that know him love him! _E_
Brent Musburger did himself a great favor by saying what everyone was thinking he is much more popular now than before. _E_
I am a handwriting analyst. Jack Lew's handwriting shows while strange that he is very secretive—not necessarily a bad thing. _E_
Via @mrctv by Ben Graham: Border Reports Back Up Trump's 'Rapists' Claim __HTTP__ _E_
RT @realDonaldTrump: As the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Dem... _E_
A rare case where the U.S. should help __HTTP__ _E_
I hate when the news media so afraid to offend anyone always refers to the BOSTON KILLER as the suspect . _E_
Empty pockets never held anyone back. Only empty heads and empty hearts can do that. Norman Vincent Peale _E_
"He who is not courageous enough to take risks will accomplish nothing in life." Muhammad Ali _E_
Making speech tonight in New Hampshire leaving now. Fantastic people fantastic crowd! _E_
The San Fran crash was totally the pilot's fault may be too late for drug testing RIDICULOUS! _E_
The Mayweather decision is a disgrace! _E_
Thank you Nevada! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
We need someone with experience to rebuild America. #MakeAmericaGreatAgain __HTTP__ _E_
What did our very stupid & ineffective A.G. Eric Schneidean during his trips to MY office tell me about President Obama & Governor Cuomo? _E_
Thank you Wisconsin! Tuesday was a great success for #WorkforceWeek at @WCTC w/ @IvankaTrump & @GovWalker. Remarks... __HTTP__ _E_
Josh8J4 @realDonaldTrump I have a dream that you will be president to make this country great again. #USA Thank you. _E_
When will Mayor Vescio and manager Zegarelli repave Pine Road in @BriarcliffManor? It is a disgrace! _E_
Nobody knows for sure that the Republicans & Democrats will be able to reach a deal on DACA by February 8 but everyone will be trying....with a big additional focus put on Military Strength and Border Security. The Dems have just learned that a Shutdown is not the answer! _E_
He @MittRomney gets the China problem why don't the others? _E_
Patrick Reed—We are proud to have you as our champion at Doral. Love the attitude & the play. See you in March at the Cadillac WGC. _E_
I'll be at Liberty University Monday 10 AM for speech. Looking forward to meeting students all sold out! _E_
RT @foxnation: .@SenTedCruz: I want to Get to a 'Yes' Vote: __HTTP__ _E_
"The best luck of all is the luck you make for yourself." – General Douglas MacArthur _E_
Mr. President it is time to lead on the Korean crisis. Make a statement from the Rose Garden and send a strong message to the man child! _E_
Today is the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_
Why is Obama playing basketball today? That is why our country is in trouble! _E_
Consumer confidence is at a 16 year high....and for good reason. Much more regulation busting to come. Working hard on tax cuts & reform! _E_
The Bernie Sanders supporters are furious with the choice of Tim Kaine who represents the opposite of what Bernie stands for. Philly fight? _E_
Information is being illegally given to the failing @nytimes & @washingtonpost by the intelligence community (NSA and FBI?).Just like Russia _E_
A Rod must be dropped in the Yankees line up tonight if they want to win. He simply can't perform without drugs. _E_
October has a 7% foreclosure increase last month. Is this @BarackObama's economic recovery? _E_
Wise words from my father: "Know everything you can about what you're doing." Fred C. Trump _E_
A @aahs5star Diamond & Green Star Diamond Award winner @TrumpGolfLA is the nation's top public course __HTTP__ _E_
70 stores above Punta Pacifica's pristine peninsula @TrumpPanama offers fine dining five pools & luxury rooms __HTTP__ _E_
I will be doing @foxandfriends at 7.00 (15 minutes). _E_
Another great cause Obama could send my $5M donation to is a charity for 9/11 First Responders. They are American heroes. _E_
Such bad reporting: A puff piece on Ben Carson in the @nytimes states that Carson is trying to solidify his lead. But I am #1 easily! Sad _E_
Our next Vice President of the United States of America Gov. @Mike_Pence!#GOPinCLE #GOPConvention#AmericaFirst __HTTP__ _E_
I look forward to paying my respects to our brave men and women on this Memorial Day at Arlington National Cemetery later this morning. _E_
They are saying that tickets to tonight's Saturday Night Live are the hardest to get in the history of this great show! Off to a good start! _E_
I've gotten many letters from people fighting autism thanking me for stating how dangerous 38 vaccines on a (cont) __HTTP__ _E_
Wow Ted Cruz falsely suggested Marco Rubio mocked the Bible and was just forced to fire his Communications Director. More dirty tricks! _E_
I enjoy meeting tourists in #TrumpTower. People travel from across the world to see the five level Atrium & waterfall. _E_
I will be going to Indiana on Thursday to make a major announcement concerning Carrier A.C. staying in Indianapolis. Great deal for workers! _E_
I have a feeling the emphasis by @johnrich and @marleematlin will be on the charities and the money raised. (cont) __HTTP__ _E_
Snow and freezing weather all over mid section of Country. Global warming specialists better start thinking fast! _E_
Obama:"I will destroy ISIS" = Obama: "If you like your healthcare plan you can keep your plan." _E_
Watch me get inducted into the #WWEHOF tonight at 10PM on USA. I will be posting exclusive behind the... __HTTP__ _E_
Scotland is beautiful. I spent several years looking for the right place visiting over 200 sites and this is absolutely the right place! _E_
George Will was pushing for @JonHuntsman for the GOP nomination in December...said he was going to win. (cont) __HTTP__ _E_
Thank you. __HTTP__ _E_
China is closing a massive oil deal w/ Russia taking advantage of the Ukraine conflict __HTTP__ Smart unlike our leaders. _E_
I havn't seen @tonyschwartz in many years he hardly knows me. Never liked his style. Super lib Crooked H supporter. Irrelevant dope! _E_
It was my great honor to welcome Prime Minister Alexis Tsipras of Greece to the WH today! __HTTP__ 📸 __HTTP__ __HTTP__ _E_
Crooked Hillary Clinton perhaps the most dishonest person to have ever run for the presidency is also one of the all time great enablers! _E_
Join me LIVE in South Korea🇰 #NationalAssembly #POTUSinAsia __HTTP__ __HTTP__ _E_
My @ WCNC News interview w/ @DianneG touring the magnificent Trump National Charlotte course & facilities __HTTP__ _E_
Thank you Arizona. Beautiful turnout of 15000 in Phoenix tonight! Full coverage of rally via my Facebook at: __HTTP__ __HTTP__ _E_
#TheArsenioHallShow Well it had to happen. People that are disloyal in the long run never make it. Arsenio was just cancelled! _E_
... at St. Jude Children's Research Hospital __HTTP__ I am proud of you Eric. _E_
Hillary Clinton has bad judgment and is unfit to serve as President. __HTTP__ _E_
The Middle East is blowing up we didn't back Egypt and now they riot against us. Iran is using Iraqi airspace (cont) __HTTP__ _E_
For those asking the Republicans only have 51 votes in the Senate and they need 60. That is why we need to win more Republicans in 2018 Election! We can then be even tougher on Crime (and Border) and even better to our Military & Veterans! _E_
Join me live as we recognize the first responders to the June 14th shooting involving @SteveScalise. #TeamScalise __HTTP__ __HTTP__ _E_
.@DannyZuker Don't lie @ApprenticeNBC was #1 in all major demos at 10. Do not lie! _E_
RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_
Sadly the overwhelming amount of violent crime in our major cities is committed by blacks and hispanics a tough subject must be discussed. _E_
#DrainTheSwamp! __HTTP__ _E_
With Hillary and Obama the terrorist attacks will only get worse. Politically correct fools won't even call it what it is RADICAL ISLAM! _E_
Thank you so much for the wonderful article Robert Davi. __HTTP__ _E_
Sen. McCain should not be talking about the success or failure of a mission to the media. Only emboldens the enemy! He's been losing so.... _E_
I just released my financial disclosure forms the largest numbers in the history of the F.E.C. Even the dishonest media thinks great! _E_
So professional of @ABC news to throw out the failing @UnionLeader newspaper from their debate. Paper won't survive highly unethical! _E_
A great afternoon. Thank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Today's job report is not a good sign & we could be facing another recession. No real job growth. We need over 300K new jobs a month. _E_
We should have gotten more of the oil in Syria and we should have gotten more of the oil in Iraq. Dumb leaders. _E_
Thank you to all for your wonderful comments on my speech. I could feel the electricity in thr air. Great reviews most votes ever recieved _E_
What do you think @amandatmiller is writing? #CelebApprentice _E_
Our country is being torn apart from the inside it's getting nasty out there. _E_
It was a pleasure to have President Ashraf Ghani of Afghanistan with us this morning! #USAatUNGA #UNGA __HTTP__ _E_
RT @NYPDnews: Many supported NYers when Sandy hit. Now our NY Task Force 1 can be there to help others during Harvey & #HurricaneIrma. Here... _E_
Based on the ovation last night from the Letterman @Late_Show audience I believe it will be hard for Obama to throw $5M down the drain.... _E_
Why would the USChamber be upset by the fact that I want to negotiate better and stronger trade deals or that I want penalties for cheaters? _E_
I am on @greta now! _E_
In one hour I will be making a major announcement from Trump Tower. Watch it live on Periscope! __HTTP__ _E_
Do you agree with the client's decision? #CelebApprentice _E_
Wow Bernie Sanders just admitted that the real unemployment rate is 10% (it is actually over 20%) and for African American youth 51%. _E_
Dopey Mort Zuckerman owner of the worthless @NYDailyNews has a major inferiority complex. Paper will close soon! _E_
92 stories above North Michigan Avenue @TrumpChicago's 5 Star @Forbes rated rooms have the best views of Chicago __HTTP__ _E_
Don't believe Chrysler (if Obama wins) see how fast @Jeep production will be moved to China and I'll be watching! _E_
Via @wsoctv by @BlairWSOC9: EXCLUSIVE: Donald Trump talks possible presidential run __HTTP__ _E_
I AM PLEASED TO INFORM YOU THAT CELEBRITY APPRENTICE HAS BEEN RENEWED FOR ANOTHER SEASON BY NBC. SEE YOU AT THE NBC UPFRONTS TOMORROW. _E_
It was Rosie O'Donnell who ate the cake in the vicious Hillary commercial about me not Crooked Hillary! @marthamaccallum _E_
Today I will meet with Canadian PM Trudeau and a group of leading business women to discuss women in the workforce. __HTTP__ _E_
Hillary Clinton is not qualified to be president because her judgement has been proven to be so bad! Would be four more years of stupidity! _E_
The middle class has worked so hard are not getting the kind of jobs that they have long dreamed of and no effective raise in years. BAD _E_
Just announced that because of Trump advertising rates for debate on @CNN are going from $5000 to $200000 a 4000% increase.PAY CHARITY? _E_
State Senator Shirley Huntley ratted on black politicians & was believed when she ratted on @AGSchneiderman nobody listened. Racism! _E_
Response to Huffington Post __HTTP__ _E_
Crowd gathers to hear Trump speech in Las Vegas __HTTP__ _E_
Thank you Des Moines Iowa! Governor @Mike_Pence and I appreciate your support! #MAGA #TrumpTrain __HTTP__ _E_
FEMA and first responders are working hard (yet again) on Hurricane Nate. Military helping. Very much under control! _E_
Exciting news—After massive construction the Blue Monster at Trump National Doral is open for business today. __HTTP__ _E_
Iran and the United States just pushed deadline back SEVEN MONTHS on working out a nuclear deal. Iran is tapping along our bad negotiators! _E_
Thank you South Carolina! #Trump2016 __HTTP__ _E_
.@VP Mike Pence is working hard on HealthCare and getting our wonderful Republican Senators to do what is right for the people. _E_
Entrepreneurs: Apply your skills and talent but above all be tenacious. See yourself as victorious which means never giving up. _E_
Hagel has been endorsed by China __HTTP__ & Iran __HTTP__ for SOD. Welcome to Obama's second term! _E_
Johnny Miller correctly very critical of greens at Pinehurst. Said they should be redone _E_
.@Peggynoonannyc Interesting article but I will beat Hillary easily. People that have given up on the system will come out to vote for me! _E_
The GOP primary is getting very nasty. The candidates need to remember that @BarackObama is the main target. He must not be reelected. _E_
Via @DMRegister by @BylineAndyDavis: "Donald Trump speaks to veterans residents in Coralville" __HTTP__ _E_
Little @MacMiller sent me an expensive plaque for making his song "Donald Trump" such a big hit. Mac you still... __HTTP__ _E_
The debates are going to have a big impact on the election. @MittRomney has proved in Florida he delivers under pressure. _E_
Even Crazy Jim Acosta of Fake News CNN agrees: "Trump World and WH sources dancing in end zone: Trump wins again...Schumer and Dems caved...gambled and lost." Thank you for your honesty Jim! _E_
The @erictrumpfdn Golf Invitational featuring a performance by @BretMichaels was a great event. Enjoy the video.... __HTTP__ _E_
#TrumpVine Opinion on Egypt __HTTP__ _E_
Only a Reagan or a Trump like figure in the White House will achieve this goal. __HTTP__ _E_
This election is a choice between law order & safety or chaos crime & violence. I will make America safe again for everyone. #ImWithYou _E_
.@Ed_Klein's book 'The Amateur' is out in paper back. Lots of insights. _E_
If you have passion confidence resilience & vision you could become an entrepreneur. Add focus to the list & you're off to a good start _E_
Russia is sending a fleet of ships to the Mediterranean. Obama's war in Syria has the potential to widen into a worldwide conflict. _E_
We will never forget the 241 American service members killed by Hizballah in Beirut. They died in service to our nation. __HTTP__ _E_
China is taking the oil from Iraq after we spent 1.5 trillion dollars and thousands of lives for their freedom . Our leaders are so stupid! _E_
I will be on Fox and Friends at.7.00 A.M. Enjoy! _E_
Look forward to being in Tampa this afternoon. Wonderful crowds. Thank you Florida! _E_
NYC's top cop acted wisely and legally to monitor activities of some in the Muslim community. Vigilance keeps us (cont) __HTTP__ _E_
Do you believe what is going on in Washington with respect to Syria these people don't have a clue! _E_
Just left Sioux Center Iowa. My speech was very well received. Truly great people! Packed house overflow! _E_
If the great Si Newhouse were still running @CondeNastCorp he would fire Graydon Carter immediately circulation tanking. _E_
RT @Scavino45: Hurricane force winds hit Florida Keys. 390 shelters have been opened in Florida. Shelters near you __HTTP__ _E_
Be sure to keep following announcements on the development of Trump International Golf Club Dubai. Will be spectacular. _E_
It is impossible for the FBI not to recommend criminal charges against Hillary Clinton. What she did was wrong! What Bill did was stupid! _E_
Great @Esquiremag piece '@DonaldJTrumpJr: What I've Learned' __HTTP__ _E_
Highly respected Constitutional law professor Mary Brigid McManamon has just stated Ted Cruz is not eligible to be President. Big problem _E_
When will @BarackObama release his college and law school transcripts? __HTTP__ _E_
The last thing we need in Alabama and the U.S. Senate is a Schumer/Pelosi puppet who is WEAK on Crime WEAK on the Border Bad for our Military and our great Vets Bad for our 2nd Amendment AND WANTS TO RAISES TAXES TO THE SKY. Jones would be a disaster! _E_
Call @MELANIATRUMP today on @QVC at 5 PM EST say hello and buy buy buy! _E_
...can't change history but you can learn from it. Robert E Lee Stonewall Jackson who's next Washington Jefferson? So foolish! Also... _E_
Amazing comeback by The Heat your friends at your favorite golf club Trump National Doral are proud of you. NOW for game 7! _E_
Departing New York with General James 'Mad Dog' Mattis for tonight's rally in Fayetteville North Carolina! See you... __HTTP__ _E_
Don't ever forget we will together MAKE AMERICA GREAT AGAIN! _E_
....Also there is NO COLLUSION! _E_
If the disgusting and corrupt media covered me honestly and didn't put false meaning into the words I say I would be beating Hillary by 20% _E_
Ebola patient will be brought to the U.S. in a few days now I know for sure that our leaders are incompetent. KEEP THEM OUT OF HERE! _E_
The winner of Best In Show of the 139th @WKCDOGS Miss P visited @TrumpTowerNY today __HTTP__ _E_
Watch @PaulRyanVP explain how 'It's irrefutable' that President Obama is damaging Medicare' __HTTP__ _E_
What the hell is going on with GLOBAL WARMING. The planet is freezing the ice is building and the G.W. scientists are stuck a total con job _E_
Looking forward to Friday night in the Great State of Alabama. I am supporting Big Luther Strange because he was so loyal & helpful to me! _E_
Today I was thrilled to announce a commitment of $25 BILLION & 20K AMERICAN JOBS over the next 4 years. THANK YOU... __HTTP__ _E_
Obama wanted Putin to reset. Instead Putin laughed at him and reloaded. _E_
Mexico doesn't respect our border hourly __HTTP__ Release USMC Tahmooressi NOW! Time for a boycott? #SaveOurMarine _E_
I don't know @SamuelLJackson to best of my knowledge haven't played golf w/him & think he does too many TV commercials—boring. Not a fan. _E_
Thank you @SeanHannity & @BoDiet! #MakeAmericaGreatAgain _E_
June 16th __HTTP__ _E_
The election is absolutely being rigged by the dishonest and distorted media pushing Crooked Hillary but also at many polling places SAD _E_
This Tweet from @realDonaldTrump has been withheld in response to a report from the copyright holder. _E_
America's Olympic uniforms are manufactured in China. Burn the uniforms!#U.S.OlympicCommittee _E_
Some low life journalist claims that I made a pass at her 29 years ago. Never happened! Like the @nytimes story which has become a joke! _E_
"@IvankaTrump: 'Trump Estates Dubai unlike anything else in the region'" __HTTP__ via @aawsat_eng by Musaid Al Zayani _E_
The wimps that run Penn State should be forced to resign (and be sued) for the pathetic settlement they made and destruction of great legacy _E_
Worst ever issue of @VanityFair magazine—bad food Graydon Carter should be fired! _E_
1. Each week you the audience can choose an MVP among the celebrities @CelebApprentice using Twitter...... _E_
Donald Trump Announcement: $5 Million for Obama College Records __HTTP__ via @Newsmax_Media _E_
.@bobbyjindal watched you on @TeamCavuto. Made some excellent points. Best Wishes. _E_
Via @HorsetalkNZ: "NY's Central Park Horse Show a huge success" __HTTP__ _E_
It is time to #DrainTheSwamp in Washington D.C! Vote Nov. 8th to take down the #RIGGED system! __HTTP__ _E_
I loved watching Clint Eastwood last night he was terrific! _E_
Obama's attack on the internet is another top down power grab. Net neutrality is the Fairness Doctrine. Will target conservative media. _E_
Another great poll result! Thank you! __HTTP__ _E_
Obama is the most profligate deficit & debt spender in our nation's history. Doubled debt (cont) __HTTP__ _E_
Now a small country like Sudan tells Obama he can't send any more Marines __HTTP__ We are a laughing stock. _E_
15K in OK! Had to turn away 5k but we are coming back soon to take care of them! So much love in the crowd! Thanks! __HTTP__ _E_
If only the illegals were Tea Party members then Obama would get them out of the country immediately. _E_
Via @Newsmax_Media by @melaniebatley: Donald Trump: France's Strict Gun Laws Enabled Attack __HTTP__ _E_
Look here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and (cont) __HTTP__ _E_
.@MarissaMayer is right to expect Yahoo employees to come to the workplace vs. working at home. She is doing a great job! _E_
Leaving for Liberty University. I'll be speaking today in front of a record crowd. #Trump2016 _E_
By continuing to give massive subsidies to Scotland's ugly wind turbines @David_Cameron is playing right into @AlexSalmond's hands. _E_
China has 5 oil projects in Iraq and we didn't get anything from the Iraqis except asked to leave. Iraq is going (cont) __HTTP__ _E_
Crooked Hillary Clinton is unfit to serve as President of the U.S. Her temperament is weak and her opponents are strong. BAD JUDGEMENT! _E_
This week we came one step closer to reaching the goal of aligning the skills taught in our nation's classrooms with the jobs of the future. __HTTP__ _E_
America needs @MittRomney and @PaulRyanVP and we need them right now. @GovChristie _E_
...these days...we could all use a little of the power of Trumpative thinking. –BarnesandNoble.com __HTTP__ _E_
How come nobody mentions that the Nielsen Ratings of the Apprentice after 12 seasons as shown by Howard Stern totally blow away... _E_
The phony Club For Growth which asked me in writing for $1000000 (I said no) is now wanting to do negative ads on me. Total hypocrites! _E_
New York City's iconic architectural masterpiece @TrumpTowerNY houses prime commercial residential & retail space __HTTP__ _E_
.@IvankaTrump's @FoxNewsSunday "Power Player of the Week" interview with Chris Wallace __HTTP__ _E_
Senator Mitch McConnell said I had excessive expectations but I don't think so. After 7 years of hearing Repeal & Replace why not done? _E_
Via @BreitbartNews by @mboyle1: "EXCLUSIVE — DONALD TRUMP TO SPEAK AT CPAC" __HTTP__ @CPACnews _E_
Watching other networks and local news. Really good night! Crazy @megynkelly is unwatchable. _E_
This is more than a campaign it is a movement. #MakeAmericaGreatAgainSIGN UP TODAY & WE WILL WIN! __HTTP__ _E_
Join me in Pueblo Colorado on Monday afternoon at 3pm! #TrumpRally __HTTP__ _E_
FACT – the reason why Americans have to worry about a government shutdown is because Obama refuses to pass a budget. _E_
Be sure to stop by Trump Tower today I'll be signing copies of my new book Time To Get Tough from 11 am to 2 pm. _E_
My warmest condolences to the families of the horrible Roseburg Oregon shootings. _E_
Trump Int'l Golf Links Scotland awarded 5 star status by Scottish Tourism chiefs. Via MailOnline __HTTP__ _E_
I know our complex tax laws better than anyone who has ever run for president and am the only one who can fix them. #failing@nytimes _E_
The POLICE in Paris did a fantastic job. Very brave not easy! _E_
Have confidence work hard and keep your focus on the small things that matter while keeping the big picture in mind. _E_
Many of the released Guantanamo detainees are now fighting for ISIS and other enemy groups.We need proper leadership before it is too late! _E_
According to @pewresearch 2/3 of Mexican LEGAL immigrants do not pursue citizenship because of 'no interest' __HTTP__ _E_
US interest payments on the debt have already passed $375B this year __HTTP__ China is laughing at us as usual. _E_
Now A Rod doesn't even show up to his single A rehab games. Maybe the @Yankees will get lucky and @MLB will suspend A Rod. _E_
1/5 households is on food stamps __HTTP__ We must do better. Americans need to have a work ethic. _E_
Crooked Hillary's brainpower is highly overrated.Probably why her decision making is so bad or as stated by Bernie S she has BAD JUDGEMENT _E_
Via @LatinoVoices by @CaritoJuliette: "Meet The Latina 2014 @MissUniverse Candidates" __HTTP__ _E_
President Obama wants @MittRomney to hand over even more past tax returns he should when @BarackObama reveals his college applications. _E_
My @CNN interview with @piersmorgan explaining why Mitt should not apologize __HTTP__ _E_
All the haters & losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_
Disaster! The @BarackObama tax hikes set for 2013 are going to throw us back into a recession according to the CBO __HTTP__ _E_
Looking forward to speaking at prestigious @TheEconomicClub on December 15th __HTTP__ _E_
Thank you Illinois! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
It's almost like the United States has no President we are a rudderless ship heading for a major disaster. Good luck everyone! _E_
I can't believe Republican leadership allowed such a stupid deal to be made. They are rapidly giving up all of their cards. _E_
.@foxandfriends will be showing much of our successful trip to Asia and the friendships & benefits that will endure for years to come! _E_
Democrat Jon Ossoff who wants to raise your taxes to the highest level and is weak on crime and security doesn't even live in district. _E_
Must read editorial via @IBDeditorials: ObamaCare's Bitter Irony: It May Increase Number Of Uninsured __HTTP__ _E_
Sexual assault and rape in the Armed Forces is a Massive problem that nobody wants to talk about or do anything about the big dark secret! _E_
Herman Cain handled the pressure of the debate really well. @THEHermancain _E_
Rosie is back on the View which tells you how desperate they must be. It is the standard short term fix and long term disaster. _E_
Diligence is the mother of good luck. Benjamin Franklin _E_
I delivered a speech in Charlotte North Carolina yesterday. I appreciate all of the feedback & support. Lets #MAGA... __HTTP__ _E_
Rev. @BillyGraham is a great man and so is his son Franklin Graham. _E_
.@ByronYork Great numbers from @CBSNews Poll. Also from ABC Washington Post Poll. Thank you! @CNN _E_
The Failing @nytimes has totally gone against the Social Media Guidelines that they installed to preserve some credibility after many of their biased reporters went Rogue! @foxandfriends _E_
The Daily Snooze publishes lies about me. They should be ashamed but it will die very soon. _E_
Think positively. Zap negativity immediately. Focus on the solution not the problem. Be persistent and alert every single day. Momentum! _E_
Don't think my statement on @ariannahuff was harsh if you knew her and the phony Huffington Post you would understand more to follow. _E_
It was an honor to welcome the Prime Minister of Vietnam Nguyễn Xuân Phúc to the @WhiteHouse this afternoon. __HTTP__ _E_
The @MissUniverse contestants review their amazing stay at @TrumpDoral __HTTP__ _E_
.@ErinBurnett should have stayed at CNBC—she was never smart but people liked her. @OutFrontCNN Jeff Zucker's got problems! _E_
I have been hitting Obama and Crooked Hillary hard on not using the term Radical Islamic Terror. Hillary just broke said she would now use! _E_
Was President Obama in charge of this years Academy Awards they remind me of the ObamaCare website! #Oscars. _E_
.@MattGinellaGC Have you ever seen Trump National/Bedminster or Trump International Golf links in Scotland. Both far better than Pinehurst! _E_
Demand by China continues to raise the price of oil __HTTP__ We must become energy independent through our vast resources. _E_
.@danawhite Great job last night very exciting! You have come a long way from those difficult early days I am proud of you. _E_
Every economic climate whether an uptick or downturn presents new opportunities and challenges. _E_
Join me! 6/10: Richmond VA 8pm6/11: Tampa FL 11am6/11: Pittsburgh PA 3pm6/13: Portsmouth NH 2:30pm __HTTP__ _E_
Congratulations to @David_Bossie & his team @Citizens_United on their important court win for the First Amendment! __HTTP__ _E_
.@BretBaier Why do you have George Will on your show he's exhausted boring and not even a little relevant! Waste of good air time! _E_
Bernie Sanders is being treated very badly by the Dems. The system is rigged against him. He should run as an independent! Run Bernie run. _E_
.@nypost: "Dozens of key staffers fleeing @AGSchneiderman's office" __HTTP__ _E_
There's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_
I don't think Ted Cruz can even run for President until he can assure Republican voters that being born in Canada is not a problem. Doubt! _E_
Looking forward to meeting with Prime Minister @Netanyahu shortly. Peace in the Middle East would be a truly great legacy for ALL people! _E_
Congratulations to THE MOVEMENT we have just won THE GREAT STATE OF OREGON. The vote percentage is even higher than anticipated! Thank you. _E_
In last night's #CNNDebate @MittRomney proved once again why he is the steady conservative who can restore America's future. _E_
"Arrests of MS 13 Members Associates Up 83% Under Trump" __HTTP__ _E_
In making big money knowledge is far more important than any other ingredient including money itself! _E_
Wonderful @pastormarkburns was attacked viciously and unfairly on @MSNBC by crazy @morningmika on low ratings @Morning_Joe. Apologize! _E_
Mayor Bill Vescio of Briarcliff Manor Westchester is doing a terrible job. Horrible roads high taxes housing down. @westchestergov _E_
If you can't handle the hard times that come with business then you will never be able to celebrate the successes. Focus & Stay Positive. _E_
Wacky @NYTimesDowd who hardly knows me makes up things that I never said for her boring interviews and column. A neurotic dope! _E_
Ben Carson has never created a job in his life (well maybe a nurse). I have created tens of thousands of jobs it's what I do. _E_
Chicago don't forget tix for @EricTrumpFdn Wine Tasting Fundraiser @TrumpChicago 11/22. Proceeds benefit @StJude __HTTP__ _E_
Wow television ratings just out: 31 million people watched the Inauguration 11 million more than the very good ratings from 4 years ago! _E_
The American people have waited long enough. There has been enough talk and no action for seven years. Now is the time for action! __HTTP__ _E_
"Take the time to move yourself forward. In other words think work and be lucky." – Think Like a Champion _E_
Via @shinysheet: Mar a Lago to host top equestrian jumpers: Trump Invitational will benefit 90 area charities. __HTTP__ _E_
Congrats to @AlCardenasACU and @CPACnews. I really enjoyed being there—the response was so terrific! _E_
RT @Scavino45: #USNSComfort en route to #PuertoRico from Norfolk Virginia to support Hurricane Maria relief efforts. __HTTP__ _E_
Thank you Florida can't wait to see you Friday in Miami! Join me: __HTTP__ __HTTP__ _E_
President Obama created a VERY BAD precedent by handing over five Taliban prisoners in exchange for Sgt. Bowe Bergdahl. Another U.S. loss! _E_
Entrepreneurs: Be curious. Discovery breeds discovery just as success breeds success. Don't sell yourself short. _E_
My interview with @EWErickson of @RedState discussing #TimeToGetTough GOP primary and my 2012 options __HTTP__ _E_
Via @BreitbartNews: "DONALD TRUMP AT SUMMIT: OBAMACARE A 'FILTHY LIE' CAN BUILD 'A BEAUTY' OF A BORDER FENCE" __HTTP__ _E_
Constitutional law expert #Laurence Tribe of Harvard says wrong to say it (natural born citizen) is a settled matter it isn't settled). _E_
Thank you Governor @TerryBranstad! #AmericaFirst #Debates2016 __HTTP__ _E_
"No one remembers who came in second." – Walter Hagen _E_
With so many scandals plaguing Obama it seems that they all hit him at the right time. Could help him get away w/ all of them. _E_
He @BarackObama is using the IRS to sabotage the Tea Party __HTTP__ What about the Occupy Wall Street groups? _E_
Congratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Nov! _E_
Played the Trump International Golf Club in Palm Beach last weekend. One of the best golf courses in the country. Perfect weather. _E_
Congratulations to the Rolling Stones on marking their 50th anniversary in London. _E_
There is no instance of a nation benefitting from prolonged warfare. Sun Tzu _E_
I will also be going to a wonderful state Missouri that I won by a lot in '16. Dem C.M. is opposed to big tax cuts. Republican will win S! _E_
Such great support in New Hampshire. So many people are working so hard to #MakeAmericaGreatAgain! _E_
Via @BreitbartNews by @LarryOConnor: TRUMP: NY MAG AILES STORY 'TOTAL BULLS**T' __HTTP__ It was total bullshit! _E_
The @USCHAMBER must fight harder for the American worker. China and many others are taking advantage of U.S. with our terrible trade pacts _E_
Just watched Hillary deliver a prepackaged speech on terror. She's been in office fighting terror for 20 years and look where we are! _E_
FLASHBACK – "Donald Trump Blasts Obama for Failing to Secure Christian Pastor's Freedom in Iran __HTTP__ via @theblaze' _E_
"You're never a loser until you quit trying." Mike Ditka _E_
Democrats are trying to bail out insurance companies from disastrous #ObamaCare and Puerto Rico with your tax dollars. Sad! _E_
#CrookedHillary __HTTP__ _E_
"The President has accomplished some absolutely historic things during this past year." Thank you Charlie Kirk of Turning Points USA. Sadly the Fake Mainstream Media will NEVER talk about our accomplishments in their end of year reviews. We are compiling a long & beautiful list. _E_
Snowden is sitting in China and taunting the U.S. He is mocking us as a Country. Great time to place a tax on China trade if not turned over _E_
What's more important? Rebuilding our military or bailing out insurance companies? Ask the Democrats. _E_
Thank you Geneva Ohio. If I am elected President I am going to keep RADICAL ISLAMIC TERRORISTS OUT of our countr... __HTTP__ _E_
I will do far more for women than Hillary and I will keep our country safe something which she will not be able to do no strength/stamina! _E_
I put @DonnyDeutsch on Apprentice at his request I did his failed cable show as a favor to him then he knocks me for my Obama announcement. _E_
How can George Osborne reduce UK debt while spending billions to subsidize Scotland's garbage wind turbines that are destroying the country? _E_
'Remarks by President Trump at Signing of H.J. Resolution 41' __HTTP__ __HTTP__ _E_
We pause today to remember the 2403 American heroes who selflessly gave their lives at Pearl Harbor 75 years ago... __HTTP__ _E_
Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster is imploding fast! _E_
Thanks to @TheRealMarilu a great woman for her wonderful defense of the Miss USA pageant. _E_
We need a dealmaker in the White House who knows how to think innovatively and make smart deals. #TimeToGetTough. _E_
Wow honored to just pass 2.5M followers on @twitter. Thanks to all my followers. We are going to have a great year together. _E_
Crooked Hillary Clinton said she is used to dealing with men who get off the reservation. Actually she has done poorly with such men! _E_
Congratulations to Michelle and Barack Obama on their 20th anniversary. _E_
I will be doing @hannityshow tonight on Fox at 9 o'clock. Will be interesting and tough! _E_
We call for the full restoration of democracy and political freedoms in Venezuela and we want it to happen very very soon! __HTTP__ _E_
Let Pete into the Hall of Fame __HTTP__ @PeteRose_14 _E_
Ron Fournier: Clinton Used Secret Server To Protect #CircleOfEnrichment" __HTTP__ _E_
Big day for healthcare. Working hard! _E_
Great job by @EricTrump on interview with @BillHemmer on @FoxNews. #ImWithYou #TrumpTrain _E_
Everybody is talking about the protesters burning the American flags and proudly waving Mexican flags. I want America First so do voters! _E_
Trump Nat'l Golf Club Philadelphia is a 360 acre beauty and an award winning Tom Fazio designed course fantastic! __HTTP__ _E_
In order to preserve my options and guarantee that @BarackObama is defeated I changed my voter registration to independent. _E_
Tonight at 8:00 is a really big one for a double episode of Celebrity Apprentice. Watch you won't believe what happens! _E_
Statement by me last night in Florida: "Honestly I don't think the Democrats want to make a deal. They talk about DACA but they don't want to help..We are ready willing and able to make a deal but they don't want to. They don't want security at the border they don't want..... _E_
Looking forward to speaking at 1:30PM tomorrow in Nashua at @NHGOP @FITNsummit!. Let's Make America Great Again! #FITN _E_
On beautiful Lake Norman @Trump_Charlotte offers a state of the art Clubhouse to complement its championship course __HTTP__ _E_
Great poll numbers! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Wow the ObamaCare website which President Obama said would be working TODAY is a total mess with many functions not even thought about! _E_
Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_
Be sure to set exceptional goals for your 2015 resolutions. Push yourself you can do it. Think Big! _E_
The unemployment numbers are tragic. We are letting the world take our jobs. It has to stop! _E_
Had a great time on the @HowardStern show this morning—he will and should never change! _E_
Why doesn't the failing @nytimes write the real story on the Clintons and women? The media is TOTALLY dishonest! _E_
Hillary Clinton's weakness while she was Secretary of State has emboldened terrorists all over the world..cont: __HTTP__ _E_
#SweepsTweet @clayaiken might get some use out of the Chi Touch digital hairdryer. Not the same for @arsenioofficial. _E_
Wow! @FoxNews poll just came out. #1 with 26%! Almost as importantly I am the strongest on economic issues by far! #Trump2016 _E_
HRC is using the oldest play in the Dem playbook when their policies fail they are left w/this one tired argument! __HTTP__ _E_
Never let the fear of striking out get in your way. Babe Ruth _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
.@IamStevenT gave one of the greatest endings to a show ever @MissUniverse. Standing ovation! _E_
Fast trial and death penalty for maniac in Colorado immediately pass speed up legislation. _E_
Last week's boardroom was truly epic and the dust hasn't settled yet. #CelebApprentice _E_
Thank you Jonathan. Greatly appreciated! __HTTP__ _E_
My twitter has become so powerful that I can actually make my enemies tell the truth. _E_
My great honor! __HTTP__ _E_
China's leadership is sneaky and underhanded they significantly underreport their actual defense budget and (cont) __HTTP__ _E_
He's back and causing more trouble than ever before! @THEGaryBusey returns in the record 13th season of 'All Star' @CelebApprentice. _E_
Off to Nashville and the NRA. _E_
... It is very effective and a commonly used business tool. _E_
Obama's economic policies are causing inflation on hard working families. The price of corn alone has risen over 200% since he was elected. _E_
I cannot believe the Republicans are extending the debt ceiling—I am a Republican & I am embarrassed! _E_
See you soon Arizona! #Trump2016 __HTTP__ _E_
Congratulations to @FLGovScott the state is really making progress and fast! _E_
Part 1 of my @jimmyfallon interview discussing my $5M offer to Obama #TRUMP Tower atrium my tweets & 57th st. crane __HTTP__ _E_
Think of it 20% of our country is essentially unemployed. _E_
Quote of the Day: Donald Trump Decrees Boycott on Glenfiddich Scotch __HTTP__ via @Zagat _E_
Great news that the New York Stock Exchange won't be owned by a German company. European regulators turned the (cont) __HTTP__ _E_
Are we talking about the same cyberattack where it was revealed that head of the DNC illegally gave Hillary the questions to the debate? _E_
Hillary's Aides Urged Her to Take Foreign Lobbyist Donation And Deal With Attacks: __HTTP__ _E_
A storied franchise with a loyal fanbase @buffalobills should remain in Buffalo. _E_
Video from Michigan last night. After asking for months the media panned their cameras! __HTTP__ __HTTP__ _E_
It's hard to read the Failing New York Times or the Amazon Washington Post because every story/opinion even if should be positive is bad! _E_
RT @MittRomney: For nearly 4 years Barack Obama has refused to crack down on China's cheating & American workers have paid the price. _E_
Thank you Windham New Hampshire! #TrumpPence16 #MAGA __HTTP__ _E_
Problems are never truly hardships to winners & if you haven't got any then you must not have a business to run. _E_
Trump: Zimmerman Trial 'Traumatic Period for Country' __HTTP__ @Newsmax_Media _E_
.@NY_POLICE Commissioner Ray Kelly has done a top job keeping NYC safe. Stop & Frisk has been a critical tool for the NYPD. _E_
Why does Conde Nast allow dopey Graydon Carter to run bad food restaurants while running failing @VanityFair magazine? _E_
A commander in chief has to possess the right instincts. That's one of the biggest problems with @BarackObama: (cont) __HTTP__ _E_
Thank you Iowa! I appreciate all of your support @IowaCentral & @ethanolbyPOET this evening! #Trump2016 #IACaucus __HTTP__ _E_
Tune in for my interview with @gretawire tonight at 10 pm @FoxNews _E_
Looking forward to the GOP debate and the outcome of the Ames straw poll. We must get a real leader. _E_
I am interviewed on This Week on @ABC this morning. Enjoy! _E_
In answer to your questions about my favorite impersonator the answer is Darrell Hammond. _E_
...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). To bad the Dems have no one who can change tones! _E_
Less than one week away from implementation ObamaCare's small business exchanges are not ready! __HTTP__ A disaster! _E_
I am so disappointed that the Yankeed haven't terminatrd A Rod's contract. There is no way they would not win in court! Hard to believe. _E_
Thank you Kenansville North Carolina! Remember on November 8th that special interest gravy train is coming to a... __HTTP__ _E_
ObamaCare strikes again. Major insurer announced that over 53000 New Yorkers will be dropped from their plans __HTTP__ _E_
My thoughts on the Republican Party in today's #trumpvlog... __HTTP__ _E_
Wow. Unbelievable. __HTTP__ _E_
Some dope said I deleted a tweet about James G. There was no tweet and there was no delete a totally fabricated story (nobody saw tweet). _E_
The current tax code is a burden on American taxpayers & harmful to job creators. Americans need #TaxReform! More: __HTTP__ __HTTP__ _E_
Via @Newsmax_Media: "Trump Iowa Visit Raises 2016 Speculation" __HTTP__ _E_
Let's go! #CelebApprentice _E_
Via @NYDailyNews: Joan Rivers' last work for @ApprenticeNBC will run on two shows next season says Donald Trump __HTTP__ _E_
Congratulations to Tom Brady @Patriots he is a great quarterback and a great champion! _E_
China just agreed that the U.S. will be allowed to sell beef and other major products into China once again. This is REAL news! _E_
The lawyer I just beat in Chicago was a buffoon but was a lot smarter and sharper than @DannyZuker. Come on Danny make the bet! _E_
.@TheRevAl came to my Trump Tower office to apologize for calling me a racist very nice apology accepted! _E_
Interesting that Roberts said it was a tax in order to come out with his good public relations decision when (cont) __HTTP__ _E_
President Obama will go down as perhaps the worst president in the history of the United States! _E_
Congrats @GretchenCarlson's new Fox show debuts w/ very strong ratings __HTTP__ Guess who her first guest was? Donald Trump. _E_
Tune into the legendary @BarbaraJWalters at 10pmE on@ABC2020 tonight. #MeetTheTrumps for a full hour @ABC #ABC2020! __HTTP__ _E_
Good advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_
Unbelievable evening. Just made a speech in front 17000 amazing New Yorkers in Bethpage Long Island great to be home! _E_
Obama asked a 7 yr old for his birth certificate. He's in your face because the Republicans dropped the ball. (cont) __HTTP__ _E_
WE ARE WITH YOU FLORIDA!Emergency Information 1 800 342 3557 __HTTP__ 1 800 FL HELP 1 __HTTP__ __HTTP__ _E_
What a 'nice guy' 97% of @BarackObama's campaign ads have been negative attacks on @MittRomney __HTTP__ Give it back Mitt! _E_
.@MissTeenUSA visited today __HTTP__ _E_
'Good Chance' Trump Will Run for President __HTTP__ via @Newsmax_Media by @melaniebatley _E_
.@CNN is in a total meltdown with their FAKE NEWS because their ratings are tanking since election and their credibility will soon be gone! _E_
It is an exciting time for our country!#WeeklyAddress #ConfirmGorsuch __HTTP__ _E_
The threat from radical Islamic terrorism is very real just look at what is happening in Europe and the Middle East. Courts must act fast! _E_
Yes All Star @ApprenticeNBC contestant @THEGaryBusey is a little out there. But he uses his 'uniqueness' to his advantage. _E_
Via @STVAberdeen: Donald Trump reveals first image of his new Aberdeenshire hotel __HTTP__ _E_
New jobs report: 432000 left workforce manufacturing & durable goods go __HTTP__ We need leaders who understand business. _E_
Our big and very popular Tax Cut and Reform Bill has taken on an unexpected new source of "love" that is big companies and corporations showering their workers with bonuses. This is a phenomenon that nobody even thought of and now it is the rage. Merry Christmas! _E_
Obama just stated It's always good to ignore Donald Trump. I state he is right especially when the truth is against him. _E_
First responders have been doing heroic work. Their courage & devotion has saved countless lives – they represent the very best of America! __HTTP__ _E_
He says he will spend $1 B to get re elected: @BarackObama. I can match him preserving my options. _E_
Justice Kennedy should be proud of himself for sticking to his principles in light of Justice Roberts' bullshit! _E_
Going to The Citadel tonight getting The Nathan Hale Patriot Award. Very nice! _E_
Negotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. _E_
Winner of the 5 Star Diamond Award @TrumpGolfLA brings luxury & elite amenities to LA's top public golf course __HTTP__ _E_
John McCain called thousands of people crazies when they came to seek help on illegal immigration last week in Phoenix. He owes apology! _E_
Trump International Hotel & Tower New York has received great acclaim as has our signature restaurant Jean Georges __HTTP__ _E_
Sheena Monnin acted terribly...she got what she deserved! _E_
.@EliseChristine #asktrump __HTTP__ _E_
My @SquawkCNBC interview from this morning discussing the price of oil windfarmsDoral Hotel & Country Club and more... __HTTP__ _E_
Art Laffer just said that he doesn't know how a Democrat could vote against the big tax cut/reform bill and live with themselves! @FoxNews _E_
I'm not proud of my locker room talk. But this world has serious problems. We need serious leaders. #debate #BigLeagueTruth _E_
Via @nypost by @JonathonTrugman: Donald Trump's resume backs his run for president __HTTP__ _E_
Thank you Lake Worth Florida. @foxandfriends _E_
All Star @ApprenticeNBC has done the impossible. TV's greatest villain @OMAROSA & @THEGaryBusey are in competition. Fireworks! _E_
Hawaii: __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_
Please to inform that the Champion Pittsburgh Penguins of the NHL will be joining me at the White House for Ceremony. Great team! _E_
I hope the Fake News Media keeps talking about Wacky Congresswoman Wilson in that she as a representative is killing the Democrat Party! _E_
Stock Market at new all time high! Working on new trade deals that will be great for U.S. and its workers! _E_
Very excited to be addressing the @RepLeadConf next Friday in New Orleans. There is much to discuss! _E_
Via @BreitbartNews by @ASwoyer: Exclusive: Trump Slams Obamatrade Stands Up For American Jobs __HTTP__ _E_
America will never be destroyed from the outside.If we falter and lose our freedomsit will be because we destroyed ourselves. A. Lincoln _E_
RomneyCare/ObamaCare architect Gruber apologized for his comments. He should apologize for the $2T monstrosity & return all taxpayer money. _E_
It was great to appear on Piers Morgan Tonight last night as his first live guest. Piers won the Celebrity Apprentice and he's fantastic. _E_
Jay Leno and his people are constantly calling me to go on his show. My answer is always no because his show sucks. They love my ratings! _E_
Congrats to @Reince Priebus a really good and talented man. We're proud of you Reince! __HTTP__ _E_
Thousands of great people showed up from Liberty University yesterday. I love standing ovations! __HTTP__ _E_
I'm helping the Serta Counting Sheep get back to work. Enter the contest __HTTP__ and win a trip to Las Vegas.. _E_
Hurricane Irma is raging but we have great teams of talented and brave people already in place and ready to help. Be careful be safe! #FEMA _E_
ObamaCare will explode and we will all get together and piece together a great healthcare plan for THE PEOPLE. Do not worry! _E_
.@HillaryClinton Obama #ISIS Strategy Has Allowed It To Expand To Become A Global Threat #DebateNight __HTTP__ _E_
What is never said is that people take a big risk with their money and can lose it all. We should be given credit for taking this risk. _E_
"You had Hillary Clinton and the Democratic Party try to hide the fact that they gave money to GPS Fusion to create a Dossier which was used by their allies in the Obama Administration to convince a Court misleadingly by all accounts to spy on the Trump Team." Tom Fitton JW _E_
....the wall is not built which it will be the drug situation will NEVER be fixed the way it should be!#BuildTheWall _E_
Does Bush's library have a wing featuring Supreme Court Justice Jon Robert's ObamaCare ruling? Roberts was his prize appointee! _E_
If taxes are raised to avoid the fiscal cliff then they must be accompanied by tangible hard cuts on spending everywhere. _E_
Rapper Mac Miller's song Donald Trump has reached close to 72 million hits. He owes me big! _E_
Why doesn't somebody study the horrible charges brought against @Macys for racial profiling? Terrible hypocrites! _E_
Chris Ruddy is always on point: Trump Opens 'Greatest Golf Course In the World' __HTTP__ via @Newsmax_Media _E_
The biggest story yesterday the one that has the Dems in a dither is Podesta running from his firm. What he know about Crooked Dems is.... _E_
SHOCK Hugo Chavez endorses @BarackObama __HTTP__ Will he be in Chicago on election night too? _E_
Republicans are always saying Obama is such a nice guy. When will they learn that he is not? _E_
All are very scripted and rehearsed two (at least) should not be on the stage. _E_
RT @FoxNews: TUNE IN: @EricTrump joins @seanhannity TONIGHT at 9p ET on @FoxNews Channel! #Hannityat9 __HTTP__ _E_
.@Joan_Rivers Get well soon Joan keep fighting! _E_
Via @bostonherald by @ ChrisCassidy_BH: Donald Trump says Jeb Bush is wrong about Iraq __HTTP__ _E_
Congratulations to @RealSheriffJoe on his successful Cold Case Posse investigation which claims @BarackObama's 'birth certificate' is fake _E_
I am happy that The Job on CBS the 16th. knockoff of the Apprentice was just cancelled. I love to see my opponents lose (not nice)! _E_
I always said that Debbie Wasserman Schultz was overrated. The Dems Convention is cracking up and Bernie is exhausted no energy left! _E_
Crooked Hillary colluded w/FBI and DOJ and media is covering up to protect her. It's a #RiggedSystem! Our country d... __HTTP__ _E_
Take a tour of this amazing residence at Trump World Tower..... __HTTP__ _E_
Lyin' Hillary Clinton told the FBI that she did not know the C markings on documents stood for CLASSIFIED. How can this be happening? _E_
13 BILLION 4.5 BILLION these are the stupid settlements that J.P.Morgan just made. Why don't they FIGHT? No wonder they keep getting sued. _E_
Word is that little Morty Zuckerman's @NYDailyNews loses more than $50 million per year can that be possible? _E_
Join me in Wisconsin tomorrow or Colorado on Tuesday!Green Bay 6pm __HTTP__ Springs 1pm... __HTTP__ _E_
There is no way that Carly Fiorina can become the Republican Nominee or win against the Dems. Boxer killed her for Senate in California! _E_
RT @gatewaypundit: The Trump Hotel Waikiki looks like a lovely resort @realDonaldTrump #Hawaii _E_
RT @foxandfriends: Sen. Ted Cruz: Trump's air traffic control plan is a 'win win' for Democrats and Republicans __HTTP__ _E_
"Do your homework before you invest. A dumb investor is a dead investor." – Think Like a Billionaire _E_
Great new poll from NH. Thank you! We need to keep this country safe! #Trump2016 __HTTP__ __HTTP__ _E_
Obama friend got a no bid $635M contract to build website __HTTP__ And now she will get more to fix it. _E_
I had thousands join me in New Hampshire last night! @HillaryClinton had 68. The #SilentMajority is fed up with what is going on in America! _E_
Just out the POLAR ICE CAPS are at an all time high the POLAR BEAR population has never been stronger. Where the hell is global warming? _E_
RT @brunelldonald: I thought about jobs that went overseas failing schools open borders not my skin color when I voted @realDonaldTrump! I... _E_
I always believed @BretMichaels was making a mistake in coming back as a competitor. I disagree with him but... __HTTP__ _E_
Will be on @Morning_Joe live from New Hampshire 7:00 A.M. Talking about the debate and more! _E_
We have wasted an enormous amount of blood and treasure in Afghanistan. Their government has zero appreciation. Let's get out! _E_
Millions of $'s of false ads paid for by lobbyists special interests of cheater @SenTedCruz and sleepy @JebBush are now running in S.C. _E_
What is Mitch McConnell thinking?...make the big deal! _E_
Wow CNN had to retract big story on Russia with 3 employees forced to resign. What about all the other phony stories they do? FAKE NEWS! _E_
Another one of my predictions just came true Iraq is a total disaster with government losing all control—so sad. _E_
RT @GovChristie: .@POTUS has done more to combat the addiction crisis than any other President. __HTTP__ _E_
We should not cut any aid to Egypt. Their country is in chaos and now they must form a normal civil government. _E_
I have clearly stated that if the New York State Republican Party is able to unify I would run for Governor and win. They can't unify SAD! _E_
My appearance this morning on Good Morning America... __HTTP__ _E_
Refugees from Syria are now pouring into our great country. Who knows who they are some could be ISIS. Is our president insane? _E_
The women played great today at the @USGA #USWomensOpen I look forward to being there tomorrow for the final round! __HTTP__ _E_
Heading to New Hampshire will be talking about Hillary saying her brain SHORT CIRCUITED and other things! _E_
26000 unreported sexual assults in the military only 238 convictions. What did these geniuses expect when they put men & women together? _E_
That Seth Meyers is hosting the Emmy Awards is a total joke. He is very awkward with almost no talent. Marbles in his mouth! _E_
The one positive from the plunge in household wealth is that we are in a buyer's market. This is the time to buy! _E_
Paul Begala the dopey @CNN flunky and head of the Pro Hillary Clinton Super PAC has knowingly committed fraud in his first ad against me. _E_
Dem Senator Schumer hated the Iran deal made by President Obama but now that I am involved he is OK with it. Tell that to Israel Chuck! _E_
I am so happy that I was able to do something really good for the Bronx and lots of jobs! _E_
Watch yesterday Obama continued to evade questions on his security failures in the Benghazi consulate attack. __HTTP__ _E_
Totally unauthorized do not pay. I am self funding my campaign! Notice has just been withdrawn. #Trump2016#MakeAmericaGreatAgain _E_
Just leaving for @LandExpo in Iowa standing room only. My great honor. @PeoplesCompany __HTTP__ _E_
Really enjoyed discussing @yankees yesterday with @RealMicihaelKay. I am a long time Yankee fan. _E_
Wow 25000 in San Diego California!Thank you!! #Trump2016 __HTTP__ _E_
The virtually incompetent Republican Strategist who has had a failed career Cheri Jacobus is incoherent with anger that her puppets died! _E_
RT @seanhannity: BOOM!! Tick Tock __HTTP__ _E_
Thank you to all of those who gave me such wonderful reviews for my performance on @nbcsnl Saturday Night Live. Best ratings in 4 years! _E_
As your President I have no higher duty than to protect the lives of the American people. __HTTP__ _E_
The Republicans must get Virgil Goode out of the race in Virginia. He will take votes away from @MittRomney. _E_
I am proud of the Rep. House & Senate for working so hard on cutting taxes {& reform.} We're getting close! Now how about ending the unfair & highly unpopular Indiv Mandate in OCare & reducing taxes even further? Cut top rate to 35% w/all of the rest going to middle income cuts? _E_
ObamaCare/RomneyCare architect Gruber was paid over $6M with our tax dollars yet Obama only claims he 'was some adviser.' _E_
My @gretawire interview discussing @IvankaTrump wanting me to run for POTUS @BarackObama's SOTU and his China policy __HTTP__ _E_
.@Omarosa has another meltdown ... while giving a check for $40000 to Michael's charity the Sue Duncan Center. #CelebApprentice _E_
According to @RasmussenPoll @MittRomney has a 12 point advantage over @BarackObama on the economy __HTTP__ Look for it to grow. _E_
Big day for HealthCare. After 7 years of talking we will soon see whether or not Republicans are willing to step up to the plate! _E_
.@GovernorPerry is a terrific guy and I wish him well I know he will have a great future! _E_
Secure your place at the National Achievers Congress in London. It will be an amazing event with a great surprise. __HTTP__ _E_
The United States condemns the terror attack in Barcelona Spain and will do whatever is necessary to help. Be tough & strong we love you! _E_
Gen. Petraeus has agreed to testify in the Senate on Benghazi. I will be watching. _E_
Australia New Zealand and more. I am always available to them. @nytimes is just upset that they looked like fools in their coverage of me. _E_
A smart negotiator would use the leverage of our dollars our laws and our armed forces to get a better deal (cont) __HTTP__ _E_
Looking forward to keynoting the South Carolina Tea Party Convention in Myrtle Beach on Monday at 3:20PM! __HTTP__ _E_
.@Omarosa on the cover of Soap Opera Digest? That's a credential... #CelebApprentice _E_
It was an honor to welcome President @MarianoraJoy of Spain. Thank you for standing w/ us in our efforts to isolate the brutal #NoKo regime. __HTTP__ _E_
"Most entrepreneurs do not realize that wealth does not come from work but from the assets they build." – Midas Touch _E_
RT @realDonaldTrump: DACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed m... _E_
.@KatyTurNBC & @DebSopan should be fired for dishonest reporting. Thank you @GatewayPundit for reporting the truth. #Trump2016 _E_
"Trump on Romney: 'You Just Can't Give Him Another Chance':Some golfers can't sink the 3 ft. putt." __HTTP__ via @PJMedia_com _E_
Hopefully the House of Representatives can hold our country together for four more years...stay strong and never give up! _E_
We are TRYING to fight ISIS and now our own people are killing our police. Our country is divided and out of control. The world is watching _E_
Lolo Jones our beautiful Olympic athlete wants to remain a virgin until she gets married she is great. @Followlolo _E_
Thanks. __HTTP__ __HTTP__ _E_
"Appreciate your property and your property will appreciate for you." – Think Like a Billionaire _E_
What people don't know about @BillMaher is that he was a terrible student and not considered smart in his early (cont) __HTTP__ _E_
If the people of Massachusetts found out what an ineffective Senator goofy Elizabeth Warren has been she would lose! _E_
Entrepreneurs: Identify your goals and see each day as an opportunity to show what you can do at the highest level. _E_
Eliot Spitzer was a horrible Governor and A.G. who ruined many good people and cost the Country billions of dollars in losses (and jobs). _E_
"The greatest discovery of all time is that a person can change his future by merely changing his attitude." @Oprah _E_
Check out my new book Time To Get Tough: Making America #1 Again __HTTP__ _E_
In @oreillyfactor's No Spin Zone re: ObamaCare causing unemployment negotiating with China & my $5M court win __HTTP__ _E_
Erin Burnett who has no ratings on CNN in prime time now wants more money to move to the morning slot. @CNN should say no way . _E_
General John Kelly is doing a great job as Chief of Staff. I could not be happier or more impressed and this Administration continues to.. _E_
Impossible is a word to be found only in the dictionary of fools. Napoleon Bonaparte _E_
People very unhappy with Crooked Hillary and Obama on JOBS and SAFETY! Biggest trade deficit in many years! More attacks will follow Orlando _E_
The Stock Market is setting record after record and unemployment is at a 17 year low. So many things accomplished by the Trump Administration perhaps more than any other President in first year. Sadly will never be reported correctly by the Fake News Media! _E_
WSJ/NBC Poll: Donald Trump Widens His Lead in Republican Presidential Race. #Trump2016 __HTTP__ _E_
We have spent over $1 Billion on the Libya operation. What are we getting back? _E_
Trump organisation backs community battle against substation __HTTP__ via @STVNews _E_
The more predictable the business the more valuable it is. Predictability also means consistency of brand experience. Midas Touch _E_
Thank you to everyone for the wonderful reviews of my speech on Thursday night. From the heart! _E_
.@antbaxter Your documentary died many deaths. You have in my opinion zero talent. _E_
Wow Senator Luther Strange picked up a lot of additional support since my endorsement. Now in September runoff. Strong on Wall & Crime! _E_
Heading to a packed house in Waterloo Iowa! Will celebrate today's great poll numbers together. See you soon! _E_
Shouldn't there have been increased security at our embassies on the anniversary of 9/11? _E_
"Study: Insurance costs to soar under Obamacare" __HTTP__ Men in NC get 305% hike. Women in NE suffer an average 237% hike. _E_
#BuyAmericanHireAmericanWatch __HTTP__ __HTTP__ _E_
.@antbaxter—Heard your documentary cost you less than $3000 to make—where did you get that kind of money? _E_
Give great credit to @GeorgeClooney for exposing the atrocities taking place in Sudan. _E_
My support of Anna Wintour for Ambassador got a lot of coverage. She is smart and will be a strong advocate for the US. _E_
President Obama Gruber and all of the other Obama cronies got ObamaCare passed by lies and fraudulent statements. Courts should overturn! _E_
Thank you @IvankaTrump for the kind words. I am very proud of the role model you are for so many. NH & IA radio ad: __HTTP__ _E_
A new radical Islamic terrorist has just attacked in Louvre Museum in Paris. Tourists were locked down. France on edge again. GET SMART U.S. _E_
Sad. Our food stamp rolls now surpass the entire population of Spain __HTTP__ We must do better or we will be Greece. _E_
Boardroom time which team do you think had the best presentation? #CelebApprentice _E_
.@BretBaier's newly released book 'Special Heart' brings a message of hope. All sales donated to heart charities __HTTP__ _E_
... Icahn Kravis Apollo and most others but nobody says they went bankrupt! _E_
Dear @kimguilfoyle Thank you so much for your nice words today on @TheFive. Will not be forgotten! In Iowa now. Packed house! _E_
US Gov't is on the hook for more than a third of the world's entire debt & we wonder why China & OPEC are laughing all the way to the bank! _E_
ObamaCare must be fully repealed or it will destroy America's small businesses. _E_
In Las Vegas getting ready to speak! _E_
Country music star @TraceAdkins returns to All Star @CelebApprentice. Competing for @RedCross Trace is great! _E_
Congratulations are in order! @TrumpPanama ranks #5 Top Hotel in Panama by @TripAdvisor's #TravelersChoice Awards! __HTTP__ _E_
Via @bizjournals by @BrandonSawalich: 3 lessons about loyalty that I learned from Donald Trump __HTTP__ _E_
My thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_
ISIS threatens us today because of the decisions Hillary Clinton has made along with President Obama. Donald J. Trump _E_
America has lost its AAA rating and gained over $6T in debt under @BarackObama and now he wants to raise the debt ceiling SCARY! _E_
Working hard on the biggest tax cut in U.S. history. Great support from so many sides. Big winners will be the middle class business & JOBS _E_
Last night's horrific execution style shootings of 12 Dallas law enforcement officers... __HTTP__ _E_
America's debt is greater than our GDP. Time for new thinking. _E_
#TBT @DonaldJTrumpJr @IvankaTrump @EricTrump and I 20 years ago __HTTP__ _E_
Thank you Florida. My Administration will follow two simple rules: BUY AMERICAN and HIRE AMERICAN! #ICYMI Watch:... __HTTP__ _E_
I believe the James Comey leaks will be far more prevalent than anyone ever thought possible. Totally illegal? Very 'cowardly!' _E_
Thank you Bridgeport Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
The President must get Congressional approval before attacking Syria big mistake if he does not! _E_
No surprise serial sexter Anthony continues to be a sick pervert. He was sexting a 'young' girl last summer __HTTP__ _E_
Will be in Missouri today with Melania for the funeral of a wonderful and truly respected woman Phyllis S! _E_
Good investors are good students. It's as simple as that. Think Like a Billionaire _E_
Houston TX: __HTTP__ Vegas NV __HTTP__ AZ: __HTTP__ __HTTP__ _E_
Congress use the power of the purse. STOP AMNESTY! _E_
The failing @nytimes which has made every wrong prediction about me including my big election win (apologized) is totally inept! _E_
"Donald Trump on Fiscal Cliff and Obama" __HTTP__ via @Livetradingnews _E_
The @HuffingtonPost is a total joke & laughing stock of journalism as is gross Arianna Huffington. They don't report the facts! _E_
Tom Ridge is a failed 'Bushy' & PA Governor. Him & his friend @KarlRove shouldn't be allowed to do their bias commentary nobody listens! _E_
If @BarackObama really loved this country he wouldn't be destroying it. He has ruined our credit and killed jobs with ObamaCare. _E_
I took a failed club in Dutchess County & made it a great success plus many jobs. @KieranLalor should be thankful. _E_
Too bad I don't get this for political speeches they cost me a fortune! __HTTP__ _E_
Club for Growth is the group that came to my office seeking $1 million dollars. I told them no and now they are doing negative ads. _E_
No deal is better than a bad deal. America out negotiated again. #Iran _E_
Join me live from the @WhiteHouse via #Periscope __HTTP__ _E_
Just got back from Tampa. It was an amazing evening with an even more amazing crowd fantastic people! Will be in South Carolina tomorrow. _E_
US government's foreign indebtedness has grown over 72% under @BarackObama. He is bleeding us dry to China. _E_
DAMAC & #Trump Organization are developing a 2nd Trump #golf course Trump World Golf Club #Dubai at AKOYA Oxygen! __HTTP__ _E_
We must stop Common Core from controlling state & local curriculums. It is a federal grab of education. Keep education local! _E_
If @BarackObama's policies are so advantageous then why is he constantly invoking Ronald Reagan on the Stump? __HTTP__ _E_
During my recent trip to the Middle East I stated that there can no longer be funding of Radical Ideology. Leaders pointed to Qatar look! _E_
Our relationship with Russia is at an all time & very dangerous low. You can thank Congress the same people that can't even give us HCare! _E_
Trump Vineyard Estates is a breathtaking location to hold special events for all occasions. Watch the video for a look __HTTP__ _E_
My int. on @FoxNews' @oreillyfactor: "Donald Trump presidential politics and 'The Factor'" __HTTP__ _E_
Why do we keep broadcasting when we are going to attack Syria. Why can't we just be quiet and if we attack at all catch them by surprise? _E_
#TrumpVlog Trouble in paradise for Clintons __HTTP__ _E_
RT @greta: Thank you @realDonaldTrump this is important to so many of us __HTTP__ _E_
Does he look sharp smart and presidential his hands keep hitting the podium making a loud and distracting noise microphone too sensitive. _E_
Ugly industrial wind turbines are ruining the beauty of parts of the country and have inefficient unreliable energy to boot. _E_
Daily Caller: Trump Surpasses Field Flirts With 40 Percent in Alabama Poll __HTTP__ _E_
Just arrived for the #GOPdebate #MakeAmericaGreatAgain __HTTP__ _E_
Had a great time on @gretawire last night. Greta always does great interviews. _E_
RT @FLOTUS: Looking forward to hosting the annual Easter Egg Roll at the @WhiteHouse on Monday! __HTTP__ _E_
"Donald Trump: The View Will be Better without Joy Behar (Video)" __HTTP__ via @gatewaypundit _E_
Honored to be attending Rev. @BillyGraham's 95th birthday. His life & work has brought hope & faith to millions worldwide. _E_
Always good to have @ArsenioHall back as advisor as well as @DonaldJTrumpJr. They have their own fan clubs at this point. #CelebApprentice _E_
In '08 America voted for Hope & Change. Instead we got incompetency. Now it is time to put a real job creator in office. Vote 4 Mitt! _E_
RT @PressSec: .@POTUS historic tax cuts + doubling of the child tax credit will do infinitely more to empower working moms than liberals' p... _E_
Our hearts & prayers go out to the people of London who suffered a vicious terrorist attack.... __HTTP__ _E_
Thank you Diamond and Silk! __HTTP__ _E_
Wow @GeorgeWill said some very nice things about me today on @FoxNewsSunday with Chris Wallace. I am making progress thanks George! _E_
Delusional @BarackObama claims that his economic plan worked __HTTP__ Is the 16% real unemployment part of the plan? _E_
"If we get tough and make the hard choices we can make America a rich nation—and respected—once again." – Time to Get Tough _E_
A top firm like Cooley will only submit a case they believe in and can win. _E_
As one of Miamii's largest landowners I am pulling for the @MiamiHEAT in the @NBA finals. Lebron's time is now! @KingJames _E_
So excited to have @SantanaCarlos performing at the 2015 #CadillacChampionship at @TrumpDoral: __HTTP__ _E_
Getting ready to go to the great State of Michigan. Big crowd tonight. Make America Great Again! _E_
I still can't believe we left Iraq without the oil. _E_
"Money was never a big motivation for me except as a way to keep score.The excitement is playing the game."–The Art of The Deal _E_
Great optimism for future of U.S. business AND JOBS with the DOW having an 11th straight record close. Big tax & regulation cuts coming! _E_
"Get to know yourself.You can't improve upon something you don't understand.The more you ask the better you'll know." Vince Lombardi _E_
Via @paramuspost: "@TrumpSoHo New York Debuts Sizzling Summer Offerings" __HTTP__ _E_
Obama & his people did a brilliant job of delaying these scandals until after the election. Mitt must be going wild thinking about it! _E_
"Design your business from the start so that it is leverageable expandable predictable and financeable." – Midas Touch _E_
Obama weak on immigration. All words no action. He's been Prez 4 years. _E_
.@washtimes @BrettMDecker: Five Questions w/ @realDonaldTrump 'Lack of Leadership is the biggest threat to America' __HTTP__ _E_
Taking risks & making mistakes is the best way to learn something new. Most of the time you will surprise yourself Trump Never Give Up _E_
Who wants the endorsement of a guy (@EricCantor) who lost in perhaps the greatest upset in the history of Congress? _E_
.@MittRomney and I are working out a great dinner for someone I hope it's you! __HTTP__ _E_
CLINTON REFUGEE PLAN COULD BRING IN 620000 REFUGEES IN FIRST TERM AT LIFETIME COST OF OVER $400 BILLION. __HTTP__ _E_
Being nice to Rocket Man hasn't worked in 25 years why would it work now? Clinton failed Bush failed and Obama failed. I won't fail. _E_
The bend in the road is not the end of the road unless you refuse to take the turn. – Anonymous _E_
Zogby Poll: Trump Widens Lead After GOP Debate __HTTP__ _E_
Formerly of the New York Times @frankrichny was a poor theatre critic who was forced out. Sadly he is an even (cont) __HTTP__ _E_
We are delivering HISTORIC TAX RELIEF for the American people!#TaxCutsandJobsAct __HTTP__ _E_
I am convinced that if @AlexSalmond had not pushed ugly wind turbines all over Scotland the vote would have been much better for him! _E_
Thank you to Ford for scrapping a new plant in Mexico and creating 700 new jobs in the U.S. This is just the beginning much more to follow _E_
Deals are my art form. Other people paint beautifully or write poetry. I like making deals preferably big deals. That's how I get my kicks. _E_
If @DannyZuker competed against me and.won (which not too many people do) he could win millions of $'s for himself or his charity! _E_
It's really cold outside they are calling it a major freeze weeks ahead of normal. Man we could use a big fat dose of global warming! _E_
.@NRO Not much is as dead or irrelevant as National Review thanks to guidance of Goldberg a total loser! Get some real talent or fold! _E_
Thank you @GolfMagazine for putting my Scotland course on your cover and a Top 100 course in the world. __HTTP__ _E_
Join the MOVEMENT to #MAGA! __HTTP__ __HTTP__ _E_
The dopes at the @nytimes bought the Boston Globe for $1.3 billion and sold it for $1.00. Their great old headquarters gave it away! So dumb _E_
#VoteTrump at clerk's offices & 185 ballot drop boxes in #ORPrimary!Closes at 8pm! __HTTP__ _E_
Happy 4th of July! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_
Via @Newsmax_Media: 14 Reasons Donald Trump Is Really Running — and Doing Well __HTTP__ _E_
To put on your calendar for May: Miss USA 2010 live from Las Vegas on May 16th 7 p.m. ET on NBC. I'll be there tune in for a great show! _E_
Had a great meeting at CIA Headquarters yesterday packed house paid great respect to Wall long standing ovations amazing people. WIN! _E_
RT @DRUDGE_REPORT: DEAD HEAT: CLINTON VS TRUMP __HTTP__ _E_
China is now given preference to buy US debt by going directly to Treasury. I don't believe @BarackObama knows that he selling us out. _E_
Today's final round of the WGC Cadillac Championship will be amazing. A lot of pressure on leader who has played great. Big names hunting! _E_
Trump Tycoon App for iPhone & iPod Touch It's $2.99 but the advice is priceless! __HTTP__ _E_
Obama will grant amnesty to millions of illegals yet he has not lifted a finger for USMC Sgt. Tahmooressi! . #BringBackOurMarine _E_
Via @Zawya: "Trump home partners with lifestyle to launch an exclusive collection of home décor" __HTTP__ _E_
RT @foxnation: . @TuckerCarlson : #Dems Don't Really Believe #Trump Is a Pawn of #Russia That's Just Their Political Tool __HTTP__ _E_
Our FIFTH 1K milestone of 2017!#DOW24K #MAGA __HTTP__ _E_
.@stephenfhayes: I heard you were a joke on the media panel this weekend in New Hampshire. You just don't have what it takes! @JoeNBC _E_
My son @EricTrump will be interviewed by @SeanHannity tonight at 10pm on @FoxNews. Enjoy! _E_
Imagine how much stronger economic shape we would be in if we made the Iraqi government agree to a cost sharing (cont) __HTTP__ _E_
being a movie star and that was season 1 compared to season 14. Now compare him to my season 1. But who cares he supported Kasich & Hillary _E_
Via @UnionLeader by @tuohy: "Trump says he will decide on a presidential run by June" __HTTP__ _E_
I am getting bad marks from certain pundits because I have a small campaign staff. But small is good flexible save money and number one! _E_
Ted Cruz has been playing an ad about me that is so ridiculously false no basis in fact. Take ad down Ted. Biggest liar in politics! _E_
ObamaCare is one of the greatest threats our country faces. It is unsustainable and will lead America into complete insolvency. _E_
Ukrainian efforts to sabotage Trump campaign quietly working to boost Clinton. So where is the investigation A.G. @seanhannity _E_
.@Univision cares far more about Mexico than it does about the U.S. Are they controlled by the Mexican government? _E_
Clinton Aides: 'Definitely' Not Releasing Some HRC Emails: __HTTP__ _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
CNN anchors are completely out of touch with everyday people worried about rising crime failing schools and vanishing jobs. _E_
.@morning_joe Wow Ticket sales go through the roof after Trump asked to speak at CPAC _E_
The New York Times/Bill Carter/Sept.26 2011: On MSNBC meanwhile Lawrence O'Donnell has lost 100000 viewers (cont) __HTTP__ _E_
Thank you New Hampshire!#Trump2016 __HTTP__ _E_
Republicans want to fix DACA far more than the Democrats do. The Dems had all three branches of government back in 2008 2011 and they decided not to do anything about DACA. They only want to use it as a campaign issue. Vote Republican! _E_
A house divided against itself cannot stand. Abraham Lincoln _E_
Chance favors the prepared mind. Louis Pasteur _E_
.@MeghanMcCain was terrible on @TheFive yesterday. Angry and obnoxious she will never make it on T.V. @FoxNews can do so much better! _E_
Thanks! __HTTP__ _E_
"The true competitors are the ones who always play to win." – Tom Brady @Patriots _E_
People believe CNN these days almost as little as they believe Hillary....that's really saying something! _E_
.@alexsalmond @pressjournal @BBCNews RT @DanScavino one would think the photo & caption says it all.... __HTTP__ _E_
Via @washingtonpost: Donald Trump will speak at CPAC by @rachelweinerwp __HTTP__ @CPACnews @AlCardenasACU @RGreggKeller _E_
Trump Organization's first project in India Trump Towers Pune will epitomize inspired living and timeless elegance __HTTP__ _E_
Order a signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm __HTTP__ _E_
We are not retreating we are advancing in another direction. Douglas MacArthur _E_
Missouri just confirmed #Trump2016 as the official winner with an additional 12 delegates. #MakeAmericaGreatAgain __HTTP__ _E_
Thank you New Hampshire! Great people see you next week! __HTTP__ _E_
Here's to a safe and happy Independence Day for one and all Enjoy it! Donald J. Trump _E_
If the wind will not serve take to the oars. Latin Proverb _E_
Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government! _E_
Great honor Rev. Jerry Falwell Jr. of Liberty University one of the most respected religious leaders in our nation has just endorsed me! _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
The Republicans must face reality & create a strong & positive immigration policy if not they will continue to lose elections. _E_
We are rebuilding other countries while our own country is going to HELL. Time to rebuild the U.S.A.! Tell our stupid politicians ENOUGH _E_
Must read @ConservReview article by @JeffJlpa1: "Jeb Bush and the Outsiders" __HTTP__ _E_
My transition team which is working long hours and doing a fantastic job will be seeing many great candidates today. #MAGA _E_
North Korea just stated that it is in the final stages of developing a nuclear weapon capable of reaching parts of the U.S. It won't happen! _E_
Thank you Travis County Texas!#MakeAmericaGreatAgain __HTTP__ _E_
China is advocating on behalf of Iran's nuclear program the Chinese oppose both sanctions and any militar... (cont) __HTTP__ _E_
The United States made some of the worst Trade Deals in world history.Why should we continue these deals with countries that do not help us? _E_
Hillary Clinton lied when she said that ISIS is using video of Donald Trump as a recruiting tool. This was fact checked by @FoxNews: FALSE _E_
People have been forced to resign positions for far less than @JonahNRO's "tweeting like a 14 year old girl" _E_
Via @BW: Donald Trump Vows to Fight Scottish Wind Farm Plan in Courts __HTTP__ _E_
Thank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Via @inventorspot by Myra Per Lee: "Got A Great Idea? Get Donald Trump To Fund It" __HTTP__ _E_
In the end you're measured not by how much you undertake but by what you finally accomplish! _E_
New Hampshire vote today MAKE AMERICA GREAT AGAIN! _E_
Just sat down for a great interview with @PHussionWYFF in Greenville today. Watch at 5pm. An amazing day in South Carolina! #VoteTrumpSC _E_
Make it special! No better place to celebrate St. Patrick's Day in the Windy City than @TrumpChicago __HTTP__ _E_
Glad to hear Bella Santorum is recovering. @RickSantorum has a beautiful family. _E_
I was invited to be with Mitt Romney tonight win lose or draw I'll be there! _E_
Our very weak and ineffective leader Paul Ryan had a bad conference call where his members went wild at his disloyalty. _E_
An: Media fell all over themselves criticizing what DonaldTrump may have insinuated about @POTUS. But he's right: __HTTP__ _E_
Celebrity Apprentice is nearing the end of a wonderful and very successful season. Watch tonight at 8:00. _E_
Just received a wonderful letter from a new father who bought his son his first book The Art of the Deal. Great parent! _E_
Thank you Vermont! #Trump2016#SuperTuesday _E_
An incredible honor to receive the endorsement of a person Ihave such tremendous respect for. Thank you Sheldon! __HTTP__ _E_
.@bobvanderplaats is a total phony and con man. When I wouldn't give him free hotel rooms and much more he endorsed Cruz. @foxandfriends _E_
It is hard to believe I am winning by so much when I am treated so badly by the media. New @CNN Poll amazing in ALL categories. 21 pt. Lead _E_
Great afternoon in Ohio & a great evening in Pennsylvania departing now. See you tomorrow Virginia! __HTTP__ _E_
Final #'s just announced in the GREAT State of MO. TRUMP WINS! New certified #'s show a 365 vote increase for me @ least 12 more delegates! _E_
Record setting cold and snow ice caps massive! The only global warming we should fear is that caused by nuclear weapons incompetent pols. _E_
Can you believe we still have not gotten our Marine out of Mexico. He sits in prison while our PRESIDENT plays golf and makes bad decisions! _E_
.@Morning_Joe just went off the rails. I will beat Hillary easily she does not want to run against me. I am tuning them out waste of time _E_
China controls North Korea. So now besides cyber hacking us all day they are using the Norks to taunt us. China is a major threat. _E_
Why would they announce a finding of the grand jury in Ferguson at 9:00 in the evening a prime time for riots! Not smart. _E_
Rexnord of Indiana made a deal during the Obama Administration to move to Mexico. Fired their employees. Tax product big that's sold in U.S. _E_
After all of these years of suffering thru ObamaCare Republican Senators must come through as they have promised! _E_
Pervert alert! Sexter Anthony Weiner will be running for Mayor of New York City. _E_
My @foxandfriends interview discussing my possible GOP endorsement @MittRomney's taxes and the Florida primary. __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
The weak jokers who so badly hurt great Penn State University should have fought the NCAA instead of making a deal __HTTP__ _E_
ObamaCare is such a national treasure that @BarackObama has waived over 1200 companies from the law __HTTP__ _E_
See problems as a mind exercise. Enjoy the challenge and remember to keep focused on your goals. _E_
Newt attacks on @MittRomney record at Bain an attack on free enterprise and entrepreneurship. Mistake! _E_
Government waste fraud and abuse should be immediately addressed. This will help solve our deficit crisis both short and long term. _E_
China steals United States Navy research drone in international waters rips it out of water and takes it to China in unprecedented act. _E_
We are what we repeatedly do. Excellence therefore is not an act but a habit. Aristotle _E_
Thank you for your support of my candidacy! #MAGA #ImWithYou __HTTP__ _E_
Wow! I hear you Warren Michigan. Streaming live join us America. It is time to DRAIN THE SWAMP!Watch: __HTTP__ _E_
We are getting reports from many voters that the Cruz people are back to doing very sleazy and dishonest pushpolls on me. We are watching! _E_
In Texas now leaving soon for BIG rally in Florida! _E_
With @C_Soules from #TheBachelor in Iowa __HTTP__ _E_
The new unemployment numbers are terrible. 522000 more people are out of the labor force to 88419000. __HTTP__ _E_
Happy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to last Sunday's tragedy. _E_
I will be making a major statement from the @WhiteHouse upon my return to D.C. Time and date to be set. _E_
When will Pakistan apologize to us for providing safe sanctuary to Osama Bin Laden for 6 years?! Some ally. _E_
Via @BW: Thomas Jefferson Donald Trump Share Love of Grapes in Virginia __HTTP__ @trumpwinery @EricTrump _E_
Editorial by @DonaldJTrumpJr in the DailyCaller: Defending Innovation in America __HTTP__ _E_
I promise that our administration will ALWAYS have your back. We will ALWAYS be with you! __HTTP__ _E_
Remember go vote we need real change this time. _E_
RT @FoxBusiness: .@JerryJrFalwell: I was so impressed by [@realDonaldTrump's] speech yesterday. He was the best I've ever seen him. __HTTP__ _E_
#CrookedHillary is nothing more than a Wall Street PUPPET! #BigLeagueTruth #Debate __HTTP__ _E_
Obama projected a 2012 budget deficit of $557B. It is actually double that at $1.1T __HTTP__ We can't afford four more years. _E_
All the guys that said @MittRomney would lose are rapidly coming on board. Mitt will remember the early helpers. _E_
Via @theblaze: Falwell on Trump: He 'was willing to say publicly' what conservatives said 'privately' __HTTP__ _E_
"Had the information (Crooked Hillary's emails) been released there would have been harm to National Security.... Charles McCulloughFmr Intel Comm Inspector General __HTTP__ _E_
Find out what Success smells like. I'll be @Macys Herald Square April 18 5:30pm to sign my new fragrance first (cont) __HTTP__ _E_
"Winners embrace hard work." @ESPNDrLou _E_
On my way to @TrumpSoHo to receive the AAA Five Diamond Award. _E_
It's very sad that Republicans even some that were carried over the line on my back do very little to protect their President. _E_
All 50 of the WORLD'S TOP 50 PLAYERS will be at TRUMP NATIONAL DORAL on Thursday Sunday for the Cadillac World Golf Championship. _E_
RT @NWSHouston: Historic flooding is still ongoing across the area. If evacuated please DO NOT return home until authorities indicate it i... _E_
My daughter Ivanka is being honored by the Wharton School of Finance with the 2012 Young Leadership Award. Also (cont) __HTTP__ _E_
RT @FoxNews: Geraldo Blasts 'Fake News' Reports About Trump's Visit to Puerto Rico __HTTP__ _E_
Do not underestimate the UNITY within the Republican Party! _E_
'Hillary Clinton Deleted Emails With Her Email Server Technician' __HTTP__ _E_
China is buying so many of our companies it's really getting bad. _E_
Another historic first under Obama businesses are collapsing faster than they're being formed __HTTP__ New leadership now! _E_
The contract to build the ObamaCare website was given to a CANADIAN company for $55 744 081. It then bloated to $292 071067 INCOMPETENCE _E_
Via @Newsmax_Media by @OwenTew: "Trump on 2016 Run: I Would Self Fund Appoint Wall Street Experts" __HTTP__ _E_
One of the dumber and least respected of the political pundits is Chris Cillizza of the Washington Post @TheFix. Moron hates my poll numbers _E_
Baltimore just set a record for the coldest day in March in a long recorded history 4 degrees. Other places likewise. Global warming con! _E_
Live tweeting during tonight's VP debate...should be a great time _E_
Thank you Faith and Freedom Forum & @UrbandaleSchool. I had a great time in Iowa today! __HTTP__ _E_
We want our companies to hire & grow in AMERICA to raise wages for AMERICAN workers & to help rebuild our AMERICAN cities & towns! #USA __HTTP__ _E_
Arrived in Palm Beach drove by a gas staion $4.50 a gallon. Result of failed @BarackObama leadership. _E_
Small business owners are the DREAMERS & INNOVATORS who are powering us into the future!Read more and watch here: __HTTP__ __HTTP__ _E_
Thank you Colorado Springs. Get out & VOTE #TrumpPence16 in November! __HTTP__ _E_
To show you how shallow politicians can be many are jealous of my @CPACnews speaking slot & also their fellow Republicans! Not good! _E_
RT @DRUDGE_REPORT: Trump: 'Is the Boston Killer Eligible for Obama Care to Bring Him Back to Health?' __HTTP__ _E_
It is very sad to see what @BarackObama has done with NASA. He has gutted the program and made us dependent on the Russians. _E_
It was great having @ArsenioHall back on this week's @ApprenticeNBC! __HTTP__ _E_
Obama will be going on @theviewtv & fundraising while in NYC for the UN Assembly... _E_
Not good or smart for Obama to be calling Russia a regional power or to mention the concept of a nuclear weapon going off in NYC. _E_
RT @realDonaldTrump: ATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_
I'm leaving now for Ireland Spain Scotland and elsewhere crazy life! _E_
People are LOVING the Trump sign on the Chicago building. Big league tweets letters and calls... _E_
Just leaving Virginia really big crowd great enthusiasm! _E_
#2. Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_
"Destiny has a part to play in your life and in your business so give it a chance to work." – Think Like a Champion _E_
I beat Hillary in the new @FoxNews Poll head to head. SHE HAS NO STRENGTH OR STAMINA both of which are needed to MAKE AMERICA GREAT AGAIN! _E_
Thank you to the Robb Report The Best of the Best issue for just naming Trump International Golf Links the Best New Golf Course In World! _E_
Sad to see the history and culture of our great country being ripped apart with the removal of our beautiful statues and monuments. You..... _E_
Thank you for all of the really nice comments and reviews concerning my speech today at the National Press Club. It was my great honor! _E_
Chinese spies stole our F 35 Joint Strike Fighter design __HTTP__ We should offset the cost from our Chinese debt _E_
Never seen such Republican ANGER & UNITY as I have concerning the lack of investigation on Clinton made Fake Dossier (now $12000000?).... _E_
Crooked Hillary has zero imagination and even less stamina. ISIS China Russia and all would love for her to be president. 4 more years! _E_
Bernie should pull his endorsement of Crooked Hillary after she decieved him and then attacked him and his supporters. _E_
Just at a news conference from Trump Turnberry in Scotland. Everybody was there & will be all over television tonite. Back on trail Saturday _E_
.@StephenBaldwin7 You were fabulous on CNN last night I greatly appreciate your support. Best wishes. _E_
"Don't bunt. Aim out of the ball park. Aim for the company of immortals." David Ogilvy _E_
As usual Hillary & the Dems are trying to rig the debates so 2 are up against major NFL games. Same as last time w/ Bernie. Unacceptable! _E_
Politicians are all talk and no action. Bush and Rubio couldn't answer simple question on Iraq. They will NEVER make America great again! _E_
More and more reporters are using the word TRUMP when referring to winning just used on Bloomberg News. Gee I wonder why? _E_
#LawandOrder #ImWithYouVideo: __HTTP__ __HTTP__ _E_
Downtown Manhattan's trendiest hotel @TrumpSoHo 46 stories of luxurious rooms fine dining & The Spa __HTTP__ _E_
I don't know if Hillary will be able to run she is a walking time bomb! _E_
Our country and it's leadership has to be so careful and so smart these are treacherous times like no other. The world is a crazy place! _E_
It was an honor to be @GretchenCarlson's inaugural guest on her new show 'The Real Story.' Gretchen will be a big success! _E_
As Bernie Sanders said Hillary Clinton has bad judgement. Bill's meeting was probably initiated and demanded by Hillary! _E_
"I'm a great believer in asking everyone for an opinion before I make a decision. It's a natural reflex." – The Art of The Deal _E_
My @foxandfriends interview discussing #MissUSA Olivia Culpo the job numbers & the waste of the Obama stimulus __HTTP__ _E_
.@genesimmons really great job handling the wise guys so easy for you such talent! I won't forget. _E_
.@NJPGA Club of the Year Trump Nat'l Bedminster is NJ's top family country club with two award winning courses __HTTP__ _E_
Thank you for a great afternoon South Carolina! See you next Tuesday! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
You mean George Bush sends our soldiers into combat they are severely wounded and then he wants $120000 to make a boring speech to them? _E_
"Do not allow fear to settle into place in any part of your life. It is a defeating attitude & a negative emotion Think Like a Champion _E_
Will Plan B miss Trace? _E_
.... It is a very effective & commonly used business tool. _E_
Welcome to the new reality. @BarackObama is now letting China buy US banks __HTTP__ The US government is selling us out. _E_
Crooked Hillary should not be allowed to run for president. She deleted 33000 e mails AFTER getting a subpoena from U.S. Congress. RIGGED! _E_
I never did a day's work in my life. It was all fun. Thomas A. Edison _E_
The @nytimes purposely covers me so inaccurately. I want other nations to pay the U.S. for our defense of them. We are the suckers no more! _E_
George Steinbrenner would have done a major number on A Rod there is no way he would have gotten paid even with the help of the union! _E_
Labor Unions Giving Serious Thought to Endorsing Trump via Washington Examiner __HTTP__ _E_
I will be interviewed on @oreillyfactor tonight at 8:00 P.M. Enjoy! _E_
Now the Chinese are planning a war game w/ the IraniansSyrians & Russians along Syrian coast. __HTTP__ Laughing at @BarackObama _E_
Why doesn't OPEC lower the price of crude to help avert the European crisis? Crude keeps rising during the dow... (cont) __HTTP__ _E_
My @FoxNews @TeamCavuto interview discussing the @RNC Convention businesses making products in China and unemployment __HTTP__ _E_
Today in Bedminster I signed the Harry W. Colmery Veterans Educational Assistance Act of 2017 joined by @DeptVetAffairs @SecShulkin. __HTTP__ _E_
The TIME Magazine cover showing late age breast feeding is disgusting sad what TIME did to get noticed. @TIME _E_
Always make a total effort even when the odds are against you. Arnold Palmer @KingdomMag _E_
As President I will bring jobs back and get wages up for Americans who need it most. __HTTP__ _E_
Bernie Sanders says that Hillary Clinton is unqualified to be president. Based on her decision making ability I can go along with that! _E_
Wind energy is a complete economic disaster.... __HTTP__ @AlexSalmond @AberdeenCC @David_Cameron @Aberdeenshire @ScotParl _E_
Weakness cow towing and not standing firm is provocative. We are getting pushed around and robbed under this President. _E_
Just found out that @tedcruz is spending a fortune on Iowa push polls negative to me. Not nice but OK! New polls are great. _E_
Visiting LA? Be sure to make a reservation at Trump National Golf Club __HTTP__ The #1 public course in the country! _E_
Congratulations to @bostonpolice on yesterday's successful and safe @bostonmarathon. The entire country is proud. _E_
Thank you Pueblo Colorado! #TrumpRally #AmericaFirst __HTTP__ __HTTP__ _E_
Great seeing @TheLeeGreenwood and Kimberly at this evenings VP dinner! #GodBlessTheUSA __HTTP__ _E_
The @washingtonpost which loses a fortune is owned by @JeffBezos for purposes of keeping taxes down at his no profit company @amazon. _E_
Always leave your ego at the door during negotiations. Remember it's only business and there will always be another day! _E_
.@meetthepress and @chucktodd did a 1 hour hit job on me today – totally biased and mostly false. Dishonest media! _E_
Is it the same Kaine that took hundreds of thousands of dollars in gifts while Governor of Virginia and didn't get indicted while Bob M did? _E_
We are now leading in many polls and many of these were taken before the criminal investigation announcement on Friday great in states! _E_
.@megynkelly recently said that she can't be wooed by Trump. She is so average in every way who the hell wants to woo her! _E_
I told everybody the Oscars were no good—Nielsen ratings confirmed one of the lowest ratings in history. _E_
Entrepreneurs: A winning attitude will put things in perspective. Keep negative thoughts & people where they belong out of the big picture. _E_
Russian officials must be laughing at the U.S. & how a lame excuse for why the Dems lost the election has taken over the Fake News. _E_
Our GDP has been growing less than 2% for the last 5 years. ObamaCare will slow us down even more. Has to be repealed. _E_
My thoughts and prayers are with the two police officers their families and everybody at the @WestervillePD. __HTTP__ _E_
The Oscar Pistorius disaster is a really interesting story to me—a very sad situation for everyone! _E_
RT @IvankaTrump: The Administration is committed to supporting military spouses in the workforce. Thanks Kim for sharing your story! __HTTP__ _E_
America needs a tough negotiator not a community organizer. _E_
Wow new polls just out have Trump up and Cruz down he is a nervous wreck! _E_
I am seriously considering Dr. Ben Carson as the head of HUD. I've gotten to know him well he's a greatly talented person who loves people! _E_
Can you imagine a Canadian company developing our website? Terrible way to put Americans back to work. _E_
RT @MeetThePress: Watch our interview with @KellyannePolls: Russia did not succeed in attempts to sway election __HTTP__ #... _E_
If we did all the things we are capable of we would literally astound ourselves. Thomas Edison _E_
RT @TeamTrump: .@timkaine's Abortion Flip Flops: From Valuing The Sanctity of Life &gt Pro Abortion Demagogue #VPdebate __HTTP__ _E_
Why does Barack Obama's ring have an arabic inscription? __HTTP__ Who is this guy? _E_
Mitt Romney called to congratulate me on the win. Very nice! _E_
Make sure to follow me on @periscopeco #MakeAmericaGreatAgain _E_
Just arrived at #ASEAN50 in the Philippines for my final stop with World Leaders. Will lead to FAIR TRADE DEALS unlike the horror shows from past Administrations. Will then be leaving for D.C. Made many good friends! _E_
Boeing is building a brand new 747 Air Force One for future presidents but costs are out of control more than $4 billion. Cancel order! _E_
Crooked Hillary wants a radical 500% increase in Syrian refugees. We can't allow this. Time to get smart and protect America! _E_
What will we get for bombing Syria besides more debt and a possible long term conflict? Obama needs Congressional approval. _E_
Via @SunSentinel by @JoanieCox: "In Palm Beach nothing trumps the Trump Invitational" __HTTP__ _E_
I believe Lance Armstrong had death wish when he did interview w/Oprah—as I predicted everybody is suing him he'll have nothing left _E_
People should be proud of the fact that I got Obama to release his birth certificate which in a recent book he "miraculously" found. _E_
"I have a very strict gun control policy: if there's a gun around I want to be in control of it." Clint Eastwood _E_
.@pastormarkburns You were great last night and we all very much appreciate it! Thank you! _E_
.@foxandfriends in 5 minutes. _E_
After decades of our leaders allowing China to steal our jobs & R&D the Chinese will 'overtake America' in 2016 ... _E_
Sadly I will no longer be doing @foxandfriends at 7:00 A.M. on Mondays. This is because I am running for president and law prohibits. LOVE! _E_
Iran is threatening to shut the Strait of Hormuz and @BarackObama won't approve the Keystone pipeline. His energy policy makes America weak. _E_
Golf match? I've won 18 Club Championships including this weekend. @mcuban swings like a little girl with no power or talent. Mark's a loser _E_
When is South Korea going to start paying us for the massive amounts of money we are spending to protect them from the North? _E_
Bought @JohnDeere stock a year ago for old fashioned reason—I love their product and service. _E_
Will be on @foxandfriends at 7:00 15 minutes! Enjoy. _E_
"When you're at a meeting monitor your behavior and work at being an observer – of yourself and others." – Think Like a Billionaire _E_
Why are we giving China foreign aid? Couldn't the Super Committee have agreed to at least cut that outlay? #TimeToGetTough _E_
I'm going to the BORDER tomorrow. Will be seeing some really brave people. Look forward to a big day! _E_
The reason that Ted Cruz lost the Evangelicals in S.C. is because he is a world class LIAR and Evangelicals do not like liars! _E_
Taking a helicopter to New Hampshire boarding now. Amazing activity planned. New UMASS poll very nice! __HTTP__ _E_
Only 1 mill. dollars @mcuban? Offer me real money and I'd consider it. Your team and networks lose so much money I doubt you have much left! _E_
Just like @Yankee organization I can't wait for @MLB to suspend A Rod. Will be a great day for the sport. _E_
How come every time I show anger disgust or impatience enemies say I had a tantrum or meltdown—stupid or dishonest people? _E_
Iraq in political turmoil one day after we leave I told you so. _E_
My interview on @theviewtv discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate(starts at 23:00) __HTTP__ _E_
The U.S. manufacturing sector has suffered its greatest order losses under @BarackObama. He has stood idle while China steals our jobs. _E_
Be prepared for a sensational episode of The Apprentice tomorrow night 10 pm on NBC. _E_
The new e mail release is a disaster for Hillary Clinton. At a minimum how can someone with such bad judgement be our next president? _E_
What a great group! __HTTP__ With @Schwarzenegger @SammartinoBruno and @TripleH. #WWEHOF _E_
WikiLeaks: 'Clinton Kaine Even Lied About Timing of Veep Pick' __HTTP__ _E_
The rigged Dem Primary one of the biggest political stories in years got ZERO coverage on Fake News Network TV last night. Disgraceful! _E_
...and says something is seriously wrong. He will never go down as great! _E_
Michele Bachmann just dropped out of prez race when she didn't do the Newsmax debate it showed great disloyalty and people rejected her. _E_
.@mcuban Baseball commissioner and owners were smart when they didn't want you to buy a team but I don't think you have the money anyway. _E_
He @BarackObama wants 23 years of @MittRomney's tax returns __HTTP__ Let's see BHO's school (cont) __HTTP__ _E_
ISIS exploded on Hillary Clinton's watch she's done nothing about it and never will. Not capable! _E_
I am encouraged by President Moon's assurances that he will work to level the playing field for American workers b... __HTTP__ _E_
Who knew this innocent kid would grow into a monster? #TBT #Trump __HTTP__ _E_
See what I have to say about the Occupy Wall Street protestors in today's #trumpvlog.... __HTTP__ _E_
I will be ON THE RECORD with Greta Van Susteren @gretawire tonight at 7 pm eastern/FOX News Channel _E_
Democracy cannot succeed unless those who express their choice are prepared to choose wisely... _E_
TAX CUTS will increase investment in the American economy and in U.S. workers leading to higher growth higher wages and more JOBS! __HTTP__ _E_
National Black Republican Association Endorses Donald J. Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Obama Clinton inherited $10T in debt and turned it into nearly $20T. They have bankrupted... __HTTP__ _E_
Americans by & large hate ObamaCare. They see Obama lied to get it passed. They see big business & gov't got waivers. Defund! _E_
Thank you for the great rallies all across the country. Tremendous support. Make America Great Again! _E_
So many people who have children with autism have thanked me—amazing response. They know far better than fudged up reports! _E_
Thank you Sanford Florida. Get out & VOTE #TrumpPence16! #ICYMI watch this afternoons rally here:... __HTTP__ _E_
An investment in knowledge pays the best interest. Benjamin Franklin _E_
I am happy to donate $5 million to a charity Barack Obama chooses. All I am asking is that he is transparent with the American people _E_
It is not freedom of the press when newspapers and others are allowed to say and write whatever they want even if it is completely false! _E_
Thank you! #TrumpWon #MAGA __HTTP__ _E_
Another example of @BarackObama's diplomatic triumphs he gave the Queen of England an iIPod filled with his speeches. _E_
Wow the respected Monmouth University poll has me ahead of most Republican candidates nationwide and most people don't think I'm running! _E_
Congratulations to @TrumpSoHo for once again receiving the AAA Five Diamond Award for another year! _E_
After many years of LEAKS going on in Washington it is great to see the A.G. taking action! For National Security the tougher the better! _E_
Just watched @meetthepress and how totally biased against me Chuck Todd and the entire show is against me.The good news the people get it! _E_
Afghanistan's so called leader Karzai is toying with the U.S. _E_
The same people that built the ObamaCare website used as the face of the website someone who is not a US citizen. Incompetent. _E_
When will @CNN get some real political talent rather than political commentators like Errol Louis who doesn't have a clue! Others bad also. _E_
Very resource rich Canada our neighbor is looking to China for its growth. Just another sad commentary on the U.S. __HTTP__ _E_
Just got back from Colorado. The love and enthusiasm at two rallies was incredible. Big crowds! _E_
Let the Arab League take care of Syria. Why are these rich Arab countries not paying us for the tremendous cost of such an attack? _E_
Full transcript of economic plan delivered to the Economic Club of New York. #MAGA __HTTP__ __HTTP__ _E_
Orders for U.S. factory goods in March record biggest decline in 3 years __HTTP__ China is eroding the US manufacturing sector. _E_
Why haven't they released the final Missouri victory for us yet? Could it be because Cruz's guy runs Missouri? _E_
Watch – Obama will not fix the illegal immigrant loophole. Instead he will sign another executive action giving more amnesty. _E_
Thank you Iowa! #FITN #IACaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
My wife Melania will be speaking in Pennsylvania this afternoon. So exciting big crowds! I will be watching from North Carolina. _E_
Some @OWS protesters are sincere people frustrated with the system others just in for the party. _E_
After seven horrible years of ObamaCare (skyrocketing premiums & deductibles bad healthcare) this is finally your chance for a great plan! _E_
Where is the President? It is time for him to come on TV and show strength against the repeated threats from North Korea and others. _E_
The F 35 program and cost is out of control. Billions of dollars can and will be saved on military (and other) purchases after January 20th. _E_
The United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_
Honor Memorial Day by thinking of and respecting all of the great men and women that gave their lives for us and our country! We love them. _E_
Wow no longer Saturday delivery from U.S. Postal Service no money our poor poor Country! _E_
I can't believe that 60 Minutes is right now showing our nuclear facilities for the world to see (at request of U.S. leadership). STUPID! _E_
..... He knows I don't respect him. _E_
I would like to offer Vice President Biden my warmest condolences on the loss of his wonderful son Beau. Met him once great guy! _E_
Join me on @greta from Indianapolis Indiana at 7pmE! Enjoy! #Trump2016 __HTTP__ _E_
Host of the 2022 @PGAChampionship & 2017 #USWomensOpen Trump Nat'l Bedminster offers 36 holes of world class golf __HTTP__ _E_
RT @IvankaTrump: Ivanka is joining @realDonaldTrump to outline an innovative new child care policy to support American families. Tune in to... _E_
Will @anthonyweiner be fully clothed in his mayoral ads? _E_
David Letterman's show has become so boring and mundane. Somehow every time I look I can't help thinking of (cont) __HTTP__ _E_
Flattering. Over 500 upset people called Mar a Lago disappointed I am not running for President but Mitt Romney will do a great job. _E_
I never equated wind farms to the Pan Am Lockerbie disaster only stated that @AlexSalmond should never have released the terrorist BAD! _E_
Man did JEB throw his brother under the bus last night on @colbertlateshow . Probably true but not nice! _E_
Enough is Enough no more Bushes! __HTTP__ _E_
.@DLoesch played great audio from my @CPACnews press conference on her radio show. Glad she made it! _E_
The Oscars are a sad joke very much like our President. So many things are wrong! _E_
Thank you! __HTTP__ _E_
The press is going out of their way to convince people that I do not like or respect women when they know that it is just the opposite! _E_
For the NY State Repubs to waste time energy and money on a primary—then go against 3 1 Dems—is insane. _E_
Well we all did it together! I hope the MOVEMENT fans will go to D.C. on Jan 20th for the swearing in. Let's set the all time record! _E_
How did Snowden with not even a high school education get access to top secret U.S. records. He then gave or sold those records traitor! _E_
MAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_
On my way to New Hampshire expecting a big and spirited crowd! #FITN #Trump2016 __HTTP__ __HTTP__ _E_
Crooked Hillary will NEVER be able to handle the complexities and danger of ISIS it will just go on forever. We need change! _E_
It is time to #DrainTheSwamp! __HTTP__ _E_
Is this the New York that Ted Cruz is talking about & demeaning? __HTTP__ _E_
Wow Now experts are calling #Harvey a once in 500 year flood! We have an all out effort going and going well! _E_
#TrumpTODAY Watch my appearance on the @TODAYshow from this morning __HTTP__ _E_
The first book signing at Trump Tower for #TimeToGetTough was so popular that I'm doing another one today from noon to 2pm/Trump Tower _E_
Personally I think Douglas Durst's brother got screwed by Douglas—no wonder he's angry! _E_
Far more killed than anticipated in radical Islamic terror attack yesterday. Get tough and smart U.S. or we won't have a country anymore! _E_
Please explain to the dummies at the @WSJ Editorial Board that I love to debate and have won according to Drudge etc. all 11 of them! _E_
Crazy @megynkelly is now complaining that @oreillyfactor did not defend her against me yet her bad show is a total hit piece on me.Tough! _E_
My @SteveDeaceShow interview discussing Ebola Obama's incompetence & my trip to Iowa for @SteveKingIA on Sat. __HTTP__ _E_
Jeb Bush and Ted Cruz are not electable presidential candidates Hillary would destroy them. Ted may not be eligible to run born in Canada _E_
Thank you Cleveland. We love you and will be back many times! _E_
My @foxandfriends re: the sequestration failure of leadership in DC China playing us & taking over in 2016 __HTTP__ _E_
Iowa was amazing last night. The event could not have worked out better. We raised $6000000 for our great vets. They were so happy & proud _E_
Obamacare is a disaster as I've been saying from the beginning. Time to repeal & replace! #ObamacareFail __HTTP__ _E_
I see my friend @FlaGovScott is speaking at CPAC. Solid guy wonderful job. #sayfie @marcaputo _E_
RT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_
"Be ready for problems you'll have them every day. Keep your focus and be as big as your daily challenges." – Trump Never Give Up _E_
The U.S. recorded its slowest economic growth in five years (2016). GDP up only 1.6%. Trade deficits hurt the economy very badly. _E_
Are all the illegals pouring into our country vaccinated? I don't think so. Great danger to U.S. _E_
I was on CNN yesterday..... __HTTP__ _E_
I will be interviewed on @foxandfriends at 8:00 A.M. So much to talk about! _E_
Funny that Jeb(!) didn't want help from his family in his failed campaign and didn't even want to use his last name.Then mommy now brother! _E_
A very interesting piece by a very good writer @KirstenPowers of @USATODAY and @FoxNews. __HTTP__ _E_
Thanks. __HTTP__ _E_
Just won IOWA @CNN Poll BIG: Trump 33% Cruz 20% Rubio 11% but @WSJ reported Cruz momentum but nothing about the fact that I easily won! _E_
Rubio lied about my meeting w/ Hispanic activists. I didn't change my opinion but treated them w/ respect. Shame! __HTTP__ _E_
A simplified tax code would spur economic growth and help create jobs. Unfortunately Washington is incapable of simplifying anything. _E_
Via @Newsmax_Media by "Poll: Trump Surges Among GOP Hopefuls in NH" __HTTP__ _E_
Just got back from South Carolina. Going to Alabama tomorrow! _E_
Just did an interview with my friend @MarkSimoneNY. Congratulations to Mark on his new show on @WOR710. _E_
My sons Don and Eric are right now at Doonbeg in Ireland. There will be nothing like it! _E_
Entrepreneurs: Keep the big picture in mind. There are always opportunites and possibilities & thinking too small can negate a lot of them. _E_
.@scottienhughes you were fantastic on CNN. Thank you for the nice words. See you at the #GOPDebate. _E_
I want talented people to come into this country—to work hard and to become citizens. Silicon Valley needs engineers etc. _E_
THE SYSTEM IS RIGGED! _E_
Michigan Mississippi Idaho & Hawaii: Get out to VOTE and join the movement today! Video: __HTTP__ __HTTP__ _E_
I worked hard with Bill Ford to keep the Lincoln plant in Kentucky. I owed it to the great State of Kentucky for their confidence in me! _E_
A 34 story luxury highrise @TrumpParc offers elite amenities with residences that maximize every inch of space __HTTP__ _E_
Everyone is asking me to cover The Apprentice LIVE on twitter. I will do so. Tonight 9 to 11. IT WILL BE A GREAT EVENING OF TELEVISION! _E_
Fun to watch the Democrats working so hard to win the great State of South Carolina when I just won the Republican version amazing people! _E_
Fox & Friends going on now enjoy! _E_
Sleepy eyes @chucktodd whenever you mention me unfairly I will likewise mention you. _E_
Adopt the Arts campaign at @fundanything ensures that an underfunded public school has music and arts programs __HTTP__ _E_
RT @DanScavino: Join President elect Trump LIVE from Mobile Alabama via his #Facebook page! #ThankYouTour2016 Watch: __HTTP__ _E_
The outer boroughs of Manhattan are still devasted by Sandy. How would the press cover this if a Republican was President. _E_
"If you put the federal government in charge of the Sahara Desert in 5 years there'd be a shortage of sand." – Milton Friedman _E_
Via @thehill: Trump warns GOP moving too fast on immigration reform __HTTP__ by @JonEasley _E_
Between Libya the national security leaks and Fast & Furious Obama has had more national security scandals than any other President. _E_
The Miami Heat looked great tonight congratulations from all of your friends at your favorite place in Miami Trump National Doral. _E_
We should have a contest as to which of the Networks plus CNN and not including Fox is the most dishonest corrupt and/or distorted in its political coverage of your favorite President (me). They are all bad. Winner to receive the FAKE NEWS TROPHY! _E_
Thank you @DailyMail for setting the failing @NYTimes story straight. This is what the NYT's should have written! __HTTP__ _E_
As a show of support for our Armed Forces I will be going to The Army Navy Game today. Looking forward to it should be fun! _E_
Just got back from Iowa had a great time with amazing people. Will be back soon! _E_
Despite the upcoming election the cover of paper thin Time Magazine looks like an ad for the movie Lincoln sad! _E_
Thanks @MickyArison for your nice statement @BLTPrimeMiami @TrumpDoral. I just want to do as well as you have with @MiamiHEAT. See u soon _E_
Never make a concession during negotiations that could lead to more demands. Be prudent. It's best to have your concessions predetermined _E_
RT @GregAbbott_TX: Spoke with Pres. Trump & heads of Homeland Security & FEMA. They're helping Texas respond to #HurricaneHarvey. __HTTP__ _E_
Temperature at record lows in many parts of the country. 50 degrees below zero with wind chill in large area. Global warming folks iced in! _E_
I believe that Crooked Hillary sent Bill to have the meeting with the U.S.A.G. So Bill is not in trouble with H except that he got caught! _E_
Truly weird Senator Rand Paul of Kentucky reminds me of a spoiled brat without a properly functioning brain. He was terrible at DEBATE! _E_
We must change the laws of our land and seek fair but rapid trials for the perpetrators of terrorist acts (Boston) with harsh punishment! _E_
I've been warning about China since as early as the 80's. No one wanted to listen. Now our country is in real trouble. #TimetoGetTough _E_
My daughter Ivanka will be representing me today at the opening of our campaign office in Manchester NH #MakeAmericaGreatAgain! _E_
The public is about to learn a lot more information on Barack Obama and his true background in the coming weeks... _E_
RT @EricTrump: I will be always be incredibly proud of my work for @StJude raising $16.3+ million dollars over the last 10 years at a 9.2%... _E_
Corey Lewandowski Senior Political Adviser: Mr.Trump has the vision and leadership skills to bring our country back to greatness. _E_
Via @UrbanTurf_DC: Trump Releases Renderings For Old Post Office Building __HTTP__ _E_
Via @nydailynews: @IvankaTrump oversees new healthy room service menu at Trump Hotels __HTTP__ _E_
I will be interviewed on @60Minutes tonight after the NFL game 7:00 P.M. Enjoy! _E_
President Donald J. Trump and @FLOTUS Melania Participate in the Pardoning of the National Thanksgiving Turkey at the White House. __HTTP__ _E_
When do we sue the company for billions that robbed us in creating the hapless ObamaCare website? _E_
I never made the ridiculous comment about James G. and Obama Care somebody else put it out and attributed it to me. Not my style! _E_
If last night's election proved anything it proved that we need to put up GREAT Republican candidates to increase the razor thin margins in both the House and Senate. _E_
The polls show that I picked up many Jeb Bush supporters. That is how I got to 46%. When others drop out I will pick up more. Sad but true _E_
Snowden is showing how weak the U.S. has become. _E_
.@TraceAdkins says @Joan_Rivers is a gem. I agree. We all agree. #CelebApprentice _E_
Excited to host two great championships at two of our best properties @seniorpgachamp at Trump DC & @pgachampionship at Trump Bedminster _E_
Great optimism in America – and the results will be even better! __HTTP__ _E_
#CrookedHillary is unfit to serve. __HTTP__ _E_
Now Obama is keeping our soldiers in Afghanistan for at least another year. He is losing two wars simultaneously. _E_
.@mcuban Mark—nice picture thanks for the invite to the Mavs/Nets game. Next time I'll go and you'll win! _E_
.@TrumpSoho has just been awarded the AAA Five Diamond Award. Congratulations to the team for this great recognition of their amazing work. _E_
Thank you!! #Trump2016 __HTTP__ _E_
I'm at Trump Int'l Hotel in Las Vegas tallest/most beautiful building in town. Speaking to another great crowd at Treasure Island (12 noon) _E_
Idiot @billmaher always forgets to mention that I am suing him to collect the $5M for charity that he expressly offered. _E_
The Trump Spa @TrumpNewYork is a serene sanctuary featuring luxurious spa treatment rooms saunas and steam rooms __HTTP__ _E_
Thank you Ted. __HTTP__ _E_
Thank you Sean McGarvey & the entire Governing Board of Presidents for honoring me w/an invite to speak. #NABTU2017... __HTTP__ _E_
Dow Passes 23000 for the First Time Fueled by Strong Earnings #Dow23K📈 __HTTP__ __HTTP__ _E_
Via @starpulse: Donald Trump Calls Barack Obama 'Incompetent' __HTTP__ _E_
Meet the amazing mother whose letter I read during my speech. She lost her son to policies supported by Clinton. __HTTP__ _E_
Just saw Crooked Hillary and Tim Kaine together. ISIS and our other enemies are drooling. They don't look presidential to me! _E_
An honor to meet with the Polish American Congress in Chicago this morning! #ImWithYou Video:... __HTTP__ _E_
JOBS JOBS JOBS! __HTTP__ _E_
Via @IBTimes: Under Fire From Donald Trump Jeb Bush Focuses On 9/11 Even Though Hijackers Got Florida Licenses __HTTP__ _E_
I am thrilled to share that the Trump Home furniture collection by @doryainteriors just opened a new... __HTTP__ _E_
THANK YOU AMERICA!#MakeAmericaGreatAgain __HTTP__ _E_
I will be interviewed on @foxandfriends at 7:00 A.M. Enjoy! _E_
My @foxandfriends interview discussing ObamaCare the Romney Trump fundraiser & my plans for Jones Beach __HTTP__ _E_
My beautiful wife Melania will be appearing on QVC this evening from 8 to 9 pm. _E_
May God have mercy upon my enemies because I won't General George S. Patton _E_
My @TeamCavuto interview re: 2016 the need for leadership in our country Syria & China hacking our military __HTTP__ _E_
Presidency. Two of my children Don and Eric plus executives will manage them. No new deals will be done during my term(s) in office. _E_
Wow the ALIS just nominated my purchase of Doral in Miami as Transaction of the Year—thanks! _E_
Re Real Estate: You don't necessarily need the best location. What you need is the best deal... _E_
Join me in Westfield Indiana tomorrow night at 7:30pm! #Trump2016 Tickets: __HTTP__ __HTTP__ _E_
Forty seven million now on food stamps. When he came to office there were 32 million. He's added 15 million people. @MittRomney _E_
Brought to you by @HillaryClinton & her campaign in Chicago Illinois. #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
The Tax Cut/Reform Bill including Massive Alaska Drilling and the Repeal of the highly unpopular Individual Mandate brought it all together as to what an incredible year we had. Don't let the Fake News convince you otherwise...and our insider Polls are strong! _E_
Saying goodbye to some of my great workers at @TrumpDoral in Miami. __HTTP__ _E_
.@PiersMorgan and @OMAROSA really hate each other. #CelebApprentice _E_
We are getting rid of all Glenfiddich garbage alcohol from Trump properties. _E_
Control your own destiny or someone else will. Jack Welch _E_
A great evening in Springfield Illinois. Thank you for all of the support! #Trump2016 __HTTP__ _E_
Our thoughts and prayers remain with Bret Michaels and his family and for his speedy recovery. _E_
Because of Rodolfo Rosas Moya who owes me lots of money Mexico will never again host the Miss Universe Pageant. _E_
China talks about the so called carbon footprint and then behind our leaders backs they laugh. They could (cont) __HTTP__ _E_
It has just been confirmed by the City of Mobile Alabama that there were 30000 people at last nights event making it #1for pol season. _E_
James Clapper and others stated that there is no evidence Potus colluded with Russia. This story is FAKE NEWS and everyone knows it! _E_
Throughout my travels I've had the pleasure of sharing the good news from America. I've had the honor of sharing our vision for a free & open Indo Pacific a place where sovereign & independent nations w/diverse cultures & many different dreams can all prosper side by side. __HTTP__ _E_
"As someone once put it 'Marriage is the greatest 'anti poverty' program God ever created.'" #TimeToGetTough _E_
The joint statement of former presidential candidates John McCain & Lindsey Graham is wrong they are sadly weak on immigration. The two... _E_
THANK YOU America! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
If you are planning to visit the world famous Trump Tower Atrium be sure to come early. During the holiday season it is packed by 10AM _E_
Senator Tom Cotton was great on Meet the Press yesterday. Despite a totally one sided interview by Chuck Todd the end result was solid! _E_
I have long given the order to help Argentina with the Search and Rescue mission of their missing submarine. 45 people aboard and not much time left. May God be with them and the people of Argentina! _E_
I am returning to the Pensacola Bay Center in Florida Friday 9/9/16 at 7pm. Join me! __HTTP__ __HTTP__ _E_
Via @WSJ by @SenAlexander: Wind Power Tax Credits Need to be Blown Away __HTTP__ @alexsalmond _E_
Caught RED HANDED very disappointed that China is allowing oil to go into North Korea. There will never be a friendly solution to the North Korea problem if this continues to happen! _E_
They had no definitive proof against Tom Brady or #patriots. If Hillary doesn't have to produce Emails why should Tom? Very unfair! _E_
What's incredible is that @Obamacare hasn't even kicked in yet and aleady it's doing tremendous damage. (cont) __HTTP__ _E_
Donald Trump's __HTTP__ Breaks $1M for Comedian Adam Carolla: New crowdfunding site sets record __HTTP__ _E_
RT @EricTrump: #JournalismIsDead __HTTP__ _E_
I love the White House one of the most beautiful buildings (homes) I have ever seen. But Fake News said I called it a dump TOTALLY UNTRUE _E_
The @nytimes is so dishonest. Their hit piece cover story on me yesterday was just blown up by Rowanne Brewer who said it was a lie! _E_
Isn't it ridiculous starting today new Ebola screenings go into effect for people coming from West Africa. Just stop the flights dummies! _E_
Watch #CelebApprentice this Sunday at 9PM ESTon @NBC it has received many 4 star reviews. _E_
More questions answered... __HTTP__ #trumpvlog _E_
.@GStephanopoulos stupidly believes that Hillary wants to run against me because she said so. She says that so people believe it opposite! _E_
Some of your most popular questions answered in today's video __HTTP__ _E_
Bob & Suzanne Wright co founders of @autismspeaks have done an absolutely fantastic job—two real winners. __HTTP__ _E_
After hearing the news that they would not be able to extort $1M from me they went hostile w/ a series of incorrect & ill informed ads. _E_
Autism WAY UP I believe in vaccinations but not massive all at once shots. Too much for small child to handle. Govt. should stop NOW! _E_
On International Women's Day join me in honoring the critical role of women here in America & around the world. _E_
Isn't it ironic that President Obama of all people is pushing for 'universal background checks?!' _E_
Just returned from Mississippi a great evening. _E_
Throwing out the first pitch a few years ago at Fenway in Boston Boston will be better than ever. __HTTP__ _E_
In the 1920's people were worried about global cooling it never happened. Now it's global warming. Give me a break! _E_
President Obama must remember that the worst thing you can do in a deal is seem desperate to make it. Be cool move slowly and think! IRAN _E_
.@GeorgeTakei is doing really well & soon coming to Broadway. _E_
I will be on @LateNightJimmy tonight. Always have a good time with @jimmyfallon. Now we know he will get high ratings tonight. _E_
Maybe the millions of people who voted to MAKE AMERICA GREAT AGAIN should have their own rally. It would be the biggest of them all! _E_
The Veterans Administration is in shambles and our veterans are suffering greatly. John McCain has done nothing to help them but talk. _E_
Football coaches are no longer allowed to scream and yell at their players because it is discriminatoryracist and can be viewed as bullying _E_
Join me tomorrow in Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_
Inner city crime is reaching record levels. African Americans will vote for Trump because they know I will stop the slaughter going on! _E_
In analyzing the Alabama Primary raceFAKE NEWS always fails to mention that the candidate I endorsed went up MANY points after endorsement! _E_
Great news Former Mayor of Dallas Tom Leppert has just endorsed me! Thank you! Tomorrow is a big day VOTE! #VoteTrump #SuperTuesday _E_
Almost universal support that Trump won the debate. Only @FoxNews is consistantly fighting the Trump win and I got them the ratings! _E_
McAllen Texas 8 miles from U.S. Mexico border. #Trump2016 Video: __HTTP__ __HTTP__ _E_
Experience is the teacher of all things. Julius Caesar _E_
.@deedeesorvino was GREAT today on @FoxNews She gets what is going on in politics and sees it very clearly. Have her on more! _E_
As I stated at the press conference on Friday regarding David Duke I disavow. __HTTP__ _E_
#LawandOrder #ImWithYouTranscript: __HTTP__ _E_
One of the hardest jobs in politics must be cleaning up after @JoeBiden gaffes. I feel sorry for his spokespeople. _E_
Had a great time going over renovations for Trump National Doral this past weekend. It is going to be amazing. __HTTP__ _E_
My interview from yesterday with @seanhannity __HTTP__ _E_
HAPPY BIRTHDAY to our @FLOTUS Melania! __HTTP__ __HTTP__ _E_
Very good news—the new Quinnipiac poll just came out—I am #1 in Iowa. _E_
By raiding the defense budget to pay for his failed social programs @BarackObama continues to weaken our (cont) __HTTP__ _E_
My @gretawire interview discussing the economy unemployment numbers China Charles BarkleyFrance and the election __HTTP__ _E_
As I predicted Obama already caught lying on Ocare enrollment # by CBO who's sticking w/ "6 million enrollments" __HTTP__ _E_
RT @billoreilly: A free press is vital to protecting all Americans. A corrupt press damages the Republic. _E_
RT @EricTrump: #Wisconsin: To find your voting location visit __HTTP__ #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_
.@VinceMcMahon @MikeTyson @HomerJSimpson I think I'm going to accept the #IceBucketChallenge stay tuned to my Twitter tomorrow.... _E_
Invincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_
The State Of The Union speech was one of the most boring rambling and non substantive I have heard in a long time. New leadership fast! _E_
Priorities. While Obama wastes billions on a broken website he is going to cut military pay __HTTP__ No surprise. _E_
Will be heading over to the debate soon. Can you believe @CNN is milking it for almost 3 hours? Too long too many people on stage! _E_
I am speaking today at the National Press Club totally sold out and will then be inspecting The Old Post Office on Pennsylvania Avenue! _E_
My @SquawkCNBC interview discussing @MittRomney's pick of @PaulRyanVP how to frame Medicare debate & @RNC convention __HTTP__ _E_
I am very proud of Ivanka! _E_
Entrepreneurs: Business is a creative endeavor. Being innovative = being open to new ideas. Keep an open mind! _E_
We're all thinking of you @SteveScalise! #TeamScalise __HTTP__ _E_
Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport this past winter. #MAGA __HTTP__ _E_
I will be interviewed on @GMA Good Morning America at 7:00 A.M. @ABC will be announcing new poll numbers. MAKE AMERICA GREAT AGAIN! _E_
Today I introduced my Contract with the American Voter our economy will be STRONG & our people will be SAFE.... __HTTP__ _E_
"Success in golf depends less on strength of body than upon strength of mind and character." Arnold Palmer _E_
Check out Ivanka's new FaceBook page and keep up with what's happening from The Celebrity Apprentice to jewlery to free tickets and more.. _E_
RT @SarahPalinUSA: Trading in the beautiful snow of Iowa for the red dirt of Oklahoma as planned despite what the media is try's no... __HTTP__ _E_
...healthcare plan is on its way. Will have much lower premiums & deductibles while at the same time taking care of pre existing conditions! _E_
The @BarackObama campaign took in $39M in May but spent $44.6M. Sound familiar! _E_
When will Obama next go on vacation if he wins the election? The day after. _E_
It's a shame to hear that the @dcexaminer is failing. No one wants the paper even if it is being handed out for free. _E_
President Obama we need to protect our closest ally Israel. The situation in the Middle East is at a tipping point. _E_
The owner of California Gold just made a jerk (fool) out of himself. Just smile and congrat the winner. His wife was visibly embarrassed! _E_
Via @SaintPetersblog by @MitchEPerry: "Shock poll: Donald Trump leads Jeb Bush 26 20% ... in Florida" __HTTP__ _E_
My acquisition of the Doral in Maimi will be a major success for the Trump Organization. The re building is on schedule. _E_
First Minister @AlexSalmond will be destroying the beauty of Scotland with his insane desire for bird killing wind turbines. _E_
Contractors can blame Obama admin all day for their $600M failure but both parties are at fault pay taxpayers back. _E_
'Donald Trump leads Hillary Clinton by 19 points among military veteran voters: poll' #AmericaFirst #MAGA __HTTP__ _E_
The U.S. accidentally air dropped a large shipment of military weapons and supplies right into the middle of ISIS as enemy laughs! Very sad! _E_
We traveled the world to strengthen long standing alliances and to form new partnerships. See more at:... __HTTP__ _E_
Thank you Hershey Pennsylvania. Get out & VOTE on November 8th & we will #MAGA! #RallyForRiley #ICYMI watch here... __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Overlooking Central Park @TrumpNewYork brings both glamor and prestige to your Five Diamond hotel stay __HTTP__ _E_
I will be interviewed tonight on @seanhannity Enjoy! 10:00 P.M. _E_
When written in Chinese the word 'crisis' is composed of two characters. One represents danger and the other represents opportunity. JFK _E_
My interview with @ThisWeekABC w/@GStephanopoulos destroyed all Sunday competition w/ 2.52M total viewers...that's why they want me on! _E_
... while Tom Brady is guilty because he REPLACED his LEGAL cellphone? _E_
RT @JoeNBC: Trump +15 on Cruz in 2 weeks. Cruz may look back and ask why he ever attacked Trump. DT has killed him ever since. __HTTP__ _E_
I think @TheRevAl should take this challenge. Axelrod was too scared. RT: @RonKaufmanIntrn: Kaufstache vs. Sharpstache. _E_
Waste @BarackObama's Dep. of Energy was warned in advance by Treasury that it wasn't loaning $ out in good deals __HTTP__ _E_
I hope Oprah gives Lance Armstrong 100 million dollars because that's what that ridiculous interview will cost him! _E_
Winners never quit and quitters never win. Vince Lombardi _E_
.@mcuban Shark Tank was shoved to Friday evening Friday is considered "dead television." Besides you are not the star (& never will be). _E_
Gold just set another record high on price with the largest physical gold sales on record __HTTP__ Inflation is coming... _E_
.@JustinRose99 Great playing we are proud of you! _E_
Obama keeps saying that he will do something but why hasn't he done it? It's all talk. _E_
Success breeds success. The best way to impress people is through results. Think Like a Billionaire _E_
Just watched Senator John Barrasso on @FoxNews He was great! Thank you John. _E_
Congratulations to @Boston_Police @FBIBoston & all emergency first responders & doctors for their excellent work under fire yesterday _E_
The Republicans can absolutely win if they stick together but they are NOT sticking together. Sen. McCain just said we can't win .Very bad! _E_
Goofy Elizabeth Warren Hillary Clinton's flunky has a career that is totally based on a lie. She is not Native American. _E_
__HTTP__ Countdown to @AmericaNowRadio as my former _E_
 _E_
Senate concludes "Benghazi could have been prevented" __HTTP__ _E_
The Democrats in the Super Committee want to raise taxes first in deficit talks. Huge mistake. Cut wasteful spending first. _E_
I commend @DrZuhdiJasser for defending the NYPD and Commissioner Kelly. The NYPD has done outstanding work in defending NYC from attacks. _E_
Chicago murder rate is record setting 4331 shooting victims with 762 murders in 2016. If Mayor can't do it he must ask for Federal help! _E_
... at Madison Square Garden followed by a ceremony with 80000 people at MetLife Stadium Wrestlemania. _E_
Watch Celebrity Apprentice on Sunday at 9 pm on NBC we're winding up for a terrific finale. What a season! __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
"Know from inside out that you have the power to succeed and you will. That's taking control." – Think Like a Champion _E_
#TrumpVine on ObamaCare website __HTTP__ _E_
My @foxandfriends int. on IRS targeting Tea Party the Benghazi death scandal & @TraceAdkins v. @pennjillette finale __HTTP__ _E_
The people that gave you global warming are the same people that gave you ObamaCare! _E_
Together we're going to restore safety to our streets and peace to our communities and we're going to destroy the vile criminal cartel #MS13 and many other gangs...'Hundreds arrested in MS 13 crackdown' __HTTP__ __HTTP__ _E_
Iran toys with U.S. days before we pay them ridiculously billions of dollars. Don't release money. We want our hostages back NOW! _E_
My @foxandfriends interview discussing the Make America Great Again Texas filing and the Iowa caucus __HTTP__ _E_
.@greta: Look forward to watching Greta's interview tonight at 7.00 p.m. with Marine Andrew Tahmooressi. #Marinefreed _E_
Thank you Greta. __HTTP__ _E_
Gain and use information to your advantage see every day as an opportunity to learn. _E_
URGENT: we've just announced a $2 million fundraising goal tonight. Please stand with us! __HTTP__ __HTTP__ _E_
Big defeat last night in Nevada for Ted Cruz and Marco Rubio. @KarlRove on @FoxNews is working hard to belittle my victory. Rove is sick! _E_
.@AnnCoulter has been amazing. We will win and establish strong borders we will build a WALL and Mexico will pay. We will be great again! _E_
The US Air Force won the war in Libya to clear the way for Islamic Extremist control of Libya. _E_
Needed: Leaders who negotiate smart trade deals.Only one knows The Art of The Deal. Let's Make America Great Again! __HTTP__ _E_
Hope and Change? Job numbers down. Time for @MittRomney _E_
I'll be doing Fox and Friends this morning at seven. _E_
"Sometimes you have to take a half step back to take two forward." @VinceMcMahon _E_
Many people talking with much agreement on my Iran speech today. Participants in the deal are making lots of money on trade with Iran! _E_
Via @CraveOnline: Donald Trump is NOT A Rod fan __HTTP__ _E_
Never bet against Bob Kraft Bill Belichick or Tom Brady! @Patriots _E_
RT @RSBNetwork: We are ALREADY LIVE in Everett WA for the Trump Rally. Come join us our cameras tonight! #TrumpinEverett __HTTP__ _E_
ALWAYS TRY TO LEARN FROM OTHER PEOPLES MISTAKES NOT YOUR OWN IT IS MUCH CHEAPER THAT WAY! _E_
If I'm the third most envied man in America the small group of haters and losers must be nauseas. _E_
Via the New York Times __HTTP__ _E_
On 9/11 we pray for the victims and their families of the attack and give thanks to all who have sacrificed for justice & our freedom. _E_
I will be on @foxandfriends at 7:00 A.M. ENJOY! _E_
Other networks are begging me to do a show I can't because I'm doing the Apprentice! _E_
The Chinese talk of climate change and carbon footprint but don't clean up their factories but they sell us the equipment to clean up ours! _E_
Rosie O'Donnell just said she felt shame at being fat not politically correct! She killed Star Jones for weight loss surgery just had it! _E_
The tournament at Trump National Doral was much more exciting than what is going on now! _E_
The polls have shown that DEAD PEOPLE voted for President Obama overwhelmingly and without hesitation he must be doing something right! _E_
Shocking two of @BarackObama's largest campaign bundlers are directly linked to Solyndra __HTTP__ What a coincidence! _E_
I will be on @seanhannity tonight at 10 PM @FoxNews. #Hannity _E_
Terrible CBO forecast for 2013 1.4% GDP growth and 7.5%+ unemployment (really 17%+) __HTTP__ You get what you vote for! _E_
Consumer spending is continuing to fall with weak June numbers. @BarackObama's policies have created a climate (cont) __HTTP__ _E_
After being forced to apologize for its bad and inaccurate coverage of me after winning the election the FAKE NEWS @nytimes is still lost! _E_
RT @RepKristiNoem: A lot of tough decisions got us to this point but we're closer than we've been in 30+ years to a fairer tax code that k... _E_
#TrumpVine A message for my hotel guest @MileyCyrus __HTTP__ _E_
I just realized that if you listen to Carly Fiorina for more than ten minutes straight you develop a massive headache. She has zero chance! _E_
.@FoxNews has been treating me very unfairly & I have therefore decided that I won't be doing any more Fox shows for the foreseeable future. _E_
Shock even more @BarackObama solar corruption. @VPBiden's chief of staff's firm got biggest DOE loan. __HTTP__ _E_
In November I think the people of Ohio will remember that the Republicans picked Cleveland instead of going to another state. Jobs! _E_
By the way where is @Oprah? Good question. 4 years ago she strongly supported Obama now she is silent. Anyway who cares I adore Oprah. _E_
Tom Brady played great today. He is a total champ and a really nice guy a rare combination! _E_
.@nbcnightlynews (Brian Williams anyone?) says women warriors are every bit as tough as the guys. Just think about that statement! _E_
I will be making my Supreme Court pick on Thursday of next week.Thank you! _E_
I have a lawsuit in Mexico's corrupt court system that I won but so far can't collect. Don't do business with Mexico! _E_
He @BarackObama said it would be 'unprecedented' if the USC rules that ObamaCare is unconstitutional. It was (cont) __HTTP__ _E_
My @NewsRadio610 int. w/@JackHeathRadio discussing Nickey S. Loeb 1st Amendment Awards Dinner & @SenScottBrown __HTTP__ _E_
RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_
What about all of the contact with the Clinton campaign and the Russians? Also is it true that the DNC would not let the FBI in to look? _E_
.@scottienhughes Keep up the great work Scottie. Polls are best ever! _E_
Congratulations to all the Trump 2012 #MissUniverse contestants who came from across the world. You did great and made us all proud! _E_
Stop the flights! __HTTP__ _E_
Thank you Green Bay Wisconsin! Governor @Mike_Pence and I will be back soon. #TrumpPence16 #MAGA __HTTP__ _E_
Congratulations to @AnnDRomney on delivering a knock out speech last night. America can't wait to call her our First Lady. _E_
Everyone should calm down. @BenAffleck is going to do a great job as Batman. _E_
Congratulations to @TrumpPanama for being named one of the "Best of +VIP Access" hotels for 2014 by @Expedia! __HTTP__ _E_
$25 Million+ raised online in just one week! RECORD WEEK. #DrainTheSwamp Today we set a bigger record. Contribute &gt __HTTP__ _E_
It is now clear that the Embassy attack in Libya was a coordinated Al Qaeda operation and not based on some video. _E_
So many people are agreeing with me on not creating a highway for Ebola to the U.S. Started in small area of Africa and now spreading fast _E_
"Our Constitution was made only for a moral & religious people. It is wholly inadequate to the government of any other." John Adams _E_
Be sure to check out the new projects @fundanything __HTTP__ Giving away money! _E_
After these spirited primaries are over @GOP must be fully united for November. If we take the Senate we stop Obama's agenda. _E_
.@WayneNewtonMrLV Wayne such a pleasant surprise so nice. Thank you very much. _E_
After raising w/ no obligation almost $6M for Vets I couldn't believe protesters formed @ Trump Tower. JUST OUT SENT BY CROOKED HILLARY! _E_
As gas prices keep rising @BarackObama won't approve Keystone. Instead he is pushing algae yes algae as an (cont) __HTTP__ _E_
It's Tuesday. @AGSchneiderman is wearing Revlon eyeliner today. Governor Cuomo alerted all to this. _E_
Excited to speak at tomorrow night's @ocrp Lincoln Day dinner in Michigan "All time sales record over 2000." __HTTP__ _E_
Great book just out A Place Called Heaven by Dr. Robert Jeffress A wonderful man! _E_
Glad to hear @EWErickson has moved over to @FoxNews. Erick is a sharp political analyst. _E_
I never asked Comey to stop investigating Flynn. Just more Fake News covering another Comey lie! _E_
Great Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY (last time was 1996). Wow! Congratulations! _E_
Congratulations John! __HTTP__ _E_
The big golf course project on the water by the Whitestone Bridge in NYC that has been under construction for many years now complete GREAT! _E_
Obama wants taxes to go up so he can take credit for lowering them next year. _E_
RT @narendramodi: Had a wonderful meeting with @IvankaTrump advisor to @POTUS and leader of the US delegation at the @GES2017. __HTTP__ _E_
Getting ready for my big foreign trip. Will be strongly protecting American interests that's what I like to do! _E_
I will be interviewed on the @TODAYshow at 7:00 A.M. this morning. Enjoy! _E_
Spitzer failed as A.G. failed as Governor in disgrace and was fired on all T.V. shows (boring and zero ratings) and he's at it again! _E_
...you can enhance location through promotion and work. _E_
The @ABC poll sample is heavy on Democrats. Very dishonest why would they do that? Other polls good! _E_
RT @DanScavino: Back to Cincinnati Ohio this Thursday (12/1/16) at 7pm for #PEOTUS @realDonaldTrump's #ThankYouTour2016! Join us! __HTTP__ _E_
TO ALL AMERICANS #HappyNewYear & many blessings to you all! Looking forward to a wonderful & prosperous 2017 as we... __HTTP__ _E_
Does anyone notice how the Montana Congressional race was such a big deal to Dems & Fake News until the Republican won? V was poorly covered _E_
Be careful – sexting pervert Anthony Weiner is upping his campaigning. When will new pictures be released? _E_
Why are people giving money to Karl Rove when he just wasted $400M without any victories? Use your head. _E_
Via The Hill No Tickets Left for Trump's Dallas Rally __HTTP__ _E_
Will be interviewed on @GMA this morning at 7:00. Thanks for the GREAT poll results! _E_
.@BarackObama's college application would be very very very very interesting! _E_
. @chrislhayes replaced @edshow on @msnbc to increase ratings. It's a shame Chris' are even worse. Sad to see. _E_
Champion @bretmichaels triumphantly returns to 13th season of All Star @CelebApprentice. Spoiler – Bret is back to his winning ways. _E_
#LasVegasStrong #USA __HTTP__ _E_
For the great people of Iowa find your #IACaucus location at __HTTP__ so important to vote! #MakeAmericaGreatAgain _E_
Looking forward to a press conference today about @adamcarolla on @fundanything movie project #roadhard __HTTP__ _E_
The @BarackObama administration is pressuring contractors to fix job loss estimates from environmental regulations __HTTP__ _E_
.@StephenBaldwin7 shines in the record 13th season of 'All Star' @CelebApprentice. The Baldwin clan will be proud of Stephen. _E_
Aberdeen tourism is booming because of my great Scottish golf club. _E_
Is PM Cameron a dummy? With monumental cuts in UK spending how come he continues to spend billions of pounds ... _E_
GET READY!! The #TrumpFerryPoint tee sheet opens TODAY @ 10am EST on our website for April 1st 30th! @TrumpFerryPoint _E_
Via @fitsnews: Donald Trump Knows How To (Tea) Party THE DONALD PLANS SPLASHY LANDING IN MYRTLE BEACH S.C. __HTTP__ _E_
As ISIS and Ebola spread like wildfire the Obama administration just submitted a paper on how to stop climate change (aka global warming). _E_
Just spoke w/ Governors Rick Scott of Florida Kenneth Mapp of the U.S. Virgin Islands & Ricardo Rosselló of Puerto Rico. WE ARE W/ YOU ALL! __HTTP__ _E_
George Steinbrenner was a great friend and a true legend. There will never be anyone like him in New York. We've lost a truly great man. _E_
Lightweight Senator Marco Rubio is polling very poorly in Florida. The people can't stand him for missing so many votes poor work ethic! _E_
Congratulations to @MikeTyson on the success of his new book Undisputed Truth & @HBO special and thanks for the nice words Mike. _E_
Winner of the 5 Star Diamond Award @TrumpGolfDC's two courses grace over 600 acres on the Potomac River __HTTP__ _E_
Should be interesting but too bad the three guys at《1% will be taking up so much time but who knows maybe a star will be born (unlikely) _E_
Follow @TrumpNH for all the updates on my New Hampshire political activities. Looking forward to returning to the Granite State on May 14! _E_
.@SenScottBrown is the most competitive GOP option against Obama's amnesty loving @SenatorSheehan. He can win! _E_
.@TraceAdkins presents the NJ Coast Red Cross a $40000 check for Sandy Relief. You can tell he's very pleased about that & rightly so. _E_
We should be concerned about the American worker & invest here. Not grant amnesty to illegals or waste $7B in Africa. _E_
Via @BreitbartNews by Steve Bannon: "'TIME TO GET TOUGH': TRUMP'S BLOCKBUSTER POLICY MANIFESTO" __HTTP__ _E_
Entrepreneurs: Believe in yourself. If you don't no one else will either. _E_
Shouldn't George Will have to give a disclaimer every time he is on Fox that his wife works for Scott Walker? _E_
.@BillMaher needs to cut back on the pot and maybe he will stop making offers he can't afford. _E_
Thank you. __HTTP__ _E_
Club for Growth letter trying to extort $1000000.00 from me. Remember I said NO! __HTTP__ _E_
My list of potential U.S. Supreme Court Justices was very well recieved. During the next number of weeks I may be adding to the list! _E_
Interesting to watch Senator Richard Blumenthal of Connecticut talking about hoax Russian collusion when he was a phony Vietnam con artist! _E_
I believe in the America that never gives up never stops striving never ceases believing in itself. @MittRomney 11.2.12 _E_
Lightweight @JebBush said tonight he didn't know his family used private eminent domain in Texas Lie! #GOPDebate _E_
... collusion which doesn't exist. The Dems are using this terrible (and bad for our country) Witch Hunt for evil politics but the R's... _E_
Today will be a Super Tuesday for @MittRomney he will win over 220 delegates from states across every region. He will be the nominee. _E_
Wow @CNN got caught fixing their focus group in order to make Crooked Hillary look better. Really pathetic and totally dishonest! _E_
Join me live in Springfield Ohio! __HTTP__ _E_
I hate what has happened to the once great @CNN. _E_
Nelson Mandela and myself had a wonderful relationship he was a special man and will be missed. __HTTP__ _E_
I'm urging my friends in Brooklyn to vote for Bob Turner tomorrow send @barackobama a message. _E_
Success seems to be connected w/ action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_
RT @EricTrump: What a scary statistic! Americans are working harder and making less! We need competent leadership! __HTTP__ _E_
Without passion you don't have energy without energy you have nothing. Nothing great in the world has been accomplished without passion! _E_
Everything comes to him who hustles while he waits. Thomas Edison _E_
#IACaucus 2/1/2016 6:30pm#MakeAmericaGreatAgain!Iowa caucus finder: __HTTP__ #GOPDebate __HTTP__ _E_
Great list of spring travel ideas from our @TrumpCollection properties: __HTTP__ _E_
Why doesn't President Obama just get the people from Google to fix the failed website. In fact why didn't he use them in the first place! _E_
A beautiful funeral today for a real NYC hero Detective Steven McDonald. Our law enforcement community has my complete and total support. _E_
Glad to see 9 more Iraq and Afghan war veterans joining the next Congress __HTTP__ They deserve to be there! _E_
Watch @IvankaTrump show you how easy it is to #CaucusForTrump in Iowa! #IACaucus Video: __HTTP__ __HTTP__ _E_
I hope @boyscouts of America handle their problems a lot better than the board at Penn State did. You can't do any worse! _E_
Join me in Houston Texas tomorrow night at 7pm! Tickets: __HTTP__ __HTTP__ _E_
Looking forward to being the guest of honor at @ralphreed's @FaithandFreedom Patriot's Award Gala Dinner in Washington DC _E_
We're going to cut taxes BIG LEAGUE for the middle class. She's raising your taxes and I'm lowering your taxes! __HTTP__ _E_
Why do the losers & haters always say I wear a "wig" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_
Mitt Romney didn't show his tax return until SEPTEMBER 21 2012 and then only after being humiliated by Harry R! A bad messenger for estab! _E_
Thank you Georgia!#SuperTuesday #Trump2016 _E_
Congratulations to our new Miss USA the beautiful Rima Fakih. Rima will represent us well at Miss Universe and be a wonderful Miss USA . _E_
Hillary when you complain about a penchant for sexism who are you referring to. I have great respect for women. BE CAREFUL! _E_
.@CNN poll just hit 49% for Trump. Interesting how my numbers have gone so far up since lightweight Marco Rubio has turned nasty. Love it! _E_
So with all of the Obama tough talk on Russia and the Ukraine they have already taken Crimea and continue to push. That's what I said! _E_
Thank you Arizona! #Trump2016 #WesternTuesday #TrumpTrain __HTTP__ _E_
"Donald Trump 2016: 7 Political Stances of GOP Presidential Hopeful" __HTTP__ via @Newsmax_Media _E_
2nd segment of my @seanhannity @FoxNews interview discussing @billmaher's insult of parents and sending him $5M bill __HTTP__ _E_
Thank you for your strong testimony when welcoming me to Liberty University yesterday @JerryJrFalwell. __HTTP__ _E_
One year ago I started calling President Obama INCOMPETENT and people thought it was too tough. Tonight everyone is using that word! _E_
Wow did the @nytimes fall into the Bush trap where his people convinced them how happy he was that I was hurting other candidates & not him _E_
This is all about American weakness and an incompetent President. _E_
Just met with courageous family of Sarah Root in Nebraska. Sarah was horribly killed by illegal immigrant but leaves behind amazing legacy. _E_
A great event in Las Vegas Nevada! __HTTP__ _E_
Remember when failed candidate @JebBush said that illegals came across the border as AN ACT OF LOVE? He's spent $59 million and is at 3%. _E_
Goodnight everyone sleep tight! _E_
Entrepreneurs: Believe in yourself! If you don't no one else will either. _E_
Be passionate. If you love what you're doing success will follow. _E_
The @MittRomney fundraiser last night was a tremendous success. _E_
Via @BreitbartNews by @pamkeyNEN: "Trump: ObamaCare Not Working for Business Going to Collapse" __HTTP__ _E_
The reason Ed Schultz said nice things about me is that I'm the only Repub who won't cut Social Security etc. I'll make America rich again! _E_
I just won a big Court decision (N.Y. Post) against some character who claimed I owed him licensing fees on success of my shirts and ties. _E_
"Live Free or Die." – motto of New Hampshire _E_
Dems warn not to underestimate Trump's potential win __HTTP__ _E_
30000 illegal immigrants with CRIMINAL RECORDS were released last year by our wonderful though highly incompetent government. So stupid! _E_
What would you choose Vampires or Cavemen? #CelebApprentice _E_
"Circumstances are beyond human control but our conduct is in our own power." Benjamin Disraeli _E_
Many on the team and staff of Bernie Sanders have been treated badly by the Hillary Clinton campaign and they like Trump on trade a lot! _E_
Dopey @Lord_Sugar People are calling in saying you are being beaten badly w/ the tweets... _E_
If you don't treat yourself like royalty no one else will. @TrumpWaikiki is Honolulu's most luxurious hotel __HTTP__ _E_
After seven years of talking Repeal & Replace the people of our great country are still being forced to live with imploding ObamaCare! _E_
Via @newsobserver by @RaleighReporter: In Raleigh Donald Trump all but announces presidential bid __HTTP__ _E_
#CelebApprentice With three wonderful but fired contestants __HTTP__ _E_
The Supreme Art of war is to subdue the enemy without fighting. Sun Tzu _E_
I don't believe the Democrats really want to see a deal on DACA. They are all talk and no action. This is the time but day by day they are blowing the one great opportunity they have. Too bad! _E_
Never confuse a single defeat with a final defeat. ― F. Scott Fitzgerald _E_
.@HollySandersGC. Remember it was Martin K who sank the big ten footer to win the Ryder Cup. He can handle the pressure! _E_
Strange but I see wacko Bernie Sanders allies coming over to me because I'm lowering taxes while he will double & triple them a disaster! _E_
Lance Armstrong's liability & lawsuits against him have just increased tenfold—his lawyers will be very happy—lots of fees! _E_
There is only one fix for ObamaCare REPEAL & REPLACE with a free market oriented alternative! _E_
I will be making a major announcement tomorrow (Thursday February 2) at 12:30 pm at Trump International Hotel & (cont) __HTTP__ _E_
Jerry Falwell of Liberty University was fantastic on @foxandfriends. The Fake News should listen to what he had to say. Thanks Jerry! _E_
Thinking small when you could think big limits you in all aspects of your life. _E_
In the end you're measured not by how much you undertake but by what you finally accomplish. _E_
A great guy (with great ratings)! __HTTP__ _E_
Today I filed my Statement of Candidacy with the FEC. Let's #MakeAmericaGreatAgain __HTTP__ _E_
The boardroom and @WrestleMania I'm watching great entertainment tonight! #CelebApprentice _E_
Roadway steel on beautiful Verrazano Narrows Bridge is rusting and rotting away. Scrape and paint before too late. _E_
Thank you John Nolte for wonderful analysis & reporting. _E_
Via @haaretzcom: "Donald Trump calls Obama Israel's greatest enemy" __HTTP__ _E_
I am so happy that people are boycotting Macy's __HTTP__ _E_
It's 4.35 a.m. and I am working on a very exciting (and hopefully very good) deal a major resort. THE HARDER I WORK THE LUCKIER I GET! _E_
Entrepreneurs: See each day as an opportunity to show what you can do at the highest level. Take responsibility for yourself! _E_
I am landing shortly. Can't wait to be with our GREAT MILITARY. See you soon! __HTTP__ _E_
Even though I am not mandated by law to do so I will be leaving my busineses before January 20th so that I can focus full time on the...... _E_
Iran is closing the Strait of Hormuz for a military exercise. Imagine what they will do with nukes?! _E_
From Donald Trump: Andrea Bocelli @ Mar a Lago Many say best night of entertainment in long history of Palm Beach __HTTP__ _E_
The immigration crisis is a horrible mess made worse by an incompetent president who doesn't have a clue. We need new leadership FAST! _E_
Because of the hurricane I am extending my 5 million dollar offer for President Obama's favorite charity until 12PM on Thursday. _E_
When times are difficult you must be even more focused. That's when you will find profitable opportunities. _E_
No surprise. @BarackObama is letting the Muslim Brotherhood in Egypt default on their US loans __HTTP__ Big mistake! _E_
Via @BreitbartNews @biggovt by @mboyle1: "EXCLUSIVE: NEVER AIRED 'APPRENTICE' PARODY OF TRUMP FIRING OBAMA" __HTTP__ _E_
A great night in Fayetteville North Carolina. Thank you! #ICYMI watch here: __HTTP__ __HTTP__ _E_
...that it was hard not to end up rooting for Trump... _E_
I don't know why the @yankees keep paying A Rod—they have a perfect out. _E_
Additionally two executives @VattenfallGroup are under major investigation & they are unable to get the many permits necessary. _E_
Lightweight choker Marco Rubio looks like a little boy on stage. Not presidential material! _E_
Wow Putin is really taking advantage of President Obama. It is important that Obama responds with strength and determination be smart cool! _E_
Off shore windfarms being abandoned in droves throughout world—too expensive to build & operate—don't work. (cont) __HTTP__ _E_
Keep an eye on Anthony Weiner. Weasels are hard to get rid of. _E_
This is such a special time to be in New York City. No better city in the world to celebrate Christmas! _E_
Obamacare continues to fail. Humana to pull out in 2018. Will repeal replace & save healthcare for ALL Americans. __HTTP__ _E_
With the debt limit approaching @GOP has even more leverage. If they stay united and on message they can win. _E_
Thank you Greensboro North Carolina! Will be back soon! #AmericaFirst __HTTP__ _E_
The difference between a successful person and others is not a lack of strength not a lack of knowledge but (cont) __HTTP__ _E_
In interview I told @AP that my taxes are under routine audit and I would release my tax returns when audit is complete not after election! _E_
Pat Caddell on Neil Cavuto tonight: I've watched Donald Trump take on the issues of energy and he ties it to (cont) __HTTP__ _E_
Watch my appearance on @Letterman from last night __HTTP__ _E_
Via @DMRegister: "@ShawnJohnson returns to reality TV with Donald Trump" __HTTP__ _E_
John McCain had a really hard time with his town hall meeting on immigration. They really went after him! _E_
We must restore the entrepreneurial spirit of our country. A small business boom. Let's Make America Great Again! __HTTP__ _E_
So now tha Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the "unsolved mystery" that took place in Florida years ago? Investigate! _E_
Today we remember the courage and bravery of our troops that stormed the beaches of Normandy 73 years ago. #DDay... __HTTP__ _E_
RT @foxandfriends: Never give up....that's the worst thing you could do. There's always a chance. Kyle Coddington's message to those als... _E_
We have to combat the welfare mentality that says individuals are entitled to live off taxpayers. (cont) __HTTP__ _E_
#MakeAmericaGreatAgain#Trump2016  __HTTP__ _E_
Trump Int'l Hotel & Tower Chicago is one of very few hotels in No. America w/ a 5 Star 5 Diamond Hotel & a 5 Star 5 Diamond Restaurant... _E_
Addressing record crowd @ Madison County Iowa GOP Dinner. We can bring common sense to DC & Make America Great Again! __HTTP__ _E_
My job as President is to do everything within my power to give America a level playing field. #AmericaFirst... __HTTP__ _E_
Polls are starting to look really bad for Obama. Looks like he'll have to start a war or major conflict to win. Don't put it past him! _E_
Entrepreneurs: Believe in yourself. If you don't no one else will either. Realize that becoming an entrepreneur is not a group effort. _E_
Weekly jobless claims are now at an astronomical 365000. Manufacturing sector is suffering badly. We must do better. __HTTP__ _E_
Has the media picked up the new Zogby poll that was just put out? I doubt it! __HTTP__ _E_
Who's the flip flopper? @MittRomney has never flip flopped on gay marriage. _E_
Amazing five days developments in Aberdeen Turnberry (Scotland) and Ireland are fantastic the best anywhere in the WORLD. A lot of fun! _E_
.@TigerWoods has made a truly great comeback he is number one again! Give him credit comebacks are tough to do. Way to go Tiger. _E_
The Maryland Democrat Party attacked me with a racist flyer. @Hogan4Governor won 2nd GOP governor elected in 40 years. _E_
Big crowd expected today in Pensacola Florida for a Make America Great Again speech. We have done so much in so short a period of time...and yet are planning to do so much more! See you there! _E_
So lets get this right. Steve Jobs dies and leaves his wife everything billions of dollars. Now his wife has a boyfriend (lover). Oh Steve! _E_
GRETA IN A FEW MINUTES on Fox. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The new @BarackObama motto You Own Nothing Not Even Your Own Success ... _E_
The great people of New Hampshire who I love are not properly served by the dying Union Leader newspaper. _E_
CRIPPLED AMERICA is perfect gift for friends & family. Order signed copy & join me tonite live streaming 7:30 __HTTP__ _E_
.@dallasmavs is 1 12 against the Western Conference's top four seeds after Sunday's loss & @okcthunder swept the season series. _E_
Worst graphics and stage backdrop ever at the Oscars. Show is terrible really BORING! _E_
My @WMUR9 Commitment 2016 Conversation with @JoshMcElveen discussing leadership China healthcare & veterans __HTTP__ _E_
A Rod is just not making it. We want to give him a chance but it was only drugs that made him great. _E_
...addresses of any mentioned person who is still living. I am doing this for reasons of full disclosure transparency and... _E_
I am convinced that sleepy eyes Chuck Todd was only a placeholder for someone else at Meet the Press. He bombed franchise in ruins! @nbc _E_
Excited to announce that @GiulianaRancic & @BravoAndy will be hosting the 2012 Miss Universe Pageant. Great ratings for Miss Universe. _E_
A week after Biden says that the Taliban is not our enemy the Taliban demand that we pay Iraq for a 9 year occupation. __HTTP__ _E_
.@tracegallagher and @FredTecce discussing my case on @FoxNews __HTTP__ _E_
.@MittRomney Op Ed Culture Does Matter : __HTTP__ _E_
Dow hit a new intraday all time high! I wonder whether or not the Fake News Media will so report? _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
So far the hurricane is being handled very well in NY not nearly as bad as stated on news. Let's see what happens later. _E_
The Democrats made up and pushed the Russian story as an excuse for running a terrible campaign. Big advantage in Electoral College & lost! _E_
.@krauthammer pretends to be a smart guy but if you look at his record he isn't. A dummy who is on too many Fox shows. An overrated clown! _E_
...to terrorism and airline flight safety. Humanitarian reasons plus I want Russia to greatly step up their fight against ISIS & terrorism. _E_
Networks are all wanting me to do shows—like it or not a "ratings machine"! –but time I run a really big company! _E_
Dopey @Rosie I never went bankrupt ABC already apologized to me for your stupid statement in the past they didn't want a lawsuit. _E_
Congratulations to @rushlimbaugh on his recent 26th year anniversary. Rush has revolutionized talk radio! Sorry haters and losers! _E_
Who would be stupid enough to invest in @VattenfallGroup's ill conceived windfarm when it will lose £25M yearly? _E_
Despite so many false statements and lies total and complete vindication...and WOW Comey is a leaker! _E_
.@alexsalmond RT @NOBLE74 I live in Aberdeenshire & I'm with you you have made a big difference to that bit (cont) __HTTP__ _E_
Thank you America! __HTTP__ _E_
What?! LaToya is saying Omarosa is one of the nicest people she's met? _E_
Entrepreneurs: Do your best to your utmost ability every day. Make that your standard. _E_
Glad to see that @PeteRose_14 has been hired by @FOXSports as an analyst. Pete should be around baseball and in the Hall of Fame! _E_
Excited to be keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa this Saturday __HTTP__ _E_
Shocker: study reveals that @msnbc is completely biased while @FoxNews is factual __HTTP__ What a surprise! _E_
We need a strong leader and fast! __HTTP__ _E_
The House's failure to pass the Balance Budget Amendment was another unforced error by the GOP. Very disappointing. _E_
Great job tonight on @FoxNews Tony. I am with you all the way! Make America Great Again @tperkins _E_
....earth shattering. He and his brother could Drain The Swamp which would be yet another campaign promise fulfilled. Fake News weak! _E_
The Benghazi terrorist is getting speedier care than our Vets at the VA. Obama has his priorities. _E_
To the African American community: The Democrats have failed you for fifty years high crime poor schools no jobs. I will fix it VOTE T _E_
Must read article via @fitsnews: DONALD TRUMP VERSUS MEXICO __HTTP__ _E_
Trump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean & is host to the 2014 Great Irish Links Challenge __HTTP__ _E_
Another great shot from the beginning of construction at @DoralResort. __HTTP__ _E_
In moments like thiswe are all just Americans. I join with the President religious and civic leaders and encourage all to pray today. _E_
If victorious Republicans will be having a big press conference at the beautiful Rose Garden of the White House immediately after vote! _E_
Unbelievable evening in New Hampshire THANK YOU! Flying to Grand Rapids Michigan now. Watch NH rally here:... __HTTP__ _E_
...in order to put any and all conspiracy theories to rest. _E_
Thank you Newt! __HTTP__ _E_
Is @billmaher the dumbest man on television?—I think so. _E_
I will be on the @todayshow tomorrow morning to make a major announcement about a television show. Stay tuned! _E_
My H 1B reform plan will transform program so it delivers for country not lobbyists & will have bipartisan support: __HTTP__ _E_
I am sure the Chinese are getting anxious. They watch the polls. @MittRomney won't let them cheat us anymore. _E_
I am very proud of @IvankaTrump for her work with @Cookies4kids. @Cookies4kids is a great cause helping children __HTTP__ _E_
RT @GeraldoRivera: #NewYork tromps #Jonas. Day after storm of the century the big city is up and running unlike others in the northeast. Mu... _E_
My interview with Greta last night on Fox News Nation Has Become All Talk No Action' __HTTP__ _E_
Now there is talk of A Rod being shipped to @Marlins. If A Rod is not a @yankee next year the fans will be happy. _E_
LETS MAKE AMERICA GREAT AGAIN!Schedule & tickets: __HTTP__ __HTTP__ _E_
I was never a fan of Bush in fact he was so bad he gave us Obama! But Obama is truly a pathetic excuse of a president can't get any worse _E_
Crowd was amazing tonight at Trump National Doral in Miami. Love and excitement in the ballroom. Tomorrow at noon in Jacksonville! _E_
Good luck @RoccoMediate and nice hat! __HTTP__ _E_
.@CharlesHurt You were great on @seanhannity last night. Thanks for the nice words. MAKE AMERICA GREAT AGAIN! _E_
Our politically correct country will read the ISIS terrorists who beheaded the reporter their Miranda Rights prior to good food & care! _E_
....came to the campaign. Few people knew the young low level volunteer named George who has already proven to be a liar. Check the DEMS! _E_
.@CNN Kayleigh McEnany was great on you network today. You should have her on more often! Thank you Kayleigh for your nice words. _E_
Everyone is starting to feel the new tax hikes. You get what you vote for! _E_
President Obama is finally getting hammered even by his most loyal supporters and the press I guess they can only take so much! _E_
.@VanityFair looks like a dying magazine! Really really boring really really thin! _E_
The new President of OPEC is Mahmoud Ahmadinejad's confidant Rostam Ghasemi a commander of the Revolutionar... (cont) __HTTP__ _E_
"Courage is contagious. When a brave man takes a stand the spines of others are often stiffened." – Rev. @BillyGraham _E_
Republicans remember—debt ceiling debt ceiling debt ceiling—be smart and you will win! _E_
Just be tough be strong be willing to learn – and you will learn. Don't be afraid of mistakes or setbacks. Think Like a Champion _E_
.@BreitbartNews continues to do great work in exposing the left wing financing behind amnesty __HTTP__ _E_
The best deals are good for everyone which creates a win win situation. Negotiation is persuasion more than power. _E_
#VoteTrumpID! #Trump2016 __HTTP__ _E_
One of the dumbest political pundits on television is Chris Stirewalt of @FoxNews. Wrong facts check Fox debate rankings Trump #1. Dope! _E_
Thank you Michigan!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Unemployment has been over 8% for a record 40 straight months. @MittRomney's election will end the @BarackObama downturn. _E_
The party of the year in Palm Beach was the New Year's Eve celebration at the Mar a Lago Club it was amazing. __HTTP__ _E_
"@ApprenticeNBC: Donald Trump Talks Joan Rivers" __HTTP__ via @TVGrapevine by @TVG_Sammi _E_
Celebrate Thanksgiving @TrumpNewYork with exclusive viewing access to the 88th Annual @Macys Thanksgiving Parade® __HTTP__ _E_
South African Tourism North America will unveil its new ad campaign "What's Your BIG 5?" on All Star @ApprenticeNBC this Sunday. _E_
Mitt Romneywho totally blew an election that should have been won and whose tax returns made him look like a fool is now playing tough guy _E_
Yesterday I explained to @wolfblitzercnn on @CNNSitRoom why @BarackObama doesn't deserve credit for killing Bin Laden __HTTP__ _E_
Democrat Dianne Feinstein should never have released secret committee testimony to the public without authorization. Very disrespectful to committee members and possibly illegal. She blamed her poor decision on the fact she had a cold a first! _E_
Sanctions were not discussed at my meeting with President Putin. Nothing will be done until the Ukrainian & Syrian problems are solved! _E_
China has announced it is "fully prepared" for a currency war __HTTP__ Outrageous they have no fear of our leaders. _E_
Many people have asked recently when do you sleep? The answer is not much. _E_
Via @examinercom: The Miss Universe contestants glow with elegance during the Trump Holiday Party __HTTP__ _E_
Excellent preliminary meeting in Oval with @SenSchumer working on solutions for Security and our great Military together with @SenateMajLdr McConnell and @SpeakerRyan. Making progress four week extension would be best! _E_
ICYMI Via @nypost by Post Editorial Board: "@TrumpFerryPoint: New Gem in the Bronx" __HTTP__ _E_
The people of Tennessee yesterday were amazing. Thank you! _E_
The five fingers represent the five key factors every entrepreneur dreaming of success must master. (cont) __HTTP__ _E_
Great progress on healthcare. Improvements being made Republicans coming together! _E_
If you never want to be criticized for goodness' sake don't do anything new. Jeff Bezos _E_
Are people really afraid of @OMAROSA Would you be? #CelebApprentice _E_
We're stuck with the worst mayor in the United States. Too bad but New York City will survive! _E_
Join me in Denver Colorado tomorrow at 9:30pm!Tickets: __HTTP__ _E_
Via@politicalwire: Trump Offers to Fund White House Tours __HTTP__ _E_
Former winner @bretmichaels returns to All Star @ApprenticeNBC March 3rd on @NBC. Bret shows once again why he is a champion! _E_
I will be interviewed on This Week with George S this morning. Enjoy! _E_
It was an honor to stop by a #SchoolChoice event hosted by @VP Pence and @usedgov Secretary @BetsyDeVosED at the... __HTTP__ _E_
Gas prices are about to hit a record high during the Labor Day weekend. @BarackObama could have stopped this. _E_
Congratulations @TrumpSoHo for making @CNTraveler's #GoldList2015! __HTTP__ _E_
Your civil liberties mean nothing if you're dead. That's why the single most important function of the federal (cont) __HTTP__ _E_
.@ChuckTodd just informed us that my interview last week on @MeetthePress was their highest rated show in 4 years. Congrats! _E_
I'll be honored at the Family Business Dynasties Gala in NYC on December 5th. It will be a great event for a great cause. _E_
"I believe in spending what you have to. But I also believe in not spending more than you should." The Art of the Deal _E_
Thank you Illinois! #Trump2016 __HTTP__ _E_
"The idea flow from the human spirit is absolutely unlimited. All you have to do is tap into that well." @jack_welch _E_
The very dishonest @NBCNews refuses to accept the fact that I have forgiven my $50 million loan to my campaign. Done deal! _E_
Jeb's big ad buy against me paid for by lobbyists shows my face but doesn't have me answering Jeb's statements. He is really pathetic! _E_
Wind farms are now being paid to shut down __HTTP__ A complete waste. _E_
Mark your calenders for August 23rd: __HTTP__ _E_
THANK YOU ALABAMA! 32000 supporters tonight. Get out & VOTE on Tuesday! WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Argyll grandmother takes UK and EU to the United Nations over plans to turn Scotland into windfarm 'hedgehog' __HTTP__ _E_
We just passed 1.9M followers & gained over 250000 followers in the last month.Thank you let's have fun and do business. _E_
RT @IvankaTrump: Ivanka penned an Op Ed that ran in the @WSJ this afternoon read it here. __HTTP__ @realDonaldTrump __HTTP__ _E_
V.P.....really! __HTTP__ _E_
Crooked Hillary said her husband is going to be in charge of the economy.If so he should runnot her.Will he bring the energizer to D.C.? _E_
According to @BarackObama the War on Terror is over __HTTP__ global warming is a national (cont) __HTTP__ _E_
It was announced this morning that unemployment rose this can't be good for Obama. _E_
Success is having to worry about every damn thing in the world except money. Johnny Cash _E_
In all my years in business and participating in politics I've never seen the country as divided as it is right (cont) __HTTP__ _E_
Happy belated birthday to Chris Wallace! Chris does a great job every week on @FoxNewsSunday. Like father like son. _E_
RT @foxandfriends: Anthem announces it will withdraw from ObamaCare Exchange in Nevada __HTTP__ _E_
"I also plan to keep making deals big deals and right around the clock." – The Art of The Deal _E_
Tickets for future debates should be put out to the general public instead of being given to the lobbyists & special interests the bosses! _E_
I would like to thank a great writer and person @JPappasPR. of REAL ESTATE WEEKLY for the wonderful story on me. Very much appreciated! _E_
Can't watch Crazy Megyn anymore. Talks about me at 43% but never mentions that there are four people in race. With two people big & over! _E_
Congratulations to my friend @TheSlyStallone on winning a #GoldenGlobe. A wonderful guy who has created something special well deserved! _E_
Keep hearing about tiny amount of money spent on Facebook ads. What about the billions of dollars of Fake News on CNN ABC NBC & CBS? _E_
Major story that the Dems are making up phony polls in order to suppress the the Trump . We are going to WIN! _E_
We cannot let the failing REPUBLICAN ESTABLISHMENT who could not stop Obama (twice) ruin the MOVEMENT with millions of $'s in false ads! _E_
Little @MacMiller I want the money not the plaque you gave me! _E_
46 stories above downtown New York @TrumpSoHo features loft inspired interiors designed by Fendi Casa __HTTP__ _E_
General John Kelly is doing a fantastic job as Chief of Staff. There is tremendous spirit and talent in the W.H. Don't believe the Fake News _E_
Again for all of the haters and losers I have NOTHING to do with Atlantic City got out a long time ago! _E_
Obama once said he "would be ignoring the law" by granting amnesty through executive action. Now he's about to do it. What will Congress do? _E_
Amazing rally in Reno Nevada thank you. Make sure you get out on 11/8 & VOTE #TrumpPence16. Together we will put... __HTTP__ _E_
LA Times USC Dornsife Sunday Poll: Donald Trump Retains 2 Point Lead Over Hillary: __HTTP__ _E_
Funny that the Democrats would have their convention in Pennsylvania where her husband and her killed so many jobs. I will bring jobs back! _E_
Wrong Policy: @BarackObama wants to cut defense spending by $487B while China is building their navy in the Pacific. __HTTP__ _E_
Lightweight NYS Attorney General Eric Schneiderman is trying to extort me with a civil law suit. See website __HTTP__ _E_
"Donald Trump—The Disrupter" will air on @FoxNews Saturday night and Sunday night at 8 PM ET. Anchored by @BretBaier. @johnrobertsFox _E_
Via @CNNPolitics by @mj_lee: Father of murder victim to introduce Trump in Phoenix __HTTP__ _E_
I am thinking about changing the name #FakeNews CNN to #FraudNewsCNN! _E_
Just to show you how dishonest certain reporters are here is my @foxandfriends interview __HTTP__ _E_
.@Mitt Romney strongly stated in one of the debates with Pres. OBAMA that Russia is the big problem. Obama scoffed. Mitt was 100% correct! _E_
Wind turbines are totally destroying the areas in which they are located—all for unreliable bad & expensive energy! _E_
Via @Newsmax_Media: Trump Says He'll Foot Bill for White House Tours __HTTP__ _E_
I am very excited about hosting @MittRomney today for a fundraiser. Looking forward to seeing @newtgingrich and many other friends. _E_
For a country like China being able to steal our military designs represents hundreds of billions in savings (cont) __HTTP__ _E_
Being true to yourself equals being true to your brand. That's the solid foundation that stands the test of time. Midas Touch _E_
Obama told the UN that "the world is more stable than it was 5 years ago." Is he delusional? _E_
RT @atensnut: How many times must it be said? Actions speak louder than words. DT said bad things!HRC threatened me after BC raped me. _E_
.@BlairKamin Sorry sucker as usual you lose again. You couldn't work for me for 10 seconds. Bad critic great sign. __HTTP__ _E_
I am increasingly concerned with the UN's ploy against @Israel this coming week and will monitor all events closely from Australia. _E_
Criminal deportations in the U.S. are the lowest number in many years. We are letting criminals knowingly stay in our country. MUST CHANGE! _E_
See June 2007 speech is Obama a total racist? _E_
WEEKLY ADDRESS __HTTP__ _E_
All Star Celebrity @ApprenticeNBC is down to the five final contestants. Getting fired now is when it really hurts! _E_
I will be interviewed on @Morning_Joe at 6:15 A.M. Enjoy! _E_
Self determination is the sacred right of all free people's and the people of the UK have exercised that right for all the world to see. _E_
Please don't pay attention to all of those phony tweets that mention my twitter handle relative to "diet" it is a total scam. _E_
Why is this reporter touching me as I leave news conference? What is in her hand?? __HTTP__ _E_
As it has turned out James Comey lied and leaked and totally protected Hillary Clinton. He was the best thing that ever happened to her! _E_
Obama's coal regulations will destroy the coal industry put Americans out of work raise electricity prices & lead to blackouts. _E_
.@DannyZuker Danny You're a total loser! _E_
Getting ready to leave for Washington D.C. The journey begins and I will be working and fighting very hard to make it a great journey for.. _E_
.@mcuban Letterman @Late_Show had his best ratings with me and you bombed. People don't care about Mark Cuban. _E_
A great honor to host and welcome leaders from around America to the @WhiteHouse Infrastructure Summit.... __HTTP__ _E_
RT @TravelGov: Continue to notify us of US citizens overseas impacted by #HurricaneIrma & #HurricaneJose. __HTTP__ __HTTP__ _E_
Terrorists are engaged in a war against civilization it is up to all who value life to confront & defeat this evil __HTTP__ _E_
Having a truly great imagination is often far more important than having even massive knowledge but still never underestimate knowledge! _E_
"Our country is the greatest force for freedom the world has ever known. We have big hearts big brains and (cont) __HTTP__ _E_
Bottom line I don't think President changed people's minds must hope for a lifeline from Putin a very dangerous lifeline at that! _E_
History lesson: There's a big difference between Hillary Clinton and Abraham Lincoln. For one his nickname is Hone... __HTTP__ _E_
.@MittRomney did a great job last night. Watch the clip! __HTTP__ _E_
Congratulations to @DLoesch on the release of her great new book #HandsOffMyGun! Check out @TheBlazeBooks excerpt __HTTP__ _E_
Anthony Hopkins is a truly great actor I love everything he does! _E_
Watch my latest video blog.... __HTTP__ _E_
Why doesn't @FoxNews quote the new Iowa @CNN Poll where I have a 33% to 20% lead over Ted Cruz and all others. Think about it! _E_
LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_
The more you learn about the debt deal the worse it gets. _E_
Dummies @Deadspin had their big payday taken from them by others in the media. _E_
I told you in speeches months ago that Jeb and Marco do not like each other. Marco is too ambitious and very disloyal to Jeb as his mentor! _E_
Once the tragic mistake of going into Iraq was made we should have at least taken the oil (or at least some of it). Now Iran & China get it _E_
Remember what I said about @BarackObama attacking Iran before the election I hope the Iranians are not so (cont) __HTTP__ _E_
Nation's infrastructure is collapsing MAKE AMERICA GREAT AGAIN! _E_
RT @TeamTrump: #RattledHillary wants to talk about her 30 years in service. How about her 30 years of FLOPSFLOPS?! #BigLeagueTruth #Debat... _E_
Join me live in Reno Nevada! __HTTP__ __HTTP__ _E_
Thank you Warwick Rhode Island!#RIPrimary #VoteTrump __HTTP__ __HTTP__ _E_
Want to know why China is growing? They can build the world's tallest building in 90 days __HTTP__ Red tape would kill it here. _E_
Thank you Nevada! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
If Cory Booker is the future of the Democratic Party they have no future! I know more about Cory than he knows about himself. _E_
of jobs and companies lost. If Mexico is unwilling to pay for the badly needed wall then it would be better to cancel the upcoming meeting. _E_
If Sheena Monnin apologized for her mistake as she should have I would have treated her very nicely. _E_
Egypt's Muslim Brotherhood President is visiting us next month. @BarackObama is so excited. _E_
RT @Scavino45: .@POTUS @realDonaldTrump @IvankaTrump Jared Kushner & Dina Powell in the Oval Office today w/ Aya & her brother Basel.#W... _E_
Politicians are ALL TALK NO ACTION! just look at our country. _E_
President Obama should bring Secretaty Sebelius into his office look right into her beautiful blue eyes and saywith emotion YOU'RE FIRED! _E_
As promised our campaign against the MS 13 gang continues. @ICEgov Busts 39 MS 13 Members in New York Operation __HTTP__ _E_
Thank you Georgia! I had a great afternoon with all of you! I will be back soon. #MakeAmerciaGreatAgain __HTTP__ _E_
RT @Heritage: We had a special visitor yesterday. @IvankaTrump thank you for meeting with @KayColesJames spending time with our team and... _E_
How can any Senator vote for Hagel as Sec. of Defense after that horrific hearing? He is not up for the job but will probably get it. _E_
.@TrumpGolfLA is proud to be hosting @PGAGrandSlam where all 4 Major Champions will square off. October 2015. __HTTP__ _E_
Had a great time yesterday on @theviewtv with @WhoopiGoldberg @JennyMcCarthy @SherriEShepherd & guest host @MrJerryOC! _E_
Iran is playing with fire they don't appreciate how kind President Obama was to them. Not me! _E_
A good question for would be entrepreneurs to ask themselves: What am I pretending not to see? There are a lot of opportunities out there. _E_
I hope all of the many thousands of people who are asking me to give up so much and RUN FOR PRESIDENT will fight hard for victory if I do! _E_
Republican Senate must get rid of 60 vote NOW! It is killing the R Party allows 8 Dems to control country. 200 Bills sit in Senate. A JOKE! _E_
Wow looks like James Comey exonerated Hillary Clinton long before the investigation was over...and so much more. A rigged system! _E_
The entrepreneur builds an enterprise the technician builds a job. Michael Gerber _E_
Many people still out of power in Staten Island. Absolutely ridiculous. Why can't they get service? _E_
I'll be on Greta Van Susteren's show tonight at 10 PM on FoxNews. Tune in. _E_
CORRUPTION CONFIRMED: FBI confirms State Dept. offered 'quid pro quo' to cover up classified emails __HTTP__ _E_
Thank you Tallahassee Florida! A beautiful evening with the MOVEMENT! Get out & VOTE!#ICYMI watch here: __HTTP__ _E_
When you're in a fight with a bully always throw the first punch—and don't telegraph it—hit hard & hit fast! _E_
...subject to the fact that if we do not reach a fair deal for all we will then terminate NAFTA. Relationships are good deal very possible! _E_
The Islamists have won. Just as I predicted the Muslim Brotherhood has taken over Egypt. @BarackObama never should have abandoned Mubarek. _E_
If Syria was forced to use Obamacare they would self destruct without a shot being fired. Obama should sell them that idea! _E_
A segment from last night's @piersmorgan interview discussing @CoryBooker and fighting fire with fire in a campaign __HTTP__ _E_
#TrumpVlog Free our Marine! __HTTP__ _E_
Via Breitbart Riding High in Polls Donald Trump Storms the American South to Overflow Crowds in Georgia __HTTP__ _E_
The economy is expected to slow down once again at the end of the year __HTTP__ The price of gas has to be lowered. _E_
A state legislator w/ a true record of accomplishments military vet @joniernst will make a tremendous US Senator. Iowa send Joni to DC! _E_
Nobody but Donald Trump will save Israel. You are wasting your time with these politicians and political clowns. Best! #SheldonAdelson _E_
Wow—Family Feud said I am the third most envied man in America. I respectfully disagree—I am very modest. _E_
While millions are being spent against me in attack ads they are paid for by the "bosses" and "owners" of candidates. I am self funding. _E_
"Donald Trump was proven right on another one of his top issues Thursday: 'gun free zones' at military bases." __HTTP__ _E_
More than a century after conquering flight the #WrightBrothers continue to motivate & inspire Americans who never tire of exploration & innovation. This GREAT AMERICAN SPIRIT can be found in the design of every new supersonic jet and next generation: __HTTP__ __HTTP__ _E_
The reason you don't generally hit runways is that they are easy and inexpensive to quickly fix (fill in and top)! _E_
Here I am with @RodStewart at Mar a Lago. __HTTP__ _E_
I look forward to working w/ D's + R's in Congress to address immigration reform in a way that puts hardworking citizens of our country 1st. _E_
Ben Carson was speaking in general terms as to what he would do if confronted with a gunman and was not criticizing the victims. Not fair! _E_
China has a backdoor into the Trans Pacific Partnership. This deal does not address currency manipulation. China is laughing at us. _E_
We are one nation. When one hurts we all hurt. We must all work together to lift each other up.#StandWithLouisiana __HTTP__ _E_
Check out the #trumpvlog to see the answers to your questions... __HTTP__ _E_
I will be going to Asheville North Carolina tonight for the 95th birthday party of the GREAT Billy Graham such a wonderful man! _E_
.And to think that just last week he was lecturing anyone who would listen about sexual harassment and respect for women. Lesley Stahl tape? _E_
Looks like I was right about NATO. I had no doubt. __HTTP__ _E_
President Obama just fired the ObamaCare website builder. My question is why were they hired in the first place? Sue them for damages! _E_
.@somelikeitlar hope you enjoyed the premiere of All Star Celebrity @ApprenticeNBC. Make sure @marklevinshow watches! _E_
Just started building one of the great hotels of the World in Washington D.C. the site of the Old Post Office. Will be amazing JOBS! _E_
If Obama goes after Mitt's private sector experience in the next debate then Mitt should ask for Obama's college records all of them. _E_
Via @postandcourier by @skropf47: "Donald Trump: Don't politicize Walter Scott shooting" __HTTP__ _E_
.@KarlRove still thinks Romney won! He doesn't have a clue! @FoxNews _E_
What an evening in Las Vegas Nevada! THANK YOU for your continued support. #Trump2016 __HTTP__ __HTTP__ _E_
Big vote tomorrow in the House. Tax cuts are getting close! _E_
RT @MELANIATRUMP "@ApprenticeNBC: Her beauty lives 5000 miles past Heaven. __HTTP__ " Thank u @THEGaryBusey! _E_
Everybody wants me to talk about Robert Pattinson and not Brian Williams—I guess people just don't care about Brian! _E_
I will be doing @SquawkCNBC at 7:30. _E_
Work often becomes problem solving. Problems come with the territory and they should never surprise you. Think Like a Champion _E_
Crazy @megynkelly supposedly had lyin' Ted Cruz on her show last night. Ted is desperate and his lying is getting worse. Ted can't win! _E_
Good night everyone sleep well and tomorrow have many victories! _E_
That was some episode last week we've got a great cast! _E_
Via @LINKSMagazine: "Only The Donald __HTTP__ _E_
Standing strong for his people @GovWalker is ignoring the Feds and keeping all Wisconsin parks open. Great! _E_
Amazing Race winning an Emmy again is a total joke. The Emmys have no credibility no wonder the ratings are at record lows. _E_
#AmericasMerkel __HTTP__ _E_
Dangerous. While Obama is cutting down our military China has announced plans to build more aircraft carriers __HTTP__ _E_
Congratulations to @MittRomney for an impressive win in Florida. He performed well under pressure. _E_
My supporters are the best! $18 million from hard working people who KNOW what we can be again! Shatter the record: __HTTP__ _E_
I have been very consistent and always said that Iraq would fall as soon as the U.S. left. What a terrible waste of lives and money! _E_
Make no mistake Fast and Furious goes ALL the way to the White House. _E_
The main stream media wants to surrender constitutional rights I believe #ISIS needs to surrender! _E_
The Senate Democrats have only confirmed 48 of 197 Presidential Nominees. They can't win so all they do is slow things down & obstruct! _E_
If Obama worked as hard on straightening out our country as he has trying to protect and elect Hillary we would all be much better off! _E_
Direct foreign investments continue to flow into China at over $100B a year __HTTP__ That's money that could be spent here. _E_
With Irma and Harvey devastation Tax Cuts and Tax Reform is needed more than ever before. Go Congress go! _E_
Goofy Elizabeth Warren didn't have the guts to run for POTUS. Her phony Native American heritage stops that and VP cold. _E_
It just shows everyone how broken and unfair our Court System is when the opposing side in a case (such as DACA) always runs to the 9th Circuit and almost always wins before being reversed by higher courts. _E_
Looks like the line has started be sure to join me for book signing @TImeToGetTough starting at 11am to 2 pm here in Trump Tower. _E_
LYIN' TED __HTTP__ _E_
I rarely agree with President Obama however he is 100% correct about Crooked Hillary Clinton. Great ad! __HTTP__ _E_
'Donald Trump: A President for All Americans' __HTTP__ _E_
My friend @eminofficial was fantastic on the @TODAYshow this morning—a star! _E_
.@LilJon once again made it to the Final Four. A true talent and great friend to #CelebApprentice @ApprenticeNBC. Great job! _E_
If Obama was willing to lie about ObamaCare then what else has he lied to us about... _E_
.@CNN is so negative it is impossible to watch. Terrible panel angry haters. Bill O @oreillyfactor said such an amazing thing about me! _E_
Hillary could lose to Trump in Democratic New York #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Re: CPAC "The crowd in the main room filled to capacity by the end of Trump's address something his operative said he planned to do... _E_
Join me live in Cincinnati Ohio!#TrumpRally #MAGA __HTTP__ _E_
Republicans Senators are working hard to pass the biggest Tax Cuts in the history of our Country. The Bill is getting better and better. This is a once in a generation chance. Obstructionist Dems trying to block because they think it is too good and will not be given the credit! _E_
Have your own vision & stick with it. Don't be afraid to be unique.Every day is an opportunity to show what you can do at the highest level. _E_
#TBT With Steven Spielberg in the old days a great guy! __HTTP__ _E_
It's Thursday. How much did OPEC steal from all of us today? _E_
The Fake News refuses to talk about how Big and how Strong our BASE is. They show Fake Polls just like they report Fake News. Despite only negative reporting we are doing well nobody is going to beat us. MAKE AMERICA GREAT AGAIN! _E_
To the @BarackObama administration saving money isn't the point expanding government and spending more (cont) __HTTP__ _E_
Via @ArabianBusiness: "Trump eyes PGA tour for Dubai golf course" __HTTP__ _E_
A great book for your reading enjoyment: REASONS TO VOTE FOR DEMOCRATS by Michael J. Knowles. _E_
Fake @NBCNews made up a story that I wanted a tenfold increase in our U.S. nuclear arsenal. Pure fiction made up to demean. NBC = CNN! _E_
Someone should ask @BarackObama in today's press conference how he accumulated more debt in 3 years than the first 42 presidents combined. _E_
Just landed in New York a one night stay in Scotland. Turnberry came out magnificently. My son Eric did a great job under budget! _E_
Congratulations to @TrumpDoral's #BlueMonster course for being named one of the 10 Toughest courses on Tour This Year __HTTP__ _E_
Even though Bernie Sanders has lost his energy and his strength I don't believe that his supporters will let Crooked Hillary off the hook! _E_
Great solidarity for our National Anthem and for our Country. Standing with locked arms is good kneeling is not acceptable. Bad ratings! _E_
Good news: Toyota and Mazda announce giant new Huntsville Alabama plant which will produce over 300000 cars and SUV's a year and employ 4000 people. Companies are coming back to the U.S. in a very big way. Congratulations Alabama! _E_
...We negotiated a ceasefire in parts of Syria which will save lives. Now it is time to move forward in working constructively with Russia! _E_
RT @FLOTUS: Thank u to Queen Fabiola University Hospital! Enjoyed creating paper flowers with amazing patients & getting a tour. #Brussels... _E_
As the days and weeks go by we see what a total mess our country (and world) is in Crooked Hillary Clinton led Obama into bad decisions! _E_
Thank you! #Trump2016 __HTTP__ _E_
Obama is under a great of pressure to perform well in the next debate. Let's see how he reacts under pressure. _E_
When people treat me badly or unfairly or try to take advantage of me my general attitude all my life has (cont) __HTTP__ _E_
RT @foxandfriends: SEN. CRUZ: It's crazy to go an August recess without having Obamacare repealed. We should work every day until it is don... _E_
Syria has been given so much time that much of the things we were going to bomb have been moved into civilian areas! A polititian's war. _E_
.@RinglingBros is retiring their elephants the circus will never be the same. _E_
'WikiLeaks Drip Drop Releases Prove One Thing: There's No Nov. 8 Deadline on Clinton's Dishonesty and Scandals' __HTTP__ _E_
Saudi Arabia should be paying the United States many billions of dollars for our defense of them. Without us gone! @AlWaleedbinT _E_
My wife @MELANIATRUMP will be joining @andersoncooper @AC360 tonight at 8pmE on @CNN. Enjoy! __HTTP__ _E_
People are so jealous of Tom Brady and the Patriots. No court could convict based on the evidence.They can't beat him on the field so this! _E_
"Here's something about Donald Trump he's got a top rated show on TV and everything he says becomes a headline." @DLoesch All true! _E_
Obama's speech in Las Vegas yesterday cost the taxpayer $520 per word and over $1.6M __HTTP__ More money borrowed from China. _E_
A $1.5B website that can only handle 50K users at a time is sad but no surprise! _E_
Ed Gillespie will turn the really bad Virginia economy #'s around and fast. Strong on crime he might even save our great statues/heritage! _E_
The booing at the NFL football game last night when the entire Dallas team dropped to its knees was loudest I have ever heard. Great anger _E_
Just out wonderful poll in North Carolina. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
...people are now starting to recognize the amazing work that has been done by FEMA and our great Military. All buildings now inspected..... _E_
Via @myrbeachonline by @TSN_MPrabhu: Donald Trump states case for becoming POTUS at SC Tea Party convention __HTTP__ _E_
What separates the winners from the losers is how a person reacts to each new twist of fate. _E_
.@jessebwatters You did a great job hosting @oreillyfactor. Everybody loved it! Thank you for the nice words. _E_
Republicans in the Senate will NEVER win if they don't go to a 51 vote majority NOW. They look like fools and are just wasting time...... _E_
"Most of the time you will need to work hard and stay focused to get to the top – and then work even harder to stay there." Think Big _E_
Too many people on stage for debate. @RandPaul at 11th with 2% in @RealClearNews shouldn't be allowed to participate. _E_
So now it is reported that the Democrats who have excoriated Carter Page about Russia don't want him to testify. He blows away their.... _E_
"History does not long entrust the care of freedom to the weak or the timid." Dwight D. Eisenhower _E_
More than 70M people watched the Presidential Debate. A new record. See what happens when I am so prominently mentioned (just kidding)! _E_
Don't forget Benghazi. _E_
#CelebApprentice We always make sure to have great NYC locations for the task delivery. _E_
Last night in Phoenix I read the things from my statements on Charlottesville that the Fake News Media didn't cover fairly. People got it! _E_
The Federal government has $2.7T in assets & $17.5T in total liabilities plus another $4.7T in intergovernmental debt. Have a nice day. _E_
Who can figure out the true meaning of covfefe ??? Enjoy! _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Donald Trump praises @LilJon and welcomes him back to All Star @CelebApprentice __HTTP__ via @HipHopNews24x7 _E_
Hopefully all supporters and those who want to MAKE AMERICA GREAT AGAIN will go to D.C. on January 20th. It will be a GREAT SHOW! _E_
I will issue a lifetime ban against senior executive branch officials lobbying on behalf of a FOREIGN GOVERNMENT!... __HTTP__ _E_
Excited to keynote of the sold out Pottawattamie County Republican Party Lincoln/Reagan Dinner tonight! __HTTP__ Leaving now! _E_
Lance Armstrong made a really big mistake by opening up to Oprah. I'll bet he wishes he had the chance to do it over again. _E_
To have a government we can afford we need to eliminate the tremendous waste clogging the system. Almost every (cont) __HTTP__ _E_
I will be at @Macys Herald Square April 18 to sign my new fragrance #Success by Trump. First 100 customers receive a copy of my new book. _E_
Hillary just gave a disastrous news conference on the tarmac to make up for poor performance last night. She's being decimated by the media! _E_
Wind turbines threaten the migration of birds __HTTP__ Where's the outcry? _E_
Via @ScotlandNow: "Donald Trump starts £250million overhaul on @TrumpTurnberry golf resort" __HTTP__ _E_
This is Amateur Night who the hell is in charge of this production? #Oscars _E_
Great photo with @IvankaTrump and @Joan_Rivers from this week's @ApprenticeNBC __HTTP__ _E_
Via @Newsmax_Media by @OwenTew: "Trump: 'Maybe Something Miraculous Happens' and Obama Will Succeed" __HTTP__ _E_
Thank you Louisville Kentucky on my way! #MAGA __HTTP__ _E_
The Donald Goes to CPAC: TV star and hotel magnate gives his thoughts on the state of America __HTTP__ by @Kredo0 _E_
Sugar is nowhere near being a billionaire and I know he works for me! _E_
On the shores of Lake Norman @Trump_Charlotte presents the true luxury lifestyle and an elite golf course __HTTP__ _E_
RT @Scavino45: Today @POTUS @realDonaldTrump and @FLOTUS Melania visit the @USCG at the Lake Worth Inlet Station in Riviera Beach Florid... _E_
With our border not being secure Obama is giving a pathway to terrorists to enter our country. An attack is on him. _E_
Entrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. _E_
Still waiting for an explanation about why @GiulianaRancic & @BillRancic did not name their son Donald. Unbelievable. _E_
Remarks at the United States Holocaust Memorial Museum's National Days of Remembrance. Full remarks:... __HTTP__ _E_
My @FoxNews interview from yesterday discussing my recent meetings in Trump Tower and also @GovChristie __HTTP__ _E_
Heading now for Reno Nevada for a big rally. Good poll numberd all over! _E_
It is time to bring competence to Washington. It is time get results. Let's Make America Great Again! __HTTP__ _E_
Keystone XL should be approved but more importantly we should be drilling & fracking our own resources. Would be an economic windfall. _E_
Ask China if their rapidly expanding (with our money) Navy or Armed Forces are going green They would laugh in your face! _E_
With all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special councel appointed! _E_
I am especially grateful for the tremendous support I have received from the Evangelicals in the just out Iowa CNN poll. Thank you! _E_
Leading by 13 over Landrieu in a @FoxNews poll @BillCassidy will beat her in November. _E_
Just leaving Salt Lake City Utah fantastic crowd with no interruptions. Love Utah will be back! _E_
Man we had a great day today at Trump Tower lots of money was given to many people who really needed it good feelings and happiness! _E_
Back in D.C. big week for Tax Cuts and many other things of great importance to our Country. Senate Republicans will hopefully come through for all of us. The Tax Cut Bill is getting better and better. The end result will be great for ALL! _E_
Awarded 5 Stars by @ForbesInspector @TrumpChicago's @SixteenChicago offers Executive Chef @cheflents's new menu __HTTP__ _E_
Do your homework. Wasting other people's time due to poor planning will only leave a bad impression. Think Like a Billionaire _E_
When you can't say it or see it you can't fix it. We will MAKE AMERICA SAFE AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_
Iran continues to delay the nuclear deal while doing many bad things behind our backs. Time to WALK and double the sanctions. Stop payments! _E_
#TheRemembranceProject __HTTP__ __HTTP__ _E_
Our many loyal viewers should expect a major announcement very soon on next season's @CelebApprentice. Our fans will be pleased. _E_
"A failure or setback is not a defeat. Defeat is a state of mind. You are defeated only when you accept defeat." – Think Big _E_
If you live in a state with early voting you should be voting as soon as possible. Bring your friends and family with you. _E_
I give Secretary of State John Kerry credit for working and trying hard but he has zero negotiating ability! _E_
Just watched @NBCNightlyNews So biased inaccurate and bad point after point. Just can't get much worse although @CNN is right up there! _E_
My wife Melania will be on @Morning_Joe tomorrow morning at 8:00. Interviewed by @morningmika Enjoy! _E_
Consumer Confidence Hits Highest Level Since December 2000 Read more: __HTTP__ __HTTP__ _E_
Obama looks exhausted and beaten. He was never made or prepared for the job. Like it or not he doesn't have it _E_
We're all very happy to hear of Bret Michael's progress and send our best wishes for his full recovery. _E_
"I like thinking big. To me it's very simple: if you're going to be thinking anyway you might as well think big." – The Art of The Deal _E_
'America must decide between failed policies or fresh perspective a corrupt system or an outsider' __HTTP__ _E_
Hey @Rosie how is your recovery going? I hope you are doing well so we can start fighting again soon! _E_
The Oscar broadcast is really boring where is the glamour and beauty? _E_
#ICYMI: Weekly Address __HTTP__ __HTTP__ _E_
The misery of Obama's economic policies. US households with unemployed parent was at record high in 2011 __HTTP__ _E_
My interview yesterday with @TeamCavuto where I discuss Dick Cheney and China __HTTP__ _E_
The ratings for The View are really low. Nicole Wallace and Molly Sims are a disaster. Get new cast or just put it to sleep. Dead T.V. _E_
"Trump: 'No way' Bush Romney would win in 2016" __HTTP__ via @FoxNews by Barnini Chakraborty _E_
#FlashbackFriday With Mickey Rooney @Regis and @itstonybennett __HTTP__ _E_
Still a buyer's market but somewhat fragile. Be sure to calculate the risk of rising rates coming sooner than you think! _E_
The State of Iowa should disqualify Ted Cruz from the most recent election on the basis that he cheated a total fraud! _E_
RT @Reince: Flying to Dallas now with @realDonaldTrump...Reports of discord are pure fiction. Great events lined up all over Texas. Rs wil... _E_
Tonight in his SOTU @BarackObama won't talk about Keystone. He will continue to dissemble about his record and play class warfare. _E_
Entrepreneurs: Is the problem a blip or a catastrophe? Keep things in perspective. Learn to expect problems and keep moving forward. _E_
Obama's deal vs. Trump's deals __HTTP__ _E_
They should seriously look into the moron George Zimmerman who shot and killed the 17 year old kid Trayvon (cont) __HTTP__ _E_
Just landed in New Hampshire. Will be at the venue shortly. #FITN _E_
The Fake News Media works hard at disparaging & demeaning my use of social media because they don't want America to hear the real story! _E_
The voters the Republican Party of Virginia are excluding will doom any chance of victory. The Dems LOVE IT! Be smart and win for a change! _E_
Excited to announce Trump Rio de Janeiro our first South American @TrumpCollection hotel set to open in 2016 __HTTP__ _E_
Just cancelled my subscription to @USATODAY. Boring newspaper with no mojo must be losing a fortune. Founder (cont) __HTTP__ _E_
This Ebola patient Thomas Duncan who fraudulently entered the U.S. by signing false papers is causing havoc. If he lives prosecute! _E_
Scary now China's Development Bank is looking to buy U.S. homes and developments __HTTP__ They will own our country soon. _E_
I can't stress strongly enough that we are currently in a buyer's residential market.Try to buy directly from a bank. _E_
Thank you America! Together we will all #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_
Via @businessinsider by @BKcolin: "Donald Trump called the White House and offered to help fix the BP oil spill" __HTTP__ _E_
Resilience is part of the survival of the fittest formula make sure you remain adaptable. _E_
Will be calling the President of Egypt in a short while to discuss the tragic terrorist attack with so much loss of life. We have to get TOUGHER AND SMARTER than ever before and we will. Need the WALL need the BAN! God bless the people of Egypt. _E_
.@DanaPerino Have you released a copy of the beautiful thank you card you sent me? Would you like to see it? @ericbolling @kimguilfoyle _E_
Michigan has made great progress under Snyder Calley. @MIGOP is out early energizing the grassroots. Keep it up! #JoinMITeam _E_
The PGA tour just extended my Trump Doral contract for WGC for ten years. _E_
Watch me on the @oreillyfactor tonight at 8PM. _E_
Thank you Roanoke Virginia be back soon! #TrumpPence16 __HTTP__ __HTTP__ _E_
The $200M renovations at Trump @DoralResort are right on target. When completed the course will be as good as it gets. _E_
Like the worthless @NYDailyNews looks like @politico will be going out of business. Bad reporting no money no cred! _E_
Who would really believe I would say such a thing about a guy I truly liked James Gandolfini. Sadly very sick people use my name. _E_
Does the Fake News Media remember when Crooked Hillary Clinton as Secretary of State was begging Russia to be our friend with the misspelled reset button? Obama tried also but he had zero chemistry with Putin. _E_
I will be on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_
Wisconsin has suffered a great loss of jobs and trade but if I win all of the bad things happening in the U.S. will be rapidly reversed! _E_
Getting to the point is appreciated by everyone. Here's some advice for public speaking: Be sincere be brief be seated. F.D. Roosevelt _E_
Donald Trump: Yahoo Marissa Mayer Are Right Employees Should Not Work From Home __HTTP__ via @HuffPostSmBiz _E_
I will be interviewed on @foxandfriends at 7:00 A.M. _E_
Governor Kasich whose failed campaign & debating skills have brought him way down in the polls is going to spend $2.5 million against me. _E_
I am very impressed by @dennisrodman. His return to this season's @ApprenticeNBC showed who Dennis really is which is very good. _E_
Same failing @nytimes reporter who wrote discredited women's story last week wrote another terrible story on me today will never learn! _E_
Donate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_
The polls & momentum are trending towards @MittRomney. Don't let the hurricane change your thinking! _E_
Eliot had a terrible debate performance this morning against Scott Stringer. He can't spin his failing and contemptible public record. _E_
Meeting Former Speaker Newt Gingrich next week. On the Agenda defeating @BarackObama. _E_
RT @FoxBusiness: #StockAlert: U.S. markets since the election __HTTP__ _E_
Obama's foreign policy is a complete and total disaster the worst President we have ever had. _E_
If only @Obama was as focused on balancing the budget as he is on weakening Israel's borders then America would be on the path to solvency. _E_
After 5 SB victories since 2002 it was my honor to give Bob Kraft Coach Belichick and the players their first to... __HTTP__ _E_
#CrookedHillary's plan will add $1.15 TRILLION in new taxes. We cannot afford her! #DrainTheSwamp #Debate __HTTP__ _E_
Arizona had a 116% increase in ObamaCare premiums last year with deductibles very high. Chuck Schumer sold John McCain a bill of goods. Sad _E_
We fight to free Libya and they kill our Ambassador and other Americans. Obama's foreign policy is a joke. _E_
I told you so @politico just lost it's top person. Poor results and no money to pay him. If they were legit they would be doing far better! _E_
You pick it! #1. Anybody that says anything derogatory about @BarackObama is labeled stupid insane or (cont) __HTTP__ _E_
The Schumer Rounds Collins immigration bill would be a total catastrophe. @DHSgov says it would be "the end of immigration enforcement in America." It creates a giant amnesty (including for dangerous criminals) doesn't build the wall expands chain migration keeps the visa... _E_
Our deepest sympathies and most heartfelt prayers are with the victims of the train derailment in Washington State. We are closely monitoring the situation and coordinating with local authorities... __HTTP__ _E_
Third rate reporters Amy Chozick and Maggie Haberman of the failing @nytimes are totally in the Hillary circle of bias. Think about Bill! _E_
The commodity market is extremely fragile. Be wary of investing right now. The futures are way too dependent on the Fed. _E_
Radio interview w/ @seanhannity discussing @PhilMickels0n_ why NY must start fracking & staying in @GOP primary __HTTP__ _E_
Great comeback by Tom Brady New England! _E_
...our Great American Flag (or Country) and should stand for the National Anthem. If not YOU'RE FIRED. Find something else to do! _E_
Poll data shows that @marcorubio does by far the best in holding onto his Senate seat in Florida. Important to keep the MAJORITY. Run Marco! _E_
Our country has been unsuccessfully dealing with North Korea for 25 years giving billions of dollars & getting nothing. Policy didn't work! _E_
Tonight's episode of @ApprenticeNBC is not only the best episode ever it has a great lesson in life. Don't miss it! _E_
I am the only one that knows how to build cities pols are all talk and no action. Our cities need help and fast. They are crumbling! _E_
The election was a major setback for economy. All young entrepreneurs should be sure to calculate Obama's policies into their investments. _E_
Good article: What Happened to American Men from @Newsmax by Michael Cohen __HTTP__ _E_
In the end Andy Pettitte did not rat out his friend Roger Clemens. I like him again a lot. _E_
I will make this right for our great Vets! _E_
Join me tomorrow! #MAGA 10am Baton Rouge LA. Tickets: __HTTP__ Grand Rapids MI.Tickets: __HTTP__ _E_
Via @latimes' @LATshowtracker:"Monday's TV Highlights:@ApprenticeNBCon @nbc" __HTTP__ _E_
Wow Jeb Bush whose campaign is a total disaster had to bring in mommy to take a slap at me. Not nice! _E_
__HTTP__ _E_
 _E_
RT @TeamTrump: WATCH: @realDonaldTrump on the stakes in this election #Debates2016 __HTTP__ _E_
Via @thestate by @AP: "Donald Trump: Giving 'serious thought' to presidential run" __HTTP__ _E_
Tweet me more of your questions to answer in the next video.... _E_
Via @theblaze: Donald Trump on how Rubio should have drank his water __HTTP__ _E_
Read what Donald Trump has to say about daughter Ivanka's upcoming new book The Trump Card: __HTTP__ _E_
Earlier today I spoke with @GovMattBevin of Kentucky regarding yesterday's shooting at Marshall County High School. My thoughts and prayers are with Bailey Holt Preston Cope their families and all of the wounded victims who are in recovery. We are with you! _E_
Donate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_
Pervert alert–serial sexter @RepWeiner is making another step towards a comeback __HTTP__ All girls under 18 should block him. _E_
The Fake News refuses to report the success of the first 6 months: S.C. surging economy & jobsborder & military securityISIS & MS 13 etc. _E_
With the ridiculous Filibuster Rule in the Senate Republicans need 60 votes to pass legislation rather than 51. Can't get votes END NOW! _E_
As I said on @foxandfriends this a.m. you have to give Obama credit—he won! ... _E_
The debates especially the second and third plus speeches and intensity of the large rallies plus OUR GREAT SUPPORTERS gave us the win! _E_
I will end illegal immigration and protect our borders! We need to MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 __HTTP__ _E_
#3. Cover your bases. Know everything you can about what you're doing. _E_
.@LindseyGrahamSC and Lyin' Ted Cruz are two politicians who are very much alike ALL TALK AND NO ACTION! Both talk about ISIS do nothing! _E_
Give a lot of credit to Carlos Beltran for developing into a terrific baseball player and total winner for the Cardinals great going Carlos! _E_
The UK has run out of money and can't afford to borrow. __HTTP__ Neither can we but that doesn't stop @BarackObama. _E_
Thank you @AmSpec __HTTP__ _E_
RT @Inspire_Us: No color no religion no nationality should come between us we are all children of God. Mother Teresa _E_
Now he has made his Busey ism into a song. #CelebApprentice _E_
RT @IvankaTrump: Working families need #TaxReform & the time is now. This Administration is committed to ensuring all Americans can thrive... _E_
Another sign that @jack_welch is right. New government labor report casts even more doubt on the September jobs data __HTTP__ _E_
It wasn't only that Obama saluted a Marine with a cup of coffee in his hand but why the hell does he have to exit a heli holding coffee? _E_
5 year old Trey has terminal cancer. I'm helping him go to Disney won't you? __HTTP__ _E_
MAKE AMERICA SAFE AND GREAT AGAIN! #RNCinCLE __HTTP__ _E_
"Dem candidates are all folks who vote with me." – Barack Obama describing ALL Democrat Senate candidates _E_
RT @Scavino45: POTUS' @realDonaldTrump on Hurricane Response Efforts in #PuertoRico on Instagram part of 9/29/17 Weekly Address. __HTTP__ _E_
Great seeing @MarianoRivera w/@realDonaldTrump at @TrumpTowerNY for @EricTrumpFdn! __HTTP__ __HTTP__ _E_
Bill Clinton stated that I called him after the election. Wrong he called me (with a very nice congratulations). He doesn't know much ... _E_
An 'extremely credible source' has called my office and told me that @BarackObama's birth certificate is a fraud. _E_
Disappointed in GOP and Dems Giving Obama power to raise the debt limit next year is a mistake. _E_
RT @TwitterData: These are the 10 most Tweeted about world leaders during the first day of #UNGA General Debate __HTTP__ _E_
In just out book Secret Service Agent Gary Byrne doesn't believe that Crooked Hillary has the temperament or integrity to be the president! _E_
"A big key to winning is knowing where the other side is coming from." – Think Like a Champion _E_
It would be nice if our commander in chief was as concerned for our Veterans health as he is for illegal immigrants becoming citizens. _E_
Obama's disastrous judgment gave us ISIS rise of Iran and the worst economic numbers since the Great Depression! _E_
Ranked nationally in @GolfMagazine's top 100 Trump Int'l Golf Club in Palm Beach is a 27 hole masterpiece __HTTP__ _E_
In my speech on protecting America I spoke about a temporary ban which includes suspending immigration from nations tied to Islamic terror. _E_
Think of yourself like a one man army. You're not only the commander in chief you're the soldier as well. – Think Like A Billionaire _E_
...and a Great Leader. John has also done a spectacular job at Homeland Security. He has been a true star of my Administration _E_
The big and highly respected Cooley LLP is handling the @billmaher case for me. _E_
After strict consultation with General Kelly the CIA and other Agencies I will be releasing ALL #JFKFiles other than the names and... _E_
RT @LouDobbs: The stock market has gained an incredible 7.8 Trillion dollars in market value since @POTUS was elected! Looks like 4% econom... _E_
Now @BarackObama is telling @MittRomney how to control his own assets. __HTTP__ Obama is consumed by class warfare. _E_
Lyin' Ted Cruz will never be able to beat Hillary. Despite a rigged delegate system I am hundreds of delegates ahead of him. _E_
I cannot imagine that these very fine Republican Senators would allow the American people to suffer a broken ObamaCare any longer! _E_
If the press would cover me accurately & honorably I would have far less reason to tweet. Sadly I don't know if that will ever happen! _E_
The Mayor of San Juan who was very complimentary only a few days ago has now been told by the Democrats that you must be nasty to Trump. _E_
For political purposes only Obama is planning to hit Libya for the Benghazi embassy attack right before the election? _E_
Yesterday's failing @NYTimes fraudulently shows an empty room prior to my speech when in fact it was packed! __HTTP__ _E_
Obama is community organizing from the Oval Office on Ferguson today. More riots sure to follow. _E_
President Obama thinks the nation is not as divided as people think. He is living in a world of the make believe! _E_
Good news House just passed #KatesLaw. Hopefully Senate will follow. _E_
U.S. Senator Bob Corker (R Tenn.) issued the following statement today regarding the 2016 presidential election: __HTTP__ _E_
Many journalists are honest and great but some are knowingly dishonest and basic scum. They should.be weeded out! _E_
I am going to save Social Security without any cuts. I know where to get the money from. Nobody else does. my @SRQRepublicans speech _E_
Great evening in Canton Ohio thank you! We are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ __HTTP__ _E_
Mitt's proposed tax cuts for the middle class will spur record economic growth. _E_
Is this Hope & Change? A record 46.7M Americans were on food stamps this past June. We must do better. _E_
Be sure to visit the world renowned Trump Tower Atrium to see our holiday decorations. __HTTP__ _E_
The greatest threat to our security is our debt. It is already past 100% GDP. We need to make real budget cuts. _E_
The real war on women. Under @BarackObama 766000 more women are unemployed from when he took office __HTTP__ _E_
President Obama wants to change the name of the White House because it is highly discriminating and not at all politically correct! _E_
M.M. is a good choice also nice guy! #Oscars _E_
ObamaCare will continue to stop entrepreneurship slow growth and halt research & development. Defund Repeal & Replace! _E_
Via @politico: Donald Trump claims Barack Obama bombshell __HTTP__ _E_
Join us! #CaucusForTrump11am WATERLOO: __HTTP__ CEDER RAPIDS: __HTTP__ __HTTP__ _E_
The deal with Iran will go down as one of the most incompetent ever made. The U.S. lost on virtually every point. We just don't win anymore! _E_
The road to success is always under construction. Arnold Palmer _E_
Making a big speech in Alabama today. So many people we had to move to a football stadium! Come and join us! _E_
Just signed Bill. Our Military will now be stronger than ever before. We love and need our Military and gave them everything — and more. First time this has happened in a long time. Also means JOBS JOBS JOBS! _E_
Getting ready to meet President al Sisi of Egypt. On behalf of the United States I look forward to a long and wonderful relationship. _E_
Don't assume you have to accept the hand you were dealt. – Think Like A Billionaire _E_
Trump National Golf Club Los Angeles is situated on the Palos Verdes Peninsula overlooking the Pacific Ocean... __HTTP__ _E_
Me by a lot! _E_
Spoke to U.K. Prime Minister Theresa May today to offer condolences on the terrorist attack in London. She is strong and doing very well. _E_
Located in South Ayrshire Scotland @TrumpTurnberry offers diverse dining options suitable for any occasion __HTTP__ _E_
Just announced that as many as 5000 ISIS fighters have infiltrated Europe. Also many in U.S. I TOLD YOU SO! I alone can fix this problem! _E_
My @FoxNews interview with @TeamCavuto discussing the national housing market unemployment numbers and the FL (cont) __HTTP__ _E_
The same people that said I wouldn't run or that I wouldn't lead or do well (1st place and leading by 21%) now say I won't beat Hillary. _E_
Via @PoliticalTicker: "TRENDING: Trump a right leaning tower at CPAC" __HTTP__ by @KilloughCNN _E_
I will be addressing a fantastic Ames crowd at tomorrow's @bobvanderplaats' @theFAMiLYLEADER Leadership Summit __HTTP__ _E_
Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_
RT @JaydaBF: VIDEO: Muslim Destroys a Statue of Virgin Mary! __HTTP__ _E_
RT @GOP: In @timkaine's own words #Debates2016 __HTTP__ _E_
RT @greta: Prob w/ all pundits saying last fall @realDonaldTrump had no chance is that shows media so out of touch w/ Americans _E_
Scottish government having huge backlash on wind turbines. @AlexSalmond is becoming very unpopular. _E_
'Trump won the third debate' __HTTP__ _E_
China is our enemy they want to destroy us Redstate Interview _E_
"Trust your instincts especially if they are well honed." – Midas Touch _E_
On 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_
Very organized process taking place as I decide on Cabinet and many other positions. I am the only one who knows who the finalists are! _E_
Can you believe that the Chinese would not give Obama the proper stairway to get off his plane fight on tarmac! __HTTP__ _E_
Lightweight shakedown artist AG Eric Schneiderman was exposed in today's New York Post editorial __HTTP__ _E_
RT @EricTrump: Wow! I am speechless! Thank you to my sidekick @LynnePatton who keeps me & the @EricTrumpFdn in line! __HTTP__ _E_
Little Barry Diller who lost a fortune on Newsweek and Daily Beast only writes badly about me. He is a sad and pathetic figure. Lives lie! _E_
An extended interview from the Super Bowl with @oreillyfactor airs tonight at 8:00 P.M. Enjoy! __HTTP__ _E_
Susan Rice the former National Security Advisor to President Obama is refusing to testify before a Senate Subcommittee next week on..... _E_
Do you believe that @FoxNews is still playing up the old Iowa poll numbers and no mention of the ABCWashington Post or just out CBS results? _E_
I've got news for President @BarackObama: America is not what's wrong with the world. I don't believe we need (cont) __HTTP__ _E_
Costs on non military lines will never come down if we do not elect more Republicans in the 2018 Election and beyond. This Bill is a BIG VICTORY for our Military but much waste in order to get Dem votes. Fortunately DACA not included in this Bill negotiations to start now! _E_
Don't forget to watch @ApprenticeNBC tonight—you will love it! 8 PM on NBC. #CelebApprentice _E_
"Be up front and direct with people and they will return the favor." – Think Like a Billionaire _E_
Republicans must start the Tax Reform/Tax Cut legislation ASAP. Don't wait until the end of September. Needed now more than ever. Hurry! _E_
Entrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_
.@THEGaryBusey returns to @CelebApprentice All Stars this season. His streak of chaos and havoc continues! _E_
... A great person inspires others to see for themselves. – Harvey Mackay _E_
Why has Barack Obama repeatedly told inconsistent stories about his religious background? __HTTP__ Who is he? _E_
Celebrated for its room views by @LuxTravelExpert @TrumpChicago soars a luxurious 92 stories over the Windy City __HTTP__ _E_
Heading to Colorado for a big rally. Massive crowd great people! Will be there soon the polls are looking good. _E_
.@newsbusters Thank you for a great and very accurate story well done! _E_
Will be in Nashville Tennessee tomorrow (Saturday) at 2:30 P.M. So much to talk about see you there! _E_
Why is Senator John McCain in Syria visiting with the rebels MAKE AMERICA GREAT AGAIN! _E_
Despite the long delays by the Democrats in finally approving Dr. Tom Price the repeal and replacement of ObamaCare is moving fast! _E_
.@EmilyMiller's book Emily Gets Her Gun exposes the attack on our Second Amendment __HTTP__ A must read! _E_
So the highly overrated anchor @megynkelly is allowed to constantly say bad things about me on her show but I can't fight back? Wrong! _E_
Via @DailyMail by @chriskitching: "Luxury penthouse at @TrumpChicago skyscraper sells for record $17M" __HTTP__ _E_
Do you believe Barack Hussein Obama (aka Barry Soetoro) looked like a president last night? I don't! _E_
Just leaving D.C. Had great meetings with Republicans in the House and Senate. Very interesting day! These are people who love our country! _E_
Americans Elect on track to put an Indy Presidential candidate on the ballot in all 50 states. _E_
Stephanie Cutter Attended WH Meetings With IRS Chief __HTTP__ Great investigative work by Jim Hoft @gatewaypundit _E_
MERRY CHRISTMAS!! __HTTP__ _E_
"Trump: I created tens of thousands of jobs" __HTTP__ via @thehill by @SmiloTweets _E_
I'm not sure about @teresa_giudice as Project Manager. @lisalampanelli can be formidable. But let's see what happens #sweepstweet _E_
Many of @TigerWoods' 'friends' were quick to abandon him in his time of crisis. Now Tiger knows who he can count on. _E_
One of the most obvious lessons on @ApprenticeNBC is for the candidates to learn to think quickly. Think Like a Champion _E_
A 7242 yd. masterpiece @TrumpGolfLA's $250 million course features 18 challenging holes with incredible views __HTTP__ _E_
I have fun I love what I do. You should too. Find out how at the National Achievers Conference this October. __HTTP__ _E_
RT @fundanything In case you missed it check out @washingtonpost story about @realDonaldTrump & @fundanything __HTTP__ _E_
The Chinese are planning on going to the Moon...I hope they stop and take a look at our flag that was put there 43 years ago. @MittRomney _E_
Via @MoscowTimes Donald Trump Planning Skyscraper in Moscow __HTTP__ _E_
Once again the Bush appointed Supreme Court Justice John Roberts has let us down. Jeb pushed him hard! Remember! _E_
Military solutions are now fully in placelocked and loadedshould North Korea act unwisely. Hopefully Kim Jong Un will find another path! _E_
Re Megyn Kelly quote: you could see there was blood coming out of her eyes blood coming out of her wherever (NOSE). Just got on w/thought _E_
If everyone is thinking alike then somebody isn't thinking. George S. Patton _E_
A must read for any country or community considering wind turbines. __HTTP__ _E_
I think the @yankees will win today. Unlike A Rod CC is good under pressure. I hope A Rod plays however. _E_
Congratulations to @ehasselbeck on her successful first day as co host of @foxandfriends! Great to be in studio today for Elisabeth. _E_
Bringing hundreds of billions of dollars back to the U.S.A. from the Middle East which will mean JOBS JOBS JOBS! _E_
Theresa @theresamay don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_
Thank you Waterbury Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Obama is now standing in a puddle acting like a President give me a break. _E_
When James Clapper himself and virtually everyone else with knowledge of the witch hunt says there is no collusion when does it end? _E_
.@sternshow My interview with Howard Stern this morning! __HTTP__ __HTTP__ _E_
#TBT With my family growing up I'm on the left. __HTTP__ _E_
Wow pres. candidate Ben Carson who is very weak on illegal Immigration just said he likes amnesty and a pathway to citizenship. _E_
Nick Adams Retaking America Best things of this presidency aren't reported about. Convinced this will be perhaps best presidency ever. _E_
Sanders said only black lives matter wow! Hillary did not answer question! _E_
We have all been following the Wisconsin recall election. @ScottKWalker's victory tonight will be well earned. A Governor who gets results. _E_
Yesterday China VP Xi stressed the benefits of trade with China to Congress __HTTP__ We need FAIR TRADE with China! _E_
ObamaCare is a disaster and Snowden is a spy who should be executed but if it and he could reveal Obama's recordsI might become a major fan _E_
The rally in Cincinnati is ON. Media put out false reports that it was cancelled. Will be great love you Ohio! _E_
I'm proud to accept the 2010 HollyRod Foundation Humanitarian Award from Holly Robinson Peete who raised $700000 on Celebrity Apprentice _E_
Via @FoxNews: "Trump: Politicians are all talk no action I'm the opposite" __HTTP__ _E_
R.P.Virginia has lost statewide 7 times in a row. Will now not allow desperately needed new voters. Suicidal mistake. RNC MUST ACT NOW! _E_
On Holocaust Remembrance Day we mourn and grieve the murder of 6 million innocent Jewish men women and children and the millions of others who perished in the evil Nazi Genocide. We pledge with all of our might and resolve: Never Again! __HTTP__ __HTTP__ _E_
Despite the establishment and the media's best efforts the people are speaking loudly and clearly. Thank you to my amazing supporters! _E_
Despite what you hear in the press healthcare is coming along great. We are talking to many groups and it will end in a beautiful picture! _E_
Former Prosecutor: The Clintons Are So Corrupt Everything 'They Touch Turns To Molten Lead' __HTTP__ _E_
Thank you Ohio! Just landed in Canton for a rally at the Civic Center. Join me at 7pm: __HTTP__ __HTTP__ _E_
RT @FoxNews: .@KellyannePolls on Harvey recovery: We hope when it comes to basic Hurricane Harvey funding that we can rely upon a nonpartis... _E_
Be sure to enjoy the '50th Anniversary Chicago International Film Festival' at @TrumpChicago the Windy City's top hotel! _E_
"When your brand begins to build you too will be faced with opportunities for greater recognition." – Midas Touch _E_
Lots of response to my Pattinson/Kristen Stewart reunion. She will cheat again 100 certain am I ever wrong? _E_
.@WSJ Editorial Board should review my debate statement re China and T.P.P. and apologize. China not part but will get their way in later. _E_
Do these very stupid politicians who got us involved in Iraq look bad or what? Everybody wants their oil only made possible by U.S.! _E_
.@lisarinna did amazing on #CelebApprentice @ApprenticeNBC. Raising over $505K for @StJude she made it to the Final Four. Congrats Lisa. _E_
Join us in Sparks Nevada today! #NevadaCaucus #VoteTrumpNV __HTTP__ _E_
The Republicans never should have agreed to this past summer's debt deal. Military cuts will now come along with tax increases. _E_
Wow the MSM is really going after me. 12000 in Sarasota a love fest hardly a mention. Only one negativity they only want negatives! _E_
Join me for a 3pm rally tomorrow at the Mid America Center in Council Bluffs Iowa! Tickets:... __HTTP__ _E_
RT @AbeShinzo: トランプ大統領による、初の、歴史的な日本訪問は、間違いなく、日米同盟の揺るぎない絆を世界に示すことができました。本当にありがとう、ドナルド。そして、アジア歴訪の大成功をお祈りしています。@realDonaldTrump __HTTP__ _E_
Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy! _E_
.@FrankLuntz I won every poll of the debate tonight by massive margins @DRUDGE_REPORT & @TIME so where did you find that dumb panel. _E_
Obama admin. called @netanyahu chickenshit. Ironic since Bibi was an IDF Special Forces commando while Obama was a community organizer. _E_
"@jacknicklaus elated at official grand opening of @TrumpFerryPoint" __HTTP__ via @nypost by @NYPost_Willis _E_
'U.S. Small Business Optimism Index Surges by Most Since 1980' __HTTP__ _E_
Thanks everybody for the Happy Birthday greetings but it's actually the 10th birthday of The Apprentice. My birthday is June 14th.... _E_
Today marks the one year anniversary of @AndrewBreitbart's passing. Andrew's mission & legacy still lives on. @BreitbartNews _E_
Great @UnionLeader piece by @jdistaso on my visit to @saintanselm for @NECouncil & @nhiop Politics & Eggs __HTTP__ _E_
I'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_
"TRUMP BATTLES THE NEW TOTALITARIANS: GOP elites join with leftists at Media Matters in targeting threat to both" __HTTP__ _E_
Via @FoxNewsInsider as seen on @foxandfriends: "Trump: Iran Nuke Talks Should Have Taken One Day" __HTTP__ _E_
Every strike brings me closer to the next home run. – Babe Ruth _E_
I will be on @foxandfriends at 7:00 in 10 minutes. HAVE A GREAT DAY ALL! _E_
"Concentration and mental toughness are the margins of victory." Bill Russell _E_
The freezing cold weather across the country is brutal. Must be all that global warming. _E_
RT @DonaldJTrumpJr: Last chance #Wisconsin: Find your polling location for today's primary & go vote! Visit __HTTP__ #T... _E_
A third rate architecture critic who I thought got fired—for the failing @chicagotribune likes the building but doesn't like the Trump sign _E_
The sequester is less than 2% of total 2013 budget. Why can't the WH re allocate funds and keep the tours open for children? #OpenOurWH _E_
The Democrats without a leader have become the party of obstruction.They are only interested in themselves and not in what's best for U.S. _E_
Isn't it sad that Weiner's first press conference with wife Huma was yesterday admitting to a sext he made post resignation & apology! _E_
#MakeAmericaGreatAgain #6Days __HTTP__ _E_
We must stop the crime and killing machine that is illegal immigration. Rampant problems will only get worse. Take back our country! _E_
The NY SAFE Act is an unconstitutional attack on 2nd Amendment rights. Will also increase crime. _E_
.@guardian_sport by @mrewanmurray:"Donald Trump's transformation will make @TrumpTurnberry Open worth the wait" __HTTP__ _E_
Great jobs numbers and finally after many years rising wages and nobody even talks about them. Only Russia Russia Russia despite the fact that after a year of looking there is No Collusion! _E_
Keep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_
President Obama is under pressure from Democrats to undo his lie on ObamaCare. His problem is that such a move would end ObamaCare. _E_
Thanks for all the nice words on my keeping the Trump Tower atrium accessible to stranded victims of #Sandy. My honor. _E_
. #LaskerRink. We do not do the maintenance on Lasker Rink that is done by NEW YORK CITY. _E_
Facebook billionaire gives up his U.S. citizenship in order to save taxes. I guess 3.8 billion isn't enough for (cont) __HTTP__ _E_
So @JLin7 had another game winning shot last night. Looks like the Knicks have not only found a new point guard (cont) __HTTP__ _E_
EXCLUSIVE — Video Interview: Bill Clinton Accuser Juanita Broaddrick Relives Brutal Rapes: __HTTP__ _E_
Just 30 minutes from Manhattan @TrumpNationalNY is Westchester's most elite club offering a 7291 yard course __HTTP__ _E_
People are proud to be saying Merry Christmas again. I am proud to have led the charge against the assault of our cherished and beautiful phrase. MERRY CHRISTMAS!!!!! _E_
The media is pathetic. Our embassies are savaged by radicals while Obama does nothing and all they can do is criticize @MittRomney. _E_
Getting ready to land in Charlottesville Virginia at Trump Vineyards another job producing development that I bought and made AMAZING! _E_
I will be at the @USGA #USWomensOpen in Bedminster NJ tomorrow. Big crowds expected & the women are playing great should be very exciting! _E_
.@evaemery Thanks you sound great! _E_
69 Democrats voted in favor of the Keystone pipeline in the House this week __HTTP__ A major defeat for @BarackObama _E_
Great sign: We built this business without government help. Obama can kiss our a ! __HTTP__ Commonly heard now across America! _E_
The full video of my @LibertyU speech __HTTP__ Liberty's largest ever Convocation crowd. _E_
Merry Christmas to all have a fantastic day year and life! The World with great leadership will become a much more beautiful place! _E_
So happy about my daughter @IvankaTrump's announcement that she will be having a baby this spring. Congratulations! _E_
.@ArsenioOFFICIAL Thx for the good wishes you are going to have a really big year! _E_
#Obamacare premiums are about to SKYROCKET again. Crooked H will only make it worse. We will repeal & replace! __HTTP__ _E_
So many politically correct fools in our country. We have to all get back to work and stop wasting time and energy on nonsense! _E_
The FAKE MSM is working so hard trying to get me not to use Social Media. They hate that I can get the honest and unfiltered message out. _E_
What we are watching on our TV screens is the unraveling of the Obama foreign policy. @PaulRyanVP _E_
For China of all nations to search the massive Indian Ocean and pick up the ping from the black box of flight 370 sounds a bit far fetched _E_
The race for DNC Chairman was of course totally rigged. Bernie's guy like Bernie himself never had a chance. Clinton demanded Perez! _E_
No surprise that @BBC is in a major scandal for shoddy journalism. Any network that air's @antbaxter's garbage has zero credibility. _E_
Why isn't Obama protecting us from ridiculous gas prices? _E_
We all know that chess is a game of strategy. So is business. Think about that and develop a strategy starting today. _E_
Iran is desperate to develop nukes. Congress must increase sanctions against Iran. _E_
"Out of clutter find Simplicity. From discord find Harmony. In the middle of difficulty lies Opportunity."–Albert Einstein _E_
Lightweight @AGSchneiderman just got his ass kicked by Trump! _E_
RT @DRUDGE_REPORT: Obama Refers to Himself 119 Times During Hillary Nominating Speech... __HTTP__ _E_
My @foxandfriends int.on @FoxNewsInsider:"'We Have No Leadership': Trump Slams Obama for Skipping Paris Unity Rally" __HTTP__ _E_
Back by popular demand the record 13th season of 'All Star' @CelebApprentice features the return of @bretmichaels. Our fans will be happy. _E_
He @BarackObama received an early endorsement from the Soviet newspaper Pravda over @MittRomney (cont) __HTTP__ _E_
Just found out that at a charity auction of celebrity portraits in E. Hampton my portrait by artist William Quigley topped list at $60K _E_
I will be interviewed on @oreillyfactor tonight at 11pmE @FoxNews. Enjoy! _E_
But maybe my biggest beef with Obama is his view that there's nothing special or exceptional about America. #TimeToGetTough _E_
I don't watch or do @Morning_Joe anymore. Small audience low ratings! I hear Mika has gone wild with hate. Joe is Joe. They lost their way! _E_
The Boston killer will soon be asking for a Presidential pardon—don't give it to him Mr. President—hang tough! _E_
When will we see stories from CNN on Clinton Foundation corruption and Hillary's pay for play at State Department? _E_
Pathetic! Since @GovWalker is going to win the recall @BarackObama is trying to disown the endorsement of Tom Barrett __HTTP__ _E_
The new Libyan Government should turn over the Lockerbie bomber now. _E_
I will be interviewed on Fox News Sunday With Chris Wallace at 9:00 A.M. or 10:00 A.M. (depending on location). Will be tough but good! _E_
Join me tomorrow! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
A Veteran & true Conservative @leezeldin will make a real difference in Washington. NY 1 GOP GOTV for Lee tomorrow! _E_
expensive mistake! THE UNITED STATES IS OPEN FOR BUSINESS _E_
JOIN ME TOMORROW IN FLORIDA!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_
I told you so. Our country totally lost control of illegal immigration even with criminals. __HTTP__ _E_
Thank you to NC for last evenings great reception. The speech was a great success. Heading now to Louisiana & another speech tonight in MI. _E_
How can the NY Times show an empty room hours before my speech even started when they knew it was going to be packed? So totally dishonest! _E_
Our great VETERANS are being treated very badly because of corruption and incompetence at the V.A. That will stop I will fix this quickly! _E_
.@TrumpLasVegas was just rated "Best Room Service" in LV by The Daily Meal. Congrats to my Las Vegas staff! __HTTP__ _E_
Join me Tuesday in Everett Washington at the Xfinity Arena! Tickets: __HTTP__ __HTTP__ _E_
RT @TeamTrump: .@realDonaldTrump calling out @HillaryClinton's support for NAFTA = most searched moment during tonight's debate. #Debates20... _E_
The five prisoners our government so stupidly released for one pathetic traitor are now fighting and killing for ISIS BAD DEAL! Courtmarshal _E_
Congratulations to @RickSantorum for coming out of Iowa a winner! _E_
With @shawnjohnson and @lorenzolamas from @apprenticenbc two great people! __HTTP__ _E_
BIG NIGHT ON TWITTER TONIGHT. I WILL BE LIVE TWEETING PRESIDENT OBAMA'S SPEECH AT 7:50 P.M. ( EASTERN). MUST TALK RADICAL ISLAMIC TERRORISM! _E_
Money may not grow from trees but it does grow from talent hard work and brains. Think Like a Billionaire _E_
Thanks for the tremendous support for my shirts ties and suits at Macy's. They do great because of really high quality at a low price. _E_
President Obama just had a news conference but he doesn't have a clue. Our country is a divided crime scene and it will only get worse! _E_
Ron Paul is right when he says we are wasting lives and money in Iraq and Afghanistan. _E_
It's amazing that some of the dumbest people on television work for the Wall Street Journal in particular a real dope named Charles Lane! _E_
I can't believe no one has been fired over the ObamaCare website fiasco! _E_
My @foxandfriends interview from this morning __HTTP__ _E_
Legendary Illusionist v. Country Music Star. This Sunday's LIVE Finale of @ApprenticeNBC is a historic matchup. MUST SEE TV! _E_
Isn't it amazing that @CNN paid a fortune for an Iowa Poll which shows me in first place over Cruz by 13% 33% to 20% then doesn't use it _E_
Where's the accountability for the $635M website fiasco in the Obama administration? Heads should roll and officials should be fired _E_
Great new poll thank you! __HTTP__ _E_
Join me in Naples Florida this evening at 6:00pm! Tickets: __HTTP__ __HTTP__ _E_
Isn't it terrible that @megynkelly used a poll not used before (I.B.D.) when I was down but refuses to use it now when I am up? _E_
If the Dems (Crooked Hillary) got elected your stocks would be down 50% from values on Election Day. Now they have a great future and just beginning! __HTTP__ _E_
I wonder how much money dumb @BuzzFeed and even dumber Ben Smith loooose each year? They have zero credibility totally irrelevant and sad! _E_
A level will be reached where ObamaCare will be so out of control expensive and unwieldy that the biggest supporters will abandon ship. _E_
Barack Obama's delivery on Saturday night was excellent cute mention of Trump and I am flattered to be mentioned. @BarackObama _E_
I am watching @CNN very little lately because they are so biased against me. Shows are predictable garbage! CNN and MSM is one big lie! _E_
The long anticipated release of the #JFKFiles will take place tomorrow. So interesting! _E_
Remember @foxandfriends at 7:00 A.M. and Celebrity Apprentice at 8:00 P.M. Enjoy! _E_
.@mdamelincourt Thanks M you are doing a great job at Trump Toronto! _E_
Congratulations to @netanyahu on his electoral victory. He will now be the longest serving @IsraeliPM. A great leader. _E_
THE WEST WILL NEVER BE BROKEN. Our values will PREVAIL. Our people will THRIVE and our civilization will TRIUMPH! __HTTP__ _E_
Thank you Charlotte North Carolina. Great afternoon! #ICYMI I delivered a speech on urban renewal. Full speech:... __HTTP__ _E_
New reality. Yuan just passed the Euro as 2nd most traded finance currency __HTTP__ Our leaders better get smart fast. _E_
Great meeting with @SenateMajLdr Mitch McConnell and Republican leaders in D.C. #Trump2016 __HTTP__ _E_
I will not be able to attend the Miss USA pageant tomorrow night because I am campaigning in Phoenix. Wishing all well! _E_
Taliban targeted innocent Afghans brave police in Kabul today. Our thoughts and prayers go to the victims and first responders. We will not allow the Taliban to win! _E_
My interview with Don Imus on @77WABCradio discussing my @RNC convention surprise & @MittRomney's China policy __HTTP__ _E_
A great new book has been written about Crooked Hillary. Read it & you will never be able to vote for her. @Ed_Klein __HTTP__ _E_
Yesterday I was in Washington D.C. visiting the #TRUMP Old Post Office renovation. It will be magnificent. _E_
As I have long stated we are so tied in with China and Asia that their markets are now taking the U.S. market down. Get smart U.S.A. _E_
The new Rasmussen Poll one of the most accurate in the 2016 Election just out with a Trump 50% Approval Rating.That's higher than O's #'s! _E_
After today Crooked Hillary can officially be called Lyin' Crooked Hillary. _E_
#TBT A picture of my fantastic father and myself. Best teacher in the world! A great Father's Day... __HTTP__ _E_
Judge Gorsuch will be sworn in at the Rose Garden of the White House on Monday at 11:00 A.M. He will be a great Justice. Very proud of him! _E_
Have a great weekend everyone and for those of you that are young entrepreneurs have fun but never stop thinking of the task ahead victory _E_
I will be on Fox & Friends (@foxandfriends) at 7.00. Fighting Ebola will be a topic! _E_
The Republican House members are working hard (and late) toward the Massive Tax Cuts that they know you deserve. These will be biggest ever! _E_
.@THEGaryBusey is making no attempt to help. Is he in BuseyLand? Their team is short on help already... #CelebApprentice _E_
Our president could not make a proper website with $5B. The website still does not work. How can we feel safe about Ebola?! _E_
Incredible crowd in Richmond Virginia tonight! So much spirit and energy! #makeamericagreatagain __HTTP__ _E_
states instead of the 15 states that I visited. I would have won even more easily and convincingly (but smaller states are forgotten)! _E_
.@EricTrump unbelievable job on #FoxNews with @greta. That was better than I could do! #Trump2016 _E_
Tomorrow in DC: 1 PM West Front Lawn of the Capitol. Not even believable that we would do this deal with Iran. _E_
All raising taxes on businesses does is force business owners to lay off employees they can no longer afford. (cont) __HTTP__ _E_
My @gretawire interview discussing @BarackObama's misleading political ad @MittRomney's response and @Cher & @Rosie __HTTP__ _E_
Angelina and Sidney had a really strange vibe going! #Oscars _E_
We need a tax system that is fair and smart one that encourages growth savings and investment. #TimeToGetTough _E_
The Crooked Hillary V.P. choice is VERY disrespectful to Bernie Sanders and all of his supporters. Just another case of BAD JUDGEMENT by H! _E_
Healthcare listening session w/ @VP & @SecPriceMD. Watch: __HTTP__ #ReadTheBill:... __HTTP__ _E_
Back to work for the President to try and keep some dignity for the office and himself. The so called rebels must be thoroughly confused! _E_
Coincidence? Obama and Ahmadinejad each describe @Israel's warning over the Iranian nuclear program as just 'noise' __HTTP__ _E_
I'm self funding and I am going to take care of the people – not the special interests and insurance companies like the other candidates. _E_
My @nbc @todayshow interview discussing my @RNC video & why @MittRomney should not apologize __HTTP__ _E_
Sorry folks but Bernie Sanders is exhausted just can't go on any longer. He is trying to dismiss the new e mails and DNC disrespect. SAD! _E_
.@DavidLetterman @Late_Show fully apologized last night for calling me a racist. Thank you David we are again friends. _E_
Trump International Golf Links and Hotel Ireland is located on the Atlantic Ocean in County Clare. Spectacular! __HTTP__ _E_
Golf bookings for next season on Scottish course are already double our projections for April opening—great news... __HTTP__ _E_
#TrumpAdvice __HTTP__ _E_
This 'deal' @RNC voted for has $41 in tax increases for every $1 in spending cuts. It is pathetic. Obama is laughing at them. _E_
Will be back in Virginia tonight for a 6pm rally at the Berglund Center in Roanoke. Join me! Tickets:... __HTTP__ _E_
The recession was made worse by @BarackObama. A $900Billion deficit is not getting better. _E_
Success tip: See yourself as victorious. This will focus you in the right direction. Apply your skills and talent and be tenacious. _E_
Amazing that Ted Cruz can't even get a Senator like @BenSasse who is easy to endorse him. Not one Senator is endorsing Canada Ted! _E_
Watch – Obama in 2006: "I've stolen ideas from Jonathan Gruber" __HTTP__ And now Obama claims he is 'just some adviser.' _E_
Read this about @Lawrence.... __HTTP__ _E_
One who fears failure limits his activities. Failure is only the opportunity to more intelligently begin again. Henry Ford _E_
The Mayor of Baltimore said she wanted to give the rioters space to destroy another real genius! _E_
Exclusive–Donald Trump: Obama 'Totally Out Negotiated' by Iran Taliban 'Virtually Every Country in the World' __HTTP__ _E_
Lance Armstrong is having a breakdown. What is he doing—his life is now officially over! _E_
My @gretawire interview __HTTP__ _E_
Can't believe these totally phoney stories 100% made up by women (many already proven false) and pushed big time by press have impact! _E_
If the election were based on total popular vote I would have campaigned in N.Y. Florida and California and won even bigger and more easily _E_
Bob Turner great guy great businessman will be a great Congressman. Was happy to help him win. _E_
Is this boring or is it just me? #Oscars _E_
The reason for the plan negotiated between the Republicans and Democrats is that we need 60 votes in the Senate which are not there! We.... _E_
Via The Political Insider: "Donald Trump Just Received The Best News Possible!" __HTTP__ _E_
Entrepreneurs: Stay focused and be tenacious. Pay attention to people who know what they're talking about. Stay fixed on your goals! _E_
Housing prices will be going up big league a great time to buy good luck! _E_
I will be on Meet the Press with Chuck Todd on NBC this morning. Enjoy! __HTTP__ _E_
My new book tells some harsh truths and lays out some bold plans. Time for America to be #1 again. #TimeToGetTough _E_
Thank you Anaheim California!#Trump2016 __HTTP__ _E_
We agree @POTUS SHE'LL (Hillary Clinton) SAY ANYTHING & CHANGE NOTHING. IT'S TIME TO TURN THE PAGE President Obama _E_
Why has @BarackObama allowed the Muslim Brotherhood to visit the @whitehouse? What Hope & Change! _E_
The real unemployment rate according to the CBO is 15% __HTTP__ @BarackObama's economic recovery is all Hope _E_
Obama called August's job report progress. Overall 96K new jobs & over 173K new people on food stamps __HTTP__ _E_
I am leaving China for #APEC2017 in Vietnam. @FLOTUS Melania is staying behind to see the zoo and of course the Great WALL of China before going to Alaska to greet our AMAZING troops. _E_
I'm a conservative but the weakness of conservatives is that they destroy each other whereas liberals unite to win. _E_
RT @joshrogin: Pence is right. Clinton & Obama tried to negotiate an Iraq troop extension but failed. Bush admin always anticipated such an... _E_
LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_
Tonight's official count 7943. An all time record for the Anderson Civic Center in SC! Thanks! #Trump2016 __HTTP__ _E_
Thanks to everyone for your kind birthday wishes very nice! _E_
The Fed's reckless monetary policy is going to create record inflation. _E_
Republicans better start listening to and respecting the Tea Party! _E_
.@Kstupples Thanks for the nice comments on Trump National Doral. I've long been your fan—now am an even bigger fan! @TrumpDoral _E_
I encourage EVERYONE in the path of #HurricaneIrma to heed the advice and orders of local & state officials! __HTTP__ _E_
People are anxiously awaiting my decision as to who the next head of the Fed will be.... __HTTP__ _E_
Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_
The irony is that the Freedom Caucus which is very pro life and against Planned Parenthood allows P.P. to continue if they stop this plan! _E_
3 Republicans and 48 Democrats let the American people down. As I said from the beginning let ObamaCare implode then deal. Watch! _E_
A message from @IvankaTrump! #SCPrimary #VoteTrumpSC #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_
Congratulations to @TrumpIntRealty for the two top rentals in 2013! __HTTP__ #TIRNYC _E_
Will the Keystone XL pipeline finally be approved? Will create over 100000 jobs and make us more energy independent. _E_
"Sometimes people spend too much time focusing on problems instead of focusing on opportunities." – Think Like A Champion _E_
I don't consider writing books a small venture...writing books is essentially a sharing experience. @MidasTouch @theRealKiyosaki _E_
My @SquawkCNBC #TrumpTuesday interview discussing QE3 @MittRomney's leaked comments Middle East & US oil capability __HTTP__ _E_
We will all have fun and hopefully learn something tonight. I will shoot straight and call it as I see it both the good and the bad. Enjoy! _E_
"Any political leader who won't face the future head on is putting the American Dream at risk." – The America We Deserve _E_
This is the best deal the Republicans could get? _E_
With 15% US real unemployment and a 16T debt @Michelle Obama's luxurious Aspen vacation her 16th cost us over $1M __HTTP__ _E_
"Do what you can with what you have where you are." Theodore Roosevelt _E_
Take action every day and stay focused for the long haul." Think Big _E_
Stop the assault on American values. Stand w/ Trump to #MakeAmericaGreatAgain!#VotersSpeak: __HTTP__ __HTTP__ _E_
Defund it or own it. If you fund it you're for it. @SenMikeLee _E_
You can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of the Deal _E_
Crime and killings in Chicago have reached such epidemic proportions that I am sending in Federal help. 1714 shootings in Chicago this year! _E_
Via @LasVegasSun by Eugene R. Dunn: "Impeach Obama and elect Trump" __HTTP__ _E_
CORRUPT with the national security leaks and Fast & Furious there are clearly at least two cover ups in @BarackObama's White House. _E_
Ted Cruz is mathematically out of winning the race. Now all he can do is be a spoiler never a nice thing to do. I will beat Hillary! _E_
The Fake News is now complaining about my different types of back to back speeches. Well their was Afghanistan (somber) the big Rally..... _E_
Join me in Dallas Texas on Thursday!#AmericaFirst #Trump2016 __HTTP__ __HTTP__ _E_
I will be live tweeting! _E_
Obama just bought the Afghan Police $288M in ammo __HTTP__ Make no mistake some of these will be shot at our troops. _E_
#SecondAmendment #2A#Debates __HTTP__ _E_
Yesterday was another big day for jobs and the Stock Market. Chrysler coming back to U.S. (Michigan) from Mexico and many more companies paying out Tax Cut money to employees. If Dems won in November Market would have TANKED! It was headed for disaster. _E_
Totally made up facts by sleazebag political operatives both Democrats and Republicans FAKE NEWS! Russia says nothing exists. Probably... _E_
'Huma Abedin told Clinton her secret email account caused problems' __HTTP__ _E_
RT @Team_Trump45: @realDonaldTrump __HTTP__ _E_
Only by enlisting the full potential of women in our society will we be truly able to #MakeAmericaGreatAgain... __HTTP__ _E_
.@EdWGillespie will totally turn around the high crime and poor economic performance of VA. MS 13 and crime will be gone. Vote today ASAP! _E_
Today it was my great honor to proclaim January 15 2018 as Martin Luther King Jr. Federal Holiday. I encourage all Americans to observe this day with appropriate civic community and service activities in honor of Dr. King's life and legacy. __HTTP__ _E_
Whether you like Obama or not Bob Gates turned out to be one disloyal dude! Personally I hate rats. _E_
Looking forward to being @TrumpSoHo this evening for Corporate Meeting Planners reception for Trump National Doral @TrumpDoral _E_
.@antbaxter Thanks for helping promote & make Trump International Golf Links Scotland so successful you stupid fool! _E_
The ISIS thug who murdered American journalist James Foley may have been Gitmo detainee __HTTP__ If so why was he released? _E_
"You miss 100% of the shots you don't take." Wayne Gretzky _E_
Joe Girardi did a great job of managing the Yankees this series. _E_
BarackObama set a record deficit last February $229 billion while borrowing 42 cents of every dollar it spent. @BarackObama is reckless. _E_
Kasich has already spent $6 million on ads in New Hampshire and his numbers have gone down. People from NH are smart! _E_
Great article on so called climate change formerly known as global warming. __HTTP__ _E_
ISIS just claimed the Degenerate Animal who killed and so badly wounded the wonderful people on the West Side was their soldier. ..... _E_
Thank you Evansville Indiana! #MakeAmericaGreatAgain __HTTP__ _E_
Crooked Hillary Clinton is 100% owned by her donors. #ImWithYou #MAGA __HTTP__ _E_
Putin says Russia can't allow a weakening of its nuclear deterrent—U.S. wants to reduce—are we crazy? _E_
We have to combat the welfare mentality that says individuals are entitled to live off taxpayers. #TimeToGetTough _E_
Remember how @ObamaCare did not have any tort reform? Now the trial lawyers are getting ready for even more lawsuits __HTTP__ _E_
Gee @meetthepress with @chucktodd was getting terrible ratings then with me he set records I saved his job but Chuck still not nice! _E_
Will be on @bloombergtv tomorrow with @sruhle. Enjoy! _E_
.@VattenfallGroup couldn't sell its money losing Aberdeen windfarm—so @AlexSalmond forced phony extension. @AberdeenCC @Aberdeenshire _E_
Work is fun deals are fun life is fun but love of a great family makes it all come together. Go out there and make your family proud. _E_
__HTTP__ _E_
Looks like a very good World Series game! _E_
RT @TeamTrump: We need STRONG BROAD SHOULDERED leadership like @mike_pence & @realDonaldTrump in the White House! #VPDebate #BigLeagueTrut... _E_
#2. Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_
At this point the legacy of the Obama Administration will be sadly that of THE GANG THAT COULDN'T SHOOT STRAIGHT what a pathetic mess! _E_
.@StephenBaldwin7 thinks @TheRealMarilu is ping ponging all over the place. Do you agree? #CelebApprentice _E_
Make no mistake Obamacare is the first step towards changing our health system into single payer. Just a disaster. _E_
A lot of complaints from people saying my name is not on the ballot in various places in Florida? Hope this is false. _E_
If Jeb Bush were more competent he could not have lost the skirmish with Marco in the debate. BAD facts for Marco if properly delivered! _E_
Via @BBCScotland: "Donald Trump's name 'will boost @TrumpTurnberry '" __HTTP__ _E_
With the whacko pervert Weiner about to be embarrassed all women need to be on the lookout. Sexting begins 9.11 @ 12:01 AM _E_
Obama's spending and borrowing is burying America and destroying our children's future. Does he even care? _E_
#TrumpVlog Make our country great again! __HTTP__ _E_
Amazing how fast all of Joe Paterno's friends abandoned him. They ran for the hills. _E_
What did we get for fighting in Libya besides a dead Ambassador. Demand their oil. _E_
In Hudson Valley @TrumpNationalNY's course has pristine fairways tour caliber greens & 64 strategic sand bunkers __HTTP__ _E_
House GOP wants to cut Medicare Obama took $500 billion from Medicare for Obamacare. Both Wrong! _E_
Via Union Leader: Trump leads tribute for slain journalist James Foley | New Hampshire First Amendment Awards __HTTP__ _E_
Funny how the failing @nytimes is pushing Dems narrative that Russia is working for me because Putin said Trump is a genius. America 1st! _E_
Is everything ok over there @Salon? I actually got some good press from them today. _E_
Iran has never had a better friend than Obama. _E_
Remember the golden rule of negotiating: He who has the gold makes the rules. _E_
The Fed must be reined in. In 2011 the Fed bought 61% percent of US debt even more than 2008. Unsustainable! __HTTP__ _E_
Ben Carson wants to abolish Medicare I want to save it and Social Security. _E_
NIELSEN RATINGS: 1.@ThisWeekABC 2.52 viewers 6 SHR1.91RTG .55 25 54 2.@meetthepress 2.24 total viewers 5 SHR1.61RTG .47 25 54 _E_
Barney Frank looked disgusting nipples protruding in his blue shirt before Congress. Very very disrespectful. _E_
Wonderful Frank Gifford has just passed away at age 84. He was my friend and a truly great guy! Warmest condolences to family. _E_
Only 10 more days until the premiere of All Star @ApprenticeNBC. On March 3rd at 9PM EST @NBC the fireworks return to the Board Room! _E_
To Tom Brady @patriots and Gisele Best wishes on the birth of your daughter. Tom is a great player and great friend. _E_
"The aesthetic the quality has to be carried all the way through." Steve Jobs _E_
Miss USA Tara Conner will not be fired I've always been a believer in second chances. says Donald Trump _E_
Little Andy Lassner who lives his life through Ellen and has nothing else going for himself is having a really bad night! #Oscars _E_
Please tell me what is going on with the Republicans? _E_
The U.S. has been talking to North Korea and paying them extortion money for 25 years. Talking is not the answer! _E_
.@FoxNews Objectified tonight at 10:00 P.M. Enjoy! _E_
Serious doubt in Illinois as to whether or not Cruz can run for President. First of many challenges. __HTTP__ _E_
RT @HeyTammyBruce: Coming up at 720a ET on @foxandfriends! See you there! #maga _E_
The reporting at the failing @nytimes gets worse and worse by the day. Fortunately it is a dying newspaper. _E_
By popular demand I will be tweeting during tomorrow's record 14th season premiere of @ApprenticeNBC on @nbc at 9/8c __HTTP__ _E_
.@jimmyfallon regularly features @ApprenticeNBC contestants on his show. We love his support & he's a terrific host.Tonight: Omarosa. _E_
What a shock – higher taxes are slowing retail spending __HTTP__ Wait until 2014 when Obama Care is fully implemented. _E_
I'm surprised that Gabriel Aubry has settled so quickly and easily with Halle—in the long run it was a wise decision. _E_
Now is no time to cut military spending. We must remain strong. Our enemies are looking for weakness. I'm i... (cont) __HTTP__ _E_
Can you imagine we spend billions of dollars protecting Saudi Arabia and now the King refuses to even meet with Obama. Great leadership! _E_
Heading to D.C. to see and hear ROLLING THUNDER. Amazing people that LOVE OUR COUNTRY. Great spirit! _E_
"Do whatever it takes to improve your public speaking skills. You'll absolutely need them." – Midas Touch _E_
If you're going through hell keep going. Winston Churchill _E_
Obama opposes sanctions on Iran __HTTP__ They are laughing at Kerry & Obama! _E_
I had a very respectful conversation with the widow of Sgt. La David Johnson and spoke his name from beginning without hesitation! _E_
Will be doing a sit down interview with @JakeTapper @CNN on Sunday morning at 9:00. Tough questions and hopefully very good answers! _E_
Bernie Sanders started off strong but with the selection of Kaine for V.P. is ending really weak. So much for a movement! TOTAL DISRESPECT _E_
If China had a tenth of the natural resources we do then they would already be energy independent. Instead we continue to buy oil from OPEC. _E_
Washington (D.C.) is such a mess nothing works! I will MAKE AMERICA GREAT AGAIN! It's not going to happen with anyone else. _E_
Looks like Anthony Weiner Is through most recent poll has him deeply in last place. GOOD NEWS _E_
The trip by @VP Pence was long planned. He is receiving great praise for leaving game after the players showed such disrespect for country! _E_
I am officially running for President of the United States. #MakeAmericaGreatAgain __HTTP__ _E_
Looking forward to live tweeting during the rest of the debates. Will be a lot of fun. _E_
Jeb Bush never uses his last name on advertising signage materials etc. Is he ashamed of the name BUSH? A pretty sad situation. Go Jeb! _E_
"Trump Rally: Stocks put 2017 in the record books" __HTTP__ _E_
With the strategy that I announced today we are declaring that AMERICA is in the game and AMERICA is DETERMINED to WIN!OUR FOUR PILLARS OF NATIONAL SECURITY STRATEGY: __HTTP__ _E_
I look very much forward to meeting Prime Minister Theresa May in Washington in the Spring. Britain a longtime U.S. ally is very special! _E_
"Do you want to know who you are? Don't ask. Act! Action will delineate and define you." Thomas Jefferson _E_
In the 10:30 PM ET lead in to local news @ApprenticeNBC delivered a 31 percent margin of victory... _E_
A lot of call ins about vote flipping at the voting booths in Texas. People are not happy. BIG lines. What is going on? _E_
Now @BarackObama is telling donors he will need to 'revisit' healthcare in his 2nd term __HTTP__ _E_
FAKE NEWS media which makes up stories and sources is far more effective than the discredited Democrats but they are fading fast! _E_
I love show Law and Order but the @MRbelzer casting is the worst ever. No talent unwatchable! _E_
I can't believe Mitch McConnell isn't way up in the Kentucky polls. Massive seniority brings so much power and status to State. Brings K.$'s _E_
In war the elememt of surprise is sooooo important.What the hell is Obama doing. _E_
Great decision by Donald Graham @Newsweek to sell. I'll now have to take my newsweek covers off the wall. _E_
Great – we are sending even more F 16's to the Muslim Brotherhood in Egypt __HTTP__ This is a total disaster. _E_
VOTER REGISTRATION DEADLINES TODAY. You can register now at: __HTTP__ and get out to... __HTTP__ _E_
When Mitt Romney asked me for my endorsement last time around he was so awkward and goofy that we all should have known he could not win! _E_
Just out Nevada poll shows Jeb Bush at 1% he should take his dumb mouthpiece @LindseyGrahamSC and just go home. _E_
Who's your pick @bretmichaels or @hollyrpeete ? Vote now on Ivanka's new Facebook page! __HTTP__ _E_
Why didn't Hillary Clinton announce that she was inappropriately given the debate questions she secretly used them! Crooked Hillary. _E_
Once the ISIS thug who beheaded Foley is identified 100% he should be bunker busted to hell. _E_
Look at the editorial I was just sent from the NY Post on 9/14/01 3 days after collapse of WTC. Any apologies? __HTTP__ _E_
You get what you vote for. US credit rating is about to be downgraded once again __HTTP__ _E_
In the spirit of transparency Obama should immediately release the 9.11 tape of Tyrone Woods pleading for military support in Benghazi. _E_
Negotiation: Think about what the other side wants. Know where they're coming from. Don't underestimate them. Create a win/win situation. _E_
We must immediately stop all air traffic coming from the Ebola infected areas of Africa—before it is too late. _E_
China Russia and Iran are laughing at us. We have weak leaders who are threatening our national security. Dangerous times. _E_
.@robbreport Best 2013 Golf Courses: Trump Int'l Golf Links Scotland. Great honor great magazine—thanks! __HTTP__ _E_
Wow did you see how badly @CNN (Clinton News Network) is doing in the ratings. With people like @donlemon who could expect any more? _E_
Will be on #Hannity @ 10pE @FoxNews discussing various subjects including immigration if elected we will #BuildTheWall & enforce our laws! _E_
Join me Monday in Columbus Ohio & Harrisburg Pennsylvania! #MAGA3pm in OH: __HTTP__ in PA: __HTTP__ _E_
The Great Irish Links Challenge @Trump_Ireland & Lahinch Golf Club is coming this June. Don't miss it. __HTTP__ #Doonbeg _E_
I have helped many friends and colleagues in their business ventures. They always thank me after they succeed. #MIDASTOUCH _E_
Thank you Indiana! #Trump2016 __HTTP__ _E_
May God be w/ the people of Sutherland Springs Texas. The FBI & law enforcement are on the scene. I am monitoring the situation from Japan. _E_
...get things done at a record clip. Many big decisions to be made over the coming days and weeks. AMERICA FIRST! _E_
I will have set the all time record in primary votes in the Republican party despite having to compete against 17 other people! _E_
Being successful requires nothing less than 100% of your concentrated effort. Be totally focused. _E_
The United States cannot continue to make such bad one sided trade deals. There are only so many jobs we can give up. No more! _E_
Crooked Hillary is spending big Wall Street money on ads saying I don't have foreign policy experience yet look what her policies have done _E_
All predictions re: my 12 o'clock release are totally incorrect. Stay tuned! _E_
....is making. Working very hard on TAX CUTS for the middle class companies and jobs! _E_
Via @RadioIowa by @okayhenderson: "Trump touts business career but not TV show during Iowa speech" __HTTP__ _E_
RT @GOPChairwoman: The Trump Inaugural Committee is donating $3 million in surplus funds to victims of the latest hurricanes. __HTTP__ _E_
Via @NYDailyNews by @klnynews: "Donald Trump wins lawsuit against Joint Commission on Public Ethics" __HTTP__ _E_
Offering true luxury @Trump_Charlotte has spectacular restaurants Olympic pools & six professional tennis courts __HTTP__ _E_
Jeb Bush gave five different answers in four days on whether or not we should have invaded Iraq.He is so confused.Not presidential material! _E_
The Blue Monster @TrumpDoral was a sensation over the weekend. Really tough but players & critics alike loved it. _E_
Happy #CincoDeMayo! The best taco bowls are made in Trump Tower Grill. I love Hispanics! __HTTP__ __HTTP__ _E_
Jared Kushner did very well yesterday in proving he did not collude with the Russians. Witch Hunt. Next up 11 year old Barron Trump! _E_
Towering over trendy Bay Street @TrumpTO offers 118 stunning condominiums w/ multi angle views & elite amenities __HTTP__ _E_
Why do shows have @ananavarro—Ntl Hispanic Chair for the losing McCain '08 & Huntsman '12. She's a loser who doesn't deliver votes. _E_
Republicans don't extend the debt ceiling—make the great deal now! _E_
How much BAD JUDGEMENT was on display by the people in DNC in writing those really dumb e mails using even religion against Bernie! _E_
#ImWithYou __HTTP__ _E_
Thanks for all of the great support but I just don't see myself wanting to run for Governor of New York I have something else in mind! _E_
Something very important and indeed society changing may come out of the Ebola epidemic that will be a very good thing: NO SHAKING HANDS! _E_
The Fake News is going crazy with wacky Congresswoman Wilson(D) who was SECRETLY on a very personal call and gave a total lie on content! _E_
I am in Iowa today great STATE fantastic PEOPLE! Many speeches big crowds all sold out! MAKE AMERICA GREAT AGAIN! _E_
Honest Omarosa: she won't backstab she'll come at you from the front. _E_
Congresswoman Jennifer Gonzalez Colon of Puerto Rico has been wonderful to deal with and a great representative of the people. Thank you! _E_
Just arrived at Trump National Doral saying hello to all the great players. This place is amazing.Come Thursday & see for yourselves! _E_
Today I officially declared my candidacy for President of the United States. Watch the video of my full speech __HTTP__ _E_
Maybe some of the dead voters who helped get President Obama elected can be brought back to life after signing up for ObamaCare. _E_
In my opinion one of the worst utility companies in the country is Florida Power and Light. _E_
Katie Couric the third rate reporter who has been largely forgotten should be ashamed of herself for the fraudulent editing of her doc. _E_
.@GlennBeck got fired like a dog by #Fox. The Blaze is failing and he wanted to have me on his show. I said no because he is irrelevant. _E_
My people caught the person who committed forgery of the James Gandolfini Obama Care phoney quote attributed to me fraud. Arrest coming? _E_
South Carolina voters have the future of our country in their hands. Vote now (today) and MAKE AMERICA GREAT AGAIN! _E_
Just arrived in West Virginia for a MAKE AMERICA GREAT AGAIN rally in Huntington at 7:00pmE. Massive crowd expected tune in! #MAGA _E_
Now Assad is demanding that Obama stop supporting the rebels before he turns over his chemical weapons. What a mess! _E_
Thank you. __HTTP__ _E_
Congratulations to Alyssa Campanella Miss California our new MIss USA! __HTTP__ _E_
My @todayshow int. with @MLauer announcing the January 4th premiere & cast of the 14th season of @ApprenticeNBC __HTTP__ _E_
Thank you Piers for the wonderful article and also great writing. @piersmorgan __HTTP__ _E_
Red line statement was a disaster for President Obama. _E_
...come down hard tax the hell out of their imports and reduce our deficit fast. _E_
Via @AP: Donald Trump returns to the 'Apprentice' boardroom __HTTP__ _E_
The Democrats only want to increase taxes and obstruct. That's all they are good at! _E_
One of the world's tallest buildings @TrumpChicago is not only a 5 star hotel but has 5 star dining options __HTTP__ _E_
Agreed! __HTTP__ _E_
Mexico will pay for the wall! _E_
like the 116% hike in Arizona. Also deductibles are so high that it is practically useless. Don't let the Schumer clowns out of this web... _E_
And finally Cruz strongly told thousands of caucusgoers (voters) that Trump was strongly in favor of ObamaCare and choice a total lie! _E_
The issue of kneeling has nothing to do with race. It is about respect for our Country Flag and National Anthem. NFL must respect this! _E_
Hillary Clinton Dominates the Pack in Fake Twitter Followers __HTTP__ _E_
Another Obama disaster __HTTP__ _E_
Republicans should not be giving Obama fast track authority on trade. The Trans Pacific Partnership will squeeze our manufacturing sector _E_
Edddie24 Mr. Trump is a real American patriot. You have my vote if you ever ran. 👍 Thank you. _E_
Hillary Clinton reaches new low. #TrumpVlog __HTTP__ _E_
...Overall the Academy Awards were very average at best. _E_
With all that Congress has to work on do they really have to make the weakening of the Independent Ethics Watchdog as unfair as it _E_
I want to thank all my friends in Macon for the special evening and great reception. What a crowd of incredible people! _E_
My friend Derek is a special athlete and special person there is nobody like him. @Yankees _E_
The @BarackObama administration now claims to have done everything to reduce gas prices __HTTP__ What about Keystone? _E_
Things are looking great for Karen H! _E_
Crooked Hillary launched her political career by letting terrorists off the hook. #DrainTheSwamp... __HTTP__ _E_
The Mar a Lago club in Palm Beach is one of the most successful places on earth in raising money for charity a great feeling! _E_
I will be interviewed by @jdickerson on @FaceTheNation tomorrow morning. Enjoy! #Trump2016 _E_
Great sportscaster Al Michaels a friend of mine played golf with me on Saturday morning at Trump National LA. He was in perfect shape! _E_
.@Mark_Sanchez shouldn't be too upset over @EvaLongoria. He will always do great! _E_
TRUMP TUESDAY @SquawkCNBC tomorrow at 7:30 am Tune in! _E_
Congratulations to our new Attorney General @SenatorSessions! __HTTP__ _E_
The Miami Heat is getting it's ass kicked they better start playing or it will be a long Summer for them. _E_
Is Fake News Washington Post being used as a lobbyist weapon against Congress to keep Politicians from looking into Amazon no tax monopoly? _E_
Home values have sunk a record 15% under Obama. _E_
For America to be great again we must have a President who has been successful and Americans can learn from on how to succeed. _E_
Donald Trump Will Be on Pennsylvania Avenue in 2016 & There's Nothing You Can Do About It __HTTP__ by @lilsarg _E_
The rolling average of jobless claims is the highest in 5 months __HTTP__ ObamaCare continues to slow growth and cost jobs. _E_
The cast of the new season of apprenticenbc. Premieres January 4th on NBC. __HTTP__ _E_
Join me in Washington today!Spokane tickets: __HTTP__ tickets: __HTTP__ __HTTP__ _E_
Congratulations to @AllenWest on winning last night's primary! _E_
WH counsel met with IRS lawyer 3x in 2012 once in September __HTTP__ But Obama just learned through news reports? _E_
The Democrats ObamaCare is imploding. Massive subsidy payments to their pet insurance companies has stopped. Dems should call me to fix! _E_
Does anyone really believe that Chuck Hagel is sorry for any of his past comments or supports Israel? _E_
One of the tallest office buildings in downtown NYC 40 Wall Street is a classic Art Deco building __HTTP__ _E_
A reader just sent me the following: I wanted to share with you something rather startling. On page 103 of (cont) __HTTP__ _E_
Breitbart gets it! Vote now @BarackObama should release his college application records and grades. He says he (cont) __HTTP__ _E_
The phony story in the failing @nytimes is a TOTAL FABRICATION. Written by same people as last discredited story on women. WATCH! _E_
Glad to hear North Carolina is solid for @MittRomney. It started trending for Mitt solidly after my speech at the @NCGOP convention. _E_
What lies behind us and what lies before us are tiny matters compared to what lies within us. Ralph Waldo Emerson _E_
Big excitement last night in the Great State of Pennsylvania! Fantastic crowd and people. MAKE AMERICA GREAT AGAIN! _E_
I now see John Kasich from Ohio who is desperate to run is using my line "Make America Great Again". Typical pol no imagination! _E_
Thank you for the nice words @ktmcfarland. The debate was interesting and fun. Keep up the great work! _E_
Starting tomorrow it's going to be #AmericaFirst! Thank you for a great morning Sarasota Florida!Watch here:... __HTTP__ _E_
Excellent Jobs Numbers just released and I have only just begun. Many job stifling regulations continue to fall. Movement back to USA! _E_
Four more years of weakness with a Crooked Hillary Administration is not acceptable. Look what has happened to the world with O & Hillary! _E_
Sorry losers and haters but I LOVED the great energy in Madison Square Garden during my speech. The WWE thought it was incredible it was! _E_
Bernie's exhausted he just wants to shut down and go home to bed! _E_
I am honored to be chosen by Gray Line for their NY Ride of Fame Campaign. Today we had the ribbon cutting ceremony in front of Trump Tower. _E_
True. __HTTP__ _E_
I was proud to be one of Ronald Reagan's earliest supporters. Like Reagan it's time to Make America Great Again! __HTTP__ _E_
Via @JNSworldnews by @JacobKamarasJNS: Donald Trump says he is no apprentice when it comes to Israel __HTTP__ _E_
#HasJustineLandedYet Justine what the hell are you doing are you crazy? Not nice or fair! I will support @AidForAfrica. Justine is FIRED! _E_
CAMPAIGN STATEMENT: __HTTP__ _E_
My @foxandfriends int. re: Tiger's victory at Trump @DoralResort 's @CadillacChamp my WH tour offer and CPAC __HTTP__ _E_
America is proud to stand shoulder to shoulder with Poland in the fight to eradicate the evils of terrorism and extremism. #POTUSinPoland __HTTP__ _E_
Thank you to teachers across America! When I become POTUS we will make education a far more important component of our life than it is now. _E_
Despite Mexico's interest in again hosting the Miss Universe Pageant it will be because of Rodolfo Rosas Moya that it will never happen. _E_
For too long we've been pushed around used by other countries and ill served by politicians in Washington who (cont) __HTTP__ _E_
MAKE AMERICA GREAT AGAIN!#AmericaFirst #ImWithYou __HTTP__ _E_
Pres. Obama is meeting with China's Pres. this week __HTTP__ He will get zero deliverables. China laughs at us. _E_
The Republicans must use the debt ceiling as leverage to make a great deal! _E_
E mails show that the AmazonWashingtonPost and the FailingNewYorkTimes were reluctant to cover the Clinton/Lynch secret meeting in plane. _E_
Just returned from Trump Doral in Miami. Massive construction job. When completed will be the best resort in U.S. Blue Monster is amazing! _E_
Why would the great people of Florida vote for a guy who as a Senator never even shows up to vote worst record. Marco Rubio is a joke! _E_
They now say using the word thug is like so many other words not politically correct (even though Obama uses it). It is racist. BULL! _E_
"US tycoon Donald Trump in talks with Ryanair to bring more flights back to Prestwick Airport" __HTTP__ via @Daily_Record _E_
Join our next Vice President @Mike_Pence in Wisconsin tonight & Michigan Thursday!MI: __HTTP__ __HTTP__ _E_
Republicans have the right approach to ObamaCare – let it fail. Free market solutions will be embraced by Americans in 2016. _E_
Don't let Obama play the Iran card in order to start a war in order to get elected be careful Republicans! _E_
.@ICEgov HSI agents and ERO officers on behalf of an entire Nation THANK YOU for what you are doing 24/7/365 to keep fellow American's SAFE. Everyone is so grateful!#LawEnforcementAppreciationDayPresident @realDonaldTrump __HTTP__ _E_
'Small business says Trump is their pick for president' __HTTP__ _E_
America needs a President who can negotiate better deals for the American People. _E_
My interview on 9/13/01 with a German reporter after visiting Ground Zero __HTTP__ _E_
Trump Tuesday on @SquawkCNBC 7:30 AM is getting very good ratings as is @Foxandfriends on Mondays 7:30 AM. _E_
Via @Newsmax_Media: Robb Report: Trump Scotland Best Golf Course in the World __HTTP__ _E_
Again I have nothing to do with the Atlantic City closing I have not even been there in many years. Some press was accurate some not! _E_
Via @BloombergNews by Peter Millard: Trump Helps Rio Builders After Olympics: Corporate Brazil __HTTP__ _E_
Our country is now in serious and unprecedented trouble...like never before. _E_
If history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_
Great meeting w/ NATO Sec. Gen. We agreed on the importance of getting countries to pay their fair share & focus on... __HTTP__ _E_
Mitt Romney is right about the Chinese rip off of America. _E_
So Obama can host the Muslim Brotherhood Pres. Morsi in the White House __HTTP__ but doesn't have time for @netanyahu? _E_
Texas & Florida are doing great but Puerto Rico which was already suffering from broken infrastructure & massive debt is in deep trouble.. _E_
Great work being done by @FEMA @DHSgov w/state & local leaders to prepare for hurricane season. Preparedness is an investment in our future! __HTTP__ _E_
Whoever wins today remember that tomorrow we still have a country struggling. Our work is not done until America is strong again. _E_
"Experience knowledge & prescience are a formidable combination of powers. Do not underestimate any of them." Think Like a Champion _E_
Obama has missed 58% of his intelligence briefings. But our president does make 100% of his fundraisers. _E_
Top brand impact is what television is all about from the commercial standpoint—a big deal for @CelebApprentice. _E_
The charities I have designated for @billmaher's donations are: Police Athletic League New York March of Dimes Hurricane Sandy victims.... _E_
Must see morning clip: Donald Trump addresses Lil Wayne tweet and 'Celebrity Apprentice' __HTTP__ via @Salon _E_
My son Don will be giving the Keynote Address at The Investment Show in Sandton South Africa on Dec. 1. He's an (cont) __HTTP__ _E_
Muslim Brotherhood head of Egypt Morsi is already making demands on Obama before the WH visit. Obama's foreign policy is a complete failure. _E_
How come there are no protests in favor of the two young police officers gunned down in Mississippi by two deranged animals. DEATH PENALTY! _E_
WikiLeaks reveals Clinton camp's work with 'VERY friendly and malleable reporters' #DrainTheSwamp #CrookedHillary __HTTP__ _E_
Chicago is a shooting disaster they should immediately go to STOP AND FRISK. They have no choice hundreds of lives would be saved! _E_
Vast numbers of manufacturing jobs in Pennsylvania have moved to Mexico and other countries. That will end when I win! _E_
Remember when I said when Saddam Hussein fell the new leader of Iraq will be meaner and tougher and hate the U.S. even more. Welcome ISIS! _E_
Thanks. __HTTP__ _E_
Keystone: @johnboehner MUST pass Keystone by linking it to another bill. __HTTP__ _E_
Top suspect in Paris massacre Salah Abdeslam who also knew of the Brussels attack is no longer talking. Weak leaders ridiculous laws! _E_
Just out report: United Kingdom crime rises 13% annually amid spread of Radical Islamic terror. Not good we must keep America safe! _E_
As the nuclear crisis with Iran shows America needs to import oil from a reliable region. Keystone XL Pipeline (cont) __HTTP__ _E_
Word is that Sleepy Eyes Chuck Todd who has failed so badly with Meet the Press will be taking over for now irrelevant Brian Williams! _E_
Via @feminamissindia: "@MannyPacquiao among @MissUniverse 2015 judges" __HTTP__ _E_
The Club for Growth is a very dishonest group. They represent conservative values terribly & are bad for America. __HTTP__ _E_
Big G7 meetings today. Lots of very important matters under discussion. First on the list of course is terrorism. #G7Taormina _E_
...Why did Democratic National Committee turn down the DHS offer to protect against hacks (long prior to election). It's all a big Dem HOAX! _E_
As always & due to popular demand@TrumpRink will be open Christmas eve & day as well as New Year's eve & day __HTTP__ _E_
Be sure to watch the Larry King Show tomorrow night on CNN 9 p.m. I'll be the host Larry the guest. __HTTP__ _E_
RT @WhiteHouse: Do not allow anyone to tell you that it cannot be done. No challenge can match the HEART and FIGHT and SPIRIT of America. ... _E_
IN AMERICA WE DON'T WORSHIP GOVERNMENT WE WORSHIP GOD! __HTTP__ _E_
Will be interviewed on @foxandfriends tomorrow morning Monday at 8:00. Much to talk about! _E_
NO GAMES! HOUSE @GOP MUST DEFUND OBAMACARE! IF THEY DON'T THEN THEY OWN IT! _E_
What a coincidence Michelle Obama called Kenya @BarackObama's homeland in 2008 __HTTP__ _E_
Great news @TPPatriots are starting their own Super PAC to fight @KarlRove __HTTP__ (via @thehill) Go get em! _E_
Pennsylvania is in play @MittRomney. All undecideds in Philly suburbs should ask themselves who do you trust most on @Israel? _E_
Watch listen and learn. You can't know it all yourself. Anyone who thinks they do is destined for mediocrity. Donald Trump _E_
Will be interviewed on the @TODAYshow this morning at 7:00. Talking about politics polls and whatever. Enjoy! _E_
After the litigation is disposed of and the case won I have instructed my execs to open Trump U(?) so much interest in it! I will be pres. _E_
Re negotiation: Know exactly what you want and keep it to yourself. Think about what the other side wants and where they're coming from. _E_
Success breeds success. The best way to impress people is through results. Think Like a Billionaire. _E_
.@MittRomney can only speak negatively about my presidential chances because I have been openly hard on his terrible choke loss to Obama! _E_
.@Borisep was great on @JudgeJeanine tonight. Very smart commentary that will prove to be correct! _E_
protesters and the tears of Senator Schumer. Secretary Kelly said that all is going well with very few problems. MAKE AMERICA SAFE AGAIN! _E_
Everything comes to him who hustles while he waits. Thomas A. Edison _E_
Success is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_
Glad to see that Sacha Baron Cohen's new movie is not only a dud but not too good at the box office. He is talentless. @Sacha_B_Cohen _E_
Going to CPAC! _E_
Obama can kill Americans at will with drones but waterboarding is not allowed—only in America! _E_
Who do you think is going home? #CelebApprentice _E_
Stay confident even when something bad happens. It is just a bump in the road. It will pass. Think Big _E_
"@BarackObama may have been a good 'community organizer' but the man is a lousy international dealmaker." #TimeToGetTough _E_
We need a tax system that is FAIR to working families & that encourages companies to STAY in America GROW in America and HIRE in America __HTTP__ _E_
A penny saved is a penny earned. Benjamin Franklin _E_
I will be doing Fox & Friends at 7.00 will be discussing the the Donald Sterling (Clippers) MESS! _E_
Maybe Obama should donate my $5M to the families of the 17 who have lost loved ones during the storm? _E_
Even Barbara Bush agrees with me __HTTP__ _E_
Check out my interview on @GMA __HTTP__ _E_
ObamaCare has 21 tax hikes __HTTP__ There's now only one solution defeat @BarackObama this November! #GOMITT _E_
.@FoxNews You shouldn't have @KarlRove on the air—he's a clown with zero credibility—a Bushy! _E_
Happy birthday to the great @TheLeeGreenwood. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_
Re: hiring contractors remember the cheapest isn't always the best. Their work may have to be redone & they may not be reliable. _E_
Check out the last webisode www.youtube.com/user/mattressserta in our 3 part series featuring me with Serta. Which one was your favorite? _E_
Where are the other candidates now that this tragic murder has taken place b/c of our unsafe border __HTTP__ We need a wall! _E_
If these guys have any integrity they'd say no to MSNBC a network that few watch and is very negative. @AndrewBreitbart re debate. _E_
I made my decision to allow Jenna Talackova to participate in Miss Universe Canada two days before Gloria Allred (cont) __HTTP__ _E_
I would rather run against Crooked Hillary Clinton than Bernie Sanders and that will happen because the books are cooked against Bernie! _E_
Thank you Orlando Florida! We are just six days away from delivering justice for every forgotten man woman and ch... __HTTP__ _E_
By the way Hillary & the MSM forgot to mention that Hillary is in the Al Shabaab terror video. __HTTP__ _E_
A clip of my @LibertyU speech talking about the importance of the election & our country's potential __HTTP__ via@washingtonpost _E_
.@LouDobbs just stated that President Trump's successes are unmatched in recent presidential history Thank you Lou! _E_
The failing @nytimes is greatly embarrassed by the totally dishonest story they did on my relationship with women. _E_
Let me put this as plainly as I know how: Iran's nuclear program must be stopped by any and all means necessary. Period. #TimeToGetTough _E_
I will be on @oreillyfactor tonight at 8:00. Enjoy! _E_
Getting closer and closer on the Tax Cut Bill. Shaping up even better than projected. House and Senate working very hard and smart. End result will be not only important but SPECIAL! _E_
Act as if what you do makes a difference. It does. William James _E_
Thank you. __HTTP__ _E_
Watch @ApprenticeNBC episode 2 online again via @nbc: "Nobody Out Thinks Donald Trump __HTTP__ _E_
If @RepMarkMeadows @Jim_Jordan and Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_
Designed by @IvankaTrump @TrumpDoral's Deluxe Guestrooms feature impeccable furnishing and details __HTTP__ _E_
"Get to the essence immediately. Learn to economize. People appreciate brevity in today's world." – Think Like a Champion _E_
'President elect Donald J. Trump today announced his intent to nominate Steven Mnuchin Wilbur Ross & Todd Ricketts... __HTTP__ _E_
Just like its website ObamaCare is a disaster.Maybe all those who are fighting it are wasting their time it will fail on its own! _E_
October 2015 thanks Chris Wallace @FoxNewsSunday! __HTTP__ _E_
I said that Eliot Spitzer was going to lose when he was way up in the polls. I fought him when others retreated out of fear. NEVER GIVE UP! _E_
Via @NRO: Palin Trump Get Longer Speaking Slots at CPAC by @KatrinoTrinko __HTTP__ _E_
The rallies in Utah and Arizona were great! Tremendous crowds and spirit. Just returned but will be going back soon. _E_
I wonder if I run for PRESIDENT will the haters and losers vote for me knowing that I will MAKE AMERICA GREAT AGAIN? I say they will! _E_
Trending story on Miss Utah is very unfair. She simply lost her train of thought—could happen to anyone! @MissUSA @MissUniverse _E_
Opening in 2016 Trump Hotel Rio de Janeiro will be a 13 story 171 guestroom masterpiece with a beachside view __HTTP__ _E_
Obama and Clinton told the same lie to sell #ObamaCare. #Debates2016 __HTTP__ _E_
Thank you to all of our amazing military families service members and veterans. #ImWithYou __HTTP__ _E_
You wouldn't believe how tall and beautiful @_KatherineWebb is 6'5 in heels. She is also a total winner in... __HTTP__ _E_
The crackdown on illegal criminals is merely the keeping of my campaign promise. Gang members drug dealers & others are being removed! _E_
MAKE AMERICA GREAT AGAIN! _E_
Remember Cruz and Bush gave us Roberts who upheld #ObamaCare twice! I am the only one who will #MAKEAMERICAGREATAGAIN! _E_
Deepest condolences to the families & fellow officers of the VA State Police who died today. You're all among the best this nation produces. _E_
Thank you to our great Police Chiefs & Sheriffs for your leadership & service. You have a true friend in the... __HTTP__ _E_
RT @DineshDSouza: Finally as if by accident the @washingtonpost breaks down & admits the truth about where the violence is coming from ht... _E_
So many great polls like Reuters big leads everywhere. New Hampshire really special! We will win big and MAKE AMERICA GREAT AGAIN! _E_
Article: More illegals enter than people born in state each week. __HTTP__ _E_
RT @DeptofDefense: #HappyThanksgiving from @USArmy and @USNationalGuard #soldiers serving with Task Force Marauder in #Afghanistan. 🦃 __HTTP__ _E_
Can't wait for @DylanByers' follow up @politico piece discussing my large Sunday news shows ratings win because of my interview! _E_
Looking for an excuse not to cook for Thanksgiving? Many NYC outlets will delivery a full meal including @TrumpSoHo __HTTP__ _E_
...New Donna B book says she paid for and stole the Dem Primary. What about the deleted E mails Uranium Podesta the Server plus plus... _E_
Just finished two major speeches in South Carolina. Big crowds great people. Going for a third now! _E_
My thoughts on Dick Cheney and his new book... __HTTP__ #trumpvlog _E_
Join me in Carmel Indiana tomorrow at 4pm! #INPrimary __HTTP__ __HTTP__ _E_
Just leaving Nashville Tennessee. Had a great time with a fabulous crowd of people! Love Nashville back soon! __HTTP__ _E_
Under the leadership of Obama & Clinton Americans have experienced more attacks at home than victories abroad. Time to change the playbook! _E_
Dick Clark was a friend of mine he lived in one of my buildings on East 61st Street. Everybody loved him. He will be missed. _E_
General Kelly is doing a great job at the border. Numbers are way down. Many are not even trying to come in anymore. _E_
Sadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. And (cont) __HTTP__ _E_
.@dixierhilton #asktrump __HTTP__ _E_
I have never liked the media term 'mass deportation' but we must enforce the laws of the land! _E_
My interview with @parademagazine from the Olympics 100 Day Countdown in Times Square __HTTP__ _E_
.#IranDeal will go down as one of the dumbest & most dangerous misjudgments ever entered into in history of our country—incompetent leader! _E_
.@CNN is all negative when it comes to me. I don't watch it anymore. _E_
With allies like Egypt and Libya who needs enemies?! _E_
RT @DRUDGE_REPORT: TRUMP STUMPS... __HTTP__ _E_
Make your life as groundbreaking as possible while also minding the tides and riptides around you. Think Like a Champion _E_
Nice guy @pennjillette needs your help to make his bad guy movie Directors Cut &gt __HTTP__ @fundanything _E_
Thank you @morningmika and @JoeNBC for all of your nice words and comments on the debate! _E_
RT @paulsperry_: Wray needs to clean house. Now we know the politicization even worse than McCabe's ties to McAuliffe/Clinton. It also infe... _E_
I love watching the dishonest writers @NYMag suffer the magazine's failure. _E_
We will never have great national security in the age of computers too many brilliant nerds can break codes (the old days were better). _E_
WATCH – WH official says that ObamaCare/RomneyCare architect Gruber was 'an important figure' in crafting the law __HTTP__ _E_
Weekly Address __HTTP__ __HTTP__ _E_
I hear @JoeNBC of rapidly fading @Morning_Joe is pushing hard for a third party candidate to run. This will guarantee a Crooked Hillary win. _E_
"@AP Interview: @MissUniverse Gabriela Isler reflects as her reign winds down" __HTTP__ via @YahooNews _E_
The US GDP in 2010 was 4.1% down to 2% in 2011 & now 1.5%. I guess @BarackObama's plan is not working! _E_
#CrookedHillary #ThrowbackThursday __HTTP__ _E_
11000 inside venue tonight in Tampa! Broke record set by Elton John in 1988 w/out musical instruments! Another 5000 outside. Will be back! _E_
For beauty and flight I'll take the @Boeing 757 over the @Boeing 787 any day! _E_
Wow just heard really bad stuff about the failing @politico. How much longer will they be around? Some very untalented reporters. _E_
If Cuba is unwilling to make a better deal for the Cuban people the Cuban/American people and the U.S. as a whole I will terminate deal. _E_
For the disciples of global warming in 150 summers (years) there have been 20 heat waves as bad or worse than current this has happened b4! _E_
I will be interviewed on @GMA at 7:00 A.M. and @foxandfriends at 7:50. Talking about my new book out today Crippled America. _E_
Loved the debate last night and almost everyone said I won but the RNC did a terrible job of ticket distrbution. All donors & special ints _E_
RT @DonaldJTrumpJr: Nevada: Here is a quick video @IvankaTrump created on How to Caucus very quick and simple! __HTTP__ ... _E_
Looking forward to attending the GREAT Rev. @BillyGraham's birthday party tonight there's nobody like him! _E_
My interview with @IngrahamAngle discussing @THEHermanCain @BarackObama's mistreatment of Israel and GOP 2012. __HTTP__ _E_
The entire cast will be back for the live finale of @ApprenticeNBC Monday night at 8 PM _E_
.@JebBush is slashing campaign salaries people making millions. If he can't manage his campaign how can he manage our countries finances? _E_
Donald Trump appearing today on CNN International's 'Connect the World' as 'Connector of the Day'. Submit questions: __HTTP__ _E_
Masa said he would never do this had we (Trump) not won the election! _E_
I wonder when we will be able to see @BarackObama's college and law school applications and transcripts. Why the long wait? _E_
I'll be co hosting @extratv tonight. Be sure to tune in! _E_
I watched Russell Brand @rustyrockets on the @jimmyfallon show the other night—what the hell do people see in Russell—a major loser! _E_
Almost every major dealmaker has used the bankruptcy laws as a business tool... _E_
That trip would be to the Trump International Hotel Las Vegas... __HTTP__ _E_
Donald J. Trump's History Of Empowering Women #BigLeagueTruth __HTTP__ _E_
A fine man Dr. Paul F. Crouch has just passed away. All Christians are grateful for his wonderful life and work. @TBN _E_
Trump National Golf Club Charlotte is the premiere club in North Carolina. __HTTP__ Will visit tomorrow. _E_
Lyin' Ted Cruz denied that he had anything to do with the G.Q. model photo post of Melania. That's why we call him Lyin' Ted! _E_
.@billmaher has continually degraded Catholic Church on the joke he calls a show __HTTP__ Catholics should boycott HBO. _E_
My daughter Ivanka will be on @foxandfriends tomorrow morning. Enjoy! _E_
Thanks Piers. Greatly appreciated. @piersmorgan __HTTP__ _E_
.@LilJon's take on @piersmorgan seems to be a classic love hate combo. Piers can be tough and everyone knows it. #CelebApprentice _E_
What ever happened to the good old days of The Academy Awards. This show is an insult to the past just plain bad! _E_
.@TrumpDoral will be featured on @GolfChannel this morning (now). _E_
Why isn't Hillary 50 points ahead? Maybe it's the email scandal policies that spread ISIS or calling millions of... __HTTP__ _E_
Remember that I predicted a long time ago that President Obama will attack Iran because of his inability to negotiate properly not skilled! _E_
I am running against the Washington insiders just like I did in the Republican Primaries. These are the people that have made U.S. a mess! _E_
Boston's Mayor Walsh wasted a lot of time and money on going for the Olympics and then he gave up. I don't want him negotiating for me! _E_
10 yrs ago today the Iraq war began. 4485 of our nation's finest have not returned home alive. Iran will soon control Iraq & its oil. _E_
Obama was guest at VP debate moderator Martha Raddatz's wedding __HTTP__ Do people think this is fair? _E_
Via @trscoop: WHOA: Trump changing venues for Saturday rally in Arizona due to OVERWHELMING RESPONSE __HTTP__ _E_
Hillary is the most corrupt person to ever run for the presidency of the United States. #DrainTheSwamp __HTTP__ _E_
They just arrested pol Shelly Silver in New York. Why aren't they arresting a far bigger crook @AGSchneiderman? _E_
Obama killed over 100k jobs by not approving Keystone XL pipeline and Canada is now selling the oil to China very dumb! _E_
Big news Budget just passed! _E_
.@ABCPolitics #GOPDebate#MakeAmericaGreatAgain #FITN __HTTP__ _E_
THANK YOU NEW YORK!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
The last time I visited China I couldn't believe all the construction. You can go up with a project in a week no red tape. _E_
Glad to hear Derek Jeter just removed his boot and is practicing on the field for @yankees. Derek is a true champion. _E_
Why doesn't @MittRomney just endorse @marcorubio already.Should have done it before NH or Nevada where he had a little sway. Too latenow! _E_
Rather than putting pressure on the businesspeople of the Manufacturing Council & Strategy & Policy Forum I am ending both. Thank you all! _E_
New Fox News PollThank you Iowa! #Trump2016 #IACaucus __HTTP__ _E_
Our new American Energy Policy will unlock MILLIONS of jobs & TRILLIONS in wealth. We are on the cusp of a true ene... __HTTP__ _E_
After spending $89 million @JebBush is at the bottom of the barrel in polls. He is ashamed to use the name Bush in ads. Low energy guy! _E_
There is no world problem which cannot be solved if people of good will & intelligence want it to be. _E_
Thank you Delaware! #Trump2016 #MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_
.@bretmichaels and George Ross are back as advisors. Good to see them! #CelebApprentice _E_
The illegal immigrant crime problem is far more serious and threatening than most people understand. Along our (cont) __HTTP__ _E_
Capitalism requires capital. When government robs capital from investors through high taxes it takes away the (cont) __HTTP__ _E_
Our great team at @FEMA is prepared for #HurricaneNate. Everyone in LA MS AL and FL please listen to your local authorities & be safe! _E_
I left Atlantic City years ago good timing. Now I may buy back in at much lower price to save Plaza & Taj. They were run badly by funds! _E_
Business is a creative endeavor. Cultivate a sense of discovery and start thinking big. _E_
The Audacity of Ineptitude – ObamaCare website will cost over $1B __HTTP__ When will someone finally be held accountable? _E_
Offshore Wind in Europe: Lessons for the U.S. __HTTP__ via @HuffPostGreen The lesson should be that it's a lousy idea!!! _E_
.@oreillyfactor @KarlRove as per the show an even more serious Cruz charge is the fraudulent voter violation certificate sent to everyone. _E_
My @FoxNews interview from last night with @gretawire discussing yesterday's meeting with @MittRomney __HTTP__ _E_
In Bangladesh hostages were immediately killed by ISIS terrorists if they were unable to cite a verse from the Koran. 20 were killed! _E_
I will be interviewed by @SeanHannity tonight at 10pm on FOX! Enjoy! _E_
I had to fire General Flynn because he lied to the Vice President and the FBI. He has pled guilty to those lies. It is a shame because his actions during the transition were lawful. There was nothing to hide! _E_
See I told you so __HTTP__ _E_
Wow! I hear that thousands of people are cutting up their @Macys credit card. That's great. #MakeAmericaGreatAgain! _E_
Great basketball game going on right now! _E_
.@MonicaCrowley you were great with @SeanHannity on @FoxNews tonight. Thank you for your kind words. We will keep Americans safe. _E_
.@TrumpDoral's Red Course redesign is underway. Will be completed in September. Follow all the developments __HTTP__ _E_
Finally in the new ABC News/Washington Post Poll Hillary Clinton is down 11 points with WOMEN VOTERS and the election is close at 47 43! _E_
Price gouging at many gas stations $10 a gallon welcome to the new world. _E_
Housing prices are up in Feb over last Feb 9.3 per cent remember I told everyone two years ago to buy (but they will be going much higher) _E_
"Life is difficult no matter what but hard work and perseverance make it a lot easier." – Think Like a Billionaire _E_
A.G. Lynch made law enforcement decisions for political purposes...gave Hillary Clinton a free pass and protection. Totally illegal! _E_
RT @EricTrump: Congratulations @SeanHannity! Looking forward to being on the show tonight at 9pmET Hannity beats Maddow POLITICO __HTTP__ _E_
All I can say is that if I were President Snowden would have already been returned to the U.S. (by their fastest jet) and with an apology! _E_
Great works are performed not by strength but by perseverance. Samuel Johnson _E_
Just left Istanbul Turkey yesterday where #TrumpTowers was just opened magnificent! _E_
For the great people of Iowa find your #IACaucus location at __HTTP__ So important to vote! #MakeAmericaGreatAgain _E_
Make sure you get on the Trump line and are not mislead by the Cruz people. They are bad! BE CAREFUL. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
"If you want to be successful at anything in life you have to be able to handle pressure." – Think Big _E_
Oscar Pistorious will likely only serve 10 months for the cold blooded murder of his girlfriend. Another O.J. travesty.The judge is a moron! _E_
They asked me to dress as Santa Claus to open Miss Universe tonight—I'm thinking about it! _E_
I don't cheat at golf but @SamuelLJackson cheats—with his game he has no choice—and stop doing commercials! _E_
I very much look forward to tomorrow's debate in New Hampshire—so many things to say so much at stake. It will be an incredible evening! _E_
Cyberattack on White House what's next? __HTTP__ _E_
Study what General Pershing of the United States did to terrorists when caught. There was no more Radical Islamic Terror for 35 years! _E_
Boycott @Macys and @Univision. MAKE AMERICA GREAT AGAIN! _E_
The stock market is having a horrendous day bad employment numbers. _E_
Thank you Iowa Get out & #VoteTrumpPence16! __HTTP__ __HTTP__ _E_
It was my great HONOR to present our nation's highest award for a public safety officer THE MEDAL OF VALOR to FIVE AMERICAN HEROES! __HTTP__ _E_
Moderator: "Respectfully you won't answer the pay to play question." #Debate #BigLeagueTruth _E_
Via @PostSports @barrysvrluga Donald Trump has major aspirations for his Trump National Golf Club in Virginia __HTTP__ _E_
Trade with China has killed over 29% of US manufacturing jobs in the US __HTTP__ China is robbing us blind! _E_
Moderator: Hillary paid $225000 by a Brazilian bank for a speech that called for "open borders." That's a quote! #Debate #BigLeagueTruth _E_
My just filed lawsuit against Univision. Always fight back when right. #MakeAmericaGreatAgain __HTTP__ _E_
I just started construction of The Old Post Office on Pennsylvania Avenue in D.C. Many jobs. Will be finest hotel in U.S. Watch it happen! _E_
.@LaurenScruggs who was badly injured by an airplane was great on The Today Show! _E_
Thank you @DonaldJTrumpJr. Proud of you! #RNCinCLE #TrumpPence2016 __HTTP__ _E_
All Star @ApprenticeNBC premiering March 3rd on @NBC features terrific TV stars competing in the toughest tasks yet. Will be great. _E_
The dirty poll done by @ABC @washingtonpost is a disgrace. Even they admit that many more Democrats were polled. Other polls were good. _E_
China's currency manipulation is one of our nation's greatest sovereign threats. The yuan has appreciated 40% against our dollar since 2005. _E_
Thanks Piers. __HTTP__ _E_
Despite all of China's cheating they are not doing that well we can beat them our country has great potential! _E_
Someone incorrectly stated that the phrase DRAIN THE SWAMP was no longer being used by me. Actually we will always be trying to DTS. _E_
Negotiations 101: The best deals you can make are the ones you walk away from...and then get them with better terms. _E_
My interview on @ThisWeekABC with @GStephanopoulos had a 40%+ ratings increase over same Sunday last year. 20% over last week. _E_
Many many people are thanking me for what I said about @autism & vaccinations. Something must be done immediately. _E_
The military generals are fuming at Obama. He has boxed them in against ISIS with a strategy that is destined to fail. Sad! _E_
In beautiful Pine Hill Trump Nat'l Philadelphia's award winning course provides amazing views of Philly skyline __HTTP__ _E_
If it were up to goofy Elizabeth Warren we'd have no jobs in America—she doesn't have a clue. _E_
President Obama has just reached an ALL TIME low approval rating! Is anybody surprised? The happiest person is former President Jimmy Carter _E_
Rumor has it that @politico is going out of business. Losing too much money. Great news! Likewise dopey Mort Zuckerman's @NYDailyNews _E_
We need a PRESIDENT with strength stamina heart and incredible deal making skill if our country is ever going to be able to prosper again! _E_
Why would Ohio listen to Bruce Springsteen reading his lines? Be careful or I will go to Ohio and @MittRomney will win it! _E_
.@FreeJesseJames Just read your complete statement. You are an amazing guy & I really appreciate your words & support. I will see you soon! _E_
This whole Super PAC scam is very unfair to a person like me who has disavowed all PAC's & is self funding. _E_
Again illegal immigrant is charged with the fatal bludgeoning of a wonderful and loved 64 year old woman. Get them out and build a WALL! _E_
RT @FiIibuster: @realDonaldTrump We have a President that is putting the security and prosperity of America first. Thank you President Tru... _E_
Carter Banned Iranians From Coming To U.S. During Hostage Crisis __HTTP__ _E_
Remember tonight (Monday) the second and third episodes of The Apprentice are on at 8:00 & 9:00. Great ratings last night 18 49. FUN! _E_
I look forward to being in South Carolina tomorrow a total sellout crowd! _E_
.@dbongino You were fantastic in defending both the Second Amendment and me last night on @CNN. Don Lemon is a lightweight dumb as a rock _E_
.@jasondhorowitz I am very proud of my sister your story was terrific. Thank you so much. _E_
Via @411mania: Donald Trump Comments on a Return to Wrestling __HTTP__ _E_
From Donald Trump: Wishing everyone a wonderful holiday & a happy healthy prosperous New Year. Let's think like champions in 2010! _E_
Always fun to read the @NewYorkObserver investigative piece re @AGSchneiderman his mascara and more! __HTTP__ _E_
Why would anybody listen to @MittRomney? He lost an election that should have easily been won against Obama. By the wayso did John McCain! _E_
Professional anarchists thugs and paid protesters are proving the point of the millions of people who voted to MAKE AMERICA GREAT AGAIN! _E_
Oil is starting to rise again despite the horrible times. OPEC continues to rip us off. Not worth $30. New leadership needed. _E_
Via @LinkedInPulse by @nicholas_wyman: "What All Hiring Managers Can Learn from Donald Trump" __HTTP__ _E_
Treat yourself to the pinnacle of luxury public golf at @TrumpGolfLA's white sand $250M premiere course __HTTP__ _E_
An excerpt of my @TheBrodyFile interview at the Sarasota 'Statesman of the Year' dinner discussing the Tea Party __HTTP__ _E_
RT @AmericaFirstPol: MAJOR IMPACT: @POTUS Trump is 50 Days in and moving swiftly to get America back on the right track. #MAGA __HTTP__ _E_
He makes a mistake every hour every day admits @BarackObama. __HTTP__ The problem is that we are paying for them. _E_
I am going to repeal and replace ObamaCare. We will have MUCH less expensive and MUCH better healthcare. With Hillary costs will triple! _E_
RT @FoxNewsSunday: Sunday our exclusive interview with President elect @realDonaldTrump Watch on @FoxNews at 2p/10p ET Check your local... _E_
70 stories over Panama Bay @TrumpPanama's deluxe rooms feature private balconies to enjoy the ocean views __HTTP__ _E_
Why aren't we getting any oil from Iraq before we leave? We are leaving the country wide open for Iran. Big mistake. _E_
I am in Scotland checking on my developments in Aberdeen and Turnberry. Just left Ireland property will be great. ALWAYS CHECKING! _E_
Go to @Macys now to see the incredible new selection of Trump Signature Collection ties shirts and suits. _E_
Sometimes we do things to build up experience and stamina to prepare but it's to prepare us for something bigger. _E_
Why would @BarackObama be spending millions of dollars to hide his records if there was nothing to hide? _E_
"Effective leadership is putting first things first. Effective management is discipline carrying it out." Stephen Covey _E_
The perfect Hawaiian getaway @TrumpWaikiki's 462 luxury guest rooms and suites each have spectacular views __HTTP__ _E_
My @TheBrodyFile int. discussing the persecution of Christians in the Middle East & Religious Liberty & Freedom __HTTP__ _E_
With our $250M in renovations @TrumpDoral offers a wide array of courses restored to perfection __HTTP__ _E_
Trump Victorious in Fort Lauderdale Litigation __HTTP__ _E_
"To keep your goals alive you must take action every day. No one should care about your money and success more than you do." Think Big _E_
#1. Be passionate you have to love what you're doing to be successful at it. _E_
We've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_
Ronald Kessler's new book The Secrets of the FBI is a great book that should be read by everyone. _E_
(2/2) David brilliantly tells it like it is the real deal! Read it! __HTTP__ _E_
The @nyjets are going to have a terrific season. @Mark_Sanchez & @TimTebow will do great things on the field. _E_
I'd bet a good lawyer could make a great case out of the fact that President Obama was tapping my phones in October just prior to Election! _E_
Focus on your goals not your problems. Don't tread water. Get out there and go for it. _E_
An idealist is a person who helps other people to be prosperous. Henry Ford _E_
When will the Democrats give us our Attorney General and rest of Cabinet! They should be ashamed of themselves! No wonder D.C. doesn't work! _E_
Great parent teacher listening session this morning with @VP Pence & @usedgov Secretary @BetsyDeVos. Watch:... __HTTP__ _E_
I will be doing @oreillyfactor tonight at 8:00pmE from Mesa Arizona will be talking about the #GOPDebate & more. __HTTP__ _E_
My @SquawkCNBC interview discussing why I don't own Facebook stock and running a tough campaign against @BarackObama __HTTP__ _E_
I am soooo proud of my children Don Eric and Tiffany their speeches under enormous pressure were incredible. Ivanka intros me tonight! _E_
Basically nothing Hillary has said about her secret server has been true. #CrookedHillary _E_
.@megynkelly the @FoxNews poll said very plainly I came in second in the debate. All others Time Drudge Slate etc. said I came in 1st. _E_
Greta in a few minutes will be interesting! _E_
The ratings at @FoxNews blow away the ratings of @CNN not even close. That's because CNN is the Clinton News Network and people don't like _E_
I am very worried that if @BarackObama is re elected then Medicare will be destroyed. We must take care of our seniors. _E_
I was on CNBC this morning talking about the market and America's financial future __HTTP__ _E_
If you can't focus with unyielding resolve then you will never be successful. Believe in yourself and you can accomplish your goals. _E_
I have great confidence in King Salman and the Crown Prince of Saudi Arabia they know exactly what they are doing.... _E_
Bernie Sanders must really dislike Crooked Hillary after the way she played him. Many of his supporters because of trade will come to me. _E_
Stop congratulating Obama for killing Bin Laden. The Navy Seals killed Bin Laden. #debate _E_
The crowning moment – Conneticut's Erin Brady winning @MissUSA 2013 __HTTP__ _E_
CPAC attendees & fellow patriots lines for my @CPACnews start at 7:00AM outside the Potomac Ballroom. Make sure to get there early! _E_
Henry McMaster Lt. Governor of South Carolina who endorsed me beat failed @CNN announcer Bakari Sellers so badly. Funny! _E_
Lightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately because he is being looked at! _E_
I was standing with @SHAQ when a young high school star Kevin Garnett @Celtics said to a crowd Forget Shaq I want to meet Donald Trump. _E_
Glad that @MittRomney is hitting @BarackObama on ending work requirements for welfare. Obama attacks the American work ethic. _E_
Thank you to General John Kelly who is doing a fantastic job and all of the Staff and others in the White House for a job well done. Long hours and Fake reporting makes your job more difficult but it is always great to WIN and few have won more than us! _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
Sec of State Kerry said we would not go back to Iraq. We shouldn't but he should not have said that. So stupid! _E_
A nurse in Dallas who treated Ebola patient Thomas Duncan was allowed to fly to Cleveland.She should never have been so allowed! The real JV _E_
Dress your best! The Trump Signature Collection exclusively available @Macys offers the tops style in menswear __HTTP__ _E_
Scary while @BarackObama has been POTUS for 1.6% of America's history he has amassed 33.3% of the total debt. _E_
Obama betrays Israel yet again our strongest ally in the Middle East. He will recognize Hamas breaking long standing US policy. _E_
Looking forward to next week's unveiling of the Red Tiger @TrumpDoral. An 18 hole masterpiece w/two island greens __HTTP__ _E_
I am proud of the Tea Party. These great patriots have accomplished so much in strengthening our country in only 3 short years. _E_
President Donald J. Trump Proclaims October 9 2017 as #ColumbusDay __HTTP__ _E_
The best thing you can do is deal from strength and leverage is the biggest strength you have." – THE ART OF THE DEAL _E_
Thank you Newt! __HTTP__ _E_
Vince McMahon @WWE and I hold the all time ratings & pay per view record in the history of wrestling. _E_
Trust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_
In order to be successful especially to be very successful you must have the ability to be able to handle pressure! _E_
Celebrity Apprentice will be rebroadcast tonight at 9 on CNBC. _E_
Flashbk – "Trump: 'I would build a border fence like you have never seen before'" __HTTP__ via @BreitbartNews by @rwildewrites _E_
The golden rule for every businessman is this: 'Put yourself in your customer's place.' Orison Swett Marden _E_
Romney was the architect of ObamaCare. Bush's Chief Justice legalized the monstrosity. Notice a trend? _E_
Who thinks that President Obama is totally incompetent? _E_
Our $17T national debt and $1T yearly budget deficits are a national security risk of the highest order. _E_
Just left the set of The Apprentice the live show tonight will be fantastic and something very big and very different is going to happen _E_
An appeaser is one who feeds a crocodile hoping it will eat him last. Winston Churchill _E_
How do third rate talents with no smarts like @ron_fournier get so much time on television news. Boring guy really bad for ratings! _E_
.@RogerJStoneJr was great on @TheKudlowReport last night. Roger and Larry are good friends! _E_
I'll be on @foxandfriends on Monday at 7:30 a.m. Always a great time. _E_
Thanks Matthew! _E_
Obama will go down as the worst President in history on many topics but especially foreign policy. _E_
Via @WashTimes by @EmilyMiller: Donald Trump says 'This country is going to hell in a handbasket' __HTTP__ _E_
.@IamStevenT stopped by my office to say hello a great guy! __HTTP__ _E_
I got to know @johnboehner very well—he is a great guy who will do the right thing for the country! _E_
...to Mar a Lago 3 nights in a row around New Year's Eve and insisted on joining me. She was bleeding badly from a face lift. I said no! _E_
.@brithume thinks that when Republicans drop out of the race someone will pick up ALL of that vote. The fact is I will get much of it! _E_
Via @UnionLeader by @tuohy: "Trump inches closer to a decision" __HTTP__ _E_
Anticipate change and embrace it. Recognize new developments that you can capitalize on and use to open new doors. _E_
Via @NewHampJournal by @jdistaso: "In NH 'The Donald' hammers Mitt Jeb as he again weighs a run for President" __HTTP__ _E_
I will be speaking about our great journey to the Republican nomination at 9:00 P.M. The movement toward a country that WINS again continues _E_
With oil below $50 the blighted views by windfarms of historic @CulzeanCastle will be very sad. #SaveCulzean __HTTP__ _E_
Happy Thanksgiving I hope everyone can get together to MAKE AMERICA GREAT AGAIN! It won't be easy nothing is but it can be done. _E_
.@BretBaier Thank you for the very fair and highly professional segment on me tonight. Many people watched and commented. _E_
Obama is not working. US Manufacturing orders fell a record 13.9% in August. Where's the recovery? __HTTP__ _E_
Aspirin gets the best press of almost anything I can think of fact or great PR? _E_
Very sad that a person who has made so many mistakes Crooked Hillary Clinton can put out such false and vicious ads with her phony money! _E_
Check out today's From The Desk Of Donald Trump at __HTTP__ I'm willing to answer your questions tweet me.... _E_
Working hard from New Jersey while White House goes through long planned renovation. Going to New York next week for more meetings. _E_
Will be leaving Trump Turnberry tomorrow place & Women's British Open are great. Will be back hitting hard tomorrow. @Turnberrybuzz _E_
Via @BreitbartNews: DONALD TRUMP: EXEC AMNESTY WILL MAKE ILLEGAL IMMIGRATION 'WORSE THAN IT'S EVER BEEN __HTTP__ _E_
Don't worry West Coast etc. we are not going to tweet who was fired or give any indication there of until after it airs. #CelebApprentice _E_
Today we are not merely transferring power from one Administration to another or from one party to another – but we are transferring... _E_
New York City hosted over 52 million visitors in 2012. __HTTP__ Record amount visited Trump Tower. _E_
A doctor on NBC Nightly News agreed with me we should not bring Ebola into our country through two patients but should bring docs to them. _E_
When the military informed Obama that they had Bin Laden is there anyone with a brain that would not have said Ok go get him ? _E_
Entrepreneurs: Put everything you've got into what you're doing. Know exactly what you want and go for it. Nothing should be haphazard. _E_
Boy did Pharrell & Robin Thicke get screwed. The Marvin Gaye song sounds nothing like theirs. Get new lawyers fast! _E_
I will be interviewed by @SeanHannity tonight at 10pm EST on @FoxNews! Enjoy! _E_
RT @Scavino45: LIVE Joint Statement by President Trump and Prime Minister Shinzo Abe: __HTTP__ _E_
Congrats to @cheflents of TrumpCollection's #TrumpChicago on being a James Beard semifinalist: __HTTP__ via @CrainsChicago _E_
Happy Birthday to the great @BillyGraham. He's done so many wonderful things not the least of which is his fantastic family. I love Billy! _E_
Join me in Cincinnati Ohio tomorrow evening at 7:00pm. I am grateful for all of your support. THANK YOU!Tickets:... __HTTP__ _E_
Hillary Advisers Wanted Her To Avoid Supporting Israel When Talking To Democrats: __HTTP__ _E_
Our campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_
As I have been saying. Only the beginning: ISIS Suspects Arrested in Turkey 150 European Passports Seized. __HTTP__ _E_
Great work Ivanka! __HTTP__ _E_
A clip from guest hosting @extratv yesterday on @nbc discussing Halle Angus and Gen. Petraeus __HTTP__ _E_
.@gerardtbaker Gerard—wonderful job last night as moderator of the debate. I told many "really smart and elegant." _E_
I am getting worried about Chris @hardball_chris Matthews. Is he drinking again? _E_
Join me live from Fort Myer in Arlington Virginia. __HTTP__ _E_
I win an election easily a great movement is verified and crooked opponents try to belittle our victory with FAKE NEWS. A sorry state! _E_
I will sign the first bill to repeal #Obamacare and give Americans many choices and much lower rates! _E_
With the very dangerous carjacking epidemic going on especially in New York and New Jersey you would be lucky to have a gun for protection _E_
Pocahontas wanted V.P. slot so badly but wasn't chosen because she has done nothing in the Senate. Also Crooked Hillary hates her! _E_
Merry Christmas and a very very very very Happy New Year to everyone! _E_
Via @BreitbartNews __HTTP__ _E_
Via WSOC_TV: Donald Trump's son says family thinking about expanding in uptown Charlotte __HTTP__ Great job @EricTrump _E_
Carly Fiorina is terrible at business the last thing our country needs! __HTTP__ _E_
.@BarackObama's assault on coal and gas and oil will send energy and manufacturing jobs to China. @MittRomney _E_
Thank you. __HTTP__ _E_
.@ConradMBlack what an honor to read your piece. As one of the truly great intellects & my friend I won't forget! __HTTP__ _E_
Today it was my privilege to welcome survivors of the #USSArizona to the @WhiteHouse. #HonorThemRemarks: __HTTP__ __HTTP__ _E_
.@FoxNews should be ashamed for allowing experts to explain how to make a nuclear attack! _E_
...way up. Regulations way down. 600000+ new jobs added. Unemployment down to 4.3%. Business and economic enthusiasm way up record levels! _E_
You talk tough Mr. President but have done nothing about China killing our jobs and economy. _E_
Watch Celebrity Apprentice on NOW! _E_
RT @LouDobbs: Making America Great Again @Kellyannepolls: After #Irma @POTUS is focused on saving lives not swamp shenanigans. #Dobbs #MA... _E_
I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_
Congratulations to Thomas Perez who has just been named Chairman of the DNC. I could not be happier for him or for the Republican Party! _E_
...Corker dropped out of the race in Tennesse when I refused to endorse him and now is only negative on anything Trump. Look at his record! _E_
PAY TO PLAY POLITICS. #CrookedHillary __HTTP__ _E_
The Democrats are pushing for Universal HealthCare while thousands of people are marching in the UK because their U system is going broke and not working. Dems want to greatly raise taxes for really bad and non personal medical care. No thanks! _E_
Bird killing windfarm that I oppose in Aberdeen just got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_
...You have little persona but The Apprentice concept is great and lucky for you! _E_
Happy to have just passed 1.3M Twitter followers. Love communicating with everyone daily. _E_
This shows what a complete & total liar Ted Cruz is he said he wouldn't have nominated John Roberts. Really? __HTTP__ _E_
"Americans are hungry to feel once again a sense of mission and greatness." – Pres. Ronald Reagan _E_
Jeb Bush will never secure our border or negotiate great trade deals for American workers. Jeb doesn't see & can't solve the problems. _E_
Rima Fakih our beautiful Miss USA rode with me on the Gray Line Ride of Fame yesterday... __HTTP__ _E_
Remember if you don't promote yourself then no one else will! Likewise believe in yourself or no one else will either. _E_
ALWAYS BORROW MONEY FROM A PESSIMIST BECAUSE HE WILL NEVER EXPECT IT TO BE PAID BACK! _E_
Great news. We are only just beginning. Together we are going to #MAGA! __HTTP__ __HTTP__ _E_
Will be on @CNN at 7:00 A.M. _E_
Wowthe Fake News media did everything in its power to make the Republican Healthcare victory look as bad as possible.Far better than Ocare! _E_
"Shutting down the government is a very serious thing. People die accidents happen. I don't know how I would vote right now on a CR OK?"Sen. Dianne Feinstein (D Calif) __HTTP__ _E_
The Afghan Security Forces who we are training have killed 52 U.S. soldiers __HTTP__ Time to get out of there! _E_
With China beating us like a punching bag daily OPEC vacuuming our wallets clean and jobs nowhere in sight (cont) __HTTP__ _E_
Both @BarackObama and China have embraced OWS. All want the decline of America. Time for the protesters to go home. _E_
Melania and I are honored to light up the @WhiteHouse this evening for #WorldAutismAwarenessDay. Join us & #LIUB.... __HTTP__ _E_
Obama's rollout of his ISIS war plan is another unmitigated disaster. The Generals must be furious. _E_
I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! _E_
For a president who likes to showcase how hip and tech savvy he is Obama also appears surprisingly clueless (cont) __HTTP__ _E_
Will be in New Hampshire and then on @CNN Special at 9 PM tonight. _E_
Under a Trump administration it's called #AmericaFirst! #ImWithYou __HTTP__ _E_
Amazing crowd outside @FallonTonight. Tune in tonight at 11:30. __HTTP__ _E_
Thank you @TeamTrump Florida. Keep me updated and lets get those 100000 registered voters!#MakeAmericaGreatAgain __HTTP__ _E_
Former Weather Underground radical Kathy Boudin spent 22 yrs in prison for armored car robbery that killed 2 cops & a Brinks guard... _E_
So I raised/gave $5600000 for the veterans and the media makes me look bad! They do anything to belittle totally biased. _E_
"Going with your instincts requires tuning in to everything around your decision." – Think Big _E_
Just put out a very important policy statement on the extraordinary influx of hatred & danger coming into our country. We must be vigilant! _E_
The last thing we need is another Bush in the White House. Would be the same old thing (remember read my lips no more taxes ). GREATNESS! _E_
Thank you for a great afternoon Birmingham Alabama! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Obama said in his speech that Muslims are our sports heroes. What sport is he talking about and who? Is Obama profiling? _E_
Ridiculous that they gave the 14 year old golfer from China a one stroke penalty for slow play at The Masters(see I can stick up for China) _E_
The real story on Collusion is in Donna B's new book. Crooked Hillary bought the DNC & then stole the Democratic Primary from Crazy Bernie! _E_
Stock Market hits an ALL TIME high! Unemployment lowest in 16 years! Business and manufacturing enthusiasm at highest level in decades! _E_
I hear they are very unhappy w/ Arianna and @huffingtonpost at @AOL. I'll bet she won't be there for long! _E_
Remember oftentimes the best deal you make is the deal you don't make! _E_
RT @KellyannePolls: Love and prayers for friends Adrienne & Eric Bolling. May Eric Chase know eternal peace. __HTTP__ _E_
Pennsylvania: Cast your vote for Trump for POTUS & ALSO vote for the TRUMP DELEGATES in your congressional district! __HTTP__ _E_
Why is no one talking about the horrible murder of Ana Charle by ex con thug West Spruill. Gunned down on street naked. Why no riots here? _E_
....instead of giving to a wonderful charitable cause. _E_
Via @BreitbartNews by @AWRHawkins: TRUMP PREACHES PEACE THROUGH STRENGTH IN PHOENIX __HTTP__ _E_
The Federal government spent over $3.7 trillion last year. This is unsustainable and a true danger. The American dream is being destroyed. _E_
It has been a pleasure to make so many friends and meet so many great people on the trail this past cycle. We will fight on! _E_
Once again Obama fails to classify China as a currency manipulator. He just helped China steal even more jobs and money from us. _E_
The pressure on the debt ceiling is on @BarackObama.... __HTTP__ #trumpvlog _E_
Thank you for all of the nice statements on the Press Conference yesterday. Rush Limbaugh said one of greatest ever. Fake media not happy! _E_
If you don't have a competitive advantage don't compete. Jack Welch _E_
Pres. Bill Clinton 5.31.12: @MittRomney had a sterling business career. _E_
Will be having meetings and working the phones from the Winter White House in Florida (Mar a Lago). Stock Market hit new Record High yesterday $5.5 trillion gain since E. Many companies coming back to the U.S. Military building up and getting very strong. _E_
Wow new Reuters Poll just out. Big lead if you want to MAKE AMERICA GREAT AGAIN! TRUMP 37 CRUZ 11 This is at the top of Drudge! _E_
Liberals can hardly belileve it they can't understand how health care costs could have risen so much when (cont) __HTTP__ _E_
Crooked H destroyed phones w/ hammer 'bleached' emails & had husband meet w/AG days before she was cleared & they talk about obstruction? _E_
"Learn know and show. It's a proven formula. Put it to use starting today." – Think Like a Champion _E_
#Imwithyou __HTTP__ __HTTP__ _E_
Today will be a big day @Team_Mitch for you in many ways. The country is lucky. _E_
Great read: "How New York's Veterans Day Parade Became 'America's Parade'" __HTTP__ _E_
Congrats to @JoeTorre @TonyLaRussa & Bobby Cox on all being unanimously elected to @MLB's @BaseballHall! Great leaders & managers. _E_
At the National Achievers Congress in London this October I'm going to talk about success and how to avoid failure __HTTP__ _E_
I have great respect for the people that represent China. What I don't respect is the way that we negotiate and (cont) __HTTP__ _E_
If Justice Roberts had done the right thing and voted against ObamaCare our country would be in a lot better shape right now! TOTAL TURMOIL _E_
"NBC FIRES TRUMP KEEPS SHARPTON: The bigots of the NBC executive suite look the other way" __HTTP__ via @AmSpec by @JeffJlpa1 _E_
Wow Rowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday was on @foxandfriends saying Times lied _E_
Featuring private living spaces oversized bathrooms & stunning views @TrumpSoHo = downtown NYC's premiere hotel __HTTP__ _E_
RT @DRUDGE_REPORT: MEXICO 2ND DEADLIEST COUNTRY TOPS AFGHAN IRAQ... __HTTP__ _E_
Today I signed an Executive Order on Improving Accountability and Whistleblower Protection at the @DeptVetAffairs:... __HTTP__ _E_
I'll bet Obama now uses the amendment for the debt ceiling. _E_
.@BillClinton was very nice to me as I am to him on the Piers Morgan Show (CNN). He is loyal to his friends. @piersmorgan _E_
I will be interviewed on @greta at 7:00 P.M. Enjoy! @FoxNews _E_
He @johnedwards is bad but @andrewyoung is worse not only is he a rat but it turns out he stole much of the money for himself. _E_
Putin has shown the world what happens when America has weak leaders. Peace Through Strength! _E_
My thoughts and prayers are with the great people of Tennessee during these terrible wildfires. Stay safe! _E_
The thousands of people that showed up for me in Phoenix were amazing Americans. @SenJohnMcCain called them crazies must apologize! _E_
1988 with Oprah discussing why I would never rule out a run for #POTUS.#Trump2016 #VoteTrumpNY #PrimaryDay __HTTP__ _E_
I don't know how much longer I can take this bullshit so terrible! #Oscars _E_
I feel so badly for Mark Cuban the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad! _E_
A great book by a great guy highly recommended! __HTTP__ _E_
No surprise Obama's Deputy Campaign Manager tweeted link from Chinese propaganda outlet __HTTP__ Did she also write it? _E_
For those of you that have conveniently forgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_
CHAIN MIGRATION must end now! Some people come in and they bring their whole family with them who can be truly evil. NOT ACCEPTABLE! __HTTP__ _E_
Lightweight A.G. Eric Schneiderman meets with President Obama (who he told me sucks as a president) and quickly files a suit against me! _E_
RT @DanScavino: Doesn't fit the MSM narrative so they wont share what @realDonaldTrump did for Jesse Jackson in 1999 so I will! __HTTP__ _E_
There are huge opportunities for profits if you can think big & create big solutions for the human needs brought by trends. Think Big _E_
Looking forward to my @theFAMiLYLEADER summit visit and speech. _E_
The April jobs report is terrible. If the labor forces didn't shrink under @BarackObama then real unemployment (cont) __HTTP__ _E_
I'll bet Jimmy Fallon gets great ratings tonight! _E_
Great interview on @foxandfriends with the parents of Otto Warmbier: 1994 2017. Otto was tortured beyond belief by North Korea. _E_
Yesterday was a big day for the stock market. Jobs are coming back to America. Chrysler is coming back to the USA from Mexico and many others will follow. Tax cut money to employees is pouring into our economy with many more companies announcing. American business is hot again! _E_
As an addition Apple must go to a larger screen now asap! They're losing their standing in the market! _E_
#NYCStrong #USA __HTTP__ _E_
Remember Anthony Wiener continued sending sick pics. long after his resignation from Congress and his apology zero control over himself! _E_
Great job on Fox this morning @KatiePavlich. I am sending out for your book immediately. Thank you very much! _E_
.@GovChristie is going to do a fantastic job tonight explaining why @MittRomney should be elected and @BarackObama has to go. _E_
Honored to be named as one of business's "Top Leaders Icons and Rebels" by @CNBC __HTTP__ Vote Trump! _E_
Getting ready to deliver a VERY IMPORTANT DECISION! 8:00 P.M. _E_
Destroying the world's finest health care system so that @BarackObama can have his socialized medicine program (cont) __HTTP__ _E_
Despite what you have heard from the FAKE NEWS I had a GREAT meeting with German Chancellor Angela Merkel. Nevertheless Germany owes..... _E_
The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved there will be a News Conference at The White House at approximately 1:00 P.M. _E_
While Obama is obsessed with green collar jobs blue collar workers aren't buying it. (cont) __HTTP__ _E_
We should never have gone into Iraq but once in should have gotten out a lot faster. MAKE AMERICA GREAT AGAIN! _E_
China will never go to war with us because if they won they would only take over property they already own! _E_
Now is the time to buy a house if you can DIRECTLY from a bank. They want to get rid of all their foreclosures. _E_
Watch the clip from my #C21 Super Bowl spot on @AccessHollywood tonight. _E_
...yet not one meeting with an ally (or an enemy!) Where's the media? _E_
Thank you West Virginia! All across the country Americans of every kind are coming together w/one simple goal: to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Entrepreneurs: Pay attention to your negotiation skills. It's all about persuasion and persuasion is power. _E_
I will be speaking at 9:00 A.M. today to Police Chiefs and Sheriffs and will be discussing the horrible dangerous and wrong decision....... _E_
Only reason the hacking of the poorly defended DNC is discussed is that the loss by the Dems was so big that they are totally embarrassed! _E_
They're going to riot in Ferguson no matter what. _E_
THANK YOU Phoenix Arizona! Time for new POWERFUL leadership. Just imagine what WE can accomplish in our first 100... __HTTP__ _E_
Somerset County New Jersey SWAT Team really fantastic people! __HTTP__ _E_
A lot of undecided and independent voters have had enough with Obama's lack of transparency. I don't blame them. _E_
'Clinton Charity Got Up To $56 Million From Nations That Are Anti Women Gays' #CrookedHillary __HTTP__ _E_
As expected the media is very much against me. Their dishonesty is amazing but just like our big wins in the primaries we will win! _E_
.@TrumpGolfLA is ranked the top course in the West __HTTP__ If you're in the area book a round today. _E_
While @BarackObama continues to defend ObamaCare in the courts he is also granting companies waivers. Eve... (cont) __HTTP__ _E_
via __HTTP__ Only one man up for the job of president __HTTP__ _E_
Can you conquer the Blue Monster? Book a tee time @TrumpDoral right here __HTTP__ _E_
Via @AP March2013: Jeb said "he was open to...pathway for citizenship for illegal immigrants" __HTTP__ Lying on campaign trail! _E_
Congrats to @rushlimbaugh on the release of his new book "Rush Revere and the Brave Pilgrims." #1 on @amazon and @bnbooks. Must read! _E_
Obama just stated he didn't take school seriously made bad choices and GOT HIGH then how the hell did he get into Columbia & Harvard? _E_
Obama's complaints about Republicans stopping his agenda are BS since he had full control for two years. He can never take responsibility. _E_
A top rated NY course by @GolfDigestMag @TrumpNationalNY provides award winning services and exceptional facilities __HTTP__ _E_
...What is wrong with this story? Isn't this just ridiculous? Terrible! #KathyBoudin _E_
Leaving the White House for the Great State of North Carolina. Big progress being made on many fronts! _E_
Mexico has taken advantage of the U.S. for long enough. Massive trade deficits & little help on the very weak border must change NOW! _E_
Remember Sunday is National Prayer Day (by Presidential Proclamation)! _E_
#TBT For all who have been asking my mother was a great beauty and a wonderful person. Here we are with my father __HTTP__ _E_
At the Univision forum Obama continued to make excuses for Fast and Furious __HTTP__ His operation killed innocent Americans. _E_
Thank you New Hampshire! Together we will Make America Great Again! __HTTP__ _E_
Wouldn't it be great to Repeal the very unfair and unpopular Individual Mandate in ObamaCare and use those savings for further Tax Cuts..... _E_
In Iran deal we get 4 prisoners. They get $150 billion 7 most wanted and many off watch list. This will create great incentive for others! _E_
.@BarackObama is begging the Eurozone to keep Greece in until after 11.6.12. He thinks the world revolves around his re election. _E_
Thank you! #Trump2016 __HTTP__ __HTTP__ _E_
Look forward to going to Indiana tomorrow in order to be with the great workers of Carrier. They will sell many air conditioners! _E_
RT @JoeNBC: Explosive Trump attack on HRC Bill Monica Cosby and Weiner. Trump camp just upped the ante on women's rights __HTTP__ _E_
FLASHBACK: "Hiding evidence of global cooling" __HTTP__ @washtimes "Scientific data" is cooked! _E_
This is what @BarackObama thinks: that America would be better off if we acted more like European socialist (cont) __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016UNIFYING THE NATIONVideo: __HTTP__ __HTTP__ _E_
Am I morally obligated to defend the president every time somebody says something bad or controversial about him? I don't think so! _E_
Remember Bill Maher praised the animals who took down the World Trade Center and was fired by ABC. DROP@HBO until dopey Bill is canned! _E_
We have the Final Six—and @LilJon is the last remaining member of Team Power. He's done a great job. #CelebApprentice _E_
Just received the new Fox poll.Thank you America! #Trump2016 __HTTP__ _E_
Cadillac has made amazing strides in the beauty and quality of their cars. Great management team congratulations! @Cadillac _E_
Visit @Fund_Anything at __HTTP__ to see my picks! #FundAnything _E_
One of my many Twitter followers suggested Obama should take my offer & give $1250000 to each family of the four... __HTTP__ _E_
U.S. COAL PRODUCTIONUp📈7.8% past year. Down📉31.5% last 10 years. #EndingWarOnCoal __HTTP__ _E_
Great meeting with military spouses in Virginia joined by @IvankaTrump @LaraLeaTrump @GenFlynn & @MayorRGiuliani. __HTTP__ _E_
Very exciting week for @TrumpDoral. I will be in Miami opening what will soon be best resort in U.S. World Golf Championship this week! _E_
America's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_
We will push onward to victory w/hope in our hearts courage in our souls & everlasting pride in each & every one of you. God Bless America. __HTTP__ _E_
It was a great honor to be on @MikeAndMike on @espn. Wow the response was amazing! _E_
Trump Was Right: 'Obama's America' Tops 2012 Documentaries __HTTP__ via @Newsmax_Media _E_
Looking forward to being guest of honor at @ralphreed's @FFCoalition Patriot Gala Dinner on June 14th in DC. Flag day and my birthday. _E_
The U.S. should not be giving away our strategy & tactics to the enemy so they can prepare. Just go and do what you have to do! _E_
FOX debate advertising rates falling like a rock! Tune into my special event for the Veterans at 9pm EST! _E_
Secret Service members on break from Obama's $4M vacation are more than welcomed to relax at Hawaii's top hotel @TrumpWaikiki. _E_
Prime Minister @Netanyahu and @PresidentRuvi on behalf of @FLOTUS Melania and myself thank you for the invitation... __HTTP__ _E_
Central American presidents are blaming us for the influx of illegal immigration __HTTP__ Obama will soon apologize. _E_
The world is most peaceful and most prosperous when America is strongest. __HTTP__ _E_
Historic Change! Obama has spent over $44M of our money on travel expenses the most for any president __HTTP__ _E_
Cruz going down fast in recent polls dropping like a rock. Lies never work! _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Can you believe it—the model who mysteriously disappeared from the ObamaCare website is not a US citizen—she's from Colombia. _E_
I use Social Media not because I like to but because it is the only way to fight a VERY dishonest and unfair "press" now often referred to as Fake News Media. Phony and non existent "sources" are being used more often than ever. Many stories & reports a pure fiction! _E_
Obama's new excuse for his failures is that you can't change Washington from the inside. Not what he said in '09. __HTTP__ _E_
U.S. tuitions are completely out of control. In the last 4 years the average price has gone up by 15%. __HTTP__ Unsustainable! _E_
I believe in free markets but allowing a merger of US Air & American Airlines is totally ridiculous! Will control most of US market. _E_
HILLARY'S HEALTH CARE POLICIES#DrainTheSwamp #Debate __HTTP__ _E_
Tomorrow is #TrumpTuesday on @squawkCNBC 7:30 AM EST. Always interesting. _E_
In even the darkest moments the light of our people has shown through their goodness their courage and their love. #USA __HTTP__ _E_
The @GOP primary voter spoke last night in VA 7 & @DaveBratVA7th won going away. Now the party MUST stand behind him! Unity Unity Unity! _E_
Department of Homeland Security has spent $3.5 billion dollars building their new headquarters and is years late and billions over budget! _E_
Tea Party takes down Eric Cantor REALLY BIG WIN! _E_
Bashar Assad is stronger today than he was before Obama threatened military action. Obama really bungled this. _E_
Phoenix crowd last night was amazing a packed house. I love the Great State of Arizona. Not a fan of Jeff Flake weak on crime & border! _E_
RT @realDonaldTrump: National Pearl Harbor Remembrance Day "A day that will live in infamy!" December 7 1941 _E_
RT @townhallcom: ABC NBC And CBS Pretty Much Bury IT Scandal Engulfing Debbie Wasserman Schultz's Office __HTTP__ _E_
I will be on @60Minutes tonight at 7:00 P.M. with Mike Pence talking about LAW AND ORDER and many other subjects! Bad times for divided USA! _E_
Congratulations to my friend David Wright of the @mets who is now their all time hitting leader. _E_
The just out USA Today National Poll where I lead by big numbers shows that in a head to head matchup I beat both Hillary and Bernie. _E_
Poor @JohnKasich doesn't have what it takes __HTTP__ _E_
Thank you @SarahPalinUSA for your amazing help and support. Big win leaving now for Atlanta and Nevada.The people of South Carolina got it! _E_
Tomorrow is #TrumpTuesday on @squawkboxCNBC 7:30 AM don't miss it! _E_
I will be going to Mississippi tomorrow night hear the crowds are going to be massive! Look forward to it. _E_
Entrepreneurs: Don't expect anyone to be on your side. Sometimes we're all in this alone. So believing in yourself is mandatory. _E_
Sirius National News at 7:30 A.M. Steve Bannon. @BreitbartNews _E_
Obama has destroyed the middle class. In '09 median household income was $55198. Now it is $50678. Four more years? _E_
SHOCK! While attacking @MittRomney's private equity experience @BarackObama raises $2M from private equity bankers __HTTP__ _E_
When and how are the dummies at the @WSJ going to apologize to me for their totally incorrect Editorial on me. I want smart trade deals. _E_
I will be on @Morning_Joe live from New Hampshire tomorrow at 7am. #Trump2016 #MakeAmericaGreatAgain _E_
Washington must come together on a deal to avoid a fiscal cliff. If taxes are raised they must come with real hard cuts. _E_
I should have easily won the Trump University case on summary judgement but have a judge Gonzalo Curiel who is totally biased against me. _E_
"Much as it pays to emphasize the positive there are times when the only choice is confrontation." – The Art of the Deal _E_
Thanks @greggutfeld. Really nice! I'm glad I did your show. @GregGutfeldShow _E_
Election is being rigged by the media in a coordinated effort with the Clinton campaign by putting stories that never happened into news! _E_
.@ShawnJohnson Congratulations on your engagement he is a lucky guy. You are  a true winner and will be an amazing couple. _E_
Once you consent to some concession you can never cancel it and put things back the way they are. Howard Hughes _E_
Did President Obama have a rough day yesterday or what? He has got to start telling the truth NO MORE LIES OR DECEPTION! _E_
Brian I hope @NBCNightlyNews isn't paying you too much look at what's happening to nightly news. _E_
Do the people of Ohio know that John Kasich is STRONGLY in favor of Common Core! In other words education of your children from D.C. No way _E_
Unlike U.S. China taxes things made in the U.S. and sold in China. China demands plants we don't. Stupid! _E_
.@HillaryClinton's 2008 Campaign And Supporters Trafficked In Rumors About Obama's Heritage #DebateNight __HTTP__ _E_
Keystone must be approved through Congress. @BarackObama is costing America over 20000 jobs and driving the price of gas high. _E_
House of Representatives needs to pass Government Funding Bill tonight. So important for our country our Military needs it! _E_
.@ErraticSLK Shout out = work hard! _E_
Everybody should contribute & fight in the long haul battle against autism. @autismspeaks _E_
GOPers eye Donald Trump for governor run __HTTP__ via @nypost by @fud31 _E_
If people knew how hard I worked to get my mastery it wouldn't seem so wonderful at all. Michelangelo _E_
The lady in Chicago that I'm fighting owes me $500 000 and is sophisticated & vicious. She made up a story & plays the age card bad! _E_
Congrats to people of Scotland on the Judge's ruling concerning bird killing land destroying environmentally disastrous windmills. _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
O'Malley as former Mayor of Baltimore has very little chance. _E_
I'm eagerly awaiting the next polls. The debate performance could be devastating to the Obama team. Let's see what happens. _E_
Admitted:@BarackObama's Treasury Secretary admitted that their 2013 budget does nothing to address America's (cont) __HTTP__ _E_
In this book our second together we share what gives us the Midas Touch the ability to turn things we touch (cont) __HTTP__ _E_
Russians are playing @CNN and @NBCNews for such fools funny to watch they don't have a clue! @FoxNews totally gets it! _E_
My friend @TheSlyStallone lost his wonderful son Sage this weekend. We all send Sly our love and warmest wishes. (cont) __HTTP__ _E_
RT @CLewandowski_: Trump winning over Latino Republicans poll says | New York Post __HTTP__ _E_
A guy named @BobBeckel on FOX their resident liberal was not born with much of a brain. _E_
For the Republicans to have any success these next two years they must have a long game plan... _E_
It's important to listen to what people say. "Horrible" and "disgusting" are the words I used in response to Sterling's comments. _E_
The most important truth our FOUNDERS understood was: FREEDOM is NOT a gift from Govt. FREEDOM is a GIFT from GOD. __HTTP__ __HTTP__ _E_
Randy Moss should not be bragging about himself—I'm the only one who is allowed to do that! _E_
It is time for DC to protect the American worker not grant amnesty to illegals. Let's Make America Great Again! __HTTP__ _E_
I had amazing time in Charlotte. Great people & many new friends. I look forward to coming back very soon. Congrats to Gavin & Staff. _E_
Do you think that very dumb reporter(blogger) McKay Coppins has apologized to his wife for his very inappropriate behavior while in Florida? _E_
'Americans overwhelmingly oppose sanctuary cities' __HTTP__ _E_
Interesting how the U.S. sells Taiwan billions of dollars of military equipment but I should not accept a congratulatory call. _E_
Via @Newsmax_Media by @dpatten32: "Trump's Brand Gives Him 2016 Mojo" __HTTP__ _E_
Did a shoot in front of the Metropolitian Museum on 5th Ave for the 13th season of the Apprentice... _E_
Exxon donated $250g to Obama's inaugural __HTTP__ I guess the Democrats have no problem accepting money from 'big oil.' _E_
Thanks & I won't let you down. __HTTP__ _E_
There are many ways of going forward but only one way of standing still. Pres. Franklin D. Roosevelt _E_
Via @successmagazine by @MikeSeemuth: "Trump Power" __HTTP__ _E_
Today I announced a new Executive Order with re: to North Korea. We must all do our part to ensure the complete denuclearization of #NoKo. __HTTP__ _E_
"If you like your plan you keep it." = "Gruber is just some adviser." Two of Obama's greatest lies told to the American public. _E_
Experience is not what happens to you it's what you do with what happens to you. Aldous Huxley _E_
RT @IvankaTrump: 3/4: This Administration is deeply committed to those who serve & their families who make it possible through their love a... _E_
Join me live in Louisiana! Tomorrow we need you to go to the polls & send John Kennedy to the U.S. Senate. __HTTP__ _E_
.@BarbaraJWalters @theviewtv Barbara unfortunately you've missed the entire point of my announcement you just don't get it! _E_
"@OMAROSA is a bit toxic" per @BrandenRoderick. Being a bit PC? #CelebApprentice _E_
The Fed should not do QE3. Neither the economy nor the dollar can withstand another round of artificial liquidity. _E_
"Donald Trump: Karl Rove Has Done Ashley Judd A Favor" __HTTP__ via @SheKnows _E_
Will be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_
$716 Billion from Medicare by @BarackObama. When will it end? _E_
Karen Handle's opponent in #GA06 can't even vote in the district he wants to represent.... _E_
Obama still refuses to stop the flights. Is he stubborn or just plain incompetent I say both! _E_
Robert Pattinson should not take back Kristen Stewart. She cheated on him like a dog & will do it again just watch. He can do much better! _E_
Thank you @MikeOzanian for the nice comments on @FoxNews today. Great job! _E_
RT @foxandfriends: U.S. spy satellites detect North Korea moving anti ship cruise missiles to patrol boat __HTTP__ _E_
Today it was a tremendous honor for me to sign the #VAaccountability Act into law delivering my campaign promise... __HTTP__ _E_
It is the same Fake News Media that said there is no path to victory for Trump that is now pushing the phony Russia story. A total scam! _E_
Still time to #VoteTrump! #iVoted #ElectionNight __HTTP__ _E_
How can @JebBush beat Hillary Clinton if he can't beat anyone else on the #GOPDebate stage with $150M? I am the only one who can! _E_
A good head and a good heart are always a formidable combination. Nelson Mandela _E_
If you voted for Obama in 2008 to prove you were not a racist then vote for Romney in 2012 to prove you are not stupid. Thanks Walter D! _E_
Where's the global warming? 2013 was one of the least extreme years in weather on record __HTTP__ _E_
Just got back from the Iowa State Fair. Record crowds phenomenal people. Thank you IOWA I will never let you down! _E_
Re negotiation: Trust your instincts even after you've honed your skills. They're there for a reason. _E_
Rand Paul is a friend of mine but he is such a negative force when it comes to fixing healthcare. Graham Cassidy Bill is GREAT! Ends Ocare! _E_
OPEC has just raised oil to over $102/Barrel. And @BarackObama still won't approve the Keystone Pipeline. Does he want high gas prices? _E_
Join me in Council Bluffs Iowa today at 3pm! #MakeAmericaGreatAgain Tickets: __HTTP__ _E_
.@AlexSalmond If a country wants to rapidly destroy its economy I have an idea just put up subsidized wind (cont) __HTTP__ _E_
Via @ACLJ: Pastor Saeed's Wife Expresses Gratitude to Donald Trump for Raising Her Husband's Plight __HTTP__ _E_
I'll be speaking tomorrow at the San Jose Convention Center (CA) for the first ever National Achievers Congress __HTTP__ _E_
AngieApon I think you should try wearing your hair combed back. It looked good when you slicked it back Mr. Trump ) #ALS May happen thx _E_
Donald Trump's Speech Is a Game Changer. #Trump2016 __HTTP__ __HTTP__ _E_
Build your reputation on intelligence responsibility and results. That's building the right way. Think Like a Champion _E_
I am going to save Medicare and Medicaid Carson wants to abolish and failing candidate Gov. John Kasich doesn't have a clue weak! _E_
I can't believe that in New York we can't watch the PGA Championsip on CBS. How .much discount is Time Warner giving its customers? _E_
Happy 8th Anniversary to @MELANIATRUMP. __HTTP__ _E_
RT @IngrahamAngle: Trump Int'l Golf Club West Palm Beach is spectacular. Almost makes me wish I had time to play/learn/like golf. _E_
I have to admit @AlexSalmond is a tough smart guy. He is formidable by any standard! _E_
Why is Washington ready to spend billions on care for illegals while our VA is still in shambles? Vets should be the priority. _E_
THANK YOU to all of the great men and women at the U.S. Customs and Border Protection facility in Yuma Arizona & around the United States! __HTTP__ _E_
Look forward to being in DC tomorrow—big crowd expected for our protest against the truly stupid nuclear deal we are making with Iran. _E_
When is the media going to talk about Hillary's policies that have gotten people killed like Libya open borders and maybe her emails? _E_
Welcome to @BarackObama's America 8.74 million workers on 'Federal Disability __HTTP__ Where are the jobs?! _E_
I own Turnberry in Scotland one of great resorts in world. Women's British Open there this week. I'll go for two days & back on trail. _E_
A simplified tax code will help promote growth in the private sector. _E_
My @SquawkCNBC #TrumpTuesday interview with Ken Langone & Dick Grasso discussing the Chicago teachers' strike @ 2012 __HTTP__ _E_
Why would Texans vote for liar Ted Cruz when he was born in Canada lived there for 4 years and remained a Canadian citizen until recently _E_
Why can't @Politico get better reporters than Ben Schreckenger? Guy is a major lightweight with no credibility. So dishonest! _E_
Entrepreneurs: Ignorance is not bliss it's fatal. It's costly. Pay attention or get crushed. Watch listen and learn. _E_
I am self funding my campaign and only work for YOU the American people!#Trump2016 Video: __HTTP__ __HTTP__ _E_
"Patriotism is supporting your country all the time and your government when it deserves it." – Mark Twain _E_
If the Republican Convention had blown up with e mails resignation of boss and the beat down of a big player. (Bernie) media would go wild _E_
#BigLeagueTruth __HTTP__ _E_
This week's All Star Celebrity @ApprenticeNBC features another memorable Board Room rumble between @piersmorgan & @OMAROSA. _E_
Saudi Arabia and many of the countries that gave vast amounts of money to the Clinton Foundation cont'd: __HTTP__ _E_
...American Cancer Society and the Dana Farber Cancer Center. _E_
Thank you @loudobbsnews I will be trying very hard to prove you right great show! _E_
I hope A Rod has a great night for the Yankees he owes it to them especially with Derek hurt. _E_
The President's speech tonight will largely focus on class warfare. The Republicans don't know how to handle that—I do. _E_
Failing host @glennbeck a mental basketcase loves SUPERPACS in other words he wants your politicians totally controlled by lobbyists! _E_
From day one I said that I was going to build a great wall on the SOUTHERN BORDER and much more. Stop illegal immigration. Watch Wednesday! _E_
#CrookedHillary Job Application __HTTP__ _E_
Via @cnsnews by @CraigBBannister: "Poll: Hispanics Blacks Call for Tighter Borders Access to Illegals' Jobs" __HTTP__ _E_
Good move by Bernie S. _E_
The statement put out yesterday by @FoxNews was a disgrace to good broadcasting and journalism. Who would ever say something so nasty & dumb _E_
Snow and ice freezing weather in Texas Arizona and Oklahoma what the hell is going on with GLOBAL WARMING? _E_
Boycott Mexico until they release our Marine. With all the money they get from the U.S. this should be an easy one. NO RESPECT! _E_
They will soon be calling me MR. BREXIT! _E_
BIG NIGHT on Celebrity Apprentice tonight. IMPORTANT starts at 10 P.M. as scheduled but NBC just increased all future episodes to 2 hours! _E_
....Dopey @krauthammer should be fired. @FoxNews _E_
Everyone knows I am right that Robert Pattinson should dump Kristen Stewart. In a couple of years he will thank me. Be smart Robert. _E_
...vast sums of money to NATO & the United States must be paid more for the powerful and very expensive defense it provides to Germany! _E_
Where was all the outrage from Democrats and the opposition party (the media) when our jobs were fleeing our country? _E_
My recent statement re: @macys We must have strong borders & stop illegal immigration now!... __HTTP__ _E_
I'm self funding my campaign but lobbyists & special interests for Jeb & others are starting to do big ads—desperate! Don't believe them. _E_
Via @bluegreentweet: Scottish wind farm opposed by Donald Trump delayed __HTTP__ _E_
We will bring America together as ONE country again – united as Americans in common purpose and common dreams. #MAGA _E_
Thank you Senator David Perdue! __HTTP__ __HTTP__ _E_
With Mexico being one of the highest crime Nations in the world we must have THE WALL. Mexico will pay for it through reimbursement/other. _E_
Pastor #Nadarkhani must be released by Iran immediately. I applaud the @WhiteHouse & @StateDept for issuing (cont) __HTTP__ _E_
.@loudobbsnews did a fantastic interview with syndicated columnist Michelle Malkin. Congrats to both! _E_
I am more concerned about Biden in the debate than I am about Obama. Be careful on Thursday night! _E_
RT @USHCC: USHCC was delighted to host @IvankaTrump for a roundtable discussion w/ Hispanic women biz owners today in Washington #USHCCLegi... _E_
One of my first acts as President will be to deport the drug lords and then secure the border. #Debate #MAGA _E_
.@rushlimbaugh played 3 separate audio bites (the most of anyone) of my CPAC speech. Hour 3 in Friday's show. _E_
Crooked Hillary Clinton wants completely open borders. Millions of Democrats will run from her over this and support me. _E_
The Tonight Show @nbc will be amazing 11:30 P.M. ENJOY! _E_
Rush Limbaugh: Trump Has Changed the Entire Debate on Immigration __HTTP__ _E_
Sgt. Bowe Bergdahl should face the death penalty for desertion five brave soldiers died trying to bring him back. U.S. has to get tough! _E_
China's business interests reach far and wide even domestically within our borders. We need to reassess our relationship. _E_
Congratulations to @Mets @RADickey43 on becoming the first knuckleball pitcher to ever win the CY Young award! _E_
#TBT Do you believe once upon a time Jon Stewart really liked me? From 2004. __HTTP__ _E_
Via @AP: Miss USA Olivia Culpo is crowned Miss Universe Ratings increase 15% over last year. __HTTP__ _E_
My offer to Obama is about transparency. In 2008 American people were sold on hope and change. This our last chance to get the full record. _E_
A great night in Macon Georgia! Thank you for all of the support. Together we will #MakeAmericaGreatAgain! __HTTP__ _E_
You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_
You have to feel bad for the Democrat Senators. They don't want Hagel either. Just following Obama's orders. _E_
So far the Super Bowl is very boring not nearly as exciting as politics MAKE AMERICA GREAT AGAIN! _E_
I am in New Hampshire having a great time! Loved the #GOPDebate last night! Everybody enjoy the Super Bowl. #SuperBowlSunday #SB50 _E_
Well Iran has done it again. Taken two of our people and asking for a fortune for their release. This doesn't happen if I'm president! _E_
Very low ratings radio host Hugh Hewitt asked me about Suleiman Abu Bake al Baghdad Hassan Nasrallah and more typical gotcha questions _E_
Hypocrite. Watch Senator Obama defend democratic debate' of Senate filibuster rules in 2005 __HTTP__ _E_
North Carolina is a fantastic state with wonderful people. I enjoy my time there when I visit Trump National Charlotte. _E_
I'm having a real hard time watching the Academy Awards (so far). The last song was terrible! Kim should sue her plastic surgeon! #Oscars _E_
...But while Dallas dropped to it knees as a team they ALL stood up for our National Anthem. Big progress being made we love our country! _E_
Rated "#1 Resort in Europe" by @CNTraveler @Trump_Ireland offers breathtaking golf & the 5 Star Lodge at Doonbeg __HTTP__ _E_
A 60% increase in Texas Blue Cross/Blue Shield through ObamaCare. I told you so there is panic and anger as healthcare costs explode! _E_
13 states have voter registration deadlines TODAY: FL OH PA MI GA TX NM IN LA TN AR KY SC.Register: __HTTP__ _E_
Can't fool Americans. 57% of uninsured hate ObamaCare __HTTP__ Reality is less will be insured b/c of this monstrosity. _E_
I'm impressed both teams have produced very entertaining silent films. #CelebApprentice _E_
Bad. @gallupnews survey shows 30% of businesses not hiring they are worried they won't be around in a year. __HTTP__ _E_
My @FoxNews interview with @gretawire discussing why I endorsed @MittRomney and why he will make a great President __HTTP__ _E_
Intelligence stated very strongly there was absolutely no evidence that hacking affected the election results. Voting machines not touched! _E_
If US Air and American Airlines are allowed to merge we are back to the days of "monopoly." _E_
The townhall question segment of my @WMUR9 Commitment 2016 Conversation @JoshMcElveen __HTTP__ Great questions/people #FITN _E_
James Clapper who famously got caught lying to Congress is now an authority on Donald Trump. Will he show you his beautiful letter to me? _E_
Just left Trump Golf Links at Ferry Point. Ribbon cutting w/@MayorBloomberg & @jacknicklaus was spectacular. Lots of people & jobs! _E_
Ever see @bluemangroup in performance? They're fantastic. And so are Penn & Teller. Don't miss them. #CelebApprentice _E_
The MAKE AMERICA GREAT AGAIN agenda is doing very well despite the distraction of the Witch Hunt. Many new jobs high business enthusiasm.. _E_
I hope @billmaher comes through with his $5 million offer which I fully accepted or I will be forced to sue him. All goes to charity! _E_
This Sunday's All Star @ApprenticeNBC features some of the biggest fireworks of the entire season. Get ready. _E_
Via @washingtonpost 9/18/01. I want an apology! Many people have tweeted that I am right! __HTTP__ __HTTP__ _E_
Happy 70th Birthday @USAirForce! __HTTP__ _E_
Success is good. Success with significance is even better. Work on what you will be proud to be associated with make your work count. _E_
Good news @RickSantorum did the right thing. I congratulate him on running a very good race. Now it's onto @BarackObama go get him Mitt! _E_
I unfairly get audited by the I.R.S. almost every single year. I have rich friends who never get audited. I wonder why? _E_
Getting ready to leave for the Great State of Indiana and meet the hard working and wonderful people of Carrier A.C. _E_
Vince McMahon shows the crowd one of the greatest moments in WWE History. #WWEHOF __HTTP__ _E_
The time has come. THEGaryBusey will be project mgr on this Sunday's All Star Celebrity @ApprenticeNBC. MUST SEE TV!!! Back to 2 hrs. _E_
See ungrateful Little @MacMiller's statement to me a year ago— __HTTP__ he was kissing my ass! _E_
....on ruining Scotland's beauty with ugly & costly wind turbines? _E_
The U 6 Unemployment Rate is over 14.9%. ObamaCare is stopping businesses from both hiring and expanding. _E_
Ratings challenged @CNN reports so seriously that I call President Obama (and Clinton) the founder of ISIS & MVP. THEY DON'T GET SARCASM? _E_
Will be in Terre Haute Indiana in a short while big rally! See you soon! _E_
Colin Montgomerie @montgomeriefdn You are not only a great golfer you are doing a great job of commentary @GolfChannel _E_
MAKE AMERICA SAFE AGAIN!#NoSanctuaryForCriminalsAct #KatesLaw #SaveAmericanLives __HTTP__ _E_
My Doral Country Club purchase was made just before Miami real estate market went through the roof—good timing! _E_
.@megynkelly must have had a terrible vacation she is really off her game. Was afraid to confront Dr. Cornel West. No clue on immigration! _E_
My @FoxNews int with @TeamCavuto on the state of world affairs economy the Bushes etc. __HTTP__ _E_
Interview with @oreillyfactor on Fox Network 4:00 P.M. (prior to Super Bowl). Enjoy! _E_
On the sands of Playa Brava waves will reflect on walls & circular architecture of Trump Tower Punta del Este __HTTP__ _E_
Thank you @SenOrrinHatch. Let's continue MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
It's not that I'm so smart it's just that I stay with problems longer. Some good words from Albert Einstein. It pays to be tenacious. _E_
Crooked Hillary who embarrassed herself and the country with her e mail lies has been a DISASTER on foreign policy. Look what's happening! _E_
Be sure to watch The Celebrity Apprentice on Sunday at 9 pm on NBC. It's an episode you'll want to see and one you won't forget! _E_
It's time for Mountain State to have a Senator who will stop Obama's war on coal. This November send DC a message vote for @CapitoforWV! _E_
Ohio had the biggest budget increase in the U.S. If it were not for striking oil they would be bust! Governor Kasich in favor of TPP fraud! _E_
The Clinton Campaign at Obama Justice #DrainTheSwamp __HTTP__ _E_
Ratings for NFL football are way down except before game starts when people tune in to see whether or not our country will be disrespected! _E_
Our Founding fathers got it. They understood that nothing good in life religious freedom economic freedom (cont) __HTTP__ _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
I am signing copies of my book CRIPPLED AMERICA. Makes a great holiday gift. Order yours now! __HTTP__ ... ... _E_
My @foxandfriends interview discussing @newsday's endorsement of @MittRomney tomorrow's election and Sandy's victims __HTTP__ _E_
A clip from my interview with @jimmyfallon discussing the cast of @ApprenticeNBC Season 5 __HTTP__ _E_
I play golf to relax. My company is in great shape. @BarackObama plays golf to escape work while America goes down the drain. _E_
Entrepreneurs always remember that every business relationship can lead to greater deals in the future. Be sure to cultivate relationships _E_
I would triple the sanctions on Iran if the American pastor is not released. my @SRQRepublicans speech _E_
$ave your $. Don't invest in @KarlRove. He doesn't have a clue. __HTTP__ _E_
Congratulations to @FLGovScott on getting an A grade from @CatoInstitute on his fiscal policy. Rick is a fantastic governor. _E_
Thank you @LuisRiveraMarin! __HTTP__ _E_
The Yankees are absolutely terrible what happened to this team? _E_
Did you ever think our country would become an economic basket case? So much for Hope & Change. _E_
The endorsement of me by the 16500 Border Patrol Agents was the first time that they ever endorsed a presidential candidate. Nice! _E_
.@TrumpDoral's record $200M renovations are on schedule. The hotel remains open for guests events and conferences. __HTTP__ _E_
RT @greta: interesting poll results so far (and go vote on __HTTP__ __HTTP__ _E_
I'm on Bill @oreillyfactor tonight at 8 PM. It will be another lively interview about how to #MakeAmericaGreatAgain! _E_
Tremendous backlash against the NFL and its players for disrespect of our Country.#StandForOurAnthem _E_
ObamaCare is an attack on our country's identity. The latest victim is the Catholic church. It must be full repealed. @BarackObama _E_
The signature restaurant of @TrumpNewYork @jeangeorges is both a Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_
Obama is totally "tweaking" the Republicans because he doesn't respect them—they've got to change their ways. _E_
I hope when Rand Paul gets out of the race—he is at 1% his supporters come over to me. I will do a much better job for them. _E_
I was #1 on Twitter and so positive. Thank you! __HTTP__ _E_
"TEA TALK: Highlights from Monday convention speech from Donald Trump" __HTTP__ via @myrbeachonline by @TSN_MPrabhu _E_
Check out ShouldTrumpRun.... __HTTP__ _E_
Work begins on the Old Post Office in Washington D.C. in 3 months. It will soon become one of the great hotels of the world. _E_
At least ObamaCare/RomneyCare architect Gruber admitted albeit privately that we were lied to by Obama. Gang of Liars. _E_
Join me in Colorado Springs at 2pm or in Denver tonight at 7pm!Colorado Springs: __HTTP__ __HTTP__ _E_
Our nation has a duty to care for our vets & their families. It's time to do it! Let's Make America Great Again! __HTTP__ _E_
I am really happy that Hillary made her speech right under Trump World Tower! _E_
My @WMUR9 'Close Up' int. with @JoshMcElveen discussing the midterms the new Congress travelling to NH & 2016 __HTTP__ _E_
My new book Time To Get Tough comes out on December 5th. Pre order on Amazon.com. It's the best book I've ever written. _E_
RT @Scavino45: The Iran deal was one of the worst & most one sided transactions the United States has EVER entered into. @POTUS @realDona... _E_
Time Warner Cable went out on 5th Avenue for 2 plus days. They are a disaster. I think I'm going to switch. _E_
RT @DanScavino: Congratulations to the 2017 @PinstripeBowl (Yankee Stadium) Champions Iowa @HawkeyeFootball! __HTTP__ _E_
Entrepreneurs: Learn to be succinct. Can you tell someone your idea in three minutes or less? Be clear and concise. _E_
To show you how politicians act Bobby Jindal spent $1000 to register in New Hampshire & dropped out the next day. Such a waste! _E_
Great quote from the late Steve Jobs: Innovation distinguishes between a leader and a follower. _E_
Congratulations to @BarackObama for being reckless. In his first 38 months in office the debt has grown at a rate that is unthinkable. _E_
Whether you think you can or think you can't you are right. Henry Ford _E_
It is crucial for Republicans to remain united during this shutdown _E_
Senator Luther Strange has done a great job representing the people of the Great State of Alabama. He has my complete and total endorsement! _E_
Great piece on Extra tonight re. Celebrity Apprentice! _E_
"Your money should be at work at all times. Even in the worst economy there is no excuse. Think Like a Billionaire _E_
The average family has spent $4155 this year filling up the car on $3.50/gallon average. Both record highs. (cont) __HTTP__ _E_
"When mistakes are made and they will be the entrepreneur's true character emerges and further growth takes place." – The Midas Touch. _E_
I can't get over after all of the buildup what a terrible game that was the worst Super Bowl in history. The advertisers must be furious! _E_
Employees of @NYMag should have their resumes updated. It is very boring & will die in the near future. How much are they losing now? _E_
Just left a great event in Pella. Going to church tomorrow in Muscatine Iowa. _E_
Every sports fan is treated to an All Star game. The loyal and growing fan base of @CelebApprentice will be getting a much bigger treat! _E_
Join me live at the 2018 World Economic Forum in Davos Switzerland! #WEF18 __HTTP__ __HTTP__ _E_
.@club4growth should release the letter they sent me asking for $1000000. When I said no they came out against me. A scam operation? _E_
The most luxurious hotel in downtown Manhattan @TrumpSoHo is a top destination __HTTP__ _E_
Many Republicans support TPP. They are stupid. We have stupid Republicans too. We need to keep jobs here! my @SRQRepublicans speech _E_
Remember that in 2006 then Senator Obama voted NOT TO INCREASE THE DEBT CEILING. Now he acts in disbelief as others plan to do the same! _E_
Thank you Florida! #Trump2016 __HTTP__ _E_
I want to thank the people of Iowa for an unbelievable day. The crowds were amazing. Will be back Tuesday! _E_
All of the phony T.V. commercials against me are bought and payed for by SPECIAL INTEREST GROUPS the bandits that tell your pols what to do _E_
Thank you @Forbes for showing the @WSJ was wrong. So dishonest! __HTTP__ _E_
Thank you California! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
No taxes in Boehner or Reid Plan important victory for America. _E_
From the great author of Rich Dad Poor Dad Robert Kiyosaki here is a very nice article. __HTTP__ _E_
Sometimes understanding other people's problems is the key to finding opportunities. Midas Touch w/@atheRealKiyosaki _E_
The Republican Party must spend its money wisely and do incredible television commercials. They must be tough and smart. _E_
The people of Colorado had their vote taken away from them by the phony politicians. Biggest story in politics. This will not be allowed! _E_
Have you seen the new #Trump fall collection exclusively available @Macys? Top selling brand nationwide.Ties shirts fragrance great gifts. _E_
WASTE HUD is spending $70M to teach grant recipients how to spend the money from their grants __HTTP__ Does it get any dumber? _E_
I am the best builder but if that were my building with the crane mishap I would have been lambasted from coast to coast. _E_
I love Twitter.... it's like owning your own newspaper without the losses. _E_
The WH yesterday defended Biden's comments that the Taliban aren't our enemy. When did the American people decide this? __HTTP__ _E_
RT @IvankaTrump: Thank you to the amazing men and women working tirelessly to bring relief to those in need. #PuertoRico #HurricaneMaria ht... _E_
Thank you Carl Higbie (former Navy Seal) for you support of my plan to straighten out the Veterans Administration a mess!Great job @kilmeade _E_
Luther Strange of the Great State of Alabama has my endorsement. He is strong on Border & Wall the military tax cuts & law enforcement. _E_
Wow @CNN is really working hard to make me look as bad as possible. Very unprofessional. Hurting in ratings bad television! _E_
Our military is building and is rapidly becoming stronger than ever before. Frankly we have no choice! _E_
My interview with @IngrahamAngle discussing @MittRomney's Super Tuesday and why @BarackObama must be defeated. __HTTP__ _E_
When will people realize that @billmaher is not an intellectual but actually a rather dumb guy—just look at his past. _E_
Our new @MissUniverse Olivia Culpo is not only beautiful but intelligent and accomplished. She is a wonderful role model. _E_
It was truly an honor to introduce my wife Melania. Her speech and demeanor were absolutely incredible. Very proud! #GOPConvention _E_
It was an honor to welcome the Teachers of the Year to the WH last month. Today we honor and thank all teachers!... __HTTP__ _E_
No member of Congress should be eligible for re election if our country's budget is not balanced deficits not allowed! _E_
Sarasota was an unbelievable success. We expected 5000 a record but 12000 showed up! Great love in the air! __HTTP__ _E_
We better get tough with RADICAL ISLAMIC TERRORISTS and get tough now or the life and safety of our wonderful country will be in jeopardy! _E_
I hope Arnold S. does well with the Apprentice because he is a nice guy and also because I get a big percentage of the profits! _E_
I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_
On the way to the #GOPDebate with my wonderful wife @MelaniaTrump. __HTTP__ __HTTP__ _E_
Entrepreneurs: View any conflict as an opportunity. Being positive could lead you into a fortunate situation. _E_
Congratulations to Billy Payne and @AugustaNational on doing the right thing. _E_
Reporter should resign __HTTP__ _E_
The road to success is always under construction. Arnold Palmer _E_
Do we still want a President who bows to the Saudis and lets OPEC rip us off? Make America strong vote for @MittRomney. _E_
Why doesn't @CNN use the #CNN Iowa poll? @andersoncooper @andydean2014 _E_
... & all Obama is concerned about stopping them doing is buying wind farms __HTTP__ _E_
I don't know why but I feel so sorry for dummy reporter John Heilemann when I watch him on television. _E_
I look forward to the debate on Thursday night & it is certainly my intention to be very nice & highly respectful of the other candidates. _E_
It was my honor. THANK YOU! __HTTP__ _E_
My interview with @Jay_Severin on behalf of @MittRomney discussing why the GOP must nominate @MittRomney __HTTP__ _E_
Congratulations to our new National Security Advisor General H.R. McMaster. Video: __HTTP__ __HTTP__ _E_
My induction last night at Madison Square Garden into the WWE Hall of Fame was amazing I met some great people including Bruno. _E_
Why hasn't Obama created jobs? _E_
US froze $8B in Iranian assets during '79 Hostage Crisis. Now Obama is giving it back to Iran while Christian Pastor is jailed. Don't do it! _E_
Wind farms are ugly not cost effective and don't produce worthwhile returns or energy. No wonder governments are giving up on them. _E_
Kim Jong Un of North Korea made a very wise and well reasoned decision. The alternative would have been both catastrophic and unacceptable! _E_
A total refutation of the disgraceful David Brooks column in the failing @NYTimes by the @WashingtonPost: __HTTP__ _E_
Another four years not good for the country but we'll have to live with it! _E_
Trump is already delivering the jobs he promised America __HTTP__ _E_
Obama hasn't released a budget in over 2 years & for the 1st time House & Senate delivered budgets before him __HTTP__ _E_
Trump Turnberry news conference tomorrow at noon Scotland time. The place is amazing! _E_
"The vast majority felt she should be prosecuted... even senior FBI officials thought Crooked was guilty. __HTTP__ _E_
It is truly amateur hour at the White House and this is why we should not be doing the war thing right now! _E_
May God Forever Bless the United States of America. #NeverForget911 __HTTP__ _E_
We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_
Keep Wednesday morning free. You will want to see this! _E_
My condolences to Dwyane Wade and his family on the loss of Nykea Aldridge. They are in my thoughts and prayers. _E_
America's debt officially became 100% of our GDP on @BarackObama's 50th birthday coincidence? _E_
Obama Care is already having a devastating impact on our economy. _E_
Will do thanks. __HTTP__ _E_
Let me say it as clearly as possible that the attack on my Catholic brothers and sisters is an attack on me. @GovMike Huckabee _E_
Scary thought is the sexual pervert Anthony Weiner now in Charlotte? Did he bring his phone with him? _E_
....and has been horrible on Virginia economy. Vote @EdWGillespie today! _E_
We still have not learned the full truth on Benghazi. Four Americans were killed. Congress must act! _E_
Eli Manning staged a great comeback in 4th quarter an elite quarterback. _E_
.@CarlyFiorina had to inject herself into my factual statements concerning Ben Carson in order to breathe life into her failing campaign! _E_
"Donald Trump's Miss USA Pageant Scores $5 Million Legal Victory Following Rigged Claims" __HTTP__ via @eonline _E_
Via @WashTimes by @SethMcLaughlin1: "Donald Trump: I want to run for president 'so badly'" __HTTP__ _E_
RT @joegooding: What's happening in our country isn't just an assault on our @POTUS @realDonaldTrump it's an assault on the American people... _E_
.@AP continues to do extremely dishonest reporting. Always looking for a hit to bring them back into relevancy—ain't working! _E_
Thanks Lou. __HTTP__ _E_
I called Chuck Schumer yesterday to see if the Dems want to do a great HealthCare Bill. ObamaCare is badly broken big premiums. Who knows! _E_
RT @FoxNews: U.S. Markets since election. __HTTP__ _E_
Great. Just reported on @FoxNews that many people who supported @JebBush are now supporting me. I knew that would happen pundits didn't! _E_
"The most important political office is that of the private citizen." Justice Louis D. Brandeis _E_
Glad to hear patriotic Americans are organizing a movement this August to boycott Chinese products __HTTP__ People get it! _E_
Report: "ANTI TRUMP FBI AGENT LED CLINTON EMAIL PROBE" Now it all starts to make sense! _E_
.@CNBC continues to report fictious poll numbers. Number one based on every statistic is Trump (by a wide margin). They just can't say it! _E_
Thank you to the 2500+ in North Augusta South Carolina. Lines down the block! Don't forget to VOTE on Saturday! __HTTP__ _E_
.@MarketMavensInc #asktrump __HTTP__ _E_
Environmental regulations stop Border Patrol from protecting 40% of the border __HTTP__ A coup for the migrant Democrats. _E_
Received a beautiful letter from Joe Paterno's son Jay. He really loved and respected his father. _E_
Wow did you just hear Bill Clinton's statement on how bad ObamaCare is. Hillary not happy. As I have been saying REPEAL AND REPLACE! _E_
Without focus it's just impossible to be successful at anything. Midas Touch _E_
Afghanistan leaders want the U.S. to keep 20 000 troops there for many more years fully paid for by the U.S. but first they want apology. _E_
ESPN is paying a really big price for its politics (and bad programming). People are dumping it in RECORD numbers. Apologize for untruth! _E_
Learning never exhausts the mind. Leonardo da Vinci _E_
Gross negligence by the Democratic National Committee allowed hacking to take place.The Republican National Committee had strong defense! _E_
The Art of the Deal = #1 business book. Over 3 million copies sold. Forbes Article from Oct. 20 2014. __HTTP__ _E_
After Super Tuesday every GOP candidate should take a long hard look at their prospects and drop out if they can't get the nomination. _E_
..Ryan died on a winning mission ( according to General Mattis) not a failure. Time for the U.S. to get smart and start winning again! _E_
Every sport evolves. Every sport gets bigger and more athletic and you have to keep up. Tiger Woods _E_
RT @Scavino45: .@POTUS @realDonaldTrump and @UN Secretary General @AntonioGuterres pose for📸prior to their expanded bilateral meeting. #USA... _E_
Vanity Fair is failing. Newstand sales are down 20 percent 2nd most for major magazines and the magazine has (cont) __HTTP__ _E_
Which National Costume do you think should win? __HTTP__ _E_
One of the reasons Hillary hid her emails was so the public wouldn't see how she got rich selling out America. __HTTP__ _E_
Really sad news: The great Arnold Palmer the King has died. There was no one like him a true champion! He will be truly missed. _E_
Yom Kippur blessings to all of my friends in Israel and around the world. #YomKippur _E_
Just met with David Perdue @Perduesenate. He's a fantastic guy who will fight hard against ObamaCare. He will win! _E_
I will write a $2 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_
"The cheapest natural gas in the world is in the United States." @boonepickens _E_
Great job on @CNN tonight @heytana. We are all proud of you! Also congrats on a great son he is going places. _E_
Wow! Such nice words from Robert Redford on my running for President. Thank you Robert. __HTTP__ _E_
.@DennisRodman must be thinking of North Korea. #CelebApprentice _E_
This is a terrific day for downtown New York. Trump SoHo is unlike anything else. Be sure to visit this fantastic hotel soon! _E_
.@davidaxelrod I hope your book is better than the Obama second book but it is inaccurate as it pertains to me but no big deal boring! _E_
I think Senator Blumenthal should take a nice long vacation in Vietnam where he lied about his service so he can at least say he was there _E_
Thank you Wilmington North Carolina. We are 3 days away from the CHANGE you've been waiting for your entire life!... __HTTP__ _E_
"It's a good idea to take your own pulse once in a while instead of focusing on what the masses are doing." – Think Like a Champion _E_
I want to see @BarackObama's college records to see how he listed his place of birth in the application. _E_
Great news here comes the Tea Party! @MittRomney has received 42k donations online & raised over $4.2 million since the ObamaCare decision. _E_
Just landed in Bedminster New Jersey. #MAGA __HTTP__ _E_
QE3 is going to further sink the dollar into oblivion. Creates artificial numbers for short term market gains. (cont) __HTTP__ _E_
Yesterday our national debt topped a record $18T. Over 44% has accrued under Obama. A real mess. _E_
Reigning @ApprenticeNBC Champion @TraceAdkins does great work with @wwpinc. Donate to an Injured Warrior today __HTTP__ _E_
The Iran deal poses a direct national security threat. It must be stopped in Congress. Stand up Republicans! _E_
Great! __HTTP__ _E_
Unbelievable support in Florida last night thank you! #MAGA __HTTP__ _E_
Obama is making the Ebola problem much worse than it needs to be in the U.S. by not halting flights from West Africa. Airport testing a joke _E_
Ranked a top course @GolfMagazine & 6 Star Diamond Award Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_
Do not view any failure as the end. Learn your lessons quickly then move on. Do not dwell on failure. Start thinking big again. _E_
Each and every new event space at @TrumpDoral looks stunning. See the transformation for yourself: __HTTP__ _E_
Yesterday was Matt Drudge's birthday Happy Birthday @DRUDGE and great job! _E_
RT @EricTrump: Mathematically it is statistically impossible for Kasich to get to 1237 he would need 112% of the remaining delegates to b... _E_
My @SquawkCNBC interview discussing the Republic of Georgia taxes the fledgling economy and Facebook __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Being true to yourself equals being true to your brand.That's the solid foundation that will keep your brand flourishing. Midas Touch _E_
The Super Committee is finding ways to raise all our taxes without admitting it. The Republicans made a big mistake agreeing to this deal. _E_
Trump arrives for SC Tea Party Convention in Myrtle Beach __HTTP__ via @WCBD _E_
Crooked Hillary Clinton just can't close the deal with Bernie. I had to knock out 16 very good and smart candidates. Hillary doesn't have it _E_
This is a storm of enormous destructive power and I ask everyone in the storm's path to heed ALL instructions from government officials. __HTTP__ _E_
I am sure the @NCGOP will do a great job bracketing the @DNC convention. They are a tremendous statewide organization. _E_
Notice how @BarackObama failed to mention ObamaCare last night in his SOTU. Even he knows it is terrible. _E_
A look at the Trump hotel planned for the Old Post Office pavilion __HTTP__ via @washingtonpost _E_
Conservative? Jeb Bush doubled Florida State debt! __HTTP__ _E_
Ignorance is inexcusable it's the surest way to fail. No acceptable reason exists for not being well informed. _E_
My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person always pushing me to do the right thing! Terrible! _E_
With panoramic views of Central Park & the Manhattan skyline 5 Star @TrumpNewYork offers 176 newly renovated rooms __HTTP__ _E_
.@WhoopiGoldberg Don't let @Rosie speak badly of you or try to bring you down. She is rude crude & not smart. She is not in your league. _E_
Unbelievable crowd of supporters in Virginia Beach Virginia. Thank you! Next stop Cleveland Ohio.... __HTTP__ _E_
Thank you California Connecticut Maryland and Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
#3. Look at the solution not the problem. Learn to focus on what will give results. _E_
I'm leaving now for Burlington Vermont. It will be wild! _E_
I will be interviewed on the @oreillyfactor tonight from Florida now. Enjoy! _E_
Druggies drug dealers rapists and killers are coming across the southern border. When will the U.S. get smart and stop this travesty? _E_
Our President is a great embarrassment to the U.S. How could anybody be so dumb or know so little as to make the very stupid 5 for 1 swap? _E_
Thanks Eric. __HTTP__ _E_
China does not negotiate from a position of strength we simply negotiate against ourselves. We have all the advantages but don't execute. _E_
New poll states that a record number of Americans have lost all faith in President Obama duh! _E_
Anyway I'm all about jobs & the economy & making America great again. We're falling fast! _E_
Sorry to hear of the passing of Neil Armstrong over the weekend. He was an American hero. _E_
Thanks to Donald Trump __HTTP__ via @AmSpec. My pleasure Jeffrey! _E_
Big cancer risk from new environmental light bulbs a big price to pay! _E_
My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_
I am on @FoxNews with @greta doing a town hall from Wisconsin now! Enjoy!#MakeAmericaGreatAgain #Trump2016 _E_
Millions Could Get Surprise Tax Bills Under 'Obamacare' If They Don't Accurately Project Their Income __HTTP__ _E_
Stock Market could hit all time high (again) 22000 today. Was 18000 only 6 months ago on Election Day. Mainstream media seldom mentions! _E_
Browse Donald Trump's Summer Reading List for Business Success at the Trump University Blog: __HTTP__ _E_
Will be interviewed tonight at 7 by @greta re Sony & Bush _E_
"If you want to be the best you'd better be the best – in all aspects of business." Think Like a Billionaire _E_
A signed copy of CRIPPLED AMERICA is the ultimate gift. Order now & join my live streaming book signing on 12/3 __HTTP__ _E_
The Blue Monster is being torn up at Trump @DoralResort. On April 1 I go out & play it one more time until the new course opens. _E_
My @SquawkCNBC interview discussing housing prices the GDP numbers China spreading its wealth and my stock picks. __HTTP__ _E_
Roger Goodell must stop apologizing to everyone who will listen and toughen up. His street smart players are laughing at him and the NFL! _E_
It was an honor to meet with Republic of Rwanda President Paul Kagame this morning in Davos Switzerland. Many great discussions! #WEF18 __HTTP__ _E_
I will be developing the two tallest towers in the Republic of Georgia. __HTTP__ _E_
.@washingtonpost @BretBaier Please thank Charles Lane for his new found confidence. He has made a very good bet! _E_
The Boston terrorist thugs' mother is also a radical. I am sure she will be granted citizenship shortly. _E_
I'll be on @foxandfriends on Monday at 7:30 AM. Tune in! _E_
Wow great post debate poll: Trump Increases Lead via Breitbart __HTTP__ _E_
While in the Philippines I was forced to watch @CNN which I have not done in months and again realized how bad and FAKE it is. Loser! _E_
Thank you!Mitchell FOX2 Michigan Poll finds Trump holds 3 1 lead over closest GOP opponents. Trump 47% Clinton 43% __HTTP__ _E_
.@penn_state leadership has permanently scarred & perhaps destroyed a great university. They should have (cont) __HTTP__ _E_
I am pleased to announce that I had the Union Leader removed from the upcoming debate. __HTTP__ _E_
I am leaving for Sioux City Iowa great event (rally). _E_
"Donald Trump pledges to make Prestwick Airport 'really successful'" __HTTP__ via @STVNews _E_
Thank you North Carolina! #Trump2016 #SuperTuesday  #MakeAmericaGreatAgain __HTTP__ _E_
If the U.S. attacks Syria and hits the wrong targets killing civilians there will be worldwide hell to pay. Stay away and fix broken U.S. _E_
Broadcom's move to America=$20 BILLION of annual rev into U.S.A. $3+ BILLION/yr. in research/engineering & $6 BILLION/yr. in manufacturing. __HTTP__ _E_
The cast has been largely selected for next year's Celebrity Apprentice. Wait 'till you hear the names AMAZING! Season 14 many nights at #1 _E_
Won $5000000 against Miss Pennsylvania Sheena Monnin for her terrible and untrue statements about Miss USA Pageant. Not a nice person! _E_
Wow @SharylAttkisson just wrote the definitive piece on what I said about John McCain __HTTP__ _E_
"Donald Trump To Be In Mason City June 4th" __HTTP__ via @KCHA _E_
.@megynkelly I am in Nevada. Sorry to inform you Kellyanne is in the audience. Better luck next time. _E_
American league wins! _E_
... The NY Daily Snooze totally lied and never even called my kids! _E_
Expecting a great crowd of amazing people. Questions will be live! #TrumpToday _E_
"Never give up on yourself." – Think Big _E_
Oh no just reported that Ted Cruz didn't report another loan this one from Citi. Wow no wonder banks do so well in the U.S. Senate. _E_
Congratulations to Boys and Girls Nation. It was my great honor to welcome you to the WH today! Full Remarks: __HTTP__ __HTTP__ _E_
So much dishonest reporting (or non reporting) in political media—an amazing experience for me. @BretBaier _E_
Having a great time hosting Prime Minister Shinzo Abe in the United States! __HTTP__ __HTTP__ _E_
The @CNN panels are so one sided almost all against Trump. @FoxNews is so much better and the ratings are much higher. Don't watch CNN! _E_
#VoteTrumpHI! #Trump2016 __HTTP__ _E_
RT @DanScavino: #TrumpTrain🚂💨 __HTTP__ _E_
I will not let the families of The Remembrance Project down! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_
An interesting cartoon that is circulating. __HTTP__ _E_
All the online polls have me winning the debate. I really enjoyed the evening. Not easy but good. __HTTP__ _E_
Re Negotiation: Persistence can go a long way. Being stubborn can be good. The key is to know when to loosen up. _E_
Dopey @ariannahuff should force her reporters to be accurate—if she has that power. _E_
TEXAS: We are with you today we are with you tomorrow and we will be with you EVERY SINGLE DAY AFTER to restore recover and REBUILD! __HTTP__ _E_
Just landed in Iowa to attend a great event in honor of wonderful Senator @JoniErnst. Look forward to being with all of my friends. _E_
"Donald Trump Takes on Apple @CPACnews" __HTTP__ via @kmbznews _E_
Go get the new book on Andrew Jackson by Brian Kilmeade...Really good. @foxandfriends _E_
Be sure to watch highlights from the record setting 14th season of @ApprenticeNBC here __HTTP__ _E_
Thank you to the greatest heroes __HTTP__ #DDay70 #WWII _E_
Hypocrites! @JamesOKeefeIII's new video shows Journal News reporters refusing to designate their homes as 'gun free' __HTTP__ _E_
My @todayshow interview where I reveal the new cast of Celebrity Apprentice and discuss the GOP primary field __HTTP__ _E_
Entrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. Get out there & go for it! _E_
With our national debt passing $16T during the @DNC convention @BarackObama has amassed more debt than the first 42 presidents. Scary. _E_
Hillary Clinton should not be given national security briefings in that she is a lose cannon with extraordinarily bad judgement & insticts. _E_
.@realbobmassi who does a show called Bob Massi Is The Property Man on @FoxNews really knows his stuff a total pro! _E_
So funny Jeb Bush called me a highly gifted politician and a great entertainer I assume that is a compliment! _E_
I hope everybody reads the @AmSpec article "Shakedown Schneiderman" – the AG of New York @AGSchneiderman __HTTP__ _E_
RT @KellyannePolls: #Polls showing @realDonaldTrump surging @hillaryclinton #slipping have HER camp on defense/lowering expectations goi... _E_
RT @DanScavino: OHIO GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
General John Kelly totally agrees w/ my stance on NFL players and the fact that they should not be disrespecting our FLAG or GREAT COUNTRY! _E_
Instinct has a lot to do with timing. You have to be patient & wait for your instincts to tell you the best time to make your move. _E_
Wacky @glennbeck who always seems to be crying (worse than Boehner) speaks badly of me only because I refuse to do his show a real nut job! _E_
... and opened a full month ahead of schedule. Case is taught in Wharton. _E_
Wow I have always liked the @nypost but they have really lied when they covered me in Iowa. Packed house standing O best speech! Sad. _E_
You may want to watch David Letterman tonight I am on! _E_
Isn't it ironic that a lot of the wealthy environmentalists use private jets and fight wind farms being placed near their property? _E_
True! __HTTP__ _E_
'Trump Helps Lift Small Business Confidence to 12 Yr. High' __HTTP__ __HTTP__ _E_
We must have Security at our VERY DANGEROUS SOUTHERN BORDER and we must have a great WALL to help protect us and to help stop the massive inflow of drugs pouring into our country! _E_
Our @TrumpNewYork is really starting the summer on the right foot with their #wellness program as seen in @TandCmag: __HTTP__ _E_
The storied success of Bain in private entrepreneurship and equity is one reason @MittRomney will be a great POTUS. _E_
Does anybody like Lyin' Ted? __HTTP__ _E_
I hereby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_
Thank you America! @FoxNews post debate poll with +/ from previous poll. #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Thank you for your wonderful endorsement today @TGowdySC. It means a great deal to me. We will not disappoint! #Trump2016 _E_
There are now 119000 fewer Americans employed than there were in July. The economy is still terrible. _E_
Kasich just announced that he wants the people of Indiana to vote for him. Typical politician can't make a deal work. _E_
With all that is happening with Ebola including the doctor who so easily came back to New York Obama still refuses to stop the flights! _E_
Before you are a leader success is all about growing yourself. When you become a leader success is all about growing others. Jack Welch _E_
Via @DMRegister by @JenniferJJacobs: Trump adds events to his Iowa trip next month __HTTP__ _E_
Wow my poll numbers have just been announced and have gone through the roof! _E_
Does anyone else have two golf pros—John Nieporte & Jim Herman—who qualified for the U.S. Open? Could this be an all time record? _E_
...or mentally troubled (or a con). _E_
Offering two championship courses @TrumpGolfDC has been awarded the honor of hosting the 2017 @seniorpgachamp __HTTP__ _E_
I will be live tweeting during tonight's #CelebrityApprentice 9 PM ET @NBC _E_
26000 sexual assaults or rapes reported in military last year and that is just the number that is reported (many do not want to report). _E_
Which National Costume do you think should win? __HTTP__ _E_
Watching John Kasich being interviewed acting so innocent and like such a nice guy. Remember him in second debate until I put him down. _E_
Wow 25 degrees below zero record cold and snow spell. Global warming anyone? _E_
Total misnomer to call ObamaCare 'The Affordable Care Act.' Affordable for whom besides big businesses & Congress w/their exemptions? _E_
The meeting with the @nytimes is back on at 12:30 today. Look forward to it! _E_
The more you know the more you realize how much you don't know. How can you possibly discover anything if you already know everything? _E_
Great meeting with Ford CEO Mark Fields and General Motors CEO Mary Barra at the @WhiteHouse today. __HTTP__ _E_
.@unicef Caryl M. Stern CEO is driving around in a Rolls Royce... _E_
I am glad America is starting to get to know @MittRomney the way I know him. A wonderful & decent family man (cont) __HTTP__ _E_
They made up a phony collusion with the Russians story found zero proof so now they go for obstruction of justice on the phony story. Nice _E_
The New Hampshire drug epidemic must stop. If elected POTUS I will create borders & the drugs will stop pouring in. __HTTP__ _E_
After my tour of Asia all Countries dealing with us on TRADE know that the rules have changed. The United States has to be treated fairly and in a reciprocal fashion. The massive TRADE deficits must go down quickly! _E_
#VoteTrump2016 & together we will #MakeAmericaGreatAgain! THANK YOU for your support! __HTTP__ _E_
Always bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_
Via @AP by @splaisance: @realmissnvusa NIA SANCHEZ CROWNED AS 63RD @MissUSA __HTTP__ _E_
.@History's wonderful The Men Who Built America with me on tonight at 9 bad timing I'll be live tweeting the debate _E_
In your planning know how much risk you can take. Evaluate whether the returns will be worth the risk. _E_
The Fake News Networks are working overtime in Puerto Rico doing their best to take the spirit away from our soldiers and first R's. Shame! _E_
After @BarackObama's speech tonight which should be well delivered reality will hit Friday morning when the new jobs report is released. _E_
.@BarackObama blocked Keystone. Now China is preparing a massive $1.5B oil deal with Canada. __HTTP__ A terrible deal for US! _E_
I endorsed Luther Strange in the Alabama Primary. He shot way up in the polls but it wasn't enough. Can't let Schumer/Pelosi win this race. Liberal Jones would be BAD! _E_
Thank you @Samsung! We would love to have you! __HTTP__ _E_
.@Theresa_May don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_
Ivanka on @foxandfriends now! _E_
Via @PRNewswire "Streetsense Brings The National a Geoffrey Zakarian Restaurant to DC's New Trump Intl Hotel" __HTTP__ _E_
.@MELANIATRUMP just finished being on @theviewtv by any standard she was great! _E_
Thanks to all for your thoughtful birthday wishes – Donald Trump _E_
We should remember that during this entire Petraeus episodeover 50 of our nation's bravest have died in Afghanistan... _E_
I will be going to church in Iowa this morning with my wife Melania. After church I will be making two speeches and touring the State! _E_
Alex Rodriguez has played under 140 games in each of the last five seasons. He will miss half of next season. Really bad deal for @yankees. _E_
For last minute shopping my new book #TimeToGetTough is a great choice... __HTTP__ _E_
"You have to learn the rules of the game. And then you have to play better than anyone else." – Albert Einstein _E_
#TrumpVlog Will our brave soldiers catch Ebola? __HTTP__ _E_
Lance Armstrong is now going to admit guilt—can that be possible after many years of denying? Just go away Lance. _E_
Entrepreneurs: Don't sell yourself short. Don't ever think you've done it all already or that you've done your best. _E_
Obama's carbon tax plan will finance more windmills in America. More real estate depreciated wildlife killed incl. bald eagles _E_
RT @USArmy333: @804StreetMedia @realDonaldTrump He's done more in 9 months then obama did in8 yrs _E_
THANK YOU NEVADA! WE WILL MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_
#TrumpAdvice __HTTP__ _E_
RT @AmericaFirstPol: .@POTUS Trump led a historic journey to the White House. 50 days in that historic journey continues. Take a look 👉 ht... _E_
Thank you to all for the wonderful reviews of my foreign policy speech. I will soon be speaking in great detail on numerous other topics! _E_
.@RICKYMONEY I don't know a lot about failures. And as you know I never went bankrupt. _E_
Now A Rod is claiming that @MLB and @yankees are out to get him' __HTTP__ He should just get the hell out of NYC already! _E_
#1 for success: Find out what you love to do. Trust yourself enough to find out what is best for you and what you're best at doing... _E_
I will be live tweeting President Obama's prime time speech tonight starting at 7:50 P.M. (Eastern).Will he finally state the real problems? _E_
Headline reads Rubio passes Bush in Florida poll Unfair because Trump destroys them both! Trump 31.5% Rubio 19.2% Bush 11.3% _E_
Thank you Tennessee! #Trump2016#SuperTuesday _E_
Good news @MittRomney has pulled ahead in Wisconsin __HTTP__ WIth @PaulRyanVP on the ticket Wisconsin is in play. _E_
Thank you West Chester Pennsylvania!#PAPrimary #VoteTrump __HTTP__ __HTTP__ _E_
We must keep evil out of our country! _E_
THANK YOU LAS VEGAS NEVADA!#NevadaCaucus #VoteTrumpNV __HTTP__ __HTTP__ _E_
JOIN ME IN OHIO TOMORROW!Springfield 1pm: __HTTP__ 4pm: __HTTP__ 7pm... __HTTP__ _E_
Why isn't President Obama working instead of campaigning for Hillary Clinton? _E_
Why did Vince and the WWE give my speech and segment the most time last night on USA Network because that's what people want to see! _E_
Trump: 'Terrible traitor' Snowden embarrassing US __HTTP__ via @thehill by @JTSTheHill _E_
Attention Arnold Palmer: Happy Birthday Arnold. There is no one like you The King! @KingdomMag __HTTP__ _E_
Losers such as George Will and @Rosie use me to get publicity for themselves. They are strictly third rate. _E_
THANK YOU! _E_
Join me in Florida tomorrow!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_
The Debt is our nation's greatest threat. @BarackObama is out of touch. _E_
COURT FINDS IN FAVOR OF TRUMP UNIVERSITY __HTTP__ _E_
Well the Special Elections are over and those that want to MAKE AMERICA GREAT AGAIN are 5 and O! All the Fake News all the money spent = 0 _E_
Winners I am convinced imagine their dreams first. They want it with all their heart and expect it to come true. Joe Montana _E_
Dummy Graydon Carter doesn't like me too much...great news. He is a real loser! @VanityFair _E_
Remember negotiations are fluid. Remain calm and don't settle easily. If you have the goods you will ultimately win. _E_
I heard because his show is unwatchable that @Lawrence has made many false statements last night about me. Maybe I should sue him? _E_
"You can't wear a blindfold in business. A regular part of your day should be devoted to expanding your horizons." – Trump: How to Get Rich _E_
Also The Donald J. Trump Signature mattress from SERTA is doing record business call Serta and see why! _E_
Remember the Republicans are 5 0 in Congressional races this year. In Senate I said Roy M would lose in Alabama and supported Big Luther Strange and Roy lost. Virginia candidate was not a "Trumper" and he lost. Good Republican candidates will win BIG! _E_
Via @politicalwire: "Trump Not Happy with Republicans" __HTTP__ _E_
I have chosen one of the truly great business leaders of the world Rex Tillerson Chairman and CEO of ExxonMobil to be Secretary of State. _E_
Dishonest media says Mexico won't be paying for the wall if they pay a little later so the wall can be built more quickly. Media is fake! _E_
The White House has just admitted Al Qaeda was involved in Benghazi __HTTP__ What about the video tape? _E_
Trump Finalizes Agreement For Trump International Hotel The Old Post Office Building Washington D.C. __HTTP__ _E_
Little Marco Rubio gave amnesty to criminal aliens guilty of sex offenses. DISGRACE! __HTTP__ _E_
Act NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_
The brand new hotel at Trump National Doral has the most beautiful rooms and suites in Miami. Enjoy! _E_
.@ZachJohnsonPGA You're one of the truly great competitiors. I've said it for years. Great going winning @OpenChampionship Not surprised! _E_
All polls have me winning debate big Drudge TIME etc. Dopey Charles Krauthammer still nasty. He has zero cred totally dishonest! _E_
Russia has more warheads than ever N Korea is testing nukes and Iran got a sweetheart deal to keep theirs. Thanks @HillaryClinton. _E_
President Obama be cool be smart be sharp and FOCUS (no more March Madness) and you can beat Putin at his own game. IT CAN BE DONE! _E_
Sorry but this is years ago before Paul Manafort was part of the Trump campaign. But why aren't Crooked Hillary & the Dems the focus????? _E_
Via @JTAnews and Jason Greenblatt Donald Trump is a Visionary With Talents Our Country Needs @JasonDovEsq __HTTP__ _E_
Never said anything derogatory about Haitians other than Haiti is obviously a very poor and troubled country. Never said "take them out." Made up by Dems. I have a wonderful relationship with Haitians. Probably should record future meetings unfortunately no trust! _E_
Glad to hear that @RobinRoberts is doing well. She is a terrific person. _E_
Via @itp_ab by @ctrenwith: 'Trump effect' will see Dubai properties rise 50%" __HTTP__ _E_
Obama has called Libya attack a bump in the road and not optimal. Just come clean already tell Americans the truth! _E_
Democrats purposely misstated Medicaid under new Senate bill actually goes up. __HTTP__ _E_
I said simply that the Mexican leaders and negotiators are smarter than ours and that the Mexican gov't is pushing their hard core to U.S. _E_
.@RGIII & @DangeRussWilson & Luck are very special players will be great playoff games. _E_
I am somewhat surprised that Bernie Sanders was not true to himself and his supporters. They are not happy that he is selling out! _E_
Thank you @krauthammer for your nice comments on @oreillyfactor. A lot of progress is being made! _E_
As the @BarackObama's took their 16th vacation this month unemployment is back to 9% and underemployment at (cont) __HTTP__ _E_
Cruz caught cold in lie after denial of push polls like lies w/ @RealBenCarson. How can he preach Christian values? __HTTP__ _E_
Got to know Senator @JohnKerry in Aspen Colorado years ago—a very solid and stand up guy. _E_
I can't believe David Letterman has announced his retirement he is a great guy! @Letterman _E_
Here's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_
Great mtg w/ @Cabinet today. Tomorrow I will be announcing the new head of the Fed. I think you will be extremely impressed by this person! __HTTP__ _E_
Rated @GolfMagazine as 1 of the top courses in the country Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_
Always pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. _E_
If you think big you will encounter big setbacks from time to time. What really matters is how you respond to them. Think Big _E_
Such a nice article in the New York Times about a wonderful developer Arthur Zeckendorf __HTTP__ _E_
Highest Stock Market EVER best economic numbers in years unemployment lowest in 17 years wages raising border secure S.C.: No WH chaos! _E_
Wishing everyone a wonderful Independence Day holiday weekend a great celebration for a great country. _E_
Miss Pennsylvania is just looking for free publicity at the expense of the real winner of Miss USA Olivia Culpo. _E_
Via CBSWashDC: "114 Year Old DC Building a Step Closer to Becoming Trump's Latest Hotel" __HTTP__ _E_
Departing The Pentagon after meetings with @VP Pence Secretary James Mattis and our great teams. #MAGA __HTTP__ _E_
General Petreus and his family are paying a big price! _E_
Great numbers on Stocks and the Economy. If we get Tax Cuts and Reform we'll really see some great results! _E_
There's no love lost between @latoyajackson & @OMAROSA Disrespectful? Who is being disrespectful? #CelebApprentice _E_
I had a great time answering as many questions as possible in sixty seconds at @facebook NY today __HTTP__ _E_
Press conference at The Old Post Office in D.C. __HTTP__ _E_
Thank you Massachusetts! #Trump2016 #SuperTuesday _E_
Get our Marine out of Mexico. __HTTP__ _E_
Pennsylvania poll just released. Two rallies there on Mon join me!Ambridge: __HTTP__ Barre:... __HTTP__ _E_
"How to travel like a billionaire! Inside Donald Trump's £63m private jet" __HTTP__ via @travelmail by @AndreaMagrath _E_
I don't know Dennis Kozlowski who made Tyco into a great company & then went to prison but he's up for parole—let him go! _E_
Senator Marco amnesty Rubio who has worst voting record in Senate just hit me on national security but I said don't go into Iraq. VISION _E_
Entrepreneurs: Keep your momentum going! It's a big factor in sustaining your success. Keep moving forward! _E_
Sleepy eyes @chucktodd—one of the dumbest voices in politics is angry that I'm doing @ThisWeekABC. _E_
Mika Brzezinski: Dem Criticism of Comey Reinforcing Idea 'There's Something There' __HTTP__ __HTTP__ _E_
A year ago today a diplomat and 3 security operatives were abandoned by our government while they were under attack. Never forget! _E_
Let's all take a moment to remember all of the heroes from a very tragic day that we cannot let happen again! _E_
President Obama NOW bring our 4000 innocent and ill trained soldiers home from West Africa before it is too late AND STOP THE FLIGHTS! _E_
Lightweight Marco Rubio was working hard last night. The problem is he is a choker and once a choker always a choker! Mr. Meltdown. _E_
Entrepreneurs: Keep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_
The @washingtonpost loses money (a deduction) and gives owner @JeffBezos power to screw public on low taxation of @Amazon! Big tax shelter _E_
We finally agree on something Rosie. __HTTP__ _E_
A Call for Unity by Jason Greenblatt @JasonDovEsq __HTTP__ _E_
#CrookedHillary __HTTP__ _E_
For all of my millions of followers and at your request I will be tweeting tonight during President Obama's speech! 9pm ET _E_
.@michellemalkin & @BuzzFeedAndrew: "Vaccine court awards millions to two autistic children damaged by vaccine" __HTTP__ _E_
A special message to the staff of @TrumpWaikiki in celebration of the 2nd anniversary.... __HTTP__ _E_
Personally I'm glad the NYPD is monitoring the actions of certain extremists. New York's finest! I support them. _E_
What is your favorite @THEGaryBusey film? Tonight's short film? Point Break? Lethal Weapon? #CelebApprentice _E_
Serious voter fraud in Virginia New Hampshire and California so why isn't the media reporting on this? Serious bias big problem! _E_
Sanctions Relief From Clinton Obama Iran Nuclear Deal Likely Go to Terrorists: __HTTP__ #BigLeagueTruth #VPDebate _E_
Via @CBSmiami by @LisaPetrillo: "Trump Unveils Renovated @TrumpDoral Red Tiger Golf Course" __HTTP__ _E_
Sen. Lindsey Graham embarrassed himself with his failed run for President and now further embarrasses himself with endorsement of Bush. _E_
Heading back to Washington D.C. Much will be accomplished this week on trade the military and security! _E_
Crooked Hillary has ZERO leadership ability. As Bernie Sanders says she has bad judgement. Constantly playing the women's card it is sad! _E_
Via @Newsmax_Media: Trump at CPAC: What Really Happened __HTTP__ _E_
We spent over a billion on Libya and lead the way why is Europe getting the oil? _E_
thilan_GolfSwag @realDonaldTrump Played Doral for the first time. absolutely great course! Fantastic job! Thanks. _E_
See Schneiderman admit he spoke with Obama about "ongoing investigations. __HTTP__ _E_
Honored to welcome Republican and Democrat members of the House Ways and Means Committee to the White House today! #USA __HTTP__ _E_
Just said at #NCGOPcon that politicians are all talk and no action and we are all tired of it! We need action and results to move forward! _E_
See what I have to say about Iran and Iraq in today's #trumpvlog... __HTTP__ _E_
.@Neilyoung one of my favorite musicians in my office. __HTTP__ _E_
RT @VP: .@POTUS is committed to the health & well being of the US people & we are confident Dr. Jerome Adams will succeed as our new surgeo... _E_
Because Gov. Kasich cannot run in the state of Pennsylvania he cannot win the nomination & should not be allowed to compete in Ohio on Tue. _E_
This is good news: @MittRomney is now leading in Michigan by 6 points according to @RasmussenPoll __HTTP__ _E_
Get ready to turn to NBC for CELEBRITY APPRENTICE TONIGHT'S SHOW IS GREAT! _E_
Support Coach Kennedy and his right together with his young players to pray on the football field. Liberty Institute just suspended him! _E_
Yankees can win today. Kuroda is a highly underrated pitcher. _E_
The protesters in California were thugs and criminals. Many are professionals. They should be dealt with strongly by law enforcement! _E_
"The longer you play the better chance the better player has of winning." @jacknicklaus _E_
There won't be any new gun legislation. No surprise. Americans support the 2nd amendment. _E_
Scary thought what is the pervert Anthony Weiner doing with all the free time he has. Does he collect unemployment? _E_
Al Shabbab not ISIS just made a video on me they all will as front runner & if I speak out against them which I must. Hillary lied! _E_
What the hell is Obama doing in allowing all of these potentially very sick people to continue entering the U.S.! Is he stupid or arrogant? _E_
Via @limbaugh: "See Trump Told You So" __HTTP__ _E_
My video response to President Obama's lack of transparency. __HTTP__ _E_
State Department has not revoked a single passport of ISIS Americans __HTTP__ We should send them to Gitmo for some R&R. _E_
Iran is moving troops into Iraq under the guise that it is helping out. Actually they will take over Iraq and all of their oil. Stupid U.S. _E_
Don't worry getting rid of state lines which will promote competition will be in phase 2 & 3 of healthcare rollout. @foxandfriends _E_
.@TrumpSoHo features a striking glass walled building w/ loft inspired interiors __HTTP__ NYC's trendiest luxury hotel _E_
Great article by @jameshohmann @politico explaining why @KarlRove was biggest loser @CPACnews __HTTP__ James is sharp. _E_
I will be interviewed on the @TODAYshow at 7:30. Enjoy! _E_
Via @MiamiHerald: Donald Trump aims to bring luxury to Doral Golf Resort & Spa __HTTP__ @DoralResort _E_
President Reagan had it right: Social Security is here to stay. We must root out the fraud and make it more (cont) __HTTP__ _E_
People the lawyers and the courts can call it whatever they want but I am calling it what we need and what it is a TRAVEL BAN! _E_
"Most people think small because most people are afraid of success afraid of making decisions afraid of winning" The Art of the Deal _E_
True thanks. __HTTP__ _E_
I'll be on @foxandfriends Monday morning at 7:30 AM. Tune in! _E_
The failing @nytimes has disgraced the media world. Gotten me wrong for two solid years. Change libel laws? __HTTP__ _E_
Obama/Reid/Nunn's failed economic policies are not working. @PerdueSenate will bring fresh perspective to solving problems. #GASen _E_
Spent the weekend in LA checking out Trump National Golf Club on the Pacific Ocean. An amazing place! __HTTP__ _E_
RT @CBSNews: WATCH NOW: The @realDonaldTrump supporters you'd never expect __HTTP__ __HTTP__ _E_
Doing an interview with @SteveDeaceShow. Discussing the ObamaCare web disaster. Be sure to listen __HTTP__ _E_
Snowden is doing great damage to our relations with other countries and U.S.prestige. China is laughing at us as he continues illegal action _E_
Final poll results from NBC on last nights Commander in Chief Forum. Thank you! #ImWithYou #MAGA __HTTP__ _E_
The failing @NYDailyNews destroyed by little Morty Zuckerman is preparing to close and save face by going online. It's dead! _E_
.@ApprenticeNBC Season 13 still #1 at 10PM in all key demos despite having to serve as our own lead in from 9 10. 11PM News loves Trump! _E_
Do you believe that The State Department on NEW YEAR'S EVE just released more of Hillary's e mails. They just want it all to end. BAD! _E_
Millions without electricity across NY & NJ. The media has covered for Obama's massive failure. Can you imagine if this was another Pres? _E_
Just left Oklahoma the most amazing crowd and people! What a night! _E_
My @Newsmax_Media interview from Friday where I predicted that @newtgingrich in South Carolina would change the race. __HTTP__ _E_
Highly respected author Christopher Bedford just came out with book The Art of the Donald Lessons from America's.... Really good book! _E_
The best vision is insight. Malcolm Forbes _E_
Follow @MELANIATRUMP's jewelry line on @QVC site __HTTP__ _E_
The #CelebrityApprentice Sunday night on NBC at 9 PM. Another exciting episode is ready to go. __HTTP__ _E_
How is Bernie Sanders going to defend our country if he can't even defend his own microphone? Very sad! _E_
Republicans must unite to defund Obamacare it will drive our country into oblivion and by the way the healthcare is no good anyway! _E_
Will be doing a big interview tonight with Bret Baier at 6:00 P.M. on Fox. Don't miss it! _E_
Thank you Waukesha Wisconsin! Full transcript of my speech #FollowTheMoney: __HTTP__ __HTTP__ _E_
It was an honor to welcome the Prime Minister of Denmark Lars Løkke Rasmussen {@larsloekke} to the @WhiteHouse yes... __HTTP__ _E_
Wishing @FLOTUS Melania and all of the great mothers out there a wonderful day ahead with family and friends! Happy #MothersDay _E_
USMC Sgt. Tahmooressi has now been held in Mexican jail for over 150 days. When will Obama call for his release? #FreeOurMarine _E_
The oil reserve is a strageic asset for a time of war and an embargo. @BarackObama should open more land for drilling not tap the reserve. _E_
Briarcliff Manor Mayor Vescio is doing a terrible job. Taxes way too high roads in terrible condition—repave Pine Road. @BriarcliffManor _E_
Journal News readership is already down 50 percent over the years. _E_
Right now we have a president and a Treasury secretary who shrug while China tears away hundreds of thousands (cont) __HTTP__ _E_
"Happiness is not something ready made. It comes from your own actions." @DalaiLama _E_
Best of luck to @chucktodd on his @meetthepress debut this Sunday. _E_
.@McIlroyRory What a year it has been for you and this weekend topped it off. Fantastic job see u at Doral. _E_
I'll be tweeting live tonight starting at 9PM ET re:@ApprenticeNBC. Don't worry other time zones I will give nothing away! _E_
Great reception in D.C. At the Values Voter Summit. Now checking on my job at the Old Post Office... _E_
The problem w/ the concept of global warming is that the U.S. is spending a fortune on fixing it while China & others do nothing! _E_
Via @foxnewslatino: "Donald Trump Plans Huge Towers In Rio For Post Olympic Building Boom" __HTTP__ _E_
This is dangerous: @BarackObama is seeking to shrink Israeli military funding but gives $1.3Billion to Muslim (cont) __HTTP__ _E_
Looks like the Bernie people will fight. If not their BLOOD SWEAT AND TEARS was a total waste of time. Kaine stands for opposite! _E_
Obama and Kerry are bungling Syria by the hour. They have set America's deterrence & stature back by years. Amateurs! _E_
According to many ISIS was given so much time and so many signals as to when we would start bombing that they were able to prepare and hide _E_
.@VattenfallGroup lead investor in Aberdeen windfarm fiasco has dropped out—project not economically viable & protestors hate it. _E_
Eli Wallach was a great actor and a great guy. My opinion his performance in The Good the Bad and the Ugly was his all time best! _E_
...Get along & make deals for the good of the country! _E_
Due diligence includes increasing your financial IQ daily. _E_
Remember no one ever said success was easy.Good luck doesn't come overnight.But if u work hard & love it u will find success & luck. _E_
Obama is an easy target on foreign policy.@MittRomney has many openings to attack especially when Obama starts bragging about Bin Laden. _E_
I answered some of your questions in today's video... __HTTP__ _E_
According to many and while nominated I would have won the Emmy many times except for my politics. @PrimetimeEmmys _E_
It's not climate changeit's global warming.Don't let the dollar sucking wiseguys change names midstream because the first name didn't work _E_
Getting ready to leave for my GREAT resort Turnberry in Scotland. Hosting The Women's British Open (biggest tournament). Will be back Sat. _E_
...and safe. Questions were asked about why the CIA & FBI had to ask the DNC 13 times for their SERVER and were rejected still don't.... _E_
Worthless @NYDailyNews which dopey Mort Zuckerman is desperately trying to sell has no buyer! Liabilities are massive! _E_
A record high 6.7% of Americans are living in extreme poverty. This is tragic. We can do better. _E_
Governor Rick Perry said Donald Trump is one of the most talented people running for the Presidency I've ever seen. Thank you Rick! _E_
Re Life: Life is very fragile and success doesn't change that. If anything success makes it more fragile. _E_
The mother of the Boston killers (not suspects) says her boys are totally innocent and were set up I can see the 14 year long defense now! _E_
RT @DonaldJTrumpJr: Donald Trump Jr. On The Record: Why Trump International Hotels And Residences Are Still Winning via @forbes __HTTP__ _E_
#CNNDebate Winning the @drudge_report poll __HTTP__ _E_
I'm leaving for Iowa now will be great! _E_
Fan favorite @LilJon once again shines in the record 13th season of 'All Star' @CelebApprentice. He is an amazing & wonderful guy! _E_
Watching Hurricane closely. My team which has done and is doing such a good job in Texas is already in Florida. No rest for the weary! _E_
LIVE on #Periscope: Join me for a few minutes in Pennsylvania. Get out & VOTE tomorrow. LETS #MAGA!! __HTTP__ _E_
Whether you think you can or think you can't you're right. Henry Ford _E_
RT @piersmorgan: BOOM! Thank you Mr President. Trophy hunting is repellent. __HTTP__ _E_
.@WSJ and dopey Karl Rove made a mistake and purposely mischaracterized my statement on the terrible TPP deal. __HTTP__ _E_
Dark Knight Rises is projected to gross over $180 million this weekend. Remember to watch for Trump Tower! _E_
.@HillaryClinton's tax hikes will CRUSH our economy. I will cut taxes BIG LEAGUE. __HTTP__ __HTTP__ _E_
Be ready for problems. You'll have them every day so keep things in perspective. Ask yourself: Is this a blip or is it a catastrophe? _E_
Think of this: After we spent $2 trillion on Iraq Baghdad is about to be taken over by ISIS. _E_
... debut her first 2013 "Melania® Timepieces & Fashion Jewelry" collection! _E_
I made a lot of money in Atlantic City and left 7 years ago great timing (as all know). Pols made big mistakes now many bankruptcies. _E_
BTW The Miss USA pageant was the highest rated non sports telecast on the Big 4 networks. Congrats to our newly crowned @Nia_Sanchez_! _E_
I am following the Trayvon Martin case carefully. It's a terrible situation that should never have happened. (cont) __HTTP__ _E_
Bill Cosby is foolish stupid or getting bad advice in remaining silent if he is innocent. Probably guilty! Not a fan. _E_
We are suffering through the worst long term unemployment in the last 70 years. I want change Crooked Hillary Clinton does not. _E_
Thank you Delaware! #Trump2016 __HTTP__ _E_
Happy Birthday @DonaldJTrumpJr! __HTTP__ _E_
Great crowd in Johnstown Pennsylvania thank you. Get out & VOTE on 11/8! Watch the MOVEMENT in PA. this afternoon... __HTTP__ _E_
A great deal of good things happening for our country. Jobs and Stock Market at all time highs and I believe will be getting even better! _E_
#USAatUNGA#UNGA __HTTP__ _E_
We will now be helping Syria and Iran by attacking ISIS ironic isn't it! _E_
Via @WWE: Donald Trump announced for WWE Hall of Fame __HTTP__ _E_
Situated in the heart of downtown Toronto the 65 story @TrumpTO offers an elegant and wonderful lifestyle __HTTP__ _E_
The @WSJ Editorial Board is so wrong so often. They got info from an incorrect story in another pub. Why not watch and listen to debate. _E_
Remain open to new ideas. That's where innovation begins. _E_
A record 46.68M Americans are now on food stamps __HTTP__ Four more years? _E_
Ask Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Council. _E_
I am self funding my campaign putting up my own money not controlled. Cruz is spending $millions on ads paid for by his N.Y. bosses. _E_
China is sending an Envoy and Delegation to North Korea A big move we'll see what happens! _E_
In memory of Joan Rivers watch when she became my Celebrity Apprentice which meant so much to her! __HTTP__ _E_
.@TrumpChicago's award winning dining options also offer the best views of the city __HTTP__ _E_
My @CNNS interview with @wolfblitzercnn discussing my endorsement of @MittRomney and why he can beat @BarackObama __HTTP__ _E_
The opinion of this so called judge which essentially takes law enforcement away from our country is ridiculous and will be overturned! _E_
Entrepreneurs: Brainpower is the ultimate leverage. _E_
If amnesty is so popular according to the DC ruling class then why is Obama delaying his executive action until after the election? _E_
'Top Hillary Adviser Mocked Plotted Attacks on Pro Sanders Civil Rights Leader' #DrainTheSwamp __HTTP__ _E_
Nobody beats me on National Security. __HTTP__ _E_
As a stockholder in Apple they should get on with a larger screen iPhone as a supplement—immediately. _E_
The Coca Cola company is not happy with me that's okay I'll still keep drinking that garbage. _E_
Entrepreneurs: Being stubborn is a big part of being a winner. Never give up! _E_
Many of the great jobs that the people of our country want are long gone shipped to other countries. We now are part time sad! I WILL FIX! _E_
The most stringent gun laws in the U.S. happen to be in Chicago and look what is happening there! _E_
To all my fans sorry I couldn't do The Apprentice any longer—but equal time (presidential run) prohibits me from doing so. Love! _E_
American sanctions alone cannot stop Iran's nuclear drive and @BarackObama cannot get China and Russia to agree on new Iranian sanctions. _E_
Great article by @RichLowry on @POLITICOMag : "Sorry Donald Trump Has A Point" __HTTP__ _E_
My speech to @PressClubDC yesterday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_
China's Communist Party has now publicly praised Obama's reelection. They have never had it so good. Will own America soon. _E_
#LaborDay #AmericaFirstVideo: __HTTP__ __HTTP__ _E_
Congrats to winners from around the world who entered the Think Like A Champion signed book/keychain contest! __HTTP__ _E_
See yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_
I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_
Heading over to @Kelly and Michael re. Apprentice! _E_
Consumer Comfort Reaches 16 Year High on U.S. Economic Optimism via Bloomberg __HTTP__ _E_
National Security Presidential Memorandum on Strengthening the Policy of the United States Toward Cuba Memorandum... __HTTP__ _E_
If someone says "I'll bet you ten dollars" and loses the bet it's pay up time. _E_
LIVE on #Periscope: Live with the Donald __HTTP__ _E_
Both Aberdeen and Turnberry in Scotland and the soon to open Doonbeg in Ireland blow Bandon Dunes away. Bandon is a toy by comparison! _E_
59% of the United States by area is now covered in snow highest % in many years. The global warming name isn't working anymore SORRY! _E_
Word is that they have far more evidence on A Rod than they have on Ryan Braun! Alex is over. _E_
Ted Cruz didn't win Iowa he stole it. That is why all of the polls were so wrong and why he got far more votes than anticipated. Bad! _E_
#CelebrityApprentice Boardrooms—can anything be more intense? #sweepstweet _E_
Senator Sessions will serve as the Chairman of my National Security Advisory Committee. __HTTP__ __HTTP__ _E_
I don't mind that @BarackObama plays a lot of golf. I just wish he used it productively to make deals with Congress! _E_
Congrats to @bubbawatson on winning the Masters. He did it without heavy reliance on coaches and the other hanger ons he just played golf. _E_
ICYMI "Raw video: Donald Trump speaks at Rep. Steve Stepanek's Amherst reception" __HTTP__ via @wmur9 _E_
Via @pressjournal by Ann Marie Parry: Plans revealed for course named after Trump's mother __HTTP__ _E_
Big protests in Iran. The people are finally getting wise as to how their money and wealth is being stolen and squandered on terrorism. Looks like they will not take it any longer. The USA is watching very closely for human rights violations! _E_
RT @BrazoriaCounty: __HTTP__ _E_
If I would have offered Obama a billion dollars to show his records he would have refused. _E_
Every Poll has me winning BIG.If you listen to dopey Karl Rove a Trump hater on @oreillyfactor you would think I'm doing poorly. @FoxNews _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Ben Smith (is that really his last name?) of @BuzzFeed is a total mess who probably got his minion Coppins to do what he didn't want to do? _E_
So much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They promote the Fake Book of a mentally deranged author who knowingly writes false information. The Mainstream Media is crazed that WE won the election! _E_
How do you like Seth and Oscars so far? _E_
Senator Lindsey Graham called me yesterday very much to my surprise and we had a very interesting talk about national security and more! _E_
Wind Power is proving to be very costly and unsightly. _E_
Via @BostonDotCom by @lilsarg: "Donald Trump on Snow Salt Vaccines and the Oval Office" __HTTP__ _E_
Scary thought @JoeBiden is a heartbeat away from the Presidency. _E_
2013 is the worst year ever for Hollywood. Garbage released after garbage. What is going on in these studios?! _E_
Bob Beckel a commentator for FOX is bad for the @FoxNews brand: @BobBeckel is close to incompetent. _E_
The Midas Touch hand is the ideal metaphor to represent the attributes critical to entrepreneurial success. (cont) __HTTP__ _E_
Great Governor @Mike_Pence is in Indiana to help lead the relief efforts after tornadoes struck. True leadership. _E_
"Winning is the most important thing in my life after breathing. Breathing first winning next." George Steinbrenner _E_
Thank you Senator @TedCruz!#Debates2016 #MAGA __HTTP__ _E_
I watched POTUS speech from Europe same old tax and spend won't create jobs. _E_
Congratulations to Patrick Reed for winning at Trump National Doral. He told me The Blue Monster is the best course I've ever played _E_
How about President Obama fixing the gasoline situation instead of taking photo ops in the destruction. _E_
Watch me on @SeanHannity's show at 10PM tonight on @FoxNews _E_
Entrepreneurs: There are no guarantees but being ready sure beats being taken by surprise. Know everything you can about what you're doing. _E_
Thank you Rand! __HTTP__ _E_
W/ views of NYC's skyline Trump Stamford is Connecticut's most luxurious high rise featuring Trump amenities __HTTP__ _E_
'Uniforms 4 Everyone' campaign @fundanything has a $3000 goal to buy underprivileged kids school uniforms __HTTP__ _E_
Great debate poll numbers I will be on @foxandfriends at 7:00 to discuss. Enjoy! _E_
The Hillary Clinton staged event yesterday was pathetic. Be careful Hillary as you play the war on women or women being degraded card. _E_
Congratulations to @SpeakerRyan @GOPLeader @SteveScalise and to the Republican Party on Budget passage yesterday. Now for biggest Tax Cuts _E_
Gabriel Sherman's book on Roger Ailes is filled with falsehoods and inaccuracies. Publisher should be ashamed (and sued). _E_
Don't worry when our country starts hurting bad enough from all of the mistakes that are being made we will start doing the right things. _E_
"Don't expect to build up the weak by pulling down the strong." Calvin Coolidge _E_
The road to success is always under construction. Arnold Palmer _E_
Thank you Jason Greenblatt @JasonDovEsq For Our Children: Let's Elect Donald Trump __HTTP__ _E_
Receiving the @RobbReport trophy for best new golf course in the world Trump International Golf Links Scotland. __HTTP__ _E_
.@KathieLGifford Melania and I send our deepest condolences. Frank was a special and amazing person. He will be missed by all! _E_
.@FranksFight Keep fighting Frank! Never give up! _E_
"Always be prepared to start." Joe Montana _E_
Via @nypost by @StarrMSS: "Trump: @ApprenticeNBC contestants 'the meanest by far'" __HTTP__ _E_
With these record high gas prices what does it say about Obama that he was trying to brag about his energy policy in the debate? _E_
Looking forward to a full day of meetings with President Xi and our delegations tomorrow. THANK YOU for the beautiful welcome China! @FLOTUS Melania and I will never forget it! __HTTP__ _E_
I am doing Greta tonight on Fox talking about Obama Care and pervert Anthony Wiener! 10 P.M. _E_
Congratulations to Dubai on winning the rights to host Expo 2020! A great place winning a major global event.@damacofficial @dubaiexpo2020 _E_
Many Super Pacs funded by groups that want total control over their candidate are being formed to "attack" Trump. Remember when u see them _E_
The Unaffordable Care Act sometimes referred to as ObamaCare is not working. Millions of people are losing their plans and doctors fraud! _E_
Every time I speak of the haters and losers I do so with great love and affection. They cannot help the fact that they were born fucked up! _E_
China has control over North Korea! _E_
Congratulations to Barack Obama for having 2012's debt already surpass 2011 __HTTP__ _E_
My @WOR710 interview on The John Gambling Show discussing the 2012 election Trump real estate projects & our airports __HTTP__ _E_
Wow "FBI lawyer James Baker reassigned" according to @FoxNews. _E_
The Federal deficit crossed $15Trillion 100% of our GDP. Yet the Super Committee can't find $1.2Trillion i... (cont) __HTTP__ _E_
So funny Crooked Hillary called BREXIT so incorrectly and now she says that she is the one to deal with the U.K. All talk no action! _E_
Join me in Tampa Florida tomorrow at 1pmE! Tickets: __HTTP__ __HTTP__ _E_
We could make America great again by spreading ObamaCare throughout the World while at the same time dropping it from U.S.! _E_
We believe that every American should stand for the National Anthem and we proudly pledge allegiance to one NATION UNDER GOD! __HTTP__ _E_
Be sure to watch the Celebrity Apprentice on Sunday night 9 pm on NBC. __HTTP__ _E_
.@megynkelly is very bad at math. She was totally unable to figure out the difference between me and Cruz in the new Monmouth Poll 41to14. _E_
KAREN HANDEL FOR CONGRESS. She will fight for lower taxes great healthcare strong security a hard worker who will never give up! VOTE TODAY _E_
So nice being with Republican Senators today. Multiple standing ovations! Most are great people who want big Tax Cuts and success for U.S. _E_
.@GovMikeHuckabee Great job on @FoxNews tonight. Thanks for your nice words about my children. Class! _E_
Enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_
thought it would be hypocritical to attend Bush's swearing in....he doesn't believe Bush is the true elected president. Sound familiar! WP _E_
Speaking at the City Club of Chicago. Sold out in minutes with thousands on the wait list!... __HTTP__ _E_
Stock market hits another high with spirit and enthusiasm so positive. Jobs outlook looking very good! #MAGA __HTTP__ _E_
.@AC360 Anderson so amazing. Your mother is and always has been an incredible woman! _E_
Over 90% of American workers could lose their healthcare by 2020 thanks to ObamaCare. Repeal before it is too late! _E_
How the hell does the Libyan government get off telling our embassy security they can't have loaded guns for protection?! _E_
Jane Fonda and Michael Douglas look great! _E_
Afghanistan is a total disaster. We don't know what we are doing. They are in addition to everything else robbing us blind. _E_
My FoxBusiness interview with Don Imus discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate __HTTP__ _E_
As I anticipated Justice Roberts made the cover of Time Magazine etc. The liberal media now loves him he should be ashamed. _E_
Thank you @JoeTrippi for the nice and true words on #Media Buzz with terrific Howie Kurtz. Leading New Hampshire 30 to 12. @FoxNews _E_
Hillary has bad judgment! __HTTP__ _E_
Response to the Pope: __HTTP__ _E_
Trump to Liberty U Students: 'The World is Laughing at Us' __HTTP__ Via @Newsmax_Media _E_
"Perception about India has changed says Donald Trump" __HTTP__ via @EconomicTimes by Kailash Babar _E_
Why are the Republicans giving Obama fast track authority for TPP and the Iran agreement?! Obama gets more from the GOP than his own party. _E_
Sorry for all of the millions of people who long to hear my brilliant words of wisdom on Fox & Friends on Monday A.M. no go in Dubai. _E_
#ObamacareFail __HTTP__ _E_
When will President Obama issue the words RADICAL ISLAMIC TERRORISM? He can't say it and unless he will the problem will not be solved! _E_
#NeverForget __HTTP__ _E_
Entrepreneurs: Ask yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_
Here's what I told @Gretawire on @FOX when it comes to singer @Cher's inappropriate attacks on @MittRomney __HTTP__ _E_
So terrible that Crooked didn't report she got the debate questions from Donna Brazile if that were me it would have been front page news! _E_
The real J.P.Morgan is spinning in his grave at the ridiculous settlements the bank is making to settle disputes. A settler is a soft target _E_
Obama: "I will control Ebola." = Obama: "If you like your health care plan you can keep your healthcare plan." _E_
Marco Rubio is a total lightweight who I wouldn't hire to run one of my smaller companies a highly overrated politician! _E_
My persona will never be that of a wallflower I'd rather build walls than cling to them Donald J. Trump _E_
"The most important thing in communication is hearing what isn't said." Peter Drucker _E_
I will beat Hillary easily but Lindsey Graham says I won't and yet he got zero against me no cred! Why does FOX put him on? _E_
The U.S. is spending fortunes at airports checking people coming in from West Africa with uncertain results. STOP THE FLIGHTS YOU DUMB B's! _E_
THANK YOU NEW YORK! #Trump2016 __HTTP__ _E_
Scotland is beautiful and Trump Internatonal Golf Links Scotland is progressing beautifully as well. __HTTP__ _E_
John Roberts arrived in Malta yesterday. Maybe we will get lucky and he will stay there. _E_
In 2011 I said that Mubarak never should have been ousted because whoever replaces him will be worse. Obama made a mistake. _E_
JOBS JOBS JOBS! __HTTP__ __HTTP__ _E_
Don't believe @BarackObama's whining Pro Romney SuperPAC spending is on par with Pro Obama SuperPAC __HTTP__ _E_
Fast and Furious gun running goes all the way to the White House. We need answers now! _E_
Just toured Baton Rouge Louisiana GREAT PEOPLE fantastic place doing really well. Miss USA Pageant totally sold out.Tomorrow night NBC _E_
I will be interviewed by @seanhannity tonight at 10:00 on @FoxNews . Much much much to talk about! _E_
.@Ynberg: Long term goal &gt &gt &gt to be the black @realDonaldTrump 4real .Great Dean and you will make it! _E_
Which National Costume do you think should win? __HTTP__ _E_
He @RickSantorum wants to decide what books people can read what movies they can see. #freespeech It doesn't work that way! _E_
Big announcement in Ames Iowa on Tuesday! You will not want to miss this rally! #Trump2016 __HTTP__ __HTTP__ _E_
We are asking law enforcement to check for dishonest early voting in Florida on behalf of little Marco Rubio. No way to run a country! _E_
What is our President doing? __HTTP__ _E_
I was so looking forward to being in Virginia Beach Virginia today. The demand for tickets was amazing. Good luck with storm back soon! _E_
these companies are able to move between all 50 states with no tax or tariff being charged. Please be forewarned prior to making a very ... _E_
Belated congratulations to @serenawilliams on winning the French Open. A great player & person! _E_
Ted Cruz poll numbers are down big. Because he was born in Canada and was until recently a Canadian citizen many believe he cannot run! _E_
Because of our terrible leaders it is now open season on every American throughout the world. Terrorists are thrilled. _E_
Success tip: Keep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_
The Obstructionist Democrats make Security for our country very difficult. They use the courts and associated delay at all times. Must stop! _E_
I aim very high and then just keep pushing and pushing to get what I'm after. The Art of the Deal _E_
Watching @TigerWoods on NBC playing great golf. Tiger won The WGC Cadillac Championship at Trump National Doral this year. I love Tiger! _E_
You can't build a reputation on what you're doing to do. Great quote by Henry Ford. _E_
Time to #DrainTheSwamp in Washington D.C. and VOTE #TrumpPence16 on 11/8/2016. Together we will MAKE AMERICA SAFE... __HTTP__ _E_
The Formula of Knowledge: The best way to learn is through studying the history of success and failures in your industry. _E_
WOW SO NICE AND SO TRUE. THANK YOU! @not_that_actor: @realDonaldTrump #TRUMP2016 TIME TO RETHINK THE CHOICES __HTTP__ _E_
I was always a big fan of Kim Novak and still am—a wonderful actress. _E_
Via @NJcomsomerset BY @wobriensomerset: @TigerWoods brings charity golf playoffs toTrump Nat'l/Bedminster __HTTP__ _E_
Why did @oreillyfactor give @davidaxelrod so much time to sell his third rate book. Bill should have hit stammering David MUCH harder! Waste _E_
Sadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. #TimeToGetTough _E_
Just got back from Iowa great people! _E_
We are in the NAFTA (worst trade deal ever made) renegotiation process with Mexico & Canada.Both being very difficultmay have to terminate? _E_
More than anything else I think deal making is an ability you're born with. It's in the genes. #TheArtofTheDeal _E_
RT @IvankaTrump: "The Trump economy is booming." One thing @realDonaldTrump "has done that has received little attention despite arguably d... _E_
.@realDonaldTrump is PRO LIFE PRO FAMILY #BigLeagueTruth #Debates2016 __HTTP__ _E_
I had fun appearing in the video for Carly Rae Jepsen's #CallMeMaybe for #MissUSA 2012 __HTTP__ _E_
That Saturday Night Live is able to joke about the Germanwings air tragedy is disgusting. They should apologize to all of those suffering! _E_
Will be on @foxandfriends at 7:00 this morning enjoy! _E_
Congratulations to @joniernst on her impressive @IowaGOP primary win last night. Now all should unite & defeat Bruce Braley this November _E_
Host of the @PGATOUR & @CadillacChamp @TrumpDoral is home to 4 unique courses including the famous Blue Monster __HTTP__ _E_
I will be on Greta @gretawire tonight at 10 PM on Fox News. _E_
Autism Speaks head up by Bob & Suzanne Wright does a fantastic job—if only we had more people like them! To help: __HTTP__ _E_
Congratulations to Obama on building a strong economy. There are 49500000 people on food stamps. A historic record! _E_
Animals representing Hillary Clinton and Dems in North Carolina just firebombed our office in Orange County because we are winning @NCGOP _E_
Via @newsbusters: "Donald Trump Issues Statement Regarding $5 Million Lawsuit Against Bill Maher" __HTTP__ _E_
The Republican Senators must step up to the plate and after 7 years vote to Repeal and Replace. Next Tax Reform and Infrastructure. WIN! _E_
Join me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Pollster Trend National GOP Average223 national polls & 33 pollsters.#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
RT @EricTrump: Honored to speak at the RNC Summer Meeting in Nashville Tennessee this evening! @GOP #MAGA @GOPChairwoman __HTTP__ _E_
Via @BreitbartNews: 'MAJOR COUP': DONALD TRUMP PICKS UP TOP IOWA GRASSROOTS OPERATIVE FOR POTENTIAL 2016 CAMPAIGN __HTTP__ _E_
Great to be in Riyadh Saudi Arabia. Looking forward to the afternoon and evening ahead. #POTUSAbroad __HTTP__ _E_
Sexual pervert Anthony Weiner has zero business holding public office. _E_
Immigration reform is all risk for the @GOP. Their base doesn't want it and the 12M illegals will all vote Democrat. _E_
Gas is $6 already in California. Don' worry @BarackObama's Algae energy policy is going to pay major (cont) __HTTP__ _E_
The job plan by @BarackObama is nothing more than a second stimulus. The first failed and so will this one. _E_
In the latest poll Danger Weiner's numbers have sunk. I wonder how Carlos handled the stress? He is one whacko sicko sexter. _E_
.@johnhawkinsrwn Great speaking to you today we will speak again soon. _E_
I am working on a new system where there will be competition in the Drug Industry. Pricing for the American people will come way down! _E_
Kay Hagan profited off of the stimulus.She just skipped a debate. Kay supports amnesty weak border & __HTTP__ @ThomTillis! _E_
The @timestribune @EricTrump: Eyes are on Northeast Pa. with gas development __HTTP__ _E_
Stock Market hit another all time high yesterday despite the Russian hoax story! Also jobs numbers are starting to look very good! _E_
Hillary defrauded America as Secy of State. She used it as a personal hedge fund to get herself rich! Corrupt dangerous dishonest. _E_
My @foxandfriends interview from Monday discussing Obama's tone going over the curb and Republican debt ceiling card __HTTP__ _E_
Thank you @ATFD17! #ImWithYouVideo: __HTTP__ _E_
President Obama was terrible on @60Minutes tonight. He said CLIMATE CHANGE is the most important thing not all of the current disasters! _E_
Democrats used to support border security — now they want illegals to pour through our borders. _E_
"Confidence is contagious. So is lack of confidence." Vince Lombardi _E_
If the Boston killer applies for Obama Care the paperwork will be too complicated for him to understand! _E_
Congratulations to Chuck Hagel on one of the shortest tenures as Sec. of Defense. Another terrible appointee by Obama. _E_
I got to know @ScottWalker well—he's a very nice person and has a great future. _E_
I read @willweatherford's comments that "the lights are dimming on gambling in Florida"—nothing could be worse for the state. _E_
Congrats to @EricTrump and @LaraLeaYunaska on a great five years! _E_
These are facts: In 2001 the US opened its markets to China & since then more than 2 million Americans can't (cont) __HTTP__ _E_
If Obama attacks Syria and innocent civilians are hurt and killed he and the U.S. will look very bad! _E_
Meeting with Generals at Mar a Lago in Florida. Very interesting! _E_
.@AGSchneiderman should remove his eyeliner as pointed out by Cuomo when he does his commercials! _E_
Celebrity Apprentice returns to NBC Sunday 3/14 9 11PM ET/PT. Outstanding list of celebrities & season should be the best one yet! _E_
Via @Newsmax_Media by @OwenTew: "Donald Trump: Kerry Has to Walk If Iran Doesn't Make Deal" __HTTP__ _E_
Dummy political pundit @krauthammer constantly pressed the crazy war in Iraq. Many lives and trillions of dollars wasted. U.S. got NOTHING! _E_
Does anyone remember the fight @mcuban had w/ the referee—he was weak & pathetic—a non athlete trying to live life thru his players. _E_
.@NFL: Too much talk not enough action. Stand for the National Anthem. _E_
Working on major Trade Deal with the United Kingdom. Could be very big & exciting. JOBS! The E.U. is very protectionist with the U.S. STOP! _E_
Our foreign policy decisions are dumbest in U.S. history _E_
Ellen was so awkward and insecure last night. The pizza skit was terrible. She should dump Andy Lassner a guy with no absolutely no talent! _E_
"Partnerships also require negotiation. It should be a win win setup. Otherwise it's not a partnership." – 'Midas Touch' _E_
Join me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
It is only the people that were never asked to be VP that tell the press that they will not take the position. _E_
Highly untalented Wash Post blogger Jennifer Rubin a real dummy never writes fairly about me. Why does Wash Post have low IQ people? _E_
Check out my interview from MSNBC at __HTTP__ _E_
Obama's ideas don't move us 'Forward' they take us 'Backwards.' These are ideas people come to America to get away from. @marcorubio _E_
Saw Michael Jordan and Ray Allen today playing golf at Trump National Doral the Blue Monster. Great guys! _E_
Capitalism requires capital. When government robs capital from investors it takes away the money that creates (cont) __HTTP__ _E_
Thanks everyone they all said I won the debate. Even won the @CNBC Poll! _E_
Success requires 100% of your focus and 100% of your effort. Don't sell yourself short. _E_
My 757 is incredible I think the teams agree on that. _E_
Economy growing! Excluding hurricane effects CEA estimates that real GDP growth would have been 3.9% in Q3.Stock market at a new high unemployment at a low. We are winning and TAX CUTS will shift our economy into high gear! __HTTP__ _E_
If everything seems under control you're not going fast enough. Mario Andretti _E_
We must stop releasing hard core criminals all over the United States. Our country must be strong again! _E_
Despite the fact that I have had great success with the words YOU'RE FIRED I do not like firing people. But ZERO on ObamaCare mess no way! _E_
"If you plan for the worst – if you can live with the worst – the good will always take care of itself." – The Art of the Deal _E_
Leaving Puerto Rico now for D.C. Will be in Las Vegas early tomorrow to pay my respects. Everyone is in my thoughts and prayers. __HTTP__ _E_
Late Night host are dealing with the Democrats for their very unfunny & repetitive material always anti Trump! Should we get Equal Time? _E_
#MakeAmericaGreatAgain __HTTP__ _E_
My @foxandfriends interview discussing the Super Bowl the real unemployment numbers Iran and @MittRomney's (cont) __HTTP__ _E_
I wonder what the work atmosphere is like @VanityFair. It must be hard working at a dying institution. _E_
Best ratings for the Dateline show were for six months not two months! _E_
Senate passed the VA Accountability Act. The House should get this bill to my desk ASAP! We can't tolerate substandard care for our vets. _E_
Post Debate via @OANN. Thank you!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
"Trump Dana Farber waiting on Bill Maher" __HTTP__ via @BostonGlobe _E_
Why doesn't President Obama call upon the NSA to fix the badly broken website then they could spy on all of the many cheaters & arrest them! _E_
We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_
If US Air & American Airlines are allowed to merge ticket prices will skyrocket—there will be no competition. _E_
Via @trscoop: "Mark Levin DEFENDS Trump: Hillary Clinton is a CROOK and a FRAUD and she's not treated this way!" __HTTP__ _E_
RT @RightlyNews: @realDonaldTrump @LouDobbs It is NOT a coincidence that the economy boomed immediately after the 2016 election. _E_
This is my pledge to the American people: __HTTP__ _E_
The reason I am staying in Bedminster N. J. a beautiful community is that staying in NYC is much more expensive and disruptive. Meetings! _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
It's going to get hotter in Las Vegas tonight! Watch the Miss Universe Pageant tonight on NBC at 9 p.m. I'm looking forward to being there! _E_
Brian Williams was never a smart guy but always passes himself off as such. People will learn the truth! @NBCNightlyNews _E_
.@bretbaier has a wonderful new book #specialheart and it's proving to be a great success already. Bret is a winner! _E_
I never thought I'd say it in my lifetime but President Barack Hussein Obama aka Barry Sotoro is a far worse president than Jimmy Carter! _E_
China has just overtaken us as the world's largest economy. We are busy wasting $'s while China builds airports & skyscrapers. _E_
Inauguration Day is turning out to be even bigger than expected. January 20th Washington D.C. Have fun! _E_
Awarded 5 Stars by @VisitScotland @TrumpScotland's MacLeod House & Lodge boutique hotel is an historic masterpiece __HTTP__ _E_
"What the mind can conceive and believe and the heart desire you can achieve." Norman Vincent Peale _E_
....and don't forget that Foxconn will be spending up to 10 billion dollars on a top of the line plant/plants in Wisconsin. _E_
If everything seems under control you're not going fast enough. Mario Andretti _E_
People are really liking my new book Crippled America. Check it out! _E_
It's about time for all Americans (Republicans & Democrats) to force our elected officials to start acting fiscally responsible! _E_
.@TheBrodyFile great job on @AC360. Thank you for the very smart and kind words! _E_
Chris Cuomo in his interview with Sen. Blumenthal never asked him about his long term lie about his brave service in Vietnam. FAKE NEWS! _E_
Congratulations to @TrumpPanama for winning the 2015 Traveler's Choice Award from @TripAdvisor __HTTP__ _E_
I don't care what people say I like Tom Cruise. He works his ass off and never ever quits. He's one of the few true movie stars. _E_
We will NEVER FORGET the victims who lost their lives one year ago today in the horrific #PulseNightClub shooting.... __HTTP__ _E_
Under Mayor @MikeBloomberg and Police Commissioner @Ray Kelly all violent crime in NYC is down dramatically. That's leadership. _E_
Expect the best from people. They will rise to the challenge and it's important to inspire confidence. _E_
Great @ANHQDC segment with @CharlesHurt: Breaking Down the Trump Factor __HTTP__ Let's Make America Great Again! _E_
The brand new Blue Monster just opened at Trump National Doral Miami. Also great new driving range which is open 'till midnight. GO SEE! _E_
Success tip: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_
In Hillary Clinton's America things get worse. #TrumpPence16 __HTTP__ _E_
"Get it straight: Pakistan is not our friend. We've given them billions and billions of dollars and what (cont) __HTTP__ _E_
Message to Obama re: Iran: "The worst thing you can possibly do in a deal is seem desperate to make it." – The Art of the Deal _E_
I own @DannyZuker but he has his friends & haters & losers tweeting that he beat me. He can't beat me at anything! _E_
Still waiting for a response from @billmaher. Does he even have $5 million? _E_
Stupid George Will gave @MittRomney no chance 3 months ago. Take off his little spectacles and he's just another dummy. _E_
Just watched Jeb's ad where he desperately needed mommy to help him. Jeb mom can't help you with ISIS the Chinese or with Putin. _E_
.@TheBrodyFile was fantastic tonight on @CNN. Thank you we will MAKE AMERICA GREAT AGAIN! _E_
#WeeklyAddress __HTTP__ _E_
Instead of driving jobs and wealth away AMERICA will become the WORLD'S great magnet for innovation & job creation! __HTTP__ _E_
Success consists of going from failure to failure without loss of enthusiasm. Winston Churchill _E_
"Donald Trump turns over 11.5 ac.in Rancho Palos Verdes for recreational open space" __HTTP__ @DailyBreezeNews by @meg_barnes _E_
RT @Franklin_Graham: Join me in praying for @POTUS. He reminded the world "If the righteous many do not confront the wicked few then evil... _E_
CNBC Titans: Donald Trump will be shown Friday Nov 19th at 9 pm and 1 am Sunday 11/21 at 9 pm and 11/24 at 7 pm __HTTP__ _E_
So many stories about me in the @washingtonpost are Fake News. They are as bad as ratings challenged @CNN. Lobbyist for Amazon and taxes? _E_
The Democrats should be ashamed. This is a disgrace!#DrainTheSwamp __HTTP__ _E_
Feels good to be home after seven months but the White House is very special there is no place like it... and the U.S. is really my home! _E_
I never fall for scams. I am the only person who immediately walked out of my 'Ali G' interview _E_
Gas prices are going up big league—I told you so—payback to OPEC! _E_
.@KatrinaPierson you did a fantastic job tonight on @FoxNews. Thank you for your very tough and very smart representation! _E_
.@BilldeBlasio should focus on running #NYC & all of the problems that he has caused with his ineptitude & not be so focused on me! _E_
With Miami's top #NewYearsEve vacation package @TrumpDoral is the perfect option to celebrate the start of 2015. __HTTP__ _E_
"Obamacare Data Mismatch Could Leave Thousands Uninsured" __HTTP__ ObamaCare is not working and has missed all targets. _E_
I hope Bill Clinton starts talking about women's issues so that voters can see what a hypocrite he is and how Hillary abused those women! _E_
RT @foxandfriends: Senators learn the hard way about the fallout from turning on Trump __HTTP__ _E_
Common Core is a federal takeover of school curriculum. Department of Education should be disbanded not expanded. Focus on local education. _E_
I was asked about healthcare by Anderson Cooper & have been consistent I will repeal all of #ObamaCare including the mandate period. _E_
Food stamps up 45%. Federal handouts up 45%. Is @BarackObama happy? __HTTP__ _E_
Ultimately Trump Tower became much more than just another good deal. I work in it I live in it and I have a (cont) __HTTP__ _E_
With over 260 5 Star guest rooms & suites @TrumpTO is 65 stories of pure luxury in the center of downtown Toronto __HTTP__ _E_
The new joke in town is that Russia leaked the disastrous DNC e mails which should never have been written (stupid) because Putin likes me _E_
Finally an accurate story from the Washington Post! __HTTP__ _E_
Mayweather is getting absolutely killed! _E_
Speaking of our very stupid war with Iraq it is totally disintegrating and Iran (with Russia) will walk in and take it over (lots of oil)! _E_
Big dinner with Governors tonight at White House. Much to be discussed including healthcare. _E_
I'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_
Stock Market just hit another record high! Jobs looking very good. _E_
NYC should hold a parade for returning Iraq and Afghanistan veterans. _E_
RT @DonaldJTrumpJr: If you live in Louisiana Maine Kentucky or Kansas remember to vote today! Together let's #MakeAmericaGreatAgain __HTTP__ _E_
Congratulations to @EmilyMiller @mboyle1 & @NolteNC on making @FishbowlDC's list of 10 Journos You Don't Want to Fight on Twitter. _E_
.@MarkHalperin works so hard but just doesn't have a natural instinct for politics. Others do and those are the people you want to follow! _E_
Isn't it crazy that people of little or no talent or success can be so critical of those whose accomplishments are great with no retribution _E_
Jodi if you're listening MAKE A DEAL! _E_
Today I'm in Aberdeen Scotland preparing for the July 10th opening of perhaps the world's greatest golf course __HTTP__ _E_
.@Larry_Kudlow 'Donald Trump Is the middle class growth candidate' __HTTP__ _E_
I am watching @FoxNews and how fairly they are treating me and my words and @CNN and the total distortion of my words and what I am saying _E_
The right leadership can help economy while creating security around the world. Let's make America great again! __HTTP__ _E_
Thanks to @johnrich for putting on such a great concert fot @Stjude. John was a winner on Celebrity Apprentice and is a fantastic guy. _E_
Thank you Michigan. This is a MOVEMENT. We are going to MAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ _E_
"Watch what people are cynical about and one can often discover what they lack." General George S. Patton _E_
It's been stated that dopey NY @AGSchneiderman used cocaine while he was a state senator. __HTTP__ _E_
Today @BarackObama is in Ohio on a bus tour. Tomorrow Pennsylvania. How about actually running the country? _E_
SHOCK! ObamaCare will cost double what @BarackObama promised over $1.76 __HTTP__ and result (cont) __HTTP__ _E_
"Donald trump files statement of candidacy" __HTTP__ via @CBSNews _E_
Vancouver's most anticipated hotel & residences @TrumpVancouver will unveil Canada's first Mar a Lago Spa __HTTP__ _E_
Just left a great rally in Florida now heading to Ohio for two more. Will be there soon. _E_
QE3 a political favor for Obama will cause record inflation on food and fuel. This hits low income families the hardest. Big mistake. _E_
.@ABFAlecBaldwin P.S. Your brother @StephenBaldwin7 is doing very well on @ApprenticeNBC and he stated he adores you. _E_
Congrats to @MiamiHEAT on winning @NBA championship. @MickyArison is a tremendous owner & has done wonders for (cont) __HTTP__ _E_
Enviro friendly? AP IMPACT: Obama administration allows wind farms to kill eagles birds despite federal laws __HTTP__ _E_
Sen. Corker is the incompetent head of the Foreign Relations Committee & look how poorly the U.S. has done. He doesn't have a clue as..... _E_
Don't be easily pleased with yourself or with anything else. Be tough & fight to keep your standards high. Think Like a Champion _E_
In light the Benghazi emails released last night it is apparent that Obama has no problem lying to the American public... _E_
Don Butler and executives are doing a great job at @Cadillac the cars are fantastic. _E_
Huma should dump the sicko Weiner. He is a calamity that is bringing her down with him. _E_
He @BarackObama promised to close Gitmo in his first year. It is still open 3 years later and about to get a (cont) __HTTP__ _E_
Ashley Judd Targeted by @karlrove's Super PAC in Ad (Video) __HTTP__ _E_
Via @FurnitureToday by Cindy W. Hodnett: "Dorya to introduce Trump Home high end furniture" __HTTP__ _E_
People love @LilJon! __HTTP__ #CelebApprentice _E_
China just hacked our federal government & stole gov. workers' information. Why do our leaders let China get away with this?! No respect. _E_
I truly believe that our country has the worst and dumbest negotiators of virtually any country in the world. _E_
#TrumpVine @arod sucks! __HTTP__ _E_
Our inner cities have been left behind. We will never have the resources to support our people if we have an open border. _E_
Judy Garland was much better to put it mildly! #Oscars _E_
U.S. small businesses are truly worried about rising healthcare costs and taxes __HTTP__ I told you so! _E_
True courage is being afraid and going ahead and doing your job anyhow that's what courage is. Gen. Norman Schwarzkopf (1934 2012) _E_
Thank you Virginia! 15000 amazing supporters! Everyone get out and #VoteTrump tomorrow! __HTTP__ _E_
Great afternoon in Little Havana with Hispanic community leaders. Thank you for your support! #ImWithYou __HTTP__ _E_
Obama never consulted with Congress about a prisoner exchange. HE BROKE THE LAW AND SHOULD BE TRIED. OUR PRESIDENT IS A TOTAL DISASTER! _E_
Zegarelli and Vescio: Pine Road looks like hell. Must be re paved now—very bad for town. @BriarcliffManor _E_
Thank you for the wonderful welcome @WEF! #Davos2018 __HTTP__ _E_
If America was under the threat of imminent attack would Obama use torture or a kiss? _E_
.@NRO Really important to save National Review from going out of business. We need a true conservative voice! _E_
With only a very small majority the Republicans in the House & Senate need more victories next year since Dems totally obstruct no votes! _E_
Logic will get you from A to B. Imagination will take you everywhere. Albert Einstein _E_
"Develop success from failures. Discouragement and failure are two of the surest stepping stones to success." – Dale Carnegie _E_
The NYC casting call for The Apprentice is thisThursday April 1 at Trump Tower. For all the information you need go to NBC.com/casting. _E_
Pathetic attempt by @foxnews to try and build up ratings for the #GOPDebate. Without me they'd have no ratings! __HTTP__ _E_
We had a wonderful visit to Vietnam thank you President Tran Dai Quang! Heading to the #ASEANSummit 50th Anniv Gala in the Philippines now. __HTTP__ _E_
Excited that @OurCountryPAC's Amy Kremer has endorsed the Newsmax iontv debate. The Tea Party Express is a great group. _E_
Yesterday Barack Obama said he wants wind turbines manufactured here in China __HTTP__ I don't think this was a gaffe. _E_
George Ross could be right—@THEGaryBusey would be better in the adventure task than the romance task. #CelebApprentice _E_
Thank you @GolfMagazine for your fantastic review of The Blue Monster at Trump National Doral BEST U.S. RESORT RENOVATION & ALL TIME _E_
On my way to San Diego to raise money for the Republican Party. I am spending a lot myself and also helping others. _E_
I'll be doing @piersmorgan show tonight on CNN at 9 PM. Will be very interesting. (I hope!) _E_
Focus on your goals not your problems. Problems are a mind exercise learn to play beyond your comfort zone. _E_
Thank you our great honor! __HTTP__ _E_
GREAT EVENING last night in Pensacola Florida. Arena was packed to the rafters the crowd was loud loving and really smart. They definitely get what's going on. Thank you Pensacola! _E_
Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_
Just finished speaking in Sydney Australia in front of 20000 people and today I'm off to Melbourne for anot... (cont) __HTTP__ _E_
Glad to hear that @FLGovScott will be speaking at the @RNC Convention. He is a true conservative and fantastic governor! _E_
Entrepreneurs: Follow your own path—it will bring you to the places you were meant to be. _E_
I bought the great Turnberry Resort today considered by many to have the greatest golf course in the World. I will take good care of it! _E_
Plain & Simple: We should only admit into this country those who share our VALUES and RESPECT our people. __HTTP__ _E_
.@StephenBaldwin7's mother thinks I'm very handsome. Now I see where Stephen and Alec get their smarts. #CelebApprentice _E_
One positive from last week for Lance was that everyone was focused on Manti Te'o! Why did Lance do that interview? _E_
#TeamTrump is thinking of Captain Andrew Maitner. A true American hero. #MaitnerStrong __HTTP__ __HTTP__ _E_
Our country has to come together. We have to start working with and really liking each other. The whole world is watching Baltimore. _E_
An analysis showed that Bernie Sanders would have won the Democratic nomination if it were not for the Super Delegates. _E_
Watching biased Charles @krauthammer a @FoxNews flunky who didn't know that I won every debate in particular the last one. Check polls! _E_
For you newcomers George Ross was one of my first advisors on the original Apprentice. #CelebApprentice _E_
So many great things happening new poll numbers looking good! News conference at 11:00 A.M. today Trump Tower! _E_
RT @DonaldJTrumpJr: Nice piece and video today in the Wall Street Journal: Trump's three eldest children jump into campaign __HTTP__ _E_
The State of Florida is so embarrassed by the antics of Crooked Hillary Clinton and Debbie Wasserman Schultz that they will vote for CHANGE! _E_
I had a great day in D.C. even though the subject was an unpleasant one the horrible Iran Nuke deal. Amazing crowd and enthusiasm! _E_
The New York Times should never have moved out of their magnificent original home... _E_
The era of strategic patience with the North Korea regime has failed. That patience is over. We are working closely... __HTTP__ _E_
I do what I do out of pure enjoyment. Hopefully nobody does it better. Theres a beauty to making a great deal. It's my canvas. _E_
Hey @KimKardashian I hear you are undecided in the election. I can explain why you should vote for @MittRomney. _E_
China is about to acquire 82800 net acres of a Texas shale oil and gas field __HTTP__ What are we doing! _E_
Keep difficulties in perspective. Ask yourself is this a blip or is it a catastrophe? _E_
Just arrived in Italy after having a very successful NATO meeting in Brussels. Told other nations they must pay more not fair to U.S. _E_
Thank you @chucktodd for your commentary last night on @NBCNightlyNews. Very fair we are making progress together! _E_
.@alexsalmond @pressjournal @BBCNews RT ‏@DanScavino the photos that they don't show the public... __HTTP__ _E_
The purpose of China's massive military buildup on the Nork's border is to intimidate us. China attacked us during the Korean War. _E_
The people of South Carolina are embarrassed by Nikki Haley! _E_
Be sure to tune in to another amazing episode of #CelebApprentice this Sunday on @nbc at 9PM EST! This Sunday's (cont) __HTTP__ _E_
You will love Celebrity Apprentice tonight 9 PM on NBC. Must watch from beginning two early firings! _E_
Hope & Change the number of 26 year olds living with parents has jumped 46% under Obama __HTTP__ Four more years? _E_
Thank you CBS & Breitbart total vindication! Will the mainstream media apologize? Many many witnesses. #Trump2016 __HTTP__ _E_
Remember: Obama turned down $5M to charity which I said I would increase by 10X to $50M just to show simple records. He's hiding lots! _E_
Thank you Grand Rapids Michigan! #ICYMI watch: __HTTP__ __HTTP__ _E_
Via @CBNNews: Exclusive: Backstage Interview w/ Donald Trump at CPAC __HTTP__ by @TheBrodyFile Great seeing you David! _E_
I'm getting The Commandant's Leadership Award from the U.S.Marines tonight at The Waldorf Astoria a great honor! @BretBaier _E_
I told @megynkelly that @oreillyfactor and I had identical views on a certain issue and she cut it out of the taped interview. Why? Too bad! _E_
My thoughts on @andyroddick in today's #trumpvlog.... __HTTP__ _E_
.@FrankLuntz your so called focus groups are a total joke. Don't come to my office looking for business again. You are a clown! _E_
The only way to do great work is to love what you do. – Steve Jobs _E_
.@Rosie—No offense and good luck on the new show but remember you started it! __HTTP__ _E_
I love New Hampshire will be an exciting evening! _E_
I would do same thing if I were China. They want Obama. __HTTP__ _E_
Just won The Club Championship at Trump International Golf Club in Palm Beach lots of very good golfers never easy to win a C.C. _E_
RT @ReutersPolitics: Trump to give $5 million to charity if Obama releases records __HTTP__ _E_
In a little reported event China has just overtaken the United States as the NUMBER ONE World economic power! Great going Washington! _E_
A Rod's appeal will go nowhere. He will get a long suspension. Good for the @Yankees. And sends strong message to @MLB players. _E_
Thank you @elvisduran for dedicating your birthday today to the @EricTrumpFdn for @StJude! Click here to donate: __HTTP__ _E_
People get what is going on! __HTTP__ _E_
#ICYMI: Will Media Apologize to Trump? __HTTP__ _E_
I don't get @billmaher and his terrible show he is dumb as a rock but tries so hard to pass himself off as a great intellect. Check past! _E_
He is a professional and true gentleman: @GeorgeTakei is one of my favorite contestants from #CelebApprentice. _E_
Tomorrow is the 10 year anniversary of the Apprentice one of the biggest hits in television history. How time flies! _E_
I hear that @SenTedCruz's $$ man Robert Mercer a good man is very angry because Cruz lied to him about liquidating his (Ted's) holdings.? _E_
DON'T LET HER FOOL US AGAIN. __HTTP__ _E_
Looking forward to speaking at tonight's gala for @MittRomney supporters at the Intrepid. Mitt's doing well. _E_
Donald Trump song is up to almost 60 million hits crazy! _E_
NobamaCare won't work never will work and can't work it is a total waste of time and energy except that it is hurting people (& economy!) _E_
RT @RealBenCarson: Please read my full endorsement of @realDonaldTrump for President of the United States: __HTTP__ _E_
The press has very inaccurately covered this event see for yourself! __HTTP__ _E_
Will be doing @greta interview tomorrow. So much to talk about! _E_
RT @IvankaTrump: Since @realDonaldTrump inauguration over 1 million net new jobs have been created in the American economy! #MAGA _E_
...Why did the DNC REFUSE to turn over its Server to the FBI and still hasn't? It's all a big Dem scam and excuse for losing the election! _E_
RT @usairforce: "#AirForce relief efforts in #PuertoRico & #VirginIslands" __HTTP__ _E_
Sugar @Lord_Sugar Unlike yours my financials are phenomenal. People don't know your real numbers & would not be impressed. _E_
A Clinton already defeated a Bush. The definition of insanity is doing the same thing twice & expecting a different result. _E_
We don't have a country if we don't have borders. #VoteTrump Video: __HTTP__ __HTTP__ _E_
North Korea has just launched another missile. Does this guy have anything better to do with his life? Hard to believe that South Korea..... _E_
Donald Trump Plans To Continue GOPLegacy Of Leading On Women's Civil Rights Against Racist Sexist Democrats __HTTP__ _E_
The media is so in the tank for Obama that it is amazing—the funny thing is he can't stand them! _E_
What would All Star @ApprenticeNBC be w/out a Baldwin? @StephenBaldwin7 is at the top of his game this season. Our fans will be happy. _E_
Dopey @BillKristol who has lost all credibility with so many dumb statements and picks said last week on @Morning_Joe that Biden was in. _E_
Can you imagine not taking Snowden's passport away before he jetted happily away to foreign lands (where he gave away many U.S. secrets). _E_
Catch the second part of my interview with Bill O'Reilly tonight at 8pm on Fox News.... _E_
#CrookedHillary is not qualified! __HTTP__ _E_
N.A.T.O. is obsolete and must be changed to additionally focus on terrorism as well as some of the things it is currently focused on! _E_
Thank you. __HTTP__ _E_
The approval process for the biggest Tax Cut & Tax Reform package in the history of our country will soon begin. Move fast Congress! _E_
Unsustainable @BarackObama has increased total federal budget outlays by over 24% during his term __HTTP__ He loves debt. _E_
Rosie O'Donnell has failed again. Her ratings were abysmal and Oprah cancelled her on Friday night. When will (cont) __HTTP__ _E_
Get rid of all of these commercials. #DemDebate _E_
I hearby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_
Here is my statement. __HTTP__ _E_
The U.S. has 69 treaties with other countries where we would have to defend them and their borders. How nice but what do we get? NOT ENOUGH _E_
.@PrimeMinisterSX has no clue what's going on in St. Maarten. Mullet Bay is a third world slum. _E_
Heading to the Great State of Wisconsin to talk about JOBS JOBS JOBS! Big progress being made as the Real News is reporting. _E_
New GOP platform now includes language that supports the border wall. We will build the wall and MAKE AMERICA SAFE AGAIN! _E_
Have a fantastic beautiful and happy Easter everyone and then when Easter is over have great wins and triumphs in life. Never give up! _E_
Totally dishonest Donna Brazile chokes on the truth. Highly illegal! Watch: __HTTP__ __HTTP__ _E_
The failing @nytimes just announced that complaints about them are at a 15 year high. I can fully understand that but why announce? _E_
Doral Tournament was great best 18th hole in golf and a wonderful winner in @JustinRose _E_
RT @realDonaldTrump: Senator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no t... _E_
.@HillaryClinton lists litany of ways she plans to restrict gun rights. 2A will not survive a Hillary presidency. #Debate #BigLeagueTruth _E_
It is Clinton and Sanders people who disrupted my rally in Chicago and then they say I must talk to my people. Phony politicians! _E_
45000 construction & manufacturing jobs in the U.S. Gulf Coast region. $20 billion investment. We are already winning again America! _E_
Thank you Columbus Ohio! __HTTP__ _E_
America is at a great disadvantage. Putin is ex KGB Obama is a community organizer. Unfair. _E_
If you think you can do a thing or think you can't do a thing you're right. Henry Ford _E_
Besides an award winning golf course @TrumpGolfLA features exquisite estates on top the Palos Verdes Peninsula __HTTP__ _E_
The failing @nytimes has become a newspaper of fiction. Their stories about me always quote non existent unnamed sources. Very dishonest! _E_
Ebola has been confirmed in N.Y.C. with officials frantically trying to find all of the people and things he had contact with.Obama's fault _E_
Remember I am the only one who is self funding my campaign. All of the other candidates are bought and paid for by special interests! _E_
It does not cost anything to dream. Spend your time enjoying your big dreams. Think Big _E_
People are really unhappy with the endless security checks at the new World Trade Center. Durst is a terrible manager. Tenants furious! _E_
Congrats to Senator McConnell and @TheTeaParty_net's Kellen Guida on yesterday's successful Tea Party Caucus __HTTP__ _E_
"Trump Brand Expands To South America: The Donald Lends His Name To Luxury Tower In Uruguay" __HTTP__ via @Forbes _E_
Blue Ribbon Commission to find and agree to future spending cuts? Bad idea. _E_
Thank you Georgia! 15000 amazing supporters tonight! Everyone get out & #VoteTrump tomorrow! #SuperTuesday __HTTP__ _E_
.@GovernorPerry failed on the border. He should be forced to take an IQ test before being allowed to enter the GOP debate. _E_
Friends of mine who are driving Cadillacs it is becoming a very hot car are raving about what a great job @Cadillac has done. _E_
It means so much to me receiving an endorsement from Phyllis Schlafly. A truly great woman & conservative. __HTTP__ _E_
.@VanityFair magazine is doing so poorly that they make even @NYMag look good. Graydon Carter should've been fired a long time ago. _E_
I just gave lots of money away at Trump Tower to people who needed it...they were very happy and appreciative! _E_
Have a good chance to win Texas on Tuesday. Cruz is a nasty guy not one Senate endorsement and despite talk gets nothing done. Loser! _E_
If the government doesn't start working together the media is right & we will hit a fiscal cliff. We need to avoid this. _E_
The lobbyists & special interests have just put out an ad for Jeb which hits me just a little but is very false! _E_
Into our first week of filming @ApprenticeNBC the Celebrities are already turning up the heat. Major fireworks! _E_
Everyone is talking about the incredible event we had in Dallas last night. Spectacular crowd & arena! Thank you @mcuban. _E_
Via @CNNPolitics by @teddyschleifer: Trump: San Francisco killing shows perils of illegal immigration __HTTP__ _E_
.@Lexi Great job in winning your first of many majors . We are proud of you at Trump International. Work hard be an all time great! _E_
Fantastic job on @CNN tonight. @kayleighmcenany is a winner! @donlemon _E_
.@MELANIATRUMP and I are looking forward to watching @AnnDRomney's speech tonight. She is an amazing woman who will be a great First Lady! _E_
OPEC is setting crude at $94/barrel on 'signs US economy is improving.' OPEC uses any excuse to rip us off and our leaders just watch. _E_
A ship is only as good as the people who serve on it — and the AMERICAN SAILOR is the BEST in the world. @USNavy #USSGeraldRFord __HTTP__ _E_
Obama can sign an illegal executive action anytime for ObamaCare but he can't fix the illegal loophole. _E_
Love seeing union & non union members alike are defecting to Trump. I will create jobs like no one else. Their #Dem leaders can't compete! _E_
Join us tomorrow night in Charleston South Carolina! #SCPrimary #Trump2016 __HTTP__ _E_
They should close down Rolling Stone Magazine after the phony rape charge story. University of Virginia should sue them for big bucks! _E_
It was so great being in Nebraska last week. Today is the big day get out and vote! _E_
The Republican establishment out of self preservation is concerned w/ my high poll #'s. More concerned are Dems—I beat Hillary heads up! _E_
Meeting with biggest business leaders this morning. Good jobs are coming back to U.S. health care and tax bills are being crafted NOW! _E_
Andy Williams has died. He was a friend of mine and a great guy. _E_
That's Adrian in the elevator— he works at @TrumpTowerNY & he's got a lot of stories. #CelebApprentice _E_
So many people have told me that I should host Meet the Press and replace the moron who is on now. Just too busy especially next 10 years! _E_
Today the Democrats lose big. But tomorrow the Republicans must communicate a positive pro growth agenda. _E_
The scum that gets high on badly hurting old ladies and others through knockout assaults wouldn't feel that way with a gun at their head! _E_
It was a great honor to welcome President Petro Poroshenko of Ukraine to the @WhiteHouse today with @VP Pence.... __HTTP__ _E_
#TimeToGetTough presents bold solutions on taxes national security the debt dealing with OPEC and China and defeating @BarackObama. _E_
My interview with @HowardKurtz on #MediaBuzz will air tomorrow on @Fox at 11am and 5pm. Great job Howie very insightful. _E_
The Obama Economy workers added to disability and individuals added to food stamps more than doubles net jobs created __HTTP__ _E_
I met Prince on numerous occasions. He was an amazing talent and wonderful guy. He will be greatly missed! _E_
Great new numbers. Thank you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Be sure to watch my #CPAC2015 speech with intro by @DLoesch and a Q&A with @seanhannity __HTTP__ _E_
#ElectionDay __HTTP__ __HTTP__ _E_
Unions who secure the border oppose the amnesty bill __HTTP__ Their expert opinions should at least be listened to. _E_
So Obama and Congress can waste billions in Iraq & Afghanistan building roads & schools but can't get money to the NJ & NY Sandy victims? _E_
Remember tonight's 8 o'clock episode of Celebrity Apprentice is the best ever—you will see nothing like it on tv. @ApprenticeNBC _E_
Mexico will pay for the wall 100%!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
Our economy has had worst recovery under Obama since the Depression. Results of his policies speak for themselves. No new taxes! _E_
Russian leaders are publicly celebrating Obama's reelection. They can't wait to see how flexible Obama will be now. _E_
Bob Corker gave us the Iran Deal & that's about it. We need HealthCare we need Tax Cuts/Reform we need people that can get the job done! _E_
.@Franklin_Graham so many people have tweeted about your amazing words to me thank you! Heading to big crowd in South Carolina! _E_
Watch the Miss Universe competition LIVE from the Bahamas Sunday 8/23 @ 9pm (ET) on NBC: __HTTP__ _E_
Greece should get out of the euro & go back to their own currency they are just wasting time. _E_
Wow Twitter Google and Facebook are burying the FBI criminal investigation of Clinton. Very dishonest media! _E_
The Governor of Puerto Rico Ricardo Rossello is a great guy and leader who is really working hard. Thank you Ricky! _E_
I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House. _E_
In the UK taxpayers are wasting £24 million on wind farms that don't even operate. __HTTP__ They (cont) __HTTP__ _E_
Thanks. __HTTP__ _E_
My thoughts on the situation in Norway and Amanda Knox... __HTTP__ #trumpvlog _E_
RT @Scavino45: .@POTUS @realDonaldTrump in the Oval Office w/senior U.S. military leaders prior to dinner hosted by the President & First L... _E_
President Obama's weakness and indecision may have saved us from doing a horrible and very costly (in more ways than money) attack on Syria! _E_
INTELLIGENCE INSIDERS NOW CLAIM THE TRUMP DOSSIER IS A COMPLETE FRAUD! @OANN _E_
... but @billmaher is allowed to say that about me. _E_
The reporter who pulled back from his 14 year old never retracted story is having fun. I don't know what he looks like and don't know him! _E_
Conde Nast made a big mistake going into the World Trade Center. The place is a total disaster and I feel this is only the beginning! _E_
RT @T_Lineberger: Thanks @IvankaTrump for coming to help win Michigan! More people here than a Hillary rally with less than 24 hours notice... _E_
For accurate reporting of my @CPACnews speech read @PoliticalTicker @Newsmax_Media @politico @HuffPostPol.... _E_
Not good news for Jeb Bush __HTTP__ _E_
My shirts ties and suits are selling great @Macy's because they are the best and most stylish at a really reasonable price thanks! _E_
It is a great victory for NYC that A Rod will never wear pinstripes again. _E_
Consumer spending fell in September __HTTP__ Another indicator the 7.8% unemployment number is cooked. _E_
The @nytimes was very nice in reporting that @CelebApprentice was #1 on all television for "top brand impact 2012." Thank you! _E_
Great poll Florida thank you! #ImWithYou #AmericaFirst __HTTP__ _E_
Even the SEALS who killed Bin Laden don't like @JoeBiden __HTTP__ _E_
Happy Birthday @EricTrump! __HTTP__ _E_
Thank you Alex! __HTTP__ _E_
Who is winning the debate so far (just last name)? #DemDebate _E_
Of course the Australians have better healthcare than we do everybody does. ObamaCare is dead! But our healthcare will soon be great. _E_
Has Pres. Obama or the White House told the public what happened in Algeria yet? Where's the media? _E_
It is amazing how @LindseyGrahamSC gets on so many T.V. shows talking negatively about me when I beat him so badly (ZERO) in his pres run! _E_
"The future is always beginning now." Mark Strand former Poet Laureate _E_
"Be sure you put your feet in the right place then stand firm." Abraham Lincoln _E_
Unless you catch hackers in the act it is very hard to determine who was doing the hacking. Why wasn't this brought up before election? _E_
Remember that things are cyclical so be resilient be patient be creative and remain positive. Think Like a Champion _E_
Trump Golf Links at Ferry Point an 18 hole public golf course in the Bronx New York is opening soon! __HTTP__ _E_
Join me tomorrow in Des Moines Iowa with Vice President Elect @mike_pence at 7:00pm!#ThankYouTour2016 #MAGA... __HTTP__ _E_
How can FBI Deputy Director Andrew McCabe the man in charge along with leakin' James Comey of the Phony Hillary Clinton investigation (including her 33000 illegally deleted emails) be given $700000 for wife's campaign by Clinton Puppets during investigation? _E_
In war there is no substitute for victory. Douglas MacArthur _E_
Mogul Donald Trump has many powerful friends. And it turns out one of them is Anna Wintour." __HTTP__ via @FoxNews _E_
Thank you for all of your support! Let's #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_
The Old Post Office building in Washington (D.C.) will soon be transformed into one of the great hotels anywhere in the world lots of jobs! _E_
I want to do negative ads on John Kasich but he is so irrelevant to the race that I don't want to waste my money. _E_
Thank you Sacramento California! #MakeAmericaGreatAgain __HTTP__ _E_
"Ability is nothing without opportunity." Napoleon Bonaparte _E_
President Obama was able to fool the Americans by getting elected but not able to fool Vladimir Putin. Too bad for us! _E_
Will be on @SeanHannity tonight at 10pmE delivering an important speech live from Wisconsin. #MakeAmericaGreatAgain _E_
Why has all time hits leader Pete Rose paid a 20 year price whrn A Rod gets 200 game penalty. It's time to let Pete into The Hall of Fame! _E_
Good Morning America is thrilled @Rosie is working for the @todayshow that means almost guaranteed success for @GMA _E_
Our great VPE @mike_pence is in Louisiana campaigning for John Kennedy for US Senate. John will be a tremendous help to us in Washington. _E_
Had a great time on @IngrahamAngle this morning. _E_
Great new poll numbers! Thank you for your support! #Trump2016 __HTTP__ _E_
.@MattGinellaGC Thx for the nice story @TrumpDoral. Look forward to showing you Trump Int'l in Aberdeen in the spring & Turnberry plans. _E_
#sweepstweet @teresa_giudice definitely fell under @lisalampanelli's negotiation skills—an important business tool. _E_
Via @Newsmax_Media: Trump @oreillyfactor Make Up After Digs at Each Other __HTTP__ _E_
I will be on @oreillyfactor tonight on @FoxNews at 8 PM and 11 PM. _E_
I build beautiful websites with very smart and imaginative people for almost NOTHING. OUR GOVERNMENT SPENT ALMOST $535 000 000 for NOTHING _E_
I am the only candidate (in many years) who is self funding his campaign. Lobbyists and $ interests totally control all other candidates! _E_
ISIS is on the run & will soon be wiped out of Syria & Iraq illegal border crossings are way down (75%) & MS 13 gangs are being removed. _E_
. @BBCNews' child molestation sex scandal is the latest in continued downward spiral of BBC.I know personally they do not check for accuracy _E_
As a candidate I promised we would pass a massive tax cut for the everyday working Americans. If you make your voices heard this moment will be forever remembered as a great new beginning – the dawn of a brilliant American future shining with PATRIOTISM PROSPERITY AND PRIDE! __HTTP__ _E_
Via @TIME by @lullintheaction: #REALTIME: Donald Trump Weighs a 2016 Run At #CPAC2015 __HTTP__ _E_
.@stuartpstevens did a horrible job for Mitt—is a refund in order? Sadly Stuart is a disaster! _E_
The Republicans look so weak and foolish—what the hell are they doing? _E_
Entrepreneurs: Set the bar high. Do the best you possibly can. Apply your skills and talent but above all be tenacious. _E_
The Trump Doctrine: Peace Through Strength. #Trump2016 __HTTP__ __HTTP__ _E_
Great win last night by Peyton Manning & @Denver_Broncos in San Diego coming from 24 points behind on the road. Very impressive. _E_
What a great time we just had in the atrium of Tump Tower for __HTTP__ The place was happy and packed! _E_
Admiral McRaven had full operational control of the Bin Laden mission __HTTP__ @BarackObama gave vague directions. _E_
Thank you for the massive turnout tonight Cleveland Ohio! Get out & VOTE #TrumpPence16 on 11/8.Watch rally here:... __HTTP__ _E_
.@DottieandBogey Thanks for nice comments over weekend re Turnberry. You and your husband have fantastic taste! Also great commentary. _E_
Congratulations to @BretBaier on his five year anniversary as the anchor @SpecialReport. Brett is great! _E_
When the stupid people start feeling sorry for the Boston killer and want to release him and give him medals remember the killings maimings _E_
Virginia's highest rated wine by @WineEnthusiast @trumpwinery is inspired by the regions of Bordeaux & Champagne __HTTP__ _E_
The United States needs to fix its own problems of which there are many first! _E_
The hardest thing Clinton has to do is defend her bad decision making including Iraq vote e mails etc. _E_
Great pick by Buffalo Sammy Watkins will be GREAT! _E_
General John Allen who I never met but spoke against me last night failed badly in his fight against ISIS. His record = BAD #NeverHillary _E_
Only 36 days until the election. @MittRomney needs to stay on offense. Make Obama's terrible record the issue. #TimeToGetTough _E_
.@IamStevenT visited me at @TrumpTowerNY what a great guy! __HTTP__ _E_
Take a look at what happened w/ Bill Clinton. The system is totally rigged. Does anybody really believe that meeting was just a coincidence? _E_
Trump's Menie golf resort enjoys bumper first year __HTTP__ via @TheScotsman _E_
I look forward to meeting @joniernst today in New Jersey. She has done a great job as Senator of Iowa! _E_
... By releasing his records he can come clean with the American people and have $5 million go to a charity. _E_
WikiLeaks emails reveal Podesta urging Clinton camp to 'dump' emails. Time to #DrainTheSwamp! __HTTP__ _E_
Our Southern border is totally out of control. This is an absolutely disgraceful. situation. __HTTP__ We need border security! _E_
Saw @mcuban try to hit a ball in Lake Tahoe while I played in tournament he's got no talent or strength!!!! @TMZ _E_
.@MattGinellaGC @GCMorningDrive Matt will be talking about Trump National Doral tomorrow A.M. Terrific guy looking forward to it! _E_
.@antbaxter I predict somebody is going to sue you! _E_
Many people are now saying I won South Carolina because of the last debate. I showed anger and the people of our country are very angry! _E_
Why does @KarlRove lie about his Reagan credentials? __HTTP__ He's a Bushie through and through. _E_
One good aspect of the Obama depression is that it will separate the winners from the losers. If you can make it now you deserve it! _E_
"Strong men have sound ideas and the force to make these ideas effective." Andrew Mellon _E_
.@danabrams editor of @mediaite explained on radio this morning that I am so widely covered because I draw high interest. True! _E_
My interview from yesterday on Fox and Friends GOP Crazy If They Don't Get Everything They Want __HTTP__ _E_
Thanks! __HTTP__ _E_
"Don't be afraid of mistakes. They can be learning tools on the way to building something great for yourself." Think Like a Champion _E_
Thank you ARIZONA! This is a MOVEMENT like nobody has ever seen before. Together we are going to MAKE AMERICA SAFE... __HTTP__ _E_
Lyin' Ted I have already beaten you in all debates and am way ahead of you in votes and delegates. You should focus on jobs & illegal imm! _E_
Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. #TimeToGetTough _E_
Thank you @GeraldoRivera @FoxandFriends. Agree! __HTTP__ _E_
Sleepy eyed @chucktodd thinks Las Vegas is a state see @todayshow this morning. _E_
I opposed going into Iraq. Hillary voted for it. As with everything else she's supported it was a DISASTER. __HTTP__ _E_
The Keystone pipeline will create 20000 jobs and make us less energy dependent from the Middle East. @BarackObama says No! _E_
Golf Odyssey just named Trump Scotland "Golf course of the year." __HTTP__ _E_
Marco Rubio was a complete disaster today in an interview with Chris Wallace @FoxNews concerning our invading Iraq.He was as clueless as Jeb _E_
CLINTON'S CLOSE TIES TO PUTIN DESERVE SCRUTINY: __HTTP__ #VPDebate _E_
Congratulations to @STEPHENATHOME I will see you on the show! _E_
The very foul mouthed Sen. John McCain begged for my support during his primary (I gave he won) then dropped me over locker room remarks! _E_
Shark Tank is a dead Friday night filler compared to the Apprentice which has been number one show for week in the T. V. ratings! _E_
Today as we Remember Pearl Harbor it was an incredible honor to be joined with surviving Veterans of the attack on 12/7/1941. They are HEROES and they are living witnesses to American History. All American hearts are filled with gratitude for their service and their sacrifice. __HTTP__ _E_
Apprentice will be amazing tomorrow night! _E_
Via @HuffPostPol: "Donald Trump: 'Republicans May Be The Worst Negotiators In History'" __HTTP__ _E_
Hillary Clinton is weak and ineffective no strength no stamina. _E_
Only those who will risk going too far can possibly find out how far one can go. T. S. Eliot _E_
The trade deficit rose to a 7yr high thanks to horrible trade policies Clinton supports. I will fix it fast JOBS! __HTTP__ _E_
I will be interviewed on @foxandfriends with the legendary Coach Bobby Knight tomorrow morning. Enjoy! #INDPrimary __HTTP__ _E_
The road to success is always under construction. Arnold Palmer _E_
President Xi thank you for such an incredible welcome ceremony. It was a truly memorable and impressive display! 📸 __HTTP__ __HTTP__ _E_
Great new ad from @CmteForIsrael: 'Next Year...President @MittRomney in Jerusalem the Capital of Israel' __HTTP__ _E_
Drugs are pouring into this country. If we have no border we have no country. That's why ICE endorsed me. #Debate #BigLeagueTruth _E_
CBO now estimates that over 2.5M will lose jobs directly because of ObamaCare. REPEAL now before it is too late. _E_
Thank you @thefix for your very honest commentary. One thing we do have great teams in IA NH SC and beyond. __HTTP__ _E_
.@KarlRove wasted $400 million + and didn't win one race—a total loser. @FoxNews _E_
‏.@richardroeper Perhaps one of the worst replacements in showbiz once you went on it was over! Your taste sucks! _E_
Do you believe that @UnionLeader in NH was demanding ads? Look at enclosed letter from them just received: __HTTP__ _E_
The lights went out in New Orleans...the Country's lights went out also. We are not the same place! _E_
Thank you for today's endorsement New York Veteran Police Association! #NewYorkValues __HTTP__ __HTTP__ _E_
New Gravis Poll in NH just out: Trump 32% Carson 13% __HTTP__ _E_
The silent majority is silent no more! Remember the importance of VOTING!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Hear Donald Trump discuss big gov spending banks & taxes on Your World w/Neil Cavuto: __HTTP__ _E_
Weekly Address from @WhiteHouse: __HTTP__ __HTTP__ _E_
Despite firing @StephenBaldwin7 in last night's All Star Celebrity @ApprenticeNBC Stephen had strong overall performance this season _E_
We will defend our country protect our communities and put the safety of the AMERICAN PEOPLE FIRST! Replay: __HTTP__ __HTTP__ _E_
We are going to make our country so strong again so great again. No more ripping off the United States. We will MAKE AMERICA GREAT AGAIN! _E_
Reporters say it's the Trump Bump I tell CNBC I am buying stocks and the market goes up. _E_
Congratulations to @piersmorgan on his new position as Editor at Large for the United States of @MailOnline! My Apprentice champ! _E_
Turn on @oreillyfactor now and enjoy true brilliance! _E_
After a great evening and packed auditorium in Iowa I am now in Colorado looking forward to what I am sure will be a very unfair debate! _E_
... where he raised 2 million dollars for the wonderful kids. Eric has a great heart! _E_
Putin just sent a Russian nuclear sub to the Gulf of Mexico. @BarackObama can't be bothered he is too concerned with @MittRomney's taxes. _E_
The Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. __HTTP__ _E_
"A vampire with a day pass?" We are in @THEGaryBusey land. #CelebApprentice _E_
Watched Saturday Night Live hit job on me.Time to retire the boring and unfunny show. Alec Baldwin portrayal stinks. Media rigging election! _E_
The @Yankees should break A Rod's contract immediately—he misrepresented. _E_
#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_
Very sad what happened last night at the Miss Universe Pageant. I sold it 6 months ago for a record price. This would never have happened! _E_
I'll be on @gretawire tonight at 10 PM Fox News _E_
Now that the election is over watch Chrysler ship @Jeep production to China my prediction. _E_
Two policemen just shot in San Diego one dead. It is only getting worse. People want LAW AND ORDER! _E_
The #1 trend on Twitter right now is #TrumpWon thank you! _E_
You must admit that Bryant Gumbel is one of the dumbest racists around an arrogant dope with no talent. Failed at CBS etc why still on TV? _E_
"In order to build your wealth and improve your business smarts you need to know about real estate." Think Like a Billionaire _E_
I am going to Iowa today sold out crowds. People don't want our country ripped off anymore. Must stop now! _E_
As we told the @nydailynews I was asked to speak at the RNC but said no because I will be doing something much bigger just watch! _E_
RT @FLOTUS: Preparations are underway to celebrate the holidays at the @WhiteHouse! __HTTP__ _E_
Amazing my tweets are covered across every spectrum from @espn to @politico to @WSJ. _E_
"Offshore wind is a dead duck in Scotland and it's time Alex Salmond manned up stopped blaming Westminster (cont) __HTTP__ _E_
"You just can't beat the person who never gives up." – Babe Ruth _E_
.@McIlroyRory Thanks for your nice note they love you at Trump National Doral. You are looking good will have a GREAT year! _E_
You are always there for us – THE MEN AND WOMEN IN BLUE.Thank you to our police thank you to our sheriffs and thank you to our law enforcement families. God Bless you all and GOD BLESS AMERICA! #LESM __HTTP__ _E_
The best social program by far is a JOB! Our jobs are being taken away from us by China and many other countries incompetent leader. _E_
On behalf of @FLOTUS Melania and myself thank you for a wonderful dinner and evening President Sergio Mattarella.... __HTTP__ _E_
A person who never made a mistake never tried anything new. Albert Einstein _E_
Carl Cameron @FoxNews is the only reporter I know who consistently fumbles & misrepresents poll results. He has been so wrong & he hates it! _E_
RT @TeamTrump: Law enforcement officers bring communities together & keep us safe. @mike_pence & @realDonaldTrump RESPECT & stand by them!... _E_
Worired that the USC will strike down ObamaCare @BarackObama is trying to implement his debacle in public schools __HTTP__ _E_
Trump Int'l Hotel & Tower Toronto. #1 in all of Canada. __HTTP__ _E_
.@BarackObama is promoting ugly inefficient unreliable bird killing noisy neighborhood destroying wind turbines. Big mistake. _E_
Is Anthony Weiner also delusional? Add him to NY Sex Offender list instead! _E_
January 20th 2017 will be remembered as the day the people became the rulers of this nation again. _E_
I said that Crooked Hillary Clinton is not qualified to be president because she has very bad judgement Bernie said the same thing! _E_
Iranian Pastor #Nadarkhani has just been sentenced to death by the Mullahs because he is a Christian (cont) __HTTP__ _E_
THE CHOICE IS CLEAR!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_
After tearing W Bush down for 12 years now the media loves him. Why not? He gave them Obama. _E_
In my office with Banana Joe who just won the @WKCDOGS at @MSGnyc. __HTTP__ _E_
Congratulations to @seanhannity on his great ratings and ratings increase as reported by the @AP today. Amazing job! _E_
Alert...The president knew that the ambassador was being attacked in Benghazi. He did nothing...he is no leader. _E_
Jamiel Shaw was incredible on @foxandfriends this morning. His son who was viciously killed by an illegal immigrant is so proud of pop! _E_
To EVERYONE including all haters and losers HAPPY NEW YEAR. Work hard be smart and always remember WINNING TAKES CARE OF EVERYTHING! _E_
My @gretawire interview where I discuss the #ObamaCare USC argument gas prices & @IvankaTrump's new clothing line __HTTP__ _E_
Via @TMZ_Sports: "Donald Trump: Don't Mess Up @terrellowens' Name. 'I've Seen Him Go Crazy At People'" __HTTP__ _E_
Pres. Obama's steady support of @Israel throughout this crisis helped stop the war. He did a good job. _E_
Miami Dade Mayor drops sanctuary policy. Right decision. Strong! __HTTP__ _E_
Macy's was very disloyal to me bc of my strong stance on illegal immigration. Their stock has crashed! #BoycottMacys __HTTP__ _E_
Amy Pascal of Sony was totally used by Rev. Al Sharpton. She should be fired for stupidity. _E_
Melania our great and very hard working First Lady who truly loves what she is doing always thought that "if you run you will win." She would tell everyone that "no doubt he will win." I also felt I would win (or I would not have run) and Country is doing great! _E_
I hope we never find life on another planet because if we do there's no doubt that the United States will start sending them money! _E_
It is a MOVEMENT not a campaign. Leaving the past behind changing our future. Together we will MAKE AMERICA SAF... __HTTP__ _E_
After @TrumpScotland I will visit @TrumpDoonbeg in Ireland the magnificent resort fronting on the Atlantic Ocean. _E_
"Tomorrow is the first blank page of a 365 page book. Write a good one." — @BradPaisley _E_
#MakeAmericaSafeAgain!#GOPConvention #RNCinCLE __HTTP__ __HTTP__ _E_
National GOP Presidential Poll via @OANN @realDonaldTrump 35.6% #Trump2016 __HTTP__ _E_
Intelligence agencies should never have allowed this fake news to leak into the public. One last shot at me.Are we living in Nazi Germany? _E_
I was the first & only potential GOP candidate to state there will be no cuts to Social Security Medicare & Medicaid. Huckabee copied me. _E_
Rumor has it Apple is going to release iPhones with bigger screens. That's good news. _E_
Washington needs common sense conservative solutions. Let's make America great again! __HTTP__ _E_
My new book Time To Get Tough will be out Dec 5th. Solutions you won't hear from the politicians. The bes... (cont) __HTTP__ _E_
Thank you to @IvankaTrump for her wonderful acknowledgement this morning on @foxandfriends... _E_
There are many Jonathan Gruber types selling the global warming stuff and they really do believe the American public is stupid. _E_
__HTTP__ _E_
Great day in Virginia. Crowd was fantastic! _E_
Excited for tomorrow's Politics & Eggs @saintanselm co hosted by @NECouncil & @nhiop. Live stream here __HTTP__ _E_
Looking forward to meeting the great folks of Sarasota GOP party when I am honored as 'Statesman of the Year.' Should be a wonderful time. _E_
This assignment has stretched not just the imaginations but the patience quotas of @lisarinna and @pennjillette. #CelebApprentice _E_
I started to get very worried about Mitt's chances when I heard that A Rod donated to his campaign. Everything A Rod touches turns bad. _E_
Sorry I won't be able to do @foxandfriends at 7 AM on Monday—will be in India. _E_
Via @thehill by @HenschOnTheHill: "Trump says US roads are 'falling apart'" __HTTP__ _E_
I will be on @foxandfriends at 7:00 there is much to talk about (sadly)! Enjoy! _E_
Thank you @megynkelly for the nice things you said about Melania. You will like her great heart and smart always wanting to help people! _E_
Surprise In a post election delayed release food stamp rolls surged to biggest monthly increase and an all time high __HTTP__ _E_
Named best golf course in the world by @RobbReport Trump Int'l Golf Links Scotland is a 7400 yd par 72 __HTTP__ _E_
I really enjoyed last night's Tele Town Hall with @ralphreed's Faith and Freedom Coalition. Thanks to the thousands who joined. _E_
Congratulations to Bernie Marcus & Herman Cain @JobCreatorsUSA on the #TruthTour2012 All employers need to check this out! _E_
Why are we sending thousands of ill trained soldiers into Ebola infested areas of Africa! Bring the plague back to U.S.? Obama is so stupid. _E_
I'll be on @Foxandfriends Monday at 7:30 AM. _E_
We're not talking about religion we're talking about security. #GOPDebate __HTTP__ _E_
Looks like Obama will not stop the very potentially dangerous flights to and from West Africa. What the hell is wrong with this guy? _E_
THANK YOU to everyone in Little Rock Arkansas tonight! A record crowd of 12K. #Trump2016 __HTTP__ __HTTP__ _E_
On the luxurious Palos Verdes Peninsula @TrumpGolfLA features @GolfWorldUS' top public course & elite restaurants __HTTP__ _E_
Via @kmovnewsfeed: Photos: Tour Donald Trump's NC golf club __HTTP__ _E_
32º in New York it's freezing! Where the hell is global warming when you need it? _E_
I am the only Republican who will get large numbers of Dems and Indies (crossover). I will also get states that no other Republican can get. _E_
.@IvankaTrump is right—Plan B has descended into a state of total chaos. #CelebApprentice _E_
"George has a real twinkle about him" says @TheRealMarilu. Really? The shark should be scared. #CelebApprentice _E_
Just landed in New Hampshire a very exciting morning planned! _E_
#AmericaFirst #ImWithYou __HTTP__ _E_
Who do you like hate so far? _E_
released by Intelligence even knowing there is no proof and never will be. My people will have a full report on hacking within 90 days! _E_
Thank you to Time Magazine and Financial Times for naming me Person of the Year a great honor! _E_
Romney's failed advisors like campaign mgr Stuart Stevens are all over TV telling people how to win. But they lost don't know how to win! _E_
If we let Crooked run the govt history will remember 2017 as the year America lost its independence. #DrainTheSwamp __HTTP__ _E_
Via @DMRegister by @SharynJackson: "Trump: @SteveKingIA has 'the right views' __HTTP__ _E_
The 'brunt' of ObamaCare will be shouldered by folks making under $120K __HTTP__ _E_
I would like to thank @GolfMagazine for the really nice review of Trump National Doral Best Renovation of the Year (and maybe all time). _E_
My motto is: 'Never give up.' I follow this very strictly. I do not let problems and challenges stop me they are normal. _E_
Wow @Politico is in total disarray with almost everybody quitting. Goodnews bad dishonest journalists! __HTTP__ _E_
A great American Kurt Cochran was killed in the London terror attack. My prayers and condolences are with his family and friends. _E_
New York Fashion Week is really bad and used to be so glamorous and exciting! No stars no fun just boring. They need serious help. #NYFW _E_
Glad to hear @BrentBozell @marklevinshow @EWErickson & @TPPatriots are standing up to @KarlRove's attack on the Tea Party. _E_
Thank you America! #Trump2016Via @DRUDGE_REPORT __HTTP__ _E_
Our VISA system is broken like so much else in our country. We better get it fixed really fast. MAKE AMERICA GREAT AGAIN! _E_
Our wonderful future V.P. Mike Pence was harassed last night at the theater by the cast of Hamilton cameras blazing.This should not happen! _E_
To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_
'Trump administration seen as more truthful than news media' __HTTP__ _E_
Yes I won the right to have my name taken off Trump Plaza in A.C. because it was not operated up to a very high standard and NO involvement _E_
I am in Las Vegas at the best hotel (by far) Trump International. I will be working with my wonderful teams and volunteers to WIN Nevada! _E_
Via @Newsmax_Media by @wandacarruthers: "Trump: Baghdad Likely to Fall to ISIS" __HTTP__ _E_
VOTE #TrumpPence16 on 11/8/16! __HTTP__ _E_
The media must immediately stop calling ISIS leaders MASTERMINDS. Call them instead thugs and losers. Young people must not go into ISIS! _E_
...want everything to be done for them when it should be a community effort. 10000 Federal workers now on Island doing a fantastic job. _E_
Obama has no understanding of how to create jobs or opportunity. He believes in Government. _E_
It was great seeing @MissUniverse and @MissTeenUSA yesterday __HTTP__ _E_
.@BillMaher's show is great for helping me get to sleep better than Sominex. _E_
In light of the horrible attack in Nice France I have postponed tomorrow's news conference concerning my Vice Presidential announcement. _E_
"It's sad—truly sad and disgraceful—the way Obama has allowed America to be abused and kicked around (cont) __HTTP__ _E_
The Tea Party is filled with great Americans. Despite being mistreated by everyone including @GOP they will continue to fight on _E_
A very big thank you to Bill Donohue head of The Catholic League for the wonderful interview on @CNN and article in Newsmax! Great insight _E_
I told you! Premiums are soaring! #RepealObamacare #Trump2016 __HTTP__ _E_
RT @glamourizes: @realDonaldTrump Only true Americans can see that president Trump is making America great. He's the only person who can! H... _E_
When somebody challenges you fight back be tough! _E_
Crooked Hillary Clinton knew that her husband wanted to meet with the U.S.A.G. to work out a deal. The system is totally rigged & corrupt! _E_
Will be in Novi Michigan this Friday at 5:00pm. Join the MOVEMENT! Tickets available at: __HTTP__ __HTTP__ _E_
I hope corrupt Hillary Clinton chooses goofy Elizabeth Warren as her running mate. I will defeat them both. _E_
I will be in Huntsville Alabama on Saturday night to support Luther Strange for Senate. Big Luther is a great guy who gets things done! _E_
I hope Washington makes a good deal to avert the fiscal cliff. Both sides need to work together. _E_
People ask about @AmandaTMiller. She is actually a VP of Marketing at the Trump Organization. #CelebApprentice _E_
Will be leaving Palm Beach for the 11 A.M. ceremony opening the magnificent GARY PLAYER VILLA at Trump Nationak Doral Miami. GARY IS GREAT! _E_
Really enjoyed my interview with @marklevinshow. He is terrific! _E_
The phony lawsuit against Trump U could have been easily settled by me but I want to go to court. 98% approval rating by students. Easy win _E_
Great making keynote speech at 2014 Lincoln Day Dinner hosted by Dan Isaacs & NY Republican County Committee. Wonderful people! _E_
.@katyperry is no bargain but I don't like John Mayer he dates and tells be careful Katy (just watch!). _E_
Crooked Hillary Clinton is soft on crime supports open borders and wants massive tax hikes. A formula for disaster! _E_
Join me in Reno Nevada tomorrow at 3:30pm! #AmericaFirst #MAGATickets: __HTTP__ _E_
The media and establishment want me out of the race so badly I WILL NEVER DROP OUT OF THE RACE WILL NEVER LET MY SUPPORTERS DOWN! #MAGA _E_
FLORIDA: Do not miss this opportunity to #MakeAmericaGreatAgain! Thank you @IvankaTrump: __HTTP__ __HTTP__ _E_
Oregon is voting today. Keep the big numbers going VOTE TRUMP! MAKE AMERICA GREAT AGAIN! _E_
An iconic building and top tourist attraction @TrumpTowerNY sets New York City's luxury standard __HTTP__ & great food! _E_
Laura Massive crowd had to move to Phoenix Convention Center. __HTTP__ _E_
Wow sleepy eyes @chucktodd is at it again. He is do totally biased. The things I am saying are correct. far better vision than the others _E_
Now that Iran ripped us off by making one of the best deals of any kind in history they have just moved to block any imports from the U.S. _E_
In today's #trumpvlog I answer your questions about what you should be doing in this uncertain economy... __HTTP__ _E_
Thank you to the BRAVE servicemen & women who have served and continue to serve the United States our true HEROES... __HTTP__ _E_
Follow Trump @DoralResort's WGC @CadillacChamp leadership board here at @nbc's @GolfChannel __HTTP__ _E_
Leadership: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_
Amazing! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
According to @pewresearch illegal immigrants favor Dems 8:1 __HTTP__ @GOP pushing amnesty. Do they have death wish _E_
Putin's letter is a masterpiece for Russia and a disaster for the U.S. He is lecturing to our President.Never has our Country looked to weak _E_
...now it's the "greatest pageant on earth" broadcast in 190 countries to 1 billion people—"hot!" _E_
either elect more Republican Senators in 2018 or change the rules now to 51%. Our country needs a good shutdown in September to fix mess! _E_
Article from The Street The Donald's Trump Card: Himself __HTTP__ _E_
Republicans have the cards because of the debt ceiling—but it doesn't seem that way! _E_
Even the Left realizes that @BarackObama's policies have led to more jobs being outsourced out of this country. __HTTP__ _E_
Hillary says things can't change. I say they have to change. It's a choice between Americanism and her corrupt globalism. #Imwithyou _E_
Dennis Rodman was either drunk or on drugs (delusional) when he said I wanted to go to North Korea with him. Glad I fired him on Apprentice! _E_
Watch Kasich squirm if he is not truthful in his negative ads I will sue him just for fun! _E_
The best investors are visionaries—they look beyond the present. _E_
Young entrepreneurs – remember quality and results are the key metrics to success. _E_
Next she says she's being set up by Omarosa to fail....is somebody confused? _E_
Thank you Congressman Steven Palazzo! __HTTP__ __HTTP__ _E_
It was wonderful to have President Petro Poroshenko of Ukraine with us in New York City today. #UNGA __HTTP__ __HTTP__ _E_
Senator @LindseyGrahamSC made horrible statements about @SenTedCruz – and then he endorsed him. No wonder nobody trusts politicians! _E_
My @SquawkCNBC #TrumpTuesday interview discussing the 2012 election OPEC ripping us off & @MittRomney's job policy __HTTP__ _E_
Toyota & Mazda to build a new $1.6B plant here in the U.S.A. and create 4K new American jobs. A great investment in American manufacturing! _E_
No wonder @NYMag is doing so poorly with an idiot Sr. Editor like @DanAmira it will only get worse! _E_
None of Romney's leaked comments change the fact that Obama is a complete disaster. 20% real unemployment and $6T in deficit spending. _E_
Hillary is too weak to lead on border security no solutions no ideas no credibility.She supported NAFTA worst deal in US history. #Debate _E_
Wow! Honored to be chosen by the highly respected + accurate Washington & Lee Mock Convention. I hope you are right I will make you proud! _E_
Pres @BarackObama expects @MittRomney to play nice like @SenJohnMcCain it's not going to happen & the result is going to be much different. _E_
WEEKLY ADDRESS __HTTP__ _E_
No money wasted like bad ads—the Republicans spent more & got nothing for it. _E_
Watch this tour by @TrumpIntRealty's @M_Griffith1 of this luxurious penthouse in Trump Park Avenue __HTTP__ _E_
Another attack this time in Germany. Many killed. God bless the people of Munich. _E_
Crooked Hillary wants to take your 2nd Amendment rights away. Will guns be taken from her heavily armed Secret Service detail? Maybe not! _E_
RT @newtgingrich: Seems out of touch w/ reality to announce a VP nominee before securing 1237 delegates. __HTTP__ __HTTP__ _E_
#VoteTrumpNH #NHPrimary #FITN __HTTP__ _E_
I will be interviewed on @greta tonight at 7pm. Enjoy! __HTTP__ _E_
.@MittRomney shouldn't give additional tax returns until @BarackObama gives his passport records college records & applications... _E_
Every American needs to say 2 simple words to every Vet they meet: THANK YOU! John Wayne Walding __HTTP__ _E_
Job openings are at a 4 year high but businesses aren't hiring __HTTP__ Why? ObamaCare US debt & @BarackObama's tax plan. _E_
Seems like the teams are surprised when @THEGaryBusey comes back. #CelebApprentice _E_
"There can be no liberty unless there is economic liberty." – The Iron Lady Margaret Thatcher _E_
My appearance on The View... __HTTP__ and __HTTP__ _E_
It is almost time. I will be making a major announcement from @TrumpTowerNY at 11AM. Follow on social media! #MakeAmericaGreatAgain _E_
The federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_
Stock Market hits new Record High. Confidence and enthusiasm abound. More great numbers coming out! _E_
Sometimes people spend too much time focusing on problems instead of focusing on opportunities Think Like a Champion _E_
Really big crowd expected tomorrow morning at # CPAC2013. I look forward to it! _E_
#MakeAmericaGreatAgain #Trump2016 Story: __HTTP__ __HTTP__ _E_
If only speeches could create jobs then @BarackObama wouldn't have such a dismal economic record. _E_
.@BradSteinle Great talking to you and your parents—fantastic people. Keep your sister's very important memory alive—big impact! _E_
"A savvy investor is a sponge for information. You have to read the newspapers... _E_
Priorities: @BarackObama wants to slash a Trillion dollars from military spending while raising the salaries of (cont) __HTTP__ _E_
I give the President's speech a 7 on the scale of 0 to 10! Not bad but room for improvement! _E_
Why was the Hanukah celebration held in the White House two weeks early? @BarackObama wants to vacation in Hawaii in late December. Sad. _E_
Iran must immediately allow Christian #PastorSaeed out of prison or we should put back sanctions (which should never have been lifted) _E_
I was never a fan of Colin Powell after his weak understanding of weapons of mass destruction in Iraq = disaster. We can do much better! _E_
I am in New Hampshire. Just received great news from Reuters poll. Thank you for your support! __HTTP__ _E_
The @FBIPressOffice police & others are doing an amazing job. How genius was it putting together that tape? _E_
Does anybody really believe that Bill Clinton and the U.S.A.G. talked only about grandkids and golf for 37 minutes in plane on tarmac? _E_
Miss Universe 2012 Pageant will be airing live on @nbc & @Telemundo december 19th. Open invite stands for Robert Pattinson. _E_
I will replace it with private plans health savings accounts & allow purchasing across state lines. Maximum choice & freedom for consumer. _E_
Word is that @NBCNews is firing sleepy eyes Chuck Todd in that his ratings on Meet the Press are setting record lows. He's a real loser! _E_
Now that China's own economy is slowing __HTTP__ watch how they start doing even bigger numbers in (cont) __HTTP__ _E_
Insurgents in Iraq show they can still mount horrifying attacks US wastes trillions. _E_
My meetings with President Xi Jinping were very productive on both trade and the subject of North Korea. He is a highly respected and powerful representative of his people. It was great being with him and Madame Peng Liyuan! _E_
Florida has been very good to me. I am really esxcited to give back at the Sarasota GOP event and @RNC convention. Will be fun! _E_
All weights are on crane's wrong side very precarious below move out! _E_
We should leave Afghanistan immediately. No more wasted lives. If we have to go back in we go in hard & quick. Rebuild the US first. _E_
Government can be efficient with the right leadership. Let's Make America Great Again __HTTP__ _E_
Via @GolfweekMag by @BKleinGolfweek: "Donald Trump reopens Doral's Blue Monster" __HTTP__ _E_
Entrepreneurs: Don't tread water. Get out there and go for it. _E_
Hillary said I really deplore the tone and inflammatory rhetoric of his campaign. I deplore the death and destruction she caused stupidity _E_
RT @foxandfriends: HAPPENING TODAY: House to vote on immigration bills including 'Kate's Law' and 'No Sanctuary for Criminals Act' __HTTP__ _E_
If you want to know about Hillary Clinton's honesty & judgment ask the family of Ambassador Stevens. _E_
Thank you Rhode Island! #Trump2016 __HTTP__ _E_
#ICYMI On Saturday I signed two EO's to help keep jobs & wealth in our country.EO1: __HTTP__ EO2:... __HTTP__ _E_
Kentucky has a chance to have the Senate Majority Leader Mitch McConnell representing it in Washington. Big power for State. Don't blow it _E_
massive increases of ObamaCare will take place this year and Dems are to blame for the mess. It will fall of its own weight be careful! _E_
What truly matters is not which party controls our government but whether our government is controlled by the people. _E_
Will be on @foxandfriends at 8:00. Enjoy! _E_
Will be in Orlando Florida this afternoon. 25000 people expected. This is a movement like our GREAT COUNTRY has never seen before! _E_
My @Shalom_TV interview discussing my video endorsement of @IsraeliPM @netanyahu and past visits to @Israel __HTTP__ _E_
...money to Bill the Hillary Russian reset praise of Russia by Hillary or Podesta Russian Company. Trump Russia story is a hoax. #MAGA! _E_
Watched low rated @Morning_Joe for first time in long time. FAKE NEWS. He called me to stop a National Enquirer article. I said no! Bad show _E_
On my way to Iowa just received new national poll numbers. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Very productive bilateral meeting with Prime Minister Benjamin @Netanyahu of Israel in Davos Switzerland! #WEF18 __HTTP__ _E_
My thoughts on the Emmys in today's #trumpvlog.... __HTTP__ _E_
Why are we still giving billions of dollars we don't have in foreign aid to the Muslim Brotherhood in Egypt? _E_
Politicians are trying to chip away at the 2nd Amendment. I won't let them take away our guns! #Trump2016Watch: __HTTP__ _E_
"Do not give in to anger. It destroys your focus on goals and ruins your concentration." – Think Big _E_
The upcoming All Star @CelebApprentice puts the celebrities under the hardest tasks we have ever given. We really pushed the envelope _E_
I hope Mark Zuckerberg signs a prenup with his current girlfriend perhaps soon to be wife. Otherwise she can walk away with 9 billion. _E_
My @bostonherald interview on Tom Brady Hillary Clinton the Granite State & Making America Great Again! __HTTP__ _E_
Via @washingtonpost by @costareports: "Trump says he is serious about 2016 bid is hiring staff and delaying TV gig" __HTTP__ _E_
Only 88000 jobs were added this past March. Prediction was 190000. Businesses can't expand with Obama Care & high taxes on horizon. _E_
Just left Columbus rally of 14000 people a far bigger crowd than even I expected! Unbelievable evening incredible spirit in the arena! _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Thank you. __HTTP__ _E_
Consumer Confidence is at an All Time High along with a Record High Stock Market. Unemployment is at a 17 year low. MAKE AMERICA GREAT AGAIN! Working to pass MASSIVE TAX CUTS (looking good). _E_
Joan Rivers on The Apprentice tonight at 8:00. I will be live tweeting. JOAN WAS GREAT! _E_
Are you allowed to impeach a president for gross incompetence? _E_
Sleep eyes @ChuckTodd is killing Meet The Press. Isn't he pathetic? Love watching him fail! _E_
Via @thehill by @HugginsRachel: "Trump looking 'very seriously' at 2016 run" __HTTP__ _E_
"Representing your own brand yourself is the best way to go. If you can't sell it who will?" – Midas Touch _E_
For eight years Russia ran over President Obama got stronger and stronger picked off Crimea and added missiles. Weak! @foxandfriends _E_
The GOP needs to learn how to get tough and outnegotiate @BarackObama and his big spending allies in (cont) __HTTP__ _E_
Good luck to Bob Kraft Tom Brady and Coach Bill Belichick tonight. _E_
Internationally recognized as an iconic landmark @TrumpTowerNY beams over Fifth Avenue __HTTP__ _E_
A great day at the White House! _E_
It is fatal to enter any war without the will to win it. Douglas MacArthur _E_
Via @Mediaite by @evanmcmurry: "Trump Calls @AGSchneiderman a Cokehead" __HTTP__ Schneiderman is by his own admission! _E_
North Korea has conducted a major Nuclear Test. Their words and actions continue to be very hostile and dangerous to the United States..... _E_
Massive crowd in VT tonight. Venue not big enough. Officials say NO to outside event and sound system. Arrive early! _E_
Wow @CNN is so negative. Their panel is a joke biased and very dumb. I'm turning to @FoxNews where we get a fair shake! Mike will do great _E_
Via @MailOnline Trump still in the lead by a whopping 14 points after fluke survey had put Carson on top __HTTP__ _E_
RT @realDonaldTrump: On #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_
RT @realDonaldTrump: HAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_
Good luck! Enjoy. __HTTP__ _E_
With all of the bad economic numbers and horrendous foreign policy Obama should be down by 12 points and he's not. _E_
WEEKLY ADDRESS __HTTP__ _E_
A great ad from @MittRomney showing A Few of the 23 Million unemployed who need economic change __HTTP__ Take it to him Mitt! _E_
His @BarackObama's budget: interest payments to China will exceed US defense spending by 2019 __HTTP__ @BarackObama's America! _E_
Angela Merkel is doing a fantastic job as the Chancellor of Germany. Youth unemployment is at a record low & she has a budget surplus. _E_
Now with the Danger Weiner campaign dead time to focus on crazy Eliot Spitzer. A man who has never earned 10 cents in his life. _E_
When ISIS caught the soldiers do you think they read them their legal rights prior to executing them? _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Watch Obama's favorability numbers drop even further if he doesn't accept my charitable offer. No one approves (cont) __HTTP__ _E_
Hurricane Irene and Libya in today's #trumpvlog.... __HTTP__ _E_
Wow more than 90% of Fake News Media coverage of me is negative with numerous forced retractions of untrue stories. Hence my use of Social Media the only way to get the truth out. Much of Mainstream Meadia has become a joke! @foxandfriends _E_
If you stop by Trump Tower (Fifth Avenue between 56th and 57th Streets) you can buy a pre signed copy of #TimeToGetTough. _E_
Thank you to the amazing law enforcement officers today in Daytona Beach Florida! #LESM #MAGA __HTTP__ _E_
By rejecting my ad on ugly windmills & @AlexSalmond's faulty thinking on the "Lockerbie bomber" the ad is now on worldwide newscasts. _E_
RT @GOPChairwoman: .@realDonaldTrump is the Paycheck President. Learn how the tax bill will put more money in your pocket & how to contact... _E_
Will be going to Pennsylvania today in order to give my total support to RICK SACCONE running for Congress in a Special Election (March 13). Rick is a great guy. We need more Republicans to continue our already successful agenda! _E_
Back by popular demand @TraceAdkins delivers in the upcoming @CelebApprentice All Stars season. Yes he sings. _E_
John Kasich should focus his special interest money on building up his failed image not negative ads on me. _E_
This Sunday's All Star Celebrity @ApprenticeNBC features the return of @Joan_Rivers. Sunday at 9 PM on @NBC full 2 hours. _E_
Wisconsin we will MAKE AMERICA GREAT AGAIN! _E_
Join me live from Bedminster New Jersey: __HTTP__ _E_
Getting ready to leave for Melbourne Florida. See you all soon! _E_
Via @11AliveNews by @JenniferJJacobs: "Trump heads to Iowa as '16 speculation rises" __HTTP__ _E_
Anytime you see a story about me or my campaign saying sources said DO NOT believe it. There are no sources they are just made up lies! _E_
Just heard that the great Golf Week Magazine named my Trump International Golf Course Scotland The Best Modern Day Golf Course In The World! _E_
Why do the Republicans keep apologizing on the so called birther issue? No more apologies take the offensive! _E_
Omarosa always promises and delivers high drama... _E_
Debate polls look great thank you!#MAGA #AmericaFirst __HTTP__ _E_
Now China is trying to take over a U.S. airbase __HTTP__ This is only the beginning. They only understand toughness! _E_
A massive tax increase will be necessary to fund Crooked Hillary Clinton's agenda. What a terrible (and boring) rollout that was yesterday! _E_
RT @Fuctupmind: @realDonaldTrump Donald Trump's amazing golf swing #CrookedHillary __HTTP__ _E_
The @rydercup is currently going on and is one of the truly great sporting events. _E_
Why is crude oil priced at $86/Barrel? OPEC is ripping us off. Not worth $30/Barrel. America needs new leaders. _E_
I will be announcing my decision on the Paris Accord over the next few days. MAKE AMERICA GREAT AGAIN! _E_
Leadership: the art of getting someone else to do something you want done because he wants to do it. Dwight D. Eisenhower _E_
What took investigators so long to interview the pilots of Asiana San Fran crash? WHY NO DRUG TESTS FOR PILOTS they were really off . _E_
.@Apprenticenbc cast will be announced tomorrow at 7:30am ET on the @todayshow with @MLauer _E_
Gary Johnson is asking people to waste their vote on him. Make it count vote for @MittRomney. _E_
THANK YOU to everyone who joined me at the @WhiteHouse yesterday. Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
.@TheRealMarilu was very impressive and is a great person. The All Star Celebrity @ApprenticeNBC viewers loved her. _E_
.@HillaryClinton has been part of the rigged DC system for 30 years? Why would we take policy advice from her? #Debates2016 _E_
.@pennjillette and @dennisrodman as PM's I'm proud of Dennis and his performance this season. #CelebApprentice _E_
Good luck @MittRomney tonight have no doubt you will be great. _E_
Whitey Bulger's prosecution starts today. Will be one of the most interesting and intriguing trials. _E_
"Donald Trump unveils vision for @TrumpTurnberry" __HTTP__ via @BunkeredOnline by @MMcEwanBunkered _E_
There should be no further releases from Gitmo. These are extremely dangerous people and should not be allowed back onto the battlefield. _E_
"You should always feel comfortable bargaining for goods and services. I do it all the time." – Think Like a Billionaire _E_
Reason I canceled my trip to London is that I am not a big fan of the Obama Administration having sold perhaps the best located and finest embassy in London for "peanuts" only to build a new one in an off location for 1.2 billion dollars. Bad deal. Wanted me to cut ribbon NO! _E_
Great Kevin McCarthy drops out of SPEAKER race. We need a really smart and really tough person to take over this very important job! _E_
After Crooked @HillaryClinton allowed ISIS to rise she now claims she'll defeat them? LAUGHABLE! Here's my plan: __HTTP__ _E_
Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion! _E_
Mexican leaders and negotiators are much tougher and smarter than those of the U.S. Mexico is killing us on jobs and trade. WAKE UP! _E_
Fort Hood shooting should be declared a terror attack. Respect the wounded and dead. _E_
Wisdom comes as a result of both experience and knowledge. It's something you can't teach someone else you have to achieve it on your own. _E_
My @SquawkCNBC interview discussing Jamie Dimon banking regulations and Mark Zuckerberg's prenuptial __HTTP__ _E_
Trump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_
In Vegas? Enjoy Thanksgiving in @TrumpLasVegas' DJT lounge where the @nfl games will be playing all day __HTTP__ _E_
Would be really bad if columnist Mike Lupica left the @NYDailyNews. A wonderful and talented guy! _E_
Tune in & join me live in Albany New York! 7pmE start time! I love you New York! #Trump2016 #TrumpTrain __HTTP__ _E_
#ICYMI: I joined #OnTheRecord with @kimguilfoyle on @FoxNews this evening. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Why is someone like George Pataki who did a terrible job as Governor of N.Y. and registers ZERO in the polls allowed on the debate stage? _E_
The West Coast's most luxurious public course @TrumpGolfLA features spectacular panoramic Pacific Ocean views __HTTP__ _E_
Trump Int'l Hotel & Golf Links Ireland (formerly The Lodge at Doonbeg) is a 5 star resort fronting the Atlantic Ocean __HTTP__ _E_
.@BarackObama should be careful questioning @MittRomney on diplomacy how many times has Obama apologized for our country on foreign soil?! _E_
Since the Obama Administration was told way before the 2016 Election that the Russians were meddling why no action? Focus on them not T! _E_
Benghazi is just another Hillary Clinton failure. It justnever seems to work the way it's supposed to with Clinton. _E_
.@LisaLampanelli You are terrific (always). Great job on the Apprentice. _E_
The opening of #TrumpScotland an exciting day on perhaps the world's best golf course watch the video __HTTP__ _E_
NBC terminates The Chris Matthews Show __HTTP__ _E_
"Get some face time in The Spa at @TrumpLasVegas" __HTTP__ via @Vegascom by Renée Libutti _E_
Apprentice = big hit. Miss Universe = Big hit. I always get big ratings. If I hosted Meet the Press instead of Sleepy Eyesa smash! @NBCNews _E_
Record low temperatures and massive amounts of snow. Where the hell is GLOBAL WARMING? _E_
I have traveled the world. America is the most beautiful country on Earth. _E_
Our country is being run by total amateurs. Let's just call it "amateur hour." _E_
Media silent when @BarackObama called @MittRomney a murderer & felon. Mitt mentions 'birth certificate' and they go nuts. Double standard! _E_
My @extratv interview before Hurricane Sandy explaining that I would be staying in Trump Tower during the storm __HTTP__ _E_
RT @JacobAWohl: @realDonaldTrump President Trump alone has succeeded in bringing the Stock Market Small Business Index and Consumer Comfor... _E_
The real story here is why are there so many illegal leaks coming out of Washington? Will these leaks be happening as I deal on N.Korea etc? _E_
#USA #Japan __HTTP__ _E_
I will be signing copies of my new book Time To Get Tough: Making America #1 Again in Trump Tower on Frida... (cont) __HTTP__ _E_
Listen – my Citizens United Political Victory Fund robo call for @leezeldin __HTTP__ #zeldinforcongress _E_
Third rate @politico took every negative tweet or response they could find & put it out when in fact the response is incredibly positive. _E_
Thank you @WayneAllynRoot.Very nice! #Trump2016 __HTTP__ _E_
Happy #MedalOfHonorDay to our heroes! __HTTP__ __HTTP__ _E_
Really sad that Republicans would allow themselves to be used in a Clinton ad. Lindsey Graham Romney Flake Sass. SUPREME COURT REMEMBER! _E_
Read my tweets you dopes of course he should get a trial but fast (not a 12 year disaster). _E_
All Presidential candidates should immediately disavow their Super PAC's. They're not only breaking the spirit of the law but the law itself _E_
I am on @oreillyfactor tonight a big special. @FoxNews at 8:00 P.M. ENJOY! _E_
.@VanityFair's terrible piece on Mitt's faith is a new low even for them. _E_
I'll be on Piers Morgan Tonight this evening 9 pm on CNN. Be sure to tune in. @PiersTonight _E_
Today is the 53rd anniversary of the March on Washington today we honor the enduring fight for justice equality and opportunity. _E_
Can the relationship between the mayor of New York City and the police force ever be fixed? Tune in to @foxandfriends at 7:15. _E_
Many people advised me not to buy the Miss Universe pageant. They were all wrong. The deal worked out to be a great one! _E_
Photos from the @ApprenticeNBC press conference __HTTP__ Premieres January 4th on @NBC. _E_
Be sure to listen to my interview today w/@SteveMTalk on @Newsmax_Media __HTTP__ Congratulations to Steve on his new show! _E_
If Democrats were not such obstructionists and understood the power of lower taxes we would be able to get many of their ideas into Bill! _E_
The people of Scotland love Trump International Golf Links. _E_
The failure of the Super Committee shows Washington has truly incompetent leaders. #TimeToGetTough _E_
Practice positive thinking—this will keep you focused while weeding out anything that is unnecessary negative or detrimental... _E_
Without passion you don't have energy and without energy you don't have anything! _E_
"Worry destroys focus." – Think Big _E_
Dateline NBC featuring yours truly just set a season high in households in the ratings—no wonder NBC likes me so much! @nbc _E_
A wonderful place. __HTTP__ _E_
.@MarkBurnettTV and his incredible wife @RealRomaDowney did a fabulous movie @SonofGodMovie see it! _E_
People (pundits) gave me no chance in South Carolina. Now it looks like a possible win. I would be happy with a one vote victory! (HOPE) _E_
Wow just won Missouri! _E_
Come join us at the Verizon Wireless Center Manchester New Hampshire on 2/8! Register now: __HTTP__ __HTTP__ _E_
#CrookedHillary __HTTP__ _E_
It seems that Justice Scalia originally wrote the majority on ObamaCare and Roberts then switched his position. __HTTP__ _E_
Interesting how President Obama so haltingly said I would never be president This from perhaps the worst president in U.S. history! _E_
Do not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Get your momentum going and keep it going. _E_
...The ads made her look great and now she probably will run. _E_
Your most popular tweet answered why I'm holding off on a Presidential bid... __HTTP__ #trumpvlog _E_
Big crowd expected tomorrow night in Iowa. It will be interesting and fun great people! _E_
Wow the ratings for @60Minutes last night were their biggest in a year very nice! _E_
.@lancearmstrong teammate is angry and jealous he is no Lance. _E_
Another one of me on stage. #WWEHOF __HTTP__ _E_
Watch my appearances on Good Day NY... __HTTP__ and @FoxandFriends... __HTTP__ _E_
I'll be signing copies of my new book Time To Get Tough today at Trump Tower 11 am to 2 pm. Hope to see you there. #TimeToGetTough _E_
It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_
Does anyone agree with Marilu that Gary while 'adorable' is a distraction? _E_
Based on the fraud committed by Senator Ted Cruz during the Iowa Caucus either a new election should take place or Cruz results nullified. _E_
RT @realDonaldTrump: Happy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to l... _E_
Hamas has warned Pres. Obama not to visit the Temple Mount during his trip to Israel __HTTP__ _E_
Today The Blue Monster is torn up. The Trump National @DoralResort is being revolutionized with $200M of renovations. _E_
RT @Newsmax_Media: Donald Trump: Mean Spirited GOP Won't Win Elections @REALDonaldTrump __HTTP__ via @Newsmax_Media _E_
Fallout from Iowa: Trump Speech Drew Greatest Response __HTTP__ via @Newsmax_Media by Jim Meyers __HTTP__ _E_
Fiscal cliff negotiations have officially begun between the President and Congress Washington must come together and make a deal. _E_
Congratulations to the @thenyrangers on taking a 2 1 lead over the @washcaps. Great game last night! _E_
The prestigious 800 acre @TrumpDoral boast luxurious event spaces and 5 Star restaurants __HTTP__ _E_
Realize that persistence can go a long way. Being stubborn is often an attribute. _E_
China is threatening Washington over the currency bill. We should pass it immediately. _E_
I am giving away money. Check the crowdfunding site @fundanything __HTTP__ Raise money for anything! _E_
Edward Snowden is absolutely killing the the U.S. with other countries! _E_
Hillary will never reform Wall Street. She is owned by Wall Street! _E_
Our country is on the precipice. Washington is broken. Where is the leadership? _E_
Thank you @IngrahamAngle for your strength & wonderful words last night on @FoxNews but @KarlRove is easy to beat! _E_
I know it has been many years since our country made great deals but isn't it about time we start right now. MAKE AMERICA GREAT AGAIN! _E_
RT @foxandfriends: Jeb is a weak guy. @EricTrump __HTTP__ _E_
.@Morning_Joe @mikebarnicle on @realDonaldTrump: He finished 2nd but he made the turn successfully like a pro _E_
Sadly it took a hit & run auto accident to make us aware of who our Secretary of Commerce is and such an important position! _E_
My @gretawire interview discussing why @BarackObama is not a nice guy and who will win the 2012 election __HTTP__ _E_
"Do not pray for easy lives. Pray to be stronger men." – Pres. John F. Kennedy _E_
Undecideds in OHPA and WI will make the difference. All should ask themselves if they want $6/gallon gas because it will come under Obama. _E_
The meeting next week with China will be a very difficult one in that we can no longer have massive trade deficits... _E_
Shows how weak and desperate Lyin' Ted is when he has to team up with a guy who openly can't stand him and is only 1 win and 38 losses. _E_
Jobless claims have dropped to a 45 year low! _E_
He @MittRomney would do a great job on Saturday Night Live. @nbcsnl _E_
Our economy is better than it has been in many decades. Businesses are coming back to America like never before. Chrysler as an example is leaving Mexico and coming back to the USA. Unemployment is nearing record lows. We are on the right track! _E_
RT @GOP: .@POTUS: I want to work with Congress Republicans and Democrats on a plan that is pro growth pro jobs pro worker and pro Amer... _E_
My thoughts and prayers are with everyone involved in the train accident in DuPont Washington. Thank you to all of our wonderful First Responders who are on the scene. We are currently monitoring here at the White House. _E_
........may be their number one act and priority. Focus on tax reform healthcare and so many other things of far greater importance! #DTS _E_
Mark They could use you. __HTTP__ _E_
Here we go Enjoy! _E_
"@PGAChampionship @seniorpgachamp both headed to Trump courses" __HTTP__ via @FoxNews _E_
Want to take a quiz with me? Download the @millonseconds app and watch @RyanSeacrest on Monday at 8/7c on @NBC _E_
Doing the @todayshow with @MLauer was great I really like Matt. _E_
It wasn't the White House it wasn't the State Department it wasn't father LaVar's so called people on the ground in China that got his son out of a long term prison sentence IT WAS ME. Too bad! LaVar is just a poor man's version of Don King but without the hair. Just think.. _E_
Thank you for all of the positive response on my Chicago lawsuit victory yesterday. Most of you saw through the phony age card ploy. _E_
Over 150000 more of our fellow Americans dropped out of the workforce in July. @BarackObama is a disaster! _E_
RT @EricTrump: Nevada we are on our way! #VoteTrumpNV #Trump2016Caucus locator: __HTTP__ __HTTP__ _E_
.@politico which is not read or respected by many may be the most dishonest of the media outlets and that is saying something. _E_
Bernie Sanders who has lost most of his leverage has totally sold out to Crooked Hillary Clinton. He will endorse her today fans angry! _E_
Bombings all over Iraq today.That country is falling apart such a horrible waste of lives and 1.5 trillion dollars (and I told you so!). _E_
DON'T LET HILLARY CLINTON DO IT AGAIN!#TrumpPence16 __HTTP__ _E_
Excited to have @SarahPalinUSA's endorsement of the Newsmax @iontv debate. Sarah is terrific. _E_
Thanks. __HTTP__ _E_
This has to stop! @BarackObama loves accruing American debt he missed his budget deficit goal by over $500 billion. __HTTP__ _E_
I will be interviewed on @oreillyfactor tonight at 8:00 P.M. (Eastern). Enjoy! _E_
RT @TeamTrump: Obama Clinton FAILED foreign policy: Bad nuclear deal Ransom payment to leading state sponsor of terror Sharing classifie... _E_
We are way over the fiscal cliff. And with Obama Care being fully implemented in less than 14 months it may be too late. _E_
Just leaving Las Vegas. Unbelievable crowd! Many Hispanics who love me and I love them! __HTTP__ _E_
"Revenge is sweet and not fattening." Alfred Hitchcock _E_
ObamaCare will destroy small business the backbone of America's economy. _E_
.@jimmykimmel is terrific but for Obama to fly on Air Force One ($'s) to do the show in these bad times is ridiculous. _E_
Thank you. __HTTP__ _E_
Via @CNNPolitics: Trump will have 'memorable' role at GOP convention __HTTP__ It's true just wait and see... _E_
Thank you Cedar Rapids Iowa!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
"You cannot push anyone up the ladder unless he is willing to climb." Andrew Carnegie _E_
Can you believe that President Karzai of Afghanistan is holding out for more more more and refuses to sign deal. Tell him to go to hell! _E_
"Iowans Drawn to Donald Trump Praise His Antiestablishment Bent" __HTTP__ via @WSJ by @heatherhaddon & @reidepstein _E_
What America needs: @MittRomney follows in steps of Kemp and Reagan with pro growth tax cut. _E_
"Do not underestimate yourself and know you are able to handle what comes your way." – Think Like a Champion _E_
RT @realDonaldTrump: The travel ban into the United States should be far larger tougher and more specific but stupidly that would not be... _E_
Via @ChristianPost @NaghmehAbedini to Testify at New Congressional Hearing on Persecution of Pastor Saeed Abedini __HTTP__ _E_
Congrats to @GovernorCorbett he's right to be suing @NCAA over the ridiculous deal made by the trustees of Penn State __HTTP__ _E_
The Iranians have just threatened to send warships to our coasts. They laugh at us. We can't allow them to develop nuclear weapons. _E_
The U.S. rocket that blew up and crashed yesterday is emblematic of the United States under Obama. Nothing works be it a rocket or website. _E_
If you include people who have left the work force unemployment rate is 15%. Labor participation rate is lowest in 70 yrs. _E_
#IndianaJones and #Ghostbusters what's wrong??? __HTTP__ _E_
I along with almost everyone else have so little confidence in President Obama. He has a horrible attitude a man who is resigned to defeat _E_
Thank you Maryland what a great way to conclude the day! Will be back soon. #Trump2016 __HTTP__ __HTTP__ _E_
Santorum calls Trump debate skippers hypocrites __HTTP__ @RickSantorum _E_
Via @Citizens_United: "Donald Trump To Speak At The Iowa Freedom Summit in Des Moines on January 24th" __HTTP__ _E_
I have tremendous respect for women and the many roles they serve that are vital to the fabric of our society and our economy. _E_
RT @JacobAWohl: @realDonaldTrump When Obama was President the #MSM LOVED talking about stock market rallies! Now they barely mention new a... _E_
....Transgender individuals to serve in any capacity in the U.S. Military. Our military must be focused on decisive and overwhelming..... _E_
Thanks. __HTTP__ _E_
Watch this video of my wonderful golf club @TrumpNationalCN in beautiful Colts NeckNJ __HTTP__ _E_
After one of the great chokes in the history of sports it will be hard for the Spurs to beat the Heat but who knows. Good game on now! _E_
I like doing this once a month for the haters & losers (and as they know) I don't wear a wig . Some may not like my hairstyle but all mine _E_
The talk in Albany is that JCOPE & Moreland Commissions are taking my complaint against lightweight (cont) __HTTP__ _E_
The @SuperCommittee will fail. The Republicans never should have agreed to the debt deal. _E_
China's the leading exporter of Iraqi oil yet they won't lift a finger against ISIS. Why should we do the heavy lifting for China's gain? _E_
To each member of the graduating class from the National Academy at Quantico CONGRATULATIONS! __HTTP__ _E_
RT @PaulaReidCBS: .@CBSNews confirms FBI found emails on #AnthonyWeiner computer related to Hillary Clinton server that are new & not p... _E_
"Change can't be measured in speeches. It is measured in achievements." @MittRomney yesterday in Fairfax VA. _E_
Paying attention is a cost effective way of protecting yourself and your interests. _E_
This will be a very interesting day for HealthCare.The Dems are obstructionists but the Republicans can have a great victory for the people! _E_
Why can't the leaders of the Republican Party see that I am bringing in new voters by the millions we are creating a larger stronger party! _E_
#ICYMI on Monday I had the great honor of welcoming India's Prime Minister @narendramodi to the WH. Full Remarks:... __HTTP__ _E_
The debate last night proved that Hillary is running against the "B" team. She won't be so lucky when it comes to me! _E_
RT @GOP: On National #VoterRegistrationDay make sure you're registered to vote so we can #MakeAmericaGreatAgain __HTTP__ ht... _E_
Why is @BarackObama constantly issuing executive orders that are major power grabs of authority? This is the latest __HTTP__ _E_
Isn't it intetesting that anybody who attacks President Obama is considered a racist by the real racists out there! _E_
If ObamaCare is not repealed then we can expect stagnant growth long term unemployment and record high premiums. _E_
.@FoxNews owes me an apology for allowing clueless pundit @RichLowry to use such foul language on TV. Unheard of! _E_
Lightweight Senator Kirsten Gillibrand a total flunky for Chuck Schumer and someone who would come to my office "begging" for campaign contributions not so long ago (and would do anything for them) is now in the ring fighting against Trump. Very disloyal to Bill & Crooked USED! _E_
In New York March was the coldest month in recorded history we could use some GLOBAL WARMING! _E_
China will now pass our economy this year way ahead of projections. Pres. Obama – China's greatest asset! _E_
Via @RealClearNews by @rebeccagberg: "Is the White House Big Enough for Donald Trump?" __HTTP__ _E_
It's Jan. 2. President Obama should end his vacation early & get back to Washington to straighten out the ObamaCare catastrophe or end it. _E_
Wow so far everyone running for office who I did a ROBOCALL for has taken the lead in the polls the smart pols know this. GREAT! _E_
Challenges present opportunities. Always keep your focus and stay calm. _E_
Via @Newsmax_Media by @spiccoli: "Donald Trump Taking 'Serious Look' at 2016 Presidential Run" __HTTP__ _E_
In beautiful Miami inspecting the progress of @TrumpDoral's $250 million conversion into the country's #1 resort. _E_
Pathetic: @BarackObama did not want to veto Keystone himself so he lobbied the Democrats in the Senate to defeat it. __HTTP__ _E_
Thoughts & prayers are w/ our @USNavy sailors aboard the #USSJohnSMcCain where search & rescue efforts are underway. __HTTP__ _E_
Great news in Georgia! The just out Landmark poll shows me in first with 43%! Wow. __HTTP__ __HTTP__ _E_
Tune in Sunday June 3 to NBC at 9pm ET for the 2012 Miss USA competition coming from Planet Hollywood Resort & Casino in Las Vegas _E_
Thank you Georgia! I appreciate all of your support. #Trump2016 __HTTP__ _E_
Congratulations to Evan Lysacek for being nominated SI sportsman of the year. He's a great guy and he has my vote! #EvanForSI _E_
Mexico was just ranked the second deadliest country in the world after only Syria. Drug trade is largely the cause. We will BUILD THE WALL! _E_
Chris Wallace @fox at 10:00 A.M. _E_
"@limbaugh: 'Trump Has Changed the Entire Debate on Immigration'" __HTTP__ via @Newsmax_Media by Jason Devaney _E_
#TBT Trump and Gekko __HTTP__ _E_
Looking forward to IA & WI with Gov. Pence tomorrow. Join us! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_
Phyllis Schlafly's Eagle Forum: 'National Review Will Be Defunct In The Next Year' __HTTP__ _E_
Priorities while fundraising and campaigning on our dime Obama has skipped over 50% of his intel briefings __HTTP__ _E_
Via @foxnewslatino by @GeraldoRivera: "@ApprenticeNBC Diary: And Now There Are Two" __HTTP__ _E_
Leaving Miami Trump National Doral will be GREAT! _E_
The leader and negotiators representing Mexico are far smarter and more cunning than the leader and negotiators representing the U.S.! _E_
My son @EricTrump has just done another great event and raised a lot of money for @StJude. He is a really good boy who loves helping kids. _E_
Watch the first #TrumpVine re: Anthony Weiner __HTTP__ _E_
Praying for the families of the two Iowa police who were ambushed this morning. An attack on those who keep us safe is an attack on us all. _E_
Huff post gets it wrong re: Ferry Point...the only leakage of gas is from Arianna Huffington. _E_
Today it was my privilege to welcome survivors of the #USSArizona to the WH. Remarks: __HTTP__ __HTTP__ _E_
The biggest problem with A Rod is he is bad for the chemistry of the Yankees he must go. _E_
Only in America can a Jihadi thug who murdered women and children be nursed back to health & then get a @RollingStone cover. _E_
Failed show @DannyZuker season 1 of @apprenticenbc had 28 million viewers and 41.5 million watching..... _E_
Obama believes Benghazi is a "phony scandal." Nothing phony about Americans being killed by Islamists. _E_
The global warming we should be worried about is the global warming caused by NUCLEAR WEAPONS in the hands of crazy or incompetent leaders! _E_
Thank you Connecticut! #Trump2016 __HTTP__ _E_
Numerous states are refusing to give information to the very distinguished VOTER FRAUD PANEL. What are they trying to hide? _E_
.@carlosbeltran15 is playing great for St. Louis Cardinals. They made a wise decision. _E_
Crooked Hillary Clinton is bought and paid for by Wall Street lobbyists and special interests. She will sell our country down the tubes! _E_
Uncomfortable looking NBC reporter Willie Geist calls me to ask for favors and then mockingly smiles when he is told of my high poll numbers _E_
"Face reality as it is not as it was or as you wish it to be." @jack_welch _E_
Good news disloyal @Macys stock is in a total free fall. Don't shop there for Christmas! __HTTP__ __HTTP__ _E_
AMERICA'S FUTURE __HTTP__ _E_
James Comey will be replaced by someone who will do a far better job bringing back the spirit and prestige of the FBI. _E_
#GodBlessTheUSA __HTTP__ _E_
...Hopefully we will never have to use this power but there will never be a time that we are not the most powerful nation in the world! _E_
Rep.Tom Marino has informed me that he is withdrawing his name from consideration as drug czar. Tom is a fine man and a great Congressman! _E_
.@Boeing stock went way down because of 787 so I just bought stock in @Boeing great company! _E_
It was my great honor to have lunch with our INCREDIBLE U.S. and ROK troops at Camp Humphreys in South Korea. 🇰 __HTTP__ __HTTP__ _E_
Anybody (especially Fake News media) who thinks that Repeal & Replace of ObamaCare is dead does not know the love and strength in R Party! _E_
I have captured the smell of success. Meet me and the new Success @Macys Herald Square April 18 5:30pm first (cont) __HTTP__ _E_
I was on a tele townhall with @TeamBachmann and hosted her 4 times in Trump Tower yet she declined the Newsmax @iontv debate. No loyalty. _E_
At least 12 dead and 50 wounded in Colorado bring back fast trials & death penalty for mass murderers & terrorists. _E_
Lyin' Ted Cruz and 1 for 38 Kasich are unable to beat me on their own so they have to team up (collusion) in a two on one. Shows weakness! _E_
1.5M have already lost their health care plans thanks to ObamaCare __HTTP__ Defund now and Repeal later! _E_
The innocent bystanders of American poverty are kids. Yet two thirds of childhood poverty in America is (cont) __HTTP__ _E_
I look forward to reading the @CommerceGov 232 analysis of steel and aluminum to be released in June. Will take major action if necessary. _E_
No I wasn't at the @Yankees game yesterday can't go today either. When I go they win. _E_
In order to #DrainTheSwamp & create a new GOVERNMENT of by & for the PEOPLE I need your VOTE! Go to __HTTP__ LET'S #MAGA! _E_
Congratulations to the Republicans in Congress. You are the only people Obama can out negotiate. #TimeToGetTough _E_
RT @PacificCommand: #USAF B 1B Lancer #bombers on Guam stand ready to fulfill USFK's #FightTonight mission if called upon to do so __HTTP__ _E_
Obama Administration official said they choked when it came to acting on Russian meddling of election. They didn't want to hurt Hillary? _E_
Let us never negotiate out of fear but let us never fear to negotiate. John F. Kennedy Inaugural Address January 1961 _E_
The roads and sidewalks airports and bridges are perfect in Dubai. Everything looks clean & strong. In U.S. everything is falling apart! _E_
Me voting it really is my hair! __HTTP__ _E_
Meet me at @TrumpTowerNY and get your copy of my new book CRIPPLED AMERICA signed on 11/3 at 12pm! __HTTP__ _E_
Great move to take A Rod out of game. Now terminate his contract based on misrepresentation (drugs). _E_
I am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... ... _E_
Never met but never liked dopey Robert Gates. Look at the mess the U.S. is in. Always speaks badly of his many bosses including Obama. _E_
Can you believe that the disrespect for our Country our Flag our Anthem continues without penalty to the players. The Commissioner has lost control of the hemorrhaging league. Players are the boss! __HTTP__ _E_
"The team with the best players wins." @jack_welch _E_
It takes guts to be a brand. You cannot be all things to all people if you want to be a brand. Midas Touch _E_
The secret of getting ahead is getting started. Mark Twain _E_
Thank you Arizona I love you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
RT @mike_pence: .@EdWGillespie is fighting to grow the economy & cut taxes! He's fighting for a safer VA. And he's is fighting for affordab... _E_
No wonder @BBC is in such big trouble & boss was just fired they are lost. _E_
Just took a look at Time Magazine looks really flimsy like a free handout at a parking lot! The sad end is coming just like Newsweek! _E_
Why is the United States Post Office which is losing many billions of dollars a year while charging Amazon and others so little to deliver their packages making Amazon richer and the Post Office dumber and poorer? Should be charging MUCH MORE! _E_
'President Trump Congratulates Exxon Mobil for Job Creating Investment Program' __HTTP__ _E_
The rally in Cincinnati is ON. Media put out false reports that it was cancelled! #MakeAmericaGreatAgain #Trump2016 _E_
Just met with the incoming Speaker of the Florida House @SteveCrisafulli – a fantastic guy! He will be a truly great leader. _E_
I endorsed a book on ObamaCare & it just went to #2 on the New York Times bestseller list! _E_
Druggie @AROD is now scheming to sue the @Yankees. He will go down as the biggest sports embarrassment of all time. _E_
We are now sending thousands of additional troops to Iraq to teach them how to fight they will run billions wasted! WHAT DOES U.S. GET? _E_
My @WendyWilliams appearance re Sony Atlantic City @ApprenticeNBC & 2016 __HTTP__ Always love going on Wendy's show! _E_
Thank you to our law enforcement officers! #LESM #Trump2016 __HTTP__ _E_
Heading back to Washington after working hard and watching some of the worst and most dishonest Fake News reporting I have ever seen! _E_
... but like many other great business people have used the laws to corporate advantage. _E_
Oh the wonders of the Arab Spring. Our new 'ally' the Muslim Brotherhood hosted Ahmadinejad yesterday __HTTP__ No more aid. _E_
Via @MiamiNewTimes by @Munzenrieder : "Doral Mayor Declares Emergency to Give Donald Trump Key to the City" __HTTP__ _E_
Dummy Bill Maher did an advertisement for the failing New York Times where the picture of him is very sad he looks pathetic bloated & gone! _E_
"I'm not afraid of failing. I don't like to fail. I hate to fail. But I'm not afraid of it." @VinceMcMahon _E_
Rosie O'Donnell's show is dead can't keep going for long with such poor ratings. @Rosie is a stone cold (cont) __HTTP__ _E_
There is great unity in my campaign perhaps greater than ever before. I want to thank everyone for your tremendous support. Beat Crooked H! _E_
Mexico is killing the United States economically because their leaders and negotiators are FAR smarter than ours. But nobody beats Trump! _E_
Negotiation tip: Be patient be persistent be stubborn. Know exactly what you want and keep it to yourself. _E_
Django Unchained is the most racist movie I have ever seen it sucked! _E_
Thank you America great #CommanderInChiefForum polls! __HTTP__ _E_
.@MarcoRubio is weak on illegal immigration and will allow anyone into the country..... _E_
Thank you Florida we are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ #AmericaFirst __HTTP__ _E_
Problem is that the acting head of the FBI & the person in charge of the Hillary investigation Andrew McCabe got $700000 from H for wife! _E_
...conquests how brave he was and it was all a lie. He cried like a baby and begged for forgiveness like a child. Now he judges collusion? _E_
Via @rcpvideo: "Donald Trump on Who He Likes For President: Donald Trump" __HTTP__ _E_
U.S. jobless claims are at a 2 month high. __HTTP__ @BarackObama's gas policy and ObamaCare are directly killing jobs. _E_
Great parade in The Villages I love you all. We will #MAGA. Thank you for the incredible support I will not forget! __HTTP__ _E_
It is so sad to see what has happened to Atlantic City. So many bad decisions by the pols over the years airport convention center etc. _E_
Terrible economic numbers released today. US GDP only grew 0.4% during Oct Dec 2012 quarter __HTTP__ Great news for China. _E_
Congrats @Jean_GeorgesNYC for being named the 6th best hotel restaurant in the world! __HTTP__ _E_
A great article by @NolteNCspelling out the truth on Mexico trade the border & illegals. Thank you @BreitbartNews __HTTP__ _E_
Big crowds standing ovations in South Carolina MAKE AMERICA GREAT AGAIN! _E_
'Donald Trump is already helping the working class' __HTTP__ _E_
RT @mike_pence: Join me in Colorado today! Look forward to seeing you!Denver 2pm __HTTP__ Springs 6pm __HTTP__ _E_
He should be ignored: @RonPaul's foreign policy is a dream come true for our enemies. He has zero chance to beat @BarackObama. _E_
It amazes me that other networks seem to treat me so much better than @FoxNews. I brought them the biggest ratings in history & I get zip! _E_
Here we go again with another Clinton scandal and e mails yet (can you believe). Crooked Hillary knew the fix was in B never had a chance! _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
My @seanhannity interview where I discuss @BarackObama's Job Council @RealSheriffJoe's investigation & 2012 election __HTTP__ _E_
Obama is our unlucky President. Everything he touches turns into a mess. Some people just don't have it! _E_
HAPPY NEW YEAR! We are MAKING AMERICA GREAT AGAIN and much faster than anyone thought possible! _E_
You need to overcome the tug of people against you as you reach for high goals. General George S. Patton _E_
.@Morning_Joe can you believe Kasie Hunt's poor and purposely inaccurate reporting on my great night and crowd in Iowa. @politico is a scam! _E_
.@timkaine is the ANTI DEFENSE SENATOR. #VPDebate #BigLeagueTruth __HTTP__ _E_
.@EWErickson ran @RedState into the ground. A change was necessary. Congratulations to @RedState and good luck in the future! _E_
Dwight Howard just signed with Houston. _E_
Doesn't dummy @chucktodd realize that when I considered running for president I filed financial papers showing unbelievable numbers. _E_
Just left West Palm Beach Fire & Rescue #2. Met with great men and women as representatives of those who do so much for all of us. Firefighters paramedics first responders what amazing people they are! _E_
.@TrumpTurnberry's 149 award winning guest rooms offer a perfect blend of Edwardian tradition and timeless design __HTTP__ _E_
Just another desperate move by the man who should have easily beaten Barrack Obama. (2/2) _E_
Marble mouth @tombrokaw asks why do we think to have a successful eveving you have to have Donald Trump as your guest of honor? BORING TOM _E_
Don't believe the main stream (fake news) media.The White House is running VERY WELL. I inherited a MESS and am in the process of fixing it. _E_
Gas prices are soaring. $4.12 in CA. OPEC is laughing at how stupid we are. _E_
We just finished shooting a new season of Celebrity Apprentice and happily for all Joan plays my advisor in two episodes. She was great! _E_
Jeb's new slogan Jeb can fix it . I never thought of Jeb as a crook! Stupid message the word fix is not a good one to use in politics! _E_
I am a very calm person but love tweeting about both scum and positive subjects. Whenever I tweet some call it a tirade..totally dishonest! _E_
Thank you Carmel Indiana! Get out & #VoteTrump tomorrow! #INPrimary #MakeAmericaGreatAgain __HTTP__ _E_
Thank you to Brandon Judd of the National Border Patrol Council for his strong statement on @foxandfriends that we very badly NEED THE WALL. Must also end loophole of "catch & release" and clean up the legal and other procedures at the border NOW for Safety & Security reasons. _E_
Great bilateral meeting with President @Alain_Berset of the Swiss Confederation as we continue to strengthen our great friendship. Such an honor to be in Switzerland! #WEF18 __HTTP__ _E_
Gloria Allred is always talking about me. She needs publicity. She is by far a better PR agent than lawyer. _E_
Reporting that Orlando killer shouted Allah hu Akbar! as he slaughtered clubgoers. 2nd man arrested in LA with rifles near Gay parade. _E_
I will be interviewed on @foxandfriends this morning at 7:30. So much to talk about! _E_
.@AP has one of the worst reporters in the business @JeffHorwitz wouldn't know the truth if it hit him in the face. _E_
slaughter you. This is a purely religious threat which turned into reality. Such hatred! When will the U.S. and all countries fight back? _E_
At some point Sgt. Bergdahl will have to explain his capture. In 2009 he simply wandered off his base without a weapon. Many questions! _E_
.@latoyajackson informs @ArsenioHall that @Omarosa is a "conniving witch"—is he surprised? Are we surprised? #CelebApprentice _E_
Via @amspec by Jeffrey Lord: Is Eric Schneiderman a Crook? What a great writer & researcher amazing story. __HTTP__ _E_
The Republicans once again hold all the cards with the debt ceiling. They can get everything they want. Focus! _E_
Great advice from my mother: "Trust in God and be true to yourself." – Mary MacLeod Trump _E_
An architectural landmark @TrumpTowerNY offers sweeping panoramic views of Fifth Avenue __HTTP__ _E_
Trump: I Love the Tea Party They Love Me __HTTP__ via @Newsmax_Media (cross posted on @foxnation __HTTP__ _E_
It snowed over 4 inches this past weekend in New York City. It is still October. So much for Global Warming. _E_
I am not available to be in @adamcarolla's new movie #RoadHard.bit.ly/roadhardmovie _E_
Offering top amenities along w/ award winning architectural design @TrumpChicago's condominiums are world class __HTTP__ _E_
Something really bad happened to the @Yankees psyche much like our President! _E_
Thank you @SenJohnMcCain for your kind remarks on the important issue of PTSD and the dishonest media. Great to be in Arizona yesterday! _E_
Thank you Jacob! __HTTP__ _E_
Dummy @Clare_OC @Forbes: Tiny fragrance deal with Parlux means nothing. Still sold at Trump Tower... _E_
#StandForOurAnthem _E_
So many veterans groups are beyond happy with all of the money I raised/gave! It was my great honor they do an amazing job. _E_
Now is the time for the @GOP to be united with the mission of electing @MittRomney this November. Stop with the public divisions. _E_
I really enjoyed the debate last night.Crooked Hillary says she is going to do so many things.Why hasn't she done them in her last 30 years? _E_
Prediction: The disaster known as ObamaCare will only get worse and Republicans will gain far greater power than they have had in years! _E_
FMR PRES of Mexico Vicente Fox horribly used the F word when discussing the wall. He must apologize! If I did that there would be a uproar! _E_
Unbelievable crowd in Dallas! __HTTP__ _E_
Bruce Willis wearing my hat on @FallonTonight last Friday __HTTP__ _E_
Via @ BreitbartNews by @BobPriceBBTX: "DONALD TRUMP HEADING TO TEXAS BORDER" __HTTP__ _E_
Being good in business is the most fascinating kind of art. Making money is art & working is art & good business is the best artAndy Warhol _E_
.@oreillyfactor bad and very deceptive journalism. Show must be heading in wrong direction too bad! @SarahPalinUSA _E_
Congratulations! 'First New Coal Mine of Trump Era Opens in Pennsylvania' __HTTP__ _E_
You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_
If you have built castles in the air your work need not be lost that is where they should be. Now put the foundations under them. Thoreau _E_
"Sixteen" @TrumpChicago is winning accolades and is a destination point restaurant—don't miss it! _E_
My beautiful daughter Ivanka just had a healthy baby boy. Jared and Ivanka are very proud! _E_
Wow NATO's top commander just announced that he agrees with me that alliance members must PAY THEIR BILLS. This is a general I will like! _E_
.@WSJ reports that @GOP getting ready to treat me unfairly—big spending planned against me. That wasn't the deal! _E_
"The four page memo released Friday reports the disturbing fact about how the FBI and FISA appear to have been used to influence the 2016 election and its aftermath....The FBI failed to inform the FISA court that the Clinton campaign had funded the dossier....the FBI became.... _E_
In all fairness to Anthony Scaramucci he wanted to endorse me 1st before the Republican Primaries started but didn't think I was running! _E_
AMERICA USED TO BE THE LEADER OF THE WORLD. THANKS TO OBAMA AMERICA ISN'T EVEN LEADING FROM BEHIND. _E_
.@SanDiegoPD Fantastic job on handling the thugs who tried to disrupt our very peaceful and well attended rally. Greatly appreciated! _E_
There has been a systematic targeting of the Tea Party by the Obama administration. Now Schneiderman goes after me. No coincidence. _E_
Our country needs a president with great leadership skills and vision not someone like Hillary or Barack neither of which has a clue! _E_
"Think of yourself as a one man army. You're not only the commander in chief you're the soldier as well." – Think Like a Billionaire _E_
I still love Derek he is a winner! _E_
It's true—@dennisrodman gets the comeback of the year award. I didn't like having to fire him. #CelebApprentice _E_
... Apprentice was #1 among ABC CBS and NBC from 10:30 11 p.m. in all key demos (adults men and women 18 34 18 49 and 25 54) Nielsen. _E_
Join me at 11:00am:Watch here: __HTTP__ __HTTP__ _E_
Thank you for your support & friendship Governor @ChrisChristie!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Obama will quarantine all soldiers returning from Africa for 21 days. But he still allows all who contract Ebola into country? Hypocrite. _E_
'Over 250000 to Lose Health Insurance in Battleground North Carolina Due to #Obamacare' __HTTP__ _E_
No one has done more for people with disabilities than me. I have spent many millions of dollars to help out and am happy to have done so! _E_
...but interestingly the same people seem to be lucky. _E_
.@RobinRoberts everyone adores you including me get well fast! _E_
It's Thursday. @billmaher is still a very dumb guy just look at his past. _E_
Congratulations to Doug Jones on a hard fought victory. The write in votes played a very big factor but a win is a win. The people of Alabama are great and the Republicans will have another shot at this seat in a very short period of time. It never ends! _E_
He @BarackObama will lose a delegate in Oklahoma he only got 57% of the vote in the Democrat primary __HTTP__ _E_
Join me in Fayetteville North Carolina tomorrow evening at 6pm. Tickets now available at: __HTTP__ _E_
If Graydon Carter's very dumb bosses would fire him for his terrible circulation numbers at failing Vanity Fair his bad food restaurants die _E_
Will be going to North Dakota today to discuss tax reform and tax cuts. We are the highest taxed nation in the world that will change. _E_
John Podesta says nominee will be Cruz b/c last person Hillary wants to face is Trump! Use your head folks! 46 41! __HTTP__ _E_
Don't forget to tune in to the Celebrity Apprentice this Sunday night 9 pm on NBC. The fireworks continue.... __HTTP__ _E_
So many major problems for the U.S. and no answers by our leaders. When will it all change? Many of our difficulties are so easy to solve! _E_
Now that I started my war on illegal immigration and securing the border most other candidates are finally speaking up. Just politicians! _E_
Just returned from Europe. Trip was a great success for America. Hard work but big results! _E_
Iran just test fired a Ballistic Missile capable of reaching Israel.They are also working with North Korea.Not much of an agreement we have! _E_
Congratulations @Jean_GeorgesNYC for 10 years of 3 #MichelinStars! Visit the restaurant in @TrumpNewYork for a meal you'll never forget. _E_
I will be using Facebook and Twitter to expose dishonest lightweight Senator Marco Rubio. A record no show in Senate he is scamming Florida _E_
Who are your favorites on Team Power? Team Plan B? #CelebApprentice _E_
Mr. Pesident @BarackObama you cannot attack free enterprise and expect to have a healthy economy! _E_
When will the Democrats and Hillary in particular say we must build a wall a great wall and Mexico is going to pay for it? Never! _E_
Join me Tuesday Nov. 3rd at 12pm in Trump Tower NYC. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_
Big time in U.S. today MAKE AMERICA GREAT AGAIN! Politicians are all talk and no action they can never bring us back. _E_
Just spent two days in Ireland at Trump International Golf Links & Hotel absolutely magnificent. __HTTP__ _E_
Both being optimistic and remembering the big picture have served me well throughout my life. You need to stay positive. _E_
RT @Harlan: Watching MSM you would have no idea @realDonaldTrump clearly unambiguously & repeatedly condemned the bigotry & violence in Ch... _E_
.@ErinBurnett's @OutFrontCNN ratings are so pathetic she even loses to @hardball_chris at 7PM which is replay of 5PM __HTTP__ _E_
Four more days until the Miss Universe Pageant. Be sure to tune in on Monday night at 9 p.m. on NBC it will be an amazing show. _E_
To aspiring entrepreneurs always remember that if your enemies aren't talking about you then you aren't doing well...and must work harder! _E_
Michele Bachmann will finish dead last tonight in Iowa because she is disloyal and a terrible boss. Sadly it is over for Michele. _E_
Once again @LilJon has competed at a very high level on Celebrity All Star @ApprenticeNBC. He is a great competitor. _E_
While I am in OH & PA you can also join @Mike_Pence in Nevada on Mon!Carson City: __HTTP__ __HTTP__ _E_
Great! __HTTP__ _E_
RT @IBDeditorials: Was Barack Obama A Foreign Exchange Student? __HTTP__ _E_
NATO commander agrees members should pay up via @dcexaminer: __HTTP__ _E_
Doesn't the US have better things to do than to destroy an American hero for the world to see? Now other (cont) __HTTP__ _E_
Join me live at the @WhiteHouse. __HTTP__ __HTTP__ _E_
Via @TVGrapevine: "@ApprenticeNBC: Premieres Sunday January 4 2015" __HTTP__ _E_
With Obama and Bernanke destroying the value of the dollar gold and real estate should continue to rise in value. _E_
China's best friend @BarackObama wants to cut the US fleet down to 230 ships the lowest level since WWI. __HTTP__ _E_
They let Crooked & the Gang off the hook for the crime but it looks like the cover up is just as bad. Unbelievable! __HTTP__ _E_
The Federal Government is teaching citizens 'Financial Literacy' while it is running $16T in debt __HTTP__ Only in America! _E_
Entrepreneurs: Paying attention can be a cost effective way of protecting yourself. _E_
RT @RealBenCarson: I endorse @realDonaldTrump. It's time to unite behind the candidate who will beat Hillary Clinton and return government ... _E_
Great to hear that our loyal @CelebApprentice fans are happy with today's announcement of the new cast. This will be something special! _E_
We can't even stop the Norks from blasting a missile. China is laughing at us. It is really sad. _E_
Thanks for all of the accolades on my speech today it's all about the truth! _E_
Change has to come from outside our very broken system. #MAGA __HTTP__ _E_
Thank you to our amazing law enforcement officers! #AmericaFirst __HTTP__ _E_
Join me in North Carolina tomorrow at 7:30pm! #ImWithYou Tickets: __HTTP__ _E_
The truth is that we could have much better healthcare in our country at a much more affordable price everyone in U.S. would benefit! _E_
Uh oh... @OMAROSA & @piersmorgan once again reunite in the Board Room in next week's 'All Star' @ApprenticeNBC. Fireworks! _E_
6 @TrumpCollection hotels made @CNTraveler reader's choice! @TrumpNewYork @TrumpSoHo @TrumpChicago @TrumpToronto @TrumpPanama @Trump_Ireland _E_
.@alexsalmond RT @King_Pepp Driving through Indiana and seeing tons of ugly windmills. Now I know what @realDonaldTrump is talking about _E_
My @gretawire interview discussing my $5M charitable offer to Obama his lack of transparency & my tremendous support __HTTP__ _E_
EARLY VOTING: MN & IA already underway more states coming up in the next week: OH ME AZ IN — check w/local officials for details & VOTE! _E_
So many signs that the Florida shooter was mentally disturbed even expelled from school for bad and erratic behavior. Neighbors and classmates knew he was a big problem. Must always report such instances to authorities again and again! _E_
My comment last March "Anthony Weiner is a sick pervert you think he will change? He will never change." __HTTP__ _E_
Again the story that there was collusion between the Russians & Trump campaign was fabricated by Dems as an excuse for losing the election. _E_
Via @examinercom by @Mellyora13: "Trump: Was Benghazi the result of incompetence or something more sinister?" __HTTP__ _E_
With the economy still on a downward trajectory the best investment young people can make now is buying property... _E_
Action speaks louder than words but not nearly as often. Mark Twain _E_
...contributions. The RNC is taking in far more $'s than the Dems and much of it by my wonderful small donors. I am working hard for them! _E_
Now that the three basketball players are out of China and saved from years in jail LaVar Ball the father of LiAngelo is unaccepting of what I did for his son and that shoplifting is no big deal. I should have left them in jail! _E_
Getting ready to make a major speech to the National Assembly here in South Korea then will be headed to China where I very much look forward to meeting with President Xi who is just off his great political victory. _E_
My @BreitbartNews' @biggovt editorial: "'A COUNTRY THAT CANNOT PROTECT ITS BORDERS WILL NOT LAST" __HTTP__ _E_
Snowing in Texas and Louisiana record setting freezing temperatures throughout the country and beyond. Global warming is an expensive hoax! _E_
An old picture with Nancy and Ronald Reagan. __HTTP__ _E_
Dopey Sugar.@Lord_Sugar ...Your net worth doesn't even qualify you to host the Apprentice. Keep making me money. _E_
Obama and the Democrats have no respect for WWII vets trying to get into the memorial. _E_
Last night was the first time Obama said we instead of I in respect to Bin Laden's killing. _E_
'Hillary's Two Official Favors To Morocco Resulted In $28 Million For Clinton Foundation' #DrainTheSwamp __HTTP__ _E_
.@Andre_Reed83 Thanks for your nice words. You are a real champion. I'm pushing! _E_
Why does @CNN & @andersoncooper waste airtime by putting failed campaign strategist Stuart Stevens who lost BIG for Romney on the show? _E_
This will be one of the biggest and most beautiful Miss Universe events ever. _E_
If Senate Republicans don't get rid of the Filibuster Rule and go to a 51% majority few bills will be passed. 8 Dems control the Senate! _E_
The Penn State Board should resign based on the grossly incompetent way they handled the NCAA. They gave away (cont) __HTTP__ _E_
#TedCruz eligibility to be President not settled law says Cruz' Constitutional Law Professor #LaurenceTribe __HTTP__ _E_
I've just done a major Dateline for NBC March 3rd just ahead of Apprentice. _E_
Why Franklin Graham says Donald Trump is right about stopping Muslim immigration __HTTP__ _E_
Thank you @ScottWalker! #AmericaFirst #RNCinCLE __HTTP__ _E_
I had a lot of fun answering your questions in the latest round of #AskTheDonald. See if your question made it __HTTP__ _E_
.@marklevinshow has been saying very nice things about me on his show recently. He has a fantastic radio show that I always enjoy! _E_
I look so forward to debating Crooked Hillary Clinton! Democrat Primaries are rigged e mail investigation is rigged so time to get it on! _E_
John McCain never had any intention of voting for this Bill which his Governor loves. He campaigned on Repeal & Replace. Let Arizona down! _E_
Thank you St. Louis Missouri! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
The Senate must go to a 51 vote majority instead of current 60 votes. Even parts of full Repeal need 60. 8 Dems control Senate. Crazy! _E_
Eli Manning. Great Athlete. Great Guy. @NYGiants great teamwork! _E_
No @DannyZuker just the opposite lots of money can go to charity if you have the guts to play the game (deal)! _E_
Anybody whose mind SHORT CIRCUITS is not fit to be our president! Look up the word BRAINWASHED. _E_
Via @TheTodaysGolfer "Trump @TurnberryBuzz transformation on course" __HTTP__ _E_
Thank you New Jersey! #Trump2016 __HTTP__ _E_
Derek Jeter had a great career until 3 days ago when he sold his apartment at Trump World Tower I told him not to sell karma? _E_
3 Chief of Staffs in less than 3 years of being President: Part of the reason why @BarackObama can't manage to pass his agenda. _E_
RT @FoxBusiness: .@charliekirk11: What this president has done is truly historic and if a Democrat president achieved 1/10th of what @POT... _E_
When renovations are completed Trump National Doral will be the finest resort in the U.S. _E_
It wasn't Donald Trump that divided this country this country has been divided for a long time! Stated today by Reverend Franklin Graham. _E_
I agree with Pres. Obama on Afghanistan. We should have a speedy withdrawal. Why should we keep wasting our money rebuild the U.S.! _E_
Who would you rather have negotiating with Iran President Obama or Toronto Mayor Ford? My money is on Ford. _E_
People are really liking the new ties and shirts @Macy's they are amazing and selling great! _E_
We need a president who knows how to get things done who can keep America strong safe and free and who can (cont) __HTTP__ _E_
.@DannyZuker You're starting up again because people have forgotten you. You wouldn't take my bet but it's (cont) __HTTP__ _E_
Many people are now saying that this is the worst storm/hurricane they have ever seen. Good news is that we have great talent on the ground. _E_
Just left Florida for D.C. The people and spirit in THAT GREAT STATE is unbelievable. Damage horrific but will be better than ever! _E_
"I also protect myself by being flexible. I never get too attached to one deal or one approach." – The Art of The Deal _E_
The interview with Oprah will cause Lance Armstrong huge legal and financial problems sometimes it is better to go into a corner and hide. _E_
Great news as a result of our TAX CUTS & JOBS ACT! __HTTP__ _E_
It's time for @PeteRose_14 to enter @MLB's @BaseballHall. All time hits leader has paid the price. _E_
My son @EricTrump and @LaraLeaYunaska just announced their engagement. Great news! A wonderful couple! _E_
Donald Trump Returns For 'All Star Celebrity Apprentice' __HTTP__ via @HuffPostTV _E_
New poll by ABC News/Washington Post TRUMP 32 CARSON 22 RUBIO 10 BUSH 7 Wow how will the media put a negative spin on this one? _E_
Where is the main stream media reporting on Univision's new expose of Fast and Furious? Too busy looking at Mitt's taxes? _E_
.@yankees are privately ecstatic over A Rod's latest doping bust. The evidence is damning __HTTP__ @yankees don't want him. _E_
Everyone is telling me that @EliotSpitzer is going to run against lightweight @AGSchneiderman Spitzer would win! _E_
THANK YOU ILLINOIS! Let's not forget to get family & friends out to VOTE IN 2016! __HTTP__ __HTTP__ _E_
So to all Americans in every city near and far small and large from mountain to mountain... __HTTP__ _E_
Barack Obama used to mock Bush's 300K monthly job reports __HTTP__ Now Obama wishes he could have a month half as good. _E_
When an employee leaves me and begs to come back I never let them. Loyalty is very important. _E_
Great win in Kansas last night for Ron Estes easily winning the Congressional race against the Dems who spent heavily & predicted victory! _E_
RT @WhiteHouse: The current tax code is a burden on American taxpayers and harmful to American job creators. Learn more: __HTTP__ _E_
I can confirm the reports @BillRancic my first season winner will be returning to this All Star season of @CelebApprentice. _E_
Iran's attack on Israeli diplomats is an attack on the West _E_
RT @SpoxDHS: Schumer Rounds Collins destroys the ability of @DHSgov to enforce immigration laws creating a mass amnesty for over 10 millio... _E_
Rising over Bay Street @TrumpTO brings opulent luxury along with our famous world class amenities to the Queen City __HTTP__ _E_
The Irish government is too smart to destroy their beautiful coastline w/ bird killing ugly wind turbines. @AlexSalmond @AberdeenCC _E_
Thank you so much. Earnest must have been a great person. __HTTP__ __HTTP__ _E_
The city of Buffalo is struggling. Moving the @buffalobills would be catastrophic. The Bills belong in Buffalo! _E_
Immigration reform really changes the voting scales for the Republicans—for the worse! _E_
Texas is heeling fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_
I am in Trump International Hotel Las Vegas getting ready and waiting for the debate tonight. Look forward hope I get treated fairly! _E_
So great that John McCain is coming back to vote. Brave American hero! Thank you John. _E_
President Obama just told President Putin how important the Russian air strikes against ISIS have been. I TOLD YOU SO! _E_
We don't want to have a recount in any of the battleground states. Obama will steal it. Make sure all your friends and family vote. _E_
Class of 2013. #WWEHOF __HTTP__ _E_
I would like to congratulate @SenateMajLdr on having done a fantastic job both strategically & politically on the passing in the Senate of the MASSIVE TAX CUT & Reform Bill. I could have not asked for a better or more talented partner. Our team will go onto many more VICTORIES! _E_
The global warming scientists don't want to be airlifted off the ship they are having too much fun and that is too simple a solution FAME! _E_
"The unemployment rate remains at a 17 year low of 4.1%. The unemployment rate in manufacturing dropped to 2.6% the lowest ever recorded. The unemployment rate among Hispanics dropped to 4.7% the lowest ever recorded..."@SecretaryAcosta @USDOL __HTTP__ _E_
Mr. Trump removing the broken teleprompter in North Carolina in front of a massive crowd. He goes on&delivers the b... __HTTP__ _E_
Next week the Senate is going to vote on legislation to save Americans from the ObamaCare DISASTER. #WeeklyAddress __HTTP__ _E_
The off shore Aberdeen wind farm site is "experimental" & has no track record delivering energy. __HTTP__ @guardian _E_
Via @Golfmagic: Golden Bear and American business tycoon finish their unlikely masterpiece __HTTP__ _E_
The greatest influence over our election was the Fake News Media screaming for Crooked Hillary Clinton. Next she was a bad candidate! _E_
Remember that Carson Bush and Rubio are VERY weak on illegal immigration. They will do NOTHING to stop it. Our country will be overrun! _E_
Hillary's refusal to mention Radical Islam as she pushes a 550% increase in refugees is more proof that she is unfit to lead the country. _E_
Obama promised 5.2% unemployment by October 2012. His promises are worthless! _E_
...case against him & now wants to clear his name by showing the false or misleading testimony by James Comey John Brennan... Witch Hunt! _E_
How did the NCAA which is weak and becoming irrelevant extract such a big & reputation shattering settlement from Penn State. Others zero! _E_
Via @CBSLA: Donald Trump Fights To Keep Large American Flag Flying At Southland Golf Course __HTTP__ _E_
Why does the failing @WSJ write a false editorial about me and let dummy @KarlRove make the same mistake in the same edition of the paper? _E_
Watching Gates on @seanhannity looks like he got hit by a truck! Why didn't Obama get him and othersto sign a confidentiality agreement? _E_
Congrats @adamcarolla on #RoadHard raising $1M on @fundanything a record. _E_
Great poll out of Illinois! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
What a series the @nyrangers @NHLDevils is turning out to be! Tonight's game should be another close one. _E_
RT @EricTrump: Join @TeamTrump on Saturday for National Day of Action as we work to #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
Ted Cruz does not have the right temperment to be President. Look at the way he totally panicked in firing his director of comm. BAD! _E_
Join me on Tuesday in Greensboro North Carolina! #Trump2016 #AmericaFirst __HTTP__ _E_
Certain Internet sites are like a bad epidemic that won't go away others are terrific _E_
Make sure you get out and vote...most important election of our generation...go Romney! _E_
Via @MiamiHerald by Bill Van Smith: @jacknicklaus reminisces amid honor at @TrumpDoral __HTTP__ _E_
The economy will come back but it will not be the same economy. The old economy of the Industrial Age is (cont) __HTTP__ _E_
Be sure to look for my beautiful wife Melania Trump tonight on QVC at 9 pm ET where she will be debuting her fantastic jewelry collection. _E_
I would have millions of votes more than Hillary except for the fact that I had 17 opponents and she just had a socialist named Bernie! _E_
Robert Pattinson is putting on a good face for the release of Twilight. He took my advice on Kristen Stewart...I hope! _E_
.@Joan_Rivers —I know you're watching what did you think of your impersonator? _E_
You can watch all the highlights of last night's record 14th season premiere of @ApprenticeNBC __HTTP__ _E_
"One thing I've learned about the press is they're always hungry for a good story the more sensational the better." Art of the Deal _E_
JOIN ME! #MAGATODAY:Springfield OH Toledo OH Geneva OH FRIDAY:Manchester NH Lisbon ME Cedar Rapids IA __HTTP__ _E_
Last Thursday Obama said investing in infrastructure would improve our economy for the long term The next day he again stopped Keystone _E_
Look at the solution not the problem. Learn to focus on what will give results. _E_
We left Iraq and it is quickly falling apart what a waste of lives and money and so obvious. _E_
. #JoeTheismann was great as a political analyst on @FoxNews. He knows far more than football. Thanks for the nice words Joe! _E_
Join me live at 9:00 P.M. #JointAddress __HTTP__ __HTTP__ _E_
Via @AmSpec by Jeffrey Lord: "Donald Trump Takes Ice Bucket Challenge – Dares Obama" __HTTP__ _E_
"Success is having to worry about every damn thing in the world except money." Johnny Cash _E_
You can change your vote in six states. So now that you see that Hillary was a big mistake change your vote to MAKE AMERICA GREAT AGAIN! _E_
My @FoxBusiness interview on @Varneyco discussing @BarackObama's dirty tactics & how @MittRomney should respond __HTTP__ _E_
Via @NYDailyNews by Eugene Dunn: "Trump the Nation's Great Hope" __HTTP__ _E_
The military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_
We have all the cards. Now is the time to make a great deal with Iran. _E_
White House Press Sec. had a hard time explaining why @BarackObama supported tax breaks for oil companies in (cont) __HTTP__ _E_
.@JohnLegere T Mobile service is terrible! Why can't you do something to improve it for your customers. I don't want it in my buildings. _E_
Go to @greta show will be talking about OPO and plenty else ENJOY! _E_
.@THEGaryBusey feels he's been abandoned by his team. Do you think so? #CelebApprentice _E_
With the signature services of Trump Attaché @TrumpWaikiki brings premiere luxury to the white sands of Waikiki __HTTP__ _E_
We as a country either have borders or we don't. IF WE DON'T HAVE BORDERS WE DON'T HAVE A COUNTRY! _E_
Hitting at home. Democrat Sen. Joe Donnelly's son had his healthcare plan dropped __HTTP__ _E_
The ObamaCare websites have cost over $5B & many still do not work __HTTP__ One of the greatest fiascos in modern history! _E_
#CelebApprentice @apprenticenbc returns tonight at 9/8c on NBC __HTTP__ _E_
We have to get tough with China before they destroy us. _E_
"The great question is not whether you have failed but whether you are content with failure." Laurence J. Peter _E_
Believe you can and you're halfway there. Theodore Roosevelt _E_
Lyin' Ted Cruz lost all five races on Tuesday and he was just given the jinx a Lindsey Graham endorsement. Also backed Jeb. Lindsey got 0! _E_
Is this what we want for a President? __HTTP__ _E_
WHY CAN'T THE MEDIA TELL THE TRUTH WE WOULD ALL BE SO MUCH BETTER OFF! _E_
Ted Cruz is lying again. Polls are showing that I do beat Hillary Clinton head to head. Check out __HTTP__ Poll snd Q Poll. _E_
Just been informed by @nbc they want to extend the run of the @ApprenticeNBC by two shows because it is doing so well. Two hours live. _E_
One of the most effective press conferences I've ever seen! says Rush Limbaugh. Many agree.Yet FAKE MEDIA calls it differently! Dishonest _E_
China 'scorns' US cyber espionage charges China does not respect us __HTTP__ and feels Obama is a dummy _E_
Jobs are returning illegal immigration is plummeting law order and justice are being restored. We are truly making America great again! _E_
Kathy Griffin should be ashamed of herself. My children especially my 11 year old son Barron are having a hard time with this. Sick! _E_
Thank you @Heritage! This is our once in a generation opportunity to revitalize our economy revive our industry & renew the AMERICAN DREAM! __HTTP__ _E_
Don't forget to watch me tonight on Late Night with Jimmy Fallon 12:35 a.m. on NBC. I'll be making a big announcement! _E_
A coincidence that the NSA leaker is living openly in Hong Kong?! At the same time the Chinese Pres. met with Obama in CA. _E_
The dying @NRO National Review has totally given up the fight against Barrack Obama. They have been losing for years. I will beat Hillary! _E_
Steven Tyler got more publicity on his song request than he's gotten in ten years. Good for him! _E_
Via @politico by "Poll: Trump has twice the support of Bush in New Hampshire" __HTTP__ _E_
The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non competitive. _E_
Today I will be rallying with with 15000 patriots in Arizona for border security! Let's Make America Great Again! __HTTP__ _E_
Achievers move forward at all times. Achievement is not a plateau it's a beginning. Don't waste time treading water. _E_
Iranian officials say that the WH is misleading public about the details of an interim nuclear agreement __HTTP__ _E_
I wonder what @JoeBiden was thinking last night as @PaulRyanVP delivered that knockout speech. Joe should call in sick for the VP debate. _E_
A dishonest slob of a reporter who doesn't understand my sarcasm when talking about him or his wife wrote a foolish & boring Trump hit _E_
Congratulations Stephen Miller on representing me this morning on the various Sunday morning shows. Great job! _E_
Corporations have NEVER made as much money as they are making now. Thank you Stuart Varney @foxandfriends Jobs are starting to roarwatch! _E_
Via @ShinySheet by @soapbox1: "Show jumping grand prix returns to Mar a Lago Sunday __HTTP__ _E_
Join us tomorrow in Scranton Pennsylvania at 3pm!#TrumpPence16 #MAGA Tickets: __HTTP__ __HTTP__ _E_
The Palestinian terror attack today reminds the world of the grievous perils facing Israeli citizens....continued: __HTTP__ _E_
I told you that the Giants starting Hudson was a mistake. Just got knocked out of the game. I love being right! _E_
Just read about my friend @HulkHogan he was set up too bad he has to use the court system instead of his muscles. _E_
Will miss @RealBenCarson tonight at the #GOPDebate. I hope all of Ben's followers will join the #TrumpTrain. We will never forget. _E_
During @BarackObama's presidency median family income has fallen 4.8% __HTTP__ Terrible for the middle class. _E_
Wishing all of those celebrating #Hanukkah around the world a happy and healthy eight nights in the company of those they love. __HTTP__ __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
Wow USA Today did todays cover story on my record in lawsuits. Verdict: 450 wins 38 losses. Isn't that what you want for your president? _E_
Michael Morell the lightweight former Acting Director of C.I.A. and a man who has made serious bad calls is a total Clinton flunky! _E_
The @washingtonpost report on potential VP candidates is wrong. Marco Rubio and most others mentioned are NOT under consideration. _E_
Cruz just lied again I am and have been totally against #ObamaCare repeal and replace! _E_
Trump National Golf Club Washington D.C. is on 600 beautiful acres fronting the Potomac River. A fantastic setting! __HTTP__ _E_
Great boardroom. #CelebApprentice _E_
Phil Mickelson's final 66 round in @The_Open was amazing. Congrats on his well deserved win. Amazing competitor & a great guy! _E_
In more and more places throughout this region citizens of SOVEREIGN and INDEPENDENT nations have taken greater control of their destinies and unlocked the potential of their people. #APEC2017 __HTTP__ _E_
This election is being rigged by the media pushing false and unsubstantiated charges and outright lies in order to elect Crooked Hillary! _E_
Donald Trump helped expose the silliness of the move by offering to pay for the White House tours. __HTTP__ _E_
My @gretawire interview discussing why the sequestration cuts are necessary our $17T national debt & 2016 election __HTTP__ _E_
A market is never saturated with a good product but it is very quickly saturated with a bad one. Henry Ford _E_
I have just lost my beautiful & elegant long time exec. assistant Norma Foerderer. She passed away yesterday – a truly magnificent woman. _E_
via WSJ. Wake up @AlexSalmond before you destroy Scotland. @David_Cameron @AberdeenCC @pressjournal __HTTP__ _E_
Set high standards and meet them. The proof is in the doing: learn by doing and taking risks. _E_
In Nov. '11 Al Qaeda's flag flew over the 'birthplace' of Libya's revolution __HTTP__ In Sept. '12 it flew over our Embassy. _E_
Hillary Clinton does not have the STRENGTH or STAMINA to be President. We need strong and super smart for our next leader or trouble! _E_
Based on the incredibly inaccurate coverage and reporting of the record setting Trump campaign we are hereby: __HTTP__ _E_
Obama's '07 speech which @DailyCaller just released not only shows that Obama is a racist but also how the press always covers for him. _E_
Never allow your attitude to be a liability. Be positive and strong. Set your mind on winning and keep it there. _E_
Thank you Iowa see you soon!#Trump2016 #ImWithYou __HTTP__ __HTTP__ _E_
I loved beating these two terrible human beings. I would never recommend that anyone use her lawyer he is a total loser! _E_
Will be meeting at 9:00 with top automobile executives concerning jobs in America. I want new plants to be built here for cars sold here! _E_
Why are people upset w/ me over Pres Obama's birth certificate?I got him to release it or whatever it was when nobody else could! _E_
.@FloydMayweather Good luck tonight Floyd. _E_
Fifth Avenue's most iconic building @TrumpTowerNY features Trump Grill nestled in the corner of the Atrium __HTTP__ _E_
The G 20 Summit was a great success for the U.S. Explained that the U.S. must fix the many bad trade deals it has made. Will get done! _E_
China's Olympic training program is abusive __HTTP__ It is modern day slavery & shameful. Their (cont) __HTTP__ _E_
We need a president who is smart and tough enough to recognize the national security threat China poses in the (cont) __HTTP__ _E_
Less than two weeks until @WWE's @WrestleMania XXIX. @TheRock v. @JohnCena willbe epic! Excited to be inducted into the Hall of Fame. _E_
.@Newsmax by @melaniebatley: Donald Trump Tells Why He's Eyeing the White House.I'll Tell You Why He Could Win. __HTTP__ _E_
Don't forget to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_
"If winning isn't everything why do they keep score?" Vince Lombardi _E_
.@weeklystandard I know your business is failing but you should try to get writers far better than @stephenfhayes. _E_
It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_
Senator Luther Strange has gone up a lot in the polls since I endorsed him a month ago. Now a close runoff. He will be great in D.C. _E_
I still can't believe we didn't t take the oil from Iraq. _E_
For reasons only they can explain the @USChamber wants to continue our bad trade deals rather than renegotiating and making them better. _E_
"Remember to keep going: if you stop your momentum will stop." – Think Big _E_
Could be a fight over red heads with @lisalampanelli—this could be good. #sweepstweet _E_
"Statement from President Donald J. Trump on #GivingTuesday" __HTTP__ _E_
Even if @BarackObama stays in DC taxpayers will pay millions for his Hawaii vacation when Americans are struggling __HTTP__ _E_
He who demands little gets it. Ellen Glasgow _E_
Via @MiamiHerald by Hannah Sampson: "BLT Prime coming to Trump's Doral resort" __HTTP__ _E_
.@TIME Magazine should definitely pick David Pecker to run things over there he'd make it exciting and win awards! _E_
The only thing more boring than @bwilliams newscast is his show Rock Center which is totally dying in the ratings—a disaster! _E_
The secret of success in life is for a man to be ready for his opportunity when it comes. Benjamin Disraeli _E_
I will be interviewed by @ericbolling tonight at 8pm on the @oreillyfactor. Enjoy! _E_
Time magazine should name David Pecker of American Media to be its top guy...but they are not smart enough to do that! _E_
"Remember that fear can be conquered. Go full throttle and the odds will be on your side." – Trump Never Give Up _E_
A special message for Martin Bashir __HTTP__ _E_
Jeff Zucker failed @NBC and he is now failing @CNN. _E_
Obama is in Texas but will not be visiting the border. He is too busy fundraising! _E_
This new Russian strategy guarantees victory for the Syrian government and makes Obama and U.S. look hopelessly bad. President in trouble! _E_
.@CarlyFiorina I only said I was on @60Minutes four weeks ago with Putin—never said I was in Green Room. Separate pieces—great ratings! _E_
Thank you Tennessee! #MAGA __HTTP__ _E_
Mattis Says Trump's Warning Stopped Chemical Weapons Attack In Syria __HTTP__ _E_
Your work will never be in vain if you work for a cause that is greater than yourself. _E_
Wow Corey Lewandowski my campaign manager and a very decent man was just charged with assaulting a reporter. Look at tapes nothing there! _E_
Last week to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_
Great @FOXSports art. by @jillpainter on Doc River's annual golf charity event @TrumpGolfLA. Doc is a great friend! __HTTP__ _E_
__HTTP__ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
He who knows when he can fight and when he cannot will be victorious. Sun Tzu _E_
The reason great dealmakers do not OPENLY celebrate a deal especially one that is not complete is that it shows weakness to the other side _E_
The Ground Zero Mosque should not go up where planned. It is wrong. My offer still stands to buy the property. Good deal for everyone. _E_
#TBT Saturday Night Live __HTTP__ _E_
The United Nations Security Council just voted 15 0 to sanction North Korea. China and Russia voted with us. Very big financial impact! _E_
Initial reports say 2nd debate viewership dropped. See what happens when I am not mentioned. _E_
W/ spectacular panoramic Pacific Ocean views @TrumpGolfLA is the top luxury public golf course in the country __HTTP__ _E_
"Get in. Get it done. Get it done right. Get out." – Fred C. Trump (My father!) _E_
Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_
When I renovated Wollman Rink in Central Park it came in $750000 under budget.. _E_
I win a state in votes and then get non representative delegates because they are offered all sorts of goodies by Cruz campaign. Bad system! _E_
An ad hoc interview I filmed with a German journalist at Ground Zero hours after the attack __HTTP__ _E_
Hillary Clinton's open borders are tearing American families apart. I am going to make our country Safe Again for all Americans. #Imwithyou _E_
Just landed in Baton Rouge Louisiana. Reports are out that lines are three quarters of a mile to get in. Wow! #MakeAmericaGreatAgain _E_
.@DannyZuker I'm in front of the camera and behind the camera just looked at your picture you'll never be in front of the camera! _E_
I always enjoy being interivewed on @WOR710 by John Gambling. My father Fred used to listen to his father's show. _E_
Director Clapper reiterated what everybody including the fake media already knows there is no evidence of collusion w/ Russia and Trump. _E_
Things happen that make you question whether you should keep going. As long as you are enjoying what you are doing keep going. _E_
My interview last night with Greta on the GOP going El Foldo __HTTP__ _E_
Thanks Larry. Best wishes. __HTTP__ _E_
After witnessing first hand the horror & devastation caused by Hurricane Harveymy heart goes out even more so to the great people of Texas! _E_
Yesterday the Christmas tree arrived at Rockefeller Plaza. An iconic event for New York! _E_
The President of Taiwan CALLED ME today to wish me congratulations on winning the Presidency. Thank you! _E_
.@FoxNews is devastated that lightweight Senator Marco Rubio got trounced tonight and is the big loser. I won the two big states great! _E_
My @SquawkCNBC interview re: Europe's financial mess investing in Spain Germany's economy and the future of the Euro __HTTP__ _E_
Sorry the best and most beautiful ties and shirts made anywhere and at a really reasonable cost. Also fragrance is amazing. GO TO MACY'S. _E_
.@BarackObama is now taking credit for changing party platform language but he reviewed it prior to the convention __HTTP__ _E_
I should release the sad and totally apologetic letter that Penn @pennjillette hand delivered to me. Minds would be changed very fast! _E_
This is the Cruz voter violation certificate sent to everyone a misdemeanor at minimum. __HTTP__ _E_
Just had a very open and successful presidential election. Now professional protesters incited by the media are protesting. Very unfair! _E_
What would you do if a large group of Muslims had a very public meeting drawing horrible and mocking cartoons of Jesus? Oh really be cool! _E_
Congratulations to @woodyjohnson4 and @nyjets on yesterday's very exciting game. _E_
Wow the ridiculous deal made between Lyin'Ted Cruz and 1 for 42 John Kasich has just blown up. What a dumb deal dead on arrival! _E_
The ROLL CALL is beginning at the Republican National Convention. Very exciting! _E_
I thought that @CNN would get better after they failed so badly in their support of Hillary Clinton however since election they are worse! _E_
Get ALL the info then quick trial then death penalty for the Boston killer of innocent children and people! Do not be kind. _E_
Dummy @mcuban is at it again trying to use me to get publicity for himself! _E_
Iran was planning to attack the Israeli and Saudi DC embassies. We should respond accordingly. The diplomatic window is closed. _E_
Bringing true luxury to the Windy City @TrumpChicago soars 92 levels over the Chicago River __HTTP__ _E_
Would very much appreciate Saudi Arabia doing their IPO of Aramco with the New York Stock Exchange. Important to the United States! _E_
The episode of the Apprentice that everyone has been waiting for....Joan Rivers stars and she is and does GREAT! Next Monday night at 8:00 _E_
Congrats @MittRomney on a huge NV victory. Let's make @BarackObama a one term president __HTTP__ #OneTermFund _E_
Honest reporters stated that the Prayer Breakfast was going on during my CPAC speech and security was very slow to let people in long lines! _E_
It's amazing my weekly scheduled interviews on @foxnews and @CNBC draw the highest ratings. And they get bigger week by week thanks folks! _E_
"Donald Trump on 'cliff': 'Other countries are eating our lunch'" __HTTP__ via @BIZPACReview _E_
A great Christmas movie & perfect #TBT! #MakeAmericaGreatAgain Story: __HTTP__ __HTTP__ _E_
Entrepreneurs: Paying attention is a cost effective way of protecting yourself. _E_
I wonder if @BarackObama ever had an Indonesian passport. Did he become an Indonesian citizen when he lived there? _E_
Now Michelle Nunn will not admit she voted for Obama. Of course she did. Nunn supports ObamaCare & is anti Second Amendment. _E_
A total lightweight: @JonHuntsman continues to give the worst responses on China in the debates. I can see why (cont) __HTTP__ _E_
NYPD Officer Larry DePrimo has made the entire city proud with a his generous act of kindness __HTTP__ NYC loves the NYPD. _E_
.@BrookslawBrooks Thank you so much for your nice words. I will make you look very smart! _E_
Set the bar high do the best you possibly can and believe in yourself—because if you don't no one else will either. _E_
People will be very surprised by our ground game on Nov. 8. We have an army of volunteers and people with GREAT SPIRIT! They want to #MAGA! _E_
We're getting down to the wire on The Apprentice tune in tonight for some great action! 10 p.m. on NBC. _E_
.@FrankLuntz works really hard but is a guy who just doesn't have it a total loser! _E_
Check out this photo shoot video of @IvankaTrump's Spring 2012 collection.... __HTTP__ _E_
Is Supreme Court Justice Ruth Bader Ginsburg going to apologize to me for her misconduct? Big mistake by an incompetent judge! _E_
Let's together Make America Great Again! Vote Trump at __HTTP__ _E_
The new winter menu @SixteenChicago @TrumpChicago explores the evolution of fine dining @RobbReport __HTTP__ _E_
It now turns out that the phony allegations against me were put together by my political opponents and a failed spy afraid of being sued.... _E_
Entrepreneurs: Put everything you've got into what you're doing. Be totally focused nothing should be haphazard. _E_
Disgusting @BarackObama's supporters are launching an anti Mormon whisper campaign __HTTP__ Shameful but no surprise. _E_
Because of the tornado tragedy I will not be doing @piersmorgan tonight. I wish everyone well! _E_
My @FoxNews interview with @TeamCavuto discussing why I will not be moderating the Newsmax @iontv debate __HTTP__ _E_
Our country is blowing up and @BarackObama is out campaigning. _E_
The Republicans must be patient and smart ObamaCare could sweep them into office in far greater numbers than anyone ever thought possible! _E_
Convention Center officials in Phoenix don't want to admit that they broke the fire code by allowing 12 15000 people in 4000 code room. _E_
Everyone loves TV's darling @TheRealMarilu. But wait until you see her tough & competitive side in the upcoming @CelebApprentice! _E_
Will be working with contractors at Trump National Doral in Miami today. _E_
Thanks. __HTTP__ _E_
My @FoxNews interview with @gretawire explaining that I am keeping all my options available for 2012 __HTTP__ _E_
Druggie A Rod @MLB's biggest fraud is lucky George Steinbrenner is no longer with us. @Yankees would have voided his contract. _E_
RT @realDonaldTrump: "Arrests of MS 13 Members Associates Up 83% Under Trump" __HTTP__ _E_
Last weeks Dateline which I hosted was the highest rated Dateline since January! _E_
A great Father's Day gift—a stay at my 5 star hotel @TrumpNewYork along with items from my signature collection __HTTP__ _E_
Statement on Relationship with NBC __HTTP__ _E_
I am going to Trump National Doral in Miami today to check out the brand new and just opened BLUE MONSTER and the spectacular driving range. _E_
I hear by demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_
RT @FoxNews: .@EricTrump: People have seen a year that's incredible that's been filled with nothing but the best for our country America... _E_
The polls are close so Crooked Hillary is getting out of bed and will campaign tomorrow.Why did she hammer 13 devices and acid wash e mails? _E_
CBS's FACE THE NATION Posts Largest Audience Since 2001#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
With Democrats Spitzer Danger Weiner & Filner which party really has the war on women? _E_
The latest book on Hillary—Wow a really tough one! __HTTP__ @RogerJStoneJr _E_
It's time for government to stop picking winners & losers. Let's make sure everyone can achieve the American dream! __HTTP__ _E_
"Set the example and you'll be a magnet for the right people. That's the best way to work with people you like." – Think Like a Champion _E_
Maybe if Obama knew too much about the spying it would be worse than knowing nothing but either way it is just another disaster! _E_
China has done very well under Obama. Now they just released their first aircraft carrier. _E_
.@JustinRose99 The display you put on this weekend was unprecedented. Even the best putters couldn't believe it. You're amazing. See u soon. _E_
We are stupidly paying Iran billions of dollars that we should not be paying. Why isn't this part of the nuclear negotiations? Really dumb! _E_
This past Sunday's All Star Celebrity @ApprenticeNBC continued to win the key demographic of adults 25 54. An amazing run! _E_
This summer is very tough for the nation's worst AG @AGSchneiderman. Moreland Commission is his disaster. _E_
RT @seanhannity: @ericbolling To my dear friend please know we all love you will be here for you and your family. _E_
People are just now starting to find out how dishonest and disgusting (FakeNews) @NBCNews is. Viewers beware. May be worse than even @CNN! _E_
Congress must stop Obama's reckless deal with Iran. The framework is a pathway for Iran to develop nukes. _E_
Congrats to my friend @Schwarzenegger who is doing next season's Celebrity Apprentice. He'll be great & will raise lots of $ for charity. _E_
The Justice Dept. should ask for an expedited hearing of the watered down Travel Ban before the Supreme Court & seek much tougher version! _E_
Having a good relationship with Russia is a good thing not a bad thing. Only stupid people or fools would think that it is bad! We..... _E_
Congratulations to Miss Rhode Island on winning the Miss USA contest. She did an amazing job. _E_
Listen to my interview with @gretawire tonight at 10PM ET on @FoxNews. _E_
The Paley Center for Media is a great place to visit when you're in NYC. #CelebApprentice _E_
...use subsidies to buy health plans. In other words Ocare is dead. Good things will happen however either with Republicans or Dems. _E_
An insightful article on @BarackObama __HTTP__ _E_
I make no apologies for this country my pride in it or my desire to see us become strong and rich again. (cont) __HTTP__ _E_
Donald Trump's Guns by @EmilyMiller @washtimes __HTTP__ _E_
#CelebApprentice Who will hear those two famous words? @Apprenticenbc premieres tomorrow at 9/8c on NBC. __HTTP__ _E_
When I said that if within the Orlando club you had some people with guns I was obviously talking about additional guards or employees _E_
I was thrilled to be back @LibertyU. Congratulations to the Class of 2017! This is your day and you've earned it.... __HTTP__ _E_
Just got back from New Hampshire. Amazing people we all had a great time together! _E_
Via @zpolitics: "Donald Trump Sends Message to @GaRepublicans" __HTTP__ _E_
....countries which are doing badly. I want a merit based system of immigration and people who will help take our country to the next level. I want safety and security for our people. I want to stop the massive inflow of drugs. I want to fund our military not do a Dem defund.... _E_
The Dow just broke 24000 for the first time (another all time Record). If the Dems had won the Presidential Election the Market would be down 50% from these levels and Consumer Confidence which is also at an all time high would be "low and glum!" _E_
Jeb Bush had a tough night at the debate. Now he'll probably take some of his special interest money he is their puppet and buy ad's. _E_
Don't forget my book signing tonight at Costco on 1250 Old Country Road in Westbury NY from 6 8 pm. Hope to see you there. _E_
RT @BarackObama: RT if you agree: We need a President who is fighting for all Americans not one who writes off nearly half the country. _E_
Great! __HTTP__ _E_
What Bernie Sanders really thinks of Crooked Hillary Clinton. __HTTP__ _E_
.@JerryLawler was terrific. #WWEHOF __HTTP__ _E_
Watch this clip from earlier this year. Time & time again I have been right about terrorism. It's time to get tough! __HTTP__ _E_
Over the years I've discovered that for a brand to build the people surrounding it have to work exceptionally well together. _E_
There is no substitute for private sector experience. _E_
Jeb Bush "I am a conservative" = Barack Obama "If you like your healthcare plan you can keep your plan." _E_
#CongratsPeggy! __HTTP__ _E_
Via @CBSNewYork: "@TrumpFerryPoint Opens In The Bronx" __HTTP__ _E_
Thank you to the people of New Hampshire I love you! Now off to South Carolina. _E_
.@cher should spend more time focusing on her family and dying career! _E_
Can it just be new age that Manti Te'o fell in love with a girl he never met or is it a hoax? _E_
Congratulations to the White House. For every 1 ObamaCare enrollment there are 44 cancellation notices. Very unfair! _E_
'As Senator Clinton promised 200000 jobs in Upstate New York her efforts fell flat.' __HTTP__ __HTTP__ _E_
Thanks for all of the nice tweets re Sgt. Tahmooressi. Especially nice that the money will be sent today #VeteransDay. _E_
.@maddow Standing in front of wind turbines is sad. Rachel windmills are terrible for the environment— _E_
Flashback: Donald Trump: $200M plan for Doral __HTTP__ via @ESPNGolf. Trump Doral's @cadillacchamp is one week away! _E_
Today is a day that I've been looking very much forward to ALL YEAR LONG. It is one that you have heard me speak about many times before. Now as President of the United States it is my tremendous honor to finally wish America and the world a very MERRY CHRISTMAS! __HTTP__ _E_
Reports by @CNN that I will be working on The Apprentice during my Presidency even part time are ridiculous & untrue FAKE NEWS! _E_
#SuccessByTrump exclusively available @Macy's has set sale records for fastest selling cologne. Makes a great gift __HTTP__ _E_
Manufacturers' record high optimism reported in the 1st qtr has carried into the 2nd qtr of 2017 via @ShopFloorNAM: __HTTP__ __HTTP__ _E_
Trump Signature mattress is from Serta the best there is! Thanks _E_
The Muslim Brotherhood dictator in Egypt is bad news. He will never be our true ally! _E_
🚨BREAKING🚨: State Department's Kennedy pressured FBI to unclassify Clinton emails: FBI documents __HTTP__ _E_
A world famous testament to architectural excellence @TrumpTowerNY features a 60 ft waterfall __HTTP__ _E_
Don't be afraid of being unique it's like being afraid of your best self. Donald J. Trump __HTTP__ _E_
The Trump Organization is going revolutionize Rio de Janeiro's downtown port area with Trump Towers. Construction begins soon! _E_
On the cover of @TIME Magazine—a great honor! __HTTP__ _E_
.@VP Mike Pence will be speaking at today's #MarchForLife You have our full support! __HTTP__ _E_
Via @Newsmax_Media by Cathy Burke: "Donald Trump on 2016 Bid: On Scale of 1 10 I'm 'Much More Than Five'" __HTTP__ _E_
have enough problems around the world without yet another one. When I am President Russia will respect us far more than they do now and.... _E_
Nasty tactics being used by @BarackObama campaign against @MittRomney. Must stop saying Obama is a nice man he is not! _E_
..and now holds an adjunct professorship at Columbia University. Boudin also received an academic laurel from NYU Law School... _E_
Via HT Politics __HTTP__ _E_
#DrainTheSwamp __HTTP__ _E_
Snowden if you're such a hero then come back home and face justice. In reality you are just another wiseguy traitor. _E_
So sad to hear of the terrorist attack in Egypt. U.S. strongly condemns. I have great... _E_
The $9B that @BarackObama spent in 'Stimulus' for Solar Wind Projects created 910 total jobs costing $9.8M each. __HTTP__ _E_
Lawyers have sent @billmaher demand notice and necessary documentation. _E_
On #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_
Horrific incident in FL. Praying for all the victims & their families. When will this stop? When will we get tough smart & vigilant? _E_
Comic @sethmeyers21 bombed at University of Texas at Arlington—crowd was dismal as was his performance—I told you so! _E_
Looking forward to @THEGaryBusey's book of Buseyisms ! _E_
Wow. @nfl ratings are down big league. Glad I didn't get the Bills. Rather be lucky than good. _E_
What you get by achieving your goals is not as important as what you become by achieving your goals. Goethe _E_
We pay a disproportionate share of the cost of N.A.T.O. Why? It is time to renegotiate and the time is now! _E_
"The Trumps pay tribute to the late @Joan_Rivers" __HTTP__ via @azcentral _E_
For too many years our inner cities have been left behind. I am going to deliver jobs safety and protection for those in need. _E_
Going on Letterman now let me know what you think how did I do? Here we go! _E_
Matt Harvey @Mets Don't let the @NYDailyNews get you down nobody reads it. Play well. _E_
New home sales reach a 10 year high. Stock Market has more record gains. Hopefully Republican Senators will give us the much needed Tax Cuts to keep it all going! Democrats want big Tax Increases. _E_
Wisconsin's economy is doing poorly and like everywhere else in U.S. jobs are leaving. I will make our economy strong again bring in jobs _E_
Today it was my tremendous honor to visit Marine Helicopter Squadron One (HMX 1) at the Marine Corps Air Facility in Quantico Virginia. I am honored to serve as your Commander in Chief. On behalf of an entire Nation THANK YOU for your sacrifice and service. We love you! __HTTP__ _E_
Looking forward to @VinceMcMahon inducting me into @WWE Hall of Fame this Saturday in @TheGarden. #WWEHOF #WrestleMania _E_
.@AlexSalmond See attached article. Very frightening to people living around these monstrosities __HTTP__ _E_
.@TrumpNewYork's 176 rooms have floor to ceiling windows providing unparalleled views of Central Park & NYC __HTTP__ _E_
Join me in Denver Colorado tonight at 9:30pm: __HTTP__ Scranton Pennsylvania Monday @ 5:30pm: __HTTP__ _E_
See my picks at @Fund_Anything at __HTTP__ and giving away money!!! #FundAnything _E_
Thank you to Governor @ScottWalker for such warm support. Great speech! _E_
Great! __HTTP__ _E_
I still hold the all time attendance and pay per view record at @WWE. _E_
Alabama was great last night amazing people. 30000 folks was largest crowd of political season. Nice! _E_
The Democrats have become nothing but OBSTRUCTIONISTS they have no policies or ideas. All they do is delay and complain.They own ObamaCare! _E_
Donald Trump: If Bill Maher Does Not Pay Off His $5 Million Bet – 'Then I'll Sue Him' __HTTP__ via @gatewaypundit _E_
People often ask me the secret to my success and the answer is simple: passion focus and hard work. Momentum keeps it all going. _E_
Why does @ThisWeekABC w/ @GStephanopoulos allow a hater & racist like @tavissmiley to waste good airtime? @ABC can do much better than him! _E_
The Trump Signature Collection is the best menswear design for young entrepreneurs. Great style & design exclusively available @Macys. _E_
Jennifer is a terrific person. __HTTP__ _E_
Still a great time to buy residential property. The courts are holding up foreclosures. Buy directly from the banks. _E_
Don't forget to tune in tonight at 10 p.m. on NBC for another action packed episode of The Apprentice. __HTTP__ _E_
How will the client react? They've got both Elle Magazine and Chi to please. #sweepstweet _E_
In less than 30 minutes watch the season premiere of @ApprenticeNBC on NBC. _E_
Make sure to tune in to All Star Celebrity @ApprenticeNBC this Sunday at 9PM EST for another round of fireworks and surprises! _E_
I only go on shows that get ratings that's why I do @oreillyfactor @hannityshow and @gretawire. Your sho... (cont) __HTTP__ _E_
Back from Miami where my Cuban/American friends are very happy with what I signed today. Another campaign promise that I did not forget! _E_
In Britain more Muslims join ISIS than join the British army. __HTTP__ _E_
President @EmmanuelMacronThank you for inviting Melania and myself to such a historic celebration in France. #BastilleDay #14juillet __HTTP__ _E_
We enjoy hosting tourists in @TrumpTowerNY. They come from all over the world to see the Atrium a NYC landmark. __HTTP__ _E_
Wow did great in the debate polls (except for @CNN which I don't watch). Thank you! _E_
Obama is angry frustrated and desperate. He said "voting is the best revenge" __HTTP__ He is divisive. _E_
Hence legal documents are being crafted which take me completely out of business operations. The Presidency is a far more important task! _E_
.@FoxNews is so biased it is disgusting. They do not want Trump to win. All negative! _E_
Message to Edward Snowden you're banned from @MissUniverse. Unless you want me to take you back home to face justice! _E_
This was the Republicans election to win but they just blew it reasons why to follow. _E_
Dopey @Lawrence O'Donnell whose unwatchable show is dying in the ratings said that my Apprentice $ numbers were wrong. He is a fool! _E_
Am now in L.A. Will be going to the U.S.S. IOWA at 5:30 P.M. to speak to our great VETERANS and other friends! _E_
It's Tuesday. How many more 'The View' Execs will leak that they want @rosie gone? Show is failing. _E_
American homeownership rate in Q2 2016 was 62.9% lowest rate in 51yrs. WE will bring back the 'American Dream!' __HTTP__ _E_
Look where the world is today a total mess and ISIS is still running around wild. I can fix it fast Hillary has no chance! _E_
I started this campaign to Make America Great Again. That's what I'm going to do. #MAGA #debate _E_
In Tampa Florida thank you to all of our outstanding volunteers who want to #MakeAmericaGreatAgain! __HTTP__ _E_
.@HallieJackson Why didn't you report Hillary lying about the ISIS video. Bad reporting. Perhaps @NBC will do better next year but doubt it! _E_
Big thanks to @David_Bossie @Citizens_United & @AFPhq for hosting me at #NHFreedomSummit. Will be back to the Granite State soon! _E_
Tremendous support (except for some Republican leadership ). Thank you. _E_
Congratulations to @FoxNews for being number one in inauguration ratings. They were many times higher than FAKE NEWS @CNN public is smart! _E_
Great meeting with active & retired law enforcement officers at the Fraternal Order of Police lodge in Akron Ohio. __HTTP__ _E_
Gov. Scott Walker just left my office we had a really wonderful talk. Very interesting! @GovWalker _E_
Sources inside @AGSchneiderman's office are saying that they are very concerned with the allegations against their lightweight boss. _E_
51% of @JonHuntsman's NH voters are satisfied with @BarackObama as president __HTTP__ So is @JonHuntsman! _E_
Excited by my acquisition of Doral Hotel & Country Club in Miami already world class but will soon be The Best. _E_
Crooked Hillary Clinton spent hundreds of millions of dollars more on Presidential Election than I did. Facebook was on her side not mine! _E_
Entrepreneurs: Being stubborn is a big part of being a winner. Don't give in and don't give up! _E_
Jon Stewart @TheDailyShow is a total phony –he should cherish his past—not run from it. _E_
Obama can attend a fundraiser every day but can't be bothered to get briefed on national security. Commander in Chief?! _E_
Just left Florida amazing how well State is doing jobs way up taxes down. Congrats to @FLGovScott _E_
'How Trump Would Stimulate the U.S. Economy' __HTTP__ _E_
New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: __HTTP__ _E_
It probably was not drugs that caused the San Fran crash but why aren't they testing who knows? _E_
The joke around town is that I freed El Chapo from the Mexican prison because the timing was so good w/ my statements on border security. _E_
I will be interviewed by @IngrahamAngle on @FoxNews at 10:00. Enjoy! _E_
Crazy Election officials saying that there is nothing stopping illegal immigrants from voting. This is very bad (unfair) for Republicans! _E_
Thank you Alabama! #Trump2016#SuperTuesday _E_
Just out @ApprenticeNBC was in first place in all demos during the 10PM hour in the ratings. _E_
Well back to the drawing board! _E_
I will be doing a Town Hall tonight at 10:00 P.M. on @seanhannity @FoxNews _E_
Looking forward to a speedy recovery for George and Barbara Bush both hospitalized. Thank you for your wonderful letter! _E_
I really like Jay Z but there is trouble in paradise. When his wife's sister starts whacking him not good! No help from B leads to a mess. _E_
I truly hope President Obama doesn't do something irrational and dangerous for our country in order to save face. He must sit back and chill _E_
The North Coast of Scotland is spectacular the sea the sand dunes the rolling bluffs we walked the course and it is fantastic. _E_
MUST READ ARTICLE: "Immigration reform could be bonanza for Democrats" __HTTP__ Are the @RNC & @GOP suicidal? _E_
Congratulations to @TrumpDoral for being named one of @LINKSMagazine's Great Destinations: __HTTP__ _E_
For all of those who have been asking about online sales the Donald J. Trump Signature Collection ties & shirts are sold @Macys.com _E_
Small Business Poll has highest approval numbers in the polls history. All business is just at the beginning of something really special! _E_
.@williebosshog such an honor to get your endorsement. You are a fantastic guy! It will not be forgotten. Don and Eric say hello! _E_
America's men & women in uniform is the story of FREEDOM overcoming OPPRESSION the STRONG protecting the WEAK & G... __HTTP__ _E_
Very different styles but each totally effective in his own way at the debate. _E_
Choose your own path: It doesn't have to be the path less traveled...What matters is that it's the right one for you. Vince Lombardi _E_
President Obama's approval rating at 38% is at an all time low. Gee I wonder why? _E_
My best wishes to everyone for a Happy Thanksgiving! _E_
The United States better address China's exchange rate before they steal our country and it is too late! China is laughing at us. _E_
Where the hell is global warming when you need it? _E_
So a woman in Chicago who never had a job has 9 kids with 7 different men (she is one of many). These kids will never work. Trouble! _E_
.@MacMiller's 'Donald Trump' song is at 64.5M views on YouTube __HTTP__ You're welcome Mac! _E_
Rev. Graham made a critical point. @BarackObama has turned a blind eye to the Christians being persecuted in (cont) __HTTP__ _E_
Welcome to Obama's America record high poverty and an 8% drop in median household family income __HTTP__ Four more years? _E_
Businesses have already started massive layoffs and reducing employees' hours due to Obama Care. Reality is setting in. _E_
A Great 4th of July! America a great country who's brightest days with wise leadership lie ahead. _E_
RT @JasonMillerinDC: Is @realDonaldTrump debating Crooked @HillaryClinton or the moderators @AC360 and @MarthaRaddatz? #rattledhillary _E_
In light of Boston immigration legislation will be much harder to get. _E_
You would think a paper like the Washington Post would be fair and objective. For the record almost all polls showed I won all debates. _E_
Hillary Clinton's Campaign Continues To Make False Claims About Foundation Disclosure: __HTTP__ _E_
.@AlexSalmond of Scotland may be the dumbest leader of the free world. I can't imagine that anyone wants him in office. _E_
The dying @UnionLeader newspaper in NH is in turmoil over my comments about them like a bully that got knocked out! _E_
Jeanne Shaheen was the deciding vote for ObamaCare. Premiums have skyrocketed 90% for New Hampshire. Send @SenScottBrown to the Senate! _E_
Kevin Garnett's response to Ray Allen last night was that of a great competitor nothing wrong in fact it was terrific. A champion! _E_
Why is @BarackObama always campaigning or on vacation? _E_
The Trump Doral's @cadillacchamp is Florida's premiere golf tournament. I'll be there! Tickets available here: __HTTP__ _E_
.@hardball_chris must have the lowest IQ on television—now telling people that domestic terrorists are from the right. _E_
RT @DRUDGE_REPORT: 43 39 __HTTP__ _E_
That the Obama administration didn't know the facts about who Bergdahl was before making the stupid 5 killers for one trade is pathetic! _E_
.@MittRomney is 100% right. The US Supreme Court should do the right thing & overturn ObamaCare or the country (cont) __HTTP__ _E_
For years even as a civilian I listened as Republicans pushed the Repeal and Replace of ObamaCare. Now they finally have their chance! _E_
.@pgaofamerica A really great tournament congrats to Monty Pete B and Ted Bishop. FANTASTIC JOB! _E_
Congratulations to Tom Brady on yet another great victory Tom is my friend and a total winner! _E_
In today's #trumpvlog @RepWeiner the Secret Service and Dick Clark..... __HTTP__ _E_
NEVER forget our HEROES held prisoner or who have gone missing in action while serving their country.Proclamation: __HTTP__ __HTTP__ _E_
Save Medicare. Vote for @MittRomney. He will repeal Obamacare on day one. _E_
Deja vu I can remember a time when our embassies were stormed under another failed President. Obama=Carter. _E_
I will be interviewed on @foxandfriends by @ainsleyearhardt starting at 6:00 A.M. Enjoy! _E_
Joining @oreillyfactor from Waukesha Wisconsin now live! Enjoy! _E_
Success requires 100% effort and 100% focus. Nothing less. _E_
What is your thought as to why Obama refused millions for charity and did not show his records and applications? _E_
THANK YOU Clemson South Carolina! #MakeAmericaGreatAgain #SCPrimary __HTTP__ _E_
People rarely succeed unless they have fun in what they are doing. Andrew Carnegie _E_
Important meetings and calls scheduled for today. Military and economy are getting stronger by the day and our enemies know it. #MAGA _E_
Can you believe the worst Mayor in the U.S. & probably the worst Mayor in the history of #NYC @BilldeBlasio just called me a blow hard! _E_
Based on the fact that Ted Cruz was born in Canada and is therefore a natural born Canadian did he borrow unreported loans from C banks? _E_
Just left Family Leadership Summit in Iowa got a standing ovation from many wonderful people. I will be back soon. _E_
I don't believe I have been given any credit by the voters for self funding my campaign the only one. I will keep doing but not worth it! _E_
Bird killing windfarm that I oppose in Aberdeen got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_
This Man Is the Most Dangerous Political Operative in America via Bloomberg Politics __HTTP__ _E_
Hillary Clinton is not a change agent just the same old status quo! She is spending a fortune I am spending very little. Close in polls! _E_
Heading to Sioux County Iowa where the crowd is amazing. Dr. Robert Jeffress will make the introduction. Make America Great Again! _E_
Thank you Pennsylvania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
"The only way to do great work is to love what you do. If you haven't found it yet keep looking. Don't settle." – Steve Jobs _E_
#TBT With the wonderful actor Jack Nicholson __HTTP__ _E_
Join the MOVEMENT! __HTTP__ __HTTP__ _E_
Via @fitsnews: The Donald Trump Show Is Returning To SC: BILLIONAIRE MOGUL HEADS BACK TO PALMETTO STATE __HTTP__ _E_
Now Syria is bombing Iraq and Secy. Kerry after we blew the hell out of the place says please don't do that. Syria is a front for Iran. _E_
Is it possible for @megynkelly to cover anyone but Donald Trump on her terrible show. She totally misrepresents my words and positions! BAD. _E_
Entrepreneurs: Do not view any failure as the final say for your efforts. Learn your lessons quickly then move on. _E_
.@HillaryClinton's Nuclear Agreement Paved The Way For The $400 Million Ransom Payment #DebateNight __HTTP__ _E_
Job numbers today terrible! So what else is new? _E_
I have always done well with properties fronting on oceans lakes and rivers. If something works stay with it. _E_
Sean Spicer is a wonderful person who took tremendous abuse from the Fake News Media but his future is bright! _E_
Via @TheBrodyFile: Iowa Evangelical Leader Says Donald Trump Is Bold And Transparent __HTTP__ _E_
The only reason I bid on @buffalobills was to make sure they stayed in Buffalo where they belong. Mission accomplished. _E_
Don't forget! Sunday night at 9 pm EST on @nbc Celebrity Apprentice is back! Tune in for a great show. @ApprenticeNBC _E_
70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow! _E_
He knows he won't have to spend much: @JonHuntsman has offered to match any donation dollar for dollar. _E_
Alaska had a 200% plus increase in premiums under ObamaCare worst in the country. Deductibles high people angry! Lisa M comes through. _E_
Both Obama administration and House leadership staffs are exempt from ObamaCare. Why not the American people? #MakeDCListen _E_
.@WWE: He's answered the call! @realDonaldTrump responds to @VinceMcMahon's #ALSIceBucketChallenge! __HTTP__ #SmackDownALS _E_
I was recently asked if Crooked Hillary Clinton is going to run in 2020? My answer was I hope so! _E_
Thank you Abingdon Virginia! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Knockout assaults are the new rage by sick and depraved youth. We better start getting tough in this country and they want to take our guns! _E_
"I always follow my own instincts but I am not going to kid you: it's also nice to get good reviews." The Art of the Deal _E_
Coming soon to Pennsylvania Avenue __HTTP__ _E_
Results of recovery efforts will speak much louder than complaints by San Juan Mayor. Doing everything we can to help great people of PR! _E_
Wow just heard that that next Tuesday's @saintanselm Politics & Eggs is the largest crowd ever. Looking forward to making new friends. _E_
Joining @SeanHannity tonight at 9pmE on @FoxNews. Enjoy! __HTTP__ _E_
Doing Fox & Friends at 7.00 A.M. ENJOY! _E_
.@BillBratton was a great choice for NYC Police Commissioner. He will make us proud and safe! _E_
Obama's own gun study proves gun control is ineffective __HTTP__ @BIZPACReview _E_
It's 46º (really cold) and snowing in New York on Memorial Day tell the so called scientists that we want global warming right now! _E_
What a rotten deal we made with Iran. We get nothing (except laughter at our stupidity). They get everything including delay and big cash! _E_
The people of Alabama will do the right thing. Doug Jones is Pro Abortion weak on Crime Military and Illegal Immigration Bad for Gun Owners and Veterans and against the WALL. Jones is a Pelosi/Schumer Puppet. Roy Moore will always vote with us. VOTE ROY MOORE! _E_
Congratulations to Justice Neil Gorsuch on his elevation to the United States Supreme Court. A great day for Americ... __HTTP__ _E_
It's Tuesday. How many fundraisers travelling on the taxpayer dime will Obama hold today? _E_
...They should realize that these relationships are a good thing not a bad thing. The U.S. is being respected again. Watch Trade! _E_
.@TMobile You service is absolutely terrible get on the ball! @JohnLegere _E_
.@BarackObama has completely failed the American people. U.S. annual incomes have fallen over 5% during his term __HTTP__ _E_
Waste. With 22 new taxes & $1.8T in added debt @BarackObama's disgraceful 'ObamaCare' will still leave 30M uninsured __HTTP__ _E_
I spell out some of the differences between Ben Carson and myself at 9:00 A.M. on @CNN @jaketapper. Ben is very weak on illegal immigration. _E_
Spent a beautiful weekend golfing at Trump National Golf Club Westchester and Trump National Golf Club Bedminster. _E_
RT @foxandfriends: France vehicle attack leaves at least six soldiers injured __HTTP__ _E_
... to OPEC countries that hate our guts. It's stupid policy." Time To Get Tough _E_
When @crowleyCNN defended Obama on Benghazi in the presidential debate she was defending a complete lie __HTTP__ _E_
I'm at Trump National DC @TrumpGolfDC watching the #2013JuniorPGA championship fantastic young players! @ThePGAofAmerica. _E_
"The way to get started is to quit talking and begin doing." – Walt Disney _E_
With our brand new Tennis Performance Center @TrumpGolfDC offers countless activities along with top courses __HTTP__ _E_
First Titantic sunk on its maiden voyage.Next the Hindenburg explodes on its first flight to America.Now we suffer the ObamaCare rollout! _E_
"President Donald J. Trump Proclaims January 16 2018 as Religious Freedom Day" __HTTP__ _E_
I recorded robo calls for @Perduesenate @leezeldin & @SteveKingIA. All had record wins. #MidasTouch _E_
"Mastering others is strength. Mastering yourself is true power." – Lao Tzu _E_
No surprise with the talk of amnesty in DC illegal immigration is picking up in Arizona __HTTP__ _E_
As I told everyone once before Wiener is a sick puppy who will never change 100% of perverts go back to their ways. Sadly there is no cure _E_
Certain Republicans who have lost to me would rather save face by fighting me than see the U.S.Supreme Court get proper appointments. Sad! _E_
Crooked Hillary's bad judgement forced her to announce that she would go to Charlotte on Saturday to grandstand. Dem pols said no way dumb! _E_
My @foxandfriends interview discussing the 9/11 Trials at Gitmo @MittRomney the job numbers and @CelebApprentice __HTTP__ _E_
Few if any Administrations have done more in just 7 months than the Trump A. Bills passed regulations killed border military ISIS SC! _E_
Great meeting all of you. This group knocked on 50K doors & counting here in Maine thank you! @MaineGOP __HTTP__ _E_
Congratulations to @ABC News for suspending Brian Ross for his horrendously inaccurate and dishonest report on the Russia Russia Russia Witch Hunt. More Networks and "papers" should do the same with their Fake News! _E_
Not only are wind farms disgusting looking but even worse they are bad for people's health __HTTP__ (cont) __HTTP__ _E_
.@Oprah was great amazing that she got Lance Armstrong to totally destroy his life. Why did he ever do that interview? _E_
#ThrowbackThursday #Trump2016 __HTTP__ _E_
Thank you New Hampshire! #FITN#Trump2016 #NHPolitics __HTTP__ _E_
Hey @SnoopDogg @ItstheSituation @SethMacFarlane: Oh I'm real scared. #TrumpRoast airs tonight at 10:30/9:30 on @Comedy Central. _E_
#TBT With @DonaldJTrumpJr almost 35 years ago __HTTP__ _E_
Not only giving out money but Obama will be seen today standing in water and rain like he is a real President don't fall for it. _E_
Great day in Kentucky with Wayne LaPierre Chris Cox & the @NRA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
A classic China just signs massive oil and gas deal with Russia giving Russia plenty of ammo to continue laughing in U.S. face. _E_
America's top public course @TrumpGolfLA's greens on Palos Verdes Peninsula have been celebrated by @GolfMagazine __HTTP__ _E_
beepee2004 Thank you very much Donald. Here is another. __HTTP__ Thanks both do justice to a fantastic place. _E_
RT @TeamTrump: Mrs. Saucier's son is in prison for having classified info on an unsecured device. @HillaryClinton did FAR WORSE & is runnin... _E_
What message does it send when @BarackObama's campaign has to spin whether America is better off than it was 4 years ago? _E_
and here's another.... __HTTP__ _E_
Can you believe this fool Dr. Thomas Frieden of CDC just stated anyone with fever should be asked if they have been in West Africa DOPE _E_
Bain did not list @MittRomney as an Executive on its website in 2000 __HTTP__ @BarackObama's Saul Alinsky tactics won't work! _E_
Our great country is respected again in Asia. You will see the fruits of our long but successful trip for many years to come! _E_
China's corporate espionage is a continued threat to the American economy. With the right leadership it can be stopped. _E_
I am looking forward to being in New Hampshire tomorrow. The silent majority is taking our country back. We will MAKE AMERICA GREAT AGAIN! _E_
We will no longer be silent. We can take our country back! Let's Make America Great Again! __HTTP__ _E_
Such a great honor. Final debate polls are in and the MOVEMENT wins!#AmericaFirst #MAGA #ImWithYou... __HTTP__ _E_
Unbelievable. __HTTP__ _E_
I refuse to call Megyn Kelly a bimbo because that would not be politically correct. Instead I will only call her a lightweight reporter! _E_
Your higher self is in direct opposition to your comfort zone. Donald J. Trump __HTTP__ _E_
I watched parts of @nbcsnl Saturday Night Live last night. It is a totally one sided biased show nothing funny at all. Equal time for us? _E_
.@LukeDonald You are so good and so talented that I have no doubt you will conquer the 18th hole at the New Blue Monster @DoralResort _E_
The polls have been consistently great. The silent majority is speaking. Politicians are failing. #MakeAmericaGreatAgain! _E_
Dianne Gallagher @DianneG is a great reporter for News Channel 36 in Charlotte NC. Fantastic interview thanks! _E_
I am very proud of my friend @OMAROSA. Despite her recent lossshe gracefully performs in the upcoming All Star @ApprenticeNBC _E_
We must build a great wall between Mexico and the United States! __HTTP__ _E_
Didn't the Boston killer even run over his own brother with a car in order to get away? We are not dealing with an innocent baby here DEATH _E_
David Pecker would be a brilliant choice as CEO of TIME Magazine nobody could bring it back like David! __HTTP__ _E_
Boasting @AAAFiveDiamond & @ForbesInspector 5 Star ratings @TrumpNewYork's @jeangeorges features a superb menu __HTTP__ _E_
The @whitehouse has 'clarified' that the unemployment is actually 8.254% not 8.3% __HTTP__ A little sensitive are we? _E_
Great investor John Paulson just sought bankruptcy protection for a unit of his hedge fund very smart but he didn't go bankrupt you morons! _E_
With a stupid guy like Jonah Goldberg who uses "tweeting like a 14 year old girl" to hit me no wonder the NRO is doing so poorly. @JonahNRO _E_
.@DaveWeigel @WashingtonPost put out a phony photo of an empty arena hours before I arrived @ the venue w/ thousands of people outside on their way in. Real photos now shown as I spoke. Packed house many people unable to get in. Demand apology & retraction from FAKE NEWS WaPo! __HTTP__ _E_
Wow the failing @nytimes has not reported properly on Crooked's FBI release. They are at the back of the pack no longer a credible source _E_
Just as I predicted Iraq is deteriorating into utter chaos __HTTP__ The war was a waste. China is taking all the oil. _E_
Via @bpolitics by @BetBrod "Trump sets an aggressive tone as he insisted he's serious about running for POTUS. __HTTP__ _E_
Looking like a really big night for Republicans a tremendous refutation of President Obama and his failed policies! _E_
.@genesimmons Amazing! Thank you.   __HTTP__ _E_
.@BarbaraJWalters @theviewtv Why did you choose me as one of the 10 Most Fascinating People of the Year last season (and more than once?) _E_
I discuss South Korea in today's all new #TrumpVlog __HTTP__ _E_
It was my great honor to welcome the 2016 World Series Champion Chicago @Cubs to the @WhiteHouse this afternoon.... __HTTP__ _E_
After North Korea missile launch it's more important than ever to fund our gov't & military! Dems shouldn't hold troop funding hostage for amnesty & illegal immigration. I ran on stopping illegal immigration and won big. They can't now threaten a shutdown to get their demands. _E_
"The object of golf is not just to win. It is to play like a gentleman and win." Phil Mickelson @MickelsonHat _E_
The ObamaCare website was hacked. $5B dollars later and the site can't even secure your personal information. _E_
Instead of attacking me Ashish J. Thakkar should worry about the culture of corruption plaguing Uganda __HTTP__ _E_
.@MittRomney and his campaign manager should not be critical of candidates after they blew an election that should never have been lost! _E_
Te'o's imaginary girlfriend is one of the great cons of all time—or he's very stupid. _E_
In the just out @FoxNews Poll I easily beat Hillary Clinton and I havn't even focused on her yet. On our way: MAKE AMERICA GREAT AGAIN! _E_
If you've looked over the yearsI've been right on virtually every issue from Iraq (not going in but if so taking the oil) to jobs to China _E_
Wow Ted Cruz got booed off the stage didn't honor the pledge! I saw his speech two hours early but let him speak anyway. No big deal! _E_
Wind turbines are not only killing millions of birds they are killing the finances & environment of many countries & communities. _E_
As President I wanted to share with Russia (at an openly scheduled W.H. meeting) which I have the absolute right to do facts pertaining.... _E_
"Build confidence starting with small successes that lead to greater and greater successes there is nothing like winning. Think Big _E_
And Trump SoHo New York is one of the hottest new hotels anywhere.... __HTTP__ _E_
I've got news for President Obama: America is not what's wrong with the world. #TimeToGetTough __HTTP__ __HTTP__ _E_
Can you imagine if @BarackObama had passed Cap and Trade?! Energy costs would be double from already record highs. _E_
Entrepreneurs: Pay attention to details. If you don't know everything about what you're doing you'll be in for some big surprises. _E_
No one wants the government to shut down but if ObamaCare is fully implemented then our country will eventually shutdown anyway! _E_
Haters stop saying I went bankrupt it is not so. I never went bankrupt... _E_
Typical @BarackObama's Press Secretary deflects any criticism of Obama's constant celebrity visits by attacking me. My great honor. _E_
A Rod's lawsuit trying to overturn a binding arbitration agreement is going nowhere. He should be banned from spring training. _E_
My book Midas Touch with Robert Kiyosaki (Rich Dad Poor Dad) will be in bookstores tomorrow it's a grea... (cont) __HTTP__ _E_
Heading to New Hampshire. #MakeAmericaGreatAgain __HTTP__ _E_
Bernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them & run as an Independent. _E_
"All our dreams can come true if we have the courage to pursue them." – Walt Disney _E_
Congratulations to the winners of the Commander in Chief's Trophy the great Air Force Falcons! Watch:... __HTTP__ _E_
I have great confidence that China will properly deal with North Korea. If they are unable to do so the U.S. with its allies will! U.S.A. _E_
There's never been anyone more abusive to women in politics than Bill Clinton.My words were unfortunate the Clintons' actions were far worse _E_
Answer to your questions I will be voting at 10:30 AM at Lighthouse International 110 East 60th Street Manhattan _E_
#CelebApprentice contestants @DeeSnider and @DebbieGibson joined me for interviews today __HTTP__ _E_
Congrats to @nbc on the success of the new smash show @NBCBlacklist. Fantastic suspense. Great acting. Must see TV! _E_
.@foxandfriends int. on how the Boston thug deserves death penalty @FBI's great work & firing Brande Roderick __HTTP__ _E_
Okay I think I'm going to do it—I'll open the Miss Universe Pageant as Santa tonight at 8 pm on @NBC _E_
Don't forget to watch The Tonight Show with the wonderful @jimmyfallon at 11:30 P.M. You will not be disappointed! @NBC _E_
The Eric Trump Foundation has raised over $1000000 towards St. Jude Children's Research Hospital. __HTTP__ _E_
Today is #VeteransDay. Let us be thankful for our nation's finest who fight at all corners of the earth to protect our freedoms. _E_
Watch my latest appearance on Squawk Box .... __HTTP__ _E_
EXCLUSIVE: Newt Gingrich: 'The Country Is in Rebellion' Trump Can 'Kick Down the Doors' __HTTP__ _E_
Biden's sarcastic smiling may or may not be effective depending on who is watching. #VPDebate _E_
.@Natalie_Gulbis Thank you for the nice piece in @SInow / @Golf_com.Keep up the great work! __HTTP__ _E_
Libya is being taken over by Islamic radicals with @BarackObama's open support. _E_
Heading to Richmond Virginia now. Join me tonight! #Trump2016Tickets: __HTTP__ _E_
Beautiful evening in Kinston North Carolina thank you! Get out and VOTE!! You can watch tonight's rally here:... __HTTP__ _E_
The world economy is under deep stress with growth slowing everywhere. Yet crude is over $87/barrel. Should be $25 at the most. _E_
#CelebrityApprentice Listening to the advice from @johnrich and @marleematlin adds another insight into the Final 4. #sweepstweet _E_
Go to Macy's today and buy Trump ties shirts suits and cufflinks as a Christmas or holiday present.Great style great price! ONLY THE BEST _E_
The Pledge #MakeAmericaGreatAgain __HTTP__ _E_
....Because of the Democrats not being interested in life and safety DACA has now taken a big step backwards. The Dems will threaten "shutdown" but what they are really doing is shutting down our military at a time we need it most. Get smart MAKE AMERICA GREAT AGAIN! _E_
Interesting case from UK re @stellacreasy and abusive troll __HTTP__ _E_
Lyin' Crooked Hillary's email stories all have one thing in common. __HTTP__ _E_
I am in Ireland inspecting my great and very beautiful Atlantic Ocean property. It is one of the most spectacular hotels anywhere! DOONBEG _E_
Whether you like it or not the Russians did a great job in hosting the Olympics! Remember when Obama went to Europe to get Olympics fourth. _E_
We will remain fully engaged w/ open lines of communication as #HurricaneHarvey makes landfall. America is w/ you! @GovAbbott @FEMA @DHSgov __HTTP__ _E_
The worst negotiators in history (otherwise known as Republicans) have just offered to suspend debt ceiling for four months. Pathetic! _E_
Via @thehill by @HenschOnTheHill: "Trump: 'I'm disappointed' in many Republicans" __HTTP__ _E_
Everybody is raving about the Trump Home Mattress by @SertaMattresses. If you are looking for a mattress go buy (cont) __HTTP__ _E_
During primetime of the Iowa Caucus Cruz put out a release that @RealBenCarson was quitting the race and to caucus (or vote) for Cruz. _E_
The Eric Trump Foundation Golf Invitational benefiting St. Jude Children's Research Hospital is today and i... (cont) __HTTP__ _E_
Eventually but at a later date so we can get started early Mexico will be paying in some form for the badly needed border wall. _E_
Will be delivering a major speech tonight live on @oreillyfactor at 8:10pm from Pensacola Florida. _E_
For all of those fools that want to attack Syria the U.S.has lost the vital element of surprise so stupid could be a disaster! _E_
My @Yahoo 'Power Players' interview with @jonkarl Inside Donald Trump's new digs on Pennsylvania Avenue" __HTTP__ _E_
Looking forward to being at the convention tonight to watch all of the wonderful speakers including my wife Melania. Place looks beautiful! _E_
Next time you are waiting in an emergency room remember the Boston killer was rushed to intensive care within minutes of capture. _E_
I will be interviewed by @kimguilfoyle at 7pm on @FoxNews. #Enjoy! _E_
My @showbiztonight interview on @KhloeKardashian @ApprenticeNBC & my surprising TV career __HTTP__ _E_
ICYMI via @DMRegister by @JenniferJJacobs: "Donald Trump to give Iowa speech on education" __HTTP__ _E_
I had a great time in Des Moines Iowa tonight! Thank you for all of the support. #Trump2016 __HTTP__ __HTTP__ _E_
The Muslim brotherhood is sending tanks into the Sinai & saying it doesn't violate Camp David accord. _E_
Praying for everyone in Florida. Hoping the hurricane dissipates but in any event please be careful. _E_
Today is the first day of the rest of your life make the most of it! _E_
The Emmys are sooooo boring! Terrible show. I'm going to watch football! I already know the winners. Good night. _E_
I gave out the Male Athlete of the Year Award last night to my friend @MichaelPhelps—22 Olympic medals—a record that will never be broken. _E_
Debbie Wasserman Schultz is hard to watch or listen to no wonder our country is going to hell! _E_
Let the Arab countries take care of Egypt they have more to gain and plenty of money..It's time for the U.S. to stop being stupid.NO DOLLARS _E_
Today in history WrestleMania 23: I shave @VinceMcMahon's hair highest rated show in WWE history @WrestleFact __HTTP__ _E_
Has Barack Obama been caught red handed laundering money into his campaign from illegal online foreign donations? Media? _E_
My interview with Andy Dean on @americanowradio I told him what I really thought about the @FoxNews debate. __HTTP__ _E_
Obamacare is a disaster. Rates going through the sky ready to explode. I will fix it. Hillary can't!#ObamacareFailed _E_
Late last Friday @BarackObama announced his 2011 budget deficit was $1.299 trillion the second largest in US history. _E_
.@FoxNews should not put @KarlRove on—he has no credibility a bush plant who called all races wrong. _E_
Thank you Ohio! Together we made history – and now the real work begins. America will start winning again!... __HTTP__ _E_
Someone must be fired at @AOL for that stupid deal they made buying Huffington Post. _E_
The failing @NRO National Review Magazine has just been informed by the Republican National Committee that they cannot participate in debate _E_
Hopefully there won't be any problems in Baltimore tonight. Be calm be cool do not let anybody get hurt.There is just too much to live for! _E_
.@TrumpDoral. Thanks for the many nice statements and to the media and golf critics for the great reviews of the brand new BLUE MONSTER! _E_
A nation WITHOUT BORDERS is not a nation at all. We must have a wall. The rule of law matters. Jeb just doesn't get it. _E_
.@RichLowry is truly one of the dumbest of the talking heads he doesn't have a clue! _E_
Those who refuse to draw red line to Iran don't have the moral right to put a red line to @Israel. @IsraeliPM @netanyahu _E_
.@HillaryClinton is on the front page of the @nytimes waving to 200 people in New Hampshire. My crowd next door was 5000 people – no pic! _E_
......@DailyCaller @BreitbartNews @DRUDGE_REPORT & @gatewaypundit. _E_
Congratulations to @thomtillis on winning @NCGOP Senate primary. Time for the party to unite and defeat ObamaCare advocate Kay Hagan! _E_
.@KeithUrban is excellent on American Idol—great touch solid guy! _E_
Just spoke to Governor Rick Scott. We are working closely with law enforcement on the terrible Florida school shooting. _E_
.@TheView T.V. show which is failing so badly that it will soon be taken off thr air is constantly asking me to go on. I TELL THEM NO _E_
All the hotels currently open in the Trump Hotel Collection have been nominated for Travel & Leisure's World's Best Awards 2011 ..... _E_
My interview this morning on Good Morning America with George Stephanopoulos __HTTP__ _E_
Canadians kicked out the firm that the U.S. paid all that money to for the failed website. How stupid are our leaders ? This is a scandal! _E_
North Korea is reliant on China. China could solve this problem easily if they wanted to but they have no respect for our leaders. _E_
While I won't be running for Governor of New York State a race I would have won I have much bigger plans in mind stay tuned will happen! _E_
Teams are making a big mistake not taking Johnny Manziel he is going to be really good (and exciting to watch). _E_
Thank you Sarah Let's have pizza in New York soon with you & your great family __HTTP__ _E_
It was a GREAT day for the United States of America! This is a great plan that is a repeal & replace of ObamaCare.... __HTTP__ _E_
He would be crazy to play in L.A. really bad coach who can't adjust to his players! _E_
Obama planted that @nytimes story on Iran so it will be discussed in tonight's debate. He wants Libya and China off the table. _E_
Great poll numbers just coming out of New Hampshire. BIG lead for Trump according to @CNN! _E_
Michele Bachmann got less than 1200 more votes in the Caucus than she did in the Ames Straw Poll. Very sad for her a nice woman! _E_
Checking out the course at TNGC Westchester and it is fantastic. Should be a great season. __HTTP__ _E_
RT @foxandfriends: Sen. John McCain making his return to the Senate ahead of health care vote __HTTP__ _E_
Congrats to @BreitbartNews' @mboyle1 on being awarded the prestigious 'Eagle Award for Amnesty Reporting' __HTTP__ _E_
I was not scheduled to be on the @oreillyfactor. Pure fiction! _E_
Via @AmSpec BY Jeffrey Lord: "Donald Trump was right on Ebola" __HTTP__ _E_
We can't destroy the competitiveness of our factories in order to prepare for nonexistent global warming. China is thrilled with us! _E_
Dopey Sugar @Lord_Sugar The wind turbines are ruining the beauty & majesty of Scotland... _E_
Thank you American Legion Post 610 for hosting @Mike_Pence & I for a roundtable with labor leaders. #LaborDay #MAGA __HTTP__ _E_
GOP now viewed more favorably than Dems in Trump era (per NBC/WSJ poll) via @HotlineJosh: __HTTP__ _E_
I have founded and run one of the largest real estate empires in the world. I employ thousands of people. Why am I the enemy? _E_
If the UN unilaterally grants the Palestinians statehood then the US should cut off all its funding. Actions have consequences. _E_
Be a cautious optimist. Call it positive thinking with a lot of reality checks. _E_
Republicans gave Obama a free pass to the White House they just don't get it. _E_
.@FLGovScott: Amazing race tremendous courage you deserved this win for a very old fashioned reason you have been a great governor! _E_
More thoughts on the debt ceiling in today's #trumpvlog... __HTTP__ _E_
Bill Clinton did a great job last night the Democrats are lucky to have him. Do you really believe he likes @BarackObama? _E_
RT @KatiePavlich: Your boss pardoned a traitor who gave U.S. enemies state secrets he also pardoned a terrorist who killed Americans. Spar... _E_
Even the once great Caesars is bankrupt in A.C. Others to follow. Ask the Democrat City Council what happened to Atlantic City. _E_
I will be on @SeanHannity @FoxNews tonight at 10pmE w/ @MELANIATRUMP from Wisconsin. Enjoy! #WIPrimary #Trump2016 __HTTP__ _E_
.@eagles should sit Michael Vick. He is a great athlete but less than average quarterback. _E_
"No government ever voluntarily reduces itself in size. So governments' programs once launched never disappear." – Ronald Reagan _E_
We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_
Crooked's stop in Johnstown Pennsylvania where jobs have been absolutely decimated by dumb politicians drew less than 200 with Bill VP _E_
I will be doing the A.L.S. Ice Bucket Challenge this morning on twitter. It is not something I look forward to doing but is for a good cause _E_
ISIS is starting its own currency. May be stronger than the dollar if ObamaCare is fully implemented. _E_
RT @transition2017: President elect Trump announces selections for Attorney General National Security Advisor CIA Director. More here: ht... _E_
I picked seven Super Bowl winners in a row & would have been right last night had the refs thrown the flag. _E_
Congrats to @Team_Mitch on winning a spirited primary. Great job Mitch. _E_
Vera Coking saved me "mucho" money by turning down my offer—thanks Vera! _E_
Together we can save American JOBS American LIVES and AMERICAN FUTURES! #Debates __HTTP__ _E_
. #RepMikeKelly Great job on @foxandfriends this morning. Thank you for the nice words! _E_
RT @DRUDGE_REPORT: REUTERS POLL: CLINTON TRUMP ALL TIED UP... __HTTP__ _E_
We are going to have a wild time in Alabama tonight! Finally the silent majority is back! __HTTP__ _E_
.@Team_Mitch Congratulations Mitch! _E_
I'll be playing golf tomorrow in Palm Beach at the number one rated golf course in the State of Florida Trump International Golf Club. _E_
.@ESPN's apology(Brent Musburger) was a disgrace to broadcasting stop being so politically correct! _E_
Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great and getting major things done! _E_
It was great being in Michigan. Remember I am the only presidential candidate who will bring jobs back to the U.S.and protect car industry! _E_
Crooked Hillary Clinton has not held a news conference in more than 7 months. Her record is so bad she is unable to answer tough questions! _E_
.@AP is doing very badly. I can say from experience their reporting is terrible & highly inaccurate. Sadly they are now irrelevant! _E_
Congratulations @TrumpNewYork for being named in @CNTraveler's Top 10 US Hotels for Business Travelers! __HTTP__ _E_
RT @foxandfriends: Report accuses material James Comey leaked to a friend contained top secret information __HTTP__ _E_
Video game violence & glorification must be stopped—it is creating monsters! _E_
Great even in SC tonight! Fire Marshall would not let everyone in 5000 turned away. Thank you for coming! _E_
In today's #trumpvlog I talk about how well Will Smith handled the situation with the reporter __HTTP__ _E_
Big wins in West Virginia and Nebraska. Get ready for November Crooked Hillary who is looking very bad against Crazy Bernie will lose! _E_
.@MittRomney's entire life and career have built prosperity and growth. _E_
Explain how the women on The View which is a total disaster since the great Barbara Walters left ever got their jobs. @abc is wasting time _E_
Thank you @ASavageNation and keep up the great work! _E_
Based on John Sweeney's lousy reputation we are airing large parts of the interview that were not shown enjoy! __HTTP__ _E_
Why would @greta use @KarlRove as an election analyst when he has made so many mistakes. He still thinks Romney won. An establishment dope! _E_
#CelebApprentice We had lots of fun last night with the live tweeting so I will do it again tonight from 8 10pm. _E_
My great honor to join our incredible men and women of the @USCG at the Lake Worth Inlet Station in Riviera Beach Florida today!#HappyThanksgiving __HTTP__ _E_
Big speech tonight in South Carolina 7:00 P.M. Tremendous crowd! _E_
Republicans must stop listening to dopes like @KarlRove who still insists Mitt Romney won the last election. Think big & think strong! _E_
Obama did much better than he did last time but still lost decisively. _E_
Why is @BarackObama spending millions to try and hide his records? He is the least transparent President ever and he ran on transparency. _E_
Entrepreneurs: Realize that becoming an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_
Lying Cruz put out a statement "Trump & Rubio are w/Obama on gay marriage. Cruz is the worst liar crazy or very dishonest. Perhaps all 3? _E_
Wacky Congresswoman Wilson is the gift that keeps on giving for the Republican Party a disaster for Dems. You watch her in action & vote R! _E_
Getting ready to leave @TrumpDoral and the brand new Blue Monster course it's unbelievable! _E_
#CrookedHillary is outspending me by a combined 31 to 1 in Florida Ohio & Pennsylvania. I haven't started yet! __HTTP__ _E_
Before Star Jones begged me to put her on The Apprentice she was "professionally dead." I saved her tiny... __HTTP__ _E_
"Yesterday's home runs don't win today's games." – Babe Ruth _E_
Via @businessinsider by @hunterw: "TRUMP UNLOADS: Hillary Clinton was 'the worst' and is 'extremely bad'" __HTTP__ _E_
A day after Greece burned @BarackObama released a $3.8 Trillion budget for 2013 with a $900 Billion deficit.He will turn America into Greece _E_
I've had enough of this good night! _E_
Biggest story today between Clapper & Yates is on surveillance. Why doesn't the media report on this? #FakeNews! _E_
Who ever heard of a legal conviction statement "more probable than not" against Tom Brady? Sue them Tom and make lots of $. @nfl _E_
.@ShawnJohnson have a great Easter you are a real champion! _E_
Glad to see my interview with Ronald Kessler @Newsmax_Media. Hopefully the @GOP can get the message. _E_
This is the single greatest witch hunt of a politician in American history! _E_
RT @DanScavino: WE LOVE OUR DEPLORABLES!!!#TrumpTrain #Debates2016 __HTTP__ _E_
State Treasurer John Kennedy is my choice for US Senator from Louisiana. Early voting today election next Saturday. _E_
Here is another CNN lie. The Clinton News Network is losing all credibility. I'm not watching it much anymore. __HTTP__ _E_
We will MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 #VoteTrumpSC __HTTP__ __HTTP__ _E_
Watch Miss USA 2013 Sunday night at 9 PM ET. Live from Planet Hollywood Las Vegas. __HTTP__ _E_
Why did Mitt Romney BEG me for my endorsement four years ago? _E_
Happy #Hanukkah __HTTP__ _E_
Via @nbc6: "@MissUniverse Pageant Coming to @TrumpDoral in 2015" __HTTP__ _E_
How is @VanityFair editor Graydon Carter allowed to run bad food restaurant Beatrice Inn? Fire Graydon! _E_
.@drmoore Russell Moore is truly a terrible representative of Evangelicals and all of the good they stand for. A nasty guy with no heart! _E_
My @CENTURY21 Super Bowl commercial __HTTP__ which aired during the third quarter. _E_
Entrepreneurs: Money is not always the bottom line: it can be a score card not the final score. _E_
.@TrumpLasVegas is Sin City's most elite destination. Treat yourself to Vegas' most luxurious hotel rooms __HTTP__ _E_
What a surprise! Newly released audit proves that the IRS only targeted Tea Party groups __HTTP__ _E_
Clinton commented in Ohio today that @MittRomney is right the economy has not been fixed under Obama.I always said Bill was an honest man. _E_
"DONALD TRUMP TO BILL MAHER: PAY UP" __HTTP__ via @BreitbartNews _E_
If @megynkelly stopped covering me on her show her ratings would drop like a rock! My h to h interview with @AC360 beat her by millions! _E_
Republicans must stop relying on losers like @KarlRove if they want to start winning presidential elections. Be tough and get smart! _E_
I have a lot of @Apple stock and I miss Steve Jobs. Tim Cook must immediately increase the size of the screen... __HTTP__ _E_
Best of luck to my good friend Derek Jeter on his first game today back at shortstop. @Yankees Captain is a warrior & winner. _E_
ALso coming up: The Celebrity Apprentice returns. Sunday night March 6 at 9 pm EST __HTTP__ apprentice/ _E_
Just finished the wonderful event on the U.S.S. Iowa. VETERANS FOR A STRONG AMERICA endorsed me. Such a great honor thank you! _E_
I will be doing The Howard Stern Show at 7 a.m. (10 minutes). Always fun and interesting talking to Howard! _E_
Remember when @ariannahuff ran for Governor of California. She got 3 votes. _E_
Via Huffington Post Congrats America! Donald Trump Is Now A 2016 Presidential Front runner __HTTP__ by Igor Bobic _E_
The object of golf is not just to win. It is to play like a gentleman and win. Phil Mickelson _E_
Had a special visitor in my office yesterday for @TIME photo shoot. __HTTP__ _E_
.@GOP's election loss and failed negotiations will serve as a case study in how third parties come about. _E_
New York Magazine just named the most influential tweeters in N.Y. and one Donald Trump was #2 after ESPN. Actually I'm easily #1! _E_
The worst employee in today's #trumpvlog... __HTTP__ _E_
Why would anyone in Florida vote for lightweight Senator Marco Rubio. Check out his credit card scam his house sale & his no show voting! _E_
Myrtle Beach South Carolina #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Why are armed drones being released over our homeland by the Government? __HTTP__ Seems excessive. _E_
Ring in 2015 in downtown New York's most elite 5 Star hotel. @TrumpSoHo offers 46 luxurious stories of excellence __HTTP__ _E_
It makes me feel so good to hit sleazebags back much better than seeing a psychiatrist (which I never have!) _E_
Our online campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_
.@janinegibson __HTTP__ _E_
It is snowing in Jerusalem and across Lebanon. Global warming! _E_
Congratulations to Jeb Hensarling & Republicans on successful House vote to repeal major parts of the 2010 Dodd Frank financial law. GROWTH! _E_
Every day St. Maarten loses vital tourism dollars due to the incompetence of PM Sarah Westcot Williams. @PrimeMinisterSX _E_
RT @PressSec: The Trump effect: "The U.S. economy is running at its full potential for the first time in a decade" WSJ __HTTP__ _E_
Congratulations to @jdickerson of Face the Nation on his highest ratings in 15 years. 4.6 million people watched my interview! Thank you! _E_
Looking forward to keynoting @ChesterfieldGOP Lincoln Reagan Gala this Friday at The Country Club at The Highlands. Sold out record crowd! _E_
Just released that international gangs are all over our cities. This will end when I am President! _E_
Realize that being an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_
Champion @bretmichaels is back competing in the upcoming All Star @CelebApprentice. Premiere is March 3rd on @NBC at 9 p.m. EST. _E_
.@BBC should never have played that piece of garbage documentary & yet the phones are ringing off the hook to play the course. _E_
Trump National Golf Club Washington D.C. is located on 600 acres and fronts the Potomac River. Spectacular! __HTTP__ _E_
I can't believe that @CNN would waste time and money with @smerconish he has got nothing going. Jeff Zucker must be losing his touch! _E_
More Anti Catholic Emails From Team Clinton: __HTTP__ __HTTP__ _E_
.@hardball_chris' very small audience is shrinking rapidly because people finally understand that he is very very dumb! _E_
Thank you for the endorsement Coach Bobby Knight! I will never forget it! __HTTP__ __HTTP__ _E_
My Administration has identified three major priorities for creating a safe modern and lawful immigration system: fully securing the border ending chain migration and canceling the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_
Don't believe the millions of dollars of phony television ads by lightweight Rubio and the R establishment. Dishonest people! _E_
Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_
Obama has exempted businesses his staff and all of Congress from ObamaCare. Why is he still forcing the monstrosity on the U.S.? _E_
Thank you Buffalo! #NYPrimary __HTTP__ __HTTP__ _E_
.@GovMikeHuckabee was great the other night. People love him. _E_
People ask me what I do in my free time. The answer I don't have any. _E_
I love you Arizona! Thank you!#Trump2016 #AmericaFirst __HTTP__ _E_
You are doing a great job the world is watching! Be safe. __HTTP__ _E_
It's amazing how badly the Knicks and Nets are playing. Everybody predicted they would be top teams with all of the money spent. Too bad! _E_
It all begins today WE WILL FINALLY TAKE OUR COUNTRY BACK AND MAKE AMERICA GREAT AGAIN! _E_
.@TrumpLasVegas' 7th floor provides the most urbane feel in Las Vegas w/private air conditioned cabanas & a massive 110 ft. heated pool. _E_
#CelebApprentice stay tuned for the 2nd half we have one more firing tonight! _E_
Now I know that Yahoo is in good hands. It took great courage for @marissamayer to take away the right of employees to work at home. _E_
.@KatrinaCampins You were absolutely great on @CNN! Thank you. _E_
3rd rate writer Vicky Ward who begged me for help see her letters to me. __HTTP__ _E_
Watch @ FoxNews' @ShannonBream @LisWiehl & former prosecutor Doug Burns destroy ridiculous lawsuit __HTTP__ _E_
You can watch 360 video live from the podium! __HTTP__ #RNCinCLE #TrumpIsWithYou #MakeAmericaGreatAgain _E_
Great to see @RedSox win big yesterday. Good for Boston and the country. Yesterday we were all @RedSox fans. _E_
If my offer is refused every undecided OH voter will be fully aware that Obama denied $5M to charity all because he is hiding something! _E_
Many political pundits are using the term Art of the Deal .... they should thank me. That is my term and book title. _E_
The sad truth is some Republicans in Congress are clueless when it comes to negotiation. #TimeToGetTough _E_
Good morning America! Thank you for all of your support in the latest Drudge poll! __HTTP__ __HTTP__ _E_
Will be interviewed by @SarahPalinUSA tonight at 10:00 on OAN Network. Enjoy! _E_
Wow the ratings are in and Arnold Schwarzenegger got swamped (or destroyed) by comparison to the ratings machine DJT. So much for.... _E_
I have nothing to do with Atlantic City sold years ago (great timing). For losers and haters I NEVER went bankrupt. Plus $10 billion sorry _E_
Don't forget to tune in tonight for another exciting episode of The Apprentice 10 p.m. on NBC. _E_
.@davidaxelrod I'm sending you a check to help find a cure. @IvankaTrump says hi. _E_
Like Al Sharpton @DonnyDeutsch apologized to me for calling me a racist on @todayshow apology accepted! _E_
Let this be the day you go for your dream. Focus don't give up and only accept total and complete victory. You can do it! _E_
Can you imagine how embarrassing it would have been for the country if the candidates actually did get into a fist fight? _E_
Texas Georgia & many more VOTE EARLY! This is a movement!#Trump2016 VOTE VIDEO: __HTTP__ __HTTP__ _E_
Donald Trump Sends @FallonTonight to Highest Friday Rating in 18 Months. @JimmyFallon that is #HUGE! __HTTP__ _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
I'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Hopefully the Republican National Committee can straighten out the total mess that is taking place in Virginia's Republican Party. FAST! _E_
Watch Eric at 9 am (EST) today on Fox 5 w/ @rosannascotto and David Price to discuss Eric Trump Foundation's $20 million donation to St Jude _E_
Bill Hemmer of @FoxNews was very nice in explaining the excitement and energy in the arena. More than in past years. _E_
Sorry folks got to go to work now but I'll be baaaaack ! _E_
I would like to thank Reince Priebus for his service and dedication to his country. We accomplished a lot together and I am proud of him! _E_
"Trump the orator outlines the greatness of America to Democrats' disgust" __HTTP__ _E_
RT @VP: Our President is choosing to put American jobs American consumers American energy and American industry first. __HTTP__ _E_
Congrats to @msnbc for firing Martin Bashir—don't feel badly he didn't get ratings anyway. @SarahPalinUSA _E_
While the Pres. of Iran tweets sweet nothings to Obama he forbids the Iranians to use twitter. Very revealing. _E_
"Don't fight the problem decide it." – General George C. Marshall _E_
Too busy playing golf? @BarackObama sends form letters with an electronic signature to the parents of fallen SEALs __HTTP__ _E_
Congratulations to @Graeme_McDowell and @kristinstape. Your baby has seriously good genes will be a champ! _E_
Nobody should be allowed to burn the American flag if they do there must be consequences perhaps loss of citizenship or year in jail! _E_
How does @michellemalkin get a conservative platform? She is a dummy just look at her past. _E_
of position. Then separately she stated He said something truly horrifying ... he refused to say that he would respect the results of _E_
Go to __HTTP__ to help my friend Scott Brown take back our Senate. _E_
Such an honor to have my good friend Israel PM @Netanyahu join us w/ his delegation in NYC this afternoon. #UNGA __HTTP__ __HTTP__ _E_
RT @RightlyNews: @realDonaldTrump @LouDobbs Trust in the media is at the lowest level in all of U.S. history. The American people see throu... _E_
Thank you America! #MAGA __HTTP__ __HTTP__ _E_
Isn't it interesting that the tragedy in Paris took place in one of the toughest gun control countries in the world? _E_
My thoughts condolences and prayers to the victims and families of the New York City terrorist attack. God and your country are with you! _E_
Obama wanted to meet with the Iranian president yet the Iranians denied the request. So much for Hope & Change. _E_
Dear @MaraLiasson I greatly appreciate your fairness. My history shows I never disappoint. Looking forward to meeting you soon. _E_
The Republicans owe an apology for blowing the 2012 election. How could they lose to Obama?! _E_
He @BarackObama is caught on tape making election promises to @MedvedevRussiaE on missile defense and national security __HTTP__ _E_
'Clinton Ally Aided Campaign of FBI Official's Wife' __HTTP__ _E_
We must build a wall to secure our border. It will save lives and help Make America Great Again! __HTTP__ _E_
Via @globegazette by John Skipper: North Iowan says Trump serious about POTUS run but he'll have to prove it __HTTP__ _E_
The NFL has just barred ball carriers from using helmet as contact. What is happening to the sport? The beginning of the end. _E_
We've all wondered how Hillary avoided prosecution for her email scheme. Wikileaks may have found the answer. Obama! __HTTP__ _E_
.@TrumpChicago's Spa at Trump® offers 12 treatment rooms & 53 spa guestrooms overlooking the Chicago skyline __HTTP__ _E_
Congrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. _E_
ObamaCare's tax credit is underperforming by over 95% creating an even bigger cost to the debt __HTTP__ It must be repealed! _E_
The truth is a beautiful weapon. __HTTP__ _E_
New book by @ericbolling is absolutely terrific and a must read! #WakeUpAmerica _E_
The Republicans are funding ObamaCare and Amnesty. Obama beats them. __HTTP__ _E_
Providing backstage commentary at the Miss USA Pageant will be comedic mother daughter duo Joan and Melissa Rivers. A fantastic lineup! _E_
.@CNN & @CNNPolitics Please thank Alisyn Camerota David Chalian and John King for the very professional reporting of the new CNN Poll. _E_
People are always asking me about the very special word CONFIDENCE. The fact is there is (almost) nothing like it. Is derived from winning! _E_
.@FLOTUS Melania and I were honored to welcome Argentina President @MauricioMacri and First Lady Juliana Awada to t... __HTTP__ _E_
I hope Republican Senators will vote for Graham Cassidy and fulfill their promise to Repeal & Replace ObamaCare. Money direct to States! _E_
Purchase your copy of CRIPPLED AMERICA now & be on potential call list for my live streaming signing event tonite. __HTTP__ _E_
Wishing you and yours a very Happy and Bountiful Thanksgiving! _E_
A nation that cannot control its borders is not a nation. President Ronald Reagan _E_
Any negative polls are fake news just like the CNN ABC NBC polls in the election. Sorry people want border security and extreme vetting. _E_
Great crowd in Fletcher North Carolina thank you! Heading to Johnstown Pennsylvania now! Get out on November 8th... __HTTP__ _E_
Little Michael Bloomberg who never had the guts to run for president knows nothing about me. His last term as Mayor was a disaster! _E_
The class warfare being played by @BarackObama is the only way he can get reelected. He can't have America focus on his horrendous record. _E_
The liberal clown @ariannahuff told her minions at the money losing @HuffingtonPost to cover me as enterainment. I am #1 in Huff Post Poll. _E_
Great poll! Thank you North Carolina! #VoteTrumpNC on 3/15!Trump 36%Cruz 18%Rubio 18%Carson 10%Kasich 7%Via @SurveyUSA _E_
Wow Crooked Hillary was duped and used by my worst Miss U. Hillary floated her as an angel without checking her past which is terrible! _E_
Have you been to the @TrumpGrill in the Trump Tower Atrium? Best meatloaf in the City my mother's famous recipe. 212.836.3249 _E_
Enjoyed watching @ericbolling $ @SarahPalinUSA's @FoxNews special #PainatthePump over the weekend. (cont) __HTTP__ _E_
Those who believe in tight border security stopping illegal immigration & SMART trade deals w/other countries should boycott @Macys. _E_
My @morning_joe int. w/@morningmika @JoeNBC & @ThomasARoberts f/@trumpdoral on why Romney shouldn't be @GOP nominee __HTTP__ _E_
Persistence is a key for success. Don't give up. Continue to Think Big and you will be able to close deals. _E_
A very interesting take from @KatiePavlich: __HTTP__ _E_
Top Clinton Aides Bemoan Campaign 'All Tactics' No Vision: __HTTP__ _E_
The Washington Establishment will never rein in government spending waste fraud and abuse. A great thinker and outsider is needed. _E_
Thank you @JebBush you finally get it! __HTTP__ _E_
I will be on @SpecialReport with @BretBaier tonight at 6PM. __HTTP__ _E_
While @BarackObama is obsessed with 'green collar jobs' blue collar workers aren't buying it. (cont) __HTTP__ _E_
Spoke to President Xi of China to congratulate him on his extraordinary elevation. Also discussed NoKo & trade two very important subjects! _E_
From rags to riches and back to rags! __HTTP__ _E_
.@DianneG @WCNC To the "news bigs" elevate Dianne Gallagher immediately—she is terrific! _E_
The Intelligence briefing on so called Russian hacking was delayed until Friday perhaps more time needed to build a case. Very strange! _E_
Americans may no longer have access to their family doctors because of Obamacare. __HTTP__ via @Newsmax_Media _E_
Amazon is doing great damage to tax paying retailers. Towns cities and states throughout the U.S. are being hurt many jobs being lost! _E_
Thanks to @BarackObama rejecting the Keystone XL pipeline China has become Canada's biggest oil consumer. China is laughing at us! _E_
.@andersoncooper Anderson—Thank you for being so fair with your reporting & story last night. Greatly appreciated! _E_
.@nbc has increased @ApprenticeNBC to 2 hours until the end of the season full 2 hour episodes starting at 9 PM EST _E_
.@danabrams Dan of course stories on me do well. Glad you have found a medium you can actual do well on. TV was not your forte. _E_
Keep the big picture in mind. There are always opportunites & possibilities & thinking too small can negate a lot of them. _E_
I will be LIVE tweeting tomorrow (MONDAY) nights TWO shows starting at 8:00 P.M. They are both great. _E_
RT @seanhannity: Watch: Donald Trump OWNS A Heckler Who Said Illegal Immigrants Are The Backbone Of America __HTTP__ _E_
Many people are equating BREXIT and what is going on in Great Britain with what is happening in the U.S. People want their country back! _E_
Entrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_
I will be on @marklevinshow at 8PM tonight. Tune in! _E_
The #WomenWhoWork campaign from @IvankaTrump __HTTP__ ... _E_
Reading @nytdavidbrooks of the NY Times is a total waste of time he is a clown with no awareness of the world around him dummy! _E_
I dream for a living. Steven Spielberg _E_
Many people have been asking me to answer questions. You can ask me questions at any time. #TrumpQandA _E_
If @HillaryClinton is president she'll be all talk and nothing will get done. #Debate #BigLeagueTruth _E_
Straighten out The Republican Party of Virginia before it is too late. Stupid! RNC _E_
The new edition of The Apprentice will be on Thursdays this fall at 10 pm ET I'm putting people back to work! _E_
.@megynkelly Sorry there was only one breakout star this weekend in New Hampshire. Just check out the local New Hampshire media! _E_
Everybody that loves the people of New York and all they have been thru should get hypocrites like Ted Cruz out of politics! _E_
Loved being with my many friends in Tennessee. The crowd and enthusiasm was fantastic. I won the straw poll big! _E_
Republican Senators are working very hard to get there with no help from the Democrats. Not easy! Perhaps just let OCare crash & burn! _E_
I am going to expand the definition of LOBBYIST so we close all the LOOPHOLES! #DrainTheSwamp __HTTP__ _E_
Wow the Supreme Court passed @ObamaCare. I guess @JusticeRoberts wanted to be a part of Georgetown society more than anyone knew. _E_
Wow I am ahead of the field with Evangelicals (am so proud of this) and virtually every other group and Ben Carson just took a swipe at me _E_
Looking forward to touring the @sigsauerinc world headquarters tomorrow! One of the top gun manufacturers in the US! #GunRights #TCOT _E_
Oppressive regimes cannot endure forever and the day will come when the Iranian people will face a choice. The world is watching! __HTTP__ _E_
With President Obama it's all talk and no action. Our country is in desperate need of smart and decisive leadership before it is too late! _E_
Many people in our Country are asking what the "Justice" Department is going to do about the fact that totally Crooked Hillary AFTER receiving a subpoena from the United States Congress deleted and "acid washed" 33000 Emails? No justice! _E_
To the three UCLA basketball players I say: You're welcome go out and give a big Thank You to President Xi Jinping of China who made..... _E_
While on FAKE NEWS @CNN Bernie Sanders was cut off for using the term fake news to describe the network. They said technical difficulties! _E_
Rick Perry is right when he says we must stand by Israel in the UN. _E_
If China didn't play games with its currency and we played on a level economic playing field we could easily (cont) __HTTP__ _E_
Part of Obama's new found confidence is that the Republicans aren't using their power of ideas properly or effectively. _E_
My interview last night with Greta on Fox News __HTTP__ _E_
Thank you Arizona! #VoteTrump __HTTP__ _E_
The con artists changed the name from GLOBAL WARMING to CLIMATE CHANGE when GLOBAL WARMING was no longer working and credibility was lost! _E_
I predict that dying @UnionLeader newspaper which has been run into the ground by publisher Stinky Joe McQuaid will be dead in 2 years! _E_
Hillary Clinton: 'Architect of failure'#DrainTheSwamp #CrookedHillary __HTTP__ _E_
Congress must pass a budget and hold Obama to it. No more continuing resolutions and no more excuses. Republicans soon hold both houses. _E_
I am in Kansas. Will be an exciting day. Big speech this morning in Wichita and then go to caucus. Sorry CPAC (the format was fine!). _E_
The Time Magazine list of the 100 Most Influential People is a joke and stunt of a magazine that will like Newsweeksoon be dead. Bad list! _E_
Via @BBCNews Trump begins renewables mission in Scotland __HTTP__ _E_
What will be the response on Wednesday? If Obama doesn't take the 5 million dollars for charity. _E_
Join me in Pittsburgh tonight at 7pmE! #Trump2016 #TrumpTrainTickets: __HTTP__ _E_
Look for good ideas outside of your own areas of expertise. Find innovations approaches and practices that you could adapt in your field. _E_
Terrible jobs report just reported. Only 38000 jobs added. Bombshell! _E_
The military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
#MakeAmericaGreatAgain __HTTP__ _E_
Captured or not all our soldiers are heroes! _E_
I employ many people in the State of Virginia JOBS JOBS JOBS! Crooked Hillary will sell us out just like her husband did with NAFTA. _E_
Despite previous tweet Dennis Rodman would do a better job than the current (cont) __HTTP__ _E_
Will the Benghazi terrorist use the videotape as a defense? If so will Obama apologize to him? _E_
In some ways it is sad. We all wanted @BarackObama to succeed. It's not worked out that way. _E_
My two sons Eric & Don have long been expert hunters & marksmen @NRA. They go on safaris & give animals to the poor & starving villagers! _E_
It is time to get out of Afghanistan. We are building roads and schools for people that hate us. It is not in our national interests. _E_
...What about all of the Clinton ties to Russia including Podesta Company Uranium deal Russian Reset big dollar speeches etc. _E_
After Friday's Twilight release I hope Robert Pattinson will not be seen in public with Kristen she will cheat on him again! _E_
I hear @glennbeck is in big trouble. Unlike me his viewers & ratings are way down & he has become irrelevant—glad I didn't do his show. _E_
.@BillGates and @JimBrownNFL32 in my Trump Tower office yesterday two great guys! __HTTP__ _E_
My interview w/ @nbc6 re: @CadillacChamp & my $200M of future renovations invested in Trump @DoralResort __HTTP__ _E_
SCARY $6T in debt and $1T annual budget deficits later @BarackObama is asking for more time to fix the economy __HTTP__ _E_
Weakness of attitude becomes weakness of character. Albert Einstein _E_
Q/A @saychowder I receive a great many requests for interviews nationally and internationally. _E_
Thank you @CrainsChicago for featuring @TrumpChicago in your list of Best Private Dining Rooms in Chicago. __HTTP__ _E_
...Those stupid people bought @mcuban's company (of which he owned a piece). _E_
The travel ban into the United States should be far larger tougher and more specific but stupidly that would not be politically correct! _E_
I consider my health stamina and strength one of my greatest assets.The world has watched me for many years and can so testify great genes! _E_
Comey lost the confidence of almost everyone in Washington Republican and Democrat alike. When things calm down they will be thanking me! _E_
Why would Obama ever nominate someone for Sec. of Defense who opposes sanctions against Iran when Obama claims to support them? _E_
The @MittRomney healthcare plan post ObamaCare relies on consumer choices with more options __HTTP__ The perfect remedy! _E_
The sexual abuse that is so rampant has according to generals greatly weakened our military. They have failed to stop it. _E_
What do you think of @DennisRodman's Donald Trump head? The hair's not quite right for one thing. #CelebApprentice _E_
This is an incredible MOVEMENT WE are going to take our country BACK! #November8th #BigLeagueTruth #Debate __HTTP__ _E_
Dopey @Lord_Sugar—Look in the mirror and thank the real Lord that Donald Trump exists. You are nothing! _E_
Getting ready to open the magnificent Turnberry in Scotland. What a great day especially when added to the brave & brilliant vote. _E_
Mitt's subsequent rise in the polls post debate shows that the American public can still spot a real winner. _E_
Obama met with Chinese Premier Wen yesterday __HTTP__ and talked trade. The Chinese are robbing us blind be tough! _E_
Will @JebBush in his phony advertising campaign show himself asking me to apologize to his wife in the debate? _E_
RT @TeamTrump: We are going to be THRIVING again. @realDonaldTrump #BigLeagueTruth #Debates2016 __HTTP__ _E_
Speaking to a record crowd of over 20000 people in Charlotte Arena this Saturday morning—look forward to it! _E_
122 vicious prisoners released by the Obama Administration from Gitmo have returned to the battlefield. Just another terrible decision! _E_
RT @DonnaWR8: .@POTUS #TRUMP & @FLOTUS🌺When ALL seemed HOPELESS...YOU brought HOPE!You INSPIRE us ALL!#MAGA #Harvey @Scavino45 #USA... _E_
IMO Manti Te'o was involved in a hoax for sympathy to get the Heisman Trophy. _E_
NYC's sole hammam The Spa at @TrumpSoHo offers classic treatments inspired by wellness rituals f/around the world __HTTP__ _E_
Thank you @LtStevenLRogers. We will respond to terrorism with strength in 2017! __HTTP__ _E_
Our economy cannot stay competitive with policies like these: @BarackObama is proposing over $90 Billion in new regulations. _E_
The massive Blue Monster @TrumpDoral is getting rave reviews. I built it in one year—no easy feat! _E_
Watch out. Champion @Joan_Rivers returns to the Boardroom as a judge in this week's All Star Celebrity @ApprenticeNBC. Don't cross her! _E_
.@deneenborelli Thank you for your nice words greatly appreciated. _E_
China court: Apple pays $60M to settle iPad case. China is getting away with murder. __HTTP__ _E_
Donald Trump: Jeb Bush's Support of Common Core 'a Disaster' __HTTP__ via @BreitbartNews by Dr. Susan Berry _E_
I'm going to D.C. today to check on the hotel I'm building on Penn. AVE. and then being honored by the Wharton School of Finance the BEST! _E_
Networks other than low ratings @CNN have been very fair and exciting! _E_
Hypocrite: @HillaryClinton is the single biggest beneficiary of Citizens United in history by far. #debate #bigleaguetruth _E_
Such a total miscarriage of Justice in San Francisco! __HTTP__ _E_
Obama's offer to Iran will not stop Iran's breakout capability. It is a bad desperate deal negotiated from weakness. Pass sanctions! _E_
Great interview tonight @donlemon very professionally done. @CNN _E_
My latest Celebrity Apprentice video blog... __HTTP__ _E_
Sorry banks when we accused lightweight AG Eric Schneiderman of not going after banks he started going after banks—but years too late! _E_
ObamaCare could eat up your raise __HTTP__ Why isn't Congress defunding it? They're obsessed with amnesty. _E_
Mark Levin's @marklevinshow 'The Liberty Amendments: Restoring the American Republic is a truly great & important book. _E_
The addition of the iconic Doral Resort to the Trump portfolio is one of the most exciting transactions __HTTP__ _E_
The fact that we are taking the Ebola patients while others from the area are fleeing to the United States is absolutely CRAZY Stupid pols _E_
Thank you @Morning_Joe for explaining to @CNN and @andersoncooper and so many others that I am leading in almost all national & state polls. _E_
Liberal press won't look into why Obama ignored security warnings for embassies but is obsessed with Romney's private comments. _E_
Looking forward to seeing Joe McQuaid Curtis Barry and my many friends in the Granite State! _E_
Why does @BarackObama have such a fascination with my plane? He is more than welcomed to come for a ride. _E_
Looking forward to press conference on taxes at 11AM at @TrumpTowerNY. _E_
Lets go America! Get out & #VoteTrump! #Trump2016#MakeAmericaGreatAgain!#SuperTuesday __HTTP__ __HTTP__ _E_
Obama will let Ebola fly into US & drugrunners cross our border daily. But he won't pressure Mexico on Sgt. Tahmooressi. #FreeOurMarine _E_
Chuck Hagel: Wrong For Defense __HTTP__ via @NewYorkObserver _E_
Oscar Pistorius only gets five years in prison for killing his girlfriend. Ridiculous decision! Judge couldn't even read her own writings. _E_
When the New York Times sold their beautiful long time building for peanuts & the buyer flipped it for a massive profit—they lost me! _E_
...accountability say the Governor. Electric and all infrastructure was disaster before hurricanes. Congress to decide how much to spend.... _E_
.@TrumpNewYork is NYC's only @ForbesInspector 5 Star & @AAAnews 5 Diamond hotel w/a 5 Star & 5 Diamond restaurant __HTTP__ _E_
RT @FieldofFight: We Can Do Better We Must Do Better We Will Do Better By LTG (R) Keith Kellogg and LTG (R) Michael Flynn @GenFlynn __HTTP__ _E_
"No one remembers who came in second." Walter Hagen _E_
ObamaCare is a failure. Costs are rising much faster under Obama than other Presidents. _E_
Welcome to the @WhiteHouse Prime Minister @JustinTrudeau! __HTTP__ _E_
Looks like a lawsuit against GoAngelo won't work—my ties & shirts doing too well at Macy's he's actually helping. I have no damages! _E_
Glad to hear that @taylorswift13 will be co hosting the Grammy nominations special on 12.5. Taylor is terrific! _E_
"Mistakes are always forgivable if one has the courage to admit them." Bruce Lee _E_
Wow the Failing @nytimes said about @foxandfriends ....the most powerful T.V. show in America. _E_
Via @UnionLeader by @tuohy: "Trump: You're Hired" __HTTP__ _E_
Get out and vote West Virginia we will MAKE AMERICA GREAT AGAIN! _E_
I look forward to @MittRomney hitting Obama hard tonight for lying about Benghazi. CIA told Obama it was a terrorist attack after 24 hrs. _E_
#AMERICA FIRST! _E_
Saturday Night Live has some incredible things in store tonight. The great thing about playing myself is that it will be authentic! Enjoy _E_
Will be interviewed by @chucktodd on @meetthepress at 10:30 A.M. _E_
Time is on your side things do not continue downward forever. Think Big _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
There is no longer a Bernie Sanders political revolution. He is turning out to be a weak and somewhat pathetic figurewants it all to end! _E_
Can you imagine what Putin and all of our friends and enemies throughout the world are saying about the U.S. as they watch the Ferguson riot _E_
#TrumpAdvice __HTTP__ _E_
RT @foxandfriends: NYT editor apologizes for misleading tweet about New England Patriots' visit to the White House (via @FoxFriendsFirst) h... _E_
...and borrow cheap! You will thank me someday. _E_
Word is that Ford Motor because of my constant badgering at packed events is going to cancel their deal to go to Mexico and stay in U.S. _E_
Isn't it crazy I'm worth billions of dollars employ thousands of people and get libeled by moron bloggers who can't afford a suit! WILD. _E_
I watched @todayshow this AM re: @MarthaStewart & dating. She looks terrific better than ever any guy would be lucky to be with her. _E_
I hear that sleepy eyes @chucktodd will be fired like a dog from ratings starved Meet The Press? I can't imagine what is taking so long! _E_
Board Room finale of this week's All Star @ApprenticeNBC will leave viewers wondering where the rest of the season goes...It's great! _E_
The terrorist came into our country through what is called the Diversity Visa Lottery Program a Chuck Schumer beauty. I want merit based. _E_
I don't think the voters will forget the rigged system that allowed Crooked Hillary to get away with murder. Come November 8 she's out! _E_
Trace delivers check to hospital in NYC: American Red Cross must be grateful to Trace and his team for their tremendous work. _E_
My response to the failing Des Moines Register the ultra liberal paper that has no power in Iowa __HTTP__ _E_
Leaving West Palm Beach Florida now heading to St. Augustine for a 3pm rally. Will be in Tampa at 7pm join me:... __HTTP__ _E_
Thank you @SahilKapur for the wonderful story. __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Just got great national poll numbers double digit lead! Thank you we will all MAKE AMERICA GREAT AGAIN! _E_
Trump at Tea Party __HTTP__ via @myrbeachonline _E_
Donald Trump donates land to conservation group in Palos Verdes __HTTP__ via @MyNewsLA _E_
Nobody would fight harder for free speech than me but why taunt over and over again in order to provoke possible death to audience. DUMB! _E_
To aspiring entrepreneurs: Be ready for problems. You'll have them every day. So remember to look at the solution not the problem. _E_
I will be interviewed on @MariaBartiromo @FoxBusiness at 7:30 _E_
Remember if you do not promote yourself no one else will. When you have success let people know about it. _E_
Unemployment for Black Americans is the lowest ever recorded. Trump approval ratings with Black Americans has doubled. Thank you and it will get even (much) better! @FoxNews _E_
George Will was a big Iraq fool. $2 trillion thousands of lives lost & we got nothing! Dummy. _E_
"Successful leaders see the opportunities in every difficulty rather than the difficulty in every opportunity." Reed Markham _E_
Rupert Murdoch Defends Trump: 'Complete Refugee Pause' Makes Sense' __HTTP__ _E_
Via @CNNMoney by @jtotoole: "U.S. taps Donald Trump to convert DC's Old Post Office into luxury hotel" __HTTP__ _E_
The original Apprentice returns with a two hour premiere on Thursday September 16th. Looking forward to a fantastic season! _E_
Crooked Hillary said that I couldn't handle the rough and tumble of a political campaign. ReallyI just beat 16 people and am beating her! _E_
Going to Scotland Ireland & other places in Europe to close up deals. Getting ready for the June 16th announcement @TrumpTowerNY! _E_
RT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2... _E_
Alternatives are important but first Repubs must repeal ObamaCare. It's an unsustainable monstrosity that's destroying our healthcare. _E_
In the heart of midtown New York @TrumpTowerNY is a landmark which hosts tourists from the around the world daily __HTTP__ _E_
Big day at the United Nations many good things and some tricky ones happening. We have a great team. Big speech at 10:00 A.M. _E_
RT @IvankaTrump: My next project is pretty amazing...!xx Ivanka __HTTP__ __HTTP__ _E_
Kern County CA has secured $1.2B for windfarms __HTTP__ They also just secured more eagle deaths & low property values. _E_
People like lawyer Elizabeth Beck and failed writer Harry Hurt & others talk about me but know nothing about me—crazy! _E_
#TBT With the cast of GoodFellas __HTTP__ _E_
.@SarahPalinUSA did a great job @CPACnews. Much of what she said was plain old common sense. _E_
My @NewsRadio967 interview re Jeb Bush's absurd immigration comment & @Citizens_United @AFPhq Freedom Summit. __HTTP__ _E_
The horrible shooting that took place in San Bernardino was an absolute act of terror that many people knew about. Why didn't they report? _E_
Thank you Rep. @CynthiaLummis! __HTTP__ __HTTP__ _E_
A Rod hit ball hard first at bat. Time for him to step up and leave. _E_
Sometimes the best thing you can do is just let things ride let time go by. Donald J. Trump _E_
Dummy goAngelo keep letting people know how great my shirts ties and cufflinks (also Success) are at Macy's.The BEST now everyone's aware! _E_
I would have had many millions of votes more than Crooked Hillary Clinton except for the fact that I had 16 opponents she had one! _E_
Thank you for your nice words @MikeNeedham @Heritage for the nice words on @FoxNewsSunday with Chris Wallace. #FNS #Trump2016 _E_
Remember to watch the series finale of The Men Who Built America this Sunday at 8/7c on @History _E_
Go out and vote this will be the most important election of our time! _E_
I hope the Mexican judge is more honest than the Mexican businessmen who used the court system to avoid paying me the money they owe me. _E_
Do you believe this one Secretary of State John Kerry just stated that the most dangerous weapon of all today is climate change. Laughable _E_
Imitation is the sincerest form of flattery Huntsman goes Donald Trump __HTTP__ _E_
Polls close in 3 hours! Everyone get out and VOTE!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Thank you @SenatorFischer! #TrumpPence16 __HTTP__ _E_
Via @advisorsource: Donald Trump speaks in Novi drawing largest crowd in Oakland County Republican Party's history __HTTP__ _E_
Trump Nat'l Westchester is among the most highly regarded clubs in New York. A great place. __HTTP__ _E_
Thank you @oreillyfactor for your wonderful editorial as to why I should have been @TIME Magazine's Person of the Year. You should run Time! _E_
Newsweek ending print edition sad. Now my Newsweek covers mean nothing they lost all credibility. TIME to follow? _E_
Everyone's wondering what's wrong with A Rod. Not one sports writer blames it on his not being able to use drugs anymore the real reason. _E_
.@HillaryClinton has been doing this for THIRTY YEARS....where has she been? #BigLeagueTruth _E_
Large Block Grants to States is a good thing to do. Better control & management. Great for Arizona. McCain let his best friend L.G. down! _E_
Let Pete Rose into the Baseball Hall of Fame. It's time he has paid a big and very long price! _E_
I am at the @USGA #USWomensOpen. An amateur player is co leading for the first time in many decades very exciting! _E_
Something must be done with dopey @KarlRove he is pushing Republicans down the same old path of defeat. Don't fall for it Karl is a loser _E_
.@Megynkelly spent a big part of her show talking about other shows spending so much time on me. Really weird she's being driven crazy! _E_
Thank you Speaker @PRyan!#AmericaFirst #Trump2016 __HTTP__ _E_
.@hardball_chris became a super liberal Obama fan only because he must need the money and on @MSNBC that's the way it is. _E_
"Expand your life every day." –Donald J. Trump __HTTP__ _E_
Tonight @FLOTUS Melania and I were thrilled to welcome so many wonderful friends to the @WhiteHouse – and wish them all a very #HappyHanukkah __HTTP__ __HTTP__ _E_
"Borrowing and spending is not the way to prosperity." @PaulRyanVP _E_
.@AlexSalmond See photo __HTTP__ _E_
Wishing everyone a Happy Memorial Day Weekend with a special thought for all the veterans who have done so much for our freedom. _E_
Re Omarosa: Nasty tough or smart...or all? _E_
From: @Newsmax_Media: @realDonaldTrump: Public not Worried About @MittRomney's Tax Returns __HTTP__ _E_
Sitting at the foot of the Whitestone Bridge @TrumpFerryPoint is an 18 hole @jacknicklaus signature course __HTTP__ _E_
The new @DarKnightRises trailer is fantastic __HTTP__ Trump Tower stood in for Wayne Enterprises during filming. _E_
Hillary Clinton is being badly criticized for her poor performance in answering questions. Let us all see what happens! _E_
.....Ahead of schedule and under budget! Will be in Oklahoma tonight! _E_
Lots of pressure on Obama tonight even more than A Rod. If he doesn't perform well it could be over. _E_
Beauty arrives to Moscow's Crocus City Hall this 11.9.! On @nbc the world will watch @MissUniverse 2013 crowned __HTTP__ _E_
Success is not the key to happiness. Happiness is the key to success. If you love what you are doing you'll be a success. A. Schweitzer _E_
It's a plain fact: free trade requires having fair rules that apply to everyone. (cont) __HTTP__ _E_
The totally unexpected loss of Supreme Court Justice Antonin Scalia is a massive setback for the Conservative movement and our COUNTRY! _E_
Today on #NationalAgDay we honor our great American farmers & ranchers. Their hard work & dedication are ingrained... __HTTP__ _E_
Do you believe the way Karzai talks down to the United States zero respect! _E_
When Americans are free to thrive innovate & prosper there is no challenge too great no task too large & no goal beyond our reach. We are a nation of explorers pioneers innovators & inventors. We are nation of people who work hard dream big & who never ever give up... __HTTP__ _E_
RT @EricTrump: 2016 was such an incredible year for our entire family! My beautiful wife @LaraLeaTrump made it even better! __HTTP__ _E_
Many countries including allies already see China as world superpower __HTTP__ We have greatest military yet no respect _E_
Watch the @nbc video where @realmissnvusa is crowned as the 63rd @MissUSA __HTTP__ The Crowning Moment! _E_
In the 1950's our climate was far more unstable than it has been over the last 5 years. _E_
I loved Walter Cronkite one of the all time greats. He couldn't stand Dan Rather I agree with Walter. @DanRatherReport _E_
.@AlexSalmond the Scottish politician who released the terrorist who blew up Pan Am flight 103 over Lockerbie... _E_
Iran will soon take all of the oil in Iraq...and Iraq itself Keep the oil. _E_
With all of the recently reported electronic surveillance intercepts unmasking and illegal leaking of information I have no idea... _E_
Go Republican Senators Go! Get there after waiting for 7 years. Give America great healthcare! _E_
With @IvankaTrump and crew at the start of a new @DoralResort. __HTTP__ _E_
The US Navy wants to go green. Our Navy should use the best & most powerful fuel & not play games. Give me a break! _E_
Rosie is crude rude obnoxious and dumb other than that I like her very much! _E_
We could only get a small fraction of this 25k crowd in. The movement to Make America Great Again is unbelievable! __HTTP__ _E_
Looking forward to being honored at @citadelgop's Patriot Dinner with @SenatorTimScott in Charleston SC this Sunday __HTTP__ _E_
Big news—WOW—U.S. economy shrinks! _E_
Investors are visionaries in some respects they look beyond the present. _E_
Obama's statement that illegals "can't stay" = Obama's promise "if you like your healthcare plan you can keep it." _E_
Today I signed an Executive Order on Enforcing Statutory Prohibitions on Federal Control of Education. EO:... __HTTP__ _E_
Entrepreneurs: Absorb assess and then act. Don't negate your own power. Whatever you've been dealt know you can deal with it. _E_
My @foxandfriends interview discussing Obama's failed and dangerous foreign policy and the real unemployment numbers __HTTP__ _E_
I've known @hardball_chris for a long time & sadly he gets dumber each & every year & started from a very low base. _E_
RT @mitchellvii: Trump always ends up being right. It's almost a little freaky. _E_
What I would do on my first day in office. #MakeAmericaGreatAgainWatch: __HTTP__ __HTTP__ _E_
Real estate is always a great asset to own but especially now. Try to take advantage if you can and buy (cont) __HTTP__ _E_
Bernie Sanders is pushing hard for a single payer healthcare plan a curse on the U.S. & its people... _E_
Per @rushlimbaugh: Why does Hillary Clinton get the benefit of the doubt (after she DESTROYS her illegal email server) ... _E_
Six days and counting until my offer to Barack Obama expires... _E_
The Apprentice will be very exciting and interesting tonight at 8:00. Joan Rivers puts on a great show! _E_
The recent Kansas election (Congress) was a really big media event until the Republicans won. Now they play the same game with Georgia BAD! _E_
Wake Up America! See article: Israeli Science: Obama Birth Certificate is a Fake __HTTP__ _E_
Tonight's episode of The Apprentice is one you won't want to miss! Be sure to tune in 10 p.m. on NBC. _E_
The devastation left by Hurricane Irma was far greater at least in certain locationsthan anyone thought but amazing people working hard! _E_
If only Obama would treat @IsraeliPM @netanyahu with the same respect he awards tyrants. Very strange & dangerous for our national security. _E_
Who are our generals that are allowing this fiasco to happen right before our eyes. Call it the PLENTY OF NOTICE WAR _E_
Would anyone in the music industry treat a Democrat like this? @RealMeatLoaf is being punished for his political views __HTTP__ _E_
Why are we building a $1Billion embassy in Iraq when the country kicked us out didn't give us any oil & is about to get taken over by Iran? _E_
I'll be speaking on Thursday April 12 at the first ever National Achievers Congress at the San Jose Convention (cont) __HTTP__ _E_
RT @GOPLeader: .@POTUS made the right call in leaving a deal that would have put an unnecessary burden on the United States. __HTTP__ _E_
Thank you! CNBC #DebateNight poll with over 400000 votes. Trump 61%Clinton 39%#AmericaFirst #ImWithYou... __HTTP__ _E_
Just spoke to Governor Kenneth Mapp of the U.S. Virgin Islands who stated that #FEMA and Military are doing a GREAT job! Thank you Governor! _E_
I wonder if @BarackObama has promised Iran and China that he can be more flexible after his last election? _E_
Getting ready to leave for South Korea and meetings with President Moon a fine gentleman. We will figure it all out! _E_
I am in Virginia @RegentU Presidential forum with Dr. Pat Robertson beginning now! Watch here: __HTTP__ _E_
Big Republican Dinner tonight at Mar a Lago in Palm Beach. I will be there! _E_
The biggest thrill in the world is entertaining the public there is no bigger thrill than that. Vince McMahon @WWE _E_
Just read in the failing @nytimes that I was not aware the event had to be held in Cleveland a total lie. These people are sick! _E_
Wonderful meeting with Canadian PM @JustinTrudeau and a group of leading CEO's & business women from Canada and th... __HTTP__ _E_
Everybody wants to see and talk to Dennis Rodman he will be on Celebrity Apprentice tonight at 9. _E_
Trump Int'l Hotel & Tower New York has the perfect Manhattan location & @jeangeorges is the signature restaurant. __HTTP__ _E_
Keep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_
RT @RealBenCarson: Many people fight for change in DC. @realDonaldTrump is a leader with an outsider's perspective & the vision guts & ene... _E_
We look forward to making the Old Post Office in DC one of the great hotels of the World. __HTTP__ _E_
Many people have been asking to see my plane The Apprentice's @AmandaTMiller will give you a tour... __HTTP__ _E_
Heading to Trump National Doral to check the progress prior to the start of the Cadillac Championship on Thursday. I'll be there all week _E_
True. __HTTP__ _E_
The interview was great for @Oprah and terrible for Lance Armstrong! _E_
China's submarines will soon be carrying nukes __HTTP__ They will be sent to patrol our coasts Obama won't do anything. _E_
I know the Governors and Jeb Bush who has gone nasty with lies is by far the weakest of the lot. His family used private eminent domain! _E_
Robust Economic growth is the answer to the Medicare Problem not cuts on the elderly. _E_
Thank you New Mexico! #Trump2016 __HTTP__ __HTTP__ _E_
Home Sales hit BEST numbers in 10 years! MAKE AMERICA GREAT AGAIN _E_
Melania and I send our thoughts and prayers to Senator McCain Cindy and their entire family. Get well soon. __HTTP__ _E_
Will be doing @oreillyfactor tonight at 8pm. Enjoy! _E_
Hillary Clinton will use American tax dollars to provide amnesty for thousands of illegals. I will put... __HTTP__ _E_
I will be interviewed on @TODAYshow and Good Morning America at 7:00 A.M. _E_
Shirts and ties are doing great @Macys thanks! _E_
I cannot believe how well certain areas are doing relative to the U.S. There is no reason for this other than poor leadership.WE SHOULD BE 1 _E_
RT @Scavino45: President Trump pays respects and delivers #MemorialDay remarks at Arlington National Cemetery. __HTTP__ _E_
If you can't adapt to new situations then you will never be successful. Every change is a new opportunity to use your talent. _E_
.@Andre_Reed83. Congratulations Andre you deserve it! _E_
Public Policy Polling (PPP) has just come out with a major poll putting me #1 with Hispanics leading all Republican candidates.Told you so _E_
Did Crooked Hillary help disgusting (check out sex tape and past) Alicia M become a U.S. citizen so she could use her in the debate? _E_
The Fed's reckless policies of low interest and flooding the market with dollars needs to be stopped or we will face record inflation. _E_
Liberal SD Dem candidate Rick Weiland wants to expand ObamaCare to single payer & opposes Ebola travel ban. Send @RoundsforSenate to Senate! _E_
Mainstream media never covered Hillary's massive "hacking"or coughing attack yet it is #1 trending. What's up? _E_
People are finally beginning to hit China and OPEC. They never give me credit for being the first by far but that's okay! _E_
Some good news for New York – Weiner has dropped 12 points in the polls & that is before more of the pervert's old texts are released. _E_
President Obama close down the flights from Ebola infected areas right now before it is too late! What the hell is wrong with you? _E_
Plane was carrying those terrible lithium ion batteries which are highly combustible as cargo. Fire could have started in cockpit. _E_
With the record $200M renovations on track & budget (a miracle in DC) Trump Int'l Washington DC is being built into a national marvel. _E_
.@TheHill Trump on Boehner resignation: 'It's a good thing' __HTTP__ _E_
Have a great Good Friday and a Happy Easter. _E_
In the just released SC poll I increased my lead by 4 points since last poll by same firm. Up by 14! Cruz dropped 3. __HTTP__ _E_
Speech in Dallas went really well. Big and wonderful crowd. Just arrived in L.A. Big day tomorrow! _E_
The basketball coach at Rutgers looks bad but I had a coach who made him look like a baby coaches can be tough! _E_
My @SquawkCNBC interview discussing @BarackObama's #WHCD my Scotland property & @BarackObama using Bin Laden's death __HTTP__ _E_
Thank you! #Trump2016 __HTTP__ __HTTP__ _E_
The wonderful people of Puerto Rico with their unmatched spirit know how bad things were before the H's. I will always be with them! _E_
Yesterday in Iowa was amazing two speeches in front of two great sold out crowds. They love that I am the only candidate self funding! _E_
'Podesta urged Clinton team to hand over emails after use of private server emerged' __HTTP__ _E_
Stock Market has increased by 5.2 Trillion dollars since the election on November 8th a 25% increase. Lowest unemployment in 16 years and.. _E_
The next ObamaCare disaster will be doctors being dropped from plans. _E_
Breitbart gets it! Vote now Obama should release his college application records & grades. He says he loves (cont) __HTTP__ _E_
Jodi Arias has stated that she follows me on twitter so I really hate to be saying that she is guilty but sadly she is as guilty as it gets _E_
.@VanityFair could come back if Graydon Carter paid as much attention as he does to his bad food restaurants. @CondeNastCorp _E_
Congratulations to @TrumpNewYork for being named #1 Best Business Hotel in NYC in @TravlandLeisure's 2014 World's Best Business Hotels. _E_
Thank you North Carolina get out & #VoteTrump on 11/8/2016!#MakeAmericaGreatAgain __HTTP__ _E_
Via @G_Liberty_Voice by Melody Dareing: "Donald Trump Wants to Build a Wall Between U.S. And Mexico" __HTTP__ _E_
Obama Putin Moscow meeting on 9.3 4 __HTTP__ On the agenda 2013 Trump @MissUniverse Pageant in Moscow on 11.9 on @nbc! _E_
When I bought the #MissUniverse pageant 13 years ago it was on life support... _E_
If everything seems under control you're just not going fast enough. Mario Andretti _E_
Jeb Bush George W and George H.W. all called to express their best wishes on the win. Very nice! _E_
2004 VIDEO:Pocahontas describing Crooked Hillary Clinton as a Corporate Donor Puppet. Time for change! #Trump2016 __HTTP__ _E_
The French police are afraid to go into many communities. How did France let this all happen and how did the female terrorist ever escape? _E_
Where serenity meets luxury: Trump Nat'l Jupiter's Spa offers treatments which help restore youthful vitality __HTTP__ _E_
Via @BET: "Donald Trump Blasts Beyoncé for Suggestive Super Bowl Show" __HTTP__ _E_
Little @MacMiller—I have more hair than you do and there's a slight age difference. _E_
Scotland is having a virtual revolt over obsolete wind turbines which are driving up energy costs and killing the bird population (and more) _E_
Does anybody really want to throw out good educated and accomplished young people who have jobs some serving in the military? Really!..... _E_
Breaking ground shortly Trump Int'l Washington DC will bring the DC Post Office far beyond its original grandeur __HTTP__ _E_
President @BarackObama's vacation is costing taxpayers millions of dollars Unbelievable! _E_
Everyone is excited for @THEGaryBusey's return to All Star @CelebApprentice. Be warned this time Gary is even more insane! _E_
The State Department's 'shadow government' #DrainTheSwamp __HTTP__ _E_
New rule for @billmaher: check the law before you make a public absolute offer. _E_
If this doctor who so recklessly flew into New York from West Africahas Ebolathen Obama should apologize to the American people & resign! _E_
See you tomorrow Wisconsin!'Trump spurs small business optimism in Milwaukee area' __HTTP__ _E_
Check out Gray Line's site for the Donald Trump Ride of Fame... __HTTP__ _E_
Shows how dumb Joe McQuaid (@deucecrew) of the dying Union Leader is to put out the letter I wrote saying why I didn't do his failed debate! _E_
NYC's sole hamman the bi level @TrumpSoHo features indoor & outdoor relaxation lounges with luxury services __HTTP__ _E_
Terrible for the economy & middle class gas has now been over $3/gallon for a record 1245 days __HTTP__ FRACK NOW & FAST! _E_
Shocking over 92% of France who just elected a socialist for its new PM want @BarackObama re elected __HTTP__ _E_
Ray Kelly is the best Police Commissioner in NYC history. Keeping NYC safe thru vigilance. @RayKelly _E_
Be sure to watch #CelebApprentice on Sunday night at 9 pm on NBC. Another great episode! __HTTP__ _E_
Have you been watching how Saudi Arabia has been taunting our VERY dumb political leaders to protect them from ISIS. Why aren't they paying? _E_
.@McLaughlinGroup Greatly appreciate yr wonderful comments this weekend. People of "great accomplishment" should easily quality for prez. _E_
Come on @DannyZuker take the bet show your friends and family (& your bosses on Modern Family) that you're not chicken shit _E_
President Obama please take the $5M check for charity tomorrow. It is so easy and could do so much good! _E_
Thank you Hawaii! #Trump2016 _E_
Watched chief negotiator for Iran on @charlierose last night. He is far smarter than our reps—increase sanctions and walk! _E_
This is an outrage! Bias Free Language Guide claims the word 'American' is 'problematic' WHAT?! __HTTP__ _E_
Lets fight like hell and stop this great and disgusting injustice! The world is laughing at us. _E_
Lets go Pennsylvania! #VoteTrump __HTTP__ _E_
Congrats to great golfer @Frostpga on his big win last week. Always been best putter. Frost Wins for Trump _E_
Pay attention to details. If you don't know every aspect of what you're doing you're setting yourself up for some big surprises. _E_
Why didn't movie Lincoln use Ford's Theater for big scene instead of the stage of an unrelated theater? _E_
Can't wait for tonight's debate actually delayed my trip to Europe so I can watch. This is going to be a great night. _E_
Thank you @gawker! Call me on my cellphone 917.756.8000 and listen to my campaign message. _E_
Did anyone notice that Obama failed to get a coalition of other countries to go along with us. He couldn't even get Britain! NO LEADERSHIP. _E_
Our government now imports illegal immigrants and deadly diseases. Our leaders are inept. _E_
Unemployment has risen today and some other very bad news has just been reported the stock market is way down. _E_
My @piersmorgan interview on Snowden the traitor national security and China hacking us __HTTP__ _E_
Watch the 2011 #MissUniverse Pageant tonight at 9PM on NBC... __HTTP__ _E_
"Results are what matter...A series of efforts will add up to experience and achievement." Think Like a Champion _E_
#MakeAmericaGreatAgain #ImWithYou __HTTP__ __HTTP__ _E_
Drain the Swamp should be changed to Drain the Sewer it's actually much worse than anyone ever thought and it begins with the Fake News! _E_
.@georgewillf is perhaps the most boring political pundit on television. Got thrown off ABC like a dog. At Mar a Lago he was a total bust! _E_
"@NBCApprentice: And the fired celebrities are..." __HTTP__ via @ew by @DaltonRoss _E_
Thank you New York I will never forget! _E_
Please help @autismspeaks with their petition to the White House for a national strategy for the autism epidemic __HTTP__ _E_
If you entered our country illegally and are then granted amnesty why would you abide by other laws? No Amnesty! _E_
Rape is a huge problem in the U.S. military. Over 19000 rapes last year. _E_
Who will be the next @TheRealTeenUSA? Find out this Saturday at 8PM ET on missteenusa.com #TeenUSA _E_
The Democrats had to come up with a story as to why they lost the election and so badly (306) so they made up a story RUSSIA. Fake news! _E_
Bad sign for Obama's campaign now publicly admitting they are focused on 4 states. Their internals must be horrendous. _E_
So many people think I will not run for President.Wow I wonder what the response will be if I do. Even the haters and losers will be happy! _E_
From @FoxNews Bombshell: In 2016 Obama dismissed idea that anyone could rig an American election. Check out his statement Witch Hunt! _E_
.@GovernorPerry just gave a pollster quote on me. He doesn't understand what the word demagoguery means. _E_
Thank you to all of the men and women who have served our country. You are our true heroes! #ArmedForcesDay __HTTP__ _E_
#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
It is so nice that the shackles have been taken off me and I can now fight for America the way I want to. _E_
It's clear to me that @teresa_giudice needs some lessons in negotiation #sweepstweet _E_
Maybe @THEGaryBusey should stick to words... vs. barking. He's got a definite talent when he wants to use it. #CelebApprentice _E_
Ready to get mad?! We are sending foreign aid to China our greatest threat __HTTP__ We are financing our enemy. _E_
Thank you Worcester Massachusetts!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_
Just arrived in Wisconsin to discuss JOBS JOBS JOBS! #MAGA __HTTP__ _E_
'Jeff Sessions a Fitting Selection for Attorney General' __HTTP__ _E_
It is time for the airline pilots flight attendants and the airlines themselves to stop flights to and from West Africa. Do it right now! _E_
.@GovernorPataki couldn't be elected dog catcher if he ran again—so he didn't! _E_
The dishonest media will NEVER keep us from accomplishing our objectives on behalf of our GREAT AMERICAN PEOPLE!... __HTTP__ _E_
Via @TIMEPolitics by @zekejmiller: "Trump To Visit New Hampshire" __HTTP__ _E_
Looking forward to speaking @nranews Convention in Nashville __HTTP__ The 2nd Amendment is a right not a privilege! _E_
I will be on @foxandfriends tomorrow morning at 7:15 Hope you enjoy and agree! _E_
THANK YOU to the amazing staff and their families of the United States Embassy in the Philippines. Keep up the GREAT WORK! __HTTP__ _E_
Will be leaving the Philippines tomorrow after many days of constant mtgs & work in order to #MAGA! My promises are rapidly being fulfilled. _E_
When will CNN do a segment on Hillary's plan to increase Syrian refugees 550% and how much it will cost? _E_
We have to get tough on China. For every one American child there are four Chinese. China is out to steal our (cont) __HTTP__ _E_
There was a major diplomatic breakthrough yesterday w/the White House Iran & China. All celebrated Chuck Hagel being voted in as SOD. _E_
Big win in the House very exciting! But when everything comes together with the inclusion of Phase 2 we will have truly great healthcare! _E_
Cryin' Chuck Schumer stated recently I do not have confidence in him (James Comey) any longer. Then acts so indignant. #draintheswamp _E_
#USAatUNGA #UNGA __HTTP__ _E_
Terrible story on front page of NYTimes about lightweight @AGSchneiderman __HTTP__ Does Eric wear eyeliner? _E_
Being tough doesn't mean being nasty difficult or unreasonable. It means being tenacious and refusing to give in or give up. _E_
The Russia hoax continues now it's ads on Facebook. What about the totally biased and dishonest Media coverage in favor of Crooked Hillary? _E_
No amnesty. Protect the rule of law! Let's Make America Great Again __HTTP__ _E_
All NYC needs is the mentally unstable Elliot Spitzer in office again. _E_
The Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_
US trade deficit hit $64B+ in April 2 yr record high __HTTP__ We must do better. China is ripping us. Bring the jobs home! _E_
.@Betsy_McCaughey Thanks so much. Really appreciate your comments. I will help the veterans like no one else. __HTTP__ _E_
Why aren't the Democrats speaking about ISIS bad trade deals broken borders police and law and order. The Republican Convention was great _E_
RT @realDonaldTrump: Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our danger... _E_
What's the primary ingredient for success? Passion. You have to love what you're doing or you won't get too far. _E_
Wow Eliot Spitzer has lost great news for New York City! _E_
.@BillMoyers is a liberal hack whose career is being laid to rest @PBS. Here Moyers coddles @JeremiahWright __HTTP__ _E_
Speaker @johnboehner seems to have gained strength in house—a good thing! _E_
Why do we continue to sit idly by while China steals our national security and corporate secrets? China is an enemy not a friend. _E_
Work is expected to begin today on my golf course in Scotland. It will be spectacular! __HTTP__ _E_
Defense Sec.Hagel has quit. Great news for our country. The guy didn't have a clue—grossly outmatched by our enemies. Couldn't even speak _E_
See the amazing views from @TrumpGolfLA located directly on the Pacific Ocean __HTTP__ _E_
Almost every major dealmaker has used the bankruptcy laws as a business tool. Icahn Black Zell—but nobody says they went bankrupt! _E_
"Always bear in mind that your own resolution to succeed is more important than any other." – Abraham Lincoln _E_
Many people look at successful people & don't see anything but the end result. They don't see all the work that went into getting there. _E_
#WeeklyAddress __HTTP__ _E_
South Korea is absolutely killing us on trade deals. Their surplus vs U.S. is massive and we pay for their protection. WHO NEGOTIATES? _E_
Today I am here to offer a renewed partnership with America to work together to strengthen the bonds of friendship and commerce between all of the nations of the Indo Pacific and together to promote our prosperity and security. #APEC2017 __HTTP__ _E_
Good morning Ohio! Some additional information from my daughter @IvankaTrump! #VoteTrump #SuperTuesday __HTTP__ _E_
.@MacMiller's "Donald Trump" __HTTP__ just crossed 73.5 million views on @YouTube. You're welcome Mac! _E_
Thank you Brian France Bill Elliott @chaseelliott @DavidRagan & @RyanJNewman! #NASCAR #Trump2016 #VoteTrump __HTTP__ _E_
How bad has our leader made us look on Syria. Stay out of Syria we don't have the leadership to win wars or even strategize. _E_
Hope & Change since @BarackObama has taken office the US debt has increased by an average of $64K per taxpayer. _E_
Via @CBNNews by @TheBrodyFile: Brody File Exclusive: Donald Trump Comes Out In Support Of 20 Week Abortion Ban __HTTP__ _E_
RT @foxandfriends: Another Dem 'queasy' over claim of Loretta Lynch meddling in Clinton case __HTTP__ _E_
Designed by @jacknicklaus Trump Golf Links at Ferry Point's 18 hole course sits by the Bronx's Whitestone Bridge __HTTP__ _E_
How incompetent are our leaders allowing these Ebola infected people to come into our country with all of the problems and danger entailed! _E_
This is why @TimTebow is a winner. He lays everything out on the field. He never quits and never gives up. That's why he is a success. _E_
Why would Kim Jong un insult me by calling me old when I would NEVER call him short and fat? Oh well I try so hard to be his friend and maybe someday that will happen! _E_
Don't miss the #MissUniverse Pageant tonight at 8/7c with performances by @NickJonas @PrinceRoyce and @GavinDeGraw __HTTP__ _E_
A quote from the late great golfer Sam Snead: Practice puts brains in your muscles! THIS IS TRUE ALSO IN LIFE. _E_
Thank you Costa Mesa California! 31000 people tonight with thousands turned away. I will be back! #Trump2016 __HTTP__ _E_
Myself with mother and father at New York Military Academy. See I can be very military. High rank!... __HTTP__ _E_
The Russians are playing a very smart game. In the meantime they are buying lots of time for Syria and making U.S. look foolish. Dangerous! _E_
I am now going to the brand new Trump International Hotel D.C. for a major statement. _E_
Thank you Graham Ledger of the Daily Ledger @OANN for your really fair coverage and your great interview with Peter Roff of U.S. NEWS & W.R. _E_
Must read article in @washtimes: @RealSheriffJoe probe could dwarf Watergate __HTTP__ _E_
ObamaCare premiums rising 13.2% in 2015 __HTTP__ Elections have consequences! _E_
What is Frank VanderSloot getting for agreeing to back Marco Rubio? Last victim was Mitt Romney see how that turned out. _E_
"UPDATE: Trump plans public event at @WartburgCollege" __HTTP__ via @wcfcourier: _E_
I will miss Mike Wallace. He did a major interview with me for 60 Minutes and it was totally fair and balanced. (cont) __HTTP__ _E_
My @foxandfriends int. on the Zimmerman trial & verdict courage of the jury and reactions! __HTTP__ _E_
The habitual vacationer @BarackObama spent 9 days before the critical Super Committee deadline traveling. He failed to lead again. _E_
Why do the networks continue to put dopey @BillKristol on panels when he has called every single shot about me wrong for 2 yrs? _E_
.@CNN is unwatchable. Their news on me is fiction. Theyare a disgrace to the broadcasting industry and an arm of the Clinton campaign. _E_
This Sunday's All Star Celebrity @ApprenticeNBC has the most beautiful boardroom judges ever w/ @IvankaTrump & @MELANIATRUMP together! _E_
When will @TedCruz give all the New York based campaign contributions back to the special interests that control him. _E_
Germany is going through massive attacks to its people by the migrants allowed to enter the country. New Years Eve was a disaster. THINK! _E_
Get smart on knockout assaults and crime we have to be slightly more vicious (and violent) than the assaulter and crime would end FAST! _E_
We can't let this happen. We should march on Washington and stop this travesty. Our nation is totally divided! _E_
NYC politicians better stop pandering ending stop & frisk would be a disaster. __HTTP__ _E_
I'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. Will be great to be back in Texas. __HTTP__ _E_
RT @EricTrump: Nevada: Reminder that today is the LAST day to register to vote in the February 23rd caucus! __HTTP__ __HTTP__ _E_
Get to the essence immediately. Learn to economize. People appreciate brevity in today's world. Think Like a Champion _E_
Required reading 4 success in politics & life read @kimguilfoyle's book #MakingTheCase. Brilliant Advice ! __HTTP__ _E_
Wow President Obama just landed in Cuba a big deal and Raul Castro wasn't even there to greet him. He greeted Pope and others. No respect _E_
Thank you Washington! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_
Congratulations to John Roberts for making Americans hate the Supreme Court because of his BS __HTTP__ _E_
Entrepreneurs: Focus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_
At some point and for the good of the country I predict we will start working with the Democrats in a Bipartisan fashion. Infrastructure would be a perfect place to start. After having foolishly spent $7 trillion in the Middle East it is time to start rebuilding our country! _E_
My @MorningJoe interview with @JoeNBC & @morningmika discussing the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_
We need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_
Via @DMRegister by @JenniferJJacobs: Trump: 'I would've won the race against Obama' __HTTP__ _E_
Sadly Democrats want to stop paying our troops and government workers in order to give a sweetheart deal not a fair deal for DACA. Take care of our Military and our Country FIRST! _E_
The so called 87 year old lady was a vicious and skilled investor who was trying to rip me off with made up facts and a blowhard lawyer. _E_
Dave Letterman @Late_Show said during my interview that Obama was probably born in the US the word probably is a disaster for Obama. _E_
This morning Chris Wallace has the best political show on television but that's only because I'm on it (kidding)! Have fun. _E_
.@CBSNews Poll WOW! New Hampshire TRUMP 38% CARSON 12% BUSH 8% South Carolina TRUMP 40% CARSON 23% CRUZ 8% Iowa TRUMP 27% CARSON 27% _E_
George Will may be the dumbest(and most overrated) political commentator of all time. If the Republicans listen to him they will lose. _E_
Morning Joe's weakness is its low ratings. I don't watch anymore but I heard he went wild against Rudy Giuliani and #2A sad & irrelevant! _E_
Wow @CNBC ratings are really low worst in many years. I guess I'll have to start doing my Tuesday morning interviews with them again! _E_
Obama just said @MittRomney was a very successful investor big mistake for Obama to admit he has less and less credibility. _E_
What's more dangerous for the country the Iranian nuclear threat or @BarackObama as President? _E_
Amazing view of @TrumpGolfLA __HTTP__ _E_
Major article in New York Times today discusses the cost of environmental damage in China and how it is RAPIDLY GROWNG! Rest of World pays. _E_
Obama' ststement on Egypt was terrible and dumb now being used by military as a rallying cry our foreign policy is worst in U.S. history. _E_
It was an honor to be the Grand Marshall in the Salute to Israel Parade back in 2004. __HTTP__ _E_
I'd like to wish all of my friends and even my many enemies a very Merry Christmas and Happy New Year. _E_
MAKE AMERICA SAFE AGAIN! __HTTP__ __HTTP__ _E_
Will be in Chicago tomorrow for a record setting (by far) luncheon. _E_
.@Franklin_Graham @BillyNungesser @SamaritansPurse so humbled by my time w/ you. You are in our thoughts & prayers. __HTTP__ _E_
Congratulations Jim Herman! We are all proud of you @TrumpGolf! __HTTP__ _E_
Had a very good call last night with the President of China concerning the menace of North Korea. _E_
A vote for Clinton Kaine is a vote for TPP NAFTA high taxes radical regulation and massive influx of refugees. _E_
Rally last night in San Jose was great. Tremendous love and enthusiasm in the hall. Big crowd. Outside small group of thugs burned Am flag! _E_
Wow was Ted Cruz disloyal to his very capable director of communication. He used him as a scape goat fired like a dog! Ted panicked. _E_
#MakeAmericaGreatAgain __HTTP__ _E_
LAWFARE: Remarkably in the entire opinion the panel did not bother even to cite this (the) statute. A disgraceful decision! _E_
Really disgusting that the failing New York Times allows dishonest writers to totally fabricate stories. _E_
Ashley Judd's candidacy was created by Karl Rove's terrible ads even before she thought seriously about running... _E_
Donald Trump Tells @theblaze About His Obama Announcement: PASSPORT APPLICATIONS TELL YOU A LOT __HTTP__ by @BillyHallowell _E_
I just wrapped up a Q&A @TwitterNYC. Thanks for all your questions! #AskTrump __HTTP__ _E_
In less than a week I'll be honored by Sarasota GOP as Statesman of the Year & then give my big surprise to @RNC convention. Will be fun! _E_
The Muslim Brotherhood @BarackObama's allies in Egypt will cancel the Camp David Agreement. __HTTP__ What a disaster! _E_
I know Mark Cuban well. He backed me big time but I wasn't interested in taking all of his calls.He's not smart enough to run for president! _E_
Re Super PAC scam: What the other candidates are doing is a disgrace. _E_
FLASHBACK – "Donald Trump Answers Boy's Prayer for New Bike" __HTTP__ via @FoxNewsInsider _E_
Thank you Greeley CO! REAL change means restoring honesty to the govt. Our plan will END govt. corruption! Watch:... __HTTP__ _E_
Just watched Cookie Roberts on @ABC. Her predictions have been so wrong for so long that she has lost all credibility. Just another sad case _E_
Eric did a great job with his Eric Trump Foundation annual charity outing. I'm proud of him. __HTTP__ _E_
Great speech by my good friend @GovChristie. He did something you won't hear at @BarackObama's convention tell the truth. _E_
The economy is broken. Entrepreneurship is being suppressed. See what I do Wednesday 11 AM at Trump Tower atrium. _E_
Gary Sinise is doing tremendous work for veterans through his foundation—check it out @GarySiniseFound _E_
Congratulations to @IsraeliPM @netanyahu on forming his new unity government. A major political success for the Jewish State of Israel. _E_
The Blue Monster at Trump National Doral recieved rave reviews from both players and architectural critics following the Cadillac WGC.Thanks _E_
Thank you Columbus Ohio! I will be back soon. #ImWithYou #MAGA __HTTP__ _E_
Hypocrite @BarackObama has major investments in companies that are outsourcing jobs overseas __HTTP__ _E_
I am at Trump National Doral best resort in U.S. Rory and Adam Scott are doing great! Watch on NBC at 3:00 P.M. MAKE AMERICA GREAT AGAIN! _E_
I am having a really hard time watching @FoxNews. _E_
Broken borders $18T debt ObamaCare failing & over budget. Don't worry our president is still fundraising __HTTP__ Priorities _E_
I'm at @WrestleMania tonight but will be doing a few tweets. I know the episode well.... #CelebApprentice _E_
RT @foxandfriends: Trump fires new warning shot at McConnell leaves door open on whether he should step down __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
It's time for Ted Cruz to either settle his problem with the FACT that he was born in Canada and was a citizen of Canada or get out of race _E_
Via @BreitbartNews by @mboyle1: "Obama's Amnesty Will Give Illegal Aliens Public Benefits" __HTTP__ _E_
Entrepreneurs: Negotiation is an art. Treat it like one. _E_
Spoke to President of Mexico to give condolences on terrible earthquake. Unable to reach for 3 days b/c of his cell phone reception at site. _E_
A strong military makes us respected by our allies & feared by our enemies. Let's Make America Great Again! __HTTP__ _E_
You have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_
Via @BBCNews: "US property tycoon Donald Trump confirms Turnberry buy" __HTTP__ _E_
Michael Vick of the Philadelphia @eagles is a great athlete but not a great quarterback. _E_
I would love to see the Republican party and everyone get together and unify.Video: __HTTP__ __HTTP__ _E_
Continuous effort not strength or intelligence is the key to unlocking our potential. Winston Churchill _E_
His @BarackObama's specialties? Vacations and campaigning. Jobs not so much! _E_
'Food Groups' – Emails Show Clinton Campaign Organized Potential VPs By Race And Gender: __HTTP__ _E_
A great day in New Jersey for Trump! __HTTP__ & __HTTP__ _E_
"Donald Trump on Mark Levin: Karl Rove is one of the most overrated people in politics" __HTTP__ via @TheRightScoop _E_
In my new book #TimeToGetTough I make a full financial disclosure detailing my net worth. __HTTP__ _E_
Making his case in a nice and articulate manner. _E_
Amazing how the haters & losers keep tweeting the name "F**kface Von Clownstick" like they are so original & like no one else is doing it... _E_
Be sure to watch The Apprentice tonight 10 p.m. on NBC it's an episode you won't forget! _E_
Remember the most hated part of ObamaCare is the Individual Mandate which is being terminated under our just signed Tax Cut Bill. _E_
It is my great honor to be speaking at CPAC 2013. They are all about what's good for America. _E_
Across the battlefields oceans and harrowing skies of Europe and the Pacific throughout the war one great battle cry could be heard by America's friends and foes alike:"REMEMBER PEARL HARBOR." __HTTP__ _E_
Dummy @BillMaher forgot to say that he made an absolute offer which I accepted. Hopefully charity gets $5M dollars. _E_
Join me in Phoenix Arizona today at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_
Must read editorial co written by @weeklystandard editor William Kristol & @NRO editor @RichLowry 'Kill the Bill' __HTTP__ _E_
Under President @BarackObama China has experienced unusually fast gains and America unusually fast losses. #TimeToGetTough _E_
The MOVEMENT in Portsmouth New Hampshire w/ 7K supporters. THANK YOU! This is the biggest election of our lifetime... __HTTP__ _E_
Remember this: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_
I can't believe the great @wjcarter got canned by @nytimes. He was a fantastic reporter & really knew entertainment. He will be missed! _E_
FLASHBACK: "Alex Salmond pleaded with Donald Trump to back release of Lockerbie bomber" __HTTP__ @telegraphnews ... _E_
I thought people weren't celebrating? They were cheering all over even this savage from Orlando. I was right. __HTTP__ _E_
Will be interviewed by @SeanHannity on @FoxNews at 10:00pm tonight. Enjoy! _E_
Big poll just out by @TheEconomist has me in 1st. place by a lot. A great honor but we have a long way to go to MAKE AMERICA GREAT AGAIN! _E_
My @IngrahamAngle interview discussing healthcare monopolies @MittRomney oil prices and @AnnRomney's birthday __HTTP__ _E_
Our way of life is under threat by Radical Islam and Hillary Clinton cannot even bring herself to say the words. _E_
.@foxandfriends Dems are taking forever to approve my people including Ambassadors. They are nothing but OBSTRUCTIONISTS! Want approvals. _E_
Via @OceanDriveMag by @SuzMcGeeNYC: Q&A: Ivanka Trump on the Business of Golf & the Championships __HTTP__ _E_
Change is the law of life. And those who look only to the past or present are certain to miss the future. John F. Kennedy _E_
Finally held our first full @Cabinet meeting today. With this great team we can restore American prosperity and br... __HTTP__ _E_
What is he reading? #Oscars _E_
President should not be telling the Washington Redskins to change their name our country has far bigger problems! FOCUS on themnot nonsense _E_
Enjoy the ratings of President Obama. __HTTP__ _E_
I am on @foxandfriends now! Tune in! _E_
Just left the #G7Summit. Had great meetings on everything especially on trade where.... _E_
The numbers at the @nytimes are so dismal especially advertising revenue that big help will be needed fast. A once great institution SAD! _E_
Thank you for such a wonderful and unforgettable visit Prime Minister @Netanyahu and @PresidentRuvi. _E_
Obama's Amnesty Executive Order can now be stopped by Majority Leader McConnell with riders. That's one reason we needed the Senate. _E_
This is a once in a generation opportunity to offer historic tax relief to the American people! Join me today: __HTTP__ __HTTP__ _E_
Bob Tyrrell @AmSpec—Thank you and also for the great work you do. _E_
Let us give thanks for all that we have and let us boldly face the exciting new frontiers that lie ahead. Happy Th... __HTTP__ _E_
Susan Rice is a good woman but Pres. O should not taunt the Republicans by appointing her S of S... _E_
Are you expanding your business? Interview returning soldiers. Give them strong consideration. Their sacrifices deserve it. _E_
A list from @Heritage: Top 10 Most Expensive Obamacare Taxes and Fees __HTTP__ _E_
Justice Roberts turned on his principles with absolutely irrational reasoning in order to get loving press from (cont) __HTTP__ _E_
What a year it's been and we're just getting started. Together we are MAKING AMERICA GREAT AGAIN! Happy New Year!! __HTTP__ _E_
Welcome back @SteveScalise!#TeamScalise __HTTP__ _E_
I would feel sorry for @JebBush and how badly he is doing with his campaign other than for the fact he took millions of $'s of hit ads on me _E_
RT @EricTrump: Friends in #FL #OH #NC #IL & #MO we would be honored to have your #VOTE! #SuperTuesday #LetsDoThis #MakeAmericaGreatAgain #T... _E_
Isn't it sad that lightweight Senator Bob Corker who couldn't get re elected in the Great State of Tennessee will now fight Tax Cuts plus! _E_
Mexican gov doesn't want me talking about terrible border situation & horrible trade deals. Forcing Univision to get me to stop no way! _E_
Next year I will be changing the name of 800 acre Doral to Trump National Doral. It will be the best resort in the country—Miami is hot! _E_
Crazy Dennis Rodman is saying I wanted to go to North Korea with him. Never discussed no interest last place on Earth I want to go to. _E_
The first meeting Jeff Sessions had with the Russian Amb was set up by the Obama Administration under education program for 100 Ambs...... _E_
Thank you Tampa Florida!#AmericaFirst #TrumpTrain __HTTP__ _E_
I will be on Fox & Friends tomorrow morning at 7.ºº _E_
What a STUPID deal for Verizon to buy AOL for $4.4 billion. AOL has been bad luck for everyone who touched it. Worth less than $1 billion! _E_
Why are we giving away our entire strategy and tactics we will deploy against ISIS? It puts our troops at a disadvantage. _E_
RT @Bet22325450ste: @FoxBusiness @foxandfriends Come on America. Get on the Trump Train. The winners already have boarded! The losers are w... _E_
Now @BarackObama is praising China's cooperation in negotiations over Chen Guangcheng __HTTP__ This is a sad episode for us. _E_
Don't ever think you've done it all already or that you've done your best. That's a shortcut to undermining your own potential. _E_
The Jets should have let them score to get the number one draft pick who will be really good. It will just never change for them! _E_
Hillary Clinton doesn't have the strength or the stamina to MAKE AMERICA GREAT AGAIN! #AmericaFirst __HTTP__ _E_
Under President Trump unemployment rate will drop below 4%. Analysts predict economic boom for 2018! @foxandfriends and @Varneyco _E_
Congratulations to @arsenioofficial on his new late night show! He will do really well. (It pays to win #CelebrityApprentice) _E_
Today it was my great honor to meet with the Crown Prince of Bahrain at the @WhiteHouse. Bahrain and the United States are important partners.During the Crown Prince's visit he is advancing $9 BILLION in commercial deals including finalizing the purchase of F 16's... __HTTP__ _E_
Glad to hear @InsideEdition has hired @_KatherineWebb to cover @SuperBowl. She will be absolutely terrific! Miss USA pageant is proud. _E_
Snowden has given serious information to China and Russia anyone who thinks otherwise is a dope! He is a traitor who fled he knew the crime! _E_
.@GOP need to face reality – not one of the illegal immigrants granted amnesty will vote Republican. _E_
How much is South Korea paying the U.S. for protection against North Korea???? NOTHING! _E_
France is losing its businesses and wealth rapidly and day by day. _E_
Take a tour of this amazing penthouse in Trump Park Avenue.... __HTTP__ _E_
THANK YOU ARIZONA! 20000 amazing supporters! Get out and #VoteTrump on Tuesday. I love you!#MakeAmericaGreatAgain __HTTP__ _E_
Last time lightweight @JebBush tried to knock off @marcorubio he made a total fool of himself. If he doesn't do better this time he is out! _E_
How does a dummy like @billmaher get a television show & his ratings stink. You'd think @HBO could do a lot better. _E_
Via @TWtravelnews by Robert Silk: "Renovations make Trump's Doral a showcase once again" __HTTP__ _E_
'ICE OFFICERS WARN HILLARY IMMIGRATION PLAN WILL UNLEASH GANGS CARTELS & DRUG VIOLENCE NATIONWIDE'... __HTTP__ _E_
The terrorists cut off the heads of Americans and laugh then want to sell us the bodies for $1000000. We fight over sleep deprivation! _E_
Via @BBCNews: "Donald Trump visits his newly purchased Turnberry golf resort" __HTTP__ _E_
Hopefully the violence & unrest in Charlotte will come to an immediate end. To those injured get well soon. We need unity & leadership. _E_
Scotland will be so lucky if this monstrosity is not built—I will tie them up in courts for years if necessary. _E_
This is going to be a special season truly great characters and cast. You will soon see! _E_
The Lincoln Day Dinner last night in Michigan was fantastic. Record attendance and tremendous enthusiasm I loved it! _E_
Do you notice the Fake News Mainstream Media never likes covering the great and record setting economic news but rather talks about anything negative or that can be turned into the negative. The Russian Collusion Hoax is dead except as it pertains to the Dems. Public gets it! _E_
Great rally in Iowa! Such wonderful people. Traveling now with @SarahPalinUSA to Tulsa massive crowd expected! __HTTP__ _E_
This afternoon I'll be speaking with Neil Cavuto on Your World with Neil Cavuto 4 p.m. on FOX News. _E_
The highly neurotic Debbie Wasserman Schultz is angry that after stealing and cheating her way to a Crooked Hillary victory she's out! _E_
After years of Comey with the phony and dishonest Clinton investigation (and more) running the FBI its reputation is in Tatters worst in History! But fear not we will bring it back to greatness. _E_
Many people will be surprised at what is about to be released concerning @BarackObama's background. I for one won't be. _E_
James Comey leaked CLASSIFIED INFORMATION to the media. That is so illegal! _E_
The GOP Debate Scorecard: Donald Trump and Energy by Wayne Allyn Root. __HTTP__ _E_
In order to try and deflect the horror and stupidity of the Wikileakes disaster the Dems said maybe it is Russia dealing with Trump. Crazy! _E_
I will be interviewed tonight at 7pm ET by @greta #OnTheRecord _E_
Hillary flunky who lost big. For the 100th time I never mocked a disabled reporter (would never do that) but simply showed him....... _E_
What is Obama thinking? __HTTP__ _E_
Many countries are cutting back big time on ugly industrial wind turbines. The energy is very inefficient & (cont) __HTTP__ _E_
I want to thank @RealSheriffJoe for all of his help in our historic Arizona win. Could not have done it without you Joe! _E_
RT @FoxNews: TONIGHT on Justice @JudgeJeanine talks to special guests @EricTrump and @LaraLeaTrump Tune in at 9p ET on Fox News Channe... _E_
I don't want to hit Crazy Bernie Sanders too hard yet because I love watching what he is doing to Crooked Hillary. His time will come! _E_
The dying @NYDailyNews asked me to do an Editorial on the Central Park 5 ripoff & then they pretend it was my idea. Loser newspaper! _E_
Irresponsible! In the last 6 months @BarackObama has held over 100 fundraisers and not a single meeting with his Job Council. _E_
.@DennisDMZ Thanks for the nice words. You are fantastic! _E_
"Be objective and strive to be your own counselor. Listen to others but know the final decision is yours." – Think Like a Champion _E_
Thank you Eau Claire Wisconsin. #VoteTrump on Tuesday April 5th!MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
You're never a loser until you quit trying. Mike Ditka _E_
Instead of trash talking @PMIsrael on the world stage @BarackObama should be defending @Israel. _E_
Wow Vanity Fair was totally shut out at the National Magazine Awards it got NOTHING. Graydon Carter is a loser with bad food restaurants! _E_
RT @Newsmax_Media: Trumps Warns of Obama Tipping Point that May Destroy America __HTTP__ via @Newsmax_media _E_
Jeb's brother George insisted on a $100000 fee and $20000 for a private jet to speak at a charity for severely wounded vets. Not nice! _E_
Welfare's purpose should be to eliminate as far as possible the need for its own existence. – Pres. Ronald Reagan _E_
What I am saying is stay out of Syria. _E_
By the US winning the Olympic medal count we proved that both the American spirit & talent is greater than a 1.4B population. USA! _E_
This just in re: FundAnything and producer Brad Wyman __HTTP__ _E_
If you think we have a problem with Social Security and Medicare now try taking in millions of new citizens all at once. _E_
ObamaCare continues to increase insurance premiums & raise record deductibles. New Congress must use every tool to defund. _E_
If you've got some problems today that's a good sign that's life. So give them some thought and make the most of the situation. _E_
The U.S. Coast Guard FEMA and all Federal and State brave people are ready. Here comes Irma. God bless everyone! _E_
RT @foxandfriends: U.S. Air Force jets take off from Guam for training ensuring they can 'fight tonight' __HTTP__ _E_
"Success breeds success. The best way to impress people is through results." – Think Like a Billionaire _E_
"Sometimes by losing a battle you find a new way to win the war. The Art of The Deal _E_
.@piersmorgan is back! Did I see @OMAROSA wince? #CelebApprentice _E_
Why wouldn't the @WSJ call for comment or clarification before writing an editorial which is so totally wrong. No wonder it is doing poorly! _E_
RT @AmbJohnBolton: Our country & civilians are vulnerable today because @BarackObama did not believe in national missile defense. Let's nev... _E_
They only changed the term to CLIMATE CHANGE when the words GLOBAL WARMING didn't work anymore. Come on people get smart! _E_
Word is that Crooked Hillary has very small and unenthusiastic crowds in Pennsylvania. Perhaps it is because her husband signed NAFTA? _E_
Rory Tiger Phil and Ernie will be fun to watch this weekend at Trump National Doral. _E_
I would like to wish all fathers even the haters and losers a very happy Fathers Day. _E_
Entrepreneurs: Let your actions show that you're the best. See each day as an opportunity to show you can do business at the highest level. _E_
Hillary and Sanders are not doing well but what is the failed former Mayor of Baltimore doing on that stage? O'Malley is a clown. _E_
One thing I will say about Rep. Keith Ellison in his fight to lead the DNC is that he was the one who predicted early that I would win! _E_
Despite thousands of hours wasted and many millions of dollars spent the Democrats have been unable to show any collusion with Russia so now they are moving on to the false accusations and fabricated stories of women who I don't know and/or have never met. FAKE NEWS! _E_
Thanks @PiersMorgan. You're great! _E_
If you want to know how to prevail through tough circumstances then read The Art of the Comeback. _E_
"True courage is being afraid and going ahead and doing your job anyhow!" General Norman Schwarzkopf _E_
#MakeAmericaGreatAgain #TrumpRallyAL __HTTP__ _E_
To aspiring entrepreneurs: Be focused! Know your goals. Put everything you've got into what you're doing every single day. _E_
I think everyone will like my new and very successful book Crippled America. Go get it and let me know what you think! _E_
Just tried watching Modern Family written by a moron really boring. Writer has the mind of a very dumb and backward child. Sorry Danny! _E_
RT @gatewaypundit: BREAKING POLL: Trump Gains 11 Points on Clinton Since March=&gt Now Leads Crooked Hillary 46 44 __HTTP__ vi... _E_
The forgotten men and women of our country will be forgotten no longer. From this moment on it's going to be #AmericaFirst _E_
Obama promised premiums would lower $2500/yr for family of 4. In truth healthcare will increase by $7450 __HTTP__ _E_
From Donald Trump: Ivanka and Jared's wedding was spectacular and they make a beautiful couple. I'm a very proud father. _E_
The federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_
I'm right TPM is wrong @BarackObama did not issue a special statement for Christmas however he issued one (cont) __HTTP__ _E_
.@CNN is so embarrassed by their total (100%) support of Hillary Clinton and yet her loss in a landslide that they don't know what to do. _E_
Now @BarackObama's Vice Chief of Joint Staff is defending China while they cheat __HTTP__ Wrong course of action. _E_
Congratulations to @gretawire on the 11 year anniversary of @FoxNews 'On the Record.' Always enjoy being interviewed by Greta. She's great. _E_
Newly minted diplomat @dennisrodman is a completely different competitor in All Star @CelebApprentice. Dennis is a legend! _E_
President Donald J. Trump Proclaims 5/14/2017 through 5/20/2017 as #PoliceWeek Proclamation... __HTTP__ _E_
The Dunes here are amazing and they're how I learned about geomorphology which is the study of movement landforms. We've had a great trip _E_
I will be making a major announcement today at 12:30 pm PST at Trump International Hotel & Tower Las Vegas (cont) __HTTP__ _E_
Left New Hampshire for Turnberry in Scotland which I am renovating. This place is incredible! @TrumpTurnberry _E_
.@realDonaldTrump on ISIS&OIL FIELDS! Saying it for years! @AndersonCooper you should acknowledge this! #Trump2016 __HTTP__ _E_
Lightweight reporter Alex Pareene @pareene is known as a total joke in political circles. Hence he writes for Loser Salon. @Salon _E_
Why does the media with a strong push from Crooked Hillary keep pushing the false narrative that I want to raise taxes. Exactly opposite! _E_
President Reagan put it best: Welfare's purpose should be to eliminate as far as possible the need for its own existence. _E_
Ted Cruz is a cheater! He holds the Bible high and then lies and misrepresents the facts! _E_
Join me in Florida this Saturday at 5pm for a rally at the Orlando Melbourne International Airport!Tickets:... __HTTP__ _E_
Via @gazettedotcom by James Q. Lynch: "Trump to run typical caucus campaign 'but bigger'" __HTTP__ _E_
.@TraceAdkins is back—good news for Plan B. #CelebApprentice _E_
In Miami tracking @TrumpDoral's $250M renovations. Will be America's top resort. @PGATOUR just signed for 10 yr ext. __HTTP__ _E_
Whatever the United States can do to help out in London and the U. K. we will be there WE ARE WITH YOU. GOD BLESS! _E_
By popular request I will be live tweeting during Celebrity Apprentice (Sunday 9 P.M.). _E_
Listen to my interview with @KathieLGifford at @PodcastOne __HTTP__ _E_
For all of my many Jewish friends Happy Passover. _E_
Watch my video blog to see if your questions from my Facebook page were answered __HTTP__ _E_
 _E_
 _E_
After all is said and done more is said than done. Aesop _E_
.@ArsenioHall How quickly people forget but not me! You told me that without The Apprentice you could never have gotten your show Sad! _E_
Why can't the pundits be honest? Hopefully we are all looking for a strong and great country again. I will make it strong and great! JOBS! _E_
Sen. Kay Hagan voted for Amnesty & ObamaCare. She is a proven liberal who recklessly goes along with Obama. Vote @ThomTillis in November! _E_
Entrepreneurs: Keep an open mind. Business is a creative endeavor. _E_
.@ArsenioHall The only thing you don't mention in the nice Esquire piece about you is The Apprentice without which you would be nowhere! _E_
New polls out today are very good considering that much of the media is FAKE and almost always negative. Would still beat Hillary in ..... _E_
Amazing playing with an ankle injury @Yankees Captain Derek Jeter tied Willie Mays last night for #10 on (cont) __HTTP__ _E_
The Massive Tax Cuts which the Fake News Media is desperate to write badly about so as to please their Democrat bosses will soon be kicking in and will speak for themselves. Companies are already making big payments to workers. Dems want to raise taxes hate these big Cuts! _E_
Obama & Democrat leaders did a great disservice by releasing the papers on torture. The world is laughing at us—they think we are fools! _E_
If @BarackObama had such a wonderful academic record why wouldn't he want to show it? _E_
.@Macys stock just dropped. Interesting. So many people calling to say they are cutting up their @Macys credit card. Thank you! _E_
The EPA official who wants to crucify gas companies resigned __HTTP__ Good but his attitude is endemic in the EPA _E_
You can't compare anything to ObamaCare because ObamaCare is dead. Dems want billions to go to Insurance Companies to bail out donors....New _E_
While Jeb Bush is cutting staff and salaries after having paid ridiculous amounts of money why did he pay so much in the first place? _E_
"If you don't have time to do it right when will you have time to do it over?" John Wooden _E_
Each time I see one of Anthony Weiner's television ads for mayor I ask what the hell is he doing just wasting money & time go get a job! _E_
Getting the strong endorsement of the great coach Bobby Knight has been a highlight of my stay in Indiana. Big speech tomorrow with Bobby! _E_
Wow the highly respected Governor of Iowa just stated that Ted Cruz must be defeated. Big shoker! People do not like Ted. _E_
Getting rid of the mortgage interest deduction would be a disaster for homeowners who have suffered enough! _E_
Flashback: "NYers were grateful when Donald Trump finished ahead of schedule and under budget the Wollman Rink" __HTTP__ _E_
The weather has been so cold for so long that the global warming HOAXSTERS were forced to change the name to climate change to keep $ flow! _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
.@Merck Pharma is a leader in higher & higher drug prices while at the same time taking jobs out of the U.S. Bring jobs back & LOWER PRICES! _E_
WORKING TOGETHER we will defeat this #OpioidEpidemic & free our nation from the terrible affliction of drug abuse. __HTTP__ __HTTP__ _E_
I'm giving away money! 11AM Trump Tower. Be there or be left behind! _E_
Conservatives have to be smart in the way we speak. Using crazy language that terifies seniors accomplishes (cont) __HTTP__ _E_
Stupid Arianna @huffingtonpost hired the man who ruined the once great NYTimes Business Section... _E_
The response has been fantastic actually overwhelming! Thank you! _E_
Great job by all law enforcement officers and Boston Mayor @Marty_Walsh. _E_
Always remember SOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_
Brande would have been fired immediately if she didn't raise $132000 a really large sum. Bret on the other hand raised very little... _E_
Entrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_
Just returned from Colorado. Amazing crowd! _E_
Getting ready to make my speech at #KansasCaucus. A great honor! #MakeAmericaGreatAgain #Trump2016 _E_
The word is that Lance Armstrong will now implicate officials and others but who knows if he's telling the truth _E_
Iran will only get stronger in Iraq with the latest civil war. We should have taken the oil immediately after the invasion. _E_
Looking forward to the debate tonight and will be tweeting live with very honest assessment. _E_
Watch my appearance on @Morning_Joe great interview! __HTTP__ _E_
Along with a soaring bar of sky bound gold @TrumpLasVegas' pool deck overlooks the City of Lights __HTTP__ _E_
Getting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_
.@IanJamesPoulter Great going and almost as importantly your clothing line is selling well! _E_
Hillary Clinton colluded with the Democratic Party in order to beat Crazy Bernie Sanders. Is she allowed to so collude? Unfair to Bernie! _E_
Yet more evidence of a media rigged election: __HTTP__ _E_
Obama's Secret Service catastrophe has openly revealed a great lack of respect for our President. If they (cont) __HTTP__ _E_
Over 50 women were interviewed by the @nytimes yet they only wrote about 6. That's because there were so many positive statements. _E_
So so so important MAKE AMERICA GREAT AGAIN! _E_
"Failures are expected by losers ignored by winners." @CoachJoeGibbs _E_
Lots of response to my comment on Diet Coke  let's face it it doesn't work just makes you hungry. _E_
Playing golf with Prime Minister Abe and Hideki Matsuyama two wonderful people! __HTTP__ _E_
Many Democrats up for reelection in 2012 are skipping the DNC convention in Charlotte __HTTP__ Smart politics! _E_
Thank you to the @washingtonpost for the accurate and very discriptive story on my speech in Alabama last night. It was a great evening! _E_
First the Ninth Circuit rules against the ban & now it hits again on sanctuary cities both ridiculous rulings. See you in the Supreme Court! _E_
I don't believe in government picking winners or in the case of (@BarackObama) picking losers @MittRomney _E_
Must read column for all young people: Obama's war on young voters who elected him __HTTP__ _E_
Former Homeland Security Advisor Jeh Johnson is latest top intelligence official to state there was no grand scheme between Trump & Russia. _E_
Together our task is to strengthen our families to build up our communities to serve our citizens and to celebrate AMERICAN GREATNESS as a shining example to the world.... __HTTP__ _E_
Help fight autism go to __HTTP__ website for __HTTP__ donations & government activation. _E_
The ABC/Washington Post Poll even though almost 40% is not bad at this time was just about the most inaccurate poll around election time! _E_
In the heart of the city Trump International Toronto is the city's most elite property __HTTP__ True luxury at its finest. _E_
Josh Brolin a friend of mine was terrific in Men in Black. Congrats! _E_
.@JebBush was terrible on Face The Nation today. Being at 2% and falling seems to have totally affected his confidence. A basket case! _E_
There are only 22 days for @BarackObama to drop @JoeBiden. Obama is not a loyal guy. I think he is strongly considering it. _E_
The Republicans are always worried about the press they should just do what is right. _E_
ISIS gained tremendous strength during Hillary Clinton's term as Secretary of State. When will the dishonest media report the facts! _E_
Entrepreneurs: Don't ever think you've done it all already or that you've done your best. Don't sell yourself short! _E_
I wonder if Marshawn Lynch will now speak and call some coach a moron for not allowing him to run the ball three times for one yard? _E_
Playef golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_
The atrium of @TrumpTowerNY dressed up for Christmas __HTTP__ _E_
Going to Salt Lake City Utah for a big rally. Lyin' Ted Cruz should not be allowed to win there Mormons don't like LIARS! I beat Hillary _E_
I'm leading by big margins in every poll but the press keeps asking would you ever get out? They are just troublemakers I'm going to win! _E_
Rev. Wright called @BarackObama on tape a liar. Why isn't this being looked into? It would be a great commercial for the republicans. _E_
Really interesting President Obama was quick to shut down flights to Isreal but is totally unwilling to shut down flights from West Africa! _E_
If Saudi Arabia which has been making one billion dollars a day from oil wants our help and protection they must pay dearly! NO FREEBIES. _E_
Paul Teutul is always good on the show. #CelebApprentice _E_
Derek Jeter broke ankle one day after he sold his apartment in Trump World Tower. _E_
We have a sacred duty to care for our vets and their families. Veterans deserve universal access to care anywhere and anytime! _E_
The Fake News Media will not talk about the importance of the United Nations Security Council's 15 0 vote in favor of sanctions on N. Korea! _E_
Thank you Gettysburg Pennsylvania! #DrainTheSwamp __HTTP__ _E_
RT @dmartosko: This is the #NYTimes. Can you understand why so many reporters are cautious about working for them? __HTTP__ _E_
Love the people of South Carolina look very much forward to the debate tonight. _E_
If a new HealthCare Bill is not approved quickly BAILOUTS for Insurance Companies and BAILOUTS for Members of Congress will end very soon! _E_
The #USSJohnFinn will provide essential capabilities to keep America safe. Our sailors are the best anywhere in the world. Congratulations! __HTTP__ _E_
I hope that Crooked Hillary picks Goofy Elizabeth Warren sometimes referred to as Pocahontas as her V.P. Then we can litigate her fraud! _E_
I will unveil my first campaign ads on @Morning_Joe at 6:30am tomorrow. Enjoy! #MakeAmericaGreatAgain _E_
The new season of the Celebrity Apprentice begins Feb. 12 be prepared for the best season yet! __HTTP__ _E_
Derek Jeter is playing phenomenal baseball. He is a total winner and also a great guy. @DerekJeter _E_
I make good deals. That's what I do. I would make great deals for our country. my @SRQRepublicans speech _E_
I have fun I love what I do. You should too. Find out how at the National Achievers Conference this October in London __HTTP__ _E_
ALABAMA get out and vote for Luther Strange he has proven to me that he will never let you down! #MAGA _E_
Happy Thanksgiving to all. Have a great day and look forward to the future. We will MAKE AMERICA GREAT AGAIN! _E_
I will be on with @BretBaier tonight at 6PM. #Trump2016 _E_
The Trumping of Turnberry via Links Magazine @TrumpTurnberry __HTTP__ _E_
Watch Donald Trump's recent appearance on The Late Show with David Letterman: __HTTP__ _E_
I will be interviewed by @oreillyfactor tonight on @FoxNews at 11pm. Enjoy! _E_
Great @foxbusiness interview with @EricTrump on @TeamCavuto discussing the real estate economy & 2016 __HTTP__ _E_
Trump buys mansion adjacent to family winery __HTTP__ via @trdny _E_
When will we stop wasting our money on rebuilding Afghanistan? We must rebuild our country first. _E_
Join me LIVE at 5:45pmE from Harrisburg Pennsylvania! #TaxReform #USA __HTTP__ __HTTP__ _E_
We only want to admit those who love our people and support our values. #AmericaFirst _E_
Will be on @foxandfriends tomorrow morning at 7:00. _E_
Don't forget episodes 2 and 3 of @ApprenticeNBC are on tonight at 8PM and 9PM on @NBC. _E_
More on Benghazi cover up: "ATTORNEY FOR WHISTLEBLOWER: 400 U.S. MISSILES STOLEN IN BENGHAZI" __HTTP__ Really bad. _E_
Weiner is gone Spitzer is gone next will be lightweight A.G. Eric Schneiderman. Is he a crook? Wait and see worse than Spitzer or Weiner _E_
Obama wants to unilaterally put a no fly zone in Syria to protect Al Qaeda Islamists __HTTP__ Syria is NOT our problem. _E_
The Fake News is becoming more and more dishonest! Even a dinner arranged for top 20 leaders in Germany is made to look sinister! _E_
If you don't publicize your successes your competitors will be sure to belittle them. Get the word out! _E_
Obama is now warning North Korea on the Yongbyon nuclear reactor __HTTP__ After Syria our enemies are laughing! _E_
Go out and buy CRIPPLED AMERICA: How to Make America Great Again. Doing really well. Great Thanksgiving or Christmas present! _E_
Can't wait for @VanityFair to fold which under Graydon Carter will be sooner rather than later. _E_
Hmmm...can you imagine me speaking at the RNC Convention in Tampa? __HTTP__ That's a speech everyone would watch. _E_
RT @realDonaldTrump: More and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the... _E_
A great story in the New York Post really well written! __HTTP__ _E_
Yes it is true Carlos Slim the great businessman from Mexico called me about getting together for a meeting. We met HE IS A GREAT GUY! _E_
Today we witnessed an incredible moment in history – the presentation of Congress' highest civilian honor to our friend and true AMERICAN HERO Bob Dole. #CongressionalGoldMedal __HTTP__ _E_
.@FoxNews from multiple sources: There was electronic surveillance of Trump and people close to Trump. This is unprecedented. @FBI _E_
Canadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_
Hillary Clinton just lost every Republican she ever had including Never Trump all farmers & sm. biz by saying she'll tax estates at 65%. _E_
Thinking big is the driving force that has forged all the great achievements in modern life. Think Big _E_
All time hit leader Pete Rose should now be in the Baseball Hall Of Fame. He has paid his penalty! _E_
MAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ __HTTP__ _E_
Strange why didn't @BarackObama hold any special event to celebrate the 2 year anniversary of ObamaCare? __HTTP__ _E_
Thank you! #MakeAmericaGreatAgain __HTTP__ _E_
Jerry Falwell Jr. stated speech was best in University's history...my great honor. _E_
The more I get to know @MittRomney the more I like him. He has the judgment and private sector experience America needs in the White House. _E_
You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_
Thank you @DonaldJTrumpJr & @EricTrump. #Trump2016 __HTTP__ _E_
The failing @nytimes finally gets it In places where no insurance company offers plans there will be no way for ObamaCare customers to.. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
AIR FORCE TRUMP: AHEAD OF 2016 THE DONALD SLAMS ROMNEY BUSH IN SOUTH CAROLINA __HTTP__ via @BreitbartNews by @mboyle1 _E_
Now that the ineffective Baltimore Police have allowed the city to be destroyed are the U.S. taxpayers expected to rebuild it (again)? _E_
Wow great ratings for @ApprenticeNBC __HTTP__ Don't forget watch 2 new episodes tonight at 8PM on @NBC. _E_
Received a standing ovation in packed house @MorningsideEdu after Sam Clovis intro! Let's Make America Great Again! __HTTP__ _E_
"Donald Trump: I'm Not Buying the @BrooklynNets" __HTTP__ via @TMZ_Sports _E_
Based on the fact that the very unfair and unpopular Individual Mandate has been terminated as part of our Tax Cut Bill which essentially Repeals (over time) ObamaCare the Democrats & Republicans will eventually come together and develop a great new HealthCare plan! _E_
Las Vegas' most elite destination @TrumpLasVegas' has 64 stories of golden glass & offers ultimate luxury __HTTP__ _E_
Achievers move forward at all times. Don't tread water. Get out there and go for it. _E_
RT @ABC: Pres. Trump: We cannot be defined by the evil that threatens us or the violence that incites such terror. __HTTP__ _E_
Doing Fox & Friends at 7 A.M. _E_
I wish the @WSJ Wall Street Journal had reported the just out @CNN Iowa Poll correctly. I lead by a wide margin13 points going up big! _E_
Read an excerpt from Think Like A Champion by Donald J. Trump: __HTTP__ _E_
Join me in Rome NY tomorrow!#Trump2016 #NYPrimaryTickets available: __HTTP__ _E_
I wish President @BarackObama the best of luck in his second term... _E_
Governor Alejandro García Padilla said presidential hopeful Sen. Marco Rubio "is no friend of Puerto Rico. __HTTP__ _E_
Ohio Senator @RobPortman: @MittRomney knows how to return prosperity: __HTTP__ #Mitt2012 #tcot _E_
I believe that President Obama is so overwhelmed by what is happening in the U.S. and throughout the World that he has totally given up! _E_
.@Yankees Kevin Youkilis is off to a terrific start. He's less than half the price and a much better player than a drug free A Rod. _E_
Via @BreitbartNews by @mboyle1: TRUMP: OBAMA SHOULDN'T ATTACK AMERICANS OVERSEAS HILLARY'S EMAIL WAS 'CRIMINAL __HTTP__ _E_
Don't let up keep getting out to vote this election is FAR FROM OVER! We are doing well but there is much time left. GO FLORIDA! _E_
Another Crooked Hillary Fan! __HTTP__ _E_
Even more @BarackObama crony capitalism & corruption. We are guaranteeing a $105M loan to another Obama donor __HTTP__ _E_
...Trump however would kick his ass! _E_
...LaVar you could have spent the next 5 to 10 years during Thanksgiving with your son in China but no NBA contract to support you. But remember LaVar shoplifting is NOT a little thing. It's a really big deal especially in China. Ungrateful fool! _E_
"We need more grown ups in Washington people who will shoot straight and level with the American people." #TimeToGetTough _E_
I'm sure the media will not report the highly respected new national poll that just came out via The Economist. 32%! __HTTP__ _E_
Getting ready to go to Iowa today. Big crowd will be a great day! _E_
Threatening phone calls from Obama supporters are being made to the Michigan GOP office __HTTP__ Don't be intimidated! _E_
#AskTrump Send me your questions to answer live from @TwitterNYC later this afternoon. _E_
We create success or failure on the course primarily by our thoughts. Gary Player _E_
Thank you! #Trump2016 __HTTP__ _E_
Today it was my great honor to sign the largest TAX CUTS and reform in the history of our country. Full remarks: __HTTP__ __HTTP__ _E_
Will be on Fox & Friends tomorrow morning at 7.00 hope you enjoy! _E_
Colorado Trump Delegates Scratched from Ballots at GOP Convention __HTTP__ _E_
Any American who fights with ISIS should have their passport revoked. Take them to Gitmo for interrogation. _E_
Wow Obama Care just got delayed by over a year because it is so complicated it cannot be understood the beginning of the end! _E_
JEB is a hypocrite! Used massive private Eminent Domain Just another clueless politician! __HTTP__ _E_
Penn State is doing a poor job in bringing its mess to a close.They should be ashamed for hiding Sandusky's crimes all these years... _E_
RT @opinionsamerica: @realDonaldTrump Strong administration leads to a strong response. _E_
The law requires individuals pay 15% on carried interest. Why would a potential President pay more than he or she is supposed to? _E_
.@Yankees manager Joe Girardi is a gritty leader who stands up for his players. Doing a great job! _E_
Obama is going to take away over 90M Americans' healthcare plans but he is letting Iran keep its nukes. Just think about that. _E_
Trump National Golf Club Los Angeles offers 18 holes fronting the Pacific Ocean on the Palos Verdes Peninsula. __HTTP__ _E_
The Fake News Media hates when I use what has turned out to be my very powerful Social Media over 100 million people! I can go around them _E_
See you tomorrow Dutchess County New York! #NYPrimary #TrumpTrain __HTTP__ __HTTP__ _E_
Big TAX REFORM AND TAX REDUCTION will be announced next Wednesday. _E_
Lifting off right now for U.S.S. Wisconsin in Norfolk. See ya' _E_
Next year @TomBrokaw should be the comedian at the White House Correspondents' dinner. The only problem is that... __HTTP__ _E_
Today we together won the Republican Nomination for President! __HTTP__ _E_
Obama has changed the Census so "it will be difficult to measure the effects" of O'Care __HTTP__ REAL data hidden _E_
The Inspector General's report on Crooked Hillary Clinton is a disaster. Such bad judgement and temperament cannot be allowed in the W.H. _E_
I will be on Fox & Friends tomorrow morning at 7.00. Ebola and ISIS will be topics. _E_
.@WhiteHouse #CEOTownHall __HTTP__ __HTTP__ _E_
Statement on Preventing Muslim Immigration: __HTTP__ __HTTP__ _E_
.@mcuban you were excellent on Howard Stern...thanks for the nice comments about my kids...yours are winners also! _E_
#trumpvlog My thoughts on gasoline prices skyrocketing...... __HTTP__ _E_
Only the Obama WH can get away with attacking Bob Woodward. _E_
To all of my twitter followers please contribute whatever you can to the campaign. We must beat Crooked Hillary. __HTTP__ _E_
.@TrumpDoral's golf courses The Red Tiger The Silver Fox & The Golden Palm are on track to open later this year __HTTP__ _E_
The attack on our Libyan consulate was the worst attack on the US since 9/11. Time for Obama to come clean. _E_
We have a MASSIVE trade deficit with Germany plus they pay FAR LESS than they should on NATO & military. Very bad for U.S. This will change _E_
I'm loyal to people who've done good work for me. #TheArtofTheDeal _E_
Is it a coincidence that the Middle East has blown up since Obama became president? _E_
Congratulations to @IvankaTrump on being named the @FoxNewsSunday Power Player of the Week __HTTP__ _E_
I started my business with very little and built it into a great company with some of the best real estate assets in the World. Amazing! _E_
Via @IBTimes: Miss Universe 2013: Contestants Stun in Gorgeous Gowns at National Gift Auction Gala __HTTP__ _E_
Via @USATODAY: "Trump endorses Wintour for ambassadorship" __HTTP__ _E_
Enjoyed watching @MonicaCrowley's analysis of my @BillOreilly interview. Great points! Thank you Monica. _E_
The CPAC speech went really well this morning first speaker standing ovation. I really enjoyed it. _E_
Ohio is losing jobs to Mexico now losing Ford (and many others). Kasich is weak on illegal immigration. We need strong borders now! _E_
The media can track down @PaulRyan's old girlfriend and marathon time but can't find @BarackObama's college applications or other info. _E_
Remember I am self funding my campaign the only one in either party. I'm not controlled by lobbyists or special interests only the U.S.A.! _E_
After years of long stops then starts why did dopey Eric Scheiderman tell people in The Trump Org. this case is going awaywe have no case _E_
Looking forward to visiting @SimpsonCollege on Wednesday to discuss education. Common Core is an attack on individual & local rights! _E_
The @Yankees should immediately stop paying A Rod—he signed his contract without telling them he was a druggie. _E_
Seven people shot and killed yesterday in Chicago. What is going on there totally out of control. Chicago needs help! _E_
Even Jimmy Carter just released a statement saying that Obama doesn't have a clue. That has to be a new low! _E_
RT @IngrahamAngle: "Far right"? You mean "right so far" as in @realDonaldTrump has been right so far abt how to kick the economy into high... _E_
Total fool @KarlRove is part of the Republican Establishment problem. An all talk no action dummy! __HTTP__ _E_
THANK YOU Grand Rapids Michigan! Time to end political correctness & secure our homeland! __HTTP__ __HTTP__ _E_
"Great effort springs naturally from great attitude." Pat Riley _E_
China is robbing us blind in trade deficits and stealing our jobs yet our leaders are claiming 'progress' __HTTP__ SAD! _E_
Thank you to @foxandfriends for the great review of the speech on immigration last night. Thank you also to the great people of Arizona! _E_
Hey @glennbeck see how I beat your boy Ted in your own Blaze poll? Your endorsement means nothing! #GOPDebate _E_
I am not angry at Russia (or China) because their leaders are far smarter than ours. We need real leadership and fastbefore it is too late _E_
RT @TeamTrump: We agree with Bill ObamaCare is "the craziest thing in the world." #BigLeagueTruth #Debates2016 __HTTP__ _E_
Thank you America! #Trump2016 __HTTP__ __HTTP__ _E_
Cryin' Chuck Schumer fully understands especially after his humiliating defeat that if there is no Wall there is no DACA. We must have safety and security together with a strong Military for our great people! _E_
Via @NewYorkObserver by @Bshapiro91: "Donald Trump @MelRivers Headline @Algemeiner Gala" __HTTP__ _E_
Obama is making speeches excoriating the Republicans and they never answer back. Why aren't they fighting? _E_
Great job on @donlemon tonight @kayleighmcenany @cherijacobus begged us for a job. We said no and she went hostile. A real dummy! @CNN _E_
I will be live tweeting the V.P. Debate. Very exciting! MAKE AMERICA GREAT AGAIN! _E_
Wow ISIS has just taken the City of Ramadi in Iraq. So many of our great soldiers died in originally going after it. Such a waste. _E_
Our spectacular ballroom under construction at the great Turnberry resort in Scotland. __HTTP__ _E_
Clinton's Top Aides Were Mired In Conflict Of Interest At The State Department: __HTTP__ #BigLeagueTruth _E_
.@yuSiddiqui @piersmorgan @rustyrockets I got much better—no contest—I got Melania! _E_
My thoughts on Joe Paterno and political analysts in today's #trumpvlog... __HTTP__ _E_
You should give the money back @HillaryClinton! #DrainTheSwamp __HTTP__ _E_
Weird why did BarackObama Sr. fail to list @BarackObama as his son in his 1961 INS application? __HTTP__ _E_
Illegal use of official Attorney General stationary by lightweight @AGSchneiderman. __HTTP__ _E_
Welcome to the new Egypt Muslim Brotherhood representatives who won't take questions from Israeli journalists __HTTP__ _E_
What my father really gave me is a good (great) brain motivation and the benefit of his experience unlike the haters and losers (lazy!). _E_
The tragedy in South Carolina is incomprehensible. My deepest condolences to all. _E_
I'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_
What a convenient mistake: @BarackObama issued a statement for Kwanza but failed to issue one for Christmas. __HTTP__ _E_
I will be on @foxandfriends tomorrow morning at 7.00. Will be talking about sleazebag Jonathan Gruber ( Americans are stupid ) & exec order _E_
While the Fake News loves to talk about my so called low approval rating @foxandfriends just showed that my rating on Dec. 28 2017 was approximately the same as President Obama on Dec. 28 2009 which was 47%...and this despite massive negative Trump coverage & Russia hoax! _E_
Really looking forward to my address @CPACnews this Friday morning at 8:30. Will stress jobs etc. Can't wait to see my many friends. _E_
Sanders says he wants to run against me because he doesn't want to run against me. He would be so easy to beat! _E_
Congratulations to my friend @limbaugh on being named to the Hall of Famous Missourians. Rush is a great guy & a great character. _E_
Thanks Mark will be fun. __HTTP__ _E_
A great and important day at the United Nations.Met with leaders of many nations who agree with much (or all) of what I stated in my speech! _E_
Many people voted for Cruz over Carson because of this Cruz fraud. Also Cruz sent out a VOTER VIOLATION certificate to thousands of voters. _E_
My friend @ChristianJosi is making a very special LP. Follow him. Conservative leader by day likely 2015 GRAMMY winner by night. #LEGENDS _E_
Oh no another rapper doing a Trump song Young Jeezy Trump Lyrics. Why aren't these guys paying me? _E_
Was Susan Rice told to lie about Bergdahl? Obama and his representatives lie about virtually everything from ObamaCare to a deserter. _E_
USMC Andrew Tahmooressi should be freed immediately. He never should have been jailed in the first place. Weak leaders. #FreeOurMarine _E_
Adrian was recognized on a Disney cruise and has had many photo requests in @TrumpTowerNY. We have a new celebrity! #CelebApprentice _E_
Pageant people are really talking about Venezuela Brazil Mexico USA India Australia. _E_
What a great four days in Cleveland. So proud of the great job done by the RNC and all. The police and Secret Service were fantastic! _E_
On 800 pristine Miami acres @TrumpDoral boasts luxurious accommodations world class dining & championship golf __HTTP__ _E_
Watch @CNN at 9:00 A.M. @jaketapper. Then interviewed on @ABC @GStephanopoulos at 10:00 A.M. and then at 10:30 A.M. watch Face The Nation. _E_
Because the ban was lifted by a judge many very bad and dangerous people may be pouring into our country. A terrible decision _E_
It is time to remember that... __HTTP__ _E_
.@natalie_gulbis Thank you for your support this morning on @GolfChannel. Even more importantly play well this week! Say hi to all. _E_
Today we just passed 1.4 million twitter followers.. _E_
I will renegotiate NAFTA. If I can't make a great deal we're going to tear it up. We're going to get this economy running again. #Debate _E_
My @eonline interview discussing @_KatherineWebb's stardom and why @espn's apology was unwarranted __HTTP__ _E_
Degenerate former Congressman Anthony Weiner is trying to make a comeback. He is a sick & perverted man that New York does not want or need. _E_
For the nonbeliever here is a photo of @Neilyoung in my office and his $$ request—total hypocrite. __HTTP__ _E_
Thank you for your interest & support during last nights #GOPDebate! #IACaucus finder: __HTTP__ __HTTP__ _E_
... ...Do your research before donating this holiday season! _E_
My wife Melania Trump's show was a tremendous success last night. In case you missed her you can see her again tonight on @QVC at 7 pm ET _E_
RT @foxandfriends: Trump vows U.S. 'power' will meet North Korean threat __HTTP__ _E_
I would like to express my warmest regards best wishes and condolences to all of the families and victims of the horrible bombing in NYC. _E_
A very big poll is coming out at 6 PM in New Hampshire. Will be very interested in the results. _E_
Always great to speak with Veterans our nation's heroes. We will Make America Great Again! __HTTP__ _E_
'CNBC Time magazine online polls say Donald Trump won the first presidential debate' via @WashTimes. #MAGA __HTTP__ _E_
Wow Huffington Post just stated that I am number 1 in the polls of Republican candidates. Thank you but the work has just begun! _E_
The Apprentice was the #1 show on television last season on Sunday from 10 to 11 congratulations Donald! _E_
Scots should boycott Glenfiddich garbage for not choosing great Olympic & U.S. Open champ Andy Murray over total loser Michael Forbes. _E_
I will stand with police and protect ALL Americans! #Debates2016 #MAGA __HTTP__ _E_
Thank you Atlanta Georgia! Will be back soon! #AmericaFirst __HTTP__ _E_
A massive blow to Obama's message only 38000 new jobs for month in just issued jobs report. That's REALLY bad! _E_
True thanks. __HTTP__ _E_
AMERICA will once again be a NATION that thinks big dreams bigger and always reaches for the stars. YOU are the ones who will shape America's destiny. YOU are the ones who will restore our prosperity. And YOU are the ones who are MAKING AMERICA GREAT AGAIN! #MAGA __HTTP__ _E_
We had a GREAT year @Macys with ties shirts and suits thanks! New selections just arrived they are amazing! _E_
Young entrepreneurs – keep positive. Don't let the ObamaCare disaster stop your endeavors. There are great opportunities out there. _E_
Sure @BarackObama's literary agent claims the 1991 booklet was a 'mistake' __HTTP__ Pretty convenient. _E_
When will @AlexSalmond realize that he's destroying Scotland the most beautiful countryside in the world w/ his stupid wind turbines? _E_
Mike Huckebee a great guy said the President should appoint me Treasury Secretary. China and OPEC would not be happy. _E_
Forty six million Americans more than at any time ever in the history of this country now live under the poverty line. #TimeToGetTough _E_
This is no act of love as Jeb Bush said... __HTTP__ _E_
I'll be on @foxandfriends Monday at 7:30 AM. Be sure to tune in. _E_
"Trump: 'I like North Carolina we are looking at another deal'" __HTTP__ via @WSOC_TV _E_
Via @WSJPolitics by @reidepstein: "Trump Surges in Popularity in N.H." __HTTP__ _E_
.@WineEnthusiast's highest rated wine in Virginia @trumpwinery is the premier name in sophistication and quality __HTTP__ _E_
Via @Newsmax_Media by Courtney Coren: Trump: China Gets Iraq Oil US Gets Nothing __HTTP__ _E_
my presidency. Isn't this a ridiculous shame? He loves these kids has raised millions of dollars for them and now must stop. Wrong answer! _E_
...Trump International Hotel Las Vegas and Trump International Hotel & Tower Waikiki Beach Walk. __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Looking forward to the @CadillacChamp at Trump Doral next week 3.6. 3.10. Can't wait to meet the attendees. #WGCDoral _E_
So much SPIRIT in LA! Thank you to all of our HEREOS who saved many lives. An honor to spend time w/ @NationalGuard #LEOs & the #CajunNavy! __HTTP__ _E_
Via @Newsmax_Media: Romney Said Nothing Wrong __HTTP__ _E_
Watch Melania on QVC this morning from 10 a.m. to 11 a.m. with her third line of her Melania Timepieces & Jewelry collection... _E_
If Ebola is so non contagious how come an NBC cameraman caught it so quickly while over in West Africa? U.S. is behaving very foolishly! _E_
Oscar Pistorious is guilty as hell! _E_
.@CarlyFiorina Carly—I did graduate from Wharton and did very well. Who is your fact checker? Will you apologize? _E_
Eric Trump on @JudgeJeanine on @FoxNews now! _E_
...while her charity is getting less than 5 cents per donated dollar. She should be ashamed! _E_
The 2012 budget deficit is already $93 billion larger than earlier estimates __HTTP__ @BarackObama (cont) __HTTP__ _E_
Jodi should try but the Govt. should not make a deal no jury could be dumb enough to let her off (but you never know look at OJ & others) _E_
If you experience any harassment or heckling at the polling places from Obama supporters make sure you report it immediately. _E_
My @SquawkCNBC interview discussing why @MittRomney is a great nominee gas prices and why George Will is a loser. __HTTP__ _E_
You must be registered Republican by February 16th to vote TRUMP in the Florida primary. __HTTP__ _E_
Don't forget to watch Celebrity Apprentice tonight at 9 on NBC GREAT EPISODE! _E_
My Administration will continue to work around the clock with Governor @RicardoRossello & his team. Great progress being made! #PRStrong __HTTP__ _E_
Just got a call from my friend Bill Ford Chairman of Ford who advised me that he will be keeping the Lincoln plant in Kentucky no Mexico _E_
Will be on @foxandfriends at 7:00 5 minutes. _E_
The Chinese Envoy who just returned from North Korea seems to have had no impact on Little Rocket Man. Hard to believe his people and the military put up with living in such horrible conditions. Russia and China condemned the launch. _E_
#TrumpAdvice __HTTP__ _E_
Via @WashTimes by @CharlesHurt: Donald Trump declares war on lying street hustlers of Congress" __HTTP__ _E_
.@SpeakerRyan Congratulations and good luck you will do a GREAT job for our wonderful U.S.A.! _E_
.@bovanpelt. Bo I heard you were great at Trump National Westchester I am not at all surprised. Keep playing well you are a winner! _E_
Brand new selection of Trump Signature Collection shirts and ties @Macys. Go check them out. _E_
While the @Yankees look like they quit and are finished they won't quit for CC _E_
"Patriotism is supporting your country all the time and your government when it deserves it." Mark Twain _E_
It always seems impossible until it is done. Nelson Mandela _E_
My statement on NATO being obsolete and disproportionately too expensive (and unfair) for the U.S. are now finally receiving plaudits! _E_
"Labor disgraces no man unfortunately you occasionally find men who disgrace labor." Gen. Ulysses S. Grant _E_
Congratulations to @secupp on joining @newtgingrich on @CNN's Crossfire. Show will be excellent! _E_
Emin from Russia a very talented guy. All proceeds go to help the Philippines. @eminofficial #missuniverse __HTTP__ _E_
Which National Costume do you think should win? __HTTP__ _E_
Looking forward to joining @V4SA Tuesday 9/15 in L.A. aboard the @USSIowa The Battleship of Presidents! Join us! __HTTP__ _E_
Thanks @WWE @VinceMcMahon is an amazing guy. _E_
"Be flexible enough to adjust to changing circumstances." – Think Big _E_
We now have confirmation as to one reason Crooked H wanted to be sure that nobody saw her e mails PAY FOR PLAY. How can she run for Pres. _E_
Detroit's bankruptcy could just be the start __HTTP__ Many municipalities across US are over leveraged & losing citizens _E_
Giving away money and revolutionizing crowdfunding. Follow @fundanything to see which causes are financed daily _E_
#trumpvlog The Republicans must act now don't let @barackobama push you around.... __HTTP__ _E_
Via @theinquisitr: "Americans Agree With Donald Trump 58 Percent Want Flights Banned From Ebola Outbreak Countries" __HTTP__ _E_
__HTTP__ _E_
Obama administration is killing American industrial renaissance by stopping drilling and fracking. Terrible for economy. _E_
Republicans must get out today and VOTE in Georgia 6. Force runoff and easy win! Dem Ossoff will raise your taxes very bad on crime & 2nd A. _E_
It's too bad so few people showed up to @bobvanderplaats Family Leader dinner. Next year I'll try & be there and they'll have a huge crowd! _E_
Welcome to the @BarackObama recovery the labor force participation rate is at a NEW 30 year low of 64.3% __HTTP__ _E_
China is now attacking Japan's economy for leverage __HTTP__ Soon they will try the same with us. #TimeToGetTough _E_
Due to popular demand CNN will re broadcast the Larry King Live show I hosted in June in which I interview Larry. Monday July 5 9 pm CNN _E_
The late great William F. Buckley would be ashamed of what had happened to his prize the dying National Review! _E_
Be tough be focused. There are a lot of ups and downs but you can ride them out if you're prepared for them. _E_
Let's see whether or not Chuck Townsend @CondeNastCorp is smart enough to fire Graydon Carter who only cares about his bad food restaurants _E_
Looks like many anti police agitators in Boston. Police are looking tough and smart! Thank you. _E_
John Kasich despite being Governor of Ohio is losing to me in the Ohio polls. Pathetic! _E_
A former Secret Service Agent for President Clinton excoriates Crooked Hillary describing her as ERRATIC & VIOLENT. Bad temperament for pres _E_
Not his 'per se'? A Friday document dump shows @BarackObama all hands on deck as Solyndra collapsed __HTTP__ @BarackObama lies. _E_
Everybody's talking about my doing twitter during the likely very boring debate tonight. @realDonaldTrump #DemDebate _E_
It's Tuesday how much has China stolen from us today through cyber espionage? _E_
Remember to take time this weekend to relax and regroup. It will pay major dividends for the next week. _E_
For great success you need passion but make sure it's well directed. Learn everything you can about what you're doing. Be an expert. _E_
We have a sacred duty to care for our vets and their families. Our Vets are owed full access to healthcare anytime & anywhere! _E_
It was great to have @ApprenticeNBC veterans George Ross and @BretMichaels back in the boardroom. __HTTP__ #CelebApprentice _E_
Today we lost a great pioneer of air and space in John Glenn. He was a hero and inspired generations of future explorers. He will be missed. _E_
Via @FortuneMagazine by @mcasey1: "Donald Trump plans to build a Trump Tower in Mumbai" __HTTP__ _E_
An impromptu interview I did with German TV on 9/11 down by Ground Zero discussing the attack and WTC Towers __HTTP__ _E_
Great poll thank you America! Once we #DrainTheSwamp together we will #MAGA#Debate __HTTP__ _E_
Whose artwork was your favorite— and what team do you think will win? #CelebApprentice _E_
CHIP should be part of a long term solution not a 30 Day or short term extension! _E_
RT @TeamTrump: She calls our people deplorable and irredeemable. I will be a president for ALL of our people. @RealDonaldTrump #BigLeag... _E_
"Our runaway judiciary is badly in need of restraint by Congress." Phyllis Schlafly _E_
The Dunes of @TrumpScotland are a world treasure threading thru @GolfWorld1's Scotland top Par 72 7400 yd course __HTTP__ _E_
Yes this is a large scale version of when I built and saved the ice skating rink in Central Park (which all should go to). Great course! _E_
... That's why so many huge deals are closed on a golf course." – TRUMP 101 _E_
New @RNC report calls for embracing "comprehensive immigration reform." __HTTP__ Does the @RNC have a death wish? _E_
.@Omarosa's meltdown—was it for real? @DennisRodman thinks she could be an Oscar winner for that performance... #CelebApprentice _E_
THANK YOU Connecticut Delaware Maryland Pennsylvania and Rhode Island! #MakeAmericaGreatAgain __HTTP__ _E_
...people not interviewed including Clinton herself. Comey stated under oath that he didn't do this obviously a fix? Where is Justice Dept? _E_
Joan Rivers had great talent but also truly amazing stamina and drive she would never give up or quit. That is why she became a champion! _E_
The sex scandal at the CIA and Pentagon is rapidly unfolding getting more interesting by the minute! _E_
#AskTrump @TwitterNYC __HTTP__ _E_
A 'confidential source' has called my office and told me that @BarackObama has added over $6T to the new national debt & ruined US credit. _E_
Anna Wintour came to my office at Trump Tower to ask me to meet with the editors of Conde Nast & Steven Newhouse a friend. Will go this AM. _E_
My @foxandfriends interview where I discuss @Rosie being canceled yet again and how she just can't make it on TV __HTTP__ _E_
Wow so nice! Thank you Wayne Allyn Root. __HTTP__ _E_
FRACK NOW & FRACK FAST!!! American prosperity depends on it. Our economic renaissance is here. _E_
.@BarackObama is bankrupting this country. His budget adds another $4.4T to the debt putting us over $20T in total debt by 2016. _E_
Too bad about New York Magazine but there's a much bigger one out there currently doing a story on me to get even that I'll soon discuss! _E_
Get ready for some excitement the live finale of the Celebrity Apprentice is on this Sunday night don't miss it! __HTTP__ _E_
"On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military est the remaining 1000 or so fighters occupy roughly 1900 square miles.." @jamiejmcintyre @dcexaminer __HTTP__ _E_
Limited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ _E_
Obama now wants to give another $450M to the Muslim Brotherhood. Money we don't have going to people that hate us. Moronic. _E_
Beautiful morning thank you @ICLV! __HTTP__ _E_
"Remember the golden rule of negotiating: 'He who has the gold makes the rules.'" – Midas Touch _E_
7 million Americans are going to lose their jobs due to ObamaCare. 46 million face 300% premium increases. DEFUND! #MakeDCListen _E_
Heading to Phoneix. Will be arriving soon. Tomorrow a big day. Tremendous crowds expected! #Trump2016 #MakeAmericaGreatAgain _E_
Entrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_
Even a mistake may turn out to be the one thing necessary to a worthwhile achievement. Henry Ford _E_
.@nbcsnl So much fun last night! _E_
Now that the Mexican drug lord escaped from prison everyone is saying that most of the cocaine etc. coming into the U.S. comes over border! _E_
This cannot be the the Academy Awards #Oscars AWFUL!!!!!!!!!!!!!!! _E_
Wow just released that $67 million in negative ads was spent on me. How am I still number one by a lot? _E_
.@meetthepress and @chucktodd very dishonest in not showing the new @CNN Poll where I am at 39% 21points higher than Cruz. Be honest Chuck! _E_
Joe Biden said that the Taliban 'is not our enemy.' I wonder how our troops in Afghanistant that are under attack view Biden's statement. _E_
I want to win for the people of this great country. The only people I will owe are the voters. #Trump2016 Video: __HTTP__ _E_
Remember politicians are all talk and NO action. Our country is a laughing stock that is going to hell. The lobbyists & donors control all! _E_
Almost daily more discrepancies in @BarackObama's biography continue to arise. Who is this guy? _E_
Via @FSMtweet: "Trump is Right: Illegal Alien Crime is Staggering in Scope and Savagery" __HTTP__ _E_
What do African Americans and Hispanics have to lose by going with me. Look at the poverty crime and educational statistics. I will fix it! _E_
My @TeamCavuto int. on simplifying the tax code our incompetent leaders Iran and making America great again __HTTP__ _E_
.@NBA hall of famer @dennisrodman brings his A game in the 13th season of All Star @CelebApprentice. This time Dennis is a star! _E_
The @SuperCommittee must cut spending not raise taxes. Washington has a spending problem not a revenue problem. _E_
I am a defender of @MileyCyrus who I think is a good person (and not because she stays at my hotels) but last night's outfit must go! _E_
For all of my fantastic supporters and for the U.S.A. we are going to win and MAKE AMERICA GREAT AGAIN maybe greater than ever before! _E_
My @FoxNews interview last night on @gretawire On 2012: I'll Wait and See __HTTP__ _E_
Joe Girardi @Yankees must play his starters even A Rod they got you there. _E_
#BuyAmericanHireAmerican __HTTP__ _E_
"The problem is that we have a president who is more concerned with pursuing some sort of bizarre ideological (cont) __HTTP__ _E_
Thank you St. Louis Missouri!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Hillary Clinton Deleted Emails Using Program Intended To Prevent Recovery #CrookedHillary __HTTP__ _E_
Will be in Cleveland Ohio w/ @mike_pence tonight join us: __HTTP__ Florida tomorrow @ 6pm: __HTTP__ _E_
Part 1 of my @SpecialReport int. with @BretBaier discussing why I am strongly considering running for President __HTTP__ _E_
The Iraqi army has squandered the majority of the weapons & training we gave them for 10 long years. When will we learn? _E_
#TBT WrestleMania 23 __HTTP__ _E_
'Why Trump' __HTTP__ _E_
Thank you Maria B! __HTTP__ _E_
While I own properties across the world I am very excited about my new acquisition of @Doral in Miami. (cont) __HTTP__ _E_
For those few people knocking me for tweeting at three o'clock in the morning at least you know I will be there awake to answer the call! _E_
I'm in Scotland to open what we hope to be the greatest golf course in the world it's amazing. _E_
When in doubt Obama fundraises. He has held 393 fundraisers in six years. Another record. _E_
It's hardly any wonder that our country's manufacturing dominance has evaporated. #TimeToGetTough (cont) __HTTP__ _E_
I have met & spent a lot of time with families @ The Remembrance Project. I will fight for them everyday!... __HTTP__ _E_
.@bubbawatson What a great player you have turned out to be but also what a great guy! Congratulations on another fantastic Masters win. _E_
Join me in Ohio & Maine!Cincinnati Ohio tonight @ 7:30pm: __HTTP__ Maine Saturday @ 3pm... __HTTP__ _E_
Personally I think Douglas Durst's brother got screwed by Douglas no wonder he's angry. _E_
"All the things I love is what my business is all about." @MarthaStewart _E_
They succeed because they think they can. Virgil _E_
Fernando thank you for the GREAT review of The Blue Monster in South Florida Golf especially top 10 in the WORLD. I love @SOFLAGOLF! _E_
Trump Tower Punta del Este's cylindrical tower redefines the essence of luxury. On the sands of Playa Brava __HTTP__ _E_
.@Deadspin guys are total losers—they had their story stolen right from under their bad complexions—other media capitalized! _E_
After my meeting with the pastors it's off to Georgia for a big rally many thousands of great people will be there a beautiful movement! _E_
For the 1st time in American history America's 16500 border patrol agents have issue a presidential primary endorsement—me! Thank you. _E_
Thank you! #VoteTrump #ImWithYou __HTTP__ _E_
When @mcuban had his own show The Benefactor it totally "bombed!" _E_
Trump has big plans for improving @DoralResort __HTTP__ via @nbc's @GolfChannel @CadillacChamp _E_
Australia is a beautiful country with terrific people who love America. _E_
We must not allow ISIS to return or enter our country after defeating them in the Middle East and elsewhere. Enough! _E_
Even the liberal CRS is now reporting Obama Care will cause 200% premium increases __HTTP__ Surprised? @Newsmax_Media _E_
RT @DonaldJTrumpJr: A message from Donald J. Trump to NEW YORK! __HTTP__ _E_
Golf Channel & Donald Trump's World of Golf host a Celebrity Match 1/25 @ TNGC LA CA Mark Wahlberg vs. Kevin Dillon __HTTP__ _E_
So biased: @TIME made 'The Protester' as the person of the year. @TIME celebrates OWS but vilified the Tea Party last year. _E_
Happy 226th Birthday to the United States Coast Guard. Thank you @USCG! #CoastGuardDay __HTTP__ _E_
Thank you for joining me in Mandan ND Gov. @DougBurgum Lt. Gov. @BrentSanfordND @SenJohnHoeven @RepKevinCramer & @SenatorHeitkamp. __HTTP__ _E_
Very proud of my Executive Order which will allow greatly expanded access and far lower costs for HealthCare. Millions of people benefit! _E_
The media is on a new phony kick about my management style. I spend much less money & get much better results! What we need as Prez! _E_
Melania and I are hosting Japanese Prime Minister Shinzo Abe and Mrs. Abe at Mar a Lago in Palm Beach Fla. They are a wonderful couple! _E_
Ron Paul is right that we are wasting trillions of dollars in Iraq and Afghanistan. _E_
Crooked Hillary Clinton is guilty as hell but the system is totally rigged and corrupt! Where are the 33000 missing e mails? _E_
The real story that Congress the FBI and all others should be looking into is the leaking of Classified information. Must find leaker now! _E_
RT @TeamTrump: Quite simply @HillaryClinton mistreats women. #BigLeagueTruth #Debate2016 __HTTP__ __HTTP__ _E_
Despite spending $500k a day on TV ads alone #CrookedHillary falls flat in nationwide @QuinnipiacPoll. Having ZERO impact. Sad!! _E_
Celebrating 1237! #Trump2016 __HTTP__ _E_
I feel bad for all @VanityFair employees. Every day at work they see circulation going down as Graydon runs his bad food restaurants. _E_
Are NFL games getting boring or is it just my magnificent imagination? In any event I'm just not watching them much anymore! _E_
Bill O'Reilly calls Trump and campaign brilliant. In first place by 27 points. _E_
Come celebrate Thanksgiving in the Windy City at @TrumpChicago's 5 Star 5 Diamond Sixteen restaurant __HTTP__ _E_
Entrepreneurs keep this in mind: Great spirits have always encountered violent opposition from mediocre minds. Albert Einstein _E_
RT @paulsperry_: __HTTP__ _E_
Resolve never to quit never to give up no matter what the situation. Jack Nicklaus _E_
I heard that @Morning_Joe was very nice on Friday but that little Donny D a big failure in TV (& someone I helped) was nasty. Irrelevant! _E_
China's media is attacking @MittRomney while endorsing @BarackObama __HTTP__ Of course. Mitt knows it's Time To Get Tough. _E_
The stock of my shirt and tie maker just hit an all time high great going great product! _E_
Remember the huge amount of money raised by @JohnRich and company... #sweepstweet _E_
If Republican Senate doesn't get rid of the Filibuster Rule & go to a simple majority which the Dems would do they are just wasting time! _E_
Your questions about my desk answered in today's #trumpvlog... __HTTP__ _E_
"Once you learn to quit it becomes a habit." Vince Lombardi _E_
Join us today! Together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
Vanity Fair circulation down 20 percent. My third rate stalker should start looking for a new job. _E_
We need to be smart vigilant and tough. We need the courts to give us back our rights. We need the Travel Ban as an extra level of safety! _E_
Entrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. Midas Touch _E_
So good to see the Saudi Arabia visit with the King and 50 countries already paying off. They said they would take a hard line on funding... _E_
As usual the weather people got it wrong in Tampa. They just look for headlines & ratings! _E_
"Faster And Cheaper Trump Finishes NYC Ice Rink @TrumpRink" __HTTP__ Gov. can be efficient w/leadership & business acumen. _E_
My new book #TimeToGetTough out Dec 5th outlines how to make America rich again. Order now through Amazon __HTTP__ _E_
Just purchased NBC's half of The Miss Universe Organization and settled all lawsuits against them. Now own 100% stay tuned! _E_
I did what was an almost an impossible thing to do for a Republican easily won the Electoral College! Now Tax Returns are brought up again? _E_
Crooked Hillary just can't close the deal with Bernie. It will be the same way with ISIS and China on trade and Mexico at the border. Bad! _E_
Trump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean in County Clare for 2.5 miles. Extraordinary! __HTTP__ _E_
When we have big disasters no one comes to our aid or even suggests helping but we are always expected to come to the aid of others! _E_
By self funding my campaign I am not controlled by my donors special interests or lobbyists. I am only working for the people of the U.S.! _E_
I heard poorly rated @Morning_Joe speaks badly of me (don't watch anymore). Then how come low I.Q. Crazy Mika along with Psycho Joe came.. _E_
Had a fantastic time at yesterday's All Star @ApprenticeNBC press conference with @StephenBaldwin7 in @TrumpTowerNY. _E_
Join me tomorrow in Michigan!Grand Rapids at 12pm: __HTTP__ at 3pm: __HTTP__ __HTTP__ _E_
Why is the UN planning to attack @Israel's sovereignty and ignore Iran's nuclear program? The US should look at future funding. _E_
RT @TODAY_Clicker: Get ready @ApprenticeNBC fans! @realDonaldTrump promises plenty of mean and nasty action.. __HTTP__ _E_
Great job First Lady Melania! __HTTP__ _E_
Inspiration exists but it must find you working. Pablo Picasso _E_
We have got to get our Marine out of that disgusting Mexican jail. Would be so easy if we had a real leader. One tough phone call & he's out _E_
Fox and Friends _E_
Story written by a @HuffingtonPost reporter that the HuffPost refused to print. Total bias but we will prevail! __HTTP__ _E_
HAPPY 70th BIRTHDAY to the @USAirForce! The American people are eternally grateful. Thank you for keeping America PROUD STRONG and FREE! __HTTP__ _E_
President Obama is the greatest hoax ever perpetrated on the American people Clint Eastwood _E_
#TrumpVlog Why are we the sad suckers? __HTTP__ _E_
.@ericbolling Great job on The Five tonight and not only because you were so nice to The Apprentice. See you soon and thanks! _E_
#CelebrityApprentice Paul Teutul Sr. joined me for a press event in Trump Tower last week __HTTP__ _E_
Yes I will give my @SuperBowl pick tomorrow. Watch @_KatherineWebb cover it on @InsideEdition. _E_
The failing @WSJ Wall Street Journal should fire both its pollster and its Editorial Board. Seldom has a paper been so wrong.Totally biased! _E_
Via @GolfMonthly by @jake0reilly: "Trump to build five new holes at @TurnberryBuzz" __HTTP__ _E_
I love being in South Carolina. We are leading big in all of the State polls Saturday is a BIG day. MAKE AMERICA GREAT AGAIN! _E_
LETS GO AMERICA! Time to take backour country and #MakeAmericaGreatAgainWatch video & go#VoteTrump!  __HTTP__ _E_
Don't believe the lies every budget @BarackObama has delivered to Congress raises the income tax on EVERYONE __HTTP__ _E_
Congratulations to the Philadelphia Eagles on a great Super Bowl victory! _E_
The Amazon Washington Post fabricated the facts on my ending massive dangerous and wasteful payments to Syrian rebels fighting Assad..... _E_
Congratulations to @gohermie for winning the @ShellHouOpen. We are all proud of you @TNGCBedminster & all @TrumpGolf clubs! Great going! _E_
Russia took Crimea during the so called Obama years. Who wouldn't know this and why does Obama get a free pass? _E_
The Donald J.Trump Signature Collection exclusively available @Macys offers top styles in menswear. Dress your best __HTTP__ _E_
By popular(extremely) demand I will be live tweeting the #Oscars2014 on Sunday night. Tell all your friends I will not be pulling punches! _E_
LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_
Jeb is fighting to defend a catastrophic event. I am fighting to make sure it doesn't happen again.Jeb is too soft we need tougher & sharper _E_
Remember Trump ties & shirts @Macys for Fathers Day your father will love you even more! _E_
Filming for @CelebApprentice Season 13 is now into the 2nd week. The 'All Star' cast is already hard at work. _E_
.@YoungDems4Trump Thank you! _E_
While I believe I will clinch before Cleveland and get more than 1237 delegates it is unfair in that there have been so many in the race! _E_
I will be in Evansville Indiana with the great Bobby Knight (who last night endorsed me) at 12:00 this afternoon. See you there! _E_
.@jorgeramosnews Please send me your new number your old one's not working. Sincerely Donald J. Trump _E_
RT @JeffTutorials: @realDonaldTrump __HTTP__ _E_
The good news is that their ratings are terrible nobody cares! __HTTP__ _E_
Government needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_
RT @foxandfriends: Israeli PM Netanyahu praises U.S. policy changes during meeting with Defense. Sec Mattis __HTTP__ _E_
Thank you for sharing Amy. __HTTP__ _E_
The real story is that President Obama did NOTHING after being informed in August about Russian meddling. With 4 months looking at Russia... _E_
A great article about how ObamaCare has even further complicated the tax code and will hurt housing market __HTTP__ _E_
Via __HTTP__ __HTTP__ _E_
I will be interviewed on @foxandfriends at 8:40. A.M. Enjoy! _E_
Enjoy Celebrity Apprentice tonight at 9 a really great episode! _E_
Margaret Thatcher was the Iron Lady of the West. She promoted freedom & democracy a great leader & ally of America. _E_
Republicans have very strong hand in their fight against Obamacare lets see if they are willing and able to play it tuff ! _E_
The Sarasota Florida rally today was amazing. 12000 people chanting their love for our country. It's going to happen this is a MOVEMENT! _E_
My @TheBrodyFile int. from Iowa on how I would build a wall to secure our Southern Border & deduct costs from Mexico __HTTP__ _E_
Be sure to buy this month's @AmSpec magazine. Read "A Trump Card" my interview with Jeffrey Lord. _E_
Lying #Ted Cruz just (on election day) came out with a sneak and sleazy Robocall. He holds up the Bible but in fact is a true lowlife pol! _E_
Hillary Lies to Benghazi Families#CrookedHillary __HTTP__ _E_
Republicans and @MittRomney must get tough very soon. _E_
"The minute that you're not learning I believe you're dead." – Jack Nicholson _E_
I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Why did Pres Obama remove sanctions against Iran prior to negotiating rather than completing successful negotiation & then remove sanctions? _E_
Remember the worst thing you can do in a negotiation is seem desperate to make the deal. _E_
Claims for unemployment are at a 3 month high __HTTP__ Where's the @BarackObama recovery? _E_
However beautiful the strategy you should occasionally look at the results. Winston Churchill _E_
My @todayshow discussing the @CelebApprentice discussing the cast __HTTP__ _E_
"Do not view any failure as the end. Learn your lessons quickly then move on." – Think Big _E_
I will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_
...and they knew exactly what I said and meant. They just wanted a story. FAKE NEWS! _E_
..(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). Too bad the Dems have no one who can change tones! _E_
Superbowl Sunday is a great American tradition. The Colts and Saints are already champions but may the best team win! _E_
Hard to believe that Bernie Sanders has done such a complete fold. He got NOTHING for all of the time energy and money. The V.P. a joke! _E_
Jobs report is really bad beyond the worst projections.A bad day on Wall Street! _E_
....that has served our country is put on a waiting list and gets no care. _E_
"Remember that some things are worth waiting for. Plans can change sometimes for good reason." – Trump Never Give Up _E_
"Never confuse a single defeat with a final defeat." F. Scott Fitzgerald _E_
"Courage is being scared to death but saddling up anyway." John Wayne _E_
The @WashingtonPost quickly put together a hit job book on me comprised of copies of some of their inaccurate stories. Don't buy boring! _E_
New South Carolina poll from PPP. Thank you! #VoteTrumpSC __HTTP__ _E_
.@CNN is so disgusting in their bias but they are having a hard time promoting Crooked Hillary in light of the new e mail scandals. _E_
Back by popular demand this year's All Star @ApprenticeNBC sees the return of @claudiajordan! Our fans love her. _E_
Thank you Omarosa for your service! I wish you continued success. _E_
Hillary and the Dems were never going to beat the PASSION of my voters. They saw what was happening in the last two weeks before the...... _E_
The just released Public Policy Polling (PPP national result) is the best yet. MAKE AMERICA GREAT AGAIN! _E_
When will the U.S. stop sending $'s to our enemies i.e. Mexico and others. _E_
RT @dcexaminer: Emails show Washington Post New York Times reporters unenthusiastic about covering Clinton Lynch meeting __HTTP__ _E_
Thank you @HauteLivingMag for naming @TrumpDoral the #1 golf course in Miami __HTTP__ _E_
We have to make America great again! _E_
Tomorrow night's episode of The Apprentice delivers excitement at QVC along with appearances by Isaac Mizrahi and Cathie Black. 10 pm on NBC _E_
...Terrible for the economy and a job killer. China is laughing at us! _E_
Not the world only your tiny group of viewers the world doesn't care about you. @lawrence You're too stupid to (cont) __HTTP__ _E_
I will be doing @GMA @GStephanopoulos this morning at around 7:00. Likewise I will be doing @Morning_Joe at around 7:00. Figure it out! _E_
.@rushlimbaugh is right—the Republicans lost because they weren't conservative enough—or tough enough. _E_
It was a great honor to welcome the President of Turkey Recep Tayyip Erdoğan to the @WhiteHouse today! __HTTP__ _E_
The rigged system may have helped Hillary Clinton escape criminal charges but... __HTTP__ __HTTP__ _E_
Let's see what happens in the boardroom... #CelebApprentice _E_
Thank you! #Trump2016 __HTTP__ _E_
I have so much admiration and respect for the 2.4 million men and women of our Armed Forces. #TimeToGetTough _E_
....for the Middle Class. The House and Senate should consider ASAP as the process of final approval moves along. Push Biggest Tax Cuts EVER _E_
My @foxandfriends interview from yesterday discussing how @BarackObama failed to show any leadership on th... (cont) __HTTP__ _E_
I believe Putin will continue to re build the Russian Empire. He has zero respect for Obama or the U.S.! _E_
A country that cannot protect its borders is a country destined to fail. Another broken promise by our leaders in Washington. _E_
If I run and if I win our country will be great again. last line of my @SRQRepublicans speech _E_
Thank you Nebraska!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Why didn't President Obama just go inside when it started raining yesterday common sense! The two Marines looked very uncomfortable & wet. _E_
I was in San Jose CA on Saturday for a sit down interview for the ACN national meeting which was attended by over 20000 people. Huge! _E_
For all of the haters and losers out there sorry I never went Bankrupt but I did build a world class company and employ many people! _E_
The ObamaCare website is unfixable & rumor has it that they will stop checks & balances—a free for all that will cost the country trillions _E_
Obama loves wasting our money. He just made another guarantee of $197M to a solar company __HTTP__ Cronyism! _E_
#CrookedHillary __HTTP__ _E_
Thomas Kinkade died. I happen to love the beauty of his paintings. He took a lot of heat from art critics who (cont) __HTTP__ _E_
Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud! _E_
This week will mark the 1 year anniversary of the attack in Benghazi that left 4 Americans dead. No answers! _E_
Wow what a day. So many foolish people that refuse to acknowledge the tremendous danger and uncertainty of certain people coming into U.S. _E_
Just won a big federal lawsuit similar in certain ways to the Trump U case but the press refuses to write about it. If I lost monster story! _E_
The military and first responders despite no electric roads phones etc. have done an amazing job. Puerto Rico was totally destroyed. _E_
Our vets are the pride of our nation. The VA scandal is a disgrace.If you can get food stamps so fast our vets should get immediate care _E_
Congrats to @TrumpWaikiki for being named @Orbitz Best In Stay Elite Award Winner for Oahu for 2014! _E_
I've been saying for three months that the bridge tolls to Staten Island are far too high and unfair just got lowered but not nearly enough _E_
The system is rigged. General Petraeus got in trouble for far less. Very very unfair! As usual bad judgment. _E_
This is not a media event or about Donald J. Trump this is about the United States of America. I will be... __HTTP__ _E_
I will be going to Texas as soon as that trip can be made without causing disruption. The focus must be life and safety. _E_
Based on the tremendous cost and cost overruns of the Lockheed Martin F 35 I have asked Boeing to price out a comparable F 18 Super Hornet! _E_
Go to work today be smart think positively and WIN! _E_
Pres. Obama is about to embark on a 17 day vacation in his 'native' Hawaii putting Secret Service away from families on Christmas. Aloha! _E_
With Barry Diller & Tina Brown in charge did anyone doubt that @Newsweek would be a massive failure? _E_
Verlander pitched great but @Yankees look truly defeated. _E_
We've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_
Why should he? He's only the POTUS and @BarackObama has no opinion on whether the Senate should pass a budget. __HTTP__ _E_
Great deal we swap 5 killer terrorists for a U.S. military deserter. That's how the U.S. negotiates nowadays. _E_
Some of the women on Celebrity Apprentice are absolutely crazy maybe the wildest thing ever on reality television. Watch tonight! _E_
My @FoxNews interview with @gretawire where I discuss my potential GOP endorsement and the NH primary __HTTP__ _E_
"Action is the foundational key to all success." Pablo Picasso _E_
Thank you Portland Maine! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
I will Make Our Government Honest Again believe me. But first I'm going to have to #DrainTheSwamp in DC. __HTTP__ _E_
Champion @Joan_Rivers loves being on the other side of the table in the Boardroom. She leaves no punches out in @CelebApprentice! _E_
Dwyane Wade's cousin was just shot and killed walking her baby in Chicago. Just what I have been saying. African Americans will VOTE TRUMP! _E_
Just returned from Pennsylvania where we will be bringing back their jobs. Amazing crowd. Will be going back tomorrow to Gettysburg! _E_
Hillary Clinton lied last week when she said ISIS made a D.T. video. The video that ISIS made was about her husband being a degenerate. _E_
The Chinese laugh at how weak and pathetic our government is in combating intellectual property theft. (cont) __HTTP__ _E_
Time flies it's @TrumpTowerNY's 30th anniversary. To celebrate we made this video highlighting its amazing history __HTTP__ _E_
My @fox8news interview discussing the passing of my longtime friend Dick Clark. __HTTP__ A true TV legend who will be missed. _E_
Great bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and ourselves is unbreakable. __HTTP__ _E_
Tim Kaine has been praising the Trans Pacific Partnership and has been pushing hard to get it approved. Job killer! _E_
I could fix tv talk shows that are doing poorly—there is tremendous talent out there waiting to be tapped—and nobody sees it! _E_
Congress is back.TIME TO CUT CAP AND BALANCE.There is no revenue problem.The Debt Limit cannot be raised until Obama spending is contained. _E_
RT @IngrahamAngle: The #CruzCrew prevailed! Smart for @MarcoRubio to keep his speech short & sweet. Ditto for @realDonaldTrump who was brie... _E_
Heading to Camp David for major meeting on National Security the Border and the Military (which we are rapidly building to strongest ever). _E_
Obamacare is a disaster. We must REPEAL & REPLACE. Tired of the lies and want to #DrainTheSwamp? Get out & VOTE... __HTTP__ _E_
RT @DRUDGE_REPORT: FORMER HOSTAGE SAYS PLANE WAITED UNTIL MONEY ARRIVED... __HTTP__ _E_
The reason Flake and Corker dropped out of the Senate race is very simple they had zero chance of being elected. Now act so hurt & wounded! _E_
The ratings of The Cycle on MSNBC a sad and pathetic show are way down. If they fired racist moron @Toure a truly stupid guy they live! _E_
Jeanne Shaheen wants amnesty for illegals placed the deciding vote for ObamaCare & opposes the 2nd Amendment. Vote her out in November! _E_
Are the Republicans going to blow their chance to take the Senate? Must focus on ObamaCare and amnesty. _E_
My @SquawkCNBC interview discussing the 57th St. crane damage from the storm and extending my $5M offer to Obama __HTTP__ _E_
Hillary's been failing for 30 years in not getting the job done it will never change. _E_
If you are steadfast in your efforts critics will be harmless. Achievers move forward and achievement is not a plateau it's a beginning. _E_
"Romney's $2 Billion Sacrifice for America" By Chris Ruddy @Newsmax_Media __HTTP__ _E_
RT @TeamTrump: 100% TRUE &gt @realDonaldTrump is right @HillaryClinton did call TPP 'the gold standard' #Debates2016 __HTTP__ _E_
...dwindling subscribers and readers.They got me wrong right from the beginning and still have not changed course and never will. DISHONEST _E_
Government Funding Bill past last night in the House of Representatives. Now Democrats are needed if it is to pass in the Senate but they want illegal immigration and weak borders. Shutdown coming? We need more Republican victories in 2018! _E_
Joe Paterno's family should sue the idiots @PennState that made that ridiculous deal and commissioned the one sided report. _E_
Website Exposing Marco 'Amnesty' Rubio Goes Live: A 'Donor Class Puppet'? Breitbart __HTTP__ _E_
For an advance preview of the Miss USA 2013 contestants as well as other show details go to __HTTP__ _E_
#BARACKTAX QUOTE: If you have health insurance you're not getting hit with a tax. _E_
See the Ashley Judd ad by @karlrove and you will definitely vote for her and love Obama. _E_
NoKo has interpreted America's past restraint as weakness. This would be a fatal miscalculation. Do not underestimate us. AND DO NOT TRY US. __HTTP__ _E_
WIshing everyone a happy healthy and prosperous New Year! _E_
Wow I have just exceeded 2 million followers and in such a short time! _E_
Unless the Republican Senators are total quitters Repeal & Replace is not dead! Demand another vote before voting on any other bill! _E_
The Comedy Central Roast of Donald Trump last week was the #1 highest rated Comedy Central Roast ever...it brought in 3.5 milion viewers _E_
You have to love what you do or you are never going to be successful no matter what you do in life. Think Big _E_
Our airports are Third World horrible. Let's rebuild them by people who know how to do it inexpensively. _E_
When you are in a war or even a battle losing is not an option! _E_
Thank you Jeffrey Lord for the great article discrediting third rate @BuzzFeed site & slimebag reporter McKay Coppins.@PiersMorgan @AmSpec _E_
President Obama put himself in a very bad position when he talked about Syria crossing the RED LINE. Amazingly now he denies he said that! _E_
Thank you for the nice words this morning @KellyRiddell. Well delivered and totally logical! @CNN @FoxNews _E_
RT @Jenniffer2012: Thank you @realDonaldTrump for all the help you are providing for Puerto Rico. We're are grateful and happy to welcome y... _E_
"45 year low in illegal immigration this year." @foxandfriends _E_
What do you think about the push to put women into high intensity combat situations? _E_
The Washington Times Presidential Debate Poll:TRUMP 77% (18290)CLINTON 17% (4100)#DrainTheSwamp #Debate __HTTP__ _E_
The media tries so hard to make my move to the White House as it pertains to my business so complex when actually it isn't! _E_
I love reading about all of the geniuses who were so instrumental in my election success. Problem is most don't exist. #Fake News! MAGA _E_
I hope people are looking at the disgraceful behavior of Hillary Clinton as exposed by WikiLeaks. She is unfit to run. _E_
Terrible. Wind farms are provided permits by the US government which causes the programmatic killing of bald eagles. _E_
"Pride yourself on your ability to find creative solutions to tough problems. Think Big _E_
The Electoral College is actually genius in that it brings all states including the smaller ones into play. Campaigning is much different! _E_
Weekly Address #KatesLaw#NoSanctuaryForCriminalsActStatement: __HTTP__ __HTTP__ _E_
Is everyone enjoying ObamaCare's 21 new 2014 taxes? __HTTP__ It's Obama's special gift added on to your rising premium. _E_
"Do your duty and a little more and the future will take care of itself." Andrew Carnegie _E_
Head of Air Force's anti sexual assault unit arrested for sexual assault! It just seems that our Country is not what it used to be. _E_
The ultimate vacation destination @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_
At the end of the day Obama won the battleground states by less than 500000 votes. This was a winnable race. GOP needs to do better! _E_
By popular demand I will be tweeting on the very tainted Academy Awards tonight! _E_
Happy New Year to all of my Jewish friends and supporters. Shana Tova. Hopefully it will be a great year! _E_
Glad to see that the Egyptian Army is releasing Mubarek. As we see Obama never should have abandoned him. He was an ally. _E_
Check out the Trump Fabulous World of Golf site to meet the Fazio family master golf course designers.... __HTTP__ _E_
The Euro put in place to hurt the U.S. is done! will have less negative impact than most think. _E_
Great rally last night in Massachusetts. 2000 people at a house must be a record! Unbelievable spirit to MAKE AMERICA GREAT AGAIN. _E_
Some dope tweeted my message to my friend Bill Belichick incorrectly they called him Bob. Sorry Bill! @Patriots _E_
Newsmax article: 'Trump Declines Prime Time GOP Convention Speech' __HTTP__ _E_
Fact – every successful GOP Senate candidate just elected ran on repealing ObamaCare. In January it's time to move! _E_
We all know that chess is a game of strategy. So is business. Think Like a Champion _E_
I could fix existing Tappan Zee Bridge for peanuts. Unfortunately Gov Cuomo will end up spending more than $10B on this project. $25 tolls? _E_
I agree getting Tax Cuts approved is important (we will also get HealthCare) but perhaps no Administration has done more in its first..... _E_
Via @DMRegister by @AP: "Donald Trump talks economy with Republicans in Davenport" __HTTP__ _E_
Celebrate Martin Luther King Day and all of the many wonderful things that he stood for. Honor him for being the great man that he was! _E_
'Economists say Trump delivered hope' __HTTP__ _E_
Will be doing @OutFrontCNN with @ErinBurnett tonight at 7 pm re: tax reductions and various other topics. _E_
"The thing about high corporate tax rates is that in the end companies aren't the ones who foot the bill consumers do." #TimeToGetTough _E_
Your tax dollars well spent. Over 1.295M ObamaCare enrollees will also be illegal immigrants __HTTP__ Are you surprised? _E_
.@KarlRove Had my best day ever in the polls one had me at 41% Morning Consult. Boston Globe Monmouth NBC and CNN all great. More! _E_
I had a great time in Iowa yesterday record crowds fantastic people! _E_
Weakness is very dangerous: @BarackObama is going to unilaterally disarm our nuclear arsenal. America keeps the world safe! _E_
I'm not a hunter and don't approve of killing animals. I strongly disagree with my sons who are hunters but (cont) __HTTP__ _E_
I have made my decision on who I will nominate for The United States Supreme Court. It will be announced live on Tuesday at 8:00 P.M. (W.H.) _E_
After allowing North Korea to research and build Nukes while Secretary of State (Bill C also) Crooked Hillary now criticizes. _E_
China is cooking up conspiracy theories that the Olympics are rigged. __HTTP__ They don't understand why they can't cheat. _E_
I am impressed with the scam @BarackObama pulled but the truth will come out. _E_
.@piersmorgan Russell has nothing going for himself except for energy & aggression. Without that he would be dead—a first class dummy! _E_
Crooked Hillary can't close the deal with Bernie Sanders. Will be another bad day for her! _E_
.@JohnKerry claims he has never stopped working" f/Pastor Abedini's release through "back channels. Where are the results? _E_
Vanity Fair party at Tribeca Film Festival was a bust. _E_
Adam Moss editor in chief of @NYMag is quickly losing his reputation in that @NYMag has become so boring and so irrelevant. _E_
Lying Ted Cruz and lightweight choker Marco Rubio teamed up last night in a last ditch effort to stop our great movement. They failed! _E_
Thank you @hardball_chris for your nice words. They are very much appreciated. I fully understand that you really get it. _E_
Rep. Lou Barletta a Great Republican from Pennsylvania who was one of my very earliest supporters will make a FANTASTIC Senator. He is strong & smart loves Pennsylvania & loves our Country! Voted for Tax Cuts unlike Bob Casey who listened to Tax Hikers Pelosi and Schumer! _E_
Do you think crooked @AGSchneiderman will ever challenge the NFL tax status? No—too many friends and contributors in @nfl? _E_
How can Crooked Hillary say she cares about women when she is silent on radical Islam which horribly oppresses women? _E_
ICYMI my @foxandfriends int. criticizing the GOP on ObamaCare the new Congress & 2016 __HTTP__ _E_
Make sure to verify the voting machine does not switch your vote. If you have any problems notify the poll workers. _E_
The fact that President Putin and I discussed a Cyber Security unit doesn't mean I think it can happen. It can't but a ceasefire can& did! _E_
Get rid of gun free zones. The four great marines who were just shot never had a chance. They were highly trained but helpless without guns. _E_
Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy and create many beautiful JOBS! _E_
RT @foxandfriends: .@Suffolk_Sheriff praises President Trump for making gang eradication a priority __HTTP__ _E_
The only deal the Republicans should accept is a complete repeal of ObamaCare. You have them on the run don't fold go for it! _E_
The Ryder Cup will be amazing this week. _E_
So many people are asking why isn't the A.G. or Special Council looking at the many Hillary Clinton or Comey crimes. 33000 e mails deleted? _E_
The failing @nytimes which never spoke to me keeps saying that I am saying to advisers that I will change. False I am who I am never said _E_
BREAKING NEWS: Obama has just made a trade with Russia. They get Florida California & our gold supply. We get borscht & a bottle of vodka. _E_
Today I was honored and proud to address the 45th Annual @March_for_Life! You are living witnesses of this year's March for Life theme: #LoveSavesLives. __HTTP__ _E_
Much bigger win than anticipated in Arizona. Thank you I will never forget! _E_
Entrepreneurs: Be passionate. You have to love what you're doing to be successful at it. _E_
We don't always think of our presidents as jobs and business negotiators but they are. Presidents are our (cont) __HTTP__ _E_
#CongressionalBaseballGame __HTTP__ _E_
We MUST have strong borders and stop illegal immigration. Without that we do not have a country. Also Mexico is killing U.S. on trade. WIN! _E_
I will be on @wolfblitzer for a @CNNSitRoom interview today. Please join us 5PM ET. _E_
A Warren Buffett corp. is currently ensnared in a bankruptcy. Likewise Icahn Kravis Apollo and many others have played the game.Thanks! _E_
Shameful. After trading 5 senior Taliban for a deserter the White House is now attacking Bergdahl's platoon __HTTP__ _E_
SUN newspaper/Scotland reports that Tourism jump is thanks to Trump. 8000 visitors in one month from 20 countries __HTTP__ _E_
A great crowd at Trump Tower for #TimeToGetTough book signing! _E_
On behalf of the entire family we would truly be honored to have your vote! Let's #MakeAmericaGreatAgain #EarlyVote __HTTP__ _E_
Very successful fund raising for @MittRomney yesterday. Good to see my friend Woody Johnson. _E_
RT @FoxNews: Jobs created in February. __HTTP__ _E_
From 2% to 27% in Texas quite a jump into first place! _E_
New orders for manufacturing down 9/10 months __HTTP__ Time for fair trade. Stop TPP! _E_
Great going Andy Roddick! Another victory for a fabulous player. Brooklyn Decker is good luck. _E_
.@McIlroyRory Great job Rory you have the heart and talent of a great champion. Work hard and win many more! See you at Turnberry. _E_
Just like I have warned from the beginning Crooked Hillary Clinton will betray you on the TPP. __HTTP__ _E_
Will be interviewed on @GMA at 7:00 A.M. Big wins last night! _E_
"A very good way to pave your own way to success is simply to work hard and to be diligent" – Think Like a Champion _E_
Alabama is sooo lucky to have a candidate like Big Luther Strange. Smart tough on crime borders & trade loves Vets & Military. Tuesday! _E_
I don't think Obama will do well in the second debate he is psyched out just like A Rod. _E_
Why are we continuing to train these Afghanis who then shoot our soldiers in the back? Afghanistan is a complete waste. Time to come home! _E_
"Always get even. When you are in business you need to get even with people who screw you." – Think Big _E_
The dishonest NY Daily News reporter advised my rep in writing story is dead and then put it out anyway. A total lie and she knew it! _E_
Prosperity is coming back to our shores because we are putting America WORKERS and FAMILIES first. #AmericaFirst __HTTP__ _E_
I fulfilled my campaign promise others didn't! __HTTP__ _E_
"WATCH: @MissUniverse contestants golf with Donald Trump @TrumpDoral" __HTTP__ via @KylePorterCBS by @CBSSports _E_
Memorial service today for beautiful and incredible Heather Heyer a truly special young woman. She will be long remembered by all! _E_
It came out that Huma Abedin knows all about Hillary's private illegal emails. Huma's PR husband Anthony Weiner will tell the world. _E_
You are witnessing the single greatest WITCH HUNT in American political history led by some very bad and conflicted people! #MAGA _E_
Entrepreneurs: Knowledge requires patience action requires courage. Put patience and courage together and you'll be a winner . _E_
Don't underestimate yourself or your possibilities. There are always opportunities. _E_
Featuring top spa in New York AAA Five Diamond Award @TrumpSoHo is Soho's most elite hotel & destination spot __HTTP__ _E_
MONDAY 11/7/2016Scranton Pennsylvania at 5:30pm. __HTTP__ Rapids Michigan at 11pm.... __HTTP__ _E_
I promised that my policies would allow companies like Apple to bring massive amounts of money back to the United States. Great to see Apple follow through as a result of TAX CUTS. Huge win for American workers and the USA! __HTTP__ _E_
According to the @nytimes a Russian sold phony secrets on "Trump" to the U.S. Asking price was $10 million brought down to $1 million to be paid over time. I hope people are now seeing & understanding what is going on here. It is all now starting to come out DRAIN THE SWAMP! _E_
Congratulations to Tom Scocca and Timothy Burke of @Deadspin for exposing the Manti Te'o fiasco. _E_
As I made very clear today our country needs the security of the Wall on the Southern Border which must be part of any DACA approval. _E_
.@GeraldoRivera Thanks my champion Geraldo and very true. _E_
In order to stay competitive in your industry it is imperative to keep up to date on all news. A great commodity is information. _E_
.@THEGaryBusey survives another week of All Star Celebrity @ApprenticeNBC. Gary is shifty and playing to win. _E_
Also tune in to the @TodayShow at 7:00am. I will be on to discuss the campaign my new ads and #CrippledAmerica. _E_
Somebody hacked the DNC but why did they not have hacking defense like the RNC has and why have they not responded to the terrible...... _E_
A new terror warning was issued for European cties. At what point do we say we have had enough and get really tough and smart. Weak leaders! _E_
Washington is simply incapable of any moderation because @BarackObama is such an extreme leftist. He must be defeated. #TImeToGetTough. _E_
RT @mike_pence: Congrats to my running mate @realDonaldTrump on a big debate win! Proud to stand with you as we #MAGA. _E_
WikiLeaks proves even the Clinton campaign knew Crooked mishandled classified info but no one gets charged? RIGGED! __HTTP__ _E_
I hope everyone read the brilliant article in American Spectator about leightweight A.G. Eric Schneiderman. He should be run out of office! _E_
ObamaCare contains marriage penalty taxes. Why should married couples be penalized for having healthcare? _E_
Will be on @ABC News tonight at 6:30. Interviewed by the legendary @BarbaraJWalters! Enjoy _E_
How did NBC get an exclusive look into the top secret report he (Obama) was presented? Who gave them this report and why? Politics! _E_
#TrumpAdvice __HTTP__ _E_
An investment in life luxury & leisure a Trump Nat'l Bedminster membership offers top amenities & services __HTTP__ _E_
I don't know why our allies are so surprised Obama is tapping their phones? Nothing changes! _E_
Today Barack Obama is standing in water in NJ. Remember on election day that he has put the US underwater. _E_
Via @LasVegasSun by Eugene Dunn: "2016 is the year of Donald Trump" __HTTP__ _E_
Our country is totally split right now but someday it will come together! _E_
My wife the beautiful @MELANIATRUMP will be appearing... #CelebApprentice _E_
Wow just 1 day after my offer to fund all WH tours Obama backtracks on decision to cancel all White House tours" ... _E_
Perhaps @BarackObama's biggest shortcoming as President is he failed to unite the country. _E_
$4 gasoline – wow—OPEC is very happy! _E_
.@JonahNRO watched on @seanhannity and appreciate your statements I have been waiting for them for a long time. Thank you. _E_
We win in our lives by having a champion's view of each moment. Donald J. Trump __HTTP__ _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
US News named the Top10 best hotels in the US and Trump Int'l Hotel & Tower NYC and Trump Int'l Hotel & Tower Chicago are on the list! _E_
Word is early voting in FL is very dishonest. Little Marco his State Chairman & their minions are working overtime trying to rig the vote. _E_
The White House is predicting 9% unemployment throughout 2012 – and when Obama Care takes effect in 2014 expect it to go even higher. _E_
As Governor of Texas Rick Perry could have done far more to secure the border but that's O.K. I like him anyway! @GovernorPerry _E_
It's a plain fact: free trade requires having fair rules that apply to everyone. #TimeToGetTough _E_
Thank you Virginia! #ImWithYou __HTTP__ _E_
Great to see how hard Republicans are fighting for our Military and Safety at the Border. The Dems just want illegal immigrants to pour into our nation unchecked. If stalemate continues Republicans should go to 51% (Nuclear Option) and vote on real long term budget no C.R.'s! _E_
'President elect Donald J. Trump's CIA Director Garners Praise' __HTTP__ __HTTP__ _E_
Wow I just found out that in a major poll of its readers the @NewYorkObserver voted me #1 on the power 100 list in NY...... _E_
While China screws us with every turn of its currency is the biggest commercial espionage threat we face (cont) __HTTP__ _E_
Golf is a game of respect and sportsmanship we have to respect its traditions and its rules. Jack Nicklaus _E_
.@GOP must stay focused on defunding ObamaCare and the impending budget battle. Don't let Syria rule the agenda. _E_
Weekly Address 11:00 A.M. at the @WhiteHouse! #MAGA __HTTP__ __HTTP__ _E_
When I said that Hillary Clinton got schlonged by Obama it meant got beaten badly. The media knows this. Often used word in politics! _E_
Additionally @CelebApprentice ranked as the #1 program in the 9 11 pm time period with adults in the 25 54 age group. _E_
Via @Mediaite: Donald Trump Trashes 'Tacky' 'Boring' Oscars Blasts 'Racist' Django Unchained __HTTP__ _E_
Friday is the last day to enter the Counting Sheep for Hire contest. Click here www.youtube.com/user/mattressserta and you could win a trip _E_
Why do the Republicans always negotiate against themselves in public? Watching them operate these fiscal negotiations is painful. _E_
Lyin' Ted Cruz even voted against Superstorm Sandy aid and September 11th help. So many New Yorkers devastated. Cruz hates New York! _E_
With Joan Rivers and ivankatrump from last night's great boardroom! __HTTP__ _E_
Just signed contract to purchase the Ritz Carlton in Jupiter Florida great land great location great future! _E_
North Korea has shown great disrespect for their neighbor China by shooting off yet another ballistic missile...but China is trying hard! _E_
Entrepreneurs: If you cannot handle the tough times you will never be successful in business. Stay positive & stay strong! _E_
Great trip to Mexico today wonderful leadership and high quality people! Look forward to our next meeting. _E_
Find out where to #VoteTrump on caucus night in Iowa on 2/1/16!#IACaucus #FITN #Trump2016 __HTTP__ __HTTP__ _E_
Even though I refused to pay a ridiculous price for the Buffalo Bills I would have produced a winner. Now that won't happen. _E_
Thank you for your service! __HTTP__ _E_
It is a disgrace that my full Cabinet is still not in place the longest such delay in the history of our country. Obstruction by Democrats! _E_
Entrepreneurs: Watching you could be the motivation for your employees.Make it an example that will best serve the success of your business. _E_
Totally biased @NBCNews went out of its way to say that the big announcement from Ford G.M. Lockheed & others that jobs are coming back... _E_
Must read opinion piece by @Gallup CEO Jim Clifton: "The Big Lie: 5.6% Unemployment" __HTTP__ Just as I have long been saying... _E_
By failing to prepare you are preparing to fail. Benjamin Franklin _E_
.@NBC just announced that all 1 hour @CelebApprentice episodes are being expanded to 2 hours—it's amazing what good ratings will do! _E_
Entrepreneurs: Get a momentum going. Listen apply then move forward. Do not procrastinate. See opportunity as the perk that it is. _E_
I just got back from Russia learned lots & lots. Moscow is a very interesting and amazing place! U.S. MUST BE VERY SMART AND VERY STRATEGIC. _E_
I hope when the MSM runs its "interruption counters" they consider the # of times the moderators interrupted me com... __HTTP__ _E_
Markets are crashing all caused by poor planning and allowing China and Asia to dictate the agenda. This could get very messy! Vote Trump. _E_
When we're talking about math that doesn't add up how about $5 trillion of deficits over the last four years. @MittRomney _E_
No matter how diligent you are in evaluating a business deal there is invariably one factor you have no control over luck... _E_
...time for Republicans & Democrats to get together and come up with a healthcare plan that really works much less expensive & FAR BETTER! _E_
Thanks @LilJon for coming to my defense in Rolling Stone Magazine. As I have often said you are a terrific guy! _E_
Moody's is out to make publicity. The bank downgrades from yesterday don't make up for @Moody's giving AAA (cont) __HTTP__ _E_
Our incompetent Secretary of State Hillary Clinton was the one who started talks to give 400 million dollars in cash to Iran. Scandal! _E_
Congrats to @Yankees on finishing 1st in the AL East. Derek Jeter is great good luck in the playoffs! _E_
Thank you! __HTTP__ _E_
Lightweight @DannyZuker is too stupid to see that China (and others) is destroying the U.S. economically and our leaders are helpless! SAD. _E_
Like it or not Edward Snowden is a SPY and should be tried as a SPY! He has stolen invaluable information and damaged us with other nations _E_
I am in Iowa watching all of these phony T.V. ads by the other candidates. All bull politicians are all talk and no action it won't happen! _E_
Rubio puts out ad that my pilot was a drug dealer not true not my pilot! Guy owned helicopter company don't think I ever even used. _E_
This is happening all over our country—great people being disenfranchised bypoliticians. Repub party is in trouble! __HTTP__ _E_
Today the U.S. flag flies at half staff at the @WhiteHouse in honor of National Pearl Harbor Remembrance Day. __HTTP__ __HTTP__ _E_
The final two @ArsenioOFFICIAL and @ClayAiken visited yesterday __HTTP__ _E_
.@LisaRinna looks better with her reduced lips. Good move Lisa. #CelebApprentice _E_
Due to the horrific events taking place in our country I have decided to postpone my speech on economic opportunity today in Miami. _E_
I'm glad that Mark Cuban won the ridiculous case with the S.E.C. It never should have been brought in the first place! _E_
If Obama doesn't accept my offer to be fully transparent what will he say? _E_
...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_
Dumbass @BillMaher has still not given me the 5 million he committed to charity we just presented him with a demand notice. _E_
Coming together is a beginning. Keeping together is progress. Working together is success. Henry Ford _E_
If we reelect @BarackObama the America we leave our kids and grandkids won't look like the America we were (cont) __HTTP__ _E_
13 Syrian refugees were caught trying to get into the U.S. through the Southern Border. How many made it? WE NEED THE WALL! _E_
I think it would be a good idea—and fair—to include @GovChristie & @MikeHuckabeeGOP in the debate. Both solid & good guys. @FoxBusiness _E_
Great numbers on the economy. All of our work including the passage of many bills & regulation killing Executive Orders now kicking in! _E_
.@DRUDGE_REPORT's First Presidential Debate Poll:Trump: 80%Clinton: 20%Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_
Via @Investopedia by @swan_investor: The Irreplaceable Brand Of Donald Trump __HTTP__ _E_
"Talent hits a target no one else can hit. Genius hits a target no one else can see." – Arthur Schopenhauer _E_
"Donald Trump Wishes Kristen Stewart A Happy Birthday" __HTTP__ via @HollywoodLife _E_
'Trump rally disrupter was once on Clinton's payroll' __HTTP__ _E_
Interesting how President Obama is flying around in a Boeing 747 on so called Earth Day! _E_
The protesters in New Mexico were thugs who were flying the Mexican flag. The rally inside was big and beautiful but outside criminals! _E_
I will be interviewed on @seanhannity tonight at 10:00. You will find it very interesting (I hope). Enjoy! _E_
Maybe I'm old fashioned but I don't like seeing women in combat. _E_
Don't reward Mitt Romney who let us all down in the last presidential race by voting for Kasich (who voted for NAFTA open borders etc.). _E_
Donna Brazile just stated the DNC RIGGED the system to illegally steal the Primary from Bernie Sanders. Bought and paid for by Crooked H.... _E_
Well the New Year begins. We will together MAKE AMERICA GREAT AGAIN! _E_
If you want to succeed keep your edge. Staying on top of all new developments in your sector = major advantage that pays dividends. _E_
Will be going to Richmond Virginia today. Big crowd! See you there. _E_
Failing comedian Bill Maher who I got an accidental glimpse of the other night is really a dumb guy just look at his past! _E_
"Create your own visual style... let it be unique for yourself and yet identifiable for others." Orson Welles _E_
I like Michael Douglas! _E_
I hope you are watching the Apprentice...tonight's show is great and Brett Michaels is back! _E_
Understand that difficulties mistakes and setbacks are an inevitable part of business and life...But always look for the opportunities. _E_
...a tool of anti Trump political actors. This is unacceptable in a democracy and ought to alarm anyone who wants the FBI to be a nonpartisan enforcer of the law....The FBI wasn't straight with Congress as it hid most of these facts from investigators." Wall Street Journal _E_
Re: Decisions: Cover your bases then ask yourself this question: What am I pretending not to see? This can save a lot of time & trouble. _E_
The Chinese are better off than they were 4 years ago. They have stolen even more from us in jobs & trade during @BarackObama's term. _E_
The fact that we are here today to debate raising America's debt limit is a sign of leadership failure. Sen. Obama 3/16/06 _E_
Looking forward to speaking at the @NHGOP #FITN Republican Leadership Summit on Saturday at 12PM! Let's Make America Great Again! _E_
The Fake News media is officially out of control. They will do or say anything in order to get attention never been a time like this! _E_
It's amazing how celebrities such as @Cher can say horrible untrue things about Republican politicians and it's (cont) __HTTP__ _E_
May God be with the people of Sutherland Springs Texas. The FBI and Law Enforcement has arrived. _E_
I don't know how Al Michaels could have been drunk and arrested on Friday night if he was totally sharp on Saturday morning. _E_
We must suspend immigration from regions linked with terrorism until a proven vetting method is in place. _E_
Join me live in Toledo Ohio!#MakeAmericaGreatAgain __HTTP__ _E_
The Democrats when they incorrectly thought they were going to win asked that the election night tabulation be accepted. Not so anymore! _E_
Making money is art and working is art and good business is the best art. Andy Warhol _E_
Hillary Clinton should have been prosecuted and should be in jail. Instead she is running for president in what looks like a rigged election _E_
Looking forward to hosting @NaghmehAbedini next week @TrumpTowerNY. The White House has abandoned her husband Christian Pastor Abedini. _E_
Donald Trump's birther event is the greatest trick he's ever pulled __HTTP__ _E_
Thank you America! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_
With one of the worst and most prolonged cold spells in history with Atlanta Texas and parts of Florida freezing Global Warming anyone? _E_
Thank you to Doug Parker and American Airlines for all of the help you have given to the U.S. with Hurricane flights. Fantastic job! _E_
If you like to work hard you will attract people with the same ethic. Think Like a Billionaire _E_
70 stories over Panama Bay @TrumpPanama is the country's first five star development. A masterpiece __HTTP__ _E_
What a coincidence that Obama's good friends in Libya and Egypt picked 9/11 to attack our embassies. _E_
Trace and his team raised an amazing amount of $. Looks like a good season for charities. _E_
Thank you New Hampshire! #MakeAmericaGreatAgain __HTTP__ _E_
Sad case @USATODAY did article saying I don't pay bills false only don't pay when work is shoddy bad or not done! They should do same! _E_
I will be interviewed on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_
On 1300 acres in Charlottesville @trumpwinery's wine has been awarded the coveted Virginia Double Gold Medal __HTTP__ _E_
Good.morning I'm going to work! _E_
RT @TeamTrump: It's US vs. them! @realDonaldTrump will fight for you! #BigLeagueTruth #Debates _E_
The @CadillacChamp returns to @TrumpDoral on March 6th __HTTP__ Watch top golfers of the world battle the Blue Monster! _E_
Rising gas prices are causing a steep rise in consumer prices and will slow any future economic growth. It is a tax on all Americans. _E_
Thank you @mcuban for your nice words. I am rapidly becoming a @dallasmavs fan! __HTTP__ _E_
Re Negotiation: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_
Receiving thousands of thank you letters from @LibertyU students for my convocation speech. The honor was all mine! Great people. _E_
Irony! @BarackObama was in Florida yesterday fundraising. Gas also rose to $6/gallon for Florida drivers yesterday. __HTTP__ _E_
Other worthy people were taken off the @CNBC list as well. Stupid poll should be canceled—no credibility. _E_
You have to love what you do or you are never going to be successful no matter what you do in life." Think Big _E_
Stay tuned for my big Obama announcement probably on Wednesday. _E_
I will be on @MeetThePress with @ChuckTodd tomorrow morning at 10:30am ET on @NBC. Enjoy! _E_
Join me in Atlanta on Wednesday at noon! #Trump2016Tickets: __HTTP__ __HTTP__ _E_
Michael Forbes is a loser who failed to stop what was just named "the golf course of the year" and which has brought ... _E_
RT @DRUDGE_REPORT: CLINTON EMAIL LED TO EXECUTION IN IRAN? __HTTP__ _E_
.@heytana great job we are all proud of you! _E_
On this solemn day of remembrance we can all take joy in the fact that Bin Laden's last sight was a Navy SEAL pulling the trigger. _E_
Standing ovation after promising to bring the American Dream back and better than ever before! __HTTP__ _E_
Remember NBC increased Celebrity Apprentice to 2 hours starting this Sunday night at 9 P.M. through end of season great news for App lovers _E_
ObamaCare is already done. HHS Sec. Sebelius is trying to force private companies to finance implementation __HTTP__ _E_
.@TrumpPanama is Panama City's premiere hotel. 70 stories over Punta Pacifica excellence has arrived to So. America __HTTP__ _E_
The harder you work the luckier you get. Gary Player _E_
Dummy @Clare_OC from failing @Forbes magazine: NASCAR deal was 1 nite ballroom ESPN was small golf outing... _E_
I have hired renowned golf course architect Gil Hanse to rebuild The Blue Monster at Doral. He designed the 2016 (cont) __HTTP__ _E_
Deserter Bergdahl returns to active duty as parents of brave soldiers killed looking for him grieve. Obama trying to play this mistake down! _E_
I wonder what the answer is on @BarackObama's college application to the question: place of birth? Maybe the (cont) __HTTP__ _E_
Republicans and Democrats should get back to work immediately to work on resolving downgrade. This is not a go... (cont) __HTTP__ _E_
Hillary Clinton doesn't have the strength or stamina to be president. Jeb Bush is a low energy individual but Hillary is not much better! _E_
.@JoselynMartinez is a very brave woman who caught her father's killer __HTTP__ She visited Ivanka & me at Trump Tower today. _E_
Credibility is important to me hence must admit that both candidates did really well last night. #VPDebate _E_
Trump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ via @thehill _E_
Thank you Great Faith Ministries International Bishop Wayne T. Jackson and Detroit! __HTTP__ _E_
As I've said many times before Jon Stewart @TheDailyShow is highly overrated. _E_
Will be interviewed by @seanhannity tonight for the full hour. Hope you enjoy it and more importantly hope you agree! _E_
Congress must end chain migration so that we can have a system that is SECURITY BASED! We need to make AMERICA SAFE! #USA __HTTP__ _E_
On my way to see the great people of Maine. Will be landing in Portland in 2 hours. Look forward to it! #Trump2016 _E_
A Rod's salary is more than the entire @astros. Half the players on @astros will have better seasons than him. A Rod is a joke! _E_
It's okay but why do the haters (& losers) want to follow me on twitter?? Get a life! _E_
#TrumpAdvice __HTTP__ _E_
The Hostess closing did not have to happen should have been an easy deal to make. _E_
Via @PatheosFamily by @BristolsBlog: Trump Weighs In on Saeed: Obama 'Didn't Even Ask' __HTTP__ Thanks Bristol! _E_
How many more times do we all have to watch and pay for that stupid and never ending #SmokeyBearHug commercial. How much is govt. spending? _E_
Trump's Tax Plan: A Proposal Reagan Would Approve? by Jeff Bell __HTTP__ _E_
The Donald J. Trump Signature Collection's new line is out @Macys ties shirts accessories great & going fast! __HTTP__ _E_
.@GolfMagazine is great thanks! _E_
Via Hardball with Chris Matthews __HTTP__ _E_
Trump Golf Links at Ferry Point will host many major championships over the years. Great thing for NYC—congratulations to all! _E_
Trump lays out big plans for Doonbeg resort: Billionaire says investment shows Ireland's economy recovering __HTTP__ _E_
RT @MittRomney: I am running for president to get us creating wealth again not to redistribute it. _E_
Thank you @FrankLuntz __HTTP__ _E_
The home of the boardroom @TrumpTowerNY __HTTP__ #CelebApprentice _E_
Just watched Full Metal Jacket can't believe R. Lee Ermey didn't win the Academy Award as the drill sergeant. Political nominations! _E_
British PM Cameron is making a fool of himself by wasting billions of pounds on unwanted & environment destroying Scottish windmills. _E_
I wonder what the next scandal will be in D.C.? Can we handle yet another? _E_
Will be in Phoenix Arizona on Wednesday. Changing venue to much larger one. Demand is unreal. Polls looking great! #ImWithYou _E_
China is our enemy. It's time we start acting like it...and if we do our job corectly China will gain a whole (cont) __HTTP__ _E_
WE ARE MAKING AMERICA GREAT AGAIN! __HTTP__ _E_
We should be focusing on beautiful clean air & not on wasteful & very expensive GLOBAL WARMING bullshit! China & others are hurting our air _E_
#TRUMP International Reality will be America's premiere real estate brokerage house __HTTP__ w/ the most distinctive services. _E_
Thank you Lexington South Carolina!#Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Great knockout on Saturday by Juan Manuel Marquez on Manny Pacquiao. A great fight! _E_
"The most terrifying words in the English language are: I'm from the government and I'm here to help." – Pres. Ronald Reagan _E_
Congratulations Eric & Lara. Very proud and happy for the two of you! __HTTP__ _E_
#MidasTouch is divided into five sections. The second is the index finger which represents Focus __HTTP__ _E_
Join me live in Hershey Pennsylvania! #MakeAmericaGreatAgain LIVE: __HTTP__ __HTTP__ _E_
.@HighSock_Sunday #asktrump __HTTP__ _E_
Congratulations to the Houston @Astros 2017 #WorldSeries Champions#HoustonStrong #EarnHistory __HTTP__ _E_
Foreign leaders are already requesting meetings with @MittRomney to warn that we are viewed as in decline __HTTP__ _E_
So many positive things going on for the U.S.A. and the Fake News Media just doesn't want to go there. Same negative stories over and over again! No wonder the People no longer trust the media whose approval ratings are correctly at their lowest levels in history! #MAGA _E_
Via @BreitbartNews by @rwildewrites: "TRUMP: 'I WOULD BUILD A BORDER FENCE LIKE YOU HAVE NEVER SEEN BEFORE'" __HTTP__ _E_
Anyone who doubts the strength or determination of the U.S. should look to our past....and you will doubt it no longer. __HTTP__ _E_
If I win I am going to instruct my AG to get a special prosecutor to look into your situation bc there's never been anything like your lies. _E_
Will be interviewed by @MariaBartiromo on @FoxBizAlert at 7:30 A.M. Enjoy! _E_
Thank you! #Trump2016 __HTTP__ __HTTP__ _E_
In this time of economic turmoil where millions of Americans are unemployed our tax dollars are paying @BillMoyers' big @PBS salary! _E_
You can only smile when the losers of the world try so hard to put down successful people. Just remember they all want to be YOU! _E_
Watch my interview with Greta Van Susteren on her show On the Record tonight on Fox News in the 10 p.m. hour. _E_
A disgraceful verdict in the Kate Steinle case! No wonder the people of our Country are so angry with Illegal Immigration. _E_
CNN/ORC Poll results just out for Nevada—WOW! Trump 38 Carson 22 Fiorina 8 Bush 6 Cruz 4 __HTTP__ _E_
...Remember I told you so. _E_
Thank you Alabama! From now on it's going to be #AmericaFirst. Our goal is to bring back that wonderful phrase:... __HTTP__ _E_
Just got final renderings of Trump National Doral in Miami there will be nothing like it in the Country will be the best! _E_
China has hacked another US government body. __HTTP__ will we learn? _E_
The failing @nytimes wrote a story about my management style & that I don't have many people. I have 73 Hillary has 800 & I'm beating her. _E_
Ted Cruz complains about my views on eminent domain but without it we wouldn't have roads highways airports schools or even pipelines. _E_
ObamaCare is torturing the American People.The Democrats have fooled the people long enough. Repeal or Repeal & Replace! I have pen in hand. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Amazing both Transformers & Dark Knight Rises featured Trump properties and each grossed over $1B. Just coincidence. _E_
....... I disagree but it's still cool. _E_
People are loving the new line of Trump ties and shirts at Macy's. Check them out! _E_
RT @IsraeliPM: PM Benjamin Netanyahu at weekly Cabinet meeting:In two weeks Israel will host @POTUS Trump on his first trip as President... _E_
It was an honor to welcome so many truckers and trucking industry leaders to the @WhiteHouse today! __HTTP__ _E_
I hear a failing New York newspaper is going to publish one of my old cell phone numbers. So original just one of many! _E_
Hopefully the violent and vicious killing by ISIS of a beloved French priest is causing people to start thinking rationally. Get tough! _E_
Looking forward to meeting the students of Urbandale High School tomorrow __HTTP__ _E_
Julian Assange said a 14 year old could have hacked Podesta why was DNC so careless? Also said Russians did not give him the info! _E_
With @TraceAdkins on top of the truck the crowd definitely buzzed. #CelebApprentice _E_
Join me on Saturday in Syracuse New York! #NYPrimary #Trump2016 __HTTP__ __HTTP__ _E_
My interview yesterday from Newsmax Obama Is 'Now Totally Lost' Boehner Must Not Fold __HTTP__ _E_
Congratulations to @GovMikeHuckabee on last night's tremendous speech. Mike united the party faithful and explained that we can do better. _E_
Why do so many people say I hate President Obama—I don't hate the President at all. I just disagree with his policies! _E_
If you treat people right they will treat you right...ninety percent of the time. Franklin D. Roosevelt _E_
Republicans are going for the big Budget approval today first step toward massive tax cuts. I think we have the votes but who knows? _E_
Pervert Anthony Wiener will never be able to get away from his perversion the cure rate is ZERO. _E_
Great! __HTTP__ _E_
"China presents three big threats to the United States in its outrageous currency manipulation its systematic (cont) __HTTP__ _E_
President Obama looks and sounds so ridiculous making his speech in Cuba especially in the shadows of Brussels. He is being treated badly! _E_
Negotiation tip: Know exactly what you want and focus on that. Trust your instincts even after you've honed your skills. _E_
The Jets just don't have it. Time for a quarterback change! _E_
How does Obama rationalize giving Iran $8B in sanction relief when a Christian pastor is being tortured in an Iranian prison? _E_
Via @MailOnline @dmartosko Donald Trump says it's morally unfair of Obama to send soldiers into Ebola hot zone __HTTP__ _E_
True thanks. __HTTP__ _E_
Ungrateful TRAITOR Chelsea Manning who should never have been released from prison is now calling President Obama a weak leader. Terrible! _E_
Why is @BarackObama continuing to lie? __HTTP__ has found that @MittRomney did not ship jobs overseas __HTTP__ _E_
CLINTON'S FLAILING SYRIA POLICY WAS JUDGED A FAILURE: __HTTP__ #VPDebate _E_
How many illegal foreign donations will Obama collect this final week? Another scandal ignored by the liberal media. __HTTP__ _E_
.@mike_pence was fantastic tonight. Will be a great V.P. _E_
I am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_
Discussing #NewYorkValuesin Buffalo last night on the eve of the #NYPrimary.LETS GO NY! #VoteTrump __HTTP__ _E_
I am making a big speech the night of the @FoxNews debate but I wish everyone well. Yesterday was a big day for me with 5 wins! _E_
The media has not covered my long shot great finish in Iowa fairly. Brought in record voters and got second highest vote total in history! _E_
Just named General H.R. McMaster National Security Advisor. _E_
Think big set your vision high and go for it. You'll be shocked by what you can accomplish when you do. Midas Touch _E_
Thank you! WE WILL MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_
Always pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. Think Like a Billionaire _E_
New national Bloomberg poll just released thank you! Join the MOVEMENT: __HTTP__ #TrumpTrain... __HTTP__ _E_
.@IvankaTrump and me at the @todayshow this morning. __HTTP__ _E_
Great news out of New Hampshire! DonaldTrump is pulling away from the pack w/ 2nd is 17% behind him! #Trump2016 __HTTP__ _E_
I have an open door policy for my employees. I'm accessible because I like to know what's going on. The Midas Touch _E_
Our country has the slowest growth since 1929. #BigLeagueTruth #debate _E_
North Carolina lost 300000 manufacturing jobs and Ohio lost 400000 since 2000. Going to Mexico etc. NO MORE IF I WIN WE WILL BRING BACK! _E_
It's Tuesday. How many more customers has Glenfiddich lost today? _E_
After thousands lost and spending two trillion dollars Iraq (I told you so) is imploding. Really dumb pols put us and kept us there so sad! _E_
I have never seen a thin person drinking Diet Coke. _E_
Rick Perry did an absolutely horrible job of securing the border. He should be ashamed of himself. Gov. Abbott has since been terrific. _E_
I have been watching and loving the United States for many years and have NEVER seen it look weaker or less effective! _E_
My speech from last Saturday's @Citizens_United @AFPhq #NHFreedomSummit __HTTP__ via @cspan _E_
No one will work harder. No one will move heaven and earth like Mitt Romney to make this country a better place to live! @AnnDRomney _E_
Dopey Prince @Alwaleed_Talal wants to control our U.S. politicians with daddy's money. Can't do it when I get elected. #Trump2016 _E_
Victoria's Secret reps were nasty to @KateUpton and now she is doing great. _E_
DOW S&P 500 and NASDAQ close at record highs! #MAGA __HTTP__ _E_
RT @realDonaldTrump: ...big unnecessary regulation cuts made it all possible" (among many other things). "President Trump reversed the poli... _E_
The sub station in Blackdog is very dangerous on unregulated landfill—fire hazard! @AlexSalmond @pressjournal _E_
Exclusive Video–Broaddrick Willey Jones to Bill's Defenders: 'These Are Crimes' 'Terrified' of 'Enabler' Hillary __HTTP__ _E_
The Celebrity Apprentice has a two hour premiere this Sunday March 14th at 9 p.m. on NBC. This will be the best season yet see you then! _E_
CNN: New GOP polls show Trump's favorability is up __HTTP__ _E_
Located in the beautiful countryside of Mooresville @Trump_Charlotte has a superb clubhouse & top amenities __HTTP__ _E_
THe Chinese military is already hacking our satellites __HTTP__ The Chinese government is not an American ally. _E_
The Theater must always be a safe and special place.The cast of Hamilton was very rude last night to a very good man Mike Pence. Apologize! _E_
.@MarieLeff #asktrump __HTTP__ _E_
Obama is without question the WORST EVER president. I predict he will now do something really bad and totally stupid to show manhood! _E_
Lightweight A.G. Eric Schneiderman is perhaps the most incompetent and least respected A.G. in the U.S. He is a total joke! _E_
Mitch get back to work and put Repeal & Replace Tax Reform & Cuts and a great Infrastructure Bill on my desk for signing. You can do it! _E_
Happy Friday the 13th __HTTP__ _E_
Tweet me back if u think we should start a petition to fire @hardball_chris for his comments on Sandy & the death & destruction it caused. _E_
China's economy is now projected to overtake the US as the world's largest economy by 2027 __HTTP__ #TimeToGetTough _E_
Trump International Hotel & Tower Vancouver will be a fantastic addition to a spectacular city. __HTTP__ _E_
Donald Trump Explains Why He Called Django Unchained 'Racist' In Tweet __HTTP__ via @accesshollywood _E_
Via @Newsmax_Media by @melaniebatley: "Trump Backed Candidate @leezeldin Wins NY GOP Primary" __HTTP__ _E_
Thank you to Jeffrey Lord @AmSpec for his incredible & insightful article this weekend on failing & irrelevant @BuzzFeed _E_
Packed venue of people who want to #MakeAmericaGreatAgain __HTTP__ _E_
My @LateNightJimmy interview with @jimmyfallon discussing the new season of All Star @CelebApprentice __HTTP__ _E_
I will be speaking the night before the RNC in Sarasota FL when I receive the Statesman of the Year award. _E_
.@MittRomney will make us energy independent by 2020 __HTTP__ @BarackObama will keep wasting money on Solyndra projects. _E_
I'll be on with Larry Kudlow of the Kudlow Report tonight on CNBC at 7 p.m. We'll be discussing current affairs and politics. Tune in. _E_
Congrats to @BarackObama he has now had over 40 months straight of over 8% unemployment while accruing over $6T (cont) __HTTP__ _E_
Via @TVbytheNumbers:"TV Ratings Sunday 'Family Guy' & 'The Simpsons' Down 'All Star Celebrity Apprentice' Up" __HTTP__ _E_
Join me in Greensboro North Carolina tomorrow at 2:00pm! #TrumpRally __HTTP__ __HTTP__ _E_
Kate is donating a #kidney to her husband __HTTP__ . You can help! I did @fundanything #donate _E_
MAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN!#Trump2016 #AmericaFirst __HTTP__ _E_
I have been consistent in my opposition to Common Core. Get rid of Common Core keep education local! _E_
.@ApprenticeNBC season premiere this Sunday at 9/8c on @NBC __HTTP__ _E_
"It's always great to be in business with Donald Trump" said @Telemundo president Emilio Romano. __HTTP__ _E_
The fact that Sneaky Dianne Feinstein who has on numerous occasions stated that collusion between Trump/Russia has not been found would release testimony in such an underhanded and possibly illegal way totally without authorization is a disgrace. Must have tough Primary! _E_
I will be on @foxandfriends Monday morning at 7.00. A lot to talk about! _E_
Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign. _E_
Remember to keep going if you stop your momentum will stop. _E_
I have an idea for A Rod buy a home at @TrumpGolfLA overlooking the Pacific will bring you better luck. _E_
A letter from an amazing woman __HTTP__ _E_
The dollar always talks in the end although our pols are killing the dollar! _E_
Remember that Bill Clinton was brought in to help Hillary against Obama in 2008. He was terrible failed badly and was called a racist! _E_
Thank you to General Motors and Walmart for starting the big jobs push back into the U.S.! _E_
No surprise. @DNC displayed Russian ships in tribute to vets __HTTP__ Did they mean to honor the Russians? _E_
"Trump's Championship #BlueMonster Course Opens To Rave Reviews" __HTTP__ via @sacbee_news _E_
.@AlexSalmond don't worry my ad will be shown across the world and it is highly accurate! _E_
CLINTON IS WEAK ON NORTH KOREA: __HTTP__ #VPDebate _E_
I have always been the same person remain true to self.The media wants me to change but it would be very dishonest to supporters to do so! _E_
I hear @NBCNews / @WSJ came out with another one of their phony polls. While I am leading they are totally discredited after last S.C. poll _E_
Opening in 2016 @TrumpVancouver's original twisting design will transform the skyline at 616 ft. & 63 stories __HTTP__ _E_
Do as I say not as I do.The politicians who passed ObamaCare are now exempting themselves from the monstrosity __HTTP__ _E_
The Bay Bridge in San Francisco is being built by the Chinese tremendous cost overruns. A total mess. We should build our own bridges etc _E_
Masa (SoftBank) of Japan has agreed to invest $50 billion in the U.S. toward businesses and 50000 new jobs.... _E_
mention crime infested) rather than falsely complaining about the election results. All talk talk talk no action or results. Sad! _E_
The Obstructionist Democrats have given us (or not fixed) some of the worst trade deals in World History. I am changing that fast! _E_
Cleveland just made a very wise decision congrats! _E_
Thank you General. #Trump2016 __HTTP__ _E_
Trump International Golf Club Turnberry Scotland home to four of the greatest Open Championships of all time.. __HTTP__ _E_
Depression be careful of China! __HTTP__ _E_
The language used by me at the DACA meeting was tough but this was not the language used. What was really tough was the outlandish proposal made a big setback for DACA! _E_
Thank you for your support! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_
Incompetent @RichLowry lost it tonight on @FoxNews. He should not be allowed on TV and the FCC should fine him! _E_
...about then candidate Trump. Catherine Herridge @FoxNews. So why doesn't Fake News report this? Witch Hunt! Purposely phony reporting. _E_
You have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_
Syria has prepared for an attack based on all of our talk they have moved targeted ammunition and supplies to new locations.Amazing! _E_
Becoming a US citizen is not a right it's a privilege. _E_
Does anybody really think that President Obama didn't know about our spying on the leaders of allies around the world not possible! _E_
.@TrumpNationalHV features wide open pristine fairways tour caliber greens 64 strategically placed sand bunkers __HTTP__ _E_
Congratulations to all of the "DEPLORABLES" and the millions of people who gave us a MASSIVE (304 227) Electoral College landslide victory! __HTTP__ _E_
The best way out is always through. Robert Frost _E_
Obstacles are those frightening things that become visible when we take our eyes off our goals. Henry Ford _E_
Nevada: A quick reminder that today is your last day to register to vote! __HTTP__ __HTTP__ _E_
.@jacknicklaus has done a GREAT job as the architect of my new golf course at Ferry Point. NYC is very proud! _E_
Obama's deal raises taxes on 77% of national households. With Obama Care taxes kicking in now everyone will be paying for his 2nd term. _E_
.@BrentBozell one of the National Review lightweights came to my office begging for money like a dog. Why doesn't he say that? _E_
We don't have the leadership including the Generals (who just said the element of surprise does not matter) to attack anyone! Cool it. _E_
Now we will never know if @BarackObama would have been able to fill Bank of America Stadium. Pretty convenient. _E_
Watching Senator Richard Blumenthal speak of Comey is a joke. Richie devised one of the greatest military frauds in U.S. history. For.... _E_
Anne Hathaway is a good winner! _E_
We launched a new series of #Trump2016 videos via Facebook. A new topic everyday! Watch: __HTTP__ __HTTP__ _E_
Congratulations to the $1B ObamaCare website on enrolling FOUR in Delaware. Cost to us $4M __HTTP__ _E_
Thank you! #AmericaFirst __HTTP__ _E_
Russia is on the move in the Ukraine Iran is nuking up & Libya is run by Al Qaeda yet Obama is busy issuing 'climate change" warnings. _E_
Just letting China know in advance that the USA will win the medal count in the Olympics. Even with your cheating you can't beat us. _E_
Hard for Biden to justify Libya mess but doing best he can. #VPDebate _E_
Eliot Spitzer's illegal frivolous & over reaching harassment of Hank Greenberg at AIG played a major part in 2008 financial meltdown. _E_
Getting ready for some big news with my friends at @pgaofamerica _E_
Will be on @foxandfriends at 8:00 A.M. _E_
Happy Birthday to my legendary friend Aretha Franklin. _E_
Great minds have purposes others have wishes. Washington Irving _E_
Courage is being scared to death... and saddling up anyway. John Wayne _E_
The failing @nytimes writes total fiction concerning me. They have gotten it wrong for two years and now are making up stories & sources! _E_
...Who says the death penalty is not a deterrent? _E_
I will be heading to Dubai where I am doing a GREAT project with Damac will be a massive success! _E_
RT @foxandfriends: Hannity: Russia allegations 'boomeranging back' on Democrats __HTTP__ _E_
Mexico is allowing many thousands to go thru their country & to our very stupid open door. The Mexicans are laughing at us as buses pass by. _E_
From 10 11 pm @ApprenticeNBC ranks #1 in 18 49 among ABCCBS and NBC. #CelebApprentice _E_
.@MittRomney should not give any other further information until @BarackObama releases the things that everyone wants to see _E_
I just finished a great meeting with the Republican Senators concerning HealthCare. They really want to get it right unlike OCare! _E_
The American people are sick and tired of not being able to lead normal lives and to constantly be on the lookout for terror and terrorists! _E_
Entrepreneurs are visionaries in some respects they look beyond the present. Keep that in mind when looking for opportunities. _E_
We want to make sure that we have the workforce development programs we need to ensure these jobs are.... __HTTP__ _E_
"It takes guts to win fortunately most people don't have guts! Donald J. Trump _E_
I will be on @meetthepress at 10:30. @nbc will be releasing their new poll numbers. Based on the debate results I should do well who knows? _E_
.@FoxNews Chris Wallace: "More evidence of Dem collusion with Russia than GOP" __HTTP__ _E_
Will be interviewed on @JudgeJeanine at 9:00 P.M. Enjoy! _E_
President Obama refuses to answer question about Iran terror funding. I won't dodge questions as your President. __HTTP__ _E_
New national poll released. Join the MOVEMENT & together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_
Remember China is not a friend of the United States! _E_
Ringling Brothers is phasing out their elephants. Ifor one will never go again. They probably used the animal rights stuff to reduce costs _E_
Egypt is a total mess. We should have backed Mubarak instead of dropping him like a dog. _E_
Phony Club For Growth tried to shake me down for one million dollars & is now putting out nasty negative ads on me. They are total losers! _E_
No matter what you're managing don't assume you can glide by. You have to work to maintain your momentum. Trump: How to Get Rich _E_
Economic confidence is soaring as we unleash the power of private sector job creation and stand up for the American Workers. #AmericaFirst _E_
I am a cautious optimist. Call it positive thinking with a lot of reality checks. _E_
SEE YOU IN COURT THE SECURITY OF OUR NATION IS AT STAKE! _E_
.@donlemon on @CNN at 10:00 P.M. _E_
It's driving @ariannahuff & the money losing @HuffingtonPost post crazy that I am #1 in their poll and they only write bad stories about me! _E_
The Misery Index is at a 28 year high. _E_
My economic policy speech will be carried live at 12:15 P.M. Enjoy! _E_
Women defy media narrative love Trump at packed Michigan rally.VIDEO: __HTTP__ __HTTP__ _E_
Thank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
For information on Trump University victory call Alan Garten Esquire at 212.836.3203 or Jeff Goldman Esquire at 212.867.4466. _E_
_xx_justme I still can't believe Donald Trump responded to my tweet. Respect #Trump2016 He would be the best Pres for this Country. Thx _E_
Wind turbines are a scourge to communities and wildlife. They are environmental disasters. _E_
Ivanka and Joan Rivers will be working hard tonight at the Live Finale everybody must watch the OPENING at 9. _E_
A new poll indicates that 68% of my supporters would vote for me if I departed the GOP & ran as an independent. __HTTP__ _E_
Via @DMRegister by @WilliamPetroski: Trump: I can make America great again __HTTP__ _E_
If Russia or some other entity was hacking why did the White House wait so long to act? Why did they only complain after Hillary lost? _E_
Sleazebag @BashirLive has just been forced to resign from @msnbc. His pathetic apology wasn't enough to save his job. @SarahPalinUSA _E_
Join me in Charleston WV tomorrow! __HTTP__ _E_
Wow CNN just said that Donald Trump won the DEBATE connected best with audience. Also Time Drudge Newsmax N.Y.Times and more! _E_
I should host the #Oscars just to shake things up this is not good! _E_
Ford said last week that it will expand in Michigan and U.S. instead of building a BILLION dollar plant in Mexico. Thank you Ford & Fiat C! _E_
Via @WashTimes By Eugene Dunn: "Trump could lead U.S. forward" __HTTP__ _E_
Today Americans everywhere remember the brave men and women of @NASA who lost their lives in our Nation's eternal quest to expand the boundaries of human potential. __HTTP__ __HTTP__ _E_
In '08 @PaulRyanVP predicted that US headed toward bankruptcy __HTTP__ @BarackObama has added over $6T in debt since. Scary. _E_
Make sure to vote today. Vote for real change. Change that will deliver jobs and a free & strong America. Vote for @MittRomney. _E_
I'll be on@SquawkCNBC tomorrow at 7:30 am #TrumpTuesday _E_
Under our President ISIS is gaining great strength __HTTP__ _E_
It was my great honor to deliver the #CGACommencement17 at the @USCGAcademy. CONGRATULATIONS to the Class of 2017!... __HTTP__ _E_
"Luck does not come around often. So when it does be sure to take full advantage of it even if it means working hard. Think Big _E_
Now another Obama speech from 2002 with him talking about taking the rich's 'stuff' __HTTP__ Who is this guy? Where's the media? _E_
Join @autismspeaks and light the world blue on 4/2. #LIUB will raise awareness for millions with autism! _E_
Big day in Alabama. Vote for Luther Strange he will be great! _E_
I will be doing Fox & Friends at 7 (15 minutes). Enjoy it and your day! _E_
Just announced that in the history of @CNN last night's debate was its highest rated ever. Will they send me flowers & a thank you note? _E_
Thank you Dallas Texas! __HTTP__ __HTTP__ _E_
The elites want Common Core so they can take education out of parental control. NO! Let's Make America Great Again! __HTTP__ _E_
I look forward to all meetings today with world leaders including my meeting with Vladimir Putin. Much to discuss.#G20Summit #USA _E_
According to Bill O'Reilly 80% of all the shootings in New York City are blacks if you add Hispanics that figure goes to 98%. 1% white. _E_
That was an amazing interview on @foxandfriends I hope the rest of the media picks it up to show how totally dishonest the @nytimes is! _E_
Now that it's almost over I can't believe that unions & management couldn't save Twinkies etc & management just got a $1.75M bonus. _E_
Record setting gas prices in the U.S. we're really looking dumb. Lots of $'s being made on us. _E_
He @MittRomney wrote a great piece on China __HTTP__ @JonHuntsman criticized him (cont) __HTTP__ _E_
At the foot of Whitestone Bridge in the Bronx @TrumpFerryPoint offers fantastic views of the Manhattan skyline __HTTP__ _E_
A note from the fabulous Mark Burnett: "Donald congratulations again we are #1 in the 10:00pm hour. I am tweeting about it." _E_
New Reuters poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
So don't forget to enter the Serta Counting Sheep for Hire contest! www.youtube.com/user/mattressserta _E_
...if Congress gives us the massive tax cuts (and reform) I am asking for those numbers will grow by leaps and bounds. #MAGA _E_
Being in Detroit today was wonderful. Quick stop in Ohio to meet with some of our great supporters. Just got back home! _E_
Ted Cruz is in trouble for not reporting his bank borrowing in his very important Financial Disclosure Form. Very low interest loans scam! _E_
"@realDonaldTrump: I would like to extend my best wishes to all even the haters and losers on this special date September 11th." _E_
Via @bpolitics by @tdopp: "In Iowa Trump Promises to 'Surprise a Lot of People'" __HTTP__ _E_
To show you how dishonest some of the press is they took my funny & (cont) __HTTP__ _E_
2016 GOP Nomination Polls have me as #1 as seen on @SpecialReport with @BretBaier. __HTTP__ _E_
It's inconvenient and inconsiderate: @BarackObama is doing a fundraiser tonight making it almost impossibl... (cont) __HTTP__ _E_
The election result in France is very disappointing. The Europeans have to embrace austerity in order for their economy to fully recover. _E_
I promise to rebuild our military and secure our border. Democrats want to shut down the government. Politics! _E_
State Department official accused of offering 'quid pro quo' in Clinton email scandal __HTTP__ _E_
I will be on FOX with the great @JudgeJeanine tonight at 9pm EST! Enjoy! #Trump2016 _E_
Hillary said at debate ISIS is going to people showing videos in order to recruit more radical jihadistst. She made up story want apology! _E_
Crooked Hillary Clinton is a fraud who has put the public and country at risk by her illegal and very stupid use of e mails. Many missing! _E_
I say we cannot continue to let Obama fly around on Air Force 1 at a cost of millions of dollars a day for the purpose of politics & play! _E_
Thank you to Chris Cox and Bikers for Trump Your support has been amazing. I will never forget. MAKE AMERICA GREAT AGAIN! _E_
I am very happy to have the civilian version of The Apprentice back on the air this fall. There will be excitement as well as opportunity. _E_
Wisconsin and Pennsylvania have just certified my wins in those states. I actually picked up additional votes! _E_
Great news! #MAGA __HTTP__ _E_
Michelle Obama's weekend ski trip toAspen makes it 16 times that Obamas have gone on vacation in 3 years. (cont) __HTTP__ _E_
Yesterday I was thrilled to be with so many WONDERFUL friends in Utah's MAGNIFICENT Capitol.It was my honor to sign two Presidential Proclamations that will modify the national monuments designations of both Bears Ears and Grand Staircase Escalante... __HTTP__ __HTTP__ _E_
Thank you @TrumpSoHo @TrumpNewYork for helping me celebrate #agreatcause @MarineCorpsLEF while accepting the Commandant's Leadership award! _E_
Never get good #'s from failing Des Moines Register/Bloomberg. I think something's going on w/them. Up 13 in IA according to respected CNN. _E_
ObamaCare is in serious trouble. The Dems need big money to keep it going otherwise it dies far sooner than anyone would have thought. _E_
The United States is prepared to work with each of the leaders in this room today to achieve mutually beneficial commerce that is in the interests of both your countries and mine. That is the message I am here to deliver today. #APEC2017 __HTTP__ _E_
Do you all remember how beautiful and safe a place Brussels was. Not anymore it is from a different world! U.S. must be vigilant and smart! _E_
Must read article on Obama's illegal fundraising from abroad __HTTP__ Foreign candidate getting foreign donations. _E_
ICYMI @nypost's @LoisWeiss described my Monday @ICSC speech @javitscenter as one of my "best and most riveting" __HTTP__ _E_
Hillary and her friends! __HTTP__ _E_
.@T Mobile has so many service complaints a total joke! _E_
I am in Las Vegas for the @MissUSA 2012 pageant. Watch live tonight on @NBC at 9PM ET. __HTTP__ _E_
NEBRASKA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Happy 70th Birthday @CIA! __HTTP__ _E_
We are fighting hard for Merit Based immigration no more Democrat Lottery Systems. We must get MUCH tougher (and smarter). @foxandfriends _E_
Big win in Montana for Republicans! We _E_
RT @EricTrump: #Arizona: We made it easy to find your polling location for today's primary! Simply visit __HTTP__ __HTTP__ _E_
Take a look at this amazing photo of the cast from the first ever All Star @CelebApprentice __HTTP__ _E_
Out of hundreds of deals & transactions I have used the bankruptcy laws a few times to make deals better. Nothing personal just business. _E_
Now an additional 600 700 jobs in America (2000) being eliminated for move to Mexico via Hartford Courant. __HTTP__ _E_
Outright disgusting the Obama administration has continually stonewalled and lied to US Amb. Sean Smith's mother __HTTP__ _E_
Hillary won't call out radical Islam! She will be soundly defeated. _E_
Why isn't the Arab League paying for everything and sending troops? They want us to do their dirty work with no involvement by themselves! _E_
The DC press corps is obsessed with my @CPACnews speech which is scheduled tomorrow 8:45AM in the Potomac Ballroom. Can't blame them. _E_
The haters and losers that assume I was a non athlete and know nothing about coaches should look into my past unlike our President open book _E_
.@MissUniverse final 3 on now. Great people great new owner @IMG. WATCH. _E_
RT @realDonaldTrump: Loser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must... _E_
Make sure to have fun and celebrate NYE with friends and family. Happy New Year everyone! _E_
Donald Trump Hands Bill O'Reilly Cable TV Viewership Win @deadline __HTTP__ _E_
Via @CBSNews by @ReenaJF: Donald Trump scolds Republicans: 'Toughen up' __HTTP__ _E_
The Fed continues to flood the market with US dollars. Wrong move. _E_
May jobless numbers have been readjusted to 8.2%. @BarackObama's economy is a disaster __HTTP__ New numbers tomorrow. _E_
There is only one person who should be crossing our southern border USMC Sgt. Tahmooressi. Boycott Mexico? #FreeOurMarine _E_
Thank you for the incredible support this morning Tampa Florida! #ICYMI watch here: __HTTP__ __HTTP__ _E_
The failing @nytimes should be focused on good reporting and the papers financial survival and not with constant hits on Donald Trump! _E_
Wake Up America China is eating our lunch. _E_
Great rally in New Mexico amazing crowd! Now in L.A. Big rally in Anaheim. _E_
I am in Indiana where we just had a great rally. Fantastic people! Staying at a Holiday Inn Express new and clean not bad! _E_
It is really a shame that Barack Obama may stop $5M from being generously donated to charity all because he refuses to be transparent. _E_
Just in big news I have been declared the winner of the CNMI Rep Caucus with 72.8% of the vote! Thank you! #SuperTuesday #VoteTrump _E_
I heard that the underachieving John King of @CNN on Inside Politics was one hour of lies. Happily few people are watching dead network! _E_
Here I am with @trishstratuscom #WWEHOF __HTTP__ _E_
What will be @RickSantorum's excuse tomorrow after @MittRomney wins Wisconsin and Maryland? Time for Rick to face reality and drop out. _E_
When it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it... (cont) __HTTP__ _E_
Guess which POTUS has held more fundraisers than the previous 5 combined? __HTTP__ @BarackObama is (cont) __HTTP__ _E_
Bernie Sanders endorsing Crooked Hillary Clinton is like Occupy Wall Street endorsing Goldman Sachs. _E_
Don King and so many other African Americans who know me well and endorsed me would not have done so if they thought I was a racist! _E_
Monday night at 8:00 will be must see television. Our wonderful Joan Rivers plays a major role as my advisor on the Apprentice. AMAZING! _E_
Why does @Greta have a fired Bushy like dummy John Sununu on spewing false info? I will beat Hillary by a lot she wants no part of Trump. _E_
There have been 17 shutdowns since 1976 14 under Reagan and Bush with Democrat Congresses who wanted more spending. _E_
Eight Syrians were just caught on the southern border trying to get into the U.S. ISIS maybe? I told you so. WE NEED A BIG & BEAUTIFUL WALL! _E_
There are 11 more Solyndras in the @BarackObama energy program __HTTP__  He loves to waste our (cont) __HTTP__ _E_
Uranium deal to Russia with Clinton help and Obama Administration knowledge is the biggest story that Fake Media doesn't want to follow! _E_
Military reps have attacked @BarackObama over Bin Laden leaks they believe he's just using this for his benefit. Not a big surprise... _E_
I will be making my announcement on the next Secretary of State tomorrow morning. _E_
Entrepreneurs are all unique. One way to build a business and turn it into a brand is to know who you are. Midas Touch _E_
.@dennisrodman looks like he really cleaned up his act. _E_
If you're going through hell keep going. Winston Churchill _E_
Hillary Clinton raked in money from regimes that horribly oppress women and gays & refuses to speak out against Radical Islam. _E_
To be successful never give up. My secrets to success will be shared at the National Achievers Congress in London. __HTTP__ _E_
Poll numbers way up making big progress! _E_
America's trade deficit with China is one of our greatest national security threats. Time for Fair Trade. We must produce our own products. _E_
My announcement is tomorrow! _E_
Sad to watch Bernie Sanders abandon his revolution. We welcome all voters who want to fix our rigged system and bring back our jobs. _E_
Trump rails on Romney as possible 2016 contender __HTTP__ via @nypost by @GeoffEarle _E_
The Mar a Lago Club the crown jewel of Palm Beach is a landmark in the National Register of Historic Places __HTTP__ _E_
Via @DailyCaller by @NeilMunroDC: "Trump Wants Ebola Travel Ban" __HTTP__ _E_
HAPPY BIRTHDAY to my son @EricTrump! Very proud of you! __HTTP__ __HTTP__ _E_
Listen to an interview with Donald Trump discussing his new book Think Like A Champion: __HTTP__ _E_
I as President want people coming into our Country who are going to help us become strong and great again people coming in through a system based on MERIT. No more Lotteries! #AMERICA FIRST _E_
If there is one more Ebola case in the U.S. a full travel ban will be instituted. This common sense move should have been done long ago! _E_
AMAZING @BarackObama has actually found a government program he can cut in half the Defense Department...bad (cont) __HTTP__ _E_
Iran's quest for nuclear weapons is a major threat to our nation's national security interests. We can't allow Iran to go nuclear. _E_
The Dallas event on September 14 at 6:00 P.M. at the American Airlines Center looks like it will be a giant success. Tickets are going FAST! _E_
Jeffrey Robinson's #TrumpTower has it all. The ultra rich powerful and beautiful. It's your summer must read. __HTTP__ _E_
The upcoming season of @CelebApprentice will be terrific a great cast. _E_
Have time to waste? Go to the ObamaCare website. _E_
With all of the jobs I am bringing back into the U.S. (even before taking office) with all of the new auto plants coming back into our..... _E_
Obama Spurns Trump Offer to Foot White House Tours __HTTP__ via @Newsmax_Media _E_
Thank you Florida! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
I'll be on @foxandfriends on Monday at 7:30 AM... _E_
I will be interviewed on Face The Nation with @jdickerson this morning. Enjoy! _E_
Get out and vote Nebraska we will MAKE AMERICA GREAT AGAIN! _E_
The #CNBC 25 poll is a joke. I was in 9th place and taken off. (Politics?) No wonder @CNBC ratings are going down the tubes. _E_
Via @CBNNews by @TheBrodyFile: "Donald Trump To Brody File in 2011: People Send Me Bibles" __HTTP__ _E_
Our country is stagnant. We've lost jobs and business. We don't make things anymore b/c of the bill Hillary's husband signed and she blessed _E_
The two fake news polls released yesterday ABC & NBC while containing some very positive info were totally wrong in General E. Watch! _E_
Via @NRO by @LovelaceRyanD: Trump Slams Bush: 'I Don't See Him Winning I Don't See There's Any Way' __HTTP__ _E_
Obama's proposed budget has another middle class tax hike __HTTP__ Enjoy! _E_
Going to Ohio home of one of the worst presidential candidates in history Kasich. Can't debate loves #ObamaCare dummy! _E_
Just reported by CNN that the Trump halo effect caused a record shattering Democratic Debate rating of 15.3 million viewers. So true! _E_
United Nations Resolution is the single largest economic sanctions package ever on North Korea. Over one billion dollars in cost to N.K. _E_
Trump Will Make America GREAT!!!! #ChangeTheWorldIn5Words _E_
RT @atensnut: Hillary calls Trump's remarks horrific while she lives with and protects a Rapist . Her actions are horrific. _E_
Andy Roddick...a great tennis player is a fantastic guy with a wonderful wife. _E_
Thank you Michigan. We are going to bring back your jobs & together we will MAKE AMERICA GREAT AGAIN!Watch:... __HTTP__ _E_
The media is so after me on women Wow this is a tough business. Nobody has more respect for women than Donald Trump! _E_
.@garyplayer you were great on @MikeAndMike this morning—& the Gary Player Villa at @TrumpDoral is a hot ticket. _E_
Heading to Biloxi Mississippi. Massive crowds expected. Thank you for your support! #VoteTrump2016 __HTTP__ _E_
Obama's job approval is at 37% a record low. @GOP & @SpeakerBoehner have the leverage & momentum. Delay ObamaCare for all Americans! _E_
#MakeAmericaGreatAgain __HTTP__ _E_
.@foxandfriends interview re: North Korea firing @dennisrodman job report @MELANIATRUMP's debut & @WrestleMania __HTTP__ _E_
FORMAL ACCEPTANCE OF THE NOMINATION! #TrumpPence16 __HTTP__ _E_
.@Omarosa's emergency has put a new spin on Team Power's presentation—but it's not "show time" yet. #CelebApprentice _E_
Will be on @foxandfriends. Enjoy! _E_
.@CharlieRymerGC Charlie call me we'll set up a match with Gary and Damon. Doral now finished and doing great! _E_
I believe that in addition to the 5 terrorist leaders President Obama gave up for Bergdahl a great deal of CASH was also given. So stupid! _E_
I was relentless because more often than you would think sheer persistence is the difference between success and failure. NEVER GIVE UP! _E_
Great article on wind turbines by Robert Bryce in today's @NYPost __HTTP__ _E_
I loved the day Paul Goldberger got fired (or left) as N.Y.Times architecture critic and has since faded into irrelevance. Kamin next! _E_
I did interview with Chris Wallace of @FoxNews in order to be fair. He then puts on Rove Lane and Will three Trump bashers to discuss. _E_
Have you heard? China just told Obama to jump. Obama asked how high. _E_
Less than 1% of Obama's $4B immigration request will go towards immediate border security. A real scam. Enforce our laws now! _E_
Obama is addicted to spending America into insolvency. His record proves it. _E_
14 African nations have totally banned West Africans from entering their nations. Likewise many other nations. But the U.S. = COME ON IN _E_
George Will said best debate he ever saw . If you ever heard George Will speak(boring) anything is exciting. _E_
Victory press conference was over. Why is she allowed to grab me and shout questions? Can I press charges? __HTTP__ _E_
I will be on Bill O'Reilly's show tonight at 8 PM talking about Iran and politics. @oreillyfactor _E_
I look very much forward to meeting w/Paul Ryan & the GOP Party Leadership on Thurs in DC. Together we will beat the Dems at all levels! _E_
I love Bluffton SC what a great place what great people. _E_
Congrats to @JimmieJohnson a great guy on winning Daytona! _E_
Thank you Pennsylvania. This is a MOVEMENT like we have never seen before! #VoteTrumpPence16 on 11/8/16 together... __HTTP__ _E_
People like @KatyTurNBC report on my campaign but have zero access. They say what they want without any knowledge.True of so much of media! _E_
How does frumpy & little read @nytimes editorial writer Gail Collins keep her job? She is totally irrelevant! @nytimescollins _E_
Happy birthday to the great @leegreenwood83. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_
I watched Mark Cuban on Jay Leno last night what a jerk! _E_
Michael Barbaro the author of the now discredited @nytimes hit piece on me with women has in past tweeted badly about me. He should resign _E_
ISIS made a big mistake with the beheading of the reporter. Even people against intervention want them blown into oblivion. LEADERSHIP! _E_
Twitter will soon be irrelevant if lowlifes are so easily able to hack into accounts. _E_
Wow sexual assaults in the military have gone through the roof far worse than anybody could have predicted! _E_
Bret had a target on his back from the get go... _E_
Trump Tower is located at 725 Fifth Avenue between 56th and 57th Streets... _E_
... Time for the Republicans to find someone new—and better. _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
I demand an apology from Hillary Clinton for the disgusting story she made up about me for purposes of the debate. There never was a video. _E_
Sacrificing our nation's bravest for ungrateful Iraqis = great for China. China is taking majority of the oil __HTTP__ _E_
My family and I just arrived in Scotland for the grand opening of Trump International Golf Links Scotland __HTTP__ _E_
To become a champion fight one more round. James J. Corbett long ago Heavyweight Champion _E_
Thanks to the historic TAX CUTS that I signed into law your paychecks are going way UP your taxes are going way DOWN and America is once again OPEN FOR BUSINESS! __HTTP__ _E_
The media refuses to talk about the three new national polls that have me in first place. Biggest crowds ever watch what happens! _E_
At 10:30 I will be interviewed on both @meetthepress by @chucktodd and @CBSNews Face The Nation by John Dickerson. This after long evening! _E_
.@tedcruz Conflicting Stances on Birthright Citizenship [14th Amendment] Gives #TeamTrump credit. __HTTP__ _E_
If U.C. Berkeley does not allow free speech and practices violence on innocent people with a different point of view NO FEDERAL FUNDS? _E_
More and more Americans seem fed up with both Parties I agree. _E_
Discussing #SyrianRefugees with @EricBolling on @FoxNews back on 10/3/2015. #ISIS __HTTP__ _E_
Alex Rodriguez should substantially reduce his salary from the Yankees in that he misrepresented his use of (cont) __HTTP__ _E_
RT @CLewandowski_: Gov Nikki Haley just became a liability for Rubio after this was published to social media! __HTTP__ _E_
Thank you Erie Pennsylvania! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_
.@CharlesMBlow Why don't you use new polls instead of the single ancient national poll that was a tiny bit negative. Dishonest reporting! _E_
My ties & shirts at Macy's are doing great. Stupid @GoAngelo is making people aware of how good they are! _E_
I'm not saying to not give vaccines I am just saying give them small doses over a long period of time not one massive dose for a child. _E_
"You can have the most wonderful product in the world but if people don't know about it it's not worth much." The Art of the Deal _E_
Jamie Dimon just gave away $13B to government in settlement. Terrible move & bad precedent. Could have done much better by fighting. _E_
I love you North Carolina thank you for your amazing support! Get out and __HTTP__ tomorrow!Watch:... __HTTP__ _E_
New Bloomberg Poll: Trump Leads Big __HTTP__ _E_
It was a great honor to have spoken before the countries of the world at the United Nations.#USAatUNGA#UNGA __HTTP__ __HTTP__ _E_
Obama is giving Social Security & ObamaCare to illegals yet wants to cut military benefits __HTTP__ Disgrace! _E_
Entrepreneurs: Everything starts with you. Leadership is not a group effort if you're in charge then be in charge. _E_
US should have told Libya Rebels give us 50% of your oil for our military support. _E_
If Republicans don't Repeal and Replace the disastrous ObamaCare the repercussions will be far greater than any of them understand! _E_
China is now deploying drones across ocean routes used for trade __HTTP__ They stole the technology from us. _E_
If the people of our great country could only see how viciously and inaccurately my administration is covered by certain media! _E_
Exceptional dining matched with exceptional views @Trumpchicago offers a unique array of 5 star dining options __HTTP__ _E_
.@bobvanderplaats begged me to do an event while asking organizers for $100000 for himself—a bad guy! _E_
RT @DRUDGE_REPORT: WSJ: The Cold Clinton Reality... __HTTP__ _E_
Entrepreneurs: See yourself as victorious. This will focus you in the right direction. Put everything you've got into what you're doing. _E_
The dishonest media didn't mention that Bernie Sanders was very angry looking during Crooked's speech. He wishes he didn't make that deal! _E_
Based on @MegynKelly's conflict of interest and bias she should not be allowed to be a moderator of the next debate. _E_
It's Friday how many advertisers dropped @HuffPost today? _E_
Very excited for @LaraLeaYunaska and @EricTrump's wedding this weekend. _E_
#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Here are my thoughts on last night's episode of The Celebrity Apprentice... __HTTP__ _E_
Could be the hurricane helps @MittRomney people are rioting in the streets over gasoline _E_
Young entrepreneurs – never back down. Take the hits and get up. That's what makes a winner. _E_
My interview w/ @BloombergTV's Peter Cook re the Old Post Office Bldg becoming Trump Int'l Hotel Washington D.C. __HTTP__ _E_
Well that is it. Well done Megyn and they all lived happily ever after! Now let us all see how THE MOVEMENT does in Oregon tonight! _E_
.@BarackObama is practically begging @MittRomney to disavow the place of birth movement he is afraid of it and (cont) __HTTP__ _E_
A lot of people strongly advised me against doing @ApprenticeNBC. Next week we start filming the record 13th seasonHence go with your gut! _E_
Sorry to hear @msnbc was dead last in the gutter in their Boston bombing coverage __HTTP__ @hardball_chris @Lawrence _E_
Hillary Clinton has announced that she is letting her husband out to campaign but HE'S DEMONSTRATED A PENCHANT FOR SEXISM so inappropriate! _E_
Via @PRNewswire: "Central Park Horse Show To Make Inaugural Debut in NYC Sept 18 21" __HTTP__ I am proud to be a sponsor! _E_
The same brilliant negotiators that gave up five Taliban leaders for one traitor are now making trade deals with China & others.No chance _E_
The debate was very interesting last night. There were numerous winners and Governor Romney did very well. _E_
"Some events will wipe out one person but will make another even more tenacious." – Think Like a Champion _E_
How come discredited reporter @mckaycoppins refused to write that the events in New Hampshire Buffalo and N.Y. were all record breakers! _E_
RT @TeamTrump: It's hard to fight terrorism when you're making cash payments to the world's LARGEST state sponsor of TERROR. Under Trump: N... _E_
Weak newscasters are asking is there a racial component to knockout attacks? Of course there is and weakness will only make it worse! _E_
A Rod should donate his contract to charity. He doesn't make the @yankees any money and he doesn't perform. He is a $30M/yr rip off. _E_
power from Washington D.C. and giving it back to you the American People. #InaugurationDay _E_
Great news from Ireland—Clare County Council turned down massive windfarm near my hotel & golf course in Doonbeg. __HTTP__ _E_
The only job @BarackObama cares about is his own. Everything he does is for his own reelection. _E_
We are getting ready to protect Saudi Arabia against Iran & others sending ships. How much are they going to pay us toward this protection. _E_
FLORIDA Visit __HTTP__ to find shelters road closures & evacuation routes. Helpful Twitter list: __HTTP__ __HTTP__ _E_
El Chapo and the Mexican drug cartels use the border unimpeded like it was a vacuum cleaner sucking drugs and death right into the U.S. _E_
.@lolojones given a raw deal in @nytimes story not fair. _E_
(1/2) Time Magazine has me on the cover this week. David Von Drehle has written one of the best stories I have ever had. _E_
The Wikileaks e mail release today was so bad to Sanders that it will make it impossible for him to support her unless he is a fraud! _E_
Democrats slam GOP healthcare proposal as Obamacare premiums & deductibles increase by over 100%. Remember keep your doctor keep your plan? _E_
... Supreme Court pick economic enthusiasm deregulation & so much more have driven the Trump base even closer together. Will never change! _E_
Join me tomorrow Nov. 3rd at 12pm in #TrumpTowerNY. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_
Congrats to Obama & Democrats. CBO has just announced that ObamaCare missed its uninsured target by half & program costs extra $700B+. _E_
"Let other people talk. Any business conversation should be two sided." – Think Like a Billionaire _E_
"Trump could be great friend if 'Second Amendment' enthusiasm is real" __HTTP__ via @SFLuxe _E_
Apple's iPhone sales fell way short they must go to a larger screen as alternative fast (as I said long ago)! Samsung's size much better. _E_
Would be nice if @jmartNYT learned how to read the polls before writing his next story. Probably done on purpose but not good reporting! _E_
Via @GolfweekMag: "Trump reveals routing for second course in Scotland" __HTTP__ _E_
A great photo of @MittRomney and me __HTTP__ _E_
Republicans should have been much tougher on Obama. Just wait until you see what Obama does to Romney at the DNC! _E_
Will be on @Morning_Joe at 6:30 A.M. _E_
Christians need support in our country (and around the world) their religious liberty is at stake! Obama has been horrible I will be great _E_
Congratulations to @NHGOP & @AFPFNH for winning control of the State House & Executive Council while holding State Senate. Strong results! _E_
She'll say anything and change NOTHING! #MAGA #BigLeagueTruth __HTTP__ _E_
Rated Toronto's #1 hotel @TrumpTO has 261 guest rooms & suites furnished in elegant cosmopolitan style. __HTTP__ _E_
I am watching Crooked Hillary speak. Same old stuff our country needs change! _E_
Iran is toying with our president buying time and laughing at the stupidity of our leadership. Syria and now this! What's next? _E_
RT @foxandfriends: Chicago approves new plan to hide illegal immigrants from the feds plus give them access to city services __HTTP__ _E_
Located in Central Park the iconic @TrumpRink is NYC's top skating rink. VIP sessions are available for booking __HTTP__ _E_
A phony story that I am trying to buy a soccer team in Argentina is untrue. Never even heard of the team—no interest! __HTTP__ _E_
Hillary said with respect to ISIS we are finally where we need to be. Do we want 4 more years of incompetent leadership? MAGA! _E_
I call Jeb Bush the reluctant warrior he just doesn't want to be doing this he is not having fun! _E_
Socialists think profits are a vice I consider losses the real vice. Winston Churchill _E_
... Will be there front & center along with the 70 greatest players in the world. WGC @Cadillac Championship _E_
Statesman of the Year in Sarasota FL on Sunday night will be terrific a total sellout. _E_
.@foxandfriends we are in record territory in all things having to do with our economy! __HTTP__ _E_
No Question' Violent Crime Will Rise If Program (Stop & Frisk) Is Stopped" @NY_POLICE Commissioner Ray Kelly _E_
Heading to Youngstown Ohio now some great polls. #AmericaFirst __HTTP__ _E_
Why didn't Obama as part of the negotiation free the Christian Pastor Saeed Abedini? __HTTP__ _E_
Congratulations to the Republic of Korea on what will be a MAGNIFICENT Winter Olympics! What the South Korean people have built is truly an inspiration! __HTTP__ _E_
The 2nd Amendment is under siege. We need SCOTUS judges who will uphold the US Constitution. #Debate #BigLeagueTruth _E_
Watch me on the @hannityshow tonight at 9pm. More thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_
In the last 24 hrs. we have raised over $13M from online donations and National Call Day and we're still going! Thank you America! #MAGA _E_
Dems don't want to talk ISIS b/c Hillary's foreign interventions unleashed ISIS & her refugee plans make it easier for them to come here. _E_
China loved Obama's climate change speech yesterday. They laughed! It hastens their takeover of us as the leading world economy. _E_
RECKLESS! @BarackObama has now increased the debt more than any other POTUS and the first 42 combined. __HTTP__ _E_
OPEC is ripping us off on oil. We are ripping ourselves off by investing in unproven green energy. #Solyndra _E_
For all of those who want to #MakeAmericaGreatAgain boycott @Macys. They are weak on border security & stopping illegal immigration. _E_
.@SkyscraperLive: Nick all of the folks at Trump International next door are wishing you well. We will block the strong winds! _E_
#VoteTrumpMS! #Trump2016 __HTTP__ _E_
RT @LouDobbs: Trump outlines new child care policy proposals via the @FoxNews App @realDonaldTrump seems a candidate of destiny __HTTP__ _E_
Ask yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_
Negotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. @realDonaldTrump _E_
Photo of @Gretawire and me from yesterday's interview... __HTTP__ _E_
Happy Father's Day to all! I had a wonderful and loving father. __HTTP__ _E_
Thank you to the amazing law enforcement officers in Colorado!#MakeAmericaGreatAgain #LESM __HTTP__ _E_
Why doesn't the media want to report that on the two Big Thursdays when Crooked Hillary and I made our speeches Republican's won ratings _E_
Military has announced that China has successfully hacked our advanced weapon designs. China is our enemy.Should we offset this on our debt? _E_
Live from New York November 7th! @nbcsnl __HTTP__ _E_
Going now to make a major speech before some of the world's biggest investors in Dubai! _E_
Nothing on emails. Nothing on the corrupt Clinton Foundation. And nothing on #Benghazi. #Debates2016 #debatenight _E_
Hard to believe that the Democrats who have gone so far LEFT that they are no longer recognizable are fighting so hard for Sanctuary crime _E_
.@HillaryClinton : Bill "clarified" what he meant when calling Obamacare a "disaster." Actually "disaster" is pretty clear. #Debate _E_
I have no doubt that Mitt will do really well tonight. We'll all be watching @MittRomney. _E_
In a new poll a majority of people felt the president knowingly lied about health care pledge. Who are the fools who don't think he lied? _E_
Our amazing golf course @TrumpScotland __HTTP__ _E_
Trump: US Must Get Tougher Because China Is 'Eating Our Lunch' __HTTP__ via Moneynews @Newsmax_Media _E_
RT @BretEastonEllis: Just back from a dinner in West Hollywood: shocked the majority of the table was voting for Trump but they would never... _E_
With $250M of renovations Trump Int'l DC's 250 expansive guest rooms will be DC's top offering of amenities & views __HTTP__ _E_
Dishonest @politico just called to say that none of the polls including Fox NBC CNN Zogby & Morning Consult matter. Serious haters. _E_
.....you keep forgetting to mention the fragrance Success ! _E_
Identify your goals. Know precisely what you want to achieve study the best people in your fieldand then plan the best route for success. _E_
Actually I was very nice to Jimmy Carter during my standing room only (& standing ovation) speech for CPAC stated better Pres. than Obama! _E_
#ThankYouTour2016 Tonight Orlando Florida Tickets: __HTTP__ Mobile AlabamaTickets:... __HTTP__ _E_
THANK YOU ARIZONA! Get out and #VoteTrump on Tuesday! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
.@washingtonpost is going out of its way to tell failing candidates how to beat Donald Trump.The Post doesn't get that I'm good at winning! _E_
In Crooked Hillary's telepromter speech yesterday she made up things that I said or believe but have no basis in fact. Not honest! _E_
He @BarackObama wants record high gas prices drilling permits on federal land are declining under his regime __HTTP__ _E_
Tainted (no very dishonest?) FBI "agent's role in Clinton probe under review." Led Clinton Email probe. @foxandfriends Clinton money going to wife of another FBI agent in charge. _E_
With all of the Crooked Hillary Clinton's foreign policy experience she has made so many mistakes and I mean real monsters! No more HRC. _E_
A lot of people are concerned about which charity my $5M will be donated to. The onus is on Obama to first release his records. _E_
It's go time! See you at Trump Tower. I'm giving money away! #FundAnything _E_
Join me Thursday in Florida & Ohio!West Palm Beach FL at noon: __HTTP__ OH this 7:30pm: __HTTP__ _E_
Mexico has lost a brilliant finance minister and wonderful man who I know is highly respected by President Peña Nieto. _E_
What about the undocumented immigrant with a record who killed the beautiful young women (in front of her father) in San Fran. Get smart! _E_
I will be in Milwaukee Wisconsin tomorrow at 7pmE with @MELANIATRUMP. Join us! #WIPrimary #Trump2016 __HTTP__ _E_
America's competitors love @BarackObama. @MedvedevRussiaE says @BarackObama has been the best 3 years for Russia __HTTP__ _E_
Just as I predicted while Obama lifted sanctions 18 months ago Iran cheated & increased its nuclear fuel by 20%. We must DOUBLE sanctions! _E_
So many in the African American community are doing so badly poverty and crime way up employment and jobs way down: I will fix it promise _E_
Some exciting news the newest acquisition of Trump Golf Trump National Golf Club Charlotte NC formerly The (cont) __HTTP__ _E_
Via @ABC by @jonkarl & @JordynPhelps: Donald Trump Says Jeb Bush is the 'Last Thing We Need' __HTTP__ _E_
Our economy is at a standstill. Some are even predicting a possible double dip. We need to elect @MittRomney in November. _E_
Via @ArutzSheva_En by Moshe Cohen: "Donald Trump: French Gun Control Allowed Terrorists to Succeed" __HTTP__ _E_
Do not look for approval except for the consciousness of doing your best. Andrew Carnegie _E_
I never said that China was in the bad TPP trade deal but that China would come in the back door at a later date. @CNN @FoxBusiness _E_
Going to Columbus Ohio today for a tremendous rally of thousands. The silent majority is no longer silent! _E_
.@TrumpNewYork is the only Forbes 5 Star & 5 Diamond hotel with a 5 Star & Five Diamond restaurant in NYC __HTTP__ _E_
What took so long to catch only 1 of the Benghazi terrorists? Especially after the killer has been taunting the US in the press f/2 yrs. _E_
Thank you Wisconsin! My Administration will be focused on three very important words: jobs jobs jobs! Watch:... __HTTP__ _E_
As everybody knows but the haters & losers refuse to acknowledge I do not wear a "wig." My hair may not be perfect but it's mine. _E_
While @BetteMidler is an extremely unattractive woman I refuse to say that because I always insist on being politically correct. _E_
Re Negotiation: Think about what the other side wants. Know where they're coming from. View any conflict as an opportunity. Be flexible. _E_
Great speech on China by @PaulRyanVP yesterday where he explains why China is treating @BarackObama like a Doormat __HTTP__ _E_
The ObamaCare disaster will increase the amount of uninsured __HTTP__ What is the point of this Trillion $ monstrosity? _E_
Excited to see that @AnnDRomney has joined twitter. Melania and I are looking forward to hosting her next week (cont) __HTTP__ _E_
Via @digitalspyus: Donald Trump to Lord Sugar: 'Drop to your knees and thank me' __HTTP__ _E_
A person who never made a mistake never tried anything new. Albert Einstein _E_
Thank you @USNavy! #USA __HTTP__ _E_
Via @NRO: Donald Trump Eyes 2016 by @woodruffbets __HTTP__ _E_
.@AlexSalmond I hope you played well at Royal Aberdeen but u must admit the windmill hovering over hole 14 is disgusting & inappropriate. _E_
It's freezing outside where the hell is global warming ?? _E_
My Administration will follow two simple rules: __HTTP__ _E_
Great new poll thank you America!#Trump2016 #ImWithYou __HTTP__ _E_
Secy John Kerry has a tough job but he looks so totally lost negotiating w/ those characters who are cleaning his clock. Sad to watch... _E_
Via @qctimes by @EdTibbetts: "Trump: U.S. getting beat up" __HTTP__ _E_
It was great to have Governor @RicardoRossello of #PuertoRico with us at the @WhiteHouse today. We are with you! #PRStrong __HTTP__ _E_
Got to do something about these missing chidlren grabbed by the perverts. Too many incidents fast trial death penalty. _E_
.@IvankaTrump looks like a movie star from the days of glamour and beauty. #CelebApprentice _E_
RT @foxnation: Grateful Syrians React To @realDonaldTrump Strike: 'I'll Name My Son Donald' __HTTP__ #SyrianStrikes _E_
My @CPACnews speech is scheduled Friday at 8:45AM in the Potomac Ballroom. Will also be telecast live on CSPAN & cable news networks. _E_
Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_
Via @BreitbartFeed why doesn't @BarackObama release his original book proposal which says he was born in Kenya? __HTTP__ _E_
Eric's Sept. 14th event will be held at Trump National Golf Club Westchester. __HTTP__ _E_
My @gretawire interview discussing my @MittRomney fundraiser in Trump Int'l Hotel Las Vegas and the state of the (cont) __HTTP__ _E_
.@Mediaite: Donald Trump Trashes @michellemalkin On Twitter:You're A 'Dummy' & 'Were Born Stupid' __HTTP__ @AndrewKirell _E_
Such amazing reporting on unmasking and the crooked scheme against us by @foxandfriends. Spied on before nomination. The real story. _E_
Someone just asked me who is my favorite Donald Trump impersonator? __HTTP__ _E_
.@foxandfriends We are not looking to fill all of those positions. Don't need many of them reduce size of government. @IngrahamAngle _E_
Then how come gasoline is hitting record high prices? _E_
The @Yankees must re negotiate @AROD's contract. He is not the same player without drugs. _E_
I am at Trump National Doral in Miami as the best golfers in the World start arriving for the World Golf Championship (Cadillac). A big week _E_
Just hit a million on Facebook __HTTP__ _E_
Obama Care stole more then $500M from Medicare. _E_
Thank you @tweetbypremier for selecting the Ocean View Suite @Trump_Ireland as one of your top 10 suites __HTTP__ _E_
Congratulations Treasury Secretary Steven Mnuchin! #ICYMI watch here: __HTTP__ __HTTP__ _E_
The @AmSpec article Shakedown Schneiderman about NY State lightweight @AGSchneiderman is amazing. __HTTP__ _E_
I hope everyone enjoyed Palm Sunday! _E_
With the exception of cheating Bernie out of the nom the Dems have always proven to be far more loyal to each other than the Republicans! _E_
The Democrats will make a deal with me on healthcare as soon as ObamaCare folds not long. Do not worry we are in very good shape! _E_
On stunning Aberdeenshire coastline @TrumpScotland features a classic Scottish link threaded through the dunes __HTTP__ _E_
Opportunity is missed by most people because it is dressed in overalls and looks like work. Thomas Edison _E_
We should not bail out any of the European countries or banks. _E_
RT @KellyannePolls: After a decent first debate @HillaryClinton is back to form: pedantic lawyerly technocratic (woefully untruthful) r... _E_
.@MittRomney's @RNC convention came in over $3M under budget. Barack's @DNC convention is over $10M in debt. What a surprise! _E_
Give yourself a chance make every day a discovery. _E_
My interview which recently aired on CNBC's Squawk Box __HTTP__ _E_
 _E_
One of the country's dumbest newspapers—The Palm Beach Post should be put to sleep. It's dying. @pbpost _E_
Such amazing people in India. This trip is very enlightening! _E_
Entrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_
Karl Rove lost GOP both Houses of Congress and the White House gave us Obama. _E_
Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_
So many self righteous hypocrites. Watch their poll numbers and elections go down! _E_
Entrepreneurs: There's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_
As I have long been saying South Africa is a total and very dangerous mess. Just watch the evening news (when not talking weather). _E_
Crooked Hillary called it totally wrong on BREXIT she went with Obama and now she is saying we need her to lead. She would be a disaster _E_
Signing my tax return.... __HTTP__ _E_
People buy deals & immediately put them into bankruptcy in order to make better deals... _E_
Join me today Nov 3rd in #TrumpTowerNYC at noon. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_
Be a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_
I won every debate so far according to all debate polls including @DRUDGE_REPORT @TIME @Slate and more. Too bad dopey @megynkelly lies! _E_
Don't blindly pursue a career that others suggest or insist is right for you. It may be worth taking a pay cut for a job you love. _E_
"Donald Trump: I've made up my mind on 2016" __HTTP__ via @msnbc by @janestreet _E_
Congratulations to America's new Secretary of @HHSGov Alex Azar! __HTTP__ _E_
Just as I predicted @BarackObama is preparing a possible attack on Iran right before November. __HTTP__ _E_
Con Ed has won its suit against the Ground Zero Mosque developers __HTTP__ The mosque is never going up. _E_
Obama's policies have led to food stamp rolls growing 75X faster than job production __HTTP__ We can't afford 4 more years. _E_
No surprise that all the foreign countries are celebrating Obama's win. They love a weak America that they can rip off. _E_
There are no short cuts to any place worth going. Beverly Sills _E_
All of my Cabinet nominee are looking good and doing a great job. I want them to be themselves and express their own thoughts not mine! _E_
My wife @MELANIATRUMP and my children will be featured on @FoxNews with @Greta 7pmE. Enjoy!#MeetTheTrumps #Trump2016 _E_
Obamacare premiums increasing 33% in Pennsylvania a complete disaster. It must be repealed and replaced!... __HTTP__ _E_
.@brandonhardest Love what you do and work hard. _E_
Just as I said last October census workers cooked the job numbers for Obama right before the election __HTTP__ _E_
And happy to welcome @ArsenioHall back as an advisor— he will have his own show and is doing great. #CelebApprentice _E_
Just did @OReillyFactor. Will be back on at 11pm on @FoxNews. _E_
Surprised @Eagles signed Michael Vick yesterday to be their 2013 QB. Vick is talented but brittle & probably won't last long. _E_
Just Introduced at #NCGOPcon as the country's highest paid speaker. Told the record crowd of 650 I am to be speaking here for free! _E_
On my way to Dayton Ohio. Will be there soon! _E_
Hillary's debate answer on delay: That is horrifying. That is not the way our democracy works. Been around for 240 years. We've had free _E_
HAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_
My @gretawire interview on @FoxNewsInsider "Trump: 'Last Person I'd Want Negotiating for Me Is Obama'" __HTTP__ _E_
China continues to be on the move both technologically and militarily. Obama is sitting by and watching. _E_
Too bad I'll Have Another out of Belmont Stakes interest now way down. _E_
Is it true the DNC would not allow the FBI access to check server or other equipment after learning it was hacked? Can that be possible? _E_
Join me this Wednesday in Phoenix Arizona at 6pm! #ImWithYouTickets: __HTTP__ __HTTP__ _E_
'Trump Celebrates American Manufacturing Survey Showing Highest Level of Optimism in 20 Years' ... __HTTP__ _E_
We need a 21st century MERIT BASED immigration system. Chain migration and the visa lottery are outdated programs that hurt our economic and national security. __HTTP__ _E_
#TBT With Barbara Walters on my helicopter going somewhere. __HTTP__ _E_
It takes guts to win! _E_
I will be going to Puerto Rico on Tuesday with Melania. Will hopefully be able to stop at the U.S. Virgin Islands (people working hard). _E_
The new season of the Celebrity Apprentice is off to a great start last night it swept the 10 p.m. hour in every key demographic. _E_
You have to know when to call it quits and when to keep moving forward. Donald J. Trump __HTTP__ _E_
Lynne Ryan just read your great story in the NY Times I am proud of you. Thanks! __HTTP__ _E_
When it comes to violent crime and if we are going to solve the problem we must stop being so politically correct must tell it like it is! _E_
Via @PRNewswire: TRUMP HOTEL COLLECTION™ Announces Trump® International Hotel & Tower Baku __HTTP__ _E_
__HTTP__ Lights... Camera....You're Fired! All new @apprenticenbc tonight at 8PM ET on NBC! _E_
Frumpy and very dumb Gail Collins an editorial writer at The New York Times is so lucky to even have a job. Check her out incompetent! _E_
Vattenfall the promoter of the money losing wind farm plan in Aberdeen Scotland just took a loss of $4.6 billion after dumb European move _E_
You can't know it all yourself anyone who thinks that they do is destined for mediocrity." The Way To The Top _E_
740 Park Avenue is being robbed all over the place we come down hard on thieves at Trump buildings. _E_
We mourn the horrifying terrorist attack in NYC. All of America is praying and grieving for the families who lost their precious loved ones. __HTTP__ _E_
A great honor to easily finish FIRST in the @FoxNews poll tabulation even though some of my best polls were not used in determining winner! _E_
We will defend our people our nations and our civilization from all who dare to threaten our way of life...cont: __HTTP__ __HTTP__ _E_
I hate to say it but the Republican Convention was far more interesting (with a much more beautiful set) than the Democratic Convention! _E_
I am honored that @BarackObama has featured my plane in one of his attack ads. It was made in America! _E_
Now China 'calls in' US diplomats to lecture them on their illegal escapades. __HTTP__ The new reality. @BarackObama is weak. _E_
This month we celebrate the contributions of Asian Americans & Pacific Islanders that enrich our Nation. __HTTP__ _E_
Great advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_
Happy New Year to all my Jewish friends celebrating the holiday. _E_
The Oil Companies collude with OPEC to keep oil artificially overvalued. They need to be reigned in. _E_
THANK YOU Youngstown Ohio! I love you! Get out & #VoteTrump tomorrow. #Trump2016 __HTTP__ _E_
The Phoenix V.A. it has just been reported is in worse shape than ever before. The wait is horrendous and people are dying. I will fix it _E_
Raffaele Sollecito was unfairly convicted. He didn't kill anyone. The Italian government should be ashamed. @Raffasolaries _E_
The Democrats want MASSIVE tax increases & soft crime producing borders.The Republicans want the biggest tax cut in history & the WALL! _E_
Leading in the Bloomberg Iowa poll. Also my favorability numbers went up at a record almost unheard of clip. Thank you Iowa! _E_
If your enemies end up liking you it's because they beat you. You want their respect not their friendship. _E_
I want to thank Steve Bannon for his service. He came to the campaign during my run against Crooked Hillary Clinton it was great! Thanks S _E_
The young intern who accidentally did a Retweet apologizes. _E_
Big protest march in Colorado on Friday afternoon! Don't let the bosses take your vote! _E_
Congress' greatest card against Obama is the power of the purse. Use it! _E_
It's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_
Why gas prices will cost @BarackObama re election: pain at the pump not good for obama __HTTP__ _E_
I gave away money. Go to __HTTP__ to see how I'm helping people. #FundAnything #Entrepreneurs #GiveBack _E_
Mitt Romney is a mixed up man who doesn't have a clue. No wonder he lost! _E_
Watch me play both golf and baseball tonight on Donald J. Trump's Fabulous World of Golf 9PM ET on Golf Channel.. __HTTP__ _E_
Because of #FakeNews my people are not getting the credit they deserve for doing a great job. As seen here they are ALL doing a GREAT JOB! __HTTP__ _E_
It is time to take back our country and MAKE AMERICA GREAT AGAIN!#CaucusForTrump Video: __HTTP__ __HTTP__ _E_
Donald Trump explains celebrity feuds: 'I speak the truth' __HTTP__ via @DigitalSpyUS _E_
Via @WDesMoinesPatch by @DerekJ3031: "@ShawnJohnson on @ApprenticeNBC" __HTTP__ _E_
Watched Sean Hannity last night a great guy. _E_
He thinks that the wealth you create belongs to the government @BarackObama doesn't respect the fact that the (cont) __HTTP__ _E_
Looking for Father's Day gift? @Miamimagazine named the spa @TrumpDoral one of the best places for men to relax __HTTP__ _E_
RT @marcorubio: Good #AfghanStrategy & excellent speech by @POTUS laying it out to the nation. _E_
LIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_
.@BenSasse looks more like a gym rat than a U.S. Senator. How the hell did he ever get elected? @greta _E_
With respect to Iran we have all the cards they are scared stiff! I can't believe we aren't able to negotiate (cont) __HTTP__ _E_
Big meeting today with Republican leadership concerning Tax Cuts and Healthcare. We are all pushing hard must get it right! _E_
Obama administration had 4 years to prepare for the ObamaCare rollout. And of course they failed miserably. _E_
Convention speaker schedule to be released tomorrow. Let today be devoted to Crooked Hillary and the rigged system under which we live. _E_
The great Barbara Walters interviews Melania Trump and me on a Special Friday night at 10:00 on ABC.... __HTTP__ _E_
The President's speech was very combative toward Republicans—they have obviously not earned his respect! _E_
Only two weeks until we start shooting @CelebApprentice. We really have something amazing for the fans this year. _E_
The dealmaker is cunning secretive focused and never settles for less than he wants. The America We Deserve _E_
Entrepreneurs: Identify your goals know precisely what you want to achieve. Then study the best people in your field and learn from them. _E_
Thank you Christian Broadcasting Network @TheBrodyFile @CBNNews __HTTP__ _E_
Think BIG! You are going to be thinking anyway so you might as well think BIG! _E_
The Job on CBS the 15th copy of The Apprentice was just cancelled I love it! _E_
Lindsey Graham is all over T.V. much like failed 47% candidate Mitt Romney. These nasty angry jealous failures have ZERO credibility! _E_
Peyton Manning should have passed on 3rd down! _E_
Still a buyer's market. Residential home sales fall 7.1% in March. __HTTP__ Now is the time to buy property. _E_
Every on line poll Time Magazine Drudge etc. has me winning the debate. Thank you to Fox & Friends for so reporting! _E_
Is Hillary really protecting women? __HTTP__ _E_
My interview from yesterday with #Apprentice Andy on @AmericaNowRadio __HTTP__ _E_
.@antbaxter Dummythanks for increasing awareness of my big golf project in Aberdeen—sales are thru the roof & Aberdeen seeing big benefits. _E_
.@MittRomney looks much stronger and much more Presidential! _E_
I had a wonderful meeting with Likud Deputy Speaker of The Knesset @DannyDanon this past Friday in Trump Tower __HTTP__ #Israel _E_
I am watching the Democrats trying to defend the you can keep you doctor you can keep your plan & premiums will go down ObamaCare lie. _E_
It was just determined that the woman who passed out at Obama's press conference had just seen what her new premiums would be! _E_
Snowden is handing over to Russia a treasure trove of intel. Our politicians are incapable of dealing! _E_
You've got something unique to offer. Find out what it is. Ask yourself: What can I provide that does not yet exist? _E_
'Kept me out of jail': Top DOJ official involved in Clinton probe represented her campaign chairman: __HTTP__ _E_
Today is my birthday. My wish is for our country to be great and prosperous again. _E_
.@thehill Your story about me & the carbon tax is absolutely incorrect—it is just the opposite. I will not support or endorse a carbon tax! _E_
Before you vote think: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_
THANK YOU to all of the incredible HEROES in Texas. America is with you! #TexasStrong __HTTP__ _E_
Congratulations to Connecticut's Erin Brady on being crowned the 2013 @MissUSA! America will be well represented in @MissUniverse! _E_
With millions of dollars of negative and phony ads against me by the establishment my numbers continue to go up. Can anyone explain this? _E_
Departing for Texas and Louisiana with @FLOTUS Melania right now @JBA_NAFW. We will see you soon. America is with you! __HTTP__ _E_
Only the Fake News Media and Trump enemies want me to stop using Social Media (110 million people). Only way for me to get the truth out! _E_
"Don't find fault find a remedy." Henry Ford _E_
Join me in Columbus Ohio tomorrow!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
President Obama said ISIL continues to shrink in an interview just hours before the horrible attack in Paris. He is just so bad! CHANGE. _E_
Paul Teutul Sr. is a fantastic guy. Although I fired him on #CelebApprentice we will remain great friends. I love the bike he made for me. _E_
Fox News PollThank you New Hampshire! #FITN#Trump2016 __HTTP__ _E_
.@foxandfriends Russia sent millions to Clinton Foundation _E_
In today's #trumpvlog I speak about Clint Eastwood the #DNC and Drew Peterson __HTTP__ _E_
...There is also something appropriate about keeping him in the home of the horrible crime he committed. Should move fast. DEATH PENALTY! _E_
Just as we won the Cold War in part by exposing the evils of communism and the virtues of free markets....Cont: __HTTP__ _E_
...the Ninth Circuit which has a terrible record of being overturned (close to 80%). They used to call this judge shopping! Messy system. _E_
So much for Washington shutting down Strasburg they deserved to lose. _E_
Surprising a future Nobel prize winner on today's @KatieShow: __HTTP__ _E_
"RUBIO'S GANG OF 8 BILL WOULD HAVE REWARDED SANCTUARY CITIES HARBORING ILLEGALS" __HTTP__ Marco is a politician he flip flops! _E_
Bloggers like McKay Coppins & @BuzzFeed are true garbage with no credibility. Record setting crowds & speech not reported. @PiersMorgan _E_
To get momentum you must first focus on a specific goal with passion and intensity. _E_
New Gravis national poll just out 36%! Very nice! #MakeAmericaGreatAgain _E_
Hillary Clinton said that it is O.K. to ban Muslims from Israel by building a WALL but not O.K. to do so in the U.S. We must be vigilant! _E_
Here's the solution on China: get tough. Slap a 25 percent tax on China's products if they don't set a real (cont) __HTTP__ _E_
Lightweight @AGSchneiderman is driving business out of NY so that he can get publicity for his failing political career. _E_
Less than a week after we leave Iraq the country is already unraveling. We got nothing from the Iraqis and now (cont) __HTTP__ _E_
My campaign for president is $35000000 under budget I have spent very little (and am in 1st place).Now I will spend big in Iowa/N.H./S.C. _E_
Our border is being breached daily by criminals. We must build a wall & deduct costs from Mexican foreign aid! __HTTP__ _E_
Mr. President you're entitled as the president to your own airplane and to your own house but not to your own facts. @MittRomney _E_
My @FoxNews interview with @TeamCavuto where I explain that we need to start using our own domestic energy resources. __HTTP__ _E_
.@CelebApprentice was #1 on network TV last night in its time slot and easily won the 10 o'clock hour in all major demographics. _E_
Now that Bush has wasted $120 million of special interest money on his failed campaign he says he would end super PACs. Sad! _E_
The great boxing promoter Don King just endorsed me. Nice! _E_
Via @BreitbartNews: "DONALD TRUMP: 'RICH PEOPLE DON'T LIKE ME'–POOR MIDDLE INCOME PEOPLE 'LIKE ME BEST'" __HTTP__ _E_
Wow great news! I hear @EWErickson of Red State was fired like a dog. If you read his tweets you'll understand why. Just doesn't have IT! _E_
I will be on @foxandfriends at 7:00 A.M. Will be talking about many things including The Apprentice! _E_
Despite the constant negative press covfefe _E_
George also appeared on Saturday Night Live when I was guest host in 2004. A great time! #CelebApprentice _E_
Melania and I offer our deepest condolences to the family of Otto Warmbier. Full statement: __HTTP__ __HTTP__ _E_
It is time to rebuild OUR country to bring back OUR jobs to restore OUR dreams & yes to put #AmericaFirst! TY O... __HTTP__ _E_
Keep focused on your goals. Practice positive thinking. View any conflict as an opportunity look at the solution not the problem. _E_
Watch my interview with Greta Van Susteren @Gretawire tonight at 10 p.m. on Fox News. _E_
Looking forward to keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tickets selling out __HTTP__ _E_
.@OMAROSA as a cashier a big mistake by @BrandenRoderick. #CelebApprentice _E_
#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
My @OraTV #Politicking interview w/@kingsthings on the govt. shutdown ObamaCare Putin 2016 and @TrumpDoral __HTTP__ _E_
Another great accolade for @TrumpGolf. Highly respected Golf Odyssey awarded @TrumpDoral Blue Monster with best redesign. Thank you! _E_
RT @foxandfriends: FOX NEWS ALERT: Jihadis using religious visa to enter US experts warn (via @FoxFriendsFirst) __HTTP__ _E_
My @SquawkCNBC interview discussing #TRUMPTUESDAYS high ratings @ToddAkin's statement & @MittRomney's policies __HTTP__ _E_
Set the bar high do the best you possibly can. Be focused disciplined and alert every single day. _E_
Just a few days until I keynote at @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa __HTTP__ Very exciting _E_
At the request of many and even though I expect it to be a very boring two hours I will be covering the Democrat Debate live on twitter! _E_
'Clinton Campaign And Harry Reid Worked With New York Times To Smear State Dept Watchdog'Time to #DrainTheSwamp! __HTTP__ _E_
.@realDonaldTrump is going to cut taxes BIG LEAGUE Crooked is going to raise taxes BIG LEAGUE! #DrainTheSwamp... __HTTP__ _E_
My @FoxNews interview with @gretawire discussing the GOP primary my 2012 options and why @BarackObama must lose __HTTP__ _E_
The Democrats have a corrupt political machine pushing crooked Hillary Clinton. We have Paul Ryan always fighting the Republican nominee! _E_
I am happy to hear that Pres.Obama is considering giving Anna Wintour @voguemagazine an ambassadorship. She is a winner & really smart! _E_
Crooked @club4growth has given up advertising in Iowa on me—remember they wanted my million dollars—I said no—total frauds! _E_
Mark my words a gallon of gas will be $5 during the summer. OPEC is ripping us off. There's nobody in our (cont) __HTTP__ _E_
Crooked Hillary Clinton now blames everybody but herself refuses to say she was a terrible candidate. Hits Facebook & even Dems & DNC. _E_
Thank you Adam Levine The Federalist in interview on @foxandfriends "Donald Trump is the greatest President our Country has ever seen." _E_
RT @FoxNews: More than 1 million jobs added since @POTUS took office. __HTTP__ __HTTP__ _E_
Word is I am doing very well in Michigan and Mississippi! Wow and with all that money spent against me! Will be going to Trump Jupiter now! _E_
To have a government we can afford we need to eliminate the tremendous waste clogging the system #TimeToGetTough _E_
No taxes the only good thing about DC Debt Deal. _E_
"Success is getting what you want. Happiness is wanting what you get." Dale Carnegie _E_
RT @GovMikeHuckabee: Trump says the chaos in Chicago was a planned attack. But Hillary insists it was a spontaneous reaction to an internet... _E_
I wonder why @BarackObama is now spending $8B to postpone Obamacare's Medicare Cuts until after the election? __HTTP__ _E_
RT @DanScavino: On behalf of our next #POTUS & @TeamTrump #HappyNewYear AMERICA __HTTP__ __HTTP__ __HTTP__ _E_
Most politicians would have gone to a meeting like the one Don jr attended in order to get info on an opponent. That's politics! _E_
The election is trending towards @MittRomney. Americans know we can't afford another 4 years of the Obama economic decline. _E_
It's that time of the year. @TrumpRink in Central Park is now open best rink in the world. __HTTP__ A landmark. _E_
Our country is looking very bad right now! _E_
Our deficits are caused by runaway spending not inadequate taxing. Washington does not have a revenue problem. _E_
Wednesday's debate is day one of the election. Over 70 million voters will be watching. _E_
When will the US government finally classify China as a currency manipulator? China is robbing us blind and @BarackObama defends them. _E_
This is your land this is your home and it's your voice that matters the most. So speak up be heard and fight fight fight for the change you've been waiting for your entire life!MERRY CHRISTMAS and THANK YOU Pensacola Florida! __HTTP__ _E_
My son Donald did a good job last night. He was open transparent and innocent. This is the greatest Witch Hunt in political history. Sad! _E_
The $10 billion (net worth) is AFTER all debt and liabilities. So simple to understand but @CNN & @CNNPolitics is just plain dumb! _E_
To all struggling young entrepreneurs stay positive in this tough climate and keep looking for good deals. They are out there. _E_
...Brande was also smart in not bringing Omarosa to the boardroom. _E_
Happy to have passed 800000 followers. Looking forward to passing 1M sooner than later. _E_
Thank you Louisiana! Get out & vote for John Kennedy tomorrow. Electing Kennedy will help enact our agenda on behal... __HTTP__ _E_
"To be successful you must become very good at finding creative solutions to what appear to be impossible problems." – Think BIG _E_
On Greta 87% of the people said they would not watch the debate if I'm not in it. Wow what an honor! _E_
Many people think that WM23 @WrestleMania "the battle of the billionaires" was the greatest of all time—set all records _E_
The crowd in Ohio was amazing last night broke all records. We all had a great time in a great State. Will be back soon! _E_
On behalf of all Americans I want to wish Jewish families many blessings in the New Year. __HTTP__ __HTTP__ _E_
"Hard work is my personal method for financial success. You can do it too." Think Big _E_
I am the only one who can fix this. Very sad. Will not happen under my watch! #MakeAmericaGreatAgain __HTTP__ _E_
‎In anticipation of ObamaCare part time jobs are surging & full time jobs are falling and becoming scarce __HTTP__ _E_
No better place to celebrate New Year's Eve than @TrumpSoHo the most elite hotel in downtown NYC __HTTP__ _E_
RT @robertjeffress: Honored to pray for my friend @realDonaldTrump at tonight's Dallas rally. #TrumpDallas c: @DanScavino __HTTP__ _E_
Upstate New York is suffering with record unemployment. Fracking is the answer. Frack now and Frack fast! _E_
Wow FBI confirms report that James Comey drafted letter exonerating Crooked Hillary Clinton long before investigation was complete. Many.. _E_
Response to Hillary Clinton __HTTP__ _E_
Leaving for New York City and meetings on military purchases and trade. _E_
Is @karlrove incompetent? 400 million dollars down the drain and not 1 victory! _E_
#TrumpVlog Obama should be ashamed! __HTTP__ _E_
China and Saudi Arabia recently struck a deal which is the largest expansion by any oil company in the world (cont) __HTTP__ _E_
.@GoAngelo—the next time you have a rally @Macy's try getting 12 people instead of 11—it would be much more effective! _E_
Still a buyer's market. Home prices are dropping mortgages are low. Now is the time to take advantage for your gain. __HTTP__ _E_
Once again @BarackObama's speech at @AIPAC yesterday proved that he is more concerned about containing @Israel (cont) __HTTP__ _E_
Thank you to everybody for your wonderful comments on my debate performance it was a lot of fun! Today I will be speaking in Reno Nevada. _E_
If you love your work the difficulties will be balanced out by the enjoyment. Think Big _E_
The difference between @MittRomney and @BarackObama's campaign promises to @Israel is that Mitt will actually keep all of his. _E_
Thanks. __HTTP__ _E_
.@RuthMarcus of the @washingtonpost was terrible today on Face The Nation.No focus poor level of concentration but correct on Hillary lying _E_
I hope all workers demand that their @Teamsters reps endorse Donald J. Trump. Nobody knows jobs like I do! Don't let them sell you out! _E_
Congratulations to @DianeSawyer on her big ratings win for the evening news. Diane is a spectacular person. _E_
Mexico's totally corrupt gov't looks horrible with El Chapo's escape—totally corrupt. U.S. paid them $3 billion. _E_
Great bilateral meeting with Prime Minister Theresa May of the United Kingdom affirming the special relationship and our commitment to work together on key national security challenges and economic opportunities. #WEF18 __HTTP__ _E_
Sheldon Adelson is looking to give big dollars to Rubio because he feels he can mold him into his perfect little puppet. I agree! _E_
Just landing in Knoxville Tennessee! Massive crowd expected! Will all have a great time despite serious subject matter. _E_
Healthy young child goes to doctor gets pumped with massive shot of many vaccines doesn't feel good and changes AUTISM. Many such cases! _E_
If Ted Cruz is so opposed to gay marriage why did he accept money from people who espouse gay marriage? _E_
Cowards die many times before their actual deaths. Caesar _E_
Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: __HTTP__ Very funny! _E_
...While I fully agree it is not politically correct! __HTTP__ _E_
PM @David_Cameron should be run out of office for spending so much of England's money to subsidize windfarms in Scotland. _E_
Ebola is much easier to transmit than the CDC and government representatives are admitting. Spreading all over Africa and fast. Stop flights _E_
Please keep your thoughts & prayers with Melissa Young Miss Wisconsin 2005. __HTTP__ _E_
Get out and vote! I am your voice and I will fight for you! We will make America great again! __HTTP__ _E_
Notice that illegal immigrants will be given ObamaCare and free college tuition but nothing has been mentioned about our VETERANS #DemDebate _E_
I will be on @foxandfriends at 7:00 A.M. Enjoy! _E_
FACT on "red line" in Syria: HRC I wasn't there. Fact: line drawn in Aug '12. HRC Secy of State til Feb '13. __HTTP__ _E_
RT @foxandfriends: FOX NEWS ALERT: ISIS claims responsibility for hostage siege in Melbourne Australia that killed 1 person and injured 3... _E_
Obama's administration is now openly admitting it expects US credit downgraded again __HTTP__ Thanks for letting us know now _E_
I was invited by Caroline Wozniacki to sit with her family in her special box during her match at the U.S. Open yesterday. She's fantastic! _E_
Who do you like of the final two? #CelebApprentice __HTTP__ _E_
My @IngrahamAngle interview on the border crisis USMC Tahmooressi & my fight for the American flag __HTTP__ (15:00 mark) _E_
My friend @GovChristie called it @MittRomney recast the race. _E_
Try to develop a tempo when you're working momentum is something you have to work at to maintain & is an important element of success. _E_
.@TheBrodyFile: Trump's appeal to evangelicals is real #Trump2016 __HTTP__ __HTTP__ _E_
My bestselling book from last April Think Like a Champion is now available in paperback. It's inspiring entertaining and a great read. _E_
....Some of those they are harshly treating have been "milking" their country for years! _E_
Thank you for the incredible support Maryland! This is a movement!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Remember I was the one who said attack the oil (ISIS source of wealth) a long time ago. Everyone scoffed now they're attacking the oil. _E_
Be in Turnberry on Thurs AM for start of Women's British Open one of world's great golf tournaments. Back soon to #MakeAmericaGreatAgain! _E_
Tremendous day in Massachusetts and Maine. Thank you to everyone for making it so special! _E_
Thank you to @DailyTelegraph reviewer @NeilMidgley who stated 'You've Been Trumped' was so biased in favour of the protesters... _E_
Puerto Rico being hit hard by new monster Hurricane. Be careful our hearts are with you will be there to help! _E_
John Foust is a liberal who supports ObamaCare and opposes Ebola travel ban. Send Conservative @BarbaraComstock to Congress! _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
Little Marco Rubio is just another Washington D.C. politician that is all talk and no action. #RobotRubio __HTTP__ _E_
Just received huge applause when I said Berghdal should be sent back to Afghanistan! @SRQRepublicans speech is sold out with record crowd. _E_
Report out that Obama Campaign paid $972000 to Fusion GPS. The firm also got $12400000 (really?) from DNC. Nobody knows who OK'd! _E_
Leadership is the capacity to translate vision into reality. Warren G. Bennis _E_
.@ABFalecbaldwin Alec it's not science it's a con read the e mails. _E_
Don't sell yourself short on something that is important. Today is just the beginning. Think Like a Champion _E_
I will represent our country well and fight for its interests! Fake News Media will never cover me accurately but who cares! We will #MAGA! _E_
I wonder if when Secy. Kerry goes to Iraq and Afghanistan he pushes hard for them to look at GLOBAL WARMING and study the carbon footprint? _E_
So many problems in the U.S. and leadership that is hopeless...and now on top of everything else we just hit $18 trillion in debt! _E_
In politics and sometimes in life FRIENDS COME AND GO BUT ENEMIES ACCUMULATE! _E_
Sunday night at 9 PM EST will be re run of last week's episode of Celebrity @ApprenticeNBC followed by new episode at 10 PM. _E_
.@serenawilliams we look forward to being with you a truly great champion tomorrow at Trump National D.C. for the Tennis Center dedication _E_
Our infrastructure plan has been put forward and has received great reviews by everyone except of course the Democrats. After many years we have taken care of our Military now we have to fix our roads bridges tunnels airports and more. Bipartisan make deal Dems? _E_
Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_
.@danielhalper Great job on @CNN today. Very wise indeed! _E_
Leaving for Albany New York now massive crowd expected. Very exciting! _E_
Don't give up Republican Senators the World is watching: Repeal & Replace...and go to 51 votes (nuke option) get Cross State Lines & more. _E_
the American people. I have no doubt that we will together MAKE AMERICA GREAT AGAIN! _E_
Join @ericbolling to get @vanessariddle to 100k followers. Beautiful girl with stage 4 cancer. __HTTP__ _E_
New polls join the MOVEMENT today. __HTTP__ #ImWithYou __HTTP__ _E_
Thank you Florida a MOVEMENT that has never been seen before and will never be seen again. Lets get out &... __HTTP__ _E_
...9 months than this Administration. Over 50 Legislation approvals massive regulation cuts energy freedom pipelines border security.... _E_
Numerous patriots will be coming to Bedminster today as I continue to fill out the various positions necessary to MAKE AMERICA GREAT AGAIN! _E_
RESPONSE TO THE LIES OF SENATOR CRUZ: __HTTP__ #VoteTrumpSC _E_
Should have settled ... Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_
The border is wide open for cartels & terrorists. Secure our border now. Build a massive wall & deduct the costs from Mexican foreign aid! _E_
Going to a Cabinet Meeting (tele conference) at 11:00 A.M. on #Harvey. Even experts have said they've never seen one like this! _E_
If we are going to continue to be stupid and go into Syria (watch Russia) as they say in the movies SHOOT FIRST AND TALK LATER! _E_
My official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_
America is going to build again. Under budget and ahead of schedule. Time to put #AmericaFirst! #InfrastructureWeek... __HTTP__ _E_
Heed the advice of @FLGovScott! If you're in an evacuation zone you need to get to a shelter...there's not many hours left. Gov. Scott __HTTP__ _E_
Via World Tribune The elites' problem with Donald Trump: He's not for sale by Jeffrey T. Kuhner __HTTP__ _E_
WOW! Thank you Massachusetts! See you soon. #VoteTrumpMA __HTTP__ _E_
Thank you Mark. #GOPDebate __HTTP__ _E_
The U.S. must immediately stop all flights from EBOLA infected countries or the plague will start and spread inside our borders. Act fast! _E_
Congrats to Pres.Obama on having 3 of @washingtonpost's "biggest Pinocchios of the year" __HTTP__ Great accomplishment! _E_
Credible Source on 9 11 Muslim Celebrations: FBI __HTTP__ via @WKRG _E_
When Strasburg leaves in a couple of years under free agency Washington will say what were we doing . _E_
.@chucktodd is a nice guy but just hopeless. He knows so little about politics and in particular winning! I fixed his rating problem. _E_
Sad only 36% think America's best days are ahead while 49% believe they are in the past __HTTP__ We can & must do better. _E_
Watch my interview with Greta Van Susteren @gretawire tonight on Fox News at 10 p.m. _E_
MERRY CHRISTMAS!!! __HTTP__ _E_
Donald Trump: Anna Wintour Ambassadorship Would Be 'A Favor To The Country' __HTTP__ via @mediaite _E_
This is how it starts. Obama is now threatening to use an Executive Order for gun control __HTTP__ Welcome to his 2nd term. _E_
Will be leaving for Missouri soon for a speech on tax cuts and tax reform so badly needed! _E_
I'm giving away money go to __HTTP__ . Take it from me! Proud of the #FundAnything team. _E_
"Obstacles are those frightful things you see when you take your eyes off your goal." Henry Ford _E_
What They Are Saying About @realDonaldTrump's GREAT Debate and @HillaryClinton's Bad Performance... __HTTP__ _E_
Congrats to fantastic All Star @ApprenticeNBC celebrity & illusionist @pennjillette on being honored at 2013 Hollywood Walk of Fame! _E_
Someone should inform @CNN that despite spending millions of $'s on graphics it is not the Democratic Debate rather the Democrat (s) D! _E_
I will be on @cbs @60minutes this Sunday. A great honor hope you enjoy it. _E_
The Republican Party must get tougher and smarter and fast or it will go down to a very big defeat just like the last two times! _E_
.@Neilyoung's song "Rockin' In The Free World" was just one of 10 songs used as background music. Didn't love it anyway. _E_
Must read quote by @EricTrump in @CNNMoney article "Builders race to develop sky high condo buildings" __HTTP__ _E_
Swisher should have caught ball in right field last night. _E_
Thank you Iowa! Great night see you soon! #Trump2016 __HTTP__ _E_
It's snowing & freezing in NYC. What the hell ever happened to global warming? _E_
Press Conference at Glasgow Prestwick Airport this Friday Nov. 14 at 11 AM with Donald J. Trump & Mr. Iain Cochrane __HTTP__ _E_
What a foolish statement by @davidaxelrod he said that a @marcorubio VP pick would 'insult' Hispanics __HTTP__ _E_
Obama through his cronies said the Keysyone pipeline was not political how much can one man lie about even the most obvious things? _E_
Thank you for your support Greensboro North Carolina. Next stop Charlotte! #MAGA __HTTP__ __HTTP__ _E_
Another must read from Jeffrey Lord @amspec: "Rove Email Leaks: Ideological War Opens in GOP" __HTTP__ _E_
.@DennisRodman is always hard to miss especially when dressed in silver finery. But not sure about the silver lipstick. #CelebApprentice _E_
Tonight is the Apprentice finale and it's a fantastic episode in every way with the great Liza Minnelli performing and a new Apprentice! _E_
Crooked Hillary just took a major ad of me playing golf at Turnberry. Shows me hitting shot but I never did = lie! Was there to support son _E_
The economy won't fully recover until @ObamaCare is fully repealed. It is a job killer! _E_
There is. __HTTP__ _E_
Fact – all the countries complaining about us spying on them spy on us. They just don't get caught stupid! _E_
I will take care of the Veterans who have served this country so bravely.#ThankAVet Video: __HTTP__ __HTTP__ _E_
China's military buildup is a major threat to the Free World. We must remain resolute and maintain our national defense at all costs. _E_
They say that if I participated in last night's Fox debate they would have had 12 million more & would have broken the all time record. _E_
I would not sign Graham Cassidy if it did not include coverage of pre existing conditions. It does! A great Bill. Repeal & Replace. _E_
.@NicolleDWallace is really hurting @TheView. She is boring predictable and has zero television it show no longer has ratings dying! _E_
I want to applaud the many protestors in Boston who are speaking out against bigotry and hate. Our country will soon come together as one! _E_
I am allowing Japan & South Korea to buy a substantially increased amount of highly sophisticated military equipment from the United States. _E_
You have to scratch your head when the president spends the last week talking about saving Big Bird. @MittRomney _E_
It was just announced that I will be hosting Saturday Night Live on Nov. 7th look forward to it! __HTTP__ _E_
In addition to winning the Electoral College in a landslide I won the popular vote if you deduct the millions of people who voted illegally _E_
.@JebBush has spent $63000000 and is at the bottom of the polls. I have spent almost nothing and am at the top. WIN! @hughhewitt _E_
Looking forward to Sunday's speech in the ExCel Centre. __HTTP__ _E_
Reckless @BarackObama is projecting $1.2T deficit from 2012 budget & a projected $25.4T debt in a decade __HTTP__ _E_
It is terrible that @BarackObama did not appoint an independent counsel to investigate the national security leaks. No accountability. _E_
"Pay attention to the small numbers in your finances such as percentages and cents... _E_
Thank you to @LOUDOBBS for giving the first six months of the Trump Administration an A+. S.C.reg cuttingStock M jobsborder etc. = TRUE! _E_
Sometimes when you innovate you make mistakes. It is best to admit them quickly and get on with other innovations. Steve Jobs _E_
Smart move by @BarackObama having Pres. Bill Clinton deliver the @DNC convention keynote. _E_
LAST thing the Make America Great Again Agenda needs is a Liberal Democrat in Senate where we have so little margin for victory already. The Pelosi/Schumer Puppet Jones would vote against us 100% of the time. He's bad on Crime Life Border Vets Guns & Military. VOTE ROY MOORE! _E_
My appearance on @foxandfriends from today.... __HTTP__ _E_
Reuters polling just out thank you!#MakeAmericaGreatAgain __HTTP__ _E_
.@cher I don't wear a "rug"—it's mine. And I promise not to talk about your massive plastic surgeries that didn't work. _E_
#TBT It is great being part of Home Alone 2 a holiday staple. __HTTP__ _E_
American professors were in Tehran for an Occupy Wall Street Conference __HTTP__ @BarackObama's diplomatic initiative?!?! _E_
The water damage to NYC is amazing. The winds were bad but the water was worse. _E_
.@mercedesschlapp thank you so much for your kind words on television fantastic job and greatly appreciated! _E_
Entrepreneurs: Set the bar high and resolve to be bigger than your problems. Who's the boss? _E_
He is out of real solutions @BarackObama's job bill is nothing more than a tax increase. _E_
Do you notice we are not having a gun debate right now? That's because they used knives and a truck! _E_
"Golf is deceptively simple and endlessly complicated. It satisfies the soul and frustrates the intellect. (cont) __HTTP__ _E_
The U.S. will invite El Chapo the Mexican drug lord who just escaped prison to become a U.S. citizen because our leaders can't say no! _E_
If I'd started in business thinking I knew everything I'd have been sunk before I got started. Think Like a Champion _E_
New Government data by the Center for Immigration Studies shows more than 3M new legal & illegal immigrants settled.. __HTTP__ _E_
Sean's interview with Bob Woodward on @hannityshow was very interesting Woodward was great. __HTTP__ _E_
In the upcoming New Year we will focus like never before if we do that we will have complete and total VICTORY in all we do! _E_
I loved beating John Kasich in the debates but it was easy—he came in dead last! _E_
Check it out 2nd video on Lying Crooked Hillary is now online! Watch it here: __HTTP__ #CrookedHillary #Trump2016 _E_
Entrepreneurs: Keep your focus and keep your momentum. Listen apply and move forward. Set the standard! _E_
Was with @jacknicklaus yesterday great golfer great architect great guy! _E_
Thank you Pittsburgh Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
The Republicans better be careful. Obama is out to destroy them! _E_
Television ratings for @nbcsnl Saturday Night Live just came out and they were great the best since 2011. Very few protesters! _E_
Failure is simply the opportunity to begin again this time more intelligently. Henry Ford _E_
Selective memory @BarackObama says that he forgets the recession __HTTP__ Maybe that's why he is forgetting to create jobs. _E_
Do we really need another Bush in the White House we have had enough of them. __HTTP__ _E_
It's Tuesday. I wonder how much money @HuffPost lost today great purchase AOL _E_
One of Obama's greatest failures will be his legacy of making millions completely dependent on government handouts not work. _E_
Am leaving now for Florida to see our GREAT first responders and to thank the U.S. Coast Guard FEMA etc. A real disaster much work to do! _E_
Stop the EBOLA patients from entering the U.S. Treat them at the highest level over there. THE UNITED STATES HAS ENOUGH PROBLEMS! _E_
The new Pope is a humble man very much like me which probably explains why I like him so much! _E_
China is our enemy. It's time we start acting like it and if we do our job correctly China will gain a whole (cont) __HTTP__ _E_
After two days of very productive talks Prime Minister Abe is heading back to Japan. L _E_
Mitt Romney must start congratulating the Navy Seals and military on Bin Laden's killing not the President. _E_
We should not allow @Chrysler to move @Jeep jobs to China after they said they wouldn't stay tuned! _E_
"Score one for the Donald in his battle with @AGSchneiderman." __HTTP__ _E_
I still say Te'o did this in order to get sympathy for the Heisman vote—thankfully he did not win. _E_
Just returned from Pensacola Florida where the crowd was incredible. _E_
They should have allowed applause during the TRIBUTE to the departed Really bad production. Bette Midler sucked! #Oscars _E_
America will THINK BIG once again. We will inspire millions of children to carry on the proud tradition of American... __HTTP__ _E_
On Monday ObamaCare kicks in with all goodies of 300% increased premiums higher taxes and part time replacement employees. _E_
Hillary Clinton should ask why the Democrat pols in Atlantic City made all the wrong moves Convention Center Airport and destroyed City _E_
.@VenueMagazine_ highlights the opening of @TrumpDoral's brand new #RedTiger course: __HTTP__ _E_
Great new ad from @MittRomney titled Nothing's Free __HTTP__ detailing both the high costs and taxes of ObamaCare. _E_
Good news out of the House with the passing of 'No Sanctuary for Criminals Act.' Hopefully Senate will follow. _E_
Sadly I'm probably helping @billmaher's lowly rated show—but charity will benefit by $5 million so it's worth it. _E_
#sweepstweet @DonaldJTrumpJr and @EricTrump have the eyes and ears for total surveillance I wonder where they got that from? _E_
RT @DRUDGE_REPORT: LIMBAUGH: By not showing he's owning entire event... __HTTP__ _E_
The #MarchForLife is so important. To all of you marching you have my full support! _E_
Just watched @Patriots Bill Belichick's news conference. He did a great job—smart concise truthful! _E_
...I trounced him in ratings & Letterman beat @jayleno last Thursday. Brian—are you irrelevant? _E_
Sugar: @Lord_Sugar—unlike you I own The Apprentice. You were never successful enough... _E_
Entrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_
Aberdeenshire coast is spectacular. Its historic value & wildlife will be tarnished if these wind turbines are built but they won't be! _E_
I will be interviewed on @foxandfriends tomorrow at 7am. Enjoy! _E_
"If you have a crisis whether on a ship or wherever there are heroes who rise above it." Jerry Bruckheimer _E_
Thank you Montana! #Trump2016 __HTTP__ __HTTP__ _E_
.@TrumpChicago's The Spa offers 5 star services w/ 12 treatment rooms & 53 spa guestrooms overlooking the skyline __HTTP__ _E_
Marco Rubio will not win. Weak on illegal immigration strong on amnesty and has the appearance to killers of the world as a lightweight . _E_
"Compete with yourself to be the best you can be." – Think Like a Champion _E_
Another new Iowa poll just released. Thank you! #IACaucus #FITN __HTTP__ _E_
Jeb Bush has zero communication skills so he spent a fortune of special interest money on a Super Bowl ad. He is a weak candidate! _E_
.@CGasparino Good seeing you. Keep up the great work never stop! _E_
Rodolfo Rosas Moya and his pals in Mexico owe me a lot of money. Disgusting & slow Mexico court system. Mexico is not a U.S. friend. _E_
The results are in. I killed Wolf Blitzer in our debate. I like Wolf but he went for an ambush! #wolfblitzercnn _E_
More than $500 million designated for Iraqi Army disappeared. Where is it? Our sad sad country what have we come to? _E_
A gallon of gas is $3.523 today and has never before risen so high early in the year __HTTP__ The @BarackObama policy realized! _E_
As China is built on corporate espionage currency manipulation & cheap labor its economy is a ticking time bomb __HTTP__ _E_
Dopey Sugar @Lord_Sugar I never go silent. I was buying a major property in Florida a property worth more than you are! _E_
Great article in the @NewYorkPost by Ben Garrett Don't Blame Sandy on Global Warming __HTTP__ _E_
With @IvankaTrump and @EricTrump at the opening of the @GaryPlayer Villa at @TrumpDoral __HTTP__ _E_
With the two wacko perverts Spitzer and Weiner NYC politics has become a joke all over the world. _E_
James Holmes the Aurora Colorado guy who killed 12 people & injured 58 others is fighting hard to avoid the death penalty... _E_
America is mired in the longest job recession since the Great Depression. @MittRomney can get us out of it. (cont) __HTTP__ _E_
Looking forward to a great weekend in Iowa! #IACaucus #CaucusForTrump Tickets: __HTTP__ __HTTP__ _E_
Via @HuffPostPol by @_under_current: "Donald Trump Will End Outsourcing If President" __HTTP__ _E_
Just gave a speech to the great men and women at Yokota Air Base in Tokyo Japan. Leaving to see Prime Minister Abe. __HTTP__ _E_
Thank you to @GaryVanSickle & Sports Illustrated @SInow for the really nice piece about me. March 17 2014 issue __HTTP__ _E_
Will be at venue in wonderful South Carolina very soon. Big traffic back up tremendous crowd! Will be wild. _E_
"A true business only exists to solve a problem and to make life better." – Midas Touch _E_
Amazing Obama speaks market goes DOWN Trump tells CNBC he's buying stock market goes UP should not be that way! _E_
Join me in Florida tomorrow! #MakeAmericaGreatAgain Daytona | 3pm __HTTP__ | 7pm __HTTP__ _E_
My book with @theRealKiyosaki Midas Touch is divided into five sections. The first is the thumb __HTTP__ _E_
The @TODAYshow refused to use their just in poll numbers where I have a massive lead but instead used @CNN numbers where my lead is smaller. _E_
We need a President who understands the economy @gallupnews has US unemployment at 8.2% in July up from 8% in June __HTTP__ _E_
People are going crazy with my comments on Diet Coke (soda). Let's face it this stuff just doesn't work. It makes you hungry. _E_
#ICYMI: I agree To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp with me on 11/8.... __HTTP__ _E_
We need a balanced budget Amendment because Congress has no fiscal discipline. _E_
Our national debt has grown by 30% and a gallon of gas has doubled so far under @BarackObama. He is a disaster. _E_
.@EricTrumpFdn continues to do important work for @StJude Children's Research Hospital. I am very proud of @EricTrump's philanthropy. _E_
I will be on CNN's State of the Union tomorrow morning at 9amE. __HTTP__ __HTTP__ _E_
I have been asking Director Comey & others from the beginning of my administration to find the LEAKERS in the intelligence community..... _E_
The Trump Administration has terminated more UNNECESSARY Regulation in just twelve months than any other Administration has terminated during their full term in office no matter what the length. The good news is THERE IS MUCH MORE TO COME! _E_
Windmills are a bigger safety hazard than either coal or oil __HTTP__ A 34% higher mortality rate than coal alone. Outrageous! _E_
Is Jon Stewart a racist? See video __HTTP__ @thedailyshow _E_
While @BarackObama criticizes the GOP budget his own party graded him with an F by voting down his budget in the House 414 0. _E_
I will be doing Fox & Friends in 10 minutes at 7.00. Many things to talk about! ENJOY _E_
Press conference after CPAC speech this morning was excellent lots of very professional reporters. _E_
Almost no news organizations are showing the satirical pictures. Gee I wonder why? The media is usually so brave! _E_
I don't know what it is but I'm getting totally bored watching NFL football. Too many penalties and far too soft! T.V. off and back to work _E_
The habitual vacationer: @BarackObama has campaigned on our dime more than any previous president in history... (cont) __HTTP__ _E_
President Obama looks absolutely exhausted in the Netherlands. He is not a natural leader was never ment to lead it is tough work for him _E_
Remember Univision apologized! _E_
I hope everyone is having a great Christmas then tomorrow it's back to work in order to Make America Great Again (which is happening faster than anyone anticipated)! _E_
Hitting the first ball at Trump International Dubai 272 right down the middle. __HTTP__ _E_
Today our entire nation pauses to REMEMBER PEARL HARBOR—and the brave warriors who on that day stood tall and fought for America. God Bless our HEROES who wear the uniform and God Bless the United States of America. #PearlHarborRemembranceDay __HTTP__ _E_
.@TrumpScotland provides luxury accommodations & a championship Par 72 7400 yd. course. Book your tee time now __HTTP__ _E_
As of September 30th we have a record trade deficit with China of over $217Billion. They are ripping us off. #TimeToGetTough _E_
"One of the keys to thinking big is total focus." – The Art of The Deal _E_
I gave millions of dollars to DJT Foundation raised or recieved millions more ALL of which is given to charity and media won't report! _E_
I'm on the David Letterman @LateShow tonight looking forward to it. 11:35 PM on CBS. _E_
Trump Miss Universe simulcast on @nbc and @Telemundo on December 19th will once again deliver an entertaining and 'beautiful' show! _E_
We were led to believe that Jeep would manufacture in U.S. and sell to China—like China does to us. _E_
Why is the GOP establishment so threatened by the Newsmax @iontv debate? More debate is always better. _E_
Scary Americans private wealth fell 40% from 2007 2010 __HTTP__ But @BarackObama thinks the private economy is doing fine. _E_
Hurricane Irma is of epic proportion perhaps bigger than we have ever seen. Be safe and get out of its wayif possible. Federal G is ready! _E_
A wonderfully written article concerning Israel by @JasonDovEsq __HTTP__ _E_
Sen. Jeff Flake(y) who is unelectable in the Great State of Arizona (quit race anemic polls) was caught (purposely) on "mike" saying bad things about your favorite President. He'll be a NO on tax cuts because his political career anyway is "toast." _E_
.@club4growth asked me for $1 million. I said no. Now falsely advertising that I will raise taxes. I'll lower big league for middle class. _E_
China must be worried that @MittRomney will win this November. They have never had such a pushover like @BarackObama. _E_
I played football and baseball sorry but said to be the best bball player in N.Y. State ask coach Ted Dobias said best he ever coached. _E_
Teachers in Chicago should go back to work immediately.Rahm Emanuel has offered them a fair deal. Now they're just acting for the cameras. _E_
The Manufacturing Index rose to 59% the highest level since early 2011 and we can do much better! _E_
Our great country has been divided for decades. Sometimes you need protest in order to heel & we will heel & be stronger than ever before! _E_
Donald Trump to Chris Christie: Don't hire @stuartpstevens __HTTP__ via @politico by @Hadas_Gold _E_
A certain whack job Go Angelo who doesn't have a life spends his time hopelessly attacking me re: Macy's.... _E_
Best thing my supporters can do if you don't like the way @megynkelly and her puppets unfairly treat us is don't watch her show! _E_
.@transition2017 update and policy plans for the first 100 days. __HTTP__ _E_
On Mike and Mike @espn in two minutes! _E_
It's late in July and it is really cold outside in New York. Where the hell is GLOBAL WARMING??? We need some fast! It's now CLIMATE CHANGE _E_
Thank you California! Will see you soon! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
We must remember this truth: No matter our color creed religion or political party we are ALL AMERICANS FIRST. __HTTP__ _E_
#TrumpVlog #TheInterviewMovie A sad day for freedom of speech __HTTP__ _E_
The Chinese are the biggest beneficiary of this post Saddam oil boom in Iraq __HTTP__ _E_
Everyone is asking me to speak more on Robert & Kristen.I don't have time except to say Robert drop her she cheated on you & will again! _E_
Why are we fighting for the rebels that hate us only to save face for Obama! _E_
Real unemployment is 20%. We must simplify the tax code and start making our own products again to bring our jobs back from overseas. _E_
MAKE AMERICA GREAT AGAIN! __HTTP__ _E_
??? @BarackObama held a raffle with donors for a lunch in the White House. The winners were conveniently all (cont) __HTTP__ _E_
Congratulations to @JasonDufner on winning the PGA championship. Great job! _E_
THe people at shouldtrumprun.com have got it right! How are our factories supposed to compete with China and other countries... _E_
Achievers go for the challenge so the next deal is what they're thinking about. They have an obligation to best themselves. _E_
Just stated by a total pro: You are the only one who has the guts to say what we are all thinking. _E_
.@David_Cameron As Prime Minister why are you spending vast amounts of money to subsidize ugly wind turbines in Scotland that nobody wants? _E_
Gallup finds Des Moines Iowa has the highest community pride (76.5) of any large city. Congrats and I agree I love the place! @DesMoines _E_
Iraq buying $200000000 worth of weapons from Iran. Despite so many killed and trillions spent Iraq dumps U.S. I TOLD YOU SO LONG AGO! _E_
Watch @IvankaTrump's Ready To Wear Fashion Show at @LordandTaylor featuring @TrumpModels and @MissUSA..... __HTTP__ _E_
Never ever quit never give up Donald J. Trump The Art of the Deal. _E_
Just heard Fake News CNN is doing polls again despite the fact that their election polls were a WAY OFF disaster. Much higher ratings at Fox _E_
RT @DonaldJTrumpJr: FINAL PUSH! Eric and I doing dozens of radio interviews. We can win this thing! GET OUT AND VOTE! #MAGA #ElectionDay ht... _E_
It does matter! __HTTP__ _E_
Last night's All Star @ApprenticeNBC once again showed why the ultimate onus lies with the project manager. The buck stops there. _E_
He thinks that the wealth you create belongs to the gov't. @BarackObama doesn't respect the fact that the money he wastes belongs to us. _E_
Via @DailyCaller by @samsondunn: "Pastor To Hispanic Congregation Speaks Out On Trump Immigrant Crime Statement" __HTTP__ _E_
The OWS protesters are doing nothing to advance the interests of the 99%. Time for them to go home! _E_
It's extremely cold in NY & NJ—not good for flood victims. Where is global warming? _E_
Friends in NY 9 let @BarackObama know that you don't approve of his mistreatment of @Israel. Vote for @Bobturner9th tomorrow! _E_
Nobody wants wind turbines they are failing all over the world and need massive subsidy a disaster for taxpayers. _E_
No one has worse judgement than Hillary Clinton corruption and devastation follows her wherever she goes. _E_
Poll numbers are starting to look very good. Leading in Florida @CNN Arizona and big jump in Utah. All numbers rising national way up. Wow! _E_
At the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of government help! _E_
.@TrumpGolfLA public golf course features spectacular panoramic Pacific Ocean views an elite attraction __HTTP__ _E_
Opportunities only present themselves if you are out there looking for them. Be aggressive and seize them when they come. _E_
Free enterprise is essentially a formula not just for wealth creation but for life satisfaction. Arthur C. Brooks _E_
It was an honor to welcome Republican and Democratic members of the Senate Finance Committee to the @WhiteHouse today. #TaxReform __HTTP__ _E_
...for safety. Thank you to the Governor of P.R. and to all of those who are working so closely with our First Responders. Fantastic job! _E_
Jeb Bush signed memo saying not to use the term anchor babies offensive. Now he wants to use it because I use it. Stay true to yourself! _E_
I suspect @JoeBiden could do well tonight. Don't be fooled by his gaffes. He is a seasoned and feisty debater. _E_
Thank you Kansas! The line going into the Orlando event is over a mile long. Massive crowd expected. Leaving Kansas now be there soon! _E_
Trump to host #Oscars? __HTTP__ _E_
Entrepreneurs: Listen and learn from others but make your own decisions. Take responsibility for yourself. It's a very empowering attitude! _E_
Happy birthday to U.S. ARMY and our soldiers. Thank you for your bravery sacrifices & dedication. Proud to be your Commander in Chief! _E_
7.8% unemployment number is a complete fraud as evidenced by the jobless claims number released yesterday.Real unemployment is at least 15% _E_
Just left hospital. Rep. Steve Scalise one of the truly great people is in very tough shape but he is a real fighter. Pray for Steve! _E_
RT @SecShulkin: Our Mobile Vet Center set up and ready to help #Veterans impacted by #HurricaneHarvey in Corpus Christi. __HTTP__ _E_
Many reports that I will be attending the Alvarez/Khan fight this weekend in Vegas. Totally untrue! Unfortunately I have other plans. _E_
"Today's put off objectives reduce tomorrow's achievements." Henry Banks _E_
I'm a former chief of police in a border town. I'm Hispanic I'm proud to be Hispanic and I'm 100% behind Trump. __HTTP__ _E_
Mar a Lago in Palm Beach is one of the most exclusive & elite clubs in the world w/award winning amenities __HTTP__ _E_
Good advice from my mother Mary MacLeod Trump: "Trust in God and be true to yourself." _E_
Iran the Number One State of Sponsored Terror with numerous violations of Human Rights occurring on an hourly basis has now closed down the Internet so that peaceful demonstrators cannot communicate. Not good! _E_
Thank you for your endorsement @GovernorSununu. #MAGA __HTTP__ _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
Tomorrow's the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_
Must watch – owner of a single restaurant anticipates that ObamaCare will cost over $1M for compliance __HTTP__ _E_
Q/A @thecelidebiasio The secret behind my success is that I love what I'm doing. That gives me energy focus (cont) __HTTP__ _E_
Things work out best for those who make the best of how things work out. John Wooden _E_
If the Republicans ever want to win a presidential election in the next 30 years they must get rid of @KarlRove. He is useless. _E_
More lies and deceptions @BarackObama is having his ex staffers write 'independent' studies for his reelection __HTTP__ _E_
True thanks. __HTTP__ _E_
"We build too many walls and not enough bridges." Isaac Newton _E_
I went to Wharton made over $8 billion employ thousands of people & get insulted by morons who can't get enough of me on twitter...! _E_
You have until 8pm to #VoteTrump Delaware! __HTTP__ _E_
Located in Tribeca each @TrumpSoHo hotel room features floor to window ceilings for a view of lower Manhattan __HTTP__ _E_
Crooked Hillary Clinton wants to essentially abolish the 2nd Amendment. No gun owner can ever vote for Clinton! _E_
Welcome to the 'Islamist Winter' the Muslim Brotherhood is now taking over the Egyptian military and possibly (cont) __HTTP__ _E_
I take great pride watching skaters enjoy the #TRUMP Rink in Central Park from my office world's best skating rink __HTTP__ _E_
South Korea is finding as I have told them that their talk of appeasement with North Korea will not work they only understand one thing! _E_
I'm not against vaccinations for your children I'm against them in 1 massive dose.Spread them out over a period of time & autism will drop! _E_
The way President Obama runs down the stairs of Air Force 1 hopping & bobbing all the way is so inelegant and unpresidential. Do not fall! _E_
Despite what the haters and losers like to say I never filed for bankruptcy but WOW the preeminent gaming company Caesars just did. _E_
.@thehill John Oliver had his people call to ask me to be on his very boring and low rated show. I said NO THANKS Waste of time & energy! _E_
I thought I was being nice to somebody re their parents. I guess this teaches you not to be nice or trusting. Sad! _E_
Our country has tremendous potential. Together we can fix Washington. Let's Make America Great Again! __HTTP__ _E_
Chrysler is moving a massive plant from Mexico to Michigan reversing a years long opposite trend. Thank you Chrysler a very wise decision. The voters in Michigan are very happy they voted for Trump/Pence. Plenty of more to follow! _E_
Crooked Hillary is being badly criticized (for a Wall Street paid for ad) by PolitiFact for a false ad on me on women. She is a total fraud! _E_
On behalf of a GRATEFUL NATION THANK YOU to all of the First Responders (HEROES) who saved countless lives in Las Vegas on Sunday night. __HTTP__ _E_
The fake news media is going crazy with their conspiracy theories and blind hatred. @MSNBC & @CNN are unwatchable. @foxandfriends is great! _E_
Rather than causing a big disruption in N.Y.C. I will be working out of my home in Bedminster N.J. this weekend. Also saves country money! _E_
THANK YOU to all of the incredible volunteers behind the scenes in Iowa! #CaucusForTrump __HTTP__ __HTTP__ _E_
Thank you South Carolina! Together WE WILL MAKE AMERICA GREAT AGAIN! #VoteTrumpSC __HTTP__ _E_
We will repeal and replace the horrible disaster known as #Obamacare! __HTTP__ _E_
.@mcuban When Apprentice became the #1 show on tv you tried copying me with The Benefactor a complete and total ratings disaster for @ABC. _E_
Sharks are last on my list other than perhaps the losers and haters of the World! _E_
"My office is at Yankee stadium. Yes dreams do come true." @Yankees Captain Derek Jeter _E_
Reverend Wright was dumped like a dog by @BarackObama he can't be feeling too good. _E_
The Yankees are sure lucky George Steinbrenner is not around. A lot of people would be losing their jobs. _E_
Gov Kasich voted for NAFTA which devastated Ohio and is now pushing TPP hard bad for American workers! _E_
Hard to believe that with 24/7 #Fake News on CNN ABC NBC CBS NYTIMES & WAPO the Trump base is getting stronger! _E_
Thank you South Carolina! Everyone get out and vote tomorrow! We will #MakeAmericaGreatAgain! __HTTP__ _E_
Zimmerman is no angel but the lack of evidence and the concept of self defense especially in Florida law gave the jury little other choice _E_
Barack Obama is not who you think he is. Most overrated politician in US history. _E_
People are happy that I left the Trump Tower atrium open as opposed to taking the easy way out. __HTTP__ _E_
I really like the Koch Brothers (members of my P.B. Club) but I don't want their money or anything else from them. Cannot influence Trump! _E_
"Be tough be smart be personable but don't take things personally. That's good business." – Think Like a Champion _E_
Power Lunching next to the #BlueMonster: __HTTP__ via @UrbanDaddy cc @TrumpDoral _E_
Still a buyer's market. Buy directly from a bank. They want to offload properties that have defaulted will give good prices & financing. _E_
Obama administration said that Saudi Arabia was on Syria's border __HTTP__ Wrong. These are the civilians planning the war. _E_
The Trump Organization is honored to be expanding our interests into Dubai. The golf course will be the top course in the Middle East. _E_
Ted Cruz purposely and illegally did not list on his personal disclosure form personally guaranteed loans from banks. They own him! _E_
Pervert Alert! Serial sexter @anthonyweiner has promised to use twitter as a "tool." Parentsmake sure your children have him blocked. _E_
The third mass attack (slaughter) in days by ISIS. 200 dead in Baghdad worst in many years. We do not have leadership that can stop this! _E_
Congratulations to Rex Tillerson on being sworn in as our new Secretary of State. He will be a star! _E_
A great interview of @DonaldJTrumpJr in the @ globeandmail on Trump Tower Toronto __HTTP__ _E_
As China and the rest of the World continue to rip off the U.S. economically they laugh at us and our president over the riots in Ferguson! _E_
Thank you! #TrumpPence16 __HTTP__ _E_
Just out: The Obama Administration knew far in advance of November 8th about election meddling by Russia. Did nothing about it. WHY? _E_
Did China ask us if it was OK to devalue their currency (making it hard for our companies to compete) heavily tax our products going into.. _E_
Getting ready to engage G7 leaders on many issues including economic growth terrorism and security. _E_
Congrats to @leezeldin on a great victory. I hope my robocalls helped! #NY1 _E_
Can you believe that Mitch McConnell who has screamed Repeal & Replace for 7 years couldn't get it done. Must Repeal & Replace ObamaCare! _E_
Crazy @megynkelly says I don't (won't) go on her show and she still gets good ratings. But almost all of her shows are negative hits on me! _E_
Tennessee GOP Poll __HTTP__ 32.7%Cruz 16.5%Carson 6.6%Rubio 5.3%Christie 2.4%Jeb 1.6% _E_
As a very active President with lots of things happening it is not possible for my surrogates to stand at podium with perfect accuracy!.... _E_
My @SquawkCNBC interview discussing 2012 election polls @MittRomney's current trip & the US housing & land market __HTTP__ _E_
One point I made last night and will continue to push is that the @GOP can't be pollitically correct. We must fight fire with fire. _E_
.@billmaher was so nervous talking about me on the @jayleno show—I've never seen him like that! _E_
China has just intervened to lower the yuan in other words they will continue to screw the U.S.! _E_
I am pleased to inform you that I have just granted a full Pardon to 85 year old American patriot Sheriff Joe Arpaio. He kept Arizona safe! _E_
He may be the worst reporter in all of sports: @RickReilly of @ESPN. He gets away with murder and most people (cont) __HTTP__ _E_
The Kate Steinle killer came back and back over the weakly protected Obama border always committing crimes and being violent and yet this info was not used in court. His exoneration is a complete travesty of justice. BUILD THE WALL! _E_
We will soon be at a point with our incompetent politicians where we will be treating illegal immigrants better than our veterans. _E_
Via __HTTP__ Interview with Donald Trump about Presidential Aspirations: It's all a deal __HTTP__ _E_
Wow @UnionLeader circulation in NH has dropped from 75000 to around 10—bad management. No wonder they begged me for ads. _E_
Isn't it a shame that the person who will have by far the most delegates and many millions more votes than anyone else me still must fight _E_
If @VattenfallGroup dropped out of the economically unfeasible wind farm development in Aberdeen who is (cont) __HTTP__ _E_
We cannot let this evil continue! #Debates2016 __HTTP__ _E_
Thank you South Carolina! We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_
Trump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ Via @thehill _E_
Will be on Fox & Friends in 3 minutes 7.00 A.M. _E_
Entrepreneurs: Don't ever think you've done it all already or that you've done your best. You haven't so don't limit yourself! _E_
I just passed a 10 block long gas line going to LGA airport a terrible situation! _E_
Our country is totally fractured and with our weak leadership in Washington you can expect Ferguson type riots and looting in other places _E_
Advice from my father Fred C. Trump: Know everything you can about what you're doing. _E_
Have passion for what you do and be efficient at the same time. Think Like a Champion _E_
Ted Cruz is falling in the polls. He is nervous. People are worried about his place of birth and his failure to report his loans from banks! _E_
Video in honor of the 100th Anniversary of the Anti Defamation League (ADL): "Imagine a World Without Hate" __HTTP__ _E_
If authorities need direct view from top of Trump Tower call office. _E_
He @BarackObama should not be trying to intimidate the USC justices on ObamaCare. He is worried because SG (cont) __HTTP__ _E_
"A people that values its privileges above its principles soon loses both." Dwight D. Eisenhower _E_
Via @WTCommunities: Donald Trump to CPAC: Romney 'Didn't Talk Enough About Success' __HTTP__ by @HuizingaDanny _E_
I can't believe Apple isn't moving faster to create a larger iPhone screen. Bring back Steve Jobs! _E_
Congress should get back to Washington but @BarackObama doesn't want to interrupt his vacation in Martha's Vineyard. _E_
Entrepreneurs: When negotiating don't be an open book. Know that the only person on your side might be yourself. _E_
The more time you spend feeling sorry for yourself the more time you waste after a setback. Move on and quickly embrace the next challenge! _E_
See dummy Danny Zuker who I never heard until this started something that he couldn't finish gutless and unwilling to take my bet! _E_
Thank you Tennessee! #Trump2016 __HTTP__ _E_
"Becoming an entrepreneur is a personal development program. If you grow personally your business will grow." – Midas Touch _E_
Thank you @DonaldJTrumpJr! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Leaving for the GREAT STATE OF SOUTH CAROLINA now to make a speech about how to MAKE OUR COUNTRY GREAT AGAIN! _E_
Great rally in Fresno California great crowd! Thank you! #Trump2016 __HTTP__ _E_
With @greta in Washington D.C. Old Post Office under construction. Tune in tonight at 7PM EST! __HTTP__ _E_
See Lyin' Ted even the @DailyBeast (no fan of mine) says this story came from Rubio not Trump! __HTTP__ _E_
Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before _E_
If Obama has to re fight this fight next year he loses Watch the fine details in every deal The Art of the Deal _E_
Via @DMRegister BY @JenniferJJacobs: "@SteveKingIA ramps up with first TV ad Trump event" __HTTP__ _E_
New CNN Iowa poll Trump 33 Cruz 20. Everyone else way down! Don't trust Des Moines Register poll biased towards Trump! _E_
Today we honored our true American heroes on the first ever National Vietnam War Veterans Day.#ThankAVeteran... __HTTP__ _E_
Wow the @nytimes is losing thousands of subscribers because of their very poor and highly inaccurate coverage of the Trump phenomena _E_
Did my weekly phoner on Fox & Friends this morning...sounding off on issues of the day ... __HTTP__ _E_
Great job tonight @ericbolling _E_
Tired of being bullied by the economy? I'm going to help people. Wednesday 11 AM at Trump Tower _E_
I will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate  #BigLeagueTruth __HTTP__ _E_
The Democrats seem intent on having people and drugs pour into our country from the Southern Border risking thousands of lives in the process. It is my duty to protect the lives and safety of all Americans. We must build a Great Wall think Merit and end Lottery & Chain. USA! _E_
Trump International Hotel Washington D.C. will be one of the world's top luxury hotels __HTTP__ _E_
Thank you. __HTTP__ _E_
Many great business campaigns at @fundanything __HTTP__ Great way to support small upstarts. _E_
Lance Armstrong did himself great harm last night. Lawsuits & failure will follow him! _E_
On the red carpet at the NYC premiere of Dark Knight Rises with @melaniatrump via @NewYorkObserver's @velvet_roper __HTTP__ _E_
.@SarahPalinUSA was 100% correct when she stated that @oreillyfactor used us in day long tease to get people to watch but we were not on! _E_
Featuring five championship golf courses including the Blue Monster @TrumpDoral is South Miami's top destination __HTTP__ _E_
"All Star Celebrity Apprentice" is #1 in the time period among ABC CBS and NBC in 18 49 and all other key demos—Nielsen Ratings _E_
The real estate market is slowly improving. Still a great time to buy. You will thank me in 5 years. _E_
"Money may not grow on trees but it does grow from talent hard work and brains." – Think Like a Billionaire _E_
A former Miss New York is the designer behind the swimsuits featured in Sunday's Miss USA pageant—beautiful! __HTTP__ _E_
It was a great honor to be with King Abdullah II of Jordan and his delegation this morning. We had a GREAT bilateral meeting! __HTTP__ _E_
EXCLUSIVE — DONALD TRUMP ON THE GOP PRIMARY: 'IF I WIN I WILL BEAT HILLARY' __HTTP__ via @BreitbartNews by Katie McHugh _E_
Wind turbine syndrome is affecting tremendous numbers of people in their wake—stop ugly turbines. _E_
Floyd Mayweather is being beaten up badly through 10 rounds by Marcos Maidana but announcers say it is even. TWO ROUNDS LEFT. _E_
Wacky & totally unhinged Tom Steyer who has been fighting me and my Make America Great Again agenda from beginning never wins elections! _E_
RT @markets: U.S. job openings surge to record __HTTP__ via @ShoChandra __HTTP__ _E_
Trump Collection's summer line exclusively available @Macys is the pinnacle of style & prestige. Dress your best! __HTTP__ _E_
On December 19th the @MissUniverse pageant will be broadcast live in over 190 countries to one billion viewers. @nbc _E_
The situation with Russia is much more dangerous than most people may think and could lead to World War III. WE NEED GREAT LEADERSHIP FAST _E_
Club For Growth tried to extort $1000000 from me. When I said NO they went hostile with negative ads. Disgraceful! _E_
Dishonest media is trying their absolute best to depict a star in a tweet as the Star of David rather than a Sheriff's Star or plain star! _E_
RT @GregAbbott_TX: Thanks to the Texas National Guard for their help to rescue flooded Texans. #HurricaneHarvey __HTTP__ _E_
"When you have confidence you can have a lot of fun. And when you have fun you can do amazing things." @RealJoeNamath _E_
Great poll thank you Nevada!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
"Generals don't panic then the troops never panic." @SHAQ _E_
You can't tax business. Business doesn't pay taxes. It collects taxes. ― Ronald Reagan _E_
Thank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
I know about the "rustic" look on golf courses—but see photo of highly rated Trump National Philadelphia—a real gem. __HTTP__ _E_
Thank you to @NYPost's Robert Rorke for the really nice review of #SNL. So many enjoyed it very gratifying! __HTTP__ _E_
The TODAY Show should call me about who to put on the show— I know more about people who get ratings than anyone. _E_
Obama wants Americans to keep buying crude from OPEC who is ripping us off instead of our ally Canada through (cont) __HTTP__ _E_
Just out new PPP NATIONAL POLL has me in first place by a wide margin at 29%. I wonder why only @FoxNews has not reported this? Too bad! _E_
As President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_
Trump National Golf Club Los Angeles fronts the Pacific Ocean and has an 18 hole Pete Dye course. Beautiful! __HTTP__ _E_
Trump Int'l Washington D.C. is a historic building which our entire nation can take pride in & enjoy Opening 2016 __HTTP__ _E_
.@rupertmurdoch is absolutely right it will be a nightmare for @Israel if Obama is re elected. _E_
I will be on Fox & Friends @foxandfriends at 7.00 a.m. (30 minutes). Enjoy! _E_
The journey to #MAGA began @CPAC 2011 and the opportunity to reconnect with friends and supporters is something I look forward to every year. See you at #CPAC2018! _E_
He is delusional: @BarackObama believes that he is the 4th best POTUS ever. _E_
Minorities line up behind.....Donald Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_
Weak & ineffective @JebBush is doing ads where he shows his statement in the debate but not my response. False advertising! _E_
The UN is about to use its Assembly to attack @Israel. We should defund the UN entirely if they can't act resp... (cont) __HTTP__ _E_
I must say that some of these college football games are great tonight very exciting I wish I had more time to watch! _E_
.@seanhannity at 10:00. _E_
I'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_
My @FoxNews interview with @TeamCavuto discussing the Newsmax @iontv debate #TimeToGetTough and the 2012 race __HTTP__ _E_
Another electric car firm that @BarackObama gave $118M just went bankrupt. __HTTP__ He loves to waste our tax dollars. _E_
"TRUMP DECLARES VICTORY ON IMMIGRATION AS OBAMA ADMITS SOME ILLEGALS ARE 'GANG BANGERS'" __HTTP__ via @BreitbartNews @ASwoyer _E_
#ICYMI: @foxandfriends this morning. __HTTP__ _E_
He admits his presidency has been flawed...but @BarackObama claims economy is stronger. __HTTP__ _E_
Ailsa Course changes: #TrumpTurnberry What a beautiful place! __HTTP__ _E_
RT @foxandfriends: White House calls out Senate Democrats for obstructing nominees __HTTP__ _E_
The judge opens up our country to potential terrorists and others that do not have our best interests at heart. Bad people are very happy! _E_
We will follow two simple rules: BUY AMERICAN & HIRE AMERICAN!#InaugurationDay #MAGA _E_
Crooked Hillary Clinton and her team were extremely careless in their handling of very sensitive highly classified information. Not fit! _E_
Don't forget the open call at Trump Tower tomorrow for The Apprentice. I look forward to seeing you there. _E_
I'm a Republican but not a fan of the last George Bush he also was a lousy President (Iraq etc.). In fact he was so bad he gave us Obama! _E_
...I told Republicans to approve healthcare fast or this would happen. But don't worry I will veto because I love our country & its people. _E_
Commodity prices are beginning to drop as a result of the Euro crisis __HTTP__ _E_
Russia should hand over Snowden to the U.S. but they are having too much fun taunting our leaders. _E_
Why is it that Eric Schneiderman is considered a lightweight by so many and has failed to go after Jon Corzine and big abusers for billions? _E_
I requested that Mitch M & Paul R tie the Debt Ceiling legislation into the popular V.A. Bill (which just passed) for easy approval. They... _E_
Why does @megynkelly devote so much time on her shows to me almost always negative? Without me her ratings would tank. Get a life Megyn! _E_
Do you think that Hillary Clinton will apologize to me for the lie she told about the video of me being used by ISIS. There is no video. _E_
Congratulations to @AlabamaFTBL on winning the BCS championship last night! _E_
Just arrived in Taormina with @FLOTUS Melania. #G7Summit #USA __HTTP__ _E_
The Amateur. On his trip to Afghanistan our commander in chief disclosed the CIA Chief's name. Unsafe disaster! __HTTP__ _E_
Wishing everyone a Happy Memorial Day and a thank you to all the soldiers who protected our great country. _E_
Hillary and the Dems loved and praised FBI Director Comey just a few days ago. Original evidence was overwhelming should not have delayed! _E_
As promised on the campaign trail we will provide opportunity for Americans to gain skills needed to succeed & thrive as the economy grows! __HTTP__ _E_
Trump offers $5 million for Obama college passport records __HTTP__ By @AlexPappasDC @DailyCaller _E_
Watch me tonight on The O'Reilly Factor at 8 pm and 11 pm EST FOX News _E_
Interesting article about Atlantic City __HTTP__ _E_
The American people agree. No free pass for #CrookedHillary! __HTTP__ _E_
Via @bostonherald by @ChrisCassidy_BH: "Trump: `The last thing we need is another Bush'" __HTTP__ _E_
First of all you don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_
Congratulations to @TrumpChicago @TrumpSoHo and @TrumpLasVegas all listed #1 on @TravelandLeisure World's Best Business Hotels _E_
The Blue Monster at Trump National Doral in Miami is doing record business everybody wants a piece of it. Great reviews. Thank you! _E_
A Rod was a great player when he lived at Trump Park Avenue even though he was on the juice! _E_
A friend of mine went to @CakeBossBuddy and sent me this beautiful cake which we put in the atrium of @TrumpTowerNY. __HTTP__ _E_
.@Lord_Sugar nice call on predicting that the iPOD would be dead finished gone kaput __HTTP__ Great business foresight. _E_
So many great endorsements yesterday except for Paul Ryan! We must put America first and MAKE AMERICA GREAT AGAIN! _E_
I hope the @RNC is ready for a Third Party if they blow this election because that is what they will face. They must fight hard. _E_
Word is that @Greta Van Susteren was let go by her out of control bosses at @NBC & @Comcast because she refused to go along w/ 'Trump hate!' _E_
Huffington Post is just upset that I said its purchase by AOL has been a disaster and that Arianna Huffington is ugly both inside and out! _E_
.@MatthewJDowd thank you for the nice comments recently especially on @BarbaraJWalters. My family & I greatly appreciate your kind words. _E_
Great being in Cincinnati Ohio last night thank you! Off to Washington D.C. now. #Trump2016 #AmericaFirst __HTTP__ _E_
The Failing @nytimes set Liddle' Bob Corker up by recording his conversation. Was made to sound a fool and that's what I am dealing with! _E_
Iraq's government is treating us like fools. We should demand their oil. _E_
All this from a guy who lectured Americans about tightening their belts: @BarackObama bashes rich people an... (cont) __HTTP__ _E_
We are the greatest country the world has ever known. I make no apologies for this country my pride in it or (cont) __HTTP__ _E_
Don't forget the Celebrity Apprentice Sunday night at 9 pm on NBC for another surprising and exciting episode __HTTP__ _E_
Trump is going to be our President. We owe him an open mind and the chance to lead. So much time and money will be spent same result! Sad _E_
The so called Commission on Presidential Debates admitted to us that the DJT audio & sound level was very bad. So why didn't they fix it? _E_
Scary. Obama and the Democrat Senate have accrued over $5T worth of debt without passing a budget in the last 3 years. 4 more years? _E_
Be on time. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. Think Like a Champion _E_
.@NYDailyNews the dying tabloid owned by dopey clown Mort Zuckerman puts me on the cover daily because I sell. My honor but it is dead! _E_
It's amazing how different all of the polling results are not an exact science. _E_
We must not let #CrookedHillary take her CRIMINAL SCHEME into the Oval Office. #DrainTheSwamp __HTTP__ _E_
.@andersoncooper did an excellent job of hosting the #DemDebate last night. Tough firm but fair. _E_
It is time to create jobs for Americans not D.C. We need a bold new direction. Let's Make America Great Again! __HTTP__ _E_
I told all of the haters and losers long ago that Iraq would fall take the OIL or get out fast! Massive waste of lives and trillions of $'s _E_
China just called. They want to lend Obama another $1B for the ObamaCare web site. _E_
RT @RSBNetwork: LIVE Stream: Donald Trump about to speak in Boca Raton FL. Protesters already before Trump speaks. #TrumpTrain __HTTP__ _E_
He has no respect for American exceptionalism. @BarackObama has outsourced our space program to the Russians __HTTP__ _E_
$30M a year and A Rod is now relegated to the bench. @yankees would have lost if Girardi hadn't benched him in the 9th (see my prediction) _E_
It's Thursday. How much has OPEC ripped us off today? _E_
.@reince is doing a fantastic job for the Republican Party hope he gets the credit he deserves. _E_
The VA scandal will only get worse over the time. Our vets deserve the best care possible. We must be open to private solutions. _E_
The super Liberal Democrat in the Georgia Congressioal race tomorrow wants to protect criminals allow illegal immigration and raise taxes! _E_
Watch @foxandfriends now on Podesta and Russia! _E_
This assignment has been a challenge to both teams. #CelebApprentice _E_
I will soon be releasing my response to the fact that President Obama refused to show his applications and records to the public. _E_
Things turn out best for the people who make the best of the way things turn out. John Wooden _E_
Just finished a very good meeting with the President of South Korea. Many subjects discussed including North Korea and new trade deal! _E_
The fight against ISIS starts at our border. 'At least' 10 ISIS have been caught crossing the Mexico border. Build a wall! _E_
Palm Springs CA has been destroyed absolutely destroyed by the world's ugliest wind farm at the Gateway on Interstate 10. Very very sad! _E_
I went to @MikeTyson's play. I will be doing a review in the next #trumpvlog. _E_
It's this simple. "Make America Great Again." #debate #BigLeagueTruth _E_
As to the U.N. things will be different after Jan. 20th. _E_
"You can't con people at least not for long. If you don't deliver the goods people will eventually catch on." The Art of The Deal _E_
Wow I was just informed that I'm being inducted into the @WWE Hall of Fame a great honor 4/6/13 at @MSGnyc __HTTP__ _E_
.@MittRomney & @PaulRyanVP get what needs to be done to reign in China. @BarackObama gets kicked around by the Chinese. _E_
Our national security starts at the border. Do you think ISIS & al Qaeda are just in the Middle East? _E_
Tell 'Top Scot' Michael Forbes to clean up his property—it is an embarrassment to Scotland. _E_
Via @washingtonpost by @jdelreal: About that Donald Trump speech at CPAC ... __HTTP__ _E_
Thank you @scottienhughes for the great job you did on @CNN. Great energy and smarts! I will not let you down. _E_
Distressed real estate opportunities can make great investments. You need the foresight and instincts to know the property's true potential. _E_
Rapper @MacMiller's song Donald Trump now has 57 million hits I created another star where's my cut? _E_
RT: @thedailybeast: Polling shows the @AmericansElect movement could still nominate a viable independent with a chance of victory... _E_
One 57 is one of the worst looking buildings I've seen in a long time in particular its very ugly skin. _E_
Funny to hear the Democrats talking about the National Debt when President Obama doubled it in only 8 years! _E_
have been allowed to run guilty as hell. They were VERY nice to her. She lost because she campaigned in the wrong states no enthusiasm! _E_
Just watched @marcorubio on television. Just another all talk no action politician. Truly doesn't have a clue! Worst voting record in Sen. _E_
Last nights results in poll taken by NBC. #AmericaFirst #ImWithYou __HTTP__ _E_
This is a MOVEMENT! #RNCinCLE __HTTP__ _E_
Don't forget the Miss USA Pageant live on Sunday night at 9 pm ET on NBC. And you can vote for your favorite beauty! __HTTP__ _E_
Donald Trump is confident that Ireland is ready for a big comeback __HTTP__ via @independent_ie by @AnitaActually _E_
Will be interviewed by @oreillyfactor tonight at 8 PM. _E_
Help save the lives of our troops.Our #vets suffering from TBI/PTS need treatment @makeitvisible Donate to __HTTP__ _E_
#CelebApprentice who do you think won? _E_
Fake News story of secret dinner with Putin is sick. All G 20 leaders and spouses were invited by the Chancellor of Germany. Press knew! _E_
Will be going to Detroit Michigan (love) today for a big meeting on bringing back car production to State & U.S. Already happening! _E_
Young entrepreneurs – be resolute in your drive for success. Gain momentum. Once you succeed promote yourself! _E_
So great to be in New York. Catching up on many things (remember I am still running a major business while I campaign) and loving it! _E_
Great that Pres. O is seeing @MittRomney today—lots of good things can happen. _E_
How does @HBO employ @BillMaher with a pathetic show that he does what kind of a special is that? Complete garbage! _E_
In Las Vegas for the Miss Universe Pageant—airing tonight on @nbc at 8 o'clock. _E_
Sleepy eyes @chucktodd is an absolute joke of a reporter. He is in the bag for Obama. He can't carry @jack_welch's jock. _E_
.@marcorubio what do you say to the family of Kathryn Steinle in CA who was viciously killed b/c we can't secure our border? Stand up for US _E_
The death tax should be abolished the Government is simply taxing you twice. It is also a job killer. _E_
Heading to Iowa to a packed house. Just released polls all first place are amazing. Thank you! _E_
Via @ProgressIndex: "Donald Trump to deliver keynote address at annual Chesterfield Republican Gala" __HTTP__ _E_
I will be interviewed on @FaceTheNation this morning. Enjoy! @jdickerson _E_
Another @BarackObama green car loan recipient is laying off staff. __HTTP__ How many billions of our money has he wasted? _E_
We are going to have a big event at the Verizon Wireless Arena in Manchester New Hampshire! 5K+! Join us tomorrow: __HTTP__ _E_
Wow Mitt Romney didn't know that Rand Paul was in the race for president. Very strange! @FoxNews _E_
Lawrence O'Donnell will soon have another cancelled show to go along with his three cancelled TV series Mister (cont) __HTTP__ _E_
RT @FoxNews: .@AlanDersh: Trump Has 'More Credibility' Than Obama With North Korea __HTTP__ __HTTP__ _E_
Crooked Hillary Clinton is being protected by the media. She is not a talented person or politician. The dishonest media refuses to expose! _E_
Hillary says take back Mosul? We would have NEVER lost Mosul if it wasn't for #CrookedHillary. #DrainTheSwamp __HTTP__ _E_
Visiting New York City? Make sure to skate in the world famous Trump Rink in Central Park __HTTP__ Great for the whole family! _E_
Another clip from my @greta interview discussing why Sony should not have capitulated to the hackers __HTTP__ No Courage! _E_
Obama has admitted that he spends his mornings watching @ESPN. Then he plays golf fundraises & grants amnesty to illegals. _E_
Wrong @BarackObama's '08 campaign manager & current Senior WH Advisor collected $100G fee from Iranian affiliate __HTTP__ _E_
We're worried about waterboarding as our enemy ISIS is beheading people and burning people alive. Time for us to wake up. _E_
Getting ready to leave for Poland after which I will travel to Germany for the G 20. Will be back on Saturday. _E_
Iran is flying supply planes to Syria through Iraqi airspace. Thank you United States for making this possible! _E_
Thank you Oklahoma & Virginia! #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
Honored to have passed 1 million twitter followers. We are making America #1 again. #TimeToGetTough _E_
The Trump Signature Collection available @Macys offers top new designs for your fall wardrobe. Dress your best! __HTTP__ _E_
When employees are working at home they can never have the same cohesivness as working together as a group... _E_
#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
I will be nominating Christopher A. Wray a man of impeccable credentials to be the new Director of the FBI. Details to follow. _E_
Today is armed forces day. Thank you  to our military service members! I love you all! _E_
Dopey Sugar—@Lord_Sugar Isn't it sad that my golf course in Scotland just got "best new course in the world"—it's worth more than you are! _E_
.@JuddApatow I agree! _E_
Obama's convention bounce is gone. @MittRomney has retaken the lead in the latest @RasmussenPoll __HTTP__ _E_
Thank you Nashville Tennessee! __HTTP__ _E_
Wow it is unbelievable how distorted one sided and biased the media is against us. The failing @nytimes is a joke. @CNN is laughable! _E_
My interview w/ @WendyWilliams on @WendyShow discussing @MichelleObama's bangs & All Star @CelebApprentice __HTTP__ _E_
Great win by the @nyjets yesterday. If they run the table they will make the playoffs. _E_
Remember when the two failed presidential candidates Lindsey Graham and Jeb Bush signed a binding PLEDGE? They broke the deal no honor! _E_
Many of the thugs that attacked the peaceful Trump supporters in San Jose were illegals. They burned the American flag and laughed at police _E_
Will be interviewed by @JudgeJeanine on @FoxNews at 9:00 P.M. (Saturday night). Enjoy! _E_
Congressman John Lewis should finally focus on the burning and crime infested inner cities of the U.S. I can use all the help I can get! _E_
.@andydean2014 Thank you you were great. You can defend me anytime. Amazing job. _E_
Cruz says I supported TARP which gave $25 million to Goldman Sachs the bank which loaned him the money he didn't disclose. Puppet! _E_
In the "old days" when good news was reported the Stock Market would go up. Today when good news is reported the Stock Market goes down. Big mistake and we have so much good (great) news about the economy! _E_
Biden's statements on Medicare are very effective. Ryan must now come back and combat. #VPDebate _E_
A clip from my @foxandfriends interview discussing how Newsmax @iontv debate is determining the GOP primary polls __HTTP__ _E_
Will be interviewed on @FoxNews at 10:00 P.M. Enjoy! _E_
I am in Iowa. Will be interviewed on This Week With @GStephanopoulos this morning. ENJOY! _E_
Be sure to listen to my interview on tonight's @SteveDeaceShow. Steve is a terrific guy! _E_
The great Mike Wallace covered me in a much more professional manner than his son Chris Wallace of @FoxNews. Mike was a total pro! _E_
Hagel committee vote has been postponed as Hagel refuses to disclose all his finances __HTTP__ _E_
NY should frack now. What's the hold up? Is Albany opposed to creating jobs and making gas cheaper for middle class? _E_
When will lightweight hack Attorney General be investigated for his repeated prosecutorial misconduct? __HTTP__ _E_
Figure out what really moves you. You've got to have the 'FIRE' in order to have the Midas Touch. Midas Touch _E_
Don't negate your own power. Whatever you've been dealt know you can deal with it. Fear is the opposite of faith. _E_
True. Thanks. __HTTP__ _E_
I love seeing that Graydon Carter and @VanityFair are failing so badly. He's only focused on his bad food restaurants. _E_
Very grateful for the 9 O decision from the U. S. Supreme Court. We must keep America SAFE! _E_
Need all on the UN Security Council to vote to renew the Joint Investigative Mechanism for Syria to ensure that Assad Regime does not commit mass murder with chemical weapons ever again. _E_
"What counts is not necessarily the size of the dog in the fight it's the size of the fight in the dog." Dwight D. Eisenhower _E_
Will be interviewed on @foxandfriends now! _E_
New Hampshire has a major decision to make today. Hopefully we won't have to hear any more Mandarin spoken in future debates. _E_
Amazing evening at Saturday Night Live! _E_
The harder I work the luckier I get. Samuel Goldwyn _E_
NYC is under constant threat from Jihadists & violent criminals. Stop & Frisk keeps streets & subways safe.Stand strong Ray Kelly _E_
Wow reviews are in THANK YOU! _E_
The Gang of Six yet another unmitigated disaster. ANY DEAL NEEDS TO REPEAL OBAMACARE. T E A. _E_
Lightweight Senator @RandPaul should focus on trying to get elected in Kentucky a great state which is embarrassed by him. _E_
Busy doing phoners this week with Neil Cavuto Wolf Blitzer Fox & Friends and Larry Kudlow....check out __HTTP__ _E_
Why didn't Gates resign if he was so unhappy about what he was being told by Obama? The fact is Iraq etc. have always been disasters! _E_
Millions losing healthcare plans despite President Obama's promise that this WOULD NOT HAPPEN! What about a massive protest march on D.C. _E_
My family has the honor of being interviewed for a full hour by the legendary @BarbaraJWalters tonight @ABC 10pmE. __HTTP__ _E_
You can benefit from others' wisdom. Not just their mistakes but the good decisions and insight they have to offer." The Way To The Top _E_
Looks like the U.S. will be having the coldest March since 1996 global warming anyone????????? _E_
Thank you Bangor Maine! Get out & #VoteTrumpPence16 on 11/8/16 and together we will MAKE AMERICA SAFE AND GREAT A... __HTTP__ _E_
I will be the featured guest on the season opener of @60Minutes this Sunday. There certainly is plenty to talk about! _E_
CHAIN MIGRATION cannot be allowed to be part of any legislation on Immigration! _E_
Frank was a great guy married to an absolutely wonderful woman @KathieLGifford. What a couple! __HTTP__ _E_
My @gretawire int. on Obama's falling poll numbers Americans losing incentive to work and Weiner's sexting __HTTP__ _E_
Heading to Birmingham Alabama and a massive crowd of incredible people! 12 noon will be wild. _E_
Leaving Nevada now for Iowa. Things are looking good great new polls! _E_
So nice great Americans outside Trump Tower right now. Thank you! __HTTP__ _E_
I am truly enjoying myself while running for president. The people of our country are amazing great numbers on November 8th! _E_
Isn't it great that Obama had time yesterday to fundraise with Jay Z and do @Late_Show while there is a record 21% real unemployment! _E_
Thank you Dayton Ohio! 20000 supporters largest in airport history! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Thank you South Carolina! __HTTP__ _E_
President Obama could totally solve the problem with Putin by demanding that Russia sign on to ObamaCare thereby destroying their economy! _E_
.@HillaryClinton you have failed failed and failed. #BigLeagueTruthTime to #DrainTheSwamp! __HTTP__ _E_
Wages in are country are too low good jobs are too few and people have lost faith in our leaders.We need smart and strong leadership now! _E_
Reports are out there that many CEOs of charities are getting overpaid while their causes are seeing very little... _E_
He (or she) who hesitates is lost: MAKE AMERICA GREAT AGAIN! _E_
.@CNN has to do better reporting if it wants to keep up with the crowd.So totally one sided and biased against me that it is becoming boring _E_
Was photo bombed yesterday by a wise guy when I left the set of @LateNightJimmy... _E_
The Senate should immediately vote on the Iranian sanctions bill. What is the delay? Iran is already breaking its agreement with Obama _E_
Donald Trump ready to end @ApprenticeNBC for White House run __HTTP__ via via @dcexaminer by @eScarry _E_
We are what we repeatedly do. Excellence then is not an act but a habit. Aristotle _E_
Via NYTimes What's Your Ideal Gadget? __HTTP__ _E_
Jeb failed as Jeb! He gave up and enlisted Mommy and his brother (who got us into the quicksand of Iraq). Spent $120 million.Weak no chance! _E_
RT @IamVicky4Trump: TUNE IN: Maria Bartiromo Has an Exclusive Interview With President Trump __HTTP__ _E_
.... we push for the removal of all trade distorting practices....to foster a truly level playing field. _E_
It is truly an honor that His Eminence Archbishop of New York @CardinalDolan will be delivering the benediction at the @RNC convention. _E_
"@joerepublic1 @mckaycoppins how nice of this punk with a pen to call a truce after he tries to show u up w/his bs! True thx _E_
Marco Rubio should pick a location that has working air conditioning next time especially when in Miami proper plan. Sweating profusely! _E_
Preview of Obama's SOTU: More taxes bigger government shrink the private sector end the Republicans & bankrupt the country. Enjoy! _E_
Think of it the Arab League doesn't want to get involved with Syria but they want us to do their dirty work. How stupid! _E_
I encourage everyone in the path of #HurricaneHarvey to heed the advice & orders of their local and state officials. __HTTP__ _E_
...to win. The Democrats are overplaying their hand. They lost the election and now they have lost their grip on reality. The real story... _E_
First Minister Salmond should stop his fruitless drive for obsolete wind turbines in Scotland he would become popular again! @alexsalmond _E_
I watched Sen. Graham @FaceTheNation. Why don't they say that I ran him out of the race like a little boy and in the end he had no support? _E_
By @kwrcrow: Hey Washington Post 'Only You Hate Donald Trump' or 'Is it FEAR?' __HTTP__ _E_
I will be on @meetthepress in an interview with @chucktodd on Sunday morning. So much to talk about! _E_
Wow just watching the news.ObamaCare and the website are TOTALLY OUT OF CONTROL. Costs are through the roof. This could be ruinous to U.S.! _E_
Offering river lake & skyline views @TrumpChicago's 339 5 Star rooms range from Deluxe Suites to Spa Guestrooms __HTTP__ _E_
There is no substitute for hard work. Thomas Edison _E_
Join me in Roanoke Virginia on Saturday evening at 6pm! #MAGA __HTTP__ _E_
Fox and.Friends now! _E_
Obama will eventually approve the Keystone XL pipeline has to happen but it is very late! _E_
The boardroom has never been as intense as in the upcoming13th season of All Star @CelebApprentice. Premieres March 3rd on @NBC! _E_
All Star Celebrity @ApprenticeNBC continues to dominate the Sunday 10PM slot in every key demographic. Still hot after 13 seasons! _E_
I will not be attending the White House Correspondents' Association Dinner this year. Please wish everyone well and have a great evening! _E_
So much Fake News being put in dying magazines and newspapers. Only place worse may be @NBCNews @CBSNews @ABC and @CNN. Fiction writers! _E_
Congratulations to our great military men and women for representing the United States and the world so well in the Syria attack. _E_
TODAY WE MAKE AMERICA GREAT AGAIN! _E_
Nielson Media Research final numbers on ACCEPTANCE SPEECH: TRUMP 32.2 MILLION. CLINTON 27.8 MILLION. Thank you! _E_
I didn't suggest a database a reporter did. We must defeat Islamic terrorism & have surveillance including a watch list to protect America _E_
Leaked e mails of DNC show plans to destroy Bernie Sanders. Mock his heritage and much more. On line from Wikileakes really vicious. RIGGED _E_
I'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_
Work Underway on First New Trump Course in Dubai Second Course in Planning __HTTP__ via @CybergolfNews _E_
Small businesses will have an ally in the White House with @MittRomney. Mitt gave a great interview yesterday __HTTP__ _E_
Family group shot. #WWEHOF __HTTP__ _E_
Just cannot believe a judge would put our country in such peril. If something happens blame him and court system. People pouring in. Bad! _E_
Via @DailyCaller @NeilMunroDC: "Obama's Border Policy Fueled Epidemic Evidence Shows" __HTTP__ _E_
RT @mike_pence: Good morning! Join me in Lima Ohio tomorrow evening at 7pm. #MAGATickets: __HTTP__ _E_
FLASHBACK via @Reuters from 2004: "Donald Trump Would 'Fire' Bush Over Iraq Invasion" It's called great vision. __HTTP__ _E_
Just arrived in Texas have been informed two @fortworthpd officers have been shot. My thoughts and prayers are with them. _E_
Trump Hotels are delivering lots of food to storm victims...we love doing it! _E_
RT @TeamTrump: Hillary's policies have made America less safe that's why 200+ general and military leaders have endorsed @realDonaldTrump!... _E_
The worst show in Las Vegas in my opinion is @pennjillette. Hokey garbage. New York show even worse! _E_
Awarded both @ForbesInspector Five Star & @AAAFiveDiamond ratings @TrumpNewYork's @Jean_GeorgesNYC is fantastic. __HTTP__ _E_
Thank you Charlotte North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_
"Donald Trump dedicates second Scottish golf course to beloved mother Mary" __HTTP__ via @MailOnline _E_
My fellow Tea Party friends in Ohio make sure you take advantage of early voting so you can GOTV election day. Know you can! Must win Ohio. _E_
Everyone should watch the documentary 'Windfall' on @netflix. See an upstate NY town ruined by environmentalists & windfarms. _E_
HAPPY PRESIDENTS DAY MAKE AMERICA GREAT AGAIN! _E_
.@ArceePalabrica @realDonaldTrump Midas Touch is the manual for entrepreneurs who want to succeed. Thanks for sharing your knowledge _E_
One season ends and another starts. Already casting for the next @ApprenticeNBC. Great news for charity $13 million so far. _E_
Looking forward to my meeting with Benjamin Netanyahu in Trump Tower at 10:00 A.M. _E_
Wow I love stimulating debate and driving certain people crazy the Generals were forced to do something they didn't want to do (not me). _E_
Thank you @NYPost! #Trump2016 __HTTP__ _E_
The lightweight hack Schneiderman told Ivanka that the "case is weak and more. Meets with Obama & then files one day later. _E_
Sadly they and others are Fake News and the public is just beginning to figure it out! __HTTP__ _E_
Jeb Bush really blew his interview with @megynkelly should cost him big time. Said he would do the disastrous Iraq war all over again _E_
So Obama wants to bomb ISIS in Iraq & arm them in Syria? What is he doing! _E_
In my administration EVERY American will be treated equally protected equally and honored equally #Debate #BigLeagueTruth _E_
Interesting.@BarackObama's 1981 transfer class to Columbia declined in quality according to the Columbia Spectator __HTTP__ _E_
What's funny about the name "F**kface Von Clownstick" it was not coined by Jon Leibowitz he stole it from some moron on twitter. _E_
So many incredible friends said thanks for TT help I say thanks to you! __HTTP__ _E_
We threw our ally Mubarak overboard and Egypt is now our enemy. Great going Obama Israel is in trouble. _E_
Will be meeting on Monday at Trump Tower with a large group of African American Pastors. Many I know wonderful people! Not a press event. _E_
It doesn't matter that Crooked Hillary has experience look at all of the bad decisions she has made. Bernie said she has bad judgement! _E_
So many people don't understand I am a big proponent of vaccines for children—just not in one massive dose—spread them out over time. _E_
Sorry but @piersmorgan is a good & smart man who is doing really well. That's why he won @ApprenticeNBC. _E_
Wow President Obama's brother Malik just announced that he is voting for me. Was probably treated badly by president like everybody else! _E_
Supporters waiting to hear me speak in Oskaloosa Iowa. #MakeAmericaGreatAgain __HTTP__ _E_
With the record high February gas prices hurting the economy even more reason to start fracking. Will create jobs & lower prices. _E_
MAKE AMERICA GREAT AGAIN! #Trump2016 #VoteTrump __HTTP__ _E_
Congratulations to @marklevinshow on 'The Liberty Amendments' debuting at #1 on the NY Times' bestseller list. Must read! _E_
Leaving now for Texas! _E_
Wow! Such a wonderful article from fantastic people my great honor! __HTTP__ _E_
If you love what you do you are going to work harder you are going to try harder and you will be better at it. Think Big _E_
Trump Int'l Puerto Rico spreads luxury residences a world class golf resort & beach club across 1000 acres __HTTP__ _E_
A great job by @RickieFowlerPGA in winning The Players yesterday. Finally your jealous critics can go to hell! Good luck at The U.S. Open. _E_
The unemployment numbers released later this week will show no job growth. We must start making our own products again. #TimeToGetTough. _E_
Dummy writer @Clare_OC from failing @Forbes magazine works so hard to make such trivial license deals look important... _E_
Jennifer Aniston is engaged she's a great person and I wish her well. _E_
We have all got to come together and win this election. We can't have four more years of Obama (or worse!). _E_
#TrumpAdvice __HTTP__ _E_
Happy to announce I am nominating Alex Azar to be the next HHS Secretary. He will be a star for better healthcare and lower drug prices! _E_
Via @Newsmax_Media: Maher Being Sued by Trump Over Birth Certificate Bet on 'Tonight Show' __HTTP__ _E_
In the plane heading to Iowa State Fair. Will be great fun. Hopefully giving helicopter rides to some of the kids. _E_
.@MannyPacquiao was robbed in his title fight on Saturday night. No wonder boxing is dying. Bring back the 15 round fights. _E_
Good @FLGovScott is suing the Federal Government so he can protect the voter rolls __HTTP__ Florida must be a legal election. _E_
#FlashbackFriday Trump family final week of @Oprah's show @Oprah is terrific! __HTTP__ _E_
I would have had millions of votes more in the primaries (than Crooked Hillary) if I only had one opponent instead of sixteen. Broke record _E_
Whether you love like or hate Donald Trump I will be on Bill O'Reilly (Fox) tonight at 8.00. Bill knows Trump is great for ratings! _E_
Melania and I extend our deepest condolences to the family of Shimon Peres... __HTTP__ _E_
Now every time Islamic militants attack they will use that movie as an excuse __HTTP__ What was the excuse before the movie? _E_
With imposing dunes on the rugged Aberdeenshire coastline @TrumpScotland's Championship Course is a masterpiece __HTTP__ _E_
Al Sharpton said they are even making it more harder to register people to vote . Which is worse his grammar or his thoughts? _E_
Really bad news just announced concerning jobs. Far fewer jobs created in August than anticipated. Interest rates therefore to remain low. _E_
"Donald Trump: 'Karl Rove Is A Total Loser' So Why Are People Still Giving Him Money?" __HTTP__ via @Mediaite _E_
"Attitude is a little thing that makes a big difference." Winston Churchill _E_
Proud to see my friend Governor Chris Christie standing up for Israel on his visit. Standing tall! _E_
Wow because of the pressure put on by me ICE TO LAUNCH LARGE SCALE DEPORTATION RAIDS. It's about time! _E_
Voter fraud! Crooked Hillary Clinton even got the questions to a debate and nobody says a word. Can you imagine if I got the questions? _E_
The Tax Cuts are so large and so meaningful and yet the Fake News is working overtime to follow the lead of their friends the defeated Dems and only demean. This is truly a case where the results will speak for themselves starting very soon. Jobs Jobs Jobs! _E_
... while a 300ft turbine in Ardrossan North Ayrshire erupted in flames the previous month during gales of 165 mph __HTTP__ _E_
A big day for New York and for our COUNTRY! MAKE AMERICA GREAT AGAIN! _E_
Thank you Governor @ScottWalker & @GOP Chairman @Reince Priebus. #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_
More people attend a @JonHuntsman rally than watch @Lawrence on @MSNBCtv all week. @Lawrence is very lonely. (cont) __HTTP__ _E_
#CelebApprentice what do you think of the choices for project manager? _E_
Weak and totally conflicted people like @TheRickWilson shouldn't be allowed on television unless given an I.Q. test. Dumb as a rock! @CNN _E_
As usual the storm of the century was not nearly as bad as forecast. What a waste of time energy and money! _E_
Via @worldnetdaily by @jerome_corsi: "Donald Trump: Obama's Jobless Figures 'Phony.' Economists agree." __HTTP__ _E_
I can't believe the Yankees continue to pay A Rod they have a perfect right to stop paying (and should have stopped a long time ago). _E_
"The achievements of an organization are the results of the combined effort of each individual." – Vince Lombardi _E_
TRUMP & CLINTON ON IMMIGRATION#Debate #BigLeagueTruth __HTTP__ _E_
"Donald Trump on Jeb Bush: 'The last thing we need is another Bush'" __HTTP__ via @fox5newsdc by @EmilyMiller _E_
ObamaCare has brought skyrocketing premium increases & unaffordable deductibles which will lead to less care & job losses. _E_
"Face reality as it is not as it was or as you wish it to be." @jack_welch _E_
SHOCK @BarackObama's people are sending paid political organizers to heckle at @MittRomney events __HTTP__ _E_
National Review Online: Kristin Davis's Libertarian 'Tough Love' __HTTP__ _E_
Congratulations to @bobmcdonnell on leading Virginia to be in the black for a 3rd straight year. He is a fantastic governor. _E_
When will our nation's sacrifices be respectfully appreciated? Iraq and Libya should reimburse us in oil. _E_
.@Lord_Sugar How did you enjoy Mar a Lago? It was nice having you there my people thought you were terrific! _E_
I remember when the Apprentice became the number one show on T.V. @tombrokow came up to me and thanked me on behalf of NBC (Yankee Stadium) _E_
Jusr watched #HarveyPitt on @TeamCavuto he was great! _E_
Via @Hometownlife: Donald Trump to speak at Lincoln Day Dinner at The Showplace in Novi __HTTP__ _E_
Selfishness ultimately begets only unhappiness. Unselfishness begets happiness. B.C. Forbes _E_
Enjoying the Olympics. Great coverage by @NBC as well. GO TEAM USA! _E_
.@MarthaRaddatz was so unprofessional and biased when discussing me on This Week. @GStephanopoulos should not allow this conduct! _E_
REPEAL AND REPLACE OBAMACARE! _E_
Clinton camp fumed when surrogate told supporters Clinton planned to betray labor on TPP post election: __HTTP__ _E_
After decades of lies and scandal Crooked Hillary's corruption is closing in. #DrainTheSwamp! __HTTP__ _E_
I cannot believe how bad Jeb Bush looks with his insane answer on Iraq and then his numerous corrections which made him look even worse. _E_
Home of @PGATOUR's @CadillacChamp @TrumpDoral represents all that is Miami: energy glamour innovation & luxury __HTTP__ _E_
Drew Peterson a real sleaze just convicted of killing wife. Change the law so he gets death penalty. _E_
Saudi Arabia should fight their own wars which they won't or pay us an absolute fortune to protect them and their great wealth $ trillion! _E_
Loved doing the debate...won Drudge and all on line polls! Amazing evening moderators did an outstanding job. _E_
RT @mike_pence: We are heading to Virginia. Looking forward to supporting my friend @EdWGillespie. He will make a great Governor for the Co... _E_
.@GovernorSununu who couldn't get elected dog catcher in NH forgot to mention my phenomenal biz success rate: 99.2% __HTTP__ _E_
Alison Grimes supports harsh restrictions to kill coal industry & supports Obama's anti gun legislation. Vote @Team_Mitch! _E_
A country that Crooked Hillary says has funded ISIS also gave Wild Bill $1 million for his birthday? SO CORRUPT! __HTTP__ _E_
I'll be appearing on Larry King Live for his final show Thursday night at 9 p.m. CNN. Larry's been on TV for 25 years... _E_
Hillary Clinton's Presidency would be catastrophic forthe future of our country. She is ill fit with bad judgment. _E_
The Generals and top military brass never wanted a mixer but were forced to do it by very dumb politicians who wanted to be politically C! _E_
'Clinton Campaign Tried to Limit Damage From Classified Info on Email Server' #DrainTheSwamp __HTTP__ _E_
Beautiful evening with Religious Leaders here at the WH last night. Join us now for a #NationalDayofPrayer LIVE:... __HTTP__ _E_
I have brought millions of people into the Republican Party while the Dems are going down. Establishment wants to kill this movement! _E_
#TBT With Darrell Hammond when I hosted SNL. __HTTP__ _E_
I watched lightweight Senator Marco Rubio who is all talk and no action defend his WEAK position on illegal immigration. Pathetic! _E_
Remember get out on November 8th & VOTE #TrumpPence16. It is time to #DrainTheSwamp this is our last chance! __HTTP__ _E_
The polling numbers for 2012 are very interesting will Americans ultimately want their leaders to be 'likeable' or 'competent'? _E_
A real president should take pride in saving and spending your money wisely not funneling it to his cronies (cont) __HTTP__ _E_
.@HillaryClinton and Obama policies increased debt by $9trillion over the last 8 years _E_
RT @Scavino45: U.S. MARKETS FROM ELECTION DAY {Since 11/8/2016} 📈 __HTTP__ _E_
DELUSIONAL Obama actually thought that he won the debate __HTTP__ What is he thinking? _E_
Congratulations to my friend @RoccoMediate on winning the big golf tournament today! _E_
In any business venture remember that branding is one of the most crucial aspects of your enterprise. Fight hard for that brand of yours. _E_
I endorsed @MittRomney not because I agree with him on every issue but because he will get tough with China. _E_
RT @EricTrump: I look forward to being on @CNN with @ErinBurnett at 7:40pmET. @realDonaldTrump _E_
.@SenTedCruz had a very good debate far better than Rand Paul. _E_
Congratulations to Karen Handel on her big win in Georgia 6th. Fantastic job we are all very proud of you! _E_
The @WTA released a new #StrongisBeautiful celebrity campaign today. Amazing athletes. Proud to be a part of this. __HTTP__ _E_
Beyond simple justice and beyond reducing our national debt another advanage of taking the oil is that it (cont) __HTTP__ _E_
Premiering Jan. 4th the record 14th season's @ApprenticeNBC cast is the nastiest yet __HTTP__ Major Boardroom fireworks! _E_
For those asking my son @EricTrump makes zero $$ running his charity & raises a great deal of $$ all of it for @StJude @EricTrumpFdn _E_
Excited to be speaking at @frankgaffney's @securefreedom Iowa National Security Action Summit tomorrow at 1:30PM! __HTTP__ _E_
I am leaving for Norfolk Virginia the great battleship U.S.S. Wisconsin for a big rally and really big crowd. See you soon! _E_
.@Playboy Playmate of the Year @BrandenRoderick returns to the 13th season of All Star @CelebApprentice she is smart & beautiful. _E_
.@Disney's acquisition of Lucas Film is a smart deal for both sides. Disney just bought a great brand which will keep producing revenue. _E_
Now Obama has set red line 2 with demand that Assad hands over Syria's chemical weapons or it will face an attack. _E_
Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_
RT @EPAScottPruitt: Thoughts and prayers for those in Texas & Louisiana. I am closely monitoring #Harvey developments along with @fema & @E... _E_
Thank you Senator @ChuckGrassley! #TrumpPence16 __HTTP__ _E_
Wind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @RSPBScotland @Natures_Voice _E_
I am very proud to have brought the subject of illegal immigration back into the discussion. Such a big problem for our country I will solve _E_
Voters understand that Crooked Hillary's negative ads are not true just like her email lies and her other fraudulent activity. _E_
"Statement by President Trump on the Apprehension of Mustafa al Imam for His Alleged Role in Benghazi Attacks" __HTTP__ _E_
Entrepreneurs Always remember that every day counts. Stay focused. Stay positive and develop momentum. _E_
Ratings for #MissUniverse pageant were highest in 4 years. @NBC likes me (and I like them!) _E_
Stop calling my office to do your show I have more important things to do with my time nobody's watching you! @lawrence _E_
My thoughts on last night's Celebrity Apprentice __HTTP__ as well as my latest video blog at __HTTP__ _E_
RT @TeamTrump: .@timkaine has a pay to play problem just like Crooked @HillaryClinton #VPDebates #BigLeagueTruth __HTTP__ _E_
Karl Rove's stupid ad made Ashley Judd hot—now everybody is talking about her. _E_
China has done great under Obama. Increased private US holdings by 500%. Hacks our military & R&D. Robs us blind daily.#timetogettough _E_
When little Morty Zuckerman closes his failing @NYDailyNews will I at least be given some credit? Will happen soon. _E_
Why does Obama believe he shouldn't comply with record releases that his predecessors did of their own volition? Hiding something? _E_
Via The Washington Times Mr. Trump buzzes the presidential radar __HTTP__ _E_
.@AlexSalmond Wind turbines are ripping your country apart and killing tourism.Electric bills in Scotland are skyrocketing stop the madness _E_
. @BarbaraJWalters made a great decision in firing @JoyVBehar from @theviewtv. The show will be better without her! _E_
RT @foxandfriends: .@DonaldJTrumpJr: Trump has had a lot more responsibility to deal with than any of the other GOP candidates __HTTP__ _E_
Thank you Nicole! __HTTP__ _E_
Thank you Reno Nevada. NOTHING will stop us in our quest to MAKE AMERICA SAFE AND GREAT AGAIN! #AmericaFirst... __HTTP__ _E_
#TrumpAdvice __HTTP__ _E_
Wow they are really killing Jay Leno let him go out with dignity! _E_
Will be doing @foxandfriends this morning at 7:00. ENJOY! _E_
The newly built Blue Monster at Trump National Doral is being considered a masterpiece by almost all who see it and play it THANK YOU! _E_
Crooked Hillary is flooding the airwaves with false and misleading ads all paid for by her bosses on Wall Street. Media is protecting her! _E_
The organized group of people many of them thugs who shut down our First Amendment rights in Chicago have totally energized America! _E_
Five Star @TrumpCondosLV are the most luxurious & elite residences in the Vegas market __HTTP__ "If you love it own it" _E_
Just returned from New Hampshire where the crowd was great and got a beautiful standing ovation! Wonderful people who truly love the U.S.A. _E_
Flags to be flown at Half Staff at all Trump Properties in Honor of the Five Fallen Soldiers __HTTP__ _E_
... and in my opinion should not be doing The Apprentice. _E_
RT @DarrenJJordan: CONSTRUCTIVE WINS! 💪 @realDonaldTrump @CLewandowski_ @DanScavino @MichaelCohen212 @KatrinaPierson @DefendingtheUSA __HTTP__ _E_
Jeb Bush who did poorly last night in the debate and whose chances of winning are zero just got Graham endorsement. Graham quit at O. _E_
Monitoring the terrible situation in Florida. Just spoke to Governor Scott. Thoughts and prayers for all. Stay safe! _E_
A good friend: @SarahPalinUSA. More importantly she is a tremendous voice for policies that would put America on (cont) __HTTP__ _E_
People Magazine: Donald Trump Was Right: He Gave SNL Its Best Ratings in Nearly 4 Years Plus What You Didn't See __HTTP__ _E_
Living in denial only 15% of Democrats think that recent economic news is poor __HTTP__ _E_
From Fox and Friends interview: Trump: We should not go back to Iraq __HTTP__ _E_
.@RepChrisCollins Chris thank you so much for your wonderful endorsement. I will not let you down! @CNN _E_
Republican Senators are working very hard to get Tax Cuts and Tax Reform approved. Hopefully it will not be long and they do not want to disappoint the American public! _E_
The Debate @BarackObama's mic and my Endorsement in today's #trumpvlog __HTTP__ _E_
No more massive injections. Tiny children are not horses—one vaccine at a time over time. _E_
The most elite private club in the world Mar a Lago is Palm Beach's legendary landmark. __HTTP__ _E_
"The Conservative does not despise government. He despises tyranny. @marklevinshow _E_
.@pennjillette doesn't like @StephenBaldwin7's cliché line and Stephen says Penn creeps him out. Do we sense conflict yet? #CelebApprentice _E_
Just sit back and watch ObamaCare is such a disaster it will fall like a house of broken cards. The website is the best part of this mess! _E_
"No person who is enthusiastic about his work has anything to fear from life." – Samuel Goldwyn _E_
America's top Army general has warned of a crisis unless sexual abuse in the military is quickly brought undet control.Forces greatly hurt! _E_
I will be on @FoxNewsSunday with Chris Wallace this morning. Enjoy! _E_
Beautiful thank you. __HTTP__ _E_
I'm looking forward to seeing you all this afternoon at Macy's Herald Square. 5:30 pm at the Crystal department on 8. _E_
Order signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm. __HTTP__ _E_
Failed presidential candidate Lindsey Graham should respect me. I destroyed his run brought him from 7% to 0% when he got out. Now nasty! _E_
The ones who are crazy enough to think that they can change the world are the ones who do. Steve Jobs _E_
He's hired! Listen to my #Apprentice Andy launchhis radio show @AmericaNowRadio with me tomorrow 6PM ET __HTTP__ _E_
Via @NRO:"Trump @KarlRove 'Most Overrated Man in Politics'Responsible for Ashley Judd's Rise" __HTTP__ @elianayjohnson _E_
Just leaving Knoxville TN what a crowd what amazing people! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_
Reckless! Why is @BarackObama wasting over $70 Billion on 'climate change activities?' Will he ever learn? __HTTP__ _E_
Celebrity Apprentice on tonight CNBC at 9 _E_
Whenever one of the morons say I wear a wig stop reading because they have no credibility & just hate. _E_
Four brave Americans died in Benghazi. Administration is still covering up the truth. We deserve to know the full truth. _E_
The NFL should have its non profit status immediately revoked while at the same time ending the giant tax scam which makes teams so valuable _E_
The people who support Hillary sit behind CNN anchor chairs or headline fundraisers those disconnected from real life. _E_
"Winners see problems as just another way to prove themselves." – Think Like a Champion _E_
Things are going really well for our economy a subject the Fake News spends as little time as possible discussing! Stock Market hit another RECORD HIGH unemployment is now at a 17 year low and companies are coming back into the USA. Really good news and much more to come! _E_
Guess who is talking to @MissUniverse at @TrumpTowerNY? Not terrible hair! __HTTP__ _E_
Much of the money I have raised for our veterans has already been distributed with the rest to go shortly to various other veteran groups. _E_
caught he cried like a baby and begged for forgiveness...and now he is judge & jury. He should be the one who is investigated for his acts. _E_
Here's my message to @BarackObama: America is a capitalistic country. Get over it and get on with it! #TimeToGetTough _E_
Mr. President take your campaign of division and anger and hate back to Chicago. @MittRomney _E_
The 2013 MISS UNIVERSE® Pageantwill take place in Russia for the very first time in the 62 year history of the contest. _E_
A day after @BarackObama released a trillion dollar budget deficit he is hosting China's future leader VP XiJinping. America's new reality. _E_
I developed the Wollman Rink under budget and in record time __HTTP__ If I hadn't gotten involved it would still be unused. _E_
My @foxandfriends interview discussing @BarackObama's reckless spending the Buffet Tax gimmick and #CelebApprentice __HTTP__ _E_
Great to meet everyone while having breakfast @ChezVachon this morning! #FITN #VoteTrumpNH __HTTP__ __HTTP__ _E_
.@RobertGBeckel Please thank your brother for his nice words on television. Seems like a great guy and character! @CNN _E_
RT @DonaldJTrumpJr: Thank you Elko County Nevada. So much amazing feedback from my forum today I really appreciate it #trump2016 #ICYMI ht... _E_
People have been asking to hear my Howard Stern interview—you can access it on @HowardTV. _E_
I am extremely pleased to see that @CNN has finally been exposed as #FakeNews and garbage journalism. It's about time! _E_
RT @EricTrump: Please stay safe #Florida! You are in our thoughts and we are praying for you! __HTTP__ _E_
Tune in to see me on @ThisWeekABC with @GStephanopoulos at 10am ET. Enjoy! _E_
Going to Charleston South Carolina in order to spend time with Boeing and talk jobs! Look forward to it. _E_
Obama's war on women. "Number of Unemployed Women Increased in July by 227000" __HTTP__ _E_
The ObamaCare website will cost over $1.5B when all is said and done. Crazy! _E_
Massive combined inoculations to small children is the cause for big increase in autism.... _E_
Don't forget to tune in tonight to see another unpredictable and exciting episode of The Apprentice 10 pm on NBC _E_
Getting ready to go on @KellyandMichael two great people! _E_
.@chelseahandler—stop trying to get your hotelier boyfriend back—a lost cause—he can do much better! _E_
Order signed copy of CRIPPLED AMERICA & have opportunity to submit question for my live streaming book signing 12/3 __HTTP__ _E_
Thank you @AnnCoulter for your nice words. The U.S. is becoming a dumping ground for the world. Pols don't get it. Make America Great Again! _E_
Tremendous pressure on President Obama to institute a travel ban on Ebola stricken West Africa. At some point this stubborn dope will fold! _E_
Low energy Jeb Bush just endorsed a man he truly hates Lyin' Ted Cruz. Honestly I can't blame Jeb in that I drove him into oblivion! _E_
Big storm in New Hampshire. Moved my event to Monday. Will be there next four days. _E_
Team Trump with the recipients of our donations in the Rockaways. #Sandy __HTTP__ _E_
Both Barack and @MittRomney were excellent at the Al Smith dinner last night! _E_
Congrats to Barack Obama on April's job report. Over 800000 left the work force w/average hourly wages & weekly hours staying flat. Bad! _E_
I'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_
To every action there is always opposed an equal reaction. Isaac Newton _E_
Via @nypost by Editorial Board: "New York's mute @AGSchneiderman" __HTTP__ Schneiderman is feckless and corrupt. _E_
Just leaving Miami for Houston Oklahoma and Colorado. Miami crowd was fantastic! _E_
Obama's China 'climate' deal binds America with language of 'will' curb emissions now while China only 'intends' to curb in 2030. Bad deal! _E_
Aside from having no ratings sleazy Ed Schultz lied about what I said. Thank you Scott Whitlock @ScottJW __HTTP__ _E_
Will be in Louisiana for the Miss USA Pageant which will be on NBC on Sunday night. Watch Miss Pennsylvaniaan interesting and amazing story _E_
We're spending a fortune looking for the lost plane with mostly Chinese passengers and that's OK but how much are Russia & China spending? _E_
Remain open to new ideas. That's where innovation comes from. _E_
Goofy Elizabeth Warren lied when she says I want to abolish the Federal Minimum Wage. See media—asking for increase! _E_
The documentary of me that @CNN just aired is a total waste of time. I don't even know many of the people who spoke about me. A joke! _E_
Ask yourself: What can I learn today that I didn't know before? Always be a student always be open to new ideas. _E_
People buy deals & immediately put them into bankruptcy in order to make better deals.. _E_
To every PATRIOT who will serve on the #USSGeraldRFord:Keep the watchProtect herDefend herLOVE HERGood Luck & Godspeed! __HTTP__ _E_
Via @StarsEntLive by Nick Ricko: "@kevinjonas @IanZiering In Celebrity @ApprenticeNBC First Look" __HTTP__ _E_
Newly released NH poll has @MittRomney with a 1 point lead. Mitt will pull away next week. _E_
Re Florida Power & Light—Most important is safety but they have to also cater to aesthetics & not ruin the beauty of Florida. _E_
I will be re tweeting some of your better most imaginative and hopefully insightful tweets. Make them good (great)! Important stuff. _E_
Tomorrow we'll be going to Panama for the opening of our new hotel. It's a fantastic building in a fantastic location. __HTTP__ _E_
Iran hides behind its assertion of technical compliance w/the nuclear deal while it brazenly violates the other limits.. Amb. @NikkiHaley __HTTP__ _E_
"Failed show @DannyZuker" I have never heard of you and was told you are a loser after reading your credits I have no questions about it! _E_
I know a great deal about websites etc. but I am unable to understand how our government spent $635 million on the ObamaCare site & disaster _E_
Another nasty season premieres Sunday March 3rd at 9/8c on NBC! __HTTP__ _E_
New Virginia poll thank you! We are going to show the whole world that America is back – BIGGER and BETTER and S... __HTTP__ _E_
A study says @Autism is out of control a 78% increase in 10 years. Stop giving monstrous combined vaccinations (cont) __HTTP__ _E_
Great job @EricTrump! Proud of you! #AmericaFirst #RNCinCLE __HTTP__ __HTTP__ _E_
"@TurnberryBuzz the jewel in Donald Trump golfing crown" __HTTP__ via @TheScotsman by @DempsterMartin _E_
Wow tremendous victory in the Trump University case against lightweight @AGSchneiderman just got the news! _E_
Casting sometimes is fate and destiny more than skill and talent from a director's point of view. Steven Spielberg _E_
So China is ordering us to raise the Debt Limit...How low have we as a nation sunk? _E_
We must bring the truth directly to hard working Americans who want to take our country back. #BigLeagueTruth... __HTTP__ _E_
As I have been saying Crooked Hillary will approve the job killing TPP after the election despite her statements to the contrary: top adv. _E_
Omarosa is very confident that the execs loved her concept & presentation. _E_
Looking forward to returning to the Hawkeye state this Saturday to support my friend and strong Conservative @SteveKingIA! _E_
Heading to U.S. Bank Arena in Cincinnati Ohio for a 7pm rally. Join me! Tickets: __HTTP__ _E_
Via @WPOffshore: "Donald Trump's Blackdog victory" __HTTP__ _E_
Today it was an honor to have @UNSecretary General @AntonioGuterres at the @WhiteHouse. Speaking for the U.S.A. we appreciate all you do! __HTTP__ _E_
Entrepreneurs: Review your work habits and make sure they are taking you in the right direction. Don't become complacent! _E_
The United States will be immediately implementing much tougher Extreme Vetting Procedures. The safety of our citizens comes first! _E_
Thank you! #MakeAmericaGreatAgain __HTTP__ _E_
It was my great honor to defend @dennisrodman on @ApprenticeNBC last night—he has come a long way and for the good! _E_
Via @RedState by @EWErickson: "Always Play On Offense" __HTTP__ _E_
My speech to @PressClubDC on Tuesday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_
In 2008 @BarackObama warned that electricity rates will necessarily skyrocket during his term. Mission Accomplished! __HTTP__ _E_
No surprise. Woman being cited by Kerry & McCain on Syrian rebels is a paid consultant of the rebels __HTTP__ _E_
.@MittRomney needs to make @BarackObama regret that he ever asked for his tax records. _E_
Little @MacMiller—I don't need your praise __HTTP__ just pay me the money you owe. _E_
Great meeting @GarySinise at @AmSpec dinner. Besides his great acting Gary does tremendous work for vets through his foundation. _E_
Join me Tuesday Nov. 3rd at 12 PM in #TrumpTower in NYC. I'll be signing copies of my book CRIPPLED AMERICA. Don't miss it! _E_
I'll be in Iowa tonight making a speech to a record setting crowd. The word is getting out MAKE AMERICA GREAT AGAIN! _E_
Tune in tonight to Greta van Susteren's show On the Record which airs on Fox News at 9 p.m. _E_
The networks are all driving me crazy to do television shows—"a ratings machine"—but because of Apprentice have been loyal to NBC. _E_
RT @realDonaldTrump: I as President want people coming into our Country who are going to help us become strong and great again people co... _E_
.@PamelaGeller is a total whack job who doesn't have a clue. Don't provoke the enemy go get them and make them pay. No signals just do it! _E_
Ivanka Trump defends her dad __HTTP__ via @politico _E_
He ruins the brand: @bobbeckel doesn't belong on @FoxNews. As CM for Mondale in '84 you lost 49 states. Sad! _E_
FM @AlexSalmond of Scotland spent more than $750000 of taxpayers $ to visit Ryder Cup in Chicago peanuts compared to his windmill folly. _E_
Phyllis Schlafly: Trump is 'last hope for America' __HTTP__ __HTTP__ _E_
The #MissUniverse women totally blow away the Victoria's Secret women! _E_
No @DannyZuker it's making you crazy because you don't have the guts to play the game. Come on Danny you can do it! _E_
Even Bill is tired of the lies SAD! __HTTP__ _E_
An honor to host President Mahmoud Abbas at the WH today. Hopefully something terrific could come out it between th... __HTTP__ _E_
Based on very popular demand I will be live tweeting tomorrow night during the Presidential debate. _E_
Miss Florida was great in her denial of Miss Pennsylvania's phoney statements. She blows Miss Pennsylvania away a different league . _E_
Congratulations to @tedcruz on his Texas primary victory last night. He will be an outstanding Senator. _E_
Just remember the birther movement was started by Hillary Clinton in 2008. She was all in! _E_
Shooting deaths of police officers up 78% this year. We must restore law and order and protect our great law enforcement officers! _E_
We pray for our fallen heroes who died while serving our country in the @USNavy aboard the #USSJohnSMcCain and their families. __HTTP__ _E_
It's not whether you get knocked down it's whether you get up. Vince Lombardi _E_
The premiere of Donald J. Trump's Fabulous World of Golf is tomorrow night at 9 p.m.ET on Golf Channel. Tune in for a great adventure! _E_
ObamaCare website fiasco was a SINGLE bid to a Canadian company terrible! _E_
Look what is happening to our country under the WEAK leadership of Obama and people like Crooked Hillary Clinton. We are a divided nation! _E_
All he does is go on television is talk talk talk but incapable of doing anything. _E_
If Justice Roberts had made the correct decision on ObamaCare our country would not be in turmoil right now! _E_
I will be in South Carolina all week. Saturday is BIG BIG BIG! Get out and vote MAKE AMERICA GREAT AGAIN _E_
Either Miss Pennsylvania will pay her father will pay or her lawyers will pay. She hurt many people! _E_
Via @BreitbartNews by @IanHanchett: "Trump: Obama 'Treats Our Known Enemies Much Better' Than Israel" __HTTP__ _E_
Via @Mediaite by forza_desiderio: "Donald Trump Blasts Obama on Ebola: Why Are You Sending Troops?" __HTTP__ _E_
Fun fact for my 2M+ followers the 'Architect' Karl Rove blew $400M in the 2012 election with a success rate of 1.6%. _E_
Thank you for a great day yesterday Rhode Island! #VoteTrump __HTTP__ _E_
Will be interviewed on @seanhannity tonight at 10pmE. Enjoy! #INPrimary _E_
The residential real estate market continues to provide opportunities for first time home owners. Buy now if you can! _E_
#CrookedHillary is not fit to be our next president! #TrumpPence16 __HTTP__ _E_
Just bought the Kluge Estate in Charlottesville Virginia (don't worry only business). See Washington Post article __HTTP__ _E_
Congratulations to Michael Jordan on his marriage over the weekend. _E_
RT @EricTrump: Debate ready!!! @realDonaldTrump #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_
A big contingent of very enthusiastic Roy Moore fans at the rally last night. We can't have a Pelosi/Schumer Liberal Democrat Jones in that important Alabama Senate seat. Need your vote to Make America Great Again! Jones will always vote against what we must do for our Country. _E_
Watch Obama refuse to call Benghazi a terrorist attack on 9.12 __HTTP__ What took @CBS so long to release this footage? _E_
Life brings you many surprises. As a child I used to vacation with my family at the Doral in Miami. Now I own it. __HTTP__ _E_
Trump: Weiner a 'Sick Puppy' That NYC Doesn't Need __HTTP__ via @Newsmax_Media _E_
Thank you! __HTTP__ _E_
The Cruz Kasich pact is under great strain. This joke of a deal is falling apart not being honored and almost dead. Very dumb! _E_
The CBO has confirmed that @BarackObama's stimulus crowds out private investment while not creating any jobs. __HTTP__ _E_
My @FoxNews int. with @seanhannity on Obama being all talk & no action & making America Great Again! __HTTP__ _E_
We need a dealmaker in the White House who knows how to think innovatively and make smart (cont) __HTTP__ _E_
Crooked Hillary Clinton is the worst (and biggest) loser of all time. She just can't stop which is so good for the Republican Party. Hillary get on with your life and give it another try in three years! _E_
DAMAC Properties @DamacOfficial @realDonaldTrump Looking forward to welcoming you to Dubai! Have a great trip! Thank you! _E_
RT @FoxNews: Poll: @realDonaldTrump vs. @HillaryClinton among white Evangelicals. __HTTP__ _E_
In today's #trumpvlog I speak about the chopper recently made for me by @occhoppers.... __HTTP__ #CelebApprentice _E_
Watch @Seanhannity tonight on his show Hannity Fox News at 9 pm. I'll be on and we'll cover the Wall Stree... (cont) __HTTP__ _E_
Join us in Iowa tomorrow! #IACaucus #Trump2016 #MakeAmericaGreatAgain 3:00pm: __HTTP__ 7:30pm: __HTTP__ _E_
I will be live tweeting during the debate tonight. _E_
I have an idea for @JebBush whose campaign is a disaster. Try using your last name & don't be ashamed of it! _E_
How long did it take for Obama to call Hugo Chavez and congratulate him on his 'reelection?' Who do you think Chavez supports in ours? _E_
If you have a speech one that would put Winston Churchill to shame liberals would find a way to make it sound terrible! _E_
Congratulations @Trump_Ireland for being named #12 resort in Europe by the @CNTraveler #ReadersChoice2014 awards! _E_
I will be on @FoxNews live with members of my family at 11:50 P.M. We will ring in the New Year together! MAKE AMERICA GREAT AGAIN! _E_
Happy to hear that @ralphreed's Faith and Freedom chapters are at the @RNC convention supporting @MittRomney. We must be united to win! _E_
I am reading that the great border WALL will cost more than the government originally thought but I have not gotten involved in the..... _E_
The top Leadership and Investigators of the FBI and the Justice Department have politicized the sacred investigative process in favor of Democrats and against Republicans something which would have been unthinkable just a short time ago. Rank & File are great people! _E_
.@FLOTUS Melania and I were honored to stop by the Women's Empowerment Panel this afternoon at the @WhiteHouse.... __HTTP__ _E_
I'm looking forward to the Super Bowl but looking even more forward to Monday night at 8:00 best episode EVER of Celebrity Apprentice! _E_
A top Clinton Foundation official said he could name "500 different examples" of conflicts of interest. __HTTP__ _E_
After @TrumpTurnberry I will be visiting Aberdeen the oil capital of Europe to see my great club @TrumpScotland. _E_
All eyes are on @TigerWoods @The_Masters. He's in good position! _E_
Ellen is sadly having a hard time with her lines. #Oscars _E_
Thanks to ObamaCare's device tax Boston Scientific plans to cut 1500 jobs __HTTP__ ObamaCare will kill ingenuity. _E_
Jeb Bush has a photoshopped photo for an ad which gives him a black left hand and much different looking body. Jeb just can't get it right! _E_
New CBS poll. #Trump2016 __HTTP__ _E_
This has been a very difficult decision regarding the Presidential run and I want to thank all my twitter fans for your fantastic support. _E_
Tom Brokaw keeps calling Mitt Romney George (Mitt's father). Sadly time is up for Tom. _E_
.@AnnCoulter U were great last nite @ericbolling on FOX. Our country has become a dumping ground for the world I'll get it to stop & fast! _E_
The mind that opens to a new idea never comes back to its original size. Einstein _E_
A great honor to sign the Veterans Appeals Improvement & Modernization Act into law w/ @AmericanLegion @SecShulkin. __HTTP__ __HTTP__ _E_
The real estate market in Vietnam is booming. Growth is everywhere in the world except for the US. _E_
RT @FoxNews: .@davidwebbshow: Let's look at the calendar. It's January 20th. DACA expires on March 5th. That means this was a construct of... _E_
Thank you Ohio. Together we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_
Trump's National Lead Increases to 35.6% Going into the Third GOP Debate it's Trump Carson and Rubio __HTTP__ ... _E_
I predicted the 9/11 attack on America in my book The America We Deserve and the collapse of Iraq in @TimeToGetTough. _E_
Thanks @piersmorgan! Trump is the most unpredictable extraordinary entertaining&massively popular candidate this country has ever seen. _E_
#MayThe4thBeWithYou here is when Darth Vader and I did some firing __HTTP__ _E_
I would invite Edward Snowden to be a judge at the Miss Universe Pageant in Moscow but would be concerned that he would sell results early! _E_
The coolest story is that John Beale the man who headed up CLIMATE CHANGE for the government is a proven con man and total phoney.ARRESTED _E_
Great move on delay (by V. Putin) I always knew he was very smart! _E_
Via The Hindu @businessline: Realty brand Donald Trump's India venture to sport desi tag __HTTP__ _E_
Jeffrey Lord former Reagan adviser has endorsed the Newsmax @iontv debate with a great article __HTTP__ _E_
We will immediately repeal and replace ObamaCare and nobody can do that like me. We will save $'s and have much better healthcare! _E_
Camp David is a very special place. An honor to have spent the weekend there. Military runs it so well and are so proud of what they do! _E_
So proud of NASCAR and its supporters and fans. They won't put up with disrespecting our Country or our Flag they said it loud and clear! _E_
Dopey Sugar @Lord_Sugar You should thank me for having created the platform on which you became known The Apprentice. Say Thank you Donald _E_
China is heavily investing in building its own jet engine __HTTP__ They will end up stealing the design from us as usual. _E_
'How Trump won over a bar full of undecideds and Democrats' __HTTP__ _E_
If @Barack Obama is really concerned about carbon emissions and air pollution then maybe he should have (cont) __HTTP__ _E_
I am starting to think that there is something seriously wrong with President Obama's mental health. Why won't he stop the flights. Psycho! _E_
Just out: Boston Herald/Franklin Pierce Poll N.H. TRUMP 28 (up 10) CARSON 16 BUSH 9 RUBIO 6 CRUZ 5 Press will say they are surging! _E_
Crooked Hillary has once again been proven to be a person who is dishonest incompetent and of very bad judgement. _E_
...and did not want to rock the boat. He didn't choke he colluded or obstructed and it did the Dems and Crooked Hillary no good. _E_
Americans nationwide have their premiums double and work hours decreased. @GOP must do the right thing stand strong & defund! _E_
It's Thursday. Which brand of eyeliner is the nation's worst AG @AGSchneiderman wearing today? _E_
.@Deadspin's disgusting response will teach me & others not to be nice anymore—a sad lesson. _E_
Feeling sorry for yourself is not only a waste of energy but the worst habit you could possibly have. Dale Carnegie _E_
Via @scotsmandotcom: "Donald Trump hires top lawyer for wind farm battle" __HTTP__ _E_
Sweat equity is the most valuable equity there is. Know your business and industry better than anyone else in the world. @mcuban _E_
If Chicago doesn't fix the horrible carnage going on 228 shootings in 2017 with 42 killings (up 24% from 2016) I will send in the Feds! _E_
"Being true to yourself and your work is an asset. Remember that assets are worth protecting." – Think Like a Champion _E_
Anyone who wants strong borders and good trade deals for the US should boycott @Univision. _E_
Entrepreneurs: Set the example and you'll be a magnet for the right people. That's the best way to work with people you like. _E_
Thank you NH! We will end illegal immigration stop the drugs deport all criminal aliens&save American lives! Watc... __HTTP__ _E_
Do you think Iran would have acted so tough if they were Russian sailors? Our country was humiliated. _E_
It is so pathetic that the Dems have still not approved my full Cabinet. _E_
New Sugar deal negotiated with Mexico is a very good one for both Mexico and the U.S. Had no deal for many years which hurt U.S. badly. _E_
Goofy Elizabeth Warren is weak and ineffective. Does nothing. All talk no action maybe her Native American name? _E_
.@ABFAlecBaldwin They were rising in the 1950's then went back down they will go up and down through eternity. _E_
.@JebBush is a low energy stiff who should focus his special interest money on the many people ahead of him in the polls. Has no chance! _E_
A record 1.2 million Americans have left the job force during @BarackObama's recovery __HTTP__ Don't trust the job numbers. _E_
Pretty even debate no knockouts. However Ryan's closing statement somewhat stronger. What do you think? #VPDebate _E_
"Real estate is at the core of almost every business and it's certainly at the core of most people's wealth." – Think Like a Billionaire _E_
Back by popular demand the fabulous @LilJon returns to the record setting 13th season of All Star @CelebApprentice. The fans love him! _E_
Re @TWC TimeWarner I am going to be switching many of my buildings to another service—this is ridiculous! _E_
"Success breeds success. The best way to impress people is through results." – Think Like a Billionaire _E_
#trumpvlog The Republicans must defeat @BarackObama not themselves..... __HTTP__ _E_
If you want more you have to require more from yourself. Dr. Phil McGraw _E_
20000📈21000📈22000📈23000📈this year...FOUR one thousand milestones this year... #Dow23K #MAGA __HTTP__ _E_
So much for Hope and Change. @BarackObama has already spent over $100M on attack ads across the swing states __HTTP__ _E_
Glad to hear that @JimTalent has put some strong anti China referendums in the @GOP convention platform. _E_
Sexual pervert & deviant Anthony Weiner is polling to see if he can run for NYC Mayor... _E_
"To state the obvious if any business operated the way the government does it would go under." #TimeToGetTough _E_
#MakeAmericaGreatAgain #Trump2016LIFE CHANGING EXPERIENCEVideo: __HTTP__ __HTTP__ _E_
On November 9th @MissUniverse comes to Moscow! Hosted by the wonderful duo of @OfficialMelB & @ThomasARoberts in Crocus City Hall! _E_
I guess Obama's Cairo Speech really worked out. The Muslim Brotherhood stormed our embassy on 9.11. Imagine if Obama speaks in Beijing? _E_
.@daveweigel of the Washington Post just admitted that his picture was a FAKE (fraud?) showing an almost empty arena last night for my speech in Pensacola when in fact he knew the arena was packed (as shown also on T.V.). FAKE NEWS he should be fired. _E_
REMEMBER the terrible 5 for 1 trade whereby the Taliban got back leaders (killers) and we got back a NOTHING WILL COME BACK TO HAUNT U.S.! _E_
I have always liked Ellen done her show numerous times but she was not good last night fumbling and stumbling! _E_
In new Quinnipiac Poll 66% of people feel the economy is "Excellent or Good." That is the highest number ever recorded by this poll. _E_
China has copied our military's F 22 Raptor design __HTTP__ We should offset their theft from our debt. _E_
Officials behind the now discredited Dossier plead the Fifth. Justice Department and/or FBI should immediately release who paid for it. _E_
Great everyone is saying I did much better on @60Minutes last week than President Obama did tonight. I agree! _E_
Median household income is down for the middle class since Obama took office. It will only go further down under Clinton. _E_
Sometimes your best investments are the ones you don't make. The Art of The Deal _E_
Congratulations to Bob Kraft and Coach Bill Belichick for having built an amazing team. @Patriots _E_
"You cannot escape the responsibility of tomorrow by evading it today." – Pres. Abraham Lincoln _E_
Boring & failing @NYMag's 3rd rate political reporter @jheil had flunky @DanAmira write a totally false report about me today...... _E_
My message MAKE AMERICA GREAT AGAIN is beginning to take hold. Bring back our jobs strengthen our military and borders help our VETS! _E_
What do you think Obama will do when Putin seizes Alaska? _E_
Big day in Washington D.C. even though White House & Oval Office are being renovated. Great trade deals coming for American workers! _E_
Via @WTOC11: Donald Trump headlines Tea Party Convention in Myrtle Beach __HTTP__ Looking forward to visiting SC on Monday! _E_
Study your area of business. All business involves risk but risk can be reduced when you learn everything you can about what you're doing. _E_
My @foxandfriends interview re: firing @bretmichaels on the premiere of All Star @ApprenticeNBC & politics __HTTP__ _E_
RT @IvankaTrump: .@realDonaldTrump stock market rally is close to becoming the greatest in 85 years __HTTP__ _E_
Trump at CPAC: 'We Have to Get the Momentum Back' __HTTP__ via @WSJ's @WSJVideo _E_
I bet the dumbest political commentator on television @Lawrence will soon be thrown off the air for poor (cont) __HTTP__ _E_
Tonight's episode of The Apprentice is one of the best ever we're down to the final 3 and it's high excitement all the way. 10 pm on NBC. _E_
"Hook your career to a big trend. There are huge opportunities for profits if you can create big solutions." – Think Big _E_
I call my own shots largely based on an accumulation of data and everyone knows it. Some FAKE NEWS media in order to marginalize lies! _E_
I have recieved and taken calls from many foreign leaders despite what the failing @nytimes said. Russia U.K. China Saudi Arabia Japan _E_
China's stock market rose yesterday after 4 consecutive days of losses __HTTP__ Their market gains the day we are hit by storm _E_
It's Tuesday how much inflation has @BarackObama's spending caused today on the price of food and gas? _E_
Al Qaeda terrorist Al Libi was immediately read his rights & is now being treated for 'pre existing' medical (cont) __HTTP__ _E_
19000 RESPECTING our National Anthem! #StandForOurAnthem __HTTP__ _E_
My twitter followers will soon be over 2 million & all the biggies. It's like having your own newspaper. _E_
Trump International in Dubai will be one of the great projects anywhere in the world. Congratulations to @damacofficial for their genius! _E_
China just landed a jet on an aircraft carrier stolen from a U.S. design. __HTTP__ We should offset the thievery from our debt.. _E_
Looking at Air Force One @ MIA. Why is he campaigning instead of creating jobs & fixing Obamacare? Get back to work for the American people! _E_
Some people dream of great accomplishments while others stay awake and do them! _E_
Entrepreneurs: Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_
It is outrageous and disgusting that families of U.S. MILITARY personnel killed in action will not be given money for burials. SAD! _E_
Little respected Club For Growth asked me for $1000000 I said NO . Now they are spending lobbyist and special interest money on ads! _E_
Don't go around saying the world owes you a living. The world owes you nothing. It was here first. Mark Twain _E_
Getting ready to celebrate the 4th of July with a big crowd at the White House. Happy 4th to everyone. Our country will grow and prosper! _E_
Just watching NBC News where our potential attack is being detailed the exact ships the stealth bombers the destinations so ridiculous! _E_
Strive for wholeness and keep your sense of wonder intact. Donald J. Trump __HTTP__ _E_
Trump Int'l Golf Links & Hotel Ireland is on 400 beautiful acres & fronts the Atlantic Ocean for 2.5 miles. Spectac! __HTTP__ _E_
Today I was pleased to announce the official approval of the presidential permit for the #KeystonePipeline. A grea... __HTTP__ _E_
I am watching two clown announcers on @FoxNews as they try to build up failed presidential candidate #LittleMarco. Fox News is in the bag! _E_
Yesterday I signed the #INTERDICTAct (H.R. 2142) with bipartisan members of Congress to help end the flow of drugs into our country. Together we are committed to doing everything we can to combat the deadly scourge of drug addiction and overdose in the United States! __HTTP__ _E_
I support K 9's for Warriors a wonderful organization that trains service dogs for veterans. Please contact __HTTP__ _E_
For Entrepreneurs: A good question to ask yourself –"What can I provide that does not yet exist?" _E_
.@jessebwatters is terrific at hosting on @FoxNews he really gets it! _E_
Mr. Khan who does not know me viciously attacked me from the stage of the DNC and is now all over T.V. doing the same Nice! _E_
Ms. Goldberg & her blowhard lawyer should be ashamed for having brought this frivolous case. They should pay me damages! _E_
English taxpayers should stop subsidizing the destruction of Scotland by paying massive subsidies for ugly wind turbines. _E_
Spoke to a capacity crowd at Horry County Republican event earlier today. __HTTP__ _E_
Donald Trump Reviews Oscars: Django 'Racist' Ceremony 'Boring' Set 'Tacky'... __HTTP__ via @eonline _E_
Democrats are not interested in Border Safety & Security or in the funding and rebuilding of our Military. They are only interested in Obstruction! _E_
Lord grant that I may always desire more than I can accomplish. Michelangelo _E_
Via @AmericanThinker by Malcolm Unwell: "Taking Trump Seriously" __HTTP__ _E_
How far has the United States gone down when we are reduced to accept the imbecilic deal just agreed to with Iran. Read THE ART OF THE DEAL! _E_
Huff Post His early morning speech drew a large crowd far larger than remarks at the same time on Thursday and packed by end! The facts. _E_
I can't believe my friend Derek Jeter is out for whole season injured day he left Trump World Tower. Lucky bldg. Move back fast! _E_
Make sure to watch Celebrity Apprentice tonight at 9 on NBC. A GREAT SHOW JUST LIKE THE MASTERS. 9 _E_
So how did I do on Face The Nation? _E_
So sad that Obama rejected Keystone Pipeline. Thousands of jobs good for the environment no downside! _E_
Thank you! #Trump2016 __HTTP__ _E_
Not one American flag on the massive stage at the Democratic National Convention until people started complaining then a small one. Pathetic _E_
Our great country has been divided for decades. Sometimes you need protest in order to heal & we will heal & be stronger than ever before! _E_
I'll always like @OMAROSA because she constantly defends me. #CelebApprentice _E_
...and now Alex Salmond pushes ugly turbines! _E_
I will be doing the @TodayShow live from New Hampshire at 7am on Monday morning. #TrumpToday _E_
.@garyplayer As a true champion you must have enjoyed how difficult but fair The Blue Monster played last weekend. Gary Player Villa loved! _E_
"Inside Donald Trump's Scottish golf course" __HTTP__ via @TelegraphSport _E_
My @greta int. discussing $25000 gift to USMC Tahmooressi Obama's trip to China & the 2014 election results __HTTP__ _E_
.@newtgingrich just said a historic victory for Trump. NICE! _E_
Keep stimulating your mind with big ideas. Be a collector of big ideas. Constantly fill your mind with new information. Think Big _E_
Congratulations to @RobinRoberts on celebrating 100 days in her bone marrow transplant recovery. Robin is a special person. _E_
.@stuartpstevens horrible advise to Mitt Romney made victory an impossibility. Don't blame Mitt! Now Stevens can't get a job! _E_
Never in U.S.history has anyone lied or defrauded voters like Senator Richard Blumenthal. He told stories about his Vietnam battles and.... _E_
The Democratic Convention has paid ZERO respect to the great police and law enforcement professionals of our country. No recognition SAD! _E_
.@BarackObama should release all his records (like other Presidents).... _E_
The first General killed in a combat zone since Vietnam it is a travesty that Obama did not attend Major General Harold Greene's funeral _E_
Thank you Terre Haute Indiana!#MakeAmericaGreatAgain __HTTP__ _E_
Carl Icahn said this about me: I think at this moment in time he's the only candidate that speaks out about the country's problems. _E_
Individual commitment to a group effort is what makes a team work a company work a society work a civilization work. Vince Lombardi _E_
Must read via @FoxNews by @JaySekulow: "Mr. President: Will you bring home American pastor imprisoned in Iran?" __HTTP__ _E_
Via @Newsmax_Media by @ChrisRuddyNMX: Donald Trump and the End of Free Speech __HTTP__ _E_
.@melaniatrump will be on @theviewtv today at 11am ET discussing @apprenticenbc #celebapprentice & her skin care collection. Tune in! _E_
4.2 million hard working Americans have already received a large Bonus and/or Pay Increase because of our recently Passed Tax Cut & Jobs Bill....and it will only get better! We are far ahead of schedule. _E_
Putting Pelosi/Schumer Liberal Puppet Jones into office in Alabama would hurt our great Republican Agenda of low on taxes tough on crime strong on military and borders...& so much more. Look at your 401 k's since Election. Highest Stock Market EVER! Jobs are roaring back! _E_
In any event we are EXTREME VETTING people coming into the U.S. in order to help keep our country safe. The courts are slow and political! _E_
Why do people give @KarlRove contributions when they know he is a loser who has no idea how to win? __HTTP__ _E_
Join me tomorrow! #Trump2016 #MakeAmericaGreatAgain Omaha Nebraska: __HTTP__ Oregon: __HTTP__ _E_
Entrepreneurs: Follow your instincts and keep your focus intact. You alone know where you really want to go. _E_
"@marklevinshow: 'PLUNDER AND DECEIT'" __HTTP__ via @AmSpec by @JeffJlpa1 _E_
Lightweight A.G. Eric Schneiderman asked us for political contributions DURING his investigation of usthen sued for $40 million.Dopey guy! _E_
The @Yankees acquisition of Ichiro was a smart move. I look forward to watching him play. _E_
Everyone is asking if and when I will endorse a candidate in the NYC mayoral race. Doing my due diligence... _E_
The brass in #TRUMP Tower's atrium is polished twice a month like clockwork. I keep the atrium impeccable. Key to its success! _E_
Vision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_
North Korea disrespected the wishes of China & its highly respected President when it launched though unsuccessfully a missile today. Bad! _E_
This election is a total sham and a travesty. We are not a democracy! _E_
Thoughts and prayers with the sailors of USS Fitzgerald and their families. Thank you to our Japanese allies for th... __HTTP__ _E_
I'm always amazed when I travel to my foreign properties.Seeing the Trump brand across 4 continents proves that excellence can be universal. _E_
NO WAY JUDGES SAY MAYWEATHER WON. INVESTIGATION SHOULD TAKE PLACE. FIX? _E_
Today it was an honor to celebrate the Collegiate National Champions of 2016/2017 at the @WhiteHouse! #NCAAChampions Photos: __HTTP__ __HTTP__ _E_
Thanks to Giovanni's Coal Fire Pizza of Florida for donating enough pizza to feed 750 Police Athletic League youngsters in NY this Friday. _E_
My interview yesterday with @IngrahamAngle __HTTP__ _E_
"Trump: 'Seriously Considering' a Presidential Bid" __HTTP__ via @NBCNews _E_
If the press can report stories from @MittRomney's dorm years then why can't it find @BarackObama's college and law school transcripts? _E_
With terrific Steve Wynn at dinner last night. __HTTP__ _E_
Via @Newsmax_Media: Trump: Americans 'Desperate for Leadership' __HTTP__ _E_
If everybody sued the Journal News for revealing their info (guns) paper would go out of business. _E_
If the Saudis are so concerned about Syria then they should go in themselves. Stop telling us to do their dirty work. _E_
"Every big thinker has had to start as a nobody. Just think big & that immediately distinguishes you from the majority." – Think Big _E_
Est. in 1906 @TrumpTurnberry is home to the iconic Ailsa @The_Open Championship course four times over __HTTP__ _E_
Why isn't AG Schneiderman going after Democrat Jon Corzine and the $1.4 billion that is "missing?" _E_
These Islamists chop Americans' heads off and want to destroy us. We should be applauding the CIA not persecuting them. _E_
Thank you Texas! If you haven't registered to VOTE today is your last day. Go to: __HTTP__ & get ou... __HTTP__ _E_
The girlfriend of Lubitz the wacko co pilot who took down the plane knew he was insane and should have reported him. Put her through hell _E_
Trump Golf Links at Ferry Point is a Jack Nicklaus Signature Design 18 hole course just minutes from Manhattan __HTTP__ _E_
If United Steelworkers 1999 was any good they would have kept those jobs in Indiana. Spend more time working less time talking. Reduce dues _E_
.@RNC report was written by the ruling class of consultants who blew the election. Short on ideas. Just giving excuses to donors. _E_
The people of Scotland are really starting to fight the ugly industrial wind turbines. See Press and Journal __HTTP__ _E_
GIVE AMERICA BACK ITS DREAM! Donald J. Trump _E_
"MSNBC'S TOURÉ HAS EPIC RACE BAITING MELTDOWN ON CNN" __HTTP__ It's Toure's modus operandi. He is so angry. _E_
See yourself as having a lot already and keep your integrity intact. It's the best path to comprehensive success. Think Like a Champion _E_
Love making correct predictions. National Review is over. __HTTP__ _E_
My nomination would increase voter turnout. #VoteTrump #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_
Congress was elected last November to reign Obama in not to give him 'fast track' authority for bad trade deals for the American worker! _E_


================================================
FILE: assignments/word_transform/common.en.vocab
================================================
,
.
the
</s>
of
-
in
and
'
)
(
to
a
is
was
on
s
for
as
by
that
it
with
from
at
he
this
be
i
an
utc
his
not
–
are
or
talk
which
also
has
were
but
have
#
one
rd
new
first
page
no
you
they
had
article
t
who
?
all
their
there
been
made
its
people
may
after
%
other
should
two
score
her
can
would
more
if
she
about
when
time
team
american
such
th
do
discussion
links
only
some
up
see
united
years
into
/
school
so
world
university
during
out
state
states
national
wikipedia
year
most
city
over
used
then
d
than
county
external
m
where
will
de
what
delete
any
these
january
march
august
july
being
film
him
many
south
september
like
between
october
three
june
well
use
war
under
them
april
we
born
december
link
while
c
later
part
november
further
players
list
please
following
my
february
known
second
u
name
group
history
series
just
e
north
work
before
since
season
both
high
st
through
district
now
!
comments
because
football
music
however
diff
century
league
edits
debate
title
articles
john
same
including
could
english
album
number
against
family
user
based
area
became
york
b
life
me
british
international
game
"
above
club
your
until
early
best
west
house
company
general
left
very
here
don
living
day
several
place
party
college
result
keep
appropriate
four
subsequent
even
class
government
how
called
did
each
found
center
per
style
com
long
country
back
way
does
www
modify
end
make
public
played
p
won
another
released
added
f
support
games
former
those
films
church
east
line
major
members
good
much
image
show
still
think
below
town
last
system
right
song
non
notable
section
single
included
align
home
women
television
—
seed
member
goals
sources
book
station
order
old
information
set
own
text
band
point
local
around
river
top
main
language
french
https
named
off
us
note
career
original
age
service
established
located
re
said
website
population
air
german
law
military
}
great
ii
within
clubs
published
president
park
official
$
r
case
>
london
times
although
small
third
different
due
get
village
closed
g
art
player
final
l
community
held
n
again
began
army
award
without
death
built
men
large
site
+
using
deletion
white
along
five
central
road
children
free
took
england
include
association
down
j
given
source
x
california
man
version
written
created
media
black
though
php
report
building
la
take
division
comment
having
king
edit
stadium
died
ship
research
record
archive
places
undo
cup
records
often
few
received
side
power
education
know
category
water
political
species
field
near
&
co
australia
video
need
go
island
form
find
served
play
project
o
according
radio
am
works
proposed
every
development
example
live
union
india
next
special
court
region
h
little
short
v
william
province
western
son
france
council
others
royal
current
street
full
red
too
department
w
san
help
among
ve
preserved
james
open
force
position
head
director
father
track
http
canada
never
australian
id
george
jpg
level
late
summer
society
moved
office
period
championship
round
story
songs
various
file
days
land
business
tv
reason
america
million
european
term
al
six
uk
post
why
produced
making
subject
young
total
david
science
related
rock
archived
railway
become
led
students
started
news
described
role
election
albums
present
indian
kingdom
books
important
northern
love
run
canadian
press
rather
k
type
act
editor
came
schools
program
once
issue
social
germany
production
male
might
awards
points
similar
professional
say
background
enough
lead
either
common
overlap
data
color
better
•
person
services
museum
battle
went
sports
already
currently
hall
buildings
historic
date
deleted
considered
change
location
seems
must
yes
our
southern
least
lost
something
review
together
robert
fact
less
japanese
groups
content
involved
isbn
board
japan
control
policy
modern
human
half
design
event
events
available
done
washington
real
start
personal
action
space
areas
doesn
notability
star
really
china
possible
paul
working
taken
far
going
minister
lake
reported
popular
married
founded
europe
author
away
independent
process
teams
character
low
michael
pages
light
big
seen
release
want
episode
wrote
republic
thomas
companies
via
russian
thanks
put
race
worked
route
recorded
someone
civil
police
charles
listed
users
template
instead
eastern
body
question
italian
featured
week
editors
texas
chief
close
himself
upon
match
q
roman
come
opened
tour
sea
actually
cross
playing
health
institute
caps
forces
green
rights
evidence
originally
aircraft
arts
range
probably
consensus
bar
problem
look
issues
alumni
average
network
win
shows
wife
returned
night
magazine
centre
joined
usually
middle
completed
elected
significant
african
able
google
stage
addition
ireland
today
academy
saint
self
itself
continued
stations
mother
appeared
africa
culture
spanish
grand
committee
things
fire
changed
gold
female
course
directed
months
whether
chinese
previous
developed
size
mentioned
add
festival
peter
basketball
across
move
performance
standard
means
give
training
artist
word
blue
primary
announced
value
christian
private
catholic
artists
includes
view
thus
almost
baseball
seven
appears
ever
provide
technology
olympics
future
formed
census
nd
images
los
results
return
quality
construction
zealand
front
cover
model
despite
read
material
strong
coach
henry
footballers
mark
rev
organization
studies
federal
richard
html
virginia
car
attack
conference
outside
study
brother
names
throughout
writer
characters
musical
nothing
border
medical
countries
past
writing
makes
interest
provided
killed
medal
signed
dr
largest
label
fair
search
bay
reference
especially
refer
removed
library
eventually
management
references
features
navy
guitar
hill
sure
historical
lower
daughter
appointed
reading
yet
systems
debut
movement
fc
specific
always
actor
natural
clear
coast
let
got
chicago
championships
ll
pennsylvania
ten
performed
individual
designed
rule
etc
lists
paris
thought
brown
hand
needs
reliable
smith
generally
base
sometimes
florida
capital
valley
bank
gave
ground
reached
italy
energy
believe
leader
active
online
block
bridge
families
changes
y
followed
industry
collection
request
soon
leading
olympic
sold
writers
professor
studio
mexico
competition
campaign
org
theatre
anything
particular
empire
length
islands
singer
create
redirect
additional
soviet
market
words
producer
notes
hockey
novel
code
referee
fourth
sport
van
mary
airport
sound
status
irish
placed
child
perhaps
idea
foreign
municipality
isn
register
eight
problems
native
coverage
channel
parliament
username
edition
minor
says
whose
foundation
units
movie
runs
ice
simply
limited
unit
student
previously
stated
governor
complete
test
nominated
bill
parts
vocals
theory
regional
km
account
vote
computer
none
carolina
tournament
poland
behind
wales
winning
lot
hospital
mid
taking
mountain
higher
cases
angeles
editing
replaced
food
multiple
likely
terms
sir
thing
square
try
topic
woman
officer
categories
greek
recent
sent
copyright
speed
templates
money
saw
senior
selected
introduced
politician
true
required
regular
awarded
commercial
cities
contains
trade
mr
degree
anti
birth
sun
finished
longer
rugby
earth
access
prior
seasons
journal
beginning
software
famous
religious
appear
martin
el
god
bit
hours
running
brought
missing
economic
structure
rural
remained
decision
certain
quite
hit
minutes
spain
plays
whole
joseph
lord
web
decided
operations
function
louis
assembly
queen
security
uses
ohio
owned
jan
operation
call
successful
legal
russia
prince
mean
jewish
staff
establishments
goal
towards
agree
bad
attendance
populated
nature
allowed
captain
mount
township
calculated
structures
hard
saying
manager
earlier
elections
meet
box
lines
democratic
success
·
associated
singles
traditional
rest
highway
matter
particularly
wide
month
care
admin
cultural
commission
didn
plan
therefore
practice
command
nomination
jersey
parties
michigan
entire
en
anyone
seem
overlaps
approximately
master
noted
usa
stop
cannot
feature
engine
response
needed
illinois
afd
experience
highest
engineering
silver
separate
takes
secretary
dutch
lee
recording
prime
le
themselves
rules
uploaded
trying
youth
scotland
iii
houses
heart
room
stone
shown
deal
drama
scores
dead
key
shot
turn
occupation
scottish
executive
plant
promoted
whom
villages
languages
internet
leave
feel
covered
merge
mostly
numerous
ancient
attempt
property
programs
picture
finally
ships
fiction
looking
secondary
nations
majority
edward
annual
digital
mission
wp
lived
claim
seat
bbc
profile
dance
prize
doing
georgia
port
pacific
castle
pass
transport
organizations
ratio
recently
fall
global
era
wing
opinion
commander
fort
effect
opening
fine
purpose
winter
genus
congress
overall
activities
met
income
massachusetts
comes
older
peak
lack
bass
super
complex
academic
stars
accounts
appearance
asian
asked
friends
kind
et
till
financial
entry
asia
sense
meaning
actress
map
→
intended
bishop
boston
rate
literature
forest
voice
jack
pre
justice
britain
champion
double
polish
numbers
columbia
temple
defeated
administration
ended
claims
z
jones
mm
parish
israel
actors
sister
nine
scored
table
attended
pop
cd
else
newspaper
friend
unknown
winner
chart
initially
loss
sites
starting
architecture
relations
upper
supported
tracks
contract
face
directly
spent
girl
clearly
junior
francisco
politics
greater
presented
mar
ed
cause
volume
caused
pagename
tom
flight
candidate
passed
matches
claimed
except
oil
assistant
surface
victory
regiment
stories
represented
gets
speedy
weeks
listing
allow
jr
branch
retired
communities
train
paper
adding
provides
remains
victoria
metal
wrong
larger
direct
frank
miles
blocked
launched
mass
chairman
comedy
relationship
knowledge
format
creek
meeting
failed
officers
draft
goes
fight
figure
faculty
camp
ran
variety
owner
statistics
raised
heavy
alexander
alone
understand
episodes
gives
educational
daily
williams
latin
completely
products
dark
attention
religion
referred
von
mind
oppose
corps
administrative
cut
scott
becoming
footballer
jean
mayor
pro
beach
descent
nearly
latter
leaving
highly
cast
territory
write
towns
forms
joe
inside
wanted
solid
individuals
authority
mention
projects
del
continue
cost
vice
drive
notice
johnson
forced
basis
looks
reasons
job
photo
hope
log
parents
entered
mike
basic
scientific
amount
spring
oxford
kong
opera
tried
critical
simple
founder
hong
told
husband
useful
technical
necessary
believed
operated
mountains
importance
musicians
hotel
girls
crew
feb
boy
ontario
nation
defense
wiki
champions
golden
districts
faith
racing
mainly
auto
unless
lives
swedish
hot
entertainment
turned
net
soccer
creation
product
tower
increased
votes
squadron
px
contemporary
subsequently
regarding
focus
marriage
questions
naval
details
forward
memorial
peace
ip
kept
iran
korea
analysis
winners
poor
grade
cricket
judge
electric
bc
exist
corporation
hold
featuring
campus
brazil
chris
beyond
fifth
increase
summary
remaining
statement
broadcast
getting
piano
des
novels
serving
hour
moving
resolution
concept
alternative
brothers
attacks
encyclopedia
republican
representatives
politicians
difficult
ability
studied
host
wall
immediately
urban
pakistan
becomes
marine
physical
dec
troops
interview
coming
semi
suggest
emperor
letter
couple
fellow
duke
tell
gallery
follow
windows
tree
hits
jazz
protection
relevant
count
situation
reviews
containing
classical
offered
lady
netherlands
reports
influence
address
linear
consider
machine
domain
elements
minnesota
types
nov
serve
sydney
ministry
blood
distance
bottom
giving
boys
potential
toronto
edited
infantry
jun
formerly
oct
conflict
workers
steve
philadelphia
helped
der
nationality
dispute
scene
method
titles
berlin
conditions
arms
races
maybe
discovered
iron
extended
churches
otherwise
positive
santa
nom
imperial
composed
ball
width
quickly
correct
responsible
possibly
indiana
soldiers
examples
korean
timezone
genre
fish
senate
effects
gun
check
appearances
plans
renamed
sign
reporting
sweden
consists
heritage
tag
primarily
doctor
leaders
lies
inc
rivers
du
crime
liberal
stand
bob
existing
publishing
industrial
answer
split
apr
sex
mixed
acting
personnel
rail
die
premier
approach
wisconsin
sentence
root
ago
standards
comics
earned
miss
specifically
horse
actual
contributions
carried
lieutenant
wood
plants
initial
origin
environment
pretty
rank
bus
gas
direction
guide
resources
affairs
accepted
animals
nor
activity
levels
laws
jim
creating
cambridge
ones
composer
remove
agency
reserve
atlantic
supreme
weight
pp
ask
fighting
jackson
widely
rose
operating
treatment
linked
andrew
trial
expanded
daniel
certainly
info
vs
sciences
fame
everything
avenue
travel
scale
break
oregon
housing
produce
capacity
smaller
fictional
exchange
actions
cited
typically
settlement
agreement
translation
males
kansas
managed
bring
charge
fails
dedicated
estate
nearby
residents
piece
growth
trust
applied
drums
issued
murder
normal
twenty
commonly
avoid
tony
norwegian
criteria
context
suggested
revolution
fully
wars
prominent
aug
leaves
advanced
distribution
medicine
garden
reach
turkey
females
publications
impact
households
survey
height
morning
honor
deep
argument
publication
arthur
elizabeth
disambiguation
worth
colorado
median
maryland
falls
zone
solo
learning
pay
resolves
choice
vol
flag
engineer
cars
farm
wilson
principal
acquired
constructed
secret
poet
build
remain
orchestra
versions
follows
fixed
fm
efforts
documentary
equipment
ray
yellow
guard
pressure
grant
prison
freedom
norway
twice
sportspeople
store
taylor
quarter
designated
independence
platform
rome
ad
teacher
copy
effort
nuclear
pictures
models
sep
everyone
easily
thank
di
description
agreed
£
institutions
covers
facilities
target
stack
rationale
stat
combined
bronze
sort
hosted
programming
sri
railroad
unique
defined
ocean
cell
missouri
concert
improve
biography
loan
shortly
contact
holy
tennessee
sub
safety
competed
stephen
policies
painting
price
entirely
mexican
leadership
flying
message
municipal
serious
headquarters
officially
cemetery
memory
×
fields
generation
join
copies
finals
fox
continues
representative
destroyed
feet
guy
philippines
revealed
organized
serves
conservative
share
maria
disease
sections
philosophy
ways
arrived
divided
floor
labour
logo
meets
yard
largely
cancer
offer
tax
expected
traffic
concerns
graduated
guest
jews
formation
meant
economy
storm
tells
mile
protected
bowl
letters
providing
begins
classic
damage
harry
offers
davis
challenge
views
marked
allows
density
literary
fa
htm
ben
transportation
kentucky
sales
fleet
supporting
captured
extra
recognized
arizona
compared
theme
francis
moscow
interested
heard
behavior
transferred
environmental
blank
musician
starring
assigned
seats
tennis
percent
logs
display
convention
ring
joint
brian
deputy
planned
universities
yards
communist
agent
difference
animal
czech
positions
exactly
stay
titled
combat
palace
card
ordered
opposition
attempts
understanding
stub
wrestling
critics
growing
establish
hands
participated
revert
poetry
materials
ga
turkish
paid
promotion
apparently
battalion
mobile
additions
row
merged
metropolitan
figures
existence
eye
longs
louisiana
lewis
melbourne
austria
brigade
screen
risk
conducted
lats
ban
da
labor
legislative
definition
indeed
draw
application
un
steel
presence
expansion
earl
max
wild
planning
comic
adopted
easy
plus
happy
acts
classes
iowa
grew
save
wins
theater
exists
roles
chance
prevent
candidates
object
felt
powers
birds
spread
defeat
cape
identified
gained
regions
mine
sides
jul
showing
teaching
guidelines
simon
depth
lyrics
christmas
declined
greece
express
federation
journalist
intelligence
occurred
connection
displayed
portuguese
declared
constitution
presidential
standing
sons
plot
dates
firm
proper
ends
pilot
relatively
receive
educated
opposed
manchester
queensland
americans
introduction
directors
vehicle
stock
vehicles
israeli
frequently
hills
performing
northwest
drug
visit
portion
residence
walter
pov
interesting
moon
limit
minute
bell
athletics
reduced
wind
oklahoma
architect
ideas
electronic
crown
younger
anderson
step
weapons
unable
neutral
connected
switzerland
expatriate
armed
weekly
rating
programme
squad
medalists
multi
dynasty
cold
granted
socorro
alliance
methods
sr
sam
alabama
albert
tropical
vietnam
dvd
refers
heat
fans
surrounding
purposes
credit
commons
boat
iv
boxes
ethnic
speaking
fell
arena
roads
core
dog
kill
athletic
oldest
negative
confirmed
sixth
edge
jesus
tools
colonel
weak
chosen
brand
resulting
nfl
rise
supply
tradition
elementary
household
spirit
task
slightly
howard
incident
develop
southeast
sunday
discuss
stats
climate
topics
purchased
communications
chapter
broken
singapore
alongside
situated
ca
license
haven
deaths
passing
citizens
guns
trees
gone
greatest
improved
visual
pope
officials
sat
glass
miller
resulted
posted
estimated
contain
brazilian
sexual
defence
respectively
concerning
rich
myself
fast
properties
taught
extensive
exhibition
speech
les
proposal
straight
ff
internal
effective
solution
fashion
foot
orange
argentina
brief
performances
adult
allowing
newly
identity
nominator
singers
inspired
discussed
require
ex
facility
transfer
egypt
cells
patrick
quebec
connecticut
scoring
anthony
permanent
phase
audience
motion
blues
hungarian
arab
trains
sets
wasn
ranked
unlike
begin
setting
eyes
database
studios
criminal
commonwealth
finish
communication
scope
accused
divisions
accept
warning
alan
objects
diego
contest
fighter
finds
coaches
beat
extremely
ford
swiss
sorry
houston
worldwide
showed
holds
cathedral
losing
advance
reality
broadcasting
adam
vandalism
enemy
entitled
youtube
assessed
billion
buried
belgium
respect
rare
detroit
graduate
colleges
explain
authorities
killing
maximum
neither
fan
notify
painter
hamilton
returning
attempted
universe
passes
obvious
suffered
pieces
apply
actresses
competitions
aid
driver
folk
dan
khan
baby
denmark
tokyo
billboard
calling
anne
happened
danish
wants
formula
interior
kevin
weather
powerful
muslim
registered
publisher
preceding
sounds
eric
approved
achieved
douglas
provincial
fund
portugal
athletes
bird
bands
audio
cat
bureau
centuries
valid
chemical
items
lane
holding
counties
update
ncaa
speak
finding
domestic
ali
false
equivalent
caught
christ
ending
toward
puerto
perform
partner
romania
aviation
wouldn
failure
ward
strength
onto
knight
nominations
hungary
concern
keeping
recordings
ep
juan
functions
mississippi
ok
calls
criticism
involving
magic
gordon
treaty
antonio
selection
rear
colonial
motor
obtained
circuit
wish
compilation
harvard
islamic
determined
geography
arkansas
fuel
artillery
medieval
locations
inclusion
recognition
northeast
chamber
moment
somewhat
grounds
anyway
succeeded
historian
condition
physics
newspapers
instance
represent
allen
watch
kitt
protect
grey
launch
dave
philip
dc
iraq
changing
ukraine
municipalities
mix
tamil
shift
shared
austrian
door
investigation
institution
princess
trail
ultimately
parks
applications
hundred
aired
requirements
talking
kim
ltd
metres
gray
sector
dean
agricultural
unincorporated
incorporated
escape
orders
corner
commissioned
founding
mill
mrs
subjects
temperature
settled
remember
miami
promote
values
spot
progress
learn
planet
oh
occupied
usage
southwest
refused
borough
truth
clark
sufficient
equal
administrator
persons
factory
fought
derived
outstanding
magazines
flow
peer
attacked
generate
shape
creator
requires
option
lincoln
starts
stands
carry
establishment
selling
causes
mp
budget
battles
sky
legend
sourced
arrested
forum
metro
broke
strike
injury
ryan
zero
converted
violence
significantly
statements
controlled
welsh
dropped
roger
pdf
distinguished
samuel
translated
papers
detail
chapel
frederick
thousands
banks
herself
offensive
kings
factor
rename
replace
museums
resistance
junction
tries
tim
engines
contributed
medium
device
profit
dream
enter
twelve
universal
typical
seeing
skills
bought
passenger
cleveland
funding
agriculture
parent
decades
receiving
signal
reform
organisation
prix
column
defunct
utah
managers
qualified
indicate
ukrainian
gay
amateur
obviously
flora
gene
soul
op
alt
discussions
montreal
turns
walker
entrance
path
nice
string
influenced
occur
developing
abandoned
humans
pair
flat
sample
contained
banned
moore
strongly
visited
pm
increasing
attorney
arm
mathematics
canal
charts
thinking
dublin
suggests
whatever
surname
brain
pittsburgh
blog
economics
seventh
alex
employed
heavily
authors
paintings
concerned
recipients
navigational
scholars
controversial
controversy
reverted
expressed
josé
bodies
conservation
maps
ahead
marie
arguments
chain
focused
readers
carl
cm
violation
offices
wave
circle
apart
invasion
jimmy
opportunity
determine
orthodox
voted
formal
describes
seconds
cycle
doubt
golf
walls
productions
constituency
closely
occurs
huge
andy
representing
indonesia
sell
aren
mon
drawn
diocese
tank
advice
senator
manner
generated
malaysia
asking
finland
causing
leads
lawyer
seattle
gain
index
saints
runner
crisis
cinema
matt
hollywood
reaction
medals
documents
reader
lawrence
pattern
archives
atlanta
voting
reviewed
looked
bear
perfect
restored
bruce
baltimore
baron
pan
commune
fantasy
duty
chair
scenes
broad
opposite
stuff
aged
streets
nick
anna
billy
extension
kent
parliamentary
kelly
shooting
ready
pick
ma
songwriter
aware
jordan
dictionary
composition
salt
stating
bangladesh
bot
successfully
benefit
lands
interests
scheduled
teachers
closing
advertising
contribution
maine
retirement
scientists
dam
ny
blocks
las
print
techniques
participate
anniversary
requested
discovery
explained
expedition
citation
assist
und
meanwhile
hampshire
creative
maintain
pierre
detailed
facts
frame
finance
socialist
script
camera
returns
engaged
assistance
experienced
underground
sale
beautiful
jane
abc
supposed
successor
classification
tool
mining
producing
cabinet
fr
bytes
ross
russell
citations
maintained
evening
singing
fifa
gender
venues
lakes
mail
jeff
electoral
emergency
mode
christopher
heads
proved
priest
funds
investment
romanian
session
capture
aspects
reduce
trophy
abuse
prefecture
walk
faced
normally
regarded
snow
shop
dakota
bush
coal
inhabitants
headed
gary
employees
error
invited
cable
protein
accident
decade
measure
watched
patients
downtown
animated
satellite
johnny
combination
courts
sequence
hook
clean
wed
owners
twin
distributed
describe
~
defensive
islam
photos
ottoman
trained
affected
routes
ministers
wine
elsewhere
biggest
li
lanka
carlos
landing
collected
revival
rio
communes
saturday
mps
guess
drop
sarah
laid
swimming
membership
edinburgh
fit
harris
dallas
degrees
bachelor
personally
briefly
files
conduct
extreme
courses
hence
homes
reaching
na
sought
vision
demand
vertical
updated
marketing
jason
consisted
appeal
plane
quick
victor
dyk
solar
ages
neighborhood
fairly
wings
acid
scheme
°
matters
rfc
constant
additionally
hip
admins
nova
ceremony
chile
composers
nazi
scholar
liverpool
hero
designer
learned
instruments
welcome
hair
consecutive
movies
adjacent
pool
tue
norman
collections
belgian
corporate
austin
ensure
driving
phone
fly
ian
window
document
adams
collaboration
margaret
kennedy
leg
videos
assume
attached
dry
expand
bible
matthew
depending
serbian
instrument
covering
random
represents
participants
thorough
mentions
portrait
drivers
airlines
franklin
viewers
finnish
differences
venue
vocal
cricketers
element
regularly
rejected
relative
illegal
stewart
roof
leagues
argued
colour
morgan
prisoners
facebook
attend
nelson
survived
insurance
expert
steam
cards
manufacturing
testing
coastal
yorkshire
rescue
territories
thu
thailand
struck
choose
vienna
journey
storage
costs
singh
distinct
notably
soldier
il
colony
evolution
taiwan
hurricane
judges
gardens
poems
consisting
removing
driven
responsibility
sentences
birmingham
engineers
visible
ft
substantial
gulf
installed
revolutionary
inner
trip
restaurant
graham
\
stores
rice
happen
prove
reasonable
skin
committed
volleyball
_
chose
factors
hundreds
injured
devices
phrase
stanley
lemmon
thompson
suicide
advantage
automatically
disc
minimum
goods
charges
alfred
operator
merely
finishing
fred
identify
producers
ss
ann
campbell
portland
helps
latest
releases
victims
explanation
operate
threat
crossing
slow
poets
stopped
strategy
wayne
ranking
disney
wright
residential
associate
hi
significance
ruled
excellent
shouldn
observed
threatened
friendly
redirects
temporary
masters
peninsula
networks
passengers
assumed
artistic
safe
earliest
festivals
compete
png
hunter
moths
alaska
mi
partnership
maintenance
monitoring
evil
relief
charlie
poverty
hop
cc
fri
encyclopedic
suspected
filled
nba
decide
breaking
argentine
resigned
oblast
handed
drew
hawaii
brooklyn
whilst
historians
pa
speaker
moth
permission
wounded
racial
marshall
kg
gate
springs
roy
photography
helping
knights
roll
progressive
contrast
continuing
processes
terminal
executed
shall
svg
spouse
infrastructure
principle
painters
painted
properly
frequency
shaped
joining
robinson
waters
ridge
bridges
ceo
monument
mental
carter
karl
mac
orleans
portal
parallel
regardless
thirty
giant
qualifying
murray
afghanistan
assessment
counter
bears
purchase
expression
backing
uefa
improvement
madrid
closure
wheel
ambassador
desert
bringing
iranian
reign
uncle
severe
rain
admiral
fishing
existed
raise
broadway
principles
grow
tests
roughly
tech
trouble
rico
paragraph
bat
prepared
measures
robin
hired
fear
merit
participation
massive
designs
agencies
whereas
technique
alberta
egyptian
clerk
knew
narrow
adapted
commissioner
rapid
credited
dating
businesses
bomb
capable
poem
stages
honorary
dragon
charged
propose
modified
fired
mlb
send
proof
practices
arabic
attractions
carrying
mouth
fix
licensed
symbol
organ
damaged
warren
exception
costa
unfortunately
jerusalem
replacement
indians
besides
soundtrack
virgin
thousand
vancouver
legislation
beauty
credits
buy
organisations
serbia
christianity
opinions
cavalry
tribe
richmond
chess
channels
claiming
exact
baker
allied
involvement
anime
referring
donald
sisters
willing
requests
unusual
yourself
impossible
colors
cook
drawing
wikimedia
jonathan
removal
moves
indicates
admitted
ownership
shore
monitored
nebraska
regulations
crash
guitarist
enforcement
supports
abbey
deleting
nevada
barry
tone
operates
indigenous
personality
reception
transit
buffalo
flowers
bond
jay
adventure
definitely
guinea
horror
rangers
pointed
apple
popularity
occasionally
coalition
franchise
starred
critic
journals
rolling
percentage
silent
laboratory
microsoft
movements
charter
suitable
alternate
offering
missions
sc
experimental
rooms
concluded
reputation
accurate
versus
websites
interpretation
tagged
endemic
chemistry
achieve
knows
manga
journalists
forests
cbs
comprehensive
symphony
promotional
electrical
tags
meters
jerry
tigers
commerce
remix
addressed
phil
automatic
gang
afterwards
printed
oak
warner
tend
ms
quote
separated
bishops
glasgow
essentially
wait
input
battery
favor
benjamin
apparent
shopping
patrol
eagle
mainstream
pc
angel
martial
restoration
delhi
hans
indicated
morris
railways
centers
mills
helpful
delivered
components
victorian
legislature
tourism
treated
extent
kids
barbara
essay
circumstances
repeated
plain
superior
strategic
similarly
duties
effectively
blp
considering
arranged
ken
grammar
amendment
alleged
relation
habitat
spoken
eu
shell
mounted
entries
conflicts
philippine
montana
appearing
triple
boundary
caribbean
hosts
signs
seriously
bristol
warring
mitchell
industries
colombia
comparison
basin
eleven
ill
putting
pradesh
charity
output
dna
carbon
boats
desc
architectural
representation
commentary
rising
visitors
markets
plate
giants
processing
landscape
dick
hunt
em
summit
rr
psychology
ride
greatly
guardian
closer
terminus
losses
balance
democracy
submarine
nicholas
unsourced
usual
peru
eighth
instrumental
hindu
amongst
defender
riding
arrival
evans
turning
imply
prose
cargo
hidden
volunteer
bio
holder
sugar
daughters
wildlife
fun
integrated
partners
rates
grace
feed
childhood
accompanied
milan
photographs
honour
soil
server
manual
concrete
possibility
ghost
confused
tunnel
larry
styles
elevation
muhammad
considerable
stood
inter
lose
phoenix
sweet
waste
operational
tall
ongoing
qualify
constitutional
sporting
peoples
acceptable
fruit
decisions
depression
perspective
longest
midfielder
crystal
monastery
resident
seek
cincinnati
tied
surgery
steps
carrier
stream
alice
dj
kick
furthermore
strange
predecessor
bernard
nigeria
pain
ph
influential
punk
wooden
suggestion
interaction
retained
achievement
mechanical
drugs
missed
expect
trinity
classified
minority
businessman
grown
coat
powered
alive
nbc
nhl
keith
bobby
harbor
behaviour
croatian
maritime
terry
virtual
indoor
periods
spiritual
easier
croatia
lions
archbishop
luis
merchant
azerbaijan
lots
contested
editorial
initiative
charlotte
pure
borders
persian
marks
armenian
romantic
replacing
talent
unlikely
panel
jump
animation
agents
employment
trading
parker
statue
ac
dated
wonder
filed
provinces
friday
jobs
cuba
couldn
aside
são
scientist
schedule
waiting
familiar
suspect
disagree
suggestions
turner
forming
formally
locomotives
barcelona
se
consistent
recommended
desire
happens
patient
bulgaria
vi
vincent
outer
hear
texts
belief
visitor
vessels
basically
continental
hole
fail
passage
sees
wedding
archaeological
layer
designation
clan
revenue
seeking
couples
entering
suit
soft
weekend
approval
democrats
crimes
collins
expatriates
horses
wear
visiting
overs
supporters
cash
somewhere
dennis
resource
sculpture
practical
harrison
pink
oliver
limits
cooper
illustrated
sur
hell
statistical
referenced
wolf
warriors
incidents
fresh
editions
roots
signature
clinical
premiered
volumes
worst
adults
contribute
necessarily
immediate
feeling
theories
essential
completion
conclusion
technologies
strip
bound
praised
stayed
hull
−
diamond
origins
empty
eliminated
valuable
cite
doubles
branches
@
honors
brick
experiences
beijing
tie
lgbt
sa
wickets
liberty
repeatedly
siege
baptist
ron
hebrew
affect
portrayed
decline
widespread
coaching
alpha
equipped
identical
submitted
enterprise
touch
transmission
rs
platforms
cave
filmed
inch
cool
bulgarian
debuted
liga
manhattan
destruction
activist
weapon
clay
keyboards
dangerous
viewed
lp
email
biology
increasingly
bold
bowling
os
compare
treaties
affiliated
sock
assault
regards
monthly
foster
cousin
urls
hispanic
logic
craig
trivial
pioneer
muslims
lay
rated
absence
amsterdam
publishers
tribes
percussion
runners
themes
benefits
guards
flows
attributed
stubs
athens
herbert
celebrated
sponsored
raf
regard
asks
delaware
neil
pole
ref
historically
tail
tours
stable
decides
vessel
identification
delta
telling
dealing
writes
mediterranean
volunteers
reply
attempting
stuart
marvel
luke
grave
odd
hearing
uss
mall
penalty
solutions
secure
hugh
steven
sole
architects
characteristics
falling
spin
clinton
villa
select
metric
criticized
surviving
roberts
standings
biological
lloyd
munich
belongs
adelaide
belong
harold
norfolk
butler
coi
rival
acoustic
posts
adaptation
greg
reporter
url
absolutely
nobody
scholarship
vast
exit
facing
inquiry
dual
belt
noticed
patent
mathematical
relating
rarely
submission
demographics
crowd
rick
governments
bonus
tourist
mystery
click
settlements
walking
nevertheless
voters
rifle
component
civilian
partial
encouraged
birthday
eddie
christians
denver
petersburg
researchers
partly
photographer
runtime
jon
obama
picked
seemed
clock
violin
highways
holiday
distinction
artwork
makeup
catherine
font
farmers
occasions
au
guideline
photograph
struggle
timestamp
produces
yale
options
pen
procedure
jacob
convicted
touring
transition
anglo
legacy
denied
relationships
ottawa
derby
surrounded
libraries
competing
speakers
grades
hudson
administrators
sacred
signing
rob
citizen
dogs
argue
believes
annually
cardinal
nepal
intersection
discussing
reveals
defeating
disputes
beam
overseas
perry
nickname
ruling
syria
wells
contributing
ultimate
ranks
danny
retail
favorite
vermont
begun
download
trusted
appointment
ballet
jefferson
anywhere
sand
angle
sessions
recreation
wearing
kenya
accessible
ralph
thread
disruptive
spend
ninth
arrest
choir
trials
mines
injuries
rapidly
rounds
competitive
opportunities
meetings
commented
wang
woods
exercise
jacques
objective
demolished
preferred
resort
pedro
robot
venezuela
segment
studying
edwards
aim
dancing
eagles
demonstrated
tribute
continuous
encourage
spider
acted
convinced
heroes
describing
rocks
bed
gap
reflect
mars
participating
cooperation
obtain
gothic
protest
hunting
rfa
frequent
conversion
stress
manufacturers
voiced
innings
traditionally
jose
adventures
tiger
totally
voyage
concentration
sing
rocket
electricity
shadow
boxing
senators
doc
stanford
machines
vegas
clearer
saved
jury
calendar
noble
tommy
guilty
leo
affair
handle
extinct
responded
shares
scotia
manufacturer
tales
implementation
truck
spelling
item
load
customers
adds
spaces
cap
orphaned
ferry
prefer
push
lie
berkeley
lebanon
madison
throne
attracted
ie
lion
retrieved
manor
promoting
saudi
serial
abroad
rogers
lights
grandfather
gauge
concerts
elder
reads
renaissance
uniform
chase
aka
computers
brisbane
susan
raymond
flower
col
thai
disaster
survive
involves
clothing
murphy
sharp
behalf
explains
yugoslavia
buddhist
publicly
meat
literally
spam
telephone
moral
sung
partially
lawyers
citing
interviews
brunswick
radar
spending
grove
tea
ap
elite
bright
improving
sierra
heaven
athlete
aspect
answers
ted
consumer
funded
exclusive
ibn
manuel
allies
reviewer
missile
mechanism
helen
withdrawn
intention
mini
casualties
establishing
diseases
rhythm
pat
catch
poll
deck
newcastle
antarctic
leeds
lasted
ranges
listings
ordinary
insects
suffering
flash
worship
boundaries
blind
pakistani
assuming
interstate
patterns
arrangement
globe
honours
gross
gilbert
applies
gradually
youngest
managing
experiment
radical
gov
legs
opponent
diameter
supplies
pitch
utility
cleanup
opponents
regime
revised
plenty
genera
diplomatic
germans
seal
gregory
corresponding
concepts
sword
purple
pending
virus
populations
bull
drummer
presents
holland
congressional
bias
merger
remote
sean
messages
rebellion
premiere
physician
ha
victim
con
cloud
angels
noise
heading
duo
beer
palestinian
copper
jurisdiction
implemented
improvements
ski
peaked
hms
loop
renaming
drum
dramatic
saskatchewan
talks
earthquake
rhode
hat
requirement
den
tanks
presidents
societies
min
defending
alcohol
dominated
sang
eat
graphics
constituencies
asp
coffee
batting
chancellor
destroy
tons
cruz
warsaw
exclusively
connections
rush
heights
playstation
outcome
apartment
cardinals
fill
recipient
correctly
traditions
fundamental
copyrighted
thin
chan
resolved
mario
departments
dame
thereafter
shield
lowest
fighters
ivan
writings
bosnia
sentenced
violent
caption
harbour
margin
auckland
postal
pirates
collective
diesel
liberation
confederate
devil
activists
sultan
rider
amazon
florence
marc
wider
arnold
shah
si
blogspot
reduction
contents
genetic
somerset
locally
milk
romance
lacking
intellectual
latino
failing
mason
pete
advisory
es
arbitration
interface
hitler
default
accessed
lifetime
sheffield
departure
hindi
anglican
suggesting
mistake
residing
remainder
raising
embassy
murdered
sox
sleep
suspended
sum
mythology
bengal
confusion
bakhsh
oscar
programmes
therapy
occasion
exposed
assisted
possession
defend
devoted
graphic
warfare
milwaukee
informed
anonymous
reverse
soap
territorial
lisa
paulo
northwestern
playoffs
boss
nasa
sockpuppets
quoted
byzantine
idaho
poster
geographic
rebounds
ho
congo
venture
cricketer
worse
hoax
restricted
advocate
doors
naming
situations
instructions
sullivan
tables
leaf
shoot
substitute
restaurants
contributor
spoke
errors
enjoyed
framework
rocky
kerala
shakespeare
quantum
immigration
mirror
certified
assets
potentially
presentation
cotton
sitting
tournaments
syndrome
checked
forty
aimed
sourcing
journalism
upload
unsuccessful
towers
conductor
hospitals
bone
essex
rebuilt
wellington
ideal
raw
sharing
labels
leonard
watson
governors
posting
harvey
bases
hello
rabbi
hardware
ensemble
monster
pitcher
emphasis
irst
recovery
respond
aaron
lesser
qualification
organic
exposure
palestine
thoughts
needing
drafted
maurice
immigrants
variant
lap
legitimate
autonomous
wallace
†
succession
throw
monday
reserves
donated
increases
kid
guidance
delivery
joan
fifty
slave
feedback
columbus
stones
manage
cgi
initiated
favour
printing
variable
theology
todd
parameters
traveled
md
canton
han
reed
celtic
characteristic
commanded
searching
inappropriate
switch
ties
tube
otto
debt
outdoor
navigation
eligible
experts
expensive
tier
gospel
newton
essays
shanghai
conventional
campaigns
formations
feelings
bath
venice
cats
variations
emerged
socks
ch
connecting
flood
documented
custom
touchdown
profession
layout
academics
settlers
merging
sony
competitors
phillips
hasn
grass
reservoir
artificial
novelist
tip
prague
abu
faces
guitars
laura
fellows
internationally
attacking
johann
dreams
hughes
suburb
understood
specialized
warned
pearl
chorus
dependent
restrictions
killer
oakland
romanized
trio
influences
blocking
mtv
wikipedians
à
cattle
gear
gabriel
traded
skating
fifteen
palm
wikis
tale
demonstrate
vary
liquid
cycling
princeton
respective
voices
faster
friedrich
jet
horn
erected
burning
worker
atmosphere
characterized
syrian
welfare
java
monitor
ye
graduating
columns
reportedly
repair
bin
stick
dollars
organised
parameter
truly
resolve
buenos
parade
backed
awareness
depends
define
spencer
republicans
conspiracy
dies
clarke
rough
engage
pine
equation
feels
democrat
permitted
cutting
button
attending
brands
queens
abraham
neck
forever
est
drink
sheriff
miguel
aires
irrelevant
poorly
montgomery
vanity
gift
riders
functional
crossed
diverse
ru
numbered
quotes
slowly
xi
attitude
mouse
justin
protests
gods
amounts
variation
smart
prices
prayer
terrorism
beta
durham
householder
counts
iraqi
detective
josh
placing
somebody
linking
compositions
oval
extensively
filming
perfectly
mw
pr
indianapolis
fn
funeral
recovered
southeastern
farmer
protestant
lt
cameron
focuses
ranging
unclear
indonesian
mixing
mumbai
nashville
danger
rally
narrative
camps
surprise
manufactured
deployed
kate
solely
molecular
unnecessary
isle
theorem
homepage
colonies
cyprus
wake
brings
winds
magnetic
conversation
sussex
ab
fl
fastest
gates
ram
plastic
electronics
restore
stockholm
inn
buses
connect
forth
guests
radiation
receives
lancashire
playoff
cork
generals
intermediate
ba
verifiable
cheers
filipino
reaches
oriented
hamburg
creates
orbit
massacre
dialogue
illness
wc
dress
codes
dawn
isolated
nancy
violations
perth
tenure
ladies
autumn
ratings
incorrect
scout
difficulty
pupils
wealth
hart
toured
allegations
regulation
watching
lodge
eggs
disputed
citizenship
specialist
tasks
intent
instruction
ceased
pride
banner
friendship
panama
corruption
sunk
harm
ernest
pilots
pursue
tape
emigrants
cancelled
revenge
revision
dominant
fee
computing
examination
chen
matrix
das
biographical
kiss
nationalist
luck
crosses
heavyweight
bid
appreciate
ce
enemies
mercury
interactive
math
debuts
preserve
nobel
grande
keeps
structural
marry
airports
veterans
airline
axis
execution
cult
reducing
sp
colin
chester
ticket
belonging
im
entity
judicial
explicitly
bombing
recognised
originated
applicable
founders
fitted
wilhelm
suddenly
parking
absolute
françois
locomotive
preparation
nintendo
declaration
presumably
burial
governing
jamaica
knowing
vladimir
beating
avg
methodist
utf
challenges
kenneth
evolved
celebration
discipline
bearing
belonged
fauna
manuscript
experiments
chiefs
compound
tampa
arabia
associations
equally
dealt
shut
targets
alien
withdrew
depicted
sergeant
diffs
subsidiary
thirteen
thick
extend
dismissed
neo
wire
phd
measured
fat
visits
linux
teach
flights
verse
bennett
warm
dynamic
shaw
breaks
grandson
monuments
lying
lords
michel
treat
raid
congregation
shorter
temperatures
testament
drinking
companion
manila
km²
punjab
imagine
consideration
veteran
doctors
eldest
carries
ruler
wise
shipping
afc
worthy
registration
directory
wyoming
manitoba
vietnamese
ronald
cuban
burns
justify
divine
suppose
sequel
fate
rovers
cole
oral
trans
deemed
boards
span
bryan
santiago
episcopal
terrorist
okay
waves
invented
landed
sandy
acres
paint
actively
indication
stops
excellence
integration
thinks
bibliography
farming
nonsense
marathon
beliefs
redundant
freestyle
aerial
€
preservation
altitude
freely
landforms
simultaneously
psychological
fernando
cultures
taxes
marcus
stakes
dominican
franz
coins
oxygen
^
incumbent
civic
hardly
isaac
spell
craft
inspiration
pairs
vector
arc
professionals
vii
contrary
accusations
approaches
baronet
slaves
mad
spectrum
client
dozen
travels
symbols
plaza
banking
inherited
legion
symptoms
mosque
guys
lab
sailing
orientation
virtually
generic
reasoning
stroke
unions
efficient
opens
impression
discover
relocated
novelists
roosevelt
dancer
phenomenon
preliminary
recognize
anchor
arguing
abilities
procedures
emotional
timber
fisher
prod
cartoon
disorder
fled
demands
lithuania
continent
fellowship
lock
relegated
warrant
pictured
recurring
overview
wealthy
acquisition
eve
filter
addresses
independently
slovenia
observation
roster
disestablished
challenged
threats
fallen
protocol
judgment
grammy
colours
distinctive
namely
opposing
landmark
package
controls
completing
sabha
prisoner
signals
owen
capita
inaugural
intervention
arriving
cylinder
tenth
liu
tested
renowned
shops
dome
philosopher
epic
stem
specified
kinds
davies
collapse
allan
sight
albanian
canyon
samples
perceived
celebrity
priests
louise
workshop
herzegovina
claude
fortune
bars
cornwall
palmer
presidency
tiny
fk
appeals
istanbul
hp
rookie
expanding
calgary
shock
pulled
stevens
employee
yang
housed
tomb
earning
innovation
streams
unity
lucas
grows
armenia
interchange
sized
proteins
proposals
swimmers
mainland
seminary
hamlet
timeline
realize
coup
newport
negotiations
exhibitions
malta
hate
westminster
installation
enters
goalkeeper
julian
morocco
efficiency
chapters
aboard
helicopter
fewer
fortress
ani
burned
displays
compiled
ips
contributors
torpedo
giovanni
chat
catholics
herald
chuck
pit
supplied
optional
desk
garrison
sprint
exile
surprised
achievements
biblical
rebels
te
denis
geographical
sit
alpine
bills
glacier
aa
binding
indicating
estonia
eating
saving
chi
developer
indie
difficulties
doctrine
worn
fork
simpson
moreover
maintaining
theological
upcoming
vocalist
temporarily
hotels
edmonton
developments
literacy
currency
missionary
arrives
hammer
dollar
ambassadors
reverts
twitter
centres
solomon
recommend
descendants
ruth
handling
customs
collect
grid
secured
certificate
destination
albania
indies
euro
consumption
feat
pushing
constantly
survivors
mansion
cardiff
temples
blake
sheet
lift
confidence
cuisine
frankfurt
galaxy
ecuador
breeding
outbreak
legendary
handball
georgian
copenhagen
trek
ignored
arch
keys
proceedings
enjoy
quartet
aims
propaganda
wu
disk
realized
ne
neat
funny
punishment
accuracy
businesspeople
meter
theoretical
suspension
graduation
flew
seeds
lighting
jennifer
smooth
ah
customer
armstrong
southwestern
involve
philosophical
escaped
powell
kills
taste
allmusic
requiring
bros
assertion
boulevard
northeastern
brooks
sending
atomic
antarctica
strikes
reconstruction
chronicle
traveling
leslie
ellis
devon
ghana
gen
rebel
duncan
pianist
canon
nc
reformed
pack
iceland
solve
cyclists
payment
suburbs
militia
pronounced
exhibit
mph
glen
eugene
compromise
tactical
discovers
switched
uganda
jail
yeah
townships
somehow
withdraw
holmes
promise
deals
convert
dos
afternoon
noting
recall
arrive
warrior
mammals
dimensions
surrey
gaming
lutheran
ports
amy
survival
responses
badly
collegiate
scandal
widow
swing
nights
polo
linda
adr
consist
probability
farms
conferences
zhang
crazy
witness
nephew
sensitive
mutual
hd
diet
clients
fringe
passion
rings
stronger
millions
dialect
orlando
undergraduate
relay
wet
cruise
henri
publish
joy
julia
kitchen
abstract
snake
comedian
motorcycle
nadu
reverting
arsenal
millennium
assists
thereby
bow
andré
serie
dimensional
travelled
eurovision
firing
suite
doug
gravity
stored
departed
optical
frontier
evaluation
graph
hybrid
oslo
earn
metre
keyboard
inducted
nearest
jamie
decorated
complicated
nathan
slavery
circular
operators
armor
mechanics
bradford
leon
rachel
footage
strings
header
hood
inspector
warnings
relatives
plains
defended
wheels
criterion
ace
arrangements
penn
approached
joke
sailed
religions
authored
grants
andrews
moderate
stolen
tributary
commanding
pin
carol
owns
prototype
copied
canterbury
midnight
quarterback
duchy
bailey
arbitrators
performers
handled
exploration
diversity
sixteen
findings
repeat
brussels
imdb
planets
theatrical
reconnaissance
shots
complaint
batman
exhibited
espn
investigate
verify
discontinued
absent
girlfriend
resignation
fossil
explaining
tang
inches
proven
yu
franco
dying
tribal
tyler
surrender
glenn
substance
focusing
luxembourg
colored
scholarly
administered
explosion
pushed
generations
duck
porter
permanently
memphis
salvador
emma
mit
zoo
gibson
wording
emerging
mere
notre
portions
macedonia
ethics
depot
curtis
rescued
gaelic
slovakia
elevated
jeremy
listen
impressive
bradley
surely
egg
conquest
rod
cdp
algorithm
burn
thesis
lover
capitol
comprises
remembered
ferdinand
marshal
judaism
balls
nacional
wrestlers
ahmed
sin
holocaust
edgar
saxophone
retain
curriculum
wishes
prepare
ruins
ibm
rochester
nigerian
pitched
jesse
malaysian
atlas
telegraph
performer
cannon
encounter
emily
dissolved
catalogue
discrimination
er
refs
myspace
reveal
wizard
teen
spots
bomber
foods
quest
connor
screenplay
motors
minimal
muscle
prestigious
sustainable
chelsea
strict
kingston
sheep
andrea
complaints
xs
née
connects
nursing
defenders
richardson
triangle
nato
teeth
occasional
strictly
harper
fluid
bigger
fed
newfoundland
disbanded
comparable
documentation
brien
compounds
pointing
edmund
instances
naturally
forcing
ussr
laser
lat
sculptor
guild
observer
worlds
imprisoned
wrestler
praise
parishes
bones
css
cox
contracts
consequences
provisions
circulation
butterfly
hugo
abolished
algeria
edu
sufficiently
armies
separation
spy
cliff
technically
reactions
lithuanian
trick
curve
accidents
horizontal
uploader
legends
enzyme
freight
lacks
hydrogen
broadcasts
viii
caroline
pull
plymouth
twentieth
cuts
mediation
airfield
catalog
dale
synthesis
rape
seoul
engagement
coin
lucy
consequently
platinum
twins
memories
robertson
verified
anthology
milton
geological
defining
dinner
hosting
thriller
retreat
albany
abdul
ignore
migration
carefully
magnitude
sudan
closest
manages
duration
henderson
explorer
marco
fusion
aids
gathered
privately
reflected
afraid
presbyterian
automobile
estates
fault
pound
allegedly
delay
developers
semifinals
belfast
arctic
ps
kurt
mayors
windsor
assumption
plates
fourteen
nominee
disruption
monroe
hearts
belgrade
victories
extending
pale
pursuit
glory
destroyer
deeply
lectures
affiliate
preston
deceased
speaks
gathering
angry
incomplete
enrolled
configuration
brad
skill
intense
tasmania
commitment
loved
reforms
rulers
uruguay
sustained
napoleon
confirm
breed
auxiliary
enabled
discography
licence
refugees
adrian
pipe
karen
altered
budapest
designers
fe
heir
advisor
illustrate
authorized
hide
announcement
compact
tissue
particles
refuses
receiver
civilians
marsh
vinyl
delayed
unrelated
encountered
wednesday
checking
chilean
hey
chambers
demo
nationwide
agrees
ahmad
santos
paying
interpreted
submit
desired
followers
observatory
problematic
springfield
kit
remarks
burton
mo
inns
coached
monarch
observations
footnotes
beetle
promised
palomar
cream
presenter
potter
su
favourite
transformation
mcdonald
bavaria
kumar
nineteenth
severely
gaining
mixture
browser
endangered
mate
everybody
lyon
illustration
kyle
afl
brook
geometry
ping
extends
aggregate
variants
baroque
iso
collapsed
neighboring
integral
jake
hopes
cornell
modes
servant
gt
td
kenny
hurt
mk
maker
inline
carlo
lynn
stability
hoping
beneath
imposed
confusing
mt
summaries
beetles
joel
cf
jets
logos
vital
malcolm
winnipeg
kilometers
songwriters
buddhism
nose
respected
pace
thunder
centered
physicians
bolivia
forget
implies
crops
halifax
toll
monk
extraordinary
lessons
pub
paralympics
monte
maría
segments
deer
wireless
whenever
commenced
mysterious
consultant
fraser
formats
jam
chicken
enable
idol
reid
births
amazing
pet
upset
loves
stretch
nominate
striking
striker
accidentally
louisville
hopkins
eds
goddess
burials
resumed
satisfy
notion
voltage
betty
marion
geology
consistently
cyclone
export
lightning
impressed
maintains
logical
aggressive
jin
julie
fbi
yankees
ludwig
fi
pond
suburban
enlisted
moments
conjunction
interim
argues
lucky
targeted
lon
speedway
regiments
picks
prevented
toy
bicycle
purely
pd
interactions
fraud
lang
arcade
lecture
sanctuary
dragons
copa
careful
nurse
rivals
module
supplement
lens
patron
commands
trend
superintendent
gerald
rap
geneva
ash
blade
disappeared
array
patrolling
predominantly
committees
loose
boom
sailors
beaten
smoke
assassination
lancaster
reynolds
divorce
dust
saxon
healthcare
separately
grain
executives
translations
zimbabwe
thrown
cohen
puts
diving
neighbouring
carroll
accounting
mesa
prussia
intelligent
cherry
underlying
tobacco
cleaned
varieties
bench
directions
ellen
padding
measurement
paradise
alexandria
complement
witch
attraction
diana
personalities
colleagues
busy
cia
screenwriter
rankings
aboriginal
commanders
salem
wagner
firms
sanctions
americas
endings
instructor
nobility
divorced
varies
tomorrow
manuscripts
unified
clarify
scouts
investigations
silva
derek
agenda
provision
humanity
admit
terror
contestants
trinidad
distant
burke
circles
assignment
releasing
recalled
shrine
sail
willie
karnataka
celebrate
ranch
jo
collaborated
vampire
unfree
playwright
sick
associates
heinrich
ethiopia
flags
tel
drove
learns
shorts
drives
accomplished
autobiography
recruited
uprising
edwin
velocity
terminology
raiders
coordinates
brighton
viola
para
morrison
propulsion
boxer
finale
sh
shoulder
disabled
joins
div
tactics
ernst
innocent
rapper
settle
privacy
boeing
cites
bunch
emmy
indo
distinguish
rosa
accordance
thermal
flute
marines
feminist
trustees
sculptures
nationally
bacteria
introduce
landmarks
disorders
rivalry
prevention
honored
healthy
circus
speculation
burma
sec
ka
ar
quiet
knee
deliver
threw
hypothesis
referendum
travelling
estonian
pastor
sofia
tribune
lasting
permit
priority
pounds
cent
consequence
rica
conducting
furniture
macdonald
honest
innovative
estimate
atp
rotation
syracuse
lecturer
automated
obscure
kosovo
classics
julius
appreciated
naples
sebastian
activated
varied
offense
advised
barnes
acknowledged
exceptions
martha
quarters
drawings
barely
hitting
refuge
maharashtra
conventions
elliott
diplomat
unused
searches
brigadier
particle
malayalam
thursday
icon
ulster
genes
infinite
considerably
vale
portraits
paste
randy
ec
saxony
convoy
annie
excessive
believing
rhine
mineral
implement
surgeon
badge
charleston
clause
infection
electron
walt
cnn
likewise
tonight
confederation
accommodate
casino
doctorate
aux
guatemala
settings
mask
shelter
dorothy
ethnicity
hopefully
elimination
heath
pregnant
richards
theodore
delegates
blair
fac
phrases
crashed
preference
janeiro
concerto
headquartered
bits
construct
tune
unofficial
bulk
lighthouse
stan
highland
mascot
squadrons
acceptance
tight
considers
ai
hub
mess
wilderness
routine
reviewing
dubbed
dozens
spotted
harmony
entrepreneur
wwe
apollo
runway
ji
naked
anton
moses
legally
wa
nominating
fake
biased
división
revolt
equity
varying
providence
investors
reliability
tenor
fights
pocket
sad
troy
treasure
ion
rendered
transformed
roberto
nn
adoption
decrease
reserved
forgotten
lok
crop
licensing
advocacy
collecting
treasury
trumpet
johnston
uncertain
norton
collector
cluster
dear
georges
roller
pt
clothes
sovereign
enhanced
compensation
consent
outline
holdings
jorge
darkness
penalties
sk
bombers
hometown
holes
blow
cooking
vfd
aftermath
trainer
rican
measuring
lawsuit
retiring
chip
consciousness
archaeology
latvia
telugu
blogs
protecting
hardy
nicknamed
scorer
stamp
nat
fur
redirected
estimates
lit
ritual
locality
trace
marble
foundations
politically
nottingham
derivative
boxers
dimension
touchdowns
crawford
bats
yugoslav
tanzania
succeed
motto
streak
concentrated
dirty
hayes
xbox
identifying
likes
genres
galleries
forbes
councils
adequate
brass
bach
alias
inland
counsel
wore
comprising
tough
advertisement
protagonist
trails
demanded
claire
mistakes
bruno
dylan
bag
throwing
churchill
tan
spelled
climb
witnesses
storyline
thames
anybody
kazakhstan
bots
constitute
presenting
highlights
jumping
prof
slovak
skull
missionaries
ordained
eventual
hoped
myth
mandatory
stern
fees
bet
monks
dancers
quantity
endorse
inventor
cairo
graves
proximity
seemingly
sue
armament
barrier
creatures
logan
je
erik
leicester
ds
silence
jessica
plateau
finite
precedent
stationed
walsh
zones
intensity
exterior
murders
paragraphs
costume
bike
neighborhoods
imprisonment
suffolk
forwards
remarkable
undelete
differ
tin
garcia
madagascar
cameras
ammunition
fires
viewing
explore
builder
minneapolis
occurring
bullet
kerry
subway
arrow
economist
bread
lou
strategies
rubber
precise
rifles
cognitive
governorate
nest
slam
ancestry
portsmouth
miscellany
convince
audiences
boarding
bonds
joshua
inhabited
casey
attract
nonetheless
eb
kilometres
pump
feeding
prey
ain
mathematician
diary
vulnerable
inscription
dubai
michelle
lebanese
productive
guided
happening
accordingly
gp
researcher
della
baden
upgraded
demonstration
equality
philosophers
spacecraft
trap
gb
clara
invitation
marking
expertise
admission
sacramento
certification
precisely
casting
reassessed
submarines
prohibited
supposedly
governance
sometime
frog
vague
tackles
mhz
secular
tracking
spa
publicity
armoured
cleared
watts
gibraltar
renewed
reflects
fever
melody
supporter
elaborate
jeffrey
discusses
surnames
useless
swift
tuesday
silly
empress
capabilities
newman
scales
onwards
beatles
ko
clergy
jacksonville
sara
lifestyle
bee
holders
baltic
czechoslovakia
brandon
loaded
maya
evangelical
enterprises
imo
mature
physically
sequences
breast
beast
raja
für
anymore
educator
bang
griffin
rhodes
preparing
proportion
enrollment
itv
ana
ceiling
rainbow
demon
prussian
equations
answered
←
ist
perception
distributor
entities
jackie
dynamics
fiji
insufficient
algebra
homer
larvae
limestone
johns
bce
chaos
chang
layers
crater
ki
chad
db
seized
webster
depicting
excess
bombardment
hurling
ashley
dot
gif
translator
cowboys
counted
hanging
soprano
interviewed
governmental
workshops
terrain
belarus
liked
console
nascar
teenage
applying
vandal
graduates
jungle
ballot
placement
fairy
tourists
reasonably
performs
quarterly
shifted
romans
rpm
diploma
circa
environments
collaborative
swan
carpenter
petition
boris
berry
invention
southampton
prairie
bend
app
finalist
questioned
explicit
bo
draws
governed
slight
drag
maxwell
quarterfinals
planes
everyday
oriental
manufacture
airing
acclaimed
coordinator
bombs
mohammad
bassist
superman
colombian
philippe
felix
bengali
greene
voluntary
floating
montenegro
sketch
lo
mann
flooding
escort
dressed
astronomy
sudden
variables
arbitrary
skiing
timothy
cello
rainfall
rafael
sphere
ought
rewrite
georg
cinematography
canvas
chest
krishna
provider
va
frances
crowned
wanting
carved
poles
cabin
civilization
broader
avoided
pl
lisbon
tongue
endorsed
newer
eliminate
gng
panels
darwin
cheese
easter
rat
papua
insert
descriptions
debates
informal
castles
cry
loyal
surfaces
nicolas
institutes
humor
madonna
worcester
cooperative
substantially
trips
winston
introducing
llc
drake
lunar
fountain
consulting
tends
patricia
ers
garcía
quit
rid
maple
aberdeen
verb
illustrations
mechanisms
reds
posters
underwent
leipzig
santo
engaging
roland
expelled
evident
staged
telecommunications
mc
strait
nationals
unanimous
misleading
sherman
stefan
sleeping
alto
crucial
thomson
directing
yo
prehistoric
communicate
alphabet
zagreb
fu
chef
ix
tubes
carnegie
hostile
prizes
eighteen
shirt
weird
namespace
pursued
loading
edges
plantation
vernon
acre
practiced
wonderful
missiles
br
guides
finger
garage
savage
technological
rely
rises
shoes
climbing
barrel
biographies
spite
assembled
ham
hon
geoffrey
valve
histories
commenting
osaka
beck
analog
monkey
tackle
listening
integrity
sits
definitions
critically
operas
baldwin
troop
objections
marina
detection
sixty
differential
linguistic
venus
salmon
monarchy
stade
depicts
francesco
abortion
monica
elephant
polar
prompted
trademark
ton
provisional
counting
tu
corn
trucks
sank
airborne
lengthy
deutsche
absorbed
par
tension
pablo
aviv
cleaning
judgement
physicist
catalina
preventing
disasters
ni
lesbian
rays
withdrawal
walks
realm
trailer
janet
tornado
aunt
distribute
genesis
shallow
makers
mentioning
requesting
floors
violate
scattered
boyfriend
consolidated
determining
catholicism
luther
pleasure
seeks
constructive
webb
battalions
alberto
magical
deliberately
sacrifice
faction
olive
corridor
compatible
receptor
molecules
kiev
lb
vista
castro
preceded
wei
pleasant
miscellaneous
lineup
ultra
nowhere
bulletin
clinic
itunes
batted
patriots
pollution
treasurer
…
herman
questionable
sally
rex
mvp
stuck
proceeded
openly
afghan
inferior
vatican
folklore
affiliation
cruiser
princes
silk
agreements
trilogy
armour
eng
stamps
harder
specimens
consumers
differently
tallest
defines
cage
vicinity
ferguson
correspondence
illustrates
tehran
cheshire
pageant
amanda
pos
bye
employer
browns
static
tibetan
nixon
sociology
recreational
capability
addressing
panthers
objection
systematic
violated
nl
chocolate
helsinki
bedford
cambodia
grandmother
rounded
fitness
halls
midlands
jung
brilliant
libya
transported
angela
creature
gather
volcanic
android
filmography
deposits
mentor
freeman
apartments
csd
uci
puppet
carey
hiding
sends
meyer
strongest
gameplay
topped
lima
freeway
kirk
batteries
chains
ucla
recover
wong
ear
servants
combine
decent
cubs
deputies
hawk
contestant
cumberland
rational
altar
ufc
organize
exhibits
subjective
determination
chennai
dong
threshold
myanmar
proud
custody
surveillance
ng
threatening
allegiance
discharge
penny
ra
kannada
astronomical
rector
petroleum
bug
proceed
chapman
mud
decree
regulatory
honey
diagram
infringement
broadcaster
complexity
riley
displacement
cr
sponsor
ira
condemned
careers
haiti
wisdom
pierce
stupid
alma
alexandra
continuously
shuttle
hampton
eva
mickey
atlético
gastropod
relegation
yuan
sv
lynch
johannes
ecclesiastical
destroying
trunk
proclaimed
genuine
adolf
sharon
acquire
guitarists
hung
corrected
theatres
gardner
artifacts
handful
mandate
locked
mohammed
dish
terrible
revived
reagan
rebecca
toledo
indefinitely
divide
recommendations
gymnastics
alter
ideology
carriers
constantinople
fatal
modeling
monsters
notified
bulls
harassment
censorship
favorable
bore
commercially
inserted
paramount
luxury
yahoo
ecology
reviewers
emirates
neutrality
chronicles
deployment
lepidoptera
blast
jenkins
omaha
detected
attribution
harsh
cn
bundesliga
overcome
monarchs
facilitate
hermann
morton
ya
sake
conceived
roma
accredited
echo
bangkok
achieving
bergen
hawks
availability
organizing
lung
andreas
cedar
stake
bankruptcy
flagship
lankan
unassessed
basement
spirits
torture
toyota
wheat
examined
membrane
knockout
coleman
homeland
maiden
dhaka
dental
christine
curious
tucker
evolutionary
responsibilities
viscount
communists
pavilion
fantastic
ceremonies
augustus
loans
altogether
joey
corporations
tagging
moss
parody
yield
beside
sizes
armored
neighbourhood
cheap
stuttgart
dialects
backup
forbidden
sanskrit
neill
councillor
knife
wade
humanities
premiership
gonzález
yesterday
percy
expressway
transform
vegetation
tickets
sf
dams
adopt
braves
throws
plural
filmmaker
pomeranian
habitats
sicily
justified
speeds
zip
updates
replied
restriction
lin
siblings
measurements
uttar
courthouse
tide
laps
dynamo
predicted
uncredited
charitable
feast
poker
promising
po
kane
trevor
reprinted
attractive
sidney
factual
holidays
wondering
initiatives
comfortable
sw
outlets
fletcher
airways
floyd
snail
genocide
saga
undertaken
scots
kingdoms
violating
weren
archdiocese
lu
fossils
buying
commissioners
smoking
enormous
subdivision
ta
leather
def
patch
unreliable
porto
valencia
subfamily
tracy
downloaded
breakfast
subspecies
andhra
mercy
botanical
peaks
ore
lone
bottle
tunisia
kw
angola
humanitarian
decorations
slide
invalid
generating
trapped
rao
guru
freshwater
telescope
dense
satisfied
criticised
coral
xavier
obituary
helena
rowing
albeit
anatomy
binary
merchants
controlling
wound
conviction
publishes
dixon
kuwait
ridiculous
suffer
nsw
blacklisted
nord
caves
regent
riverside
shane
balanced
fitzgerald
highlight
payments
tendency
lópez
joyce
belle
lifted
davidson
decorative
rouge
clue
rockets
cure
bolton
slope
manning
boyd
organisms
paraguay
preview
dodgers
qatar
import
buddy
iucn
dock
clare
penguin
rapids
exceptional
papal
shed
cologne
providers
beings
norwich
challenging
anthem
supervision
investigator
ming
lease
yours
cecil
toys
conrad
buck
hang
bulldogs
italia
portable
dell
sd
irving
bells
tibet
composite
councillors
salary
blacks
enhance
staying
realizes
sysop
dropping
smallest
fragments
objectives
caesar
tragedy
macedonian
unesco
investigated
cameroon
carson
slot
rams
icelandic
sophie
granite
jenny
deciding
comparative
apostolic
venezuelan
pirate
paralympic
slovenian
accompanying
seasonal
delegation
combining
portfolio
machinery
miners
embedded
keen
charted
gandhi
fiscal
testimony
duchess
cu
reflection
middlesex
carr
freshman
booth
rudolf
león
courage
midland
riots
develops
pier
isles
formatting
woodland
toss
auction
cbc
specifications
acclaim
secrets
crest
sutton
decreased
ups
repairs
que
bypass
prominence
slopes
phantom
hartford
alps
liner
ease
cet
investments
auburn
flame
exeter
din
kay
rent
highlands
induced
verses
crane
simulation
schemes
latitude
lance
networking
keeper
wanderers
fixing
asset
prospect
knocked
merits
prejudice
highlighted
stance
gymnasium
gloria
trent
choosing
horizon
yemen
readily
riot
arose
renovation
quinn
excluded
elena
successive
anger
hierarchy
ml
nm
decoration
nerve
disability
giuseppe
boot
exceed
scan
rca
pioneers
canberra
mb
coupled
newark
flowing
fungi
sensors
battlefield
shells
ashore
arguably
cottage
honduras
resides
shepherd
reservation
elderly
pbs
mapping
mongolia
commentator
bi
nash
sharks
ricky
shirley
stopping
exercises
stepped
ribbon
infant
destroyers
mild
vikings
peers
correspondent
renovated
proceeds
reproduction
sings
mercedes
compliance
sided
bombay
jointly
barack
premises
hawaiian
poly
handbook
rené
screenshot
imported
transactions
statute
congressman
jerome
shark
choices
doom
bare
raids
diagnosis
referencing
firearms
stoke
bride
prolific
selective
criminals
lindsay
advocates
automotive
wartime
worry
vacuum
heating
rushing
jp
lengths
radius
johan
inactive
advances
valued
celebrities
meta
estadio
rehabilitation
synonym
asserted
corp
ambiguous
leone
sailor
racism
negro
tears
inspection
ancestors
honda
barn
controller
assert
abbreviated
singular
vacant
depend
ag
burden
encounters
fruits
ruby
lunch
nationalism
photographers
mick
classroom
btw
delegate
clyde
forums
cups
challenger
recommendation
shi
instant
falcon
loses
planted
nordic
hannah
coordination
hammond
dances
barracks
skilled
qualifier
saves
hardcore
precipitation
bucharest
ranger
lily
surveys
peterson
albion
wicket
destinations
calm
photographic
schmidt
subtropical
contacts
asylum
deserves
hyderabad
visa
twelfth
contacted
matching
pregnancy
blatant
noteworthy
commit
stale
expectations
carmen
erie
harvest
flies
kashmir
theft
relates
occupy
flames
wesley
pga
umpires
beats
arlington
sterling
chances
donna
petty
lets
sonic
tr
transparent
dedication
clarence
katherine
nervous
facto
excluding
everywhere
sunset
anthropology
minds
reward
narrator
wingspan
antoine
ally
steep
primitive
gateway
leigh
slavic
derbyshire
lopez
chronic
donations
imaging
suited
europa
duet
ignoring
intel
winchester
broncos
buddha
pig
whale
beaches
approaching
archipelago
attributes
minerals
justification
frozen
inherently
fischer
gains
dee
comparing
accurately
participant
peaceful
typhoon
hank
manually
verlag
shortened
tram
aerospace
alignment
arabian
ivory
hastings
taxa
oz
holly
traces
constellation
allocated
healing
frames
rey
lorenzo
specification
risks
professors
viking
atmospheric
pottery
bloody
eden
dd
mormon
nielsen
consecrated
peruvian
coloured
relevance
interference
candy
bitter
flown
longtime
struggled
monetary
trends
individually
funk
basque
bangalore
synagogue
latvian
elect
pot
interact
crews
yi
paths
var
guarantee
intact
lacrosse
raj
immigrant
aus
patriarch
themed
blessed
tailed
oversight
servers
ecological
rabbit
phi
sectors
corners
powder
mega
vfl
conservatives
trivia
clips
coventry
wives
plaque
institutional
inform
vic
obtaining
summers
detect
precision
conservatory
sunshine
motivated
diaspora
linguistics
gm
lc
accepting
flowering
nazis
savings
underwater
deeper
mccarthy
avoiding
toxic
georgetown
screening
unaware
phenomena
inaugurated
rolls
liver
aided
benedict
commodore
hollow
mob
dawson
emi
brett
alert
seventeen
bounded
insisted
loyalty
reunion
patterson
collision
lighter
linnaeus
zl
filling
enacted
matthews
willis
secretly
vowel
whatsoever
encouraging
finest
drainage
disciplines
shooters
lambert
seth
kai
emigrated
bermuda
wow
noun
reversed
factories
valentine
plasma
virtue
heated
phillies
bryant
carnival
notices
bates
partisan
amended
ballad
dairy
insight
sorts
wolves
hay
kolkata
trafficking
momentum
alternatively
peerage
labeled
ricardo
devils
solved
endurance
blame
pearson
gaza
hal
teachings
convent
dealer
ct
neal
frost
bout
bugs
consistency
demanding
wrecked
steady
explored
fuller
shannon
watershed
specially
hansen
implied
mg
extant
reverend
lanes
whitney
mistaken
pupil
intend
adequately
limitations
lateral
varsity
lovers
intentions
comply
securities
prohibition
acute
container
emissions
upgrade
notation
omar
taipei
torres
dining
survivor
corpus
advancing
shaft
nicole
turks
marker
locals
observe
dishes
cove
seating
dont
pour
loud
hu
suspicious
expense
prosecution
noah
hyde
eternal
terrestrial
ethical
countess
tender
reflecting
yearly
wholly
pushpin
masses
modest
dudley
barrett
explosive
fathers
pas
revealing
juvenile
similarities
conquered
relate
colts
violates
surrendered
denomination
feminine
contentious
softball
compression
micro
sovereignty
reef
brady
walked
scenic
eleventh
shadows
breach
behavioral
monasteries
rope
owing
spare
cache
wimbledon
practitioners
duplicate
wards
notorious
parkway
ronnie
odds
andre
½
mos
maternal
transmitter
steal
identifies
tertiary
pension
hc
scenario
constantine
specimen
mare
dolphins
derives
gloucestershire
metals
teaches
incorporate
oceania
galway
mighty
lightweight
processor
prints
somalia
unblock
og
switching
possessed
hr
phillip
bilateral
descended
gloucester
pyramid
eps
emeritus
nina
val
wolfgang
gill
cooling
hiv
electorate
sexuality
dash
doctoral
rita
outright
constituted
nam
timing
peters
tribunal
stalin
stadion
organizational
memoirs
pitchers
presumed
mayer
beaver
ja
sm
gonna
gustav
motorsport
facade
spike
titans
possess
apps
cal
carlton
klein
hassan
dundee
translate
flesh
medalist
clayton
pistol
mikhail
expired
unnamed
clube
watchlist
valleys
benson
profits
acids
travis
distances
enclosed
asteroid
cow
accommodation
catalan
warwick
textile
brewery
sanders
clarification
deadly
qualities
marching
assess
gotten
witnessed
marx
chassis
intensive
usd
phones
marcel
weber
diane
detachment
operative
gazette
thoroughly
specify
fritz
marketed
cloth
doyle
statues
nigel
qing
sheikh
cafe
sandbox
breeds
orphan
monaco
folks
meaningful
invaded
norm
sen
berkshire
illustrator
sl
celebrations
beverly
macmillan
strain
motorway
contracted
constitutes
williamson
pronunciation
disco
presently
talented
deletions
forestry
rodríguez
exam
coined
struggling
advantages
mood
wheeler
karachi
blackburn
unsuccessfully
educators
beth
swimmer
tunnels
wagon
masculine
offshore
darren
gerard
consul
excuse
unreleased
investigating
mohamed
salon
checks
trigger
reinforced
enabling
shoots
pipes
atoms
invisible
ibrahim
memoir
intro
inception
implications
coronation
tobago
werner
gifts
livestock
companions
differs
confirmation
volcano
jupiter
prophet
taxi
disagreement
dresden
originating
refuse
cyclist
deserve
detention
persistent
scripts
diagnosed
iris
dad
abbot
elvis
dorset
mel
laboratories
localities
responding
lorraine
fabric
seas
shin
starter
clip
villain
chemicals
replaceable
multiplayer
nr
astronomer
immune
comprised
bologna
flexible
modifications
samoa
tai
gamma
reformation
convincing
cherokee
posthumously
homeless
triumph
prefix
carriage
obsolete
communism
unreferenced
consortium
stark
locate
greeks
inclined
effectiveness
advocated
surgical
souls
retire
cs
renewable
safely
cameo
dissolution
unemployment
racist
torn
baghdad
malay
dana
clarinet
sodium
moldova
backs
crow
interrupted
columnist
quantities
arise
pseudonym
eclipse
organs
philharmonic
christie
incorporating
château
christina
countryside
conception
angelo
dir
lincolnshire
wished
regina
rectangular
transfers
yacht
picking
secondly
algorithms
nassau
constituent
sánchez
falcons
termed
bart
genius
defeats
informational
punch
specialty
clash
continuation
charities
abdullah
remedy
judged
josef
loosely
uninvolved
comfort
branded
broadly
cadet
helicopters
nave
shooter
lotus
enlarged
dwarf
logged
affecting
packers
antwerp
glad
fold
knox
billed
terrace
flanders
tri
socialism
sandstone
renault
fleming
judy
infected
informative
happiness
warehouse
miracle
leisure
kindergarten
leopold
barker
synthetic
waterloo
sage
adviser
accepts
trim
basilica
sitcom
cartoons
irregular
atom
normandy
financing
admits
embarked
passive
modification
shiva
tudor
breakdown
alfonso
badminton
fiber
fiba
mice
carlisle
unveiled
subdivisions
tired
mu
wage
fingers
cest
loving
worried
gambling
basel
randolph
allison
mothers
capitals
auditorium
vault
haute
persuaded
descendant
poison
gujarat
royals
yan
damages
eleanor
parsons
invested
calvin
dover
mafia
resembles
offs
anon
deny
shapes
madras
wears
professionally
essence
emotions
myers
saturn
transmitted
paperback
freed
domains
hanover
mets
pike
sketches
cellular
augusta
sinclair
examine
unexpected
sponsorship
ae
dab
bmw
uncommon
aria
realistic
barton
miranda
dodge
stint
oc
sunderland
plug
orbital
spreading
puzzle
symbolic
diamonds
jesuit
viable
flee
lined
roses
competitor
munster
midway
germanic
urdu
rides
scratch
lahore
pipeline
protective
levy
rodriguez
seymour
affects
pac
jules
archer
kicked
promises
remake
forts
accusation
reopened
phases
rendering
progression
complained
liberals
indefinite
lacked
clement
batsman
mackenzie
judith
chartered
salisbury
discoveries
pursuing
primera
polls
reactor
cope
beds
howe
calcutta
stems
dubious
desperate
lafayette
uniforms
outcomes
tips
partition
ministries
conclusions
destiny
sigma
reporters
daytime
aurora
goodbye
relisted
webpage
coordinate
steering
pepper
drunk
folded
seals
scouting
swim
cunningham
lexington
persecution
expenses
theaters
controversies
agnes
higgins
projected
functioning
query
seine
byron
nowadays
accomplishments
pérez
averaged
fool
downs
brave
mozart
gil
dirt
mathematicians
aquatic
noel
inventory
ethiopian
evaluate
meanings
reluctant
rgb
imagination
barber
slogan
reich
hurricanes
desktop
demonstrates
amusement
hunters
drops
letting
sept
demolition
pi
hague
bronx
dukes
natives
ancestor
preparatory
fence
retaining
rna
strikeouts
turin
crush
algerian
accusing
exploring
reunited
wii
narrowly
cone
nz
württemberg
urged
taiwanese
capturing
continuity
enforce
clouds
verification
owl
struggles
namibia
pratt
tate
romeo
reduces
regulated
lonely
deaf
shire
structured
newsletter
ink
dwight
swept
menu
steelers
pointless
declare
paolo
supernatural
privileges
pleased
commercials
faithful
knock
launching
jubilee
finishes
lamb
resemble
hesse
storms
simpsons
storey
flemish
logistics
consort
satellites
kb
dub
elisabeth
apache
antenna
ashes
fin
advise
trump
kidnapped
enables
wounds
alternatives
turtle
greenwich
cake
tributaries
heather
neologism
hobart
pulse
camden
meyrick
eaten
irrigation
kyoto
labrador
vera
honestly
ninja
midfielders
transcription
validity
employers
winger
redesignated
pays
curse
breakthrough
cycles
node
europeans
expressing
recreated
thirds
privy
habit
viewer
russians
shields
analyst
exposition
interval
decommissioned
igor
oath
filing
generator
welcomed
pg
bahrain
pressed
stays
twist
griffith
aston
bordered
nicaragua
bacon
bury
frankly
smithsonian
premium
ned
mayo
walton
stevenson
colleague
curves
fraternity
xii
encoded
jacobs
hire
isolation
raven
dig
introduces
verbal
appropriately
noticeboard
patents
usc
holt
tx
utilized
searched
clifford
sonata
cb
hemisphere
solving
slang
thrust
trout
aesthetic
predators
panic
tulsa
aaa
teresa
warming
advancement
hunger
ears
tract
laurent
vandals
resign
burnt
splitting
courtesy
rodney
encyclopaedia
treating
subscription
perspectives
augustine
eighteenth
midwest
pasha
amber
dayton
universidad
revelation
designing
overtime
rows
functionality
ave
pts
mastering
deity
shocked
autonomy
kuala
conway
frequencies
magnus
protesters
inflation
academia
rc
hancock
unacceptable
passport
interpretations
creators
appealed
baba
viral
clearing
woody
ventures
tab
marginal
meantime
inscriptions
shame
genetics
motivation
yoga
rotterdam
marvin
teenager
northumberland
vocational
afterward
laurel
overhead
balloon
sponsors
firstly
wha
greenland
demographic
rahman
attained
ivy
eg
screw
talked
wigan
accreditation
vicar
surprising
sandra
overwhelming
rats
mali
baku
stranger
installations
payne
mod
projection
ferrari
collectors
sands
subtle
strips
shoe
basket
moroccan
fisheries
gfdl
tally
hadn
voter
resolutions
silicon
rage
brent
aliens
cao
imagery
nucleus
ceremonial
practically
simmons
circuits
strand
definitive
promotes
tooth
uranium
complications
sega
clusters
zambia
employ
ky
isabella
abandon
marriages
tunes
probable
amino
accent
incorporates
garnered
playwrights
robots
slate
weston
parachute
escapes
lauren
skiers
belarusian
clearance
snails
martínez
mercer
crimson
incorrectly
meal
islander
pharmaceutical
dominion
chemist
medley
proves
porn
mirrors
remnants
expressions
vegetables
teammate
leningrad
psychiatric
laos
traced
wiltshire
tense
ant
hectares
fears
curling
fits
af
inning
rotten
fernández
fury
underneath
bihar
kraków
confident
pioneering
suzuki
mobility
fascist
marquis
registry
numerical
catches
olivier
halt
chronology
polytechnic
randall
aligned
dunn
scrapped
alike
leased
chandler
sophisticated
spur
acknowledge
laurence
textbook
arabs
bother
xx
rewritten
sealed
verde
landscapes
curved
reilly
spatial
commissions
databases
approximate
inheritance
counterpart
cease
ugly
fraction
traits
rivera
transaction
decay
lineage
substances
sued
guam
ensuring
syndicated
fencing
zhou
finalists
mir
electrons
§
avant
turbine
install
majesty
zürich
outlet
eduardo
vince
mint
brotherhood
iconic
katie
bizarre
ign
gr
swansea
announces
wheelchair
refusing
slip
shri
gore
herb
rep
stripped
crescent
possibilities
syntax
genome
hale
railroads
sonny
favored
jonas
hawkins
similarity
sheets
suspicion
molecule
winters
launches
tonnes
nautical
preferences
cds
casa
tricks
burlington
partnerships
confined
terminated
salvation
adjusted
bb
guaranteed
swamp
uc
wwii
overly
hanna
dove
jennings
beneficial
annexed
sutherland
gym
patriotic
mollusk
suspects
squares
ads
collectively
staffordshire
eet
revenues
müller
bremen
licenses
seventy
cultivation
credible
ingredients
benny
stafford
stephens
shipped
xiii
dramatically
racer
translators
beef
password
modules
comeback
precious
dei
computational
apologize
moist
aging
garrett
dinosaur
emblem
plague
broadcasters
wan
christchurch
pie
spiral
marcos
tensions
admiralty
orchestral
screened
ev
cement
refugee
continually
chaired
madame
viewpoint
miniseries
luigi
ieee
donation
numbering
café
hare
correction
jensen
crusade
talbot
developmental
brooke
extensions
cube
migrated
goa
vacation
maggie
boots
submerged
ambulance
disabilities
sanctioned
ligue
tyne
byrne
importantly
luna
flint
skeleton
burmese
kurdish
lawn
école
weaver
mai
einstein
forgot
gould
crosby
livingston
titular
wreck
niagara
demons
treatments
sinking
capt
vintage
perkins
gdp
pius
stereo
troubles
tina
resume
unlimited
footnote
molly
robbie
attitudes
thoroughbred
decisive
mans
boost
norse
neighbors
restoring
portrayal
ti
afford
rim
procedural
assam
abundant
wendy
gran
feud
jockey
breathing
alison
niger
prone
whereby
emmanuel
birthplace
doll
oaks
umbrella
elliot
bahá
statewide
edith
campuses
showcase
rotating
angles
tolerance
survives
insignia
berg
combines
resistant
paula
affiliates
memorable
traders
sink
concentrate
prop
fp
um
hiking
mansfield
refusal
rama
paved
afb
airplane
demonstrating
violet
northampton
blamed
wordpress
globally
constable
abbreviation
ku
gentleman
cosmic
shawn
piper
banker
villagers
definite
bullets
holden
lionel
demonstrations
daniels
drinks
prosecutor
photographed
inaccurate
blown
bollywood
irene
chamberlain
dive
sworn
anita
apartheid
ada
simplified
reject
abbott
stephanie
longitude
mack
lucia
smile
classrooms
feared
injection
calcium
redskins
jill
nurses
sophia
swords
pokémon
prominently
multimedia
buchanan
declaring
proving
judging
sanction
employs
sant
horace
darker
emerson
mandarin
garde
ipswich
borrowed
bred
vaughan
criticisms
ambitious
philanthropist
discourse
rajasthan
collar
honolulu
inherent
jamaican
fcc
advent
priory
inability
brigades
sunny
talents
jakarta
natalie
chin
lester
macau
barbados
activism
consumed
spectators
wages
undertook
silesian
inmates
homosexuality
teamed
recruit
sensor
cp
floods
thereof
weakened
twilight
lowe
nitrogen
assassinated
harmful
denotes
sergei
kirby
displaced
occurrence
firefox
strengthen
américa
leonardo
settling
categorized
irc
brittany
lds
rite
proportional
speeches
una
contributes
digit
lobby
supervisor
minorities
spectacular
skaters
bays
bean
bg
blades
nyc
pitt
printer
genoa
brandenburg
residences
thumb
wikipedian
lublin
isabel
civility
sol
luftwaffe
seated
overnight
marty
segunda
kerr
butterflies
televised
hostage
lennon
fowler
labs
instructed
casual
rude
cubic
commentators
soup
loch
battleship
dissertation
denominations
python
extract
peasants
throat
celebrating
finn
mound
mozambique
fortified
inadequate
uzbekistan
stained
punjabi
caste
prisons
stirling
limiting
senegal
beirut
touched
repeating
sexually
wildcats
deposit
olivia
colombo
observers
hubert
basil
ou
batch
nh
greenwood
breath
baton
spokesman
caution
meadows
turnout
aluminum
revisions
stein
jokes
italics
bark
azerbaijani
diabetes
flickr
framed
lesson
tenants
captains
factions
conscious
alley
schneider
bert
subset
bowie
geo
meditation
rue
undergo
promotions
diplomats
critique
browne
succeeding
purchasing
pastoral
liz
horns
emergence
canceled
marched
damaging
trustee
resorts
taxation
miniature
conclude
dartmouth
clarity
incredible
resided
forewings
chevrolet
qualifications
poetic
void
bosnian
cab
hereditary
passages
mater
principality
increment
contests
omitted
gentlemen
nj
underway
tumor
torah
catalonia
cricinfo
memorials
warn
geoff
jimbo
locks
bees
vanderbilt
commemorate
equilibrium
dorsal
realism
adaptations
johannesburg
bent
nonprofit
intentionally
becker
bs
eponymous
assumes
advertisements
humorous
shores
hoffman
enforced
chung
juniors
oman
renewal
flank
okinawa
outlined
helmet
straw
grouped
downstream
tens
anticipated
incivility
steele
goodman
nutrition
havana
messenger
wi
robbery
cultivated
unusually
ducks
castile
screens
peaking
bowler
huang
proposing
packaging
int
nu
apology
trauma
tt
reyes
subjected
capped
draught
whites
vols
laying
extinction
troll
divisional
pulling
attribute
proto
proxy
coding
canoe
gasoline
podcast
commuter
bloc
encompasses
cpu
borderline
displaying
geometric
chaplain
tended
chord
barrow
angus
leinster
costumes
domingo
cowboy
jew
ge
preparations
qualifies
sculptors
koch
sts
bedroom
prevents
minded
declining
choral
interpret
grape
expeditions
airs
limerick
khorasan
haunted
offspring
xp
marxist
aluminium
permits
tomatoes
savannah
tapes
quarry
australians
disappointed
cola
crafts
irvine
curator
rbi
davenport
receivers
accidental
marseille
opted
gibbs
rand
insect
sergio
combinations
appreciation
enzymes
awkward
spanning
pressing
dioxide
theologian
progressed
receptors
encourages
corrupt
donovan
ht
hogan
sg
nominees
logging
ole
evacuated
doubled
conversations
questioning
gavin
targeting
implementing
ceylon
depiction
converts
symmetry
excavations
reuters
hercules
martín
devi
sheridan
unfair
pile
muscles
reptiles
repertoire
sorted
samurai
cassette
stripes
ak
juice
proprietary
fierce
necessity
stealing
hilton
blu
giles
compelling
stomach
rushed
abusive
buttons
gram
hinduism
fortifications
reggae
homosexual
apprentice
partnered
hussein
sells
disposal
pharmacy
synthesizer
liability
cam
dinamo
sci
contexts
cannes
willow
hbo
starr
statutory
winged
concurrency
recruiting
periodic
slower
pagan
matched
defendant
specializing
crashes
pizza
evan
rwanda
mls
avengers
thornton
ghosts
exploitation
xml
ul
colt
ensuing
pulitzer
supervised
flooded
repaired
integer
instantly
mistress
securing
ethnicities
mortar
plausible
cop
activation
lyric
cypriot
assumptions
methodology
ios
gale
dominic
nominal
bud
melissa
skip
hapoel
invite
rt
reside
martinez
affiliations
medina
debris
successes
frustrated
engineered
arranger
»
infinity
corresponds
cretaceous
bite
nodes
bavarian
marilyn
rumors
neighbor
suffix
fare
ordering
lifelong
campeonato
fuselage
wanna
capitalism
harlem
warwickshire
delivering
pigs
centennial
rituals
justices
joão
bolt
wakefield
infamous
automobiles
damascus
listeners
simpler
orioles
xv
dried
ke
hai
assignments
porch
hymn
⋅
indirect
verifiability
uruguayan
vhs
sioux
convenience
statesman
detached
practicing
chu
abs
hiatus
banning
sul
paired
relied
watt
kite
offence
sylvia
frigate
convey
dramas
shade
trades
marian
coastline
clive
lamp
camping
chips
earthquakes
assisting
indicator
bloom
emil
prayers
attorneys
napoleonic
mortality
bruins
processed
pathway
treatise
malik
res
cliffs
caliber
bordeaux
concludes
lowell
atop
relies
marino
scorecard
recognizes
heavier
prevalent
ruined
kidney
emphasized
uncertainty
documentaries
teddy
omega
cheng
denial
sunni
granting
outreach
ordnance
neville
brush
rolled
alain
mama
evaluated
ein
styled
turbo
pietro
demise
hydraulic
eminent
apologies
baxter
addiction
anders
gubernatorial
emerge
zu
solidarity
posed
maid
confluence
physiology
algebraic
brake
loire
jacket
spinning
brendan
scarlet
explorers
impacts
dante
gerry
regained
granada
cemeteries
royalty
ins
confronted
boulder
announcer
inconsistent
amphibious
spears
mariners
mutant
oracle
slim
erosion
borneo
positioned
coordinated
explanations
transformers
evelyn
profiles
beethoven
digits
usgs
raises
parole
constraints
notification
vein
frankie
economists
terra
bahamas
spark
kappa
potato
interred
impose
acc
kathleen
unconscious
filmmakers
cl
endless
absurd
apex
evacuation
lil
mortgage
considerations
zur
suits
lagos
hm
realizing
quotation
representations
pilgrimage
truman
ashton
yields
taliban
concise
remarked
olga
prostitution
flyers
prestige
contractor
xiv
expeditionary
bubble
mauritius
deliberate
incoming
panzer
antiquity
convenient
wines
directorate
approve
gifted
wilmington
tempo
dexter
skipped
upstream
islanders
bp
perennial
kerman
rai
fulfill
api
hatch
unionist
unfortunate
immunity
snakes
nile
rm
duel
assessments
bern
daisy
authentic
attachment
xvi
lawson
laureate
rupert
mccartney
relativity
nikolai
cord
exclude
fencers
shipyard
entirety
emotion
cobra
sensitivity
intimate
soils
plots
holstein
investor
huntington
neighbours
retains
barons
enrique
roth
psychologist
kaiser
darling
judiciary
vicente
streaming
jeanne
replies
converting
klaus
anarchist
oxide
removes
overlooking
occupies
stella
dose
zombie
canadians
dwelling
backwards
liberia
blackpool
modeled
offline
norris
serbs
harp
rental
relieved
botanist
guilt
fried
alexandre
concentrations
markings
verdict
cyril
mnm
hertfordshire
discovering
aces
shropshire
spiders
compulsory
woo
bonnie
crack
incarnation
oldham
liam
positively
mgm
everett
quincy
firmly
rode
utrecht
volcanoes
britannica
amendments
proposition
threads
triggered
humphrey
bust
fitting
discrete
peterborough
zurich
deliveries
lumpur
wolfe
unfinished
drill
exotic
teenagers
sauce
probation
loaned
archie
javier
exports
cartridge
clifton
nagar
deluxe
commando
deportivo
feeds
optimal
rhythmic
kernel
springer
austro
dramatists
bans
tragic
credibility
resurrection
conditional
ind
gaa
usb
belize
researched
prosperity
standardized
comet
pools
troubled
negotiated
tasked
markers
kenyan
archibald
lava
victorious
promo
creativity
protocols
beloved
asserts
esther
objected
peel
heidelberg
chooses
groove
lantern
savoy
saunders
facial
confrontation
ladder
bohemia
thief
guyana
vanessa
halloween
sentiment
appointments
levi
rodgers
homestead
realised
plc
sainte
pune
sarajevo
announcing
dinosaurs
dominance
precursor
laugh
financed
lars
viktor
scrutiny
sandwich
apparatus
utterly
pornography
toulouse
tap
incredibly
alarm
cruisers
preserving
rover
granddaughter
exams
halfway
acceleration
raced
shelley
adverse
competes
intervals
nichols
induction
privilege
trombone
aforementioned
airplay
edison
eyed
magnet
martyrs
suffers
captive
codex
wool
forensic
exciting
majors
surprisingly
vowels
seller
platoon
dia
fog
honorable
cornish
bt
triangular
yuri
hiroshima
selections
wash
freddie
nationale
undrafted
signatures
treason
babies
persia
wilkinson
whoever
sacked
glaciers
sustainability
leaning
recognise
lumber
receptions
ballads
pillars
turret
residency
reginald
doubts
zhu
une
owens
lately
veterinary
guggenheim
reputable
hector
lounge
undergoing
presenters
sacks
tara
jumped
curry
goat
nightmare
burst
yokohama
behaviors
secretariat
spans
illusion
wta
icons
electro
teens
romanesque
shake
elias
resist
planetary
pseudo
alba
biodiversity
shifting
bluff
anxiety
zhao
rogue
dolls
pitching
barney
sikh
suffrage
feathers
richie
marathi
habits
blend
extraction
courtyard
turf
desirable
expo
bhutan
guerrilla
esp
deadline
ea
bon
exchanges
whip
farewell
cardiac
sensible
deities
replica
smiles
sympathetic
reproductive
cousins
terrorists
acquiring
heinz
songwriting
financially
lizard
zen
remixes
exiled
negotiate
axe
flour
jade
insane
pose
fry
ida
goose
scientology
chiang
dom
pact
garbage
regency
fulton
reorganized
sixteenth
riga
mom
xu
accompany
nw
hasan
mosaic
icc
ark
subcategories
joachim
bahn
rutgers
comma
crude
taxonomy
alcoholic
rom
caucasus
charlton
villains
eliminating
eighty
occupying
superhero
mao
baptiste
paz
sid
loads
lime
raleigh
crossover
karate
strengthened
brig
dickinson
drought
delays
socially
maccabi
ajax
chargers
rejection
schooling
caucus
counterparts
reconstructed
investigative
catcher
prev
thor
amnesty
tracked
surfaced
thirteenth
bohemian
failures
soviets
wichita
barriers
lottery
grateful
outskirts
turtles
meadow
electromagnetic
conan
likelihood
endowment
wiley
formatted
patriot
deacon
infrared
dioceses
specialists
julio
paso
kang
bourbon
tf
promoter
nineteen
surgeons
greco
militant
gable
quoting
thatcher
weigh
parma
librarian
chairs
vc
comedians
stressed
watches
undefeated
communal
tablet
palatinate
copying
flip
descriptive
shelf
upright
nursery
synod
packages
unsure
disclosure
emission
visually
correlation
ideals
av
disappearance
gong
derivatives
sammy
tong
possesses
adobe
deficit
premise
identities
avon
vega
rosario
tko
leap
transparency
everton
ri
lakers
sic
debated
ruin
sophomore
retailers
malawi
payload
subdivided
gf
substantive
devotion
punches
audit
méxico
steadily
noon
hist
inequality
lowland
strauss
informs
confession
mba
stunt
monopoly
niece
surf
spells
nationalists
nathaniel
gentle
patronage
transferring
kc
playboy
penguins
transgender
eisenhower
opus
linebacker
chatham
communion
jewelry
probe
sharma
gases
malls
ro
organist
pertaining
intersections
russo
candidacy
concurrently
laguna
elevator
discs
ironically
lenin
breton
patience
pedestrian
peggy
tucson
cares
issuing
nickel
verbs
catching
orchid
calculate
comprise
screenwriters
maj
manned
alexis
instituted
namesake
libyan
mentally
ir
liang
quotations
oxfordshire
townsend
cw
tear
unpublished
subordinate
heroic
shine
regain
determines
onset
sounding
branding
pf
recreate
adjoining
peasant
rebuilding
infections
violinist
mongol
botswana
exclusion
magistrate
unstable
murderer
deborah
uncivil
leicestershire
promptly
arte
alejandro
killings
orient
gymnasts
balkan
pascal
readings
venetian
vocabulary
cum
destructive
judo
signpost
translates
differing
eli
cane
innocence
npr
arrows
calculation
innovations
croix
abundance
cyber
pauline
ego
litigation
ua
hooks
nwa
bangladeshi
motives
coats
mia
possessions
meridian
examinations
bayern
cv
satirical
reissued
wrapped
harriet
audition
enjoys
sampling
brackets
insists
newest
amor
hubbard
merry
backgrounds
fragment
nottinghamshire
beginnings
busch
ancestral
dalton
magnificent
lethal
banana
guerrero
usaf
friction
comparisons
madness
routledge
mutations
assassin
histoire
canals
harding
conceptual
daddy
occupational
guinness
inscribed
adler
acronym
forthcoming
carpet
felipe
hamlets
buzz
unfamiliar
recorder
intake
unhappy
seventeenth
fundraising
laden
byrd
homage
chiefly
fuck
cooke
hulk
grandchildren
suppression
mae
establishes
antony
natal
contention
unix
conform
ernie
chandra
beard
coca
usable
teammates
concord
macarthur
maltese
discretion
sorting
tottenham
bel
worcestershire
danube
garland
builders
wetlands
côte
mapped
cooled
bas
temporal
misunderstanding
boyle
moody
detained
beacon
coaster
baja
blah
nobles
oilers
arches
examining
hazard
titan
cables
psychedelic
qaeda
forrest
realise
obligations
osborne
somali
mma
reminiscent
recruitment
flats
obligation
libertarian
weiss
corrections
wembley
debts
answering
rigid
flores
enlightenment
sect
focal
fielding
abolition
gps
citadel
gravel
secretaries
oswald
noir
martyr
institut
myths
anterior
sticks
nb
suppressed
analogy
golfers
labelled
zinc
beans
mclean
shrewsbury
turbines
ourselves
textbooks
ang
tractor
typing
borne
sting
pic
cents
excited
speedily
scandinavian
atari
unblocked
inlet
fairfield
pounder
minimize
substituted
chronological
satisfaction
remedies
polynomial
butter
fourteenth
posterior
float
persuade
cyrus
wherever
om
laptop
spartak
derry
rhetoric
sunrise
equestrian
render
nhs
plantations
enthusiasm
repository
propeller
morse
stadiums
nk
maturity
outfit
inflammatory
habsburg
bombings
schwartz
drain
nate
strasbourg
lemon
norte
brennan
ions
workforce
honourable
predict
bullying
graveyard
afro
mortal
±
nuts
visions
poisoning
combustion
commandant
enduring
mn
exceeded
por
clans
tuberculosis
warships
eddy
caldwell
eco
foul
bentley
physicists
ankara
geelong
organism
beaumont
gorge
mcgill
retrospective
nolan
procession
rb
weir
panther
foremost
aragon
palermo
jaw
explores
baritone
kilkenny
annals
ceramic
pony
cornelius
detainees
neural
moor
ssr
abbas
collaborations
tidal
hui
announce
calculations
congregations
unification
cartoonist
improper
panorama
dividing
nt
grouping
mural
torque
hatred
productivity
dans
exempt
soundtracks
futsal
monumental
anaheim
spends
economically
wolverhampton
spire
centro
brakes
predecessors
jays
expresses
basal
versa
packed
landings
categorization
accomplish
warden
wholesale
dial
asphalt
clarified
blockade
laurie
middleton
flynn
toby
mole
nicholson
cheaper
piedmont
refrain
otters
patrons
corporal
sparks
berger
jain
trolling
coliseum
aero
turnpike
historia
offerings
smell
moreno
oversaw
bamboo
lockheed
meals
charging
dal
unchanged
foo
observing
setup
metallic
respiratory
militants
correspond
rowers
lean
proposes
sweep
meredith
purdue
nissan
calculus
steals
samantha
constructing
babylon
huddersfield
rabbis
donor
smash
putnam
drowned
hut
salzburg
devised
dillon
pressures
mountainous
rented
musée
veronica
brock
galicia
pal
gus
abused
famed
tiles
drift
brewing
canary
humour
olympia
radial
bk
córdoba
nude
currents
reservoirs
feminism
resembling
québec
transitional
straightforward
waterford
divers
shia
insee
averaging
famine
willy
greens
ramsey
honoured
guangzhou
teatro
unsuitable
metacritic
ling
summoned
indirectly
reflections
jurisdictions
wyatt
cfd
manifesto
shan
cadets
depictions
clicking
utilities
figured
explosives
paradox
minsk
conferred
chrome
scroll
tramway
solicitor
niche
crap
lifting
expecting
doncaster
regulate
defenses
experiencing
agf
shirts
ts
marquess
undue
wax
motive
hutchinson
overturned
tango
lara
strokes
infectious
reinstated
mont
pigeon
lyons
stole
daylight
fertile
stairs
patrols
updating
slender
ut
botany
dignity
madhya
ideological
grip
shortage
analyses
skater
clone
ravens
gu
foreigners
takeover
westward
recognizing
retrieve
traction
brewers
humboldt
alternating
lenses
opposes
pulp
salle
visibility
plata
picnic
decks
czechoslovak
concur
worms
boone
lam
lagoon
soo
cruel
threatens
allocation
buffy
recovering
río
affordable
presley
randomly
timor
mackay
tire
chestnut
pillar
accumulated
dt
diagnostic
dem
imho
unanimously
popularly
choreographer
simone
bernie
bags
champagne
norms
düsseldorf
musicals
complexes
endorsement
neighbourhoods
concurrent
hydroelectric
carrie
mughal
monmouth
forested
mccoy
args
ignorance
squash
conductors
invasive
closes
burgess
tavern
wmf
petit
reno
az
racehorse
jong
kitty
reinforcements
seahawks
workplace
offset
benz
cha
racecourse
reissue
rebuild
motorcycles
sevens
covenant
robust
dislike
minus
weimar
hoover
dolphin
conditioning
ella
investigators
glasses
bowen
hindus
handicap
ware
jurassic
amphibians
implying
postgraduate
siding
trench
spi
weighed
redevelopment
sanchez
inactivated
wishing
imaginary
revue
incidentally
hs
imam
radioactive
consultation
tipperary
tonga
adapt
lovely
erotic
hg
manipulation
belmont
farrell
thickness
discharged
torch
lois
ramos
filters
damon
mongolian
employing
premature
preacher
ballots
rubin
pornographic
katrina
etymology
attracting
ambient
subdistrict
feudal
antagonist
dare
insult
diplomacy
claudia
neglected
literal
middleweight
complaining
crushed
seniors
brunei
dots
postponed
lowered
vegetable
siberia
collects
birch
syndicate
crowds
wwf
scholarships
polite
confirms
stall
shifts
wired
directive
aide
theresa
biographer
gma
tissues
benton
nos
marijuana
commemorative
gnu
tuition
resemblance
lsu
gao
sundays
lac
watkins
passionate
waterfall
genealogy
discouraged
centenary
empirical
charting
bd
hq
react
snyder
psychiatry
prescribed
educate
fairfax
devastated
confronts
testified
rails
westphalia
routing
exhaust
twisted
vitamin
alvin
routinely
chromosome
mecklenburg
weakness
weekends
puppets
nippon
jealous
brutal
absorption
shaun
kung
canonical
worm
akin
viruses
odi
butcher
farther
lim
disagreed
andersen
separating
excavated
eligibility
césar
weasel
graphical
itn
mock
agreeing
kara
atlantis
inductees
freak
amtrak
wien
accounted
inclusive
eliot
unrest
specials
speculative
semitic
mla
dismissal
harmonica
outlook
elegant
mast
crystals
resting
climbed
dug
heirs
profound
mitch
uae
depict
abel
colonists
temperate
alexa
dar
enthusiastic
cromwell
annoying
iihf
frustration
kathy
kensington
guiding
surroundings
kidnapping
playground
pomerania
det
incorporation
cfm
españa
exported
texture
fancy
tor
doris
digging
cocaine
rites
bauer
erich
mainz
dwellings
spinal
ramp
socialists
semester
che
unwilling
prediction
rollback
upheld
samsung
albuquerque
ó
reconciliation
freelance
stretching
topology
neurons
assertions
retention
woodlands
standalone
cobb
halo
graphs
grange
mendoza
aquatics
lip
speculated
raphael
unprecedented
baseman
sadly
amherst
builds
researching
seeded
lyrical
colchester
gallagher
generates
wherein
amos
pitchfork
adopting
scarborough
quasi
northamptonshire
cooked
optimization
vacancy
aggression
dressing
contingent
sympathy
lea
juliet
emperors
staging
df
paternal
principally
schleswig
fresno
clever
suzanne
ee
uncovered
prolonged
disappointing
liaison
polling
abd
sunlight
tyrone
syed
compressed
humid
assyrian
touching
gravitational
accession
tutor
darlington
tar
complain
geologic
singaporean
integrate
ing
pioneered
sar
execute
sedan
antique
rf
morales
choi
disappear
stocks
surplus
furious
buccaneers
mutation
ghetto
satire
vp
velvet
astronaut
gaps
concacaf
punt
ljubljana
cfl
archaeologists
irwin
pad
autobiographical
yukon
interceptions
instrumentation
rockefeller
interception
captained
shining
spokesperson
toilet
pol
orchard
rutherford
rfd
kramer
romney
nas
advocating
pueblo
nuremberg
flavor
hypothetical
‎
finch
grammatical
knots
remotely
utilize
divinity
fixtures
invest
straits
jumps
retreated
bacterial
théâtre
shy
buckinghamshire
sai
sino
amid
aiming
surveyed
misuse
continents
refined
solitary
spectral
desmond
odyssey
hiring
mysteries
phosphate
bombed
wesleyan
imprint
caledonia
exploded
portals
darts
animalia
cancellation
autism
knoxville
peacock
syllable
pianists
depths
michele
shipbuilding
sleeve
cumbria
quo
theologians
reigning
pamela
montevideo
andrei
assemblies
stanton
tones
saddle
disturbed
blessing
inevitable
reprise
selecting
iaaf
portray
jasper
eaton
fb
pits
hanson
mcmahon
robbins
vine
sparta
espionage
fifteenth
poznań
clown
paddy
troupe
relying
yankee
vaccine
welch
captivity
packet
replay
qi
boiler
belly
iphone
excerpt
competent
nightclub
symposium
jewel
generous
statutes
entertaining
odessa
cockpit
nets
bucks
detailing
headline
tremendous
mailing
hicks
fiat
alessandro
alec
centred
stretches
clashes
leiden
damn
surveyor
paterson
yong
aristotle
dáil
tent
nouns
atkinson
persona
mig
distributions
playable
nuns
rotary
angular
foley
slaughter
switches
rejoined
distress
ariel
corpse
peripheral
accelerated
prasad
fixture
voluntarily
accord
conscience
ass
daytona
accountability
novi
burnett
coconut
giorgio
drilling
khz
anniversaries
travelers
dominate
lazy
ordinance
semifinal
piston
cody
gómez
bravo
crete
bravery
theorists
novgorod
analytical
inventions
extracted
metabolism
provence
stud
stratford
bella
recruits
countless
posthumous
originates
amir
morality
fife
tombs
credentials
proclamation
sahara
presided
papa
rus
boring
oceans
ismail
intercontinental
cain
choke
ahl
compass
freeze
profitable
haifa
southbound
reeves
gmt
elaine
compton
blonde
sultanate
curtain
deposited
ot
royce
dispatched
ud
submissions
crossings
operatic
buckley
golfer
vita
mirza
fra
termination
hitter
burkina
reliance
superseded
propelled
liquor
blackwell
ciudad
flexibility
arbor
monastic
pe
adjective
boer
wicked
hewitt
bilingual
constance
bleeding
perez
vilnius
loser
fond
lasts
stranded
bottles
monkeys
sheila
exchanged
participates
reel
kicks
invites
bureaucrat
relics
washed
mx
ew
jessie
blunt
olsen
sims
hk
skinner
canoeists
elm
resonance
faso
declares
franchises
kurdistan
coffin
sights
italians
bothered
recipe
alright
elephants
greenhouse
automation
hampson
cascade
forge
aquarium
romero
tsar
disciples
donnell
specialised
cutter
sustain
scream
pavel
approx
ratified
generalized
activate
processors
garner
satisfies
northbound
andes
shareholders
evergreen
kicking
killers
postseason
meteorological
digest
handles
mclaren
subscribers
sparrow
marin
dynasties
shankar
mat
wally
primetime
snowman
grapes
crusaders
boroughs
underworld
headmaster
ravi
substrate
cheltenham
melodies
mankind
prompting
spies
tuning
insulting
creed
senses
specializes
mona
reorganization
confederacy
stockton
accessories
supportive
programmer
swami
torpedoes
motif
itf
cortex
epidemic
ambrose
za
unsigned
lyricist
courtney
edo
mustafa
shrub
germain
whales
française
encoding
concluding
crossroads
consolidation
calcio
willem
telecom
goldman
briggs
alonso
sumatra
anchored
kapoor
boycott
museo
forks
consulate
firearm
banjo
frogs
pork
contemporaries
emphasize
arises
kazan
surpassed
inverse
reddy
colonization
assured
obliged
eruption
analogous
friedman
ideally
exits
keller
remark
nad
fiddle
horrible
economies
entrants
pasadena
fungus
escaping
scanned
libretto
benin
ecosystem
navies
portrays
joanna
usda
graffiti
mystic
obstacles
fda
bing
blanked
ants
reddish
barlow
lent
deeds
doe
plugin
futures
horton
brasil
cannabis
serv
entertainer
glance
soloist
repetition
sparked
verona
attracts
armenians
coupe
wit
chrysler
hobby
jaime
merlin
sindh
wight
cropped
lama
connector
mccain
tm
magician
guangdong
wizards
advertised
mediator
burger
bernardino
catalyst
glacial
rhodesia
tbilisi
protestants
hindwings
stretched
gossip
metropolis
beatrice
undercover
authoritative
díaz
cannons
naturalist
glider
illegitimate
juventus
xxx
disciplinary
occupations
ty
internally
sheer
arithmetic
spokane
newcomer
sami
earnings
programmed
aba
ns
overthrow
allah
rancho
dump
merchandise
patches
humble
shelby
avery
denote
worded
javascript
panchayat
padres
unverifiable
rewarded
presentations
hurdles
versailles
generators
happily
dungeons
seville
systemic
daly
haitian
patented
gig
renovations
stellar
med
mates
sans
convinces
strengthening
porsche
undertake
skyscrapers
buckingham
diaries
arrests
wilde
mandated
adjust
immense
rot
kv
hungry
fremantle
tna
midst
sgt
waterfront
celestial
levine
combatant
nicola
minh
engined
exceptionally
sheldon
halted
playback
giro
hee
poe
proponents
inauguration
bind
needle
courier
excavation
spurs
exodus
quad
climax
potassium
ascent
volkswagen
lydia
reprint
connell
lattice
unicode
godfrey
rossi
gonzalez
prospects
decreasing
rains
hymns
qf
admired
orion
pledge
modernist
blacklist
monitors
academies
derive
shit
undergone
garfield
vishnu
evidently
fest
cbn
overhaul
flawed
cynthia
degradation
bracket
pray
tex
utilizing
abe
pam
artery
appalachian
plagiarism
leopard
piers
sensory
evidenced
bunker
spherical
regret
ulrich
socio
vickers
supermarket
customary
malone
weights
elders
tornadoes
corey
vacated
charm
petrol
blanc
ik
signaling
buffer
melting
sensation
subcommittee
finances
caracas
vernacular
regimental
judoka
psychic
gundam
denny
fatalities
zach
ju
burgundy
exemption
otago
simultaneous
unite
eager
composing
rothschild
booker
weighing
calais
hint
punished
crying
bunny
spl
libel
antioch
gangs
castillo
arrondissement
appoint
urge
palestinians
favoured
hernández
backward
ambiguity
approximation
grocery
restrict
cyrillic
shoulders
harley
dealers
diminished
unopposed
ret
surge
reservations
bald
seminars
rudolph
vijay
wagons
devastating
remind
bn
tallinn
praising
campaigned
nasty
pants
fleeing
analyzed
apocalypse
archaeologist
grief
dispersed
allegheny
consulted
hydro
legislators
staircase
bernstein
bundle
commencement
textual
prospective
moose
chancel
consuming
minas
consonant
nun
variously
melodic
knot
null
tsunami
adventist
defendants
protested
valves
brewer
barred
ruiz
weekday
ordination
rpg
hillary
spun
racehorses
erin
balkans
prep
ville
yiddish
entrepreneurs
crimean
sq
intersects
welterweight
bratislava
mushroom
mosques
humidity
alicia
emilio
dixie
©
len
gradual
trash
litre
chasing
ponds
greenville
midi
convex
rejects
seminar
cart
russ
insist
stationary
toni
larsen
orchestras
bandwidth
seize
plato
auf
tc
fibers
interfere
punctuation
clair
collingwood
tn
portraying
imports
gradient
respects
gregg
philips
megan
quiz
alterations
howell
guardians
highlighting
tasmanian
mf
surround
ol
loops
symphonic
hospitality
rae
intellectuals
junk
cod
bf
winding
sb
estuary
discount
axle
reliably
chun
willingness
neoclassical
painful
amelia
hussain
exhausted
responds
provost
luca
umpire
indiscriminate
ngo
behave
podium
quentin
jakob
pneumonia
lao
ethan
committing
eliza
deficiency
coherent
rudy
mantle
woodward
sac
julien
copyediting
liberties
therapeutic
arising
spill
estádio
semantic
chloride
confront
vanguard
vendors
baptism
rv
famously
planting
valle
í
bonn
claus
mono
intends
penal
lips
opt
eurasian
ramon
maxim
zion
uploading
alfredo
ia
gs
policeman
treasures
ceramics
africans
embrace
culminating
bliss
wonders
bowls
universally
characterised
playhouse
goldberg
caretaker
guadalajara
archaic
risen
ranged
terminals
campaigning
pedal
wen
managerial
immortal
marrying
suppress
cambridgeshire
rappers
deposed
mistakenly
recycling
intentional
bei
fishermen
alloy
malmö
assurance
lan
stevie
marne
contractors
spine
maximilian
gala
glucose
parallels
awesome
migrants
quaker
zionist
detract
palazzo
doping
ramsay
js
ministerial
blanche
moran
crab
butt
canadiens
hanged
electrified
burt
ambush
flotilla
equatorial
moderately
minors
subsidiaries
conflicting
herd
luc
serb
shea
collier
nickelodeon
apostles
sherwood
conducts
archery
cyclones
oceanic
potomac
conversely
captures
shootout
eton
loc
adb
blanket
paraphrasing
expose
schooner
departing
lbs
drv
awaiting
disguise
brenda
nora
beams
connie
como
hayden
ld
wehrmacht
warbler
rhineland
advert
toe
astros
ditch
polymer
yorker
subsection
thanksgiving
transverse
houghton
neutron
morphology
mythological
cho
locke
modelling
bois
façade
leafs
piracy
evolve
compliant
fulham
superstar
mechanic
perimeter
exceeding
hmm
franciscan
detector
arrange
wires
vertex
bethlehem
wharf
gi
carmel
medication
infants
auguste
broadband
bali
rift
henrik
delivers
mitsubishi
leak
nme
sharply
formulation
bisexual
sichuan
sincerely
bricks
kendall
countdown
supplier
plea
affinity
het
fw
replaces
fined
walters
analysts
kyiv
marketplace
kits
lamps
lviv
boyer
preserves
expulsion
favourable
biologist
debbie
stephenson
tanker
domination
margins
skate
herring
disrupt
worthwhile
ffff
steward
proceeding
jacqueline
cindy
watford
theodor
restructuring
mysore
baronets
diver
philipp
disguised
gmbh
accuse
convergence
prophecy
nuevo
wills
outfielder
sanitation
tortured
luton
govt
ankle
backlog
coil
collaborate
cinematographer
undisclosed
demos
predator
tops
livery
coefficient
sentinel
recalls
alphabetical
inserting
ponce
defences
volunteered
wilkes
overlooked
vogue
leaked
middlesbrough
torpedoed
soyuz
janata
milestone
imposing
shades
deed
freud
campo
rodrigo
redesigned
gwen
masonic
summarize
jstor
monterey
denise
spear
zoe
graf
dev
fertility
carla
vertices
successors
pleaded
ventura
sins
mastered
culminated
expectation
asteroids
wat
prima
serpent
stepping
farmland
fixes
viaduct
christoph
initiate
remixed
dunedin
grenade
ao
analyze
satan
français
folding
earls
christi
flux
invaders
nail
modular
squirrel
offences
sloan
boilers
liturgical
ballroom
vida
scenarios
tablets
martins
neon
trader
tails
saxe
lamar
thessaloniki
dictatorship
sperm
differentiate
conjecture
taft
mckay
melville
kris
mating
ridges
tabloid
northward
decreases
battleships
descending
polk
announcements
hara
supplying
état
hears
otis
milano
nikki
pearce
proton
weaker
rainy
diffusion
clarkson
bordering
hostilities
awakening
sherlock
tyson
vengeance
doi
pont
slalom
comune
bowman
sack
leroy
elk
applicants
mister
nobleman
hamas
vectors
disagreements
fs
thats
freezing
mounting
quintet
baronetage
counseling
khmer
beaux
fascism
reproduce
andrés
walled
costly
jurist
malaya
gerhard
executions
flagged
foil
jammu
urgent
cerebral
tajikistan
typo
colorful
whig
deception
mariana
hooker
akron
crimea
neolithic
narrated
viva
fia
krai
feasible
immigrated
canvassing
qualifiers
awful
hübner
engraved
coke
conquer
introductory
raaf
hazardous
certificates
directorial
hume
dl
practitioner
disused
periodically
ferries
pathways
abuses
scrap
meaningless
anand
docks
illustrating
falkland
shale
annex
whistle
glamorgan
isa
aft
creations
sms
jaguar
hazel
shu
sellers
vaudeville
tenant
willard
boca
uni
nagoya
ransom
stokes
redirecting
curiosity
disqualified
emerald
fars
shear
nokia
interfaces
pereira
genuinely
chalk
pest
steamer
illegally
guillaume
mixtape
compelled
decimal
ascension
technician
wasted
denying
melanie
mutiny
hind
impaired
unidentified
openings
michaels
donetsk
ching
confirming
presiding
motifs
defects
urging
capsule
buyers
trailing
gomez
astronomers
clues
disciple
jared
apostle
grossing
compiler
jackets
obe
gan
acquitted
magna
nan
preface
ensign
uh
dracula
mandolin
patton
turkic
naomi
unmarried
snooker
lena
annexation
wasting
yen
tying
dull
concession
valerie
amiga
donors
purge
algae
jesuits
sinatra
disastrous
pathology
containers
airlift
jpeg
blanco
rory
handsome
dvds
kabul
archbishops
rip
brigham
ginger
bangor
registers
pembroke
diagrams
disappointment
champ
indy
nicely
unexpectedly
uncomfortable
kahn
caring
cinemas
summarized
postage
nut
peculiar
loyola
equals
ww
authorship
obsessed
veracruz
tunisian
escorted
wavelength
spawned
relocation
headlines
squads
colon
fist
frigates
insults
guillermo
avalanche
unpopular
dickens
deported
asa
fulfilled
retaliation
miner
lol
grounded
yin
settler
cbe
segregation
exercised
substitution
kelley
incidence
kinetic
bernhard
fearing
blurb
skeptical
hereford
zheng
benedictine
hz
nairobi
sinai
gypsy
falsely
optics
touches
tanner
hitchcock
manifold
nests
moe
dependence
pixels
yves
prefers
xiao
denounced
gymnast
mop
helm
eduard
bis
vie
pilgrims
merrill
bail
rigorous
sha
gem
circulated
saul
duffy
totals
fashioned
landfall
ramps
hyun
offended
rockies
waltz
medicinal
epa
ambition
disturbing
pardon
moot
linguist
strangers
camille
tb
uninhabited
beverages
vila
lend
tandem
semiconductor
palaces
od
nomenclature
browning
muse
silesia
antigua
isis
tires
simplicity
fuels
interdisciplinary
fluent
barony
swindon
padma
bounds
hostility
gabon
theoretically
bankrupt
masked
poole
maud
mohan
ritchie
insurgency
moisture
correctional
mckenzie
burnley
sermon
venom
tha
elton
capitalist
dominique
compilations
ramón
priesthood
awb
geologist
revive
whitish
veins
sarawak
organizer
tehsil
galloway
bengals
referees
sud
patel
tripoli
protects
cantonese
wr
sulfur
eccentric
qc
pba
vivian
desires
pak
fatty
scorers
feng
expiry
protectorate
ottomans
tobias
adaptive
federico
nike
emanuel
manners
tuscany
documenting
tao
scripture
rusty
mediated
shout
stronghold
spray
eastward
rhythms
rooted
pixel
tile
ornamental
intercepted
suns
webber
cis
josephine
preaching
distortion
roofs
hail
evaluating
bayer
culturally
paradigm
dewey
indicators
distinguishing
fg
markup
paranormal
compatibility
grasp
kyrgyzstan
hardcover
atoll
amalgamated
ensured
mythical
rufus
atheist
warns
masterpiece
mis
booklet
montréal
postwar
île
heats
notoriety
ortiz
lever
symmetric
doo
blowing
lobbying
exploit
ib
choreography
saloon
thieves
sabbath
uhf
zeppelin
bernardo
mv
mystical
société
fundamentally
qb
attested
metaphor
chesapeake
lokomotiv
faa
salvage
oasis
beverage
sufi
stefano
galaxies
shelters
anchorage
upwards
reminded
sexy
threaten
quantitative
guessing
parentheses
instituto
hutton
lai
treats
rink
arid
trams
hailed
washing
stony
skies
barrister
flourished
vampires
gum
bathroom
bartlett
benfica
crowded
harmonic
psychiatrist
guido
fas
qin
ethel
glossary
cavity
aziz
forgive
sardinia
transylvania
stadio
dai
reggie
repetitive
uncopyrighted
dismantled
currie
miracles
roc
fam
moines
reassigned
pumps
kindly
sniper
pod
intercollegiate
kin
obsession
ezra
ninety
thy
gertrude
guthrie
lola
anthropologist
goodwin
blanking
hellenic
hairs
mutually
harrington
parkinson
sums
hormone
audrey
gut
archers
drummond
aperture
goalie
digitally
misconduct
mammal
knowles
spotlight
seldom
spice
galerie
assistants
fitzroy
ic
outlaw
cougars
harald
genetically
rotor
mas
splits
peabody
encouragement
instability
drafts
periodical
multinational
anhalt
rayon
sylvester
archival
mil
claudio
witches
onward
tomas
destroys
apples
arenas
medallists
sabah
motorsports
napier
lucius
oxidation
lighthouses
realms
vargas
headings
pulls
grazing
commentaries
resisted
emails
dictator
croydon
enthusiasts
montenegrin
periodicals
commitments
laughing
efficiently
tk
negatively
ames
unavailable
reluctantly
usl
predictions
preferably
precedence
clergyman
potatoes
debating
costello
libre
opener
screenplays
frederic
offenders
ars
announcers
lede
reminds
sweeping
fore
psi
sooner
transports
nil
antrim
kilda
purchases
stalking
protagonists
cigarette
académie
stamford
racers
clinics
upgrades
tl
snap
dunes
griffiths
mca
chick
recipes
ghanaian
initiation
ballistic
appealing
eh
barrels
roche
inspire
satisfying
attic
attain
consult
tuned
ala
matthias
chesterfield
viceroy
disturbance
besieged
tau
lauderdale
dumb
sawyer
tacoma
holloway
maldives
vuelta
langley
barnett
lightly
slater
liège
cassidy
jaguars
ks
existent
dart
boiling
ferreira
cullen
browsers
insertion
dortmund
macintosh
undated
lille
packs
université
chittagong
resolving
reproduced
glover
millionaire
synonymous
dion
organizers
urine
sicilian
influx
pets
noticeable
mer
beckett
fukuoka
nanjing
pledged
wes
buyer
mal
stripe
envelope
rosenberg
overlapping
trenton
bestowed
faber
consonants
richest
neptune
barr
khuzestan
characterization
tolkien
forged
nero
cecilia
edible
dice
asserting
breeders
separates
skier
mausoleum
monty
reelection
yearbook
shafts
masks
faculties
encompassing
dismiss
santana
swallow
clint
prevailing
transcript
enjoying
massacres
ensembles
malaria
oro
staple
telangana
fender
trait
lange
outdated
contamination
cska
differentiation
advisors
gilles
downloads
grains
psychologists
tow
axel
wt
tattoo
siena
depressed
cass
rowland
lund
hearings
rosemary
parrot
adhere
lindsey
kemp
ryder
peninsular
blaze
limbs
furnace
sergey
fools
phelps
dickson
slovene
pretend
erect
rainforest
enclosure
analogue
legitimacy
tirana
recession
affection
ska
ef
shipwrecks
aesthetics
hayward
aol
waited
np
tito
djs
mag
perpetual
swap
adjustment
bertrand
navigate
fairs
mourning
mounts
steiner
fanny
postcode
draper
fortunes
cancel
hides
spartans
sears
fullback
lal
lex
stimulus
tactic
presume
cabaret
thou
transforming
confiscated
undertaking
canopy
inverted
graeme
drained
withdrawing
titanic
airfields
gaston
engraving
wonderland
spontaneous
warranted
spirituality
dharma
ying
propagation
textiles
olds
gesture
alumnus
kamen
scandinavia
bonaparte
repeats
undoubtedly
knowledgeable
reconsider
magnum
richter
clemson
parry
nfc
grandparents
miriam
pontifical
diocesan
harmless
dictionaries
mart
fumble
gettysburg
bey
shortest
cylindrical
tiffany
physiological
safari
screaming
centimeters
faults
owed
proliferation
limb
alliances
malicious
farmhouse
admissions
commodity
intending
ndp
inputs
abdomen
discarded
félix
impulse
stricken
crowley
jiang
penned
vineyard
businessmen
yielded
rationales
saxophonist
kobe
arbitrator
louisa
admirals
texans
orthodoxy
dirk
chattanooga
creole
drafting
garry
bloomberg
fuji
cummings
gothenburg
pamphlet
patty
stiff
marries
honneur
scheduling
cheek
bucket
flaws
vapor
overturn
byu
protector
carleton
woodstock
lastly
geographically
freiburg
ufo
prelude
cory
lynx
liechtenstein
examiner
sharpe
atkins
ffa
blogger
priorities
akbar
meg
airbus
isil
translating
homicide
avid
sanford
heels
discographies
levin
lau
shotgun
emigration
slated
homework
fascinating
casualty
guernsey
populous
concealed
jumper
diaz
waived
techno
lending
theorist
compose
lively
relieve
masonry
armistice
camel
revolves
edt
waterfalls
til
madeleine
titus
catering
delicate
quietly
glorious
redemption
injunction
isfahan
nana
liturgy
cosmos
togo
baroness
exploited
improves
fig
chant
quran
derrick
chairperson
trance
elmer
respectable
trophies
bari
dangers
haryana
taekwondo
microwave
morrow
allegation
ras
assessing
insights
gangster
viewpoints
yunnan
danielle
marshes
io
msg
dino
bishopric
deserted
internationale
pricing
cz
heron
unmanned
avatar
yates
aleksandr
walnut
marguerite
seneca
nrl
confidential
interpreter
navarre
remembrance
gemini
torino
pfc
chords
fireworks
acquisitions
scaled
scanning
compromised
pointer
pitches
dye
oversee
betrayed
serena
readable
unreasonable
petersen
gdańsk
gardiner
convictions
collaborators
craters
entrusted
satisfactory
emilia
coincidence
susceptible
industrialist
lawsuits
feather
competence
nasal
roe
metadata
elevations
denoted
dyer
transporting
coupling
norwood
kiel
elbow
hats
understands
forecast
ample
dispatch
traps
franks
thistle
pb
partisans
duff
billie
heavenly
huskies
katz
instructors
accessibility
robotics
lausanne
perpendicular
brains
plaster
rumours
knesset
buster
trusts
token
guantanamo
brest
coma
preferable
zeller
opp
simplest
centralized
gee
fernandez
goalkeepers
barnard
submitting
cathy
believers
prototypes
pops
raped
collaborating
cheyenne
heroine
citrus
timely
empires
salford
zoom
encyclopaedic
bilbao
dissent
aground
inclination
intervene
gail
cairns
murdoch
commemorated
vows
slayer
interacting
siberian
vinci
rowan
oliveira
baylor
wilder
boise
keynes
ridden
dragged
cerro
excel
jeremiah
addison
inventors
huron
celebrates
gators
frontal
murals
denies
sharif
harrisburg
evolving
installment
wai
energetic
bafta
craven
prepares
palais
provoked
popularized
monsoon
mara
attendees
berth
assure
safer
bismarck
whitman
matilda
weed
sails
mellon
surfing
biotechnology
cary
queue
scattering
haul
subgroup
arturo
consume
yeshiva
erwin
coordinating
carolyn
hartley
bournemouth
mata
genealogical
torre
formulated
authorised
soda
rendition
suriname
prefect
insurgents
agg
sinister
rec
sim
phyllis
parental
reminder
rp
fishes
seaside
marlborough
cy
rhys
dodd
nails
cylinders
browse
immaculate
sounded
intensified
accordion
embraced
joker
hendrix
elector
voyager
ita
juno
virgil
grab
pilgrim
strawberry
bounty
vicious
towed
collaborator
sabre
celtics
remarkably
disclosed
decca
uniquely
synchronized
microphone
fang
dhabi
fracture
colliery
brethren
maze
comparatively
iberian
asleep
succeeds
‘
sprinter
conceded
hidalgo
hack
johor
hum
unbeaten
owls
congregational
pentagon
categorize
rests
continuum
computation
schumacher
sas
freddy
liable
somme
kangaroo
mlas
qui
taxonomic
yerevan
barnsley
augmented
westwood
ari
galactic
superiority
ioc
heck
fiona
chiba
emotionally
illuminated
kidd
interventions
slade
ale
absorb
vain
robotic
staten
prevalence
braun
selo
shandong
thorpe
wolverines
hints
tug
lied
cambodian
commemorating
knives
zeus
edmond
bluegrass
enrico
zaragoza
averages
clerks
sax
concordia
appendix
familia
baird
otter
spw
gillespie
seminal
nf
rana
wrap
mead
casablanca
qur
babe
coincide
penis
mckinley
verdi
severity
complementary
superliga
veto
accountant
theo
charming
colbert
ineffective
rushes
sui
criticizing
donkey
roadway
encryption
puzzles
misunderstood
hokkaido
worries
hairy
serra
unemployed
jsp
retailer
hotspur
anal
haynes
bartholomew
untitled
wooded
judah
deco
entropy
helens
abnormal
analytic
reinforce
sonia
romani
vt
blows
cows
clutch
gupta
bolivian
ramírez
manipulate
knighted
bahia
sliding
shower
ipa
everest
audi
shetland
cooler
outlines
orbits
purity
hawthorn
marianne
skyline
ignorant
emerges
implication
policing
mariano
hoc
turkmenistan
limitation
prosecutors
weaving
transforms
aubrey
peck
busiest
wikipedias
prosperous
rewards
precinct
bu
novella
wikia
diagonal
bowled
sbs
authenticity
journeys
detectives
kinda
basics
putin
aviator
churchyard
alderman
culinary
rosie
luzon
connectivity
ava
panda
bankers
prescott
entrepreneurship
destined
goaltender
biomedical
doha
morley
outing
newcomers
schedules
sire
samson
cheryl
liberalism
caucasian
dolly
flu
traverse
ceded
pieter
coasts
grossed
foothills
collided
tricky
wb
envoy
seizure
erupted
sweeney
contra
disrupted
morale
enhancing
caravan
fortunately
barra
disappears
bahadur
presses
independents
rack
reactors
designations
printers
algiers
lehigh
tam
wexford
fibre
tory
navarro
chimney
yellowish
romano
modernization
schultz
robson
lyndon
wandering
rowe
tutorial
ignacio
iq
distributing
vertically
remastered
alta
streetcar
gloves
impressions
invade
warrants
imminent
reese
rematch
unitary
mei
sampled
cinematic
federally
volvo
kosmos
halle
hernandez
refurbished
ineligible
mayoral
rhyme
suppliers
esperanto
wentworth
javelin
herrera
landowners
cooperate
heroin
talmud
alsace
ely
wee
royalist
melvin
chico
divides
incentive
constructions
chili
kern
landslide
cochrane
compensate
deposition
flyer
gina
terence
slots
weightlifting
paraguayan
vh
donate
marius
fins
corbett
jihad
ate
marquette
orphans
serge
mecca
wrongly
muller
sap
palatine
goats
borussia
handy
upward
analyzing
cheung
landowner
refinery
indexed
manson
ethanol
recognizable
runaway
corona
synopsis
cebu
johnstone
tightly
consoles
crewe
indicative
informing
jens
vance
scare
coincided
mari
bloomington
scared
nara
jargon
scandals
saddam
neglect
wnba
moldovan
mcgraw
zoology
spanned
confuse
guerre
slowed
metz
drowning
bsc
pasted
engagements
connolly
edged
communicating
mcdonnell
fonts
elongated
picasso
nueva
disagrees
certifications
regeneration
futebol
tristan
mercenaries
telenovela
nikola
wrist
motions
hornets
awarding
beyoncé
dreaming
inevitably
scar
mein
dora
lombardy
rochdale
prostitute
dk
pistols
implicit
saturdays
hygiene
armagh
emphasizes
maclean
orléans
coptic
alpes
authorization
amenities
yun
thorn
owning
bureaucracy
articulated
timed
bedfordshire
aleppo
deprived
invitational
banners
dmitry
transmit
compartment
grimsby
natasha
massey
amc
runoff
highlanders
weighted
abbr
endowed
sabotage
wasp
wedge
ulysses
pins
fir
iss
evenings
augsburg
roach
carriages
mold
saskatoon
improvised
alzheimer
shoreline
censored
oaxaca
münchen
excerpts
originate
delight
galileo
tendencies
exploits
haas
avenues
flanked
hostages
invasions
discourage
positioning
symbolism
chased
gaga
foreman
contender
bison
encyclopedias
drummers
ecumenical
hazards
disregard
ballard
injuring
discus
contaminated
hilary
barangay
bios
clocks
bandits
styling
remembers
mcgee
delist
lego
systematically
initials
learnt
astronauts
manfred
sermons
westbound
messiah
nationalities
invading
cookie
confer
repertory
lansing
preseason
kaplan
coated
boogie
belts
fx
wrocław
curated
epithet
jarvis
impacted
milford
junta
petra
drastically
harcourt
akira
calhoun
amour
slash
obstacle
repealed
coded
sickness
stm
neumann
explosions
eileen
sensing
imagined
proponent
trojan
cher
stockport
oricon
jewels
insisting
neuroscience
bids
gotta
duplication
condensed
negotiation
realization
inviting
cathedrals
doherty
epoch
sociological
département
populace
seychelles
prc
laundry
catchment
bourne
fragile
nes
pickup
oblique
youths
cabal
informally
breakup
rye
investing
organising
proportions
rees
beg
prompt
anatolia
kicker
intercity
eastbound
legged
adjutant
elgin
olson
zeta
blew
princely
fern
prohibit
warrington
cristina
blitz
strains
exaggerated
truce
rods
oclc
induce
thee
legislator
suez
mortimer
vandalized
carver
metabolic
ong
integers
cavaliers
pennant
markus
confessed
revisited
meiji
conveyed
shareholder
shaping
huffington
mainline
memorandum
overland
accumulation
peach
certainty
migrant
efficacy
listener
mussolini
congestion
upside
amounted
oils
compares
nyt
khalid
chinatown
inspiring
remnant
striped
pumping
göttingen
sumo
wilkins
mixes
broker
navajo
faroe
vascular
knocking
admittedly
inflicted
hua
possessing
unconstitutional
ue
spruce
enforcing
fairness
paramilitary
yds
accuses
repression
yamaha
abbreviations
jiangsu
backdrop
shreveport
feeder
stout
benito
ensg
gareth
pivotal
rolf
densely
pharaoh
dynamite
astrology
predominant
chichester
italianate
rehearsal
evasion
jing
liberated
drains
remembering
medicines
doubtful
classmates
spheres
requiem
startup
insistence
ernesto
giacomo
topical
pretoria
idiot
longitudinal
ratios
helium
fargo
penetration
conserved
saharan
sediment
licensee
sundance
ecosystems
yvonne
convened
legislatures
maharaja
routed
electors
biochemistry
rodeo
exposing
queer
easton
damian
differed
handel
commemoration
revolutionaries
lore
ur
eugène
bicycles
ellington
lookout
interpreting
ek
gull
sleeper
medici
trimmed
replication
hyundai
auspices
moto
facilitated
arroyo
hacker
sour
ob
alfa
idf
butte
starters
downloadable
inverness
traveller
sociologist
maureen
andorra
hunted
heel
suites
annapolis
damien
isp
gunn
scarce
proteam
rosen
consultants
katy
plagued
controllers
energies
linguists
badges
houten
ness
cruft
grassland
vineyards
bikes
ccc
smuggling
narration
ensures
dane
puebla
marta
bully
martian
donegal
dyke
raided
calder
contradictory
lenny
exceeds
salesman
artworks
betrayal
doubling
olympiad
distributors
cappella
abruptly
negotiating
pencil
prescription
statehood
weekdays
novelty
directional
bureaucrats
wolverine
heal
contempt
reversion
spectator
ger
fairbanks
anyways
professions
soto
gustavo
rebranded
joints
grandmaster
bharatiya
antiquities
sacrifices
rashid
nutrients
sudanese
joaquin
witchcraft
lithium
ironic
nepalese
durban
defect
conspicuous
regents
spaced
musa
maynard
unwanted
ich
martina
pubs
anchors
sighted
warship
chaplin
dow
marjorie
foundry
johansson
muir
feminists
travellers
fuse
nicknames
opium
guiana
shrines
groundbreaking
benefited
amplifier
munro
framing
supplemented
christensen
bromwich
nouveau
foreseeable
undid
yusuf
kildare
wrath
baptized
shepard
silla
goldsmith
drone
novo
underwood
quota
karma
vis
mcc
ogden
cigarettes
kamal
geared
confessions
lieu
jazeera
skeletal
yr
cartridges
gardening
checklist
bog
dietrich
rhône
judd
globalization
sal
downward
polly
evangelist
diverted
launcher
euros
engages
stereotypes
hedge
forster
jae
navigator
polished
fabricius
implicated
xl
monde
delle
pk
henley
kawasaki
domenico
whitehead
uv
hannover
disadvantage
dietary
mesh
lucknow
pulmonary
inadvertently
counselor
compiling
rig
fisherman
kathryn
stephan
vojvodina
anglia
louvre
gregorian
kat
essendon
foreground
cant
vu
thom
posing
marches
brentford
monterrey
meps
louie
bidding
pensions
scenery
arched
almanac
siemens
jewellery
wallis
battling
mateo
spamming
unspecified
gibbons
potsdam
gladstone
caspian
revoked
fatigue
ensued
toro
hash
hooper
hopper
quay
souza
govern
selangor
claremont
assign
slipped
brandt
bose
pluto
balancing
highschool
aristocratic
guerra
goddard
windmill
iata
dept
homo
archeological
wolff
adjunct
perceive
projecting
montane
stylistic
carving
humanist
sahib
vanuatu
gerais
mcgrath
carlson
alf
sachs
usher
strengths
waterways
indications
sesame
leary
meritorious
breathe
scribe
vastly
linden
palma
clad
amidst
barley
sportsmen
sloop
spartan
meteor
balcony
bored
cute
gospels
ayrshire
neighbour
gigs
pinto
jailed
acacia
retarget
leach
asiatic
nantes
reefs
mandir
refueling
glee
coefficients
onion
maize
gogh
purse
lsp
amman
sequels
canons
annotated
lambda
fortification
toxicity
dependency
conversions
emir
sirius
wetland
auditor
goethe
cottages
baths
punish
abby
distinctions
armand
ecuadorian
norma
doctrines
entrances
sewage
axes
pretending
subcategory
idle
gems
nehru
pistons
shocking
nebula
laval
lungs
manuals
aztec
vendor
honoring
curb
incentives
manifest
whiskey
gallantry
salaries
graded
biennial
marcelo
unitarian
bakery
traveler
tanaka
webs
trainers
zombies
bai
contradiction
dude
eritrea
sexes
manny
distorted
sudbury
retro
restart
manly
köln
orphanage
jericho
newscast
aquino
nesting
ribs
jumpers
underside
disclose
penang
saratoga
convict
recommends
narratives
purported
stables
jürgen
yao
brownish
nair
takeoff
chávez
comets
heiress
wo
stargate
gently
admitting
kermanshah
uppsala
cypress
zhejiang
restrictive
economical
republics
integrating
nico
instructional
occupants
soc
pawn
secession
tyrol
avoids
tian
grady
anarchism
attends
downhill
guarded
natalia
sparse
ambitions
syllables
keyboardist
hungarians
ignores
jc
hated
bathurst
macon
psycho
hanoi
superb
buys
hodges
hl
disposition
suffice
rooney
chloe
hearted
levant
foundered
unto
bala
ritter
banquet
haley
clade
kan
turnover
lazio
ventilation
gears
lifts
shelton
mccormick
mcbride
jang
snout
rotherham
nemesis
reconcile
coating
teahouse
lombard
alvarez
expenditure
aviators
toes
parc
comedic
walden
outs
canucks
supervisors
cheating
harvesting
olaf
breuning
carthage
paraná
gottfried
examines
reputed
dunbar
elijah
fordham
idols
barge
rihanna
cartel
angered
antilles
radcliffe
capacities
fueled
woodrow
mustang
overseeing
organizes
wah
asturias
beaufort
palo
ama
quezon
darius
motown
westchester
trenches
dcc
contraction
shrimp
dungeon
supremacy
crust
scriptures
postmaster
tal
roanoke
covert
yeast
mongols
maxi
glove
gilmore
muscular
ste
weighs
deux
ewing
commissioning
clauses
dune
fictitious
razavi
storylines
coa
frankenstein
burundi
bearer
raul
stylized
pines
prakash
rani
dm
cleaner
indicted
kinase
dupont
guadalupe
progressively
landlord
fleetwood
migratory
rejecting
guise
brewster
nomadic
caliph
calculating
configurations
jar
readiness
insulin
dagger
goalscorers
ren
sasha
perfection
ursula
novak
wrexham
spaniards
orkney
donaldson
tata
auschwitz
halftime
transitions
testify
trolley
tae
earle
konstantin
oyster
severed
img
reversal
concessions
sampson
pisa
mcleod
trojans
paige
twinkle
antisemitism
comp
practised
caf
lining
suicides
collateral
xinjiang
rugged
fy
blizzard
esteem
stimulation
pity
inherit
zack
cadillac
frédéric
tokugawa
twain
shen
photon
munitions
incompatible
trolls
toad
revolver
prevailed
synonyms
ingredient
enhancement
overwhelmingly
légion
als
mounds
regis
login
bogotá
msn
fungal
aryan
revolutions
bonding
sven
nicky
packing
transformations
esq
tastes
keel
installing
bl
fis
vivid
plaintiff
decomposition
groom
phonetic
synth
censor
über
tianjin
hari
topographic
bwv
davey
expands
occurrences
relaxed
outward
reg
crafted
americana
implementations
experimentation
argyll
burr
konrad
gazetteer
basins
darrell
localized
deploy
lineman
messaging
spence
fei
locker
badger
gearbox
palin
cumulative
spellings
loyalist
wharton
peoria
knees
mannheim
yd
ser
argent
cradle
mclaughlin
confesses
detrimental
clockwise
prosecuted
locus
regulars
wakes
linkedin
jordanian
kaufman
domesday
disliked
viet
rust
notions
porcelain
semantics
southend
eclectic
cared
horseshoe
ci
topological
macleod
bloomfield
longevity
variance
highness
guo
harrow
ebert
insignificant
transplant
lib
anticipation
chao
lg
terre
voyages
sumner
razor
veil
luciano
arcadia
formidable
tides
argonauts
tick
notwithstanding
unpleasant
cantata
columbian
netball
chevalier
policemen
animator
weddings
quartz
dumont
abdominal
renew
affirmed
computed
primaries
laureates
filmfare
clones
flashback
utilizes
traumatic
outdoors
hoffmann
construed
jesús
pushes
riaa
locking
supplementary
agrarian
blossom
octave
jude
ptolemy
queries
ropes
albrecht
haydn
cdc
grenada
fade
aspen
roi
deteriorated
egyptians
boasts
suárez
lifeboat
groningen
sevilla
hybrids
babu
depart
skins
burrows
paisley
terminate
ghent
reigned
usernames
lowering
desperately
seismic
eastenders
pow
madden
crocodile
abrams
interiors
dent
marlins
betting
hagen
republished
reelected
chong
fiesta
projections
muddy
invertebrates
paleontology
novice
rower
accolades
prologue
cinderella
cyclic
amalgamation
berwick
blatantly
freedoms
transmissions
organise
reflective
stabbed
simulcast
reformer
denton
oppression
foam
monograph
gentry
chemists
gabrielle
dresses
lectured
maneuver
nerves
adulthood
bray
milne
daring
hamid
utter
herbs
harness
cleopatra
brno
emancipation
phoebe
reactive
christy
reset
bianca
implements
correcting
fugitive
cicero
bono
synagogues
invariant
mindanao
pleistocene
attackers
viz
hebei
defamation
relocate
superheroes
optic
dowager
fuzzy
specifies
anthologies
amin
poultry
helmut
comedies
beech
fivb
lori
northernmost
simulator
ding
southernmost
apprenticeship
msc
raceway
khyber
pensacola
royale
macro
insider
righteous
mirage
sweat
banda
phrasing
understandable
rutland
inuit
tumors
cavendish
voodoo
pun
gambia
thematic
altering
pyrénées
saitama
reacted
sint
tulane
boiled
regulating
progresses
warmer
usefulness
intrinsic
stainless
lulu
vegetarian
tracts
alam
kissing
annoyed
raúl
shattered
noaa
dusty
lilly
interestingly
foreword
ives
seo
neurological
vibration
essayist
poisoned
invoked
frontman
archdeacon
renal
jiu
triassic
priced
internacional
fabian
trailers
lillian
poses
uneven
turrets
harlan
cruelty
storytelling
virtues
gorilla
rochelle
gui
auditions
oboe
remarried
bounce
radicals
incidental
sonora
guarantees
orchids
myrtle
charters
superficial
patti
volga
hilda
hamburger
multitude
dire
aeronautical
programmers
quicker
clinched
att
tnt
packard
bubbles
stunning
garth
burroughs
crypt
rewriting
gonzales
restless
overwhelmed
miocene
fused
granville
influenza
tomato
nagasaki
medications
engraver
distinctly
waist
salvatore
tunis
puppetry
fremont
semitism
recycled
commendation
scorpion
damned
encore
inhabit
shapiro
argyle
bingham
interrogation
gamble
bridget
regulator
advertise
classifications
butch
thriving
wiener
jalisco
administer
henan
workplaces
newbie
librarians
ub
filmmaking
easiest
ngos
welles
devote
shrubs
attendant
aerodrome
follower
methyl
grayson
vaughn
bodyguard
byte
splash
peña
classify
lieutenants
bridgeport
technicians
manufactures
lennox
swans
canoeing
planck
eastwood
formulas
staffed
directs
gérard
envisioned
musik
repeal
mach
mori
rabbits
prostitutes
compassion
labeling
synthesizers
hen
delgado
bosch
cue
dh
kaye
fielded
hawker
barrie
hawke
ithaca
cornerstone
incapable
eureka
anastasia
cayman
barnet
fronts
clippers
napoli
deviation
quakers
keepers
mutants
peng
rum
hahn
spacing
britannia
muñoz
parasite
topography
conglomerate
amusing
outflow
offender
waller
mabel
intercept
iroquois
perceptions
nic
honesty
faulkner
mined
cluj
blazers
abide
lpga
pontiac
abusing
turmoil
rhino
kilometre
packaged
trois
aspiring
inhibitors
barrage
piazza
truncated
trondheim
capitalized
busan
phased
dank
outlaws
pronouns
ignition
evade
buddhists
kobayashi
woven
mute
fai
irony
cabinets
persisted
gc
potent
subsidies
gin
nuclei
procurement
eintracht
pictorial
maroon
prem
inexperienced
hid
designate
eats
macquarie
booking
adherents
icf
hove
caliphate
ox
tolerant
aristocracy
plumage
claw
backstroke
migrate
tilt
hillside
avalon
wasps
temper
corvette
edna
chopin
glendale
chaotic
assaulted
mahmoud
devotees
padua
matrices
dilemma
fide
eine
plum
attacker
googling
pertinent
bourgeois
mani
graz
mosquito
euclidean
cub
echoes
misses
assemble
ethernet
bait
scholastic
dip
schubert
mauritania
lev
crisp
totaling
multiplication
larson
breaststroke
chefs
suspicions
ngc
groves
ingram
adriatic
knicks
outpost
darmstadt
rhymes
commodities
fashionable
sediments
punitive
léon
skipper
irina
grassroots
sticking
blaine
capitalization
preached
sheppard
magistrates
nadia
explanatory
mina
tensor
signalling
euroleague
estimation
ivanov
keystone
imitation
biennale
salamanca
islamabad
connacht
converse
bradshaw
unseen
daryl
mbe
adolescent
skyscraper
montpellier
gag
dormant
vanished
partizan
eastman
nunavut
attach
volatile
caleb
moniker
cardiovascular
nec
reza
melt
disks
pri
broughton
dx
zoological
bodied
portage
supermarkets
assassins
rn
earnest
cosmology
amar
seaman
ejected
mandal
scrub
phylum
tyre
havilland
gotham
brabant
premiers
fay
skopje
decker
mermaid
outspoken
comrades
karim
climbs
archiving
slain
amplitude
appellate
fishery
bragg
excitement
horne
salute
inflammation
málaga
surrounds
friars
backbone
petals
pegasus
moselle
aisle
hobbs
armando
kharkiv
trafford
ridley
verge
altitudes
hates
midfield
contracting
cocktail
outsiders
experimented
maguire
bard
faded
alternately
unlawful
eternity
convection
kimberley
lute
huntsville
darryl
schizophrenia
mcdowell
grasses
ugandan
pollard
prophets
quito
truss
outsider
cambrian
newmarket
hound
staples
narayan
illustrators
drury
barclay
preferring
faust
maha
gage
alleging
thence
downing
elf
octopus
interceptor
deserved
fines
someday
hangar
prohibits
beau
ebay
albans
deutsch
lucien
contrasting
hannibal
aegean
mcguire
dil
filtering
hourly
johanna
stacy
naturalized
erica
compute
evenly
terribly
palms
kickoff
withstand
naive
kylie
vase
azt
dominica
azores
outgoing
rollins
internationals
mcpherson
barre
jd
bulb
crusader
spines
fielder
macy
nakamura
greenberg
goldwyn
ode
guildford
aqueduct
rubbish
vasco
simulated
arboretum
oleg
notch
greeted
choirs
lew
biologists
taller
centric
jamal
hermitage
footsteps
wiped
booked
trier
parramatta
iain
lakshmi
moravian
peppers
versatile
mundo
pedersen
whaling
binds
dim
bazaar
featherweight
halves
trieste
jedi
luz
firefighters
causeway
magdalena
mist
arranging
galveston
comte
forcibly
parasites
isabelle
amend
fijian
fidelity
sentencing
shenzhen
offenses
baked
nea
lure
roundabout
listened
pointe
parasitic
solvent
vested
modifying
karabakh
scotch
valuation
severn
diversion
goldstein
cas
mornings
hunan
dummy
resembled
hb
kelvin
beavers
calvert
injected
artifact
manipulated
musically
italiana
dialog
fluids
slab
walsall
grams
hillsborough
lizards
moonlight
cantor
carole
saigon
telecommunication
gunner
sj
stray
brightness
molina
pseudoscience
obey
prism
impending
octagonal
universiade
sorrow
jarrett
dolores
fronted
gunpowder
babylonian
curvature
colonels
vip
borg
torquay
antibodies
cracks
sinn
heidi
yep
bergman
christophe
marko
mavericks
siam
apologized
unauthorized
daphne
mozilla
jenna
replacements
frustrating
francesca
takahashi
passports
claudius
scent
charley
bmg
susquehanna
scam
danzig
stature
gunfire
rallies
emory
dependencies
serials
drunken
stalled
clapton
compile
huber
obesity
fourier
sn
infancy
hyper
palau
siegfried
candle
allowance
islamist
strikers
principals
oversees
stimuli
jai
hodge
mathews
parcel
welcoming
shouting
godfather
cuckoo
breeze
hrs
drying
mitochondrial
retrieval
minogue
duc
syriac
sebastián
hurley
cms
emery
madeira
chihuahua
kali
bloggers
civilizations
hezbollah
currencies
frankish
vibrant
ingrid
sentiments
indochina
electrification
danced
acquainted
chow
originals
wren
convoys
waterway
rotated
phylogenetic
welding
husbands
vigorous
congenital
fulfilling
tolerate
menace
eurobasket
spectroscopy
marek
mckenna
showdown
shrew
rebirth
gujarati
identifiable
unprotected
strained
lyle
booster
stealth
fayette
liking
hodgson
decatur
newsweek
moshe
musique
greyhound
koreans
contacting
zulu
galician
ferris
ripley
merseyside
ostensibly
abducted
floral
kilometer
mazda
sequential
entertainers
gaddafi
yielding
narrower
rivalries
croats
czar
coinage
ter
newtown
ghz
sousa
lynne
kepler
●
sheriffs
reworded
mohawk
hawthorne
laude
doses
ape
nazareth
doubleday
jess
magnesium
societal
intercourse
murdering
yamaguchi
armada
gilan
functioned
chapels
athena
canning
southport
taunton
favorites
gladys
vincenzo
clerical
disrupting
bogus
gatherings
privileged
lst
pollock
kant
warp
wcw
domino
pockets
ops
disneyland
resurrected
radiant
henrietta
mersin
sonar
acquaintance
clancy
contrasts
appliances
socket
spiegel
harmon
sikhs
modelled
brasileiro
muzzle
bitch
supervising
rotate
impress
pantheon
duran
ignatius
pv
lcd
astro
clermont
scary
genie
despair
undermine
joanne
buff
bosses
jurisprudence
andersson
dialogues
sabres
enfield
gastropods
cutler
rostov
congresses
jaws
grupo
goodwill
grim
trustworthy
atrocities
cowan
kassel
telenovelas
supplements
synthesized
steamship
proprietor
grimm
publicized
props
shortlisted
superfamily
erroneous
rouen
cheer
buena
amadeus
caledonian
guatemalan
chateau
commended
downfall
mazandaran
liar
paddle
opéra
appropriations
hostel
mandy
vedic
jaya
gwynedd
depended
josiah
theta
medial
regimes
sticky
mallorca
vent
motel
jena
karel
turing
superhuman
psalm
bal
hellenistic
exhibiting
winery
backstage
tipped
bromley
primate
historiography
discounted
rave
asst
taxon
mcintyre
bae
pause
zee
garment
nikolay
catfish
reworked
markham
kisses
botanists
disclaimer
sig
sabine
defiance
aj
lucrative
surveying
rudd
oneself
ara
biz
greensboro
campos
nguyen
topping
cactus
identifier
spaceflight
unhelpful
firth
milky
mule
beasts
scrolls
teller
serum
nevis
tarzan
kindness
tempest
swinging
administratively
amateurs
blacksmith
upstairs
deportation
bland
sincere
empowerment
motivations
mildred
boating
♦
awake
subsp
faulty
fran
closet
symphonies
intuitive
admiration
pepsi
alla
multiply
reuse
estrada
junctions
manpower
kei
converter
avoidance
marley
nsa
massif
fucking
federer
brawl
redundancy
erasmus
offending
fairchild
untrue
dramatist
thinkers
residues
advises
tesla
housemates
protesting
circulating
forerunner
galatasaray
rodents
soluble
zimbabwean
colloquially
brace
newborn
tangent
nominally
ruthless
valentin
corrosion
neue
regression
equator
ict
ascending
classmate
glam
groundwater
marylebone
vulcan
nih
sibling
clutter
stacey
blackout
erskine
dade
toulon
poisonous
derivation
believer
eindhoven
accompaniment
trajectory
apogee
batters
fallout
beers
auditioned
becky
qu
anarchy
arias
hacking
calculator
heraldry
gama
marital
scripted
bender
britney
leukemia
coyote
horizons
advising
thru
fitzpatrick
eun
esoteric
omnibus
slice
drastic
laughter
glands
samoan
professorship
salts
tchaikovsky
poured
austen
spd
leftist
charismatic
dominating
lesions
southward
simeon
halfback
abstraction
cid
mammoth
slides
gainesville
healy
fallon
volta
thorne
assaults
clemens
meme
fabricated
temp
meath
rumble
malacca
titanium
osman
grover
adele
och
pinned
elsa
moors
soho
accelerate
tun
bissau
copyrights
xm
claws
braga
homunculus
constantin
mustard
regal
bestseller
wessex
bolshevik
isla
karin
grenades
blackhawks
hun
steamboat
leonid
secrecy
grading
flock
fukushima
essen
breakout
prostate
tariff
exponential
yarmouth
acknowledging
upton
dumped
trillion
slant
williamsburg
analytics
judgments
descend
inhibitor
conservatism
merton
shortages
hierarchical
janice
reigns
scuttled
ellie
contrasted
improperly
wilfred
sheltered
lothian
methane
abandoning
thoughtful
bending
keane
mediate
battleground
pollen
revelations
unambiguous
dieter
ventral
leah
stresses
yoon
maru
chill
salazar
expos
liszt
abduction
morphological
murderers
turnbull
cater
jealousy
ui
mandela
diva
bethel
maximus
fontaine
mahmud
specifics
curtiss
wilcox
stakeholders
exiles
yellowstone
radically
captives
macbeth
intricate
horseback
thrash
burgos
coe
clarendon
amd
litter
israelis
sfr
summits
constabulary
residue
orr
hess
keating
recounts
pendleton
waits
zelda
laird
sikkim
ceasefire
gonzaga
stimulate
seperate
crushing
primates
emphasizing
typed
equivalence
spree
mora
incurred
permissions
unconventional
bites
sprinters
moi
legit
validation
rasmussen
humane
storing
cosmetic
showtime
bullock
eel
hampered
isotopes
abandonment
provocative
heraldic
concentrating
msnbc
lazarus
elastic
gideon
scully
elves
vulnerability
rockwell
relisting
larkin
hedgehog
bryce
adjustments
authentication
crook
hearst
anwar
consecration
olympian
amassed
ina
camouflage
weaknesses
temptation
turk
cisco
fontana
sql
stuffed
lobe
aden
lays
parted
pcs
squid
mesopotamia
dogg
jamestown
swamps
seafood
shortstop
milo
rms
cantons
curly
ryu
overture
kitts
ipad
crested
woodpecker
jimmie
fabulous
endeavour
ctv
throated
newell
partridge
evicted
jonah
mcmillan
indictment
alton
wellesley
oven
atm
captions
dormitory
univ
jolly
clubhouse
ubuntu
temperament
guarding
encompass
charcoal
indigo
havre
marxism
horizontally
anonymously
permitting
leung
lordship
zionism
lesotho
tufts
salad
atheists
nestor
spared
airplanes
magnolia
willoughby
pinch
needles
duluth
raoul
breaker
tee
excelled
dustin
socking
mats
eskimos
confinement
gal
vaguely
endured
unnecessarily
messina
haired
mons
indus
jasmine
moons
leyte
curt
haleakala
epsilon
gustave
manchu
pediatric
opole
constituents
yeomanry
browsing
innsbruck
aachen
olympique
eyre
intervened
snowy
banknotes
karlsruhe
hyperbolic
expenditures
orton
modulation
somerville
reeve
frescoes
frederik
isu
khalifa
cops
epsom
applicant
riddle
apocalyptic
foliage
günther
timetable
yoruba
gorman
therapist
predictable
gutenberg
affluent
hottest
casts
transcribed
guineas
jameson
münster
peyton
federalist
kraft
haarlem
inexpensive
peugeot
montrose
restricting
sustaining
standardization
strata
realises
derogatory
locale
ashland
mentoring
abigail
glow
ionic
radios
massimo
jst
chechen
proofs
alexei
purcell
dp
cookies
telegram
fips
boarded
branched
mic
deanery
harassed
pathogens
rene
demoted
breakers
criticize
sportsman
acknowledges
longstanding
restraint
swear
oops
nils
journalistic
protestantism
sous
este
orbiting
kk
outrage
commandos
flattened
ferrer
costing
lace
wilbur
blackish
veneto
retreating
deciduous
arrogant
elle
moderator
nepali
brescia
forgiveness
cones
aspirations
wilton
maitland
kota
tore
ellison
malabar
cesar
swaziland
collapsible
westfield
fabrication
seton
realities
injustice
barron
calabria
montagu
denison
syndication
csa
tp
coco
seater
praying
irs
correlated
aarhus
boo
kitchener
pei
nightingale
dentistry
fujian
fencer
fresco
pairing
lucha
chopra
villiers
accents
specialising
investigates
modena
cellar
dubois
hugely
sligo
hackney
paints
bellevue
isaiah
weaponry
foxes
fortuna
tanya
mathieu
avail
spices
hangs
gland
additive
angie
burnham
nexus
brit
headers
vivo
ls
educating
modernism
kr
kimberly
occult
parodies
harvested
sykes
overseen
menon
improvisation
wainwright
override
picturesque
kathmandu
redesign
unión
achilles
battled
cabins
thuringia
annette
ahmedabad
stir
merritt
gershwin
unbiased
spoon
thierry
lasers
magyar
bautista
oddly
helix
worldcat
supplemental
grill
baseline
ono
cautious
perl
branching
evacuate
ike
readership
rockford
tubular
pursuits
imperialism
havoc
malley
tolerated
faq
executing
atheism
hauled
chicks
gillian
manx
wondered
intern
worthless
sequencing
allegro
merges
shamrock
inference
cpc
sneak
medallist
antónio
arden
exemplifies
ric
peat
bop
asher
pharmaceuticals
festivities
barrington
princesses
bargaining
reuben
bam
ringo
rumored
rendezvous
reckless
margarita
venerable
unpaid
inferno
berber
omission
judas
seminole
harassing
steamed
fabio
mercenary
grafton
tempted
nicholls
wendell
sable
intermittent
svenska
rag
theirs
castes
azad
gardener
federated
coloring
dahl
spikes
gallo
billing
homogeneous
enraged
inca
bulldog
deprecated
whereabouts
residual
fountains
pleas
hilbert
hurst
katharine
royalties
rumor
jerzy
whichever
dissolve
duane
moritz
hemingway
zum
apes
springsteen
southland
jehovah
appleton
erika
penelope
regan
puck
mcgregor
heartland
differentiated
footprint
distinguishes
scoreboard
eras
blanchard
cognition
lowry
masjid
panoramic
kingsley
beforehand
adjectives
numerals
amp
yoo
microscope
balloons
aggressively
kellogg
respondents
colloquial
serviced
showcased
umar
santander
forehead
conqueror
hyphen
emile
augusto
alameda
cruises
epstein
discretionary
petitioned
faye
gnome
kwan
instinct
itu
mechanized
pulpit
flees
handing
extracts
tashkent
viper
quarterbacks
vassal
deutschland
crowns
losers
foolish
wheeled
almeida
substrates
penetrate
pau
reused
correspondents
townsville
stallion
ws
michelin
brill
dorchester
homme
colo
plotting
horsepower
valladolid
duplicated
raiding
hermit
cues
tung
mergers
crashing
eredivisie
adm
dns
chariot
pavement
braking
bureaucratic
gunnar
grinding
registrar
apa
sogn
imperative
rankin
nonlinear
orchards
charlemagne
combo
pentecostal
recaptured
beit
nasser
himalayan
transmitting
arson
singled
ellsworth
breasted
greenfield
transatlantic
hardin
sakura
ordo
timbers
maison
peptide
rescues
nawab
tg
consultancy
sway
invariably
descends
privateer
toilets
transient
premio
frazier
ayr
maths
tailor
nietzsche
supper
piotr
gaul
botanic
smackdown
aw
nocturnal
magdalene
contiguous
reprised
unilateral
hilly
classed
champaign
schloss
feasibility
borrowing
darby
trough
cured
ax
albanians
squire
meade
symmetrical
heller
limburg
subcontinent
deportes
telescopes
starvation
abdel
jeans
gamer
lid
rajshahi
terminating
dunlop
carmichael
embarrassing
snack
rampage
jalan
photons
naia
disturbances
azure
mun
cali
vittorio
tendentious
baronetcy
aiding
simulations
fenton
contradict
cj
baccalaureate
henson
deepest
izmir
assigns
prohibiting
nytimes
deng
barbarian
maori
nur
petitions
environmentally
ish
familiarity
sulawesi
massively
holm
kgb
indices
donnelly
flair
polynesia
antagonists
clover
soy
scaling
wingers
valiant
couch
hines
rubble
winslow
aramaic
polynomials
strive
confess
spreads
siegel
andrey
accomplishment
lexicon
declines
bakersfield
rant
colourful
sopranos
olympiacos
bump
athenian
paranoid
gel
slept
risky
bathing
nbl
duly
mixer
luís
inhibition
nonfiction
tracey
consolidate
reasoned
fission
barrio
lear
ascended
eviction
alteration
corinthians
coburg
borrow
bandar
rightly
temperance
fraudulent
catastrophic
kanagawa
gesellschaft
sabrina
scrolling
funky
sato
musica
pronoun
acidic
mania
mango
copeland
restarted
confederates
tracing
roberta
sargent
astra
excludes
banco
adolph
biting
dealings
mustered
teaming
punta
prank
aldershot
jelly
fredrik
turbulent
reinforcement
arg
climates
oder
jana
hires
godzilla
arbitrarily
maestro
davy
lega
tvb
eroded
icao
blanca
ambushed
nicolás
redistribution
capcom
gomes
battista
advisers
eminem
ivoire
scans
manifestation
screenshots
mentality
orson
sonoma
deletes
blended
cosmopolitan
triad
fills
crises
upgrading
pill
yue
marge
jonny
crooked
cbd
aqua
detecting
carbonate
fjord
godwin
iqbal
yuen
graders
motorola
betsy
folder
breadth
cheney
cores
ipod
conscription
batter
flush
hiroshi
vertigo
incarcerated
keynote
armory
jeopardy
poppy
shady
erroneously
hardest
beads
creditors
coward
mimi
grimes
tenerife
navigable
dakar
banded
slug
promoters
squared
pots
compendium
nazism
mimic
pu
endeavor
anatomical
mccann
runways
experimenting
faint
andrzej
nizhny
scala
insulation
horned
aromatic
depleted
registering
douglass
rodrigues
hatfield
misery
explode
tyres
breakaway
chickens
sporadic
unilaterally
intervening
mika
custer
perak
vines
lyman
rss
conditioned
undesirable
doesnt
cursed
urbana
configured
hadith
disposed
aeronautics
smallpox
crusades
lust
logically
maverick
gratitude
tentative
befriended
naga
fallacy
niels
enigma
insanity
platt
leighton
workings
whisky
woodbridge
selector
lycée
eocene
partido
ucoz
ami
parisian
pursuant
lucie
commemorates
cruising
mort
baggage
isotope
conical
kendrick
slick
launchers
binghamton
sigismund
aquitaine
m²
bodily
hurts
parliaments
saba
weakening
popes
sadie
balfour
homecoming
osborn
gaius
pasture
jimi
unionists
bassett
mira
+,
kia
unsupported
waffen
nadal
alcoholism
powerhouse
grasslands
outrageous
contend
millwall
tomás
segregated
hepburn
aix
router
centrally
plastics
francois
premieres
π
distracted
relational
stéphane
furnished
airliner
dockyard
webcast
anarchists
verbatim
collapses
inquisition
pst
wwi
abkhazia
euler
shortcut
letterman
aldo
zimmerman
taped
exhaustive
shootings
jeep
florian
ramirez
zachary
favors
mauro
optimistic
successively
gadget
cramer
józef
lacey
ligament
spores
presumption
ashford
persist
utopia
solids
choreographed
agatha
lad
bouts
kimball
cg
projective
dentist
adi
schwarz
sulfate
aichi
corsica
herefordshire
cosmetics
upland
goddesses
skirt
jaipur
esa
csi
inorganic
phosphorus
facilitating
accelerator
ifk
smyth
completes
kettle
fatally
char
mccall
triangles
pills
reflex
norbert
dauphin
arundel
hammersmith
rectory
elaborated
asahi
unicorn
diner
perpetrators
jharkhand
nilsson
primer
beatty
moray
wrestlemania
murcia
grenadier
reductions
javanese
sedimentary
ata
ours
crawley
mammalian
linn
scientifically
sheng
heresy
layered
frans
fyi
sejm
tak
initiating
lifespan
privatization
pumped
bukit
aguilera
irrational
finalized
widowed
uta
assassinate
indexing
klan
anthropological
rk
dfb
brody
seekers
stalls
coincidentally
oldenburg
venezia
chromosomes
steeplechase
gaulle
huh
enjoyable
gigantic
illumination
humber
{
proficiency
mbc
spit
fianna
pads
solos
vidal
apologise
hornet
kaunas
expects
polydor
porte
luka
maximize
chairmen
libertadores
legions
livingstone
dunne
nikita
dod
jura
originality
preschool
enhances
bjp
learners
influencing
membranes
forsyth
laborers
oskar
shiny
widened
apt
immature
visualization
femme
twinned
mcfarland
felony
marriott
ruben
gl
ymca
spouses
yemeni
mummy
edict
cries
sima
violently
flanagan
flycatcher
climbers
puget
horatio
ito
pointers
uniformly
eugen
horst
tds
deans
inmate
clashed
hartlepool
emitted
locating
characterize
fences
réunion
philanthropic
kowloon
nirvana
enchanted
gough
exemplary
falkirk
biplane
conn
transliteration
chavez
magdeburg
dwyer
filippo
undone
enthusiast
tabs
pronounce
wrought
ultraviolet
parades
berliner
remade
feral
francs
polynesian
emu
latina
bydgoszcz
casinos
tuna
mushrooms
español
whitaker
abstain
ima
zx
rebellions
sportswomen
asimov
resin
greer
joseon
vest
pagoda
wal
vendetta
zamora
gillingham
honduran
wiesbaden
commits
bayou
casimir
nutritional
kickboxing
cochran
tempered
humbug
chongqing
cholera
salsa
fusiliers
andalusia
braille
salman
jen
commence
ornate
malvern
hallmark
situ
egan
archduke
wd
bandit
vans
circumstance
afi
baptists
feldman
mcnamara
aguilar
bash
pamphlets
condor
necklace
bellied
perished
italic
hormones
liter
himachal
newscasts
sewer
spurious
multicultural
outlying
vanilla
doorway
quartermaster
constraint
faisal
socrates
dobson
kun
forefront
rigged
rx
quake
fulbright
robins
nicosia
selectively
calf
loyalists
enriched
billings
lorestan
antibiotics
ferrara
selena
heap
cymru
maher
observes
vertebrates
joshi
microscopic
lei
guerrillas
sleepers
vitro
seaplane
coarse
lister
peshawar
astor
bookstore
relist
harpercollins
cigar
orthogonal
rampant
hypotheses
vitória
sacrificed
aquinas
fwiw
hartman
illnesses
bless
standpoint
arterial
rudder
diablo
pia
carly
silas
patna
mana
indira
multilingual
deserving
outputs
blindness
revolving
wanda
iar
internment
embankment
abba
mojo
conus
asean
enactment
bellamy
hou
dumping
rosenthal
absorbing
vortex
roughriders
redwood
anxious
dalai
reclaimed
rizal
cocoa
kmh
electronica
kazakh
expressive
flaming
yucatán
externally
carvings
roadside
pains
rebellious
uniting
cassandra
kok
kew
bolívar
distracting
ammonia
mahal
rousseau
vr
vulture
trujillo
writ
peking
ethnically
equivalents
filly
convertible
accountable
notts
riff
ortega
impairment
anthropomorphic
sculpted
alleviate
magnate
spp
steer
impartial
rojas
ent
wadi
totaled
tad
mathew
jacobite
estranged
webcomic
cree
sham
chancellors
joaquín
cryptic
spoiler
nell
cham
mathias
saturated
martinique
pious
resigning
moravia
diversified
afp
hospitalized
doomed
notebook
cultivars
adoptive
pendulum
gonzalo
himalayas
leuven
inhibit
haig
raft
clements
disadvantages
yamamoto
boon
occidental
vogel
deserts
westport
authoritarian
zodiac
goodness
bain
marcia
transvaal
mcconnell
quan
sixties
matteo
bra
barbuda
topeka
grenoble
grabbed
cesare
cancers
braunschweig
traitor
uphold
convicts
hinted
starship
proctor
comfortably
cock
accessory
krasnodar
artefacts
debuting
weightlifters
ivo
adorned
haji
extras
exchequer
fullerton
continual
rhetorical
skepticism
internship
sync
philanthropy
marshals
spec
informatics
kristen
cuthbert
greeting
horticultural
lame
schuster
durable
memo
ddr
closures
rg
outnumbered
wellness
hermes
resurgence
ponte
yuki
orchestrated
gamespot
posse
beverley
uploads
slap
watanabe
lesley
fleets
trident
subunit
trousers
craftsman
orthography
waldo
wrestled
jiménez
kunst
packets
triathlon
bcs
croft
consultative
stratigraphic
molluscs
fina
plurality
naacp
warlord
saline
collisions
outset
majestic
aloud
spelt
rhein
vichy
ipc
bastard
launceston
nfcc
bryn
fascinated
cracked
entertain
binomial
simulate
afrikaans
gustaf
debra
brandy
knocks
messy
ratification
biota
tweed
chunk
iberia
shipment
cipher
pakhtunkhwa
hectare
timelines
mendes
fellowships
frontiers
jj
quartets
tatar
dissident
nowy
prefectures
negligible
valor
umm
delighted
nyu
alphonse
hackett
selby
sapporo
fenerbahçe
nagano
mobilization
djibouti
venerated
dmitri
protections
aek
revered
lana
macpherson
elizabethan
vhf
frivolous
cos
immortality
angolan
arrays
tyrant
folds
indoors
jia
aura
zanzibar
sine
quadrangle
immersion
specialization
simplify
thicker
sava
reunite
pinyin
papacy
escorts
cfa
nonsensical
adherence
paraphrase
factually
specifying
scarecrow
osprey
incheon
papyrus
maui
quadratic
motorized
gracie
mercantile
worshipped
amore
kanye
projectile
liaoning
serialized
refit
nightly
beautifully
stagecoach
féin
antennas
wynn
adapting
villas
willamette
violins
causal
adjustable
michelangelo
extraterrestrial
raju
lodges
wreckage
conservancy
labourers
assimilation
zappa
renato
thurston
accessing
entre
lubbock
unresolved
lorenz
statistic
intra
refurbishment
etiquette
creeks
swiftly
metallica
bertram
plainly
visionary
peanut
therein
persians
uncontroversial
heroism
trey
budgets
middletown
hays
xviii
degraded
routines
chet
noisy
disappearing
polygon
ova
hampden
dorado
spitfire
posture
eater
luisa
fo
samba
bantamweight
transmitters
ozone
muster
screenings
colegio
cara
defected
dempsey
waikato
bertha
clicked
deficient
getty
brighter
excessively
celia
schism
laredo
maternity
penultimate
assad
relax
axiom
evansville
bloch
pickering
distillery
upi
airmen
aiv
tres
mueller
tba
taxis
londonderry
trumpeter
gutiérrez
uri
portico
thrush
dumfries
hendrik
hurdle
allotted
enrichment
heterosexual
tijuana
biographers
fidel
renders
medicare
hadley
plaintiffs
splendid
mcintosh
mifflin
gateshead
psp
hirsch
ballarat
pinnacle
catalytic
unfounded
maneuvers
bladder
apc
servicemen
prematurely
singleton
dotted
mandates
ascii
marsden
ferns
devonian
cong
piloted
republika
livelihood
selects
bharat
regulators
obligatory
salim
vibe
hex
smoothly
responsive
lux
mortars
interned
respectfully
fearless
cabrera
kanji
opined
pandora
cst
siu
periphery
imperfect
lush
hooked
mustangs
harpsichord
sawmill
danes
chorale
kochi
nugent
lobes
anson
scranton
hurlers
alot
salinas
frs
libby
gallons
parti
iteration
quechua
rainer
constants
postmodern
werewolf
climatic
fayetteville
dissemination
diffuse
elevators
eta
quadrant
burbank
flavour
tatiana
antalya
flyweight
lourdes
gibb
seung
gestapo
tartu
kf
acton
huey
bumper
dusk
cleric
commonplace
tents
anzac
windy
paine
unc
illustrious
powys
brink
wildly
wounding
vandalizing
relaxation
chai
inversion
glide
reclamation
customized
psalms
puppy
gall
prodded
vicky
dumbarton
lehman
townland
pageants
crabs
akademi
buren
planner
juveniles
garlic
gar
violinists
mehmet
turku
oldies
yak
rediscovered
afforded
robbed
fitz
taboo
leverage
bm
bananas
général
biosphere
persistence
truths
recounted
cereal
enslaved
counterfeit
liquids
plovdiv
antelope
pied
martyn
roar
rohan
flex
widening
weymouth
widows
alerted
tierra
bao
mildly
reciprocal
rotational
kyung
mariner
champs
peacekeeping
airship
tombstone
brutality
reims
embassies
transitioned
essayists
stipulated
elektra
prequel
honeymoon
sanitary
grind
shalom
canto
gifford
ballpark
sébastien
manu
alas
norsk
philanthropists
clandestine
notifications
flinders
midwestern
ported
guadalcanal
distraction
avec
waking
vaccines
guadeloupe
roderick
tuvalu
saxons
silvio
intersect
lacy
duplicates
fractured
regatta
cracking
eared
whistler
tamara
hana
commuted
ascribed
stampeders
varna
elsie
corridors
elemental
winfield
heaviest
quorum
herzog
mohd
inaccessible
persecuted
handheld
utilization
burgh
bret
alloys
lowlands
delaney
hers
theorems
combatants
interpersonal
adjusting
transistor
pico
chaplains
sealing
sizable
seizures
spaceship
jerk
crows
moo
pianos
taj
barren
disable
begs
detectors
britten
attaining
prelate
eau
zenit
passions
hydra
kidnap
szczecin
revise
stampede
hogg
chalmers
boulogne
bochum
reincarnation
minnie
gottlieb
communicated
stills
millennia
bipolar
anthropologists
sentimental
hf
grays
billions
grizzlies
escalated
pedigree
expressly
banished
transformer
hepatitis
fd
dutton
peacefully
mulder
reborn
landau
stimulated
flavors
embodied
qualitative
backlash
colby
gliding
southwark
kristin
advertiser
dorsey
zebra
ditto
coeducational
breasts
stigma
nudity
leland
berman
statistically
olympus
fares
humanoid
finely
suspense
townshend
valencian
demi
baking
verizon
mfa
infinitely
surfer
cochin
bootleg
impoverished
predatory
schoolhouse
fokker
beak
hurry
prodigy
auditory
specialize
dorian
benchmark
rang
palette
fortunate
gables
laughs
researches
cafeteria
kts
cornerback
reclassified
fluorescent
accumulate
swat
planners
childish
eels
contenders
constrained
carp
perm
caller
dictated
augustin
interstellar
encompassed
coyotes
piercing
blockbuster
catarina
crowe
lark
sochi
accountants
tj
abolitionist
olimpia
expansions
retracted
ubiquitous
mayfield
enroll
wycombe
catalunya
linz
noticing
nellie
leith
fuchs
dani
heer
politely
polymers
hershey
fernandes
anticipate
antonia
stu
rj
helmets
extremist
robe
haut
rescuing
acm
lag
tripod
wavelengths
merle
nutrient
overdose
giulio
cahill
backyard
headwaters
interlude
schulz
kalamazoo
smartphone
snoop
pandit
chew
rehearsals
mri
intersecting
lucille
delisted
christgau
separatist
wilfrid
intellect
larsson
necked
caterpillar
varma
facilitates
bargain
jock
reunification
sarcastic
friar
loeb
suspend
kala
conf
haskell
antibody
jitsu
asthma
floats
microscopy
cultivar
leasing
qld
skye
gregor
oise
horde
asha
swarm
andover
frey
susanna
dizzy
spirited
rada
uno
spielberg
gallipoli
candles
hôtel
expire
poorer
chiapas
cinnamon
empowered
myriad
anytime
impedance
embryo
cans
hh
rko
salvaged
shang
alerts
biking
labyrinth
parochial
categorised
curate
refining
moderne
srpska
implicitly
metaphysics
suck
funnel
discredit
remington
ardabil
nasdaq
brahmin
tbs
yachts
parity
arjun
ante
davison
axial
barring
divisive
nasir
unexplained
maratha
ssp
cheat
marginally
sherry
boniface
wheelbase
sparsely
exploding
silvia
coolidge
brahma
goalscorer
yada
fats
epilepsy
aidan
sly
nath
mukherjee
holiness
nectar
cleansing
flaw
agra
bowers
midtown
spurred
pvt
palmerston
artemis
raider
noël
darcy
australasian
sausage
mezzo
kbs
thermodynamics
kilmarnock
brochure
cardboard
guzmán
proficient
barangays
amt
tallahassee
defenceman
adolescents
elliptical
breached
departmental
electrode
hampstead
erased
sistan
taxpayer
minerva
bonded
pusher
pastures
strands
whiting
greed
clearwater
condemnation
blackberry
mh
colón
jure
pullman
tariffs
sitcoms
gotha
cartwright
corpses
deh
kurds
resentment
behavioural
elects
warehouses
argentinian
smashing
apparel
meteorology
allahabad
bute
amplifiers
shipments
montague
huff
balochistan
ballets
extermination
stamped
preach
signage
unsafe
roscoe
avignon
referendums
emmett
txt
lászló
shaanxi
bearings
arne
machado
sana
tildes
davao
wary
bonnet
fantasia
permian
youngstown
rsa
nhk
magdalen
checker
haines
pinball
unicef
patsy
frequented
reclaim
ornaments
mcqueen
insurrection
azul
elliptic
supernova
fujiwara
crank
fabrics
rulings
kt
benoit
ousted
deterioration
eclipses
herbal
literate
vowed
exponent
verve
geologists
maximal
motherwell
linen
engel
spectra
englishman
guelph
pygmy
crucifixion
extracurricular
interfering
adultery
uzbek
esteban
bacterium
handicapped
fiery
cloak
héctor
accra
biases
rubens
hickory
gestures
obstruction
sheen
clarifying
ivor
erickson
béla
sewing
terrier
dew
ay
vito
lough
nv
pathogen
pearls
taxpayers
casper
corrupted
grabs
immensely
mehta
jeong
checkpoint
forte
echl
regulates
artisans
chola
tahiti
cassie
sulphur
gk
senegalese
cpi
modi
reversing
walkers
saône
eucalyptus
thrace
coppa
headlined
astoria
paced
ashok
grumman
ledger
figurative
propellers
fundraiser
narcotics
ems
competitiveness
sects
lambeth
looney
crustaceans
batavia
metrics
constructs
beale
fragmentation
vulgar
hose
cedric
directories
nagpur
dawkins
vox
duos
würzburg
pea
ecw
nit
hanley
nano
bundled
mooney
chengdu
simplex
catastrophe
unpredictable
interspersed
piero
marietta
gladiators
brahms
muriel
psychotherapy
mortally
sociedad
academically
smiley
villeneuve
biomass
carte
meyers
replicate
benevolent
gaye
walther
cad
weakly
fbs
weaken
cobalt
bursts
cranes
rad
dissipated
pigment
königsberg
tailored
superiors
lettres
dea
courtroom
ravenna
obverse
whereupon
falmouth
resultant
piped
sociologists
manipur
nagorno
promenade
knob
rallied
discoverer
mastery
boutique
reykjavík
rocco
generalization
clemente
víctor
parenting
mallory
plight
strategically
bien
nai
bosco
henning
danville
loft
shawnee
skinned
micropolitan
bandy
illicit
instrumentalist
linkage
dough
sigh
microbiology
demonstrators
pipelines
inoue
spectacle
hype
wv
relic
tedious
terraces
militias
callaghan
comcast
mins
cessna
priscilla
abyss
xxi
urges
aris
coates
gallon
radha
animations
jody
ayala
kampala
commencing
crystalline
purification
undertaker
antoinette
soloists
coimbra
dalmatia
dorothea
kl
monsieur
eruptions
favourites
reopen
mariah
craftsmen
domes
hubei
recital
devout
woolwich
pressured
pros
dispersion
bmx
huts
planar
foe
shanxi
fascination
sala
scanner
comprehension
coroner
covington
fledged
adapter
prentice
ripe
berries
pali
flashes
sanjay
flipped
cyborg
rennes
extremes
milestones
fcs
breweries
ulm
banksia
summed
durga
cookbook
duarte
ganga
dissatisfied
hiram
tintin
daley
miniatures
lawful
proxies
mala
skulls
concertos
dispersal
syn
lanarkshire
brevet
lovecraft
anjou
yangon
administrations
independiente
alouettes
richland
freshmen
spawn
nostalgia
nach
overrun
stochastic
wheaton
ligand
quarries
addis
yoko
roxy
unsolved
slippery
downey
folio
obedience
ardent
oakley
didier
napa
impetus
geek
summon
horticulture
sidebar
kampong
legality
peg
falun
polarization
naughty
stab
colonialism
panathinaikos
vf
syrup
patrolled
heavens
regnum
madurai
flushing
gael
thrive
pejorative
sigmund
salaam
aberdeenshire
kenyon
afterlife
chemotherapy
luggage
adopts
catania
troublesome
variability
trafalgar
pulaski
antiquarian
verne
undergoes
streamlined
lavender
vocalists
fáil
mountaineers
inspirational
bolts
horsemen
islington
cartoonists
seniority
thyroid
plank
camogie
timer
duets
leaflets
cristo
speculate
stadt
lodged
servicing
diamondbacks
exemplified
garments
racetrack
peanuts
eurasia
unarmed
salamander
nino
haunting
puri
ríos
keefe
reluctance
enjoyment
newbury
gunmen
draining
praises
omit
lite
bolsheviks
slightest
glue
lavish
sacrament
isolate
provoke
amiens
blur
aggregator
jg
worrying
rhinos
coasters
stroud
kremlin
teaser
lobbied
rl
grad
counters
explodes
wisden
mysticism
millar
snowfall
osama
modem
sadler
philology
francophone
bello
blending
bombardier
plutonium
pax
cakes
ses
billionaire
argus
postings
receipt
nelly
nativity
whitehall
calligraphy
andres
selma
ei
malware
signifies
israelites
warhol
rhapsody
lasalle
hussars
travers
commandments
duct
alexandru
narod
clumsy
marrow
tercera
meng
rijeka
subtitles
str
saeed
stemming
cossacks
niigata
converge
flanks
bethesda
alamo
yamato
reorganisation
reacts
alphabetically
coloration
alistair
schema
geese
yosemite
sweetheart
internazionale
daniela
yeovil
spongebob
intimidation
contradicts
tomé
ateneo
donnie
reina
nonstop
oprah
affirmative
scuba
dre
tweaked
schumann
observances
rufous
luo
localization
badgers
perugia
vow
mor
plotted
torneo
deva
haworth
eyewitness
realist
accompanies
winthrop
ting
conservatoire
stunts
coles
shelled
aquila
mississauga
bethany
erp
misguided
recap
nitrate
mennonite
midsummer
mahatma
anomaly
cai
vibrations
garonne
benign
thrill
prado
undisputed
italiano
preventive
mentors
sectarian
trainee
caen
malibu
bradbury
dundas
cheetah
terminates
schiller
sy
nac
rivière
operatives
irrespective
subsistence
aerodynamic
puff
redmond
odin
françoise
wardrobe
sos
worthington
eccentricity
vicki
lamont
pci
kristina
wm
tw
hubble
tees
aet
lyn
almond
symptom
childbirth
covent
flashbacks
postcards
inquiries
dragoons
trudeau
morally
xiang
appointing
grudge
chabad
captaincy
alves
blvd
docking
rib
buxton
shenandoah
stricter
gop
twenties
gg
rash
yogi
divergence
permissible
consolation
creep
début
boa
alegre
rodolfo
rahul
addicted
sorcerer
habib
defends
meuse
floppy
hoaxes
owes
cloudy
hmmm
showcases
franc
optimus
hao
fatima
collage
brampton
telecast
coli
fermentation
biathlon
deformation
tortoise
dessert
moat
reintroduced
dissenting
metaphysical
basalt
ideologies
lauded
adventurer
probes
optimized
favorably
conceal
supervisory
carnivorous
jerseys
forecasts
folly
karaoke
superleague
cervical
impeachment
smiling
cato
bribery
wbc
saleh
devoid
acquires
tobin
ludlow
meier
smoky
raptors
refloated
chieftain
fútbol
departs
pee
onstage
predicting
formative
gunnery
ghulam
porta
artur
morals
skirmish
rosary
modal
sectional
landfill
nuggets
sanderson
linesman
incompetent
bw
rebbe
rbis
chandigarh
trapping
modernized
austrians
cercle
strangely
glamour
robbers
olav
await
chadwick
frightened
boar
enoch
ringing
breda
insulted
strife
sorority
flea
slavs
impractical
boomerang
sheds
cults
calvinist
corinth
invoke
transcripts
vom
perfume
bony
lakeland
pla
bikini
potts
petite
racine
patriarchate
shahid
abolish
genital
circumcision
ogg
recommending
pelican
plaques
memorabilia
unearthed
tractors
glowing
relaunched
dominates
reformers
shutdown
slogans
dashes
marlon
jian
ripper
tern
appropriation
arranges
bullshit
squeeze
inventing
spade
cooperatives
lawton
highs
dg
fetus
electronically
disadvantaged
waco
hesitate
blaming
youthful
harass
artificially
calvary
exercising
pillai
cholesterol
stabilization
ppp
martini
ecm
thiruvananthapuram
nuestra
nope
imf
gunboat
basing
ukrainians
postcard
monographs
apertura
regensburg
rebelled
dysfunction
parked
vertebrate
conor
melted
kodak
montserrat
triggers
hoon
blessings
farrar
mojave
saab
benches
repairing
engravings
illusions
bowlers
asap
stringent
irregularities
huntingdon
andean
airdate
cartagena
uttarakhand
assigning
shaker
digestive
afonso
hike
mosley
kruger
kreis
stunned
acorn
bluetooth
iglesias
ine
greyish
humphreys
vfb
hopeless
antigen
lockhart
sturgeon
joo
vigo
teutonic
larva
crockett
monmouthshire
stormed
obs
insurgent
buckeyes
ganesh
loco
walla
bingo
embarrassment
sited
peas
disqualification
rested
bst
attire
paralysis
calderón
sao
senna
migrating
romanians
mansions
invent
ural
financier
vasily
mayhem
tammy
jacobson
som
endeavors
rensselaer
troopers
dementia
staunch
chop
abrupt
splinter
connors
sponsoring
bagh
thrissur
endorsing
kirsten
indexes
spawning
stepfather
newbies
vertebrae
suny
salient
skeletons
allentown
sunflower
chaim
gallen
neutrons
commas
hog
domestically
gilmour
hideout
adamson
furness
whitley
shui
massage
gd
melrose
suresh
reliefs
deficiencies
capo
tübingen
racially
lakeside
wipe
melon
nakhon
carney
rourke
rostock
sardar
ether
fetal
sash
daimler
sarasota
connectors
ruhr
entertained
martyrdom
prometheus
slough
bridgewater
earnhardt
confederations
forgetting
subsidy
ceilings
freemasonry
stacked
ericsson
headache
jb
farley
massa
foraging
noor
frontline
beograd
degli
geothermal
luiz
zoologist
penitentiary
reforming
summons
contended
gregorio
flare
comb
luxurious
chlorine
footed
bashir
escorting
formosa
arthritis
fragmented
timberlake
surpassing
negligence
barbarossa
¢
yadav
glossy
marcello
drumming
lizzie
onions
therapies
shuffle
cryptography
abi
cartier
upbringing
unlock
escalating
specificity
auvergne
falco
alaskan
rh
montage
nod
björn
quercus
kn
selfish
whitesmoke
restrained
torso
avro
dias
piccadilly
bertie
templar
boyce
ethnographic
childless
antisemitic
colleen
corresponded
callahan
exiting
ripped
uma
prom
seward
federations
mileage
shaking
concede
mubarak
cora
anglesey
countered
mcdermott
arsenic
ashanti
prefectural
reinhard
sari
geraldine
coleridge
leyton
fluctuations
antoni
pleasing
scooby
asheville
housewives
arun
mahler
horrors
lexical
showcasing
kinship
orbiter
pancreatic
madre
percival
avenge
thrower
lm
goblin
leases
brodie
filipinos
villanova
dons
reopening
cabot
inspectors
orwell
boyz
embracing
laundering
arisen
overlook
tome
bancroft
opel
rubio
tagalog
uncover
australasia
crawl
barbera
yung
salvadoran
oberlin
olya
hathaway
fractions
scalar
whalers
fables
adair
westmoreland
heightened
orissa
fridays
hibernian
elisa
duisburg
sup
electrically
surreal
latex
gsm
substitutes
sonatas
creighton
etienne
styria
narrows
trumpets
defective
huxley
broom
mehmed
manifested
dyson
erection
dun
bhopal
gown
brutally
cameroonian
falklands
stubbs
undead
counterattack
aruba
chevron
culver
recapture
clasp
riviera
manchuria
renumbered
geographer
expectancy
contractual
petar
achieves
wiring
bastion
pear
manipulating
rejoin
teal
psychoanalysis
aur
unofficially
fledgling
kibbutz
expansive
exceedingly
aldrich
importing
wildcat
lockwood
kingfisher
angrily
curran
pulses
nikon
monoplane
earns
kato
adolphe
maharaj
inferred
durant
tuba
lutz
hesitant
tre
supersonic
shek
annum
utica
barking
swanson
mcneil
reactivated
toast
orb
rosters
dumas
savanna
rounder
mckee
scunthorpe
benefactor
expiration
blackwood
begging
furlongs
freighter
khalil
throwers
swedes
rockingham
prehistory
licences
toc
sloane
shutter
minesweeper
anc
interwar
overcoming
classis
nicaraguan
cba
sightings
voltaire
fewest
batista
kilograms
populist
percentages
nicolae
scot
befriends
yoshida
rhin
marconi
pigeons
barbarians
gower
roux
forrester
picard
reversions
sutra
observance
rooster
stereotype
mangrove
shoemaker
decrees
poplar
unimportant
takashi
ola
puritan
linebackers
embarrassed
reptile
ons
oecd
concepción
signatories
promulgated
cádiz
pelham
cavalier
aggregation
interchangeable
cellist
ver
autopsy
hutt
reis
lancers
morrissey
lettering
devotional
enclave
dunfermline
spoof
categorizing
reversible
sadness
objectively
interrupt
triples
bandleader
lobos
disarmament
parque
wealthiest
hewlett
sankt
unrestricted
steppe
lui
kuhn
mckinney
redeveloped
zoos
petr
marche
shutout
cutoff
flap
leyland
magma
scorpions
mollusks
plume
esperanza
telford
bellator
sleeves
hajduk
sampler
rocker
guan
sleepy
scotty
blond
fortresses
waldorf
outraged
collapsing
sorbonne
padilla
spearheaded
wyndham
domesticated
maribor
garibaldi
harbours
ashby
cio
visuals
bantu
góra
kielce
gaines
futuristic
err
meteorite
panamanian
axioms
noticeably
macclesfield
denominational
taurus
quarterfinal
dix
constitutions
magicians
dunkirk
multiplied
riyadh
owe
walpole
rei
beheaded
eno
valence
validated
jima
recalling
gunners
cine
melinda
elegans
rwandan
durango
bhutto
latham
deane
volt
reconciled
filtered
bede
sponge
vizier
goran
wrongdoing
deutscher
expires
baptised
sprague
netflix
smells
wheeling
quadruple
bong
saginaw
liza
monaghan
pyramids
hendricks
pembrokeshire
blames
robb
carousel
grossman
introductions
vanishing
fraternal
xin
motivational
rui
preachers
micronesia
herr
juliette
lander
ardennes
aladdin
boolean
flamenco
deadliest
needless
ramakrishna
curaçao
har
suk
shaded
renee
dynastic
franck
clique
adolfo
raging
astrophysics
moffat
dolan
phonology
delicious
usenet
iconography
inductee
endure
nanny
masts
tabernacle
ballast
midlothian
fireplace
clinch
westinghouse
khanate
transporter
stain
inflow
fading
paddington
piccolo
shinto
governorship
sindhi
enquiry
greenish
comptroller
lopes
cons
rds
airspace
arnhem
tay
pryor
courageous
salerno
measurable
stump
gilded
archangel
adana
alkaline
asunción
zoning
gras
vigorously
milling
dwayne
toto
hilltop
psv
acetate
erase
numeric
ascot
hilarious
pragmatic
patriotism
unverified
devonshire
layton
dalian
leverkusen
kickstarter
freeing
emirate
conclusive
waugh
mathematically
arteries
mobilized
pedestal
spoiled
carboniferous
arb
kirkland
propellant
cossack
recurrent
bulbs
cdt
polled
modernity
gastrointestinal
tracker
stimulating
mau
hinton
specialties
whitby
qian
bedrock
inflated
encountering
grille
verifying
stalk
alligator
weld
terminator
souvenir
witty
photoshop
hebron
yeung
ipv
categorisation
fung
incarnations
mpeg
liao
kettering
wabash
pows
fairies
mediocre
tankers
precedents
fillmore
sapphire
angers
gianni
hayley
markedly
downed
rowley
ultrasound
aleksandar
marlene
tarn
hasty
galilee
relocating
northwards
anno
reindeer
champlain
carrington
faroese
whitfield
zimmer
myles
cabbage
stavanger
kollam
pomona
wastewater
emergencies
osbourne
brood
études
hexagonal
oran
barns
mitt
prefixes
smoked
dissatisfaction
pty
saviour
dismay
growers
carnatic
crippled
manifestations
balliol
smolensk
lax
exhaustion
margrave
marques
uncut
pitted
deccan
breeder
terri
optimum
hebrides
asbestos
samara
misc
homosexuals
pillow
neutrally
moro
gino
accademia
howie
stitch
northrop
unfit
dopamine
astana
obscene
gamers
relays
dubbing
lyme
gabriela
loren
invincible
excursion
enact
finlay
ghosh
revisit
byzantines
overboard
bequeathed
headlining
tasman
irb
waivers
breaches
amazed
cosby
uptown
challengers
usn
frenchman
ultimatum
unleashed
dsm
medallion
chromatic
cultura
heartbeat
oda
narayana
outbreaks
lipid
toxin
sublime
curricular
juniper
lübeck
stumbled
undeveloped
kickers
♫
sideways
mongo
outfits
ferenc
scoreless
unequal
ost
salted
somaliland
enzo
bea
upstate
interviewing
contesting
combs
bulgarians
odor
innate
tolstoy
partitions
breslau
discontent
lucifer
cena
riches
tuck
rembrandt
obsessive
primes
sinhalese
starboard
madeline
administering
camilla
compressor
pj
logistical
dio
relativistic
slips
waltham
frameworks
structurally
notifying
piles
inserts
agitation
waverley
persuasion
rune
nosed
ricci
viability
baluchestan
unorganized
seventies
honorific
robyn
interchanges
yamada
passers
pacers
kam
guangxi
roosters
iptv
respectful
abnormalities
hedges
evading
gdr
dewitt
chases
incomes
sharia
kearney
ly
callsign
pedestrians
drifting
hive
arcs
gemma
abbasid
richer
practise
bun
repulsed
whaaat
inconclusive
suffragan
epistle
perch
groupings
latent
partnering
zenith
canaan
hastily
hasidic
larouche
shorthand
ordre
philatelic
swelling
mcmanus
unlocked
naruto
calle
finishers
mangalore
forecasting
asians
eure
spontaneously
pfr
stockings
adriana
fries
championed
blink
shortcomings
stove
congolese
reprints
rustic
yangtze
guevara
screams
volunteering
catchy
niles
payroll
solitude
obscurity
toothed
drones
quotient
softer
bir
reliant
stravinsky
rebound
condemning
entrenched
palacio
cascades
lv
tangible
oratory
howitzer
bona
haus
intriguing
vous
remodeled
babel
burrell
caine
vaccination
shizuoka
horowitz
dimitri
anhui
addict
xia
persists
prowess
distract
wuhan
elites
sendai
hartmann
hwa
stabilize
refuted
tropics
tagore
kenji
scarlett
recorders
cavan
travancore
grasshopper
hanja
lyceum
mrna
hohenzollern
grooves
yeon
progressing
deter
dandy
philosophies
mcclellan
enamel
magellan
penrith
tit
readability
juilliard
tapping
anil
thwarted
gallant
terrell
tycoon
finley
elbe
tod
looted
yaroslavl
tula
moby
bonner
acta
elective
grease
cortés
digby
furnishings
demonic
signify
lick
thinker
bearers
kombat
begum
liber
excelsior
brightly
farmbrough
robber
weinberg
aitken
farc
rarity
rodent
shipwreck
detonated
antics
bypassed
dragging
lodging
equitable
universität
yom
vodka
zane
misrepresentation
mantra
thinner
caldera
angelina
padre
collège
oneida
inquirer
minted
gangsta
agile
boasted
hollis
malayan
conductivity
asymmetric
emulate
scars
scripting
disparate
insular
harmonies
aga
sclerosis
susie
alger
antiques
plenipotentiary
appetite
invaluable
rhodesian
landry
khaled
superfluous
roaring
wrc
jutland
enlargement
horner
devine
gis
dragoon
riemann
timeless
pyrenees
herbie
hanuman
showers
fray
ethos
jansen
reggio
sober
fundamentals
friesland
tentacles
carlyle
liquidation
involuntary
burlesque
outlawed
rehab
é
parnell
mondays
volley
juris
declarations
phnom
haggard
irt
bundles
patterned
nouvelle
istván
accorded
parr
mails
kaliningrad
duval
persuasive
dnp
sugarcane
inhibits
roofed
heyday
appliance
spas
moderated
oncology
kh
diminutive
siren
rajput
washburn
hospice
gwr
iodine
temps
misrepresenting
astonishing
pumpkin
evaluations
divergent
deforestation
usaaf
rents
inappropriately
cynical
swapped
confronting
divert
onboard
oranges
workload
likeness
mechanically
lakewood
cohn
wynne
ebook
swings
menzies
fong
ramesh
enlist
mccormack
replying
bobcats
unblocking
xix
redding
infirmary
chelmsford
renegade
imaginative
tac
fumbles
bests
greetings
photobucket
spaghetti
gamecube
coveted
realising
karan
norwalk
warcraft
pics
bmi
glastonbury
relentless
sion
lorentz
climber
intensely
incarceration
slit
haq
vo
holistic
mather
lerner
ayp
andrade
ester
venomous
intestinal
transcendental
institutionalized
arrivals
palmas
maude
pastry
hangzhou
namco
shorten
sunken
suisse
deferred
sinha
miroslav
príncipe
docked
subculture
tori
pong
stubborn
bedrooms
justine
alden
packer
tomlinson
puma
bongo
substituting
tinker
christened
dwellers
choctaw
amarillo
tess
exe
sortable
paulista
greenock
polka
subdued
spalding
stabilized
embargo
pilar
inefficient
dryden
maastricht
unnoticed
sta
dada
maroons
horrified
smashed
mitigate
dwarfs
nel
likened
korn
remakes
forbade
perpetrator
nader
gaia
grosvenor
disrepair
flanking
grosso
capacitor
costal
monoxide
columnists
recited
mcnally
resisting
artes
lovin
nationalistic
comanche
welt
provoking
revista
yuma
ppm
inconsistency
echoed
falk
epilogue
cnbc
gunshot
winnie
dsc
dario
visitation
seamen
liv
kano
triton
calendars
palladium
olomouc
stonewall
silhouette
awaited
pretends
simons
concourse
mountaineering
impressionist
heathrow
hobson
movable
barges
cervantes
alchemy
tapestry
yeh
mantis
schroeder
hoop
paving
lineages
embryonic
witnessing
discriminatory
farmed
distal
hinder
npa
implausible
cessation
glimpse
conner
bieber
epitaph
inspected
offaly
stratton
conserve
flutes
saipan
lobster
muscat
vikram
logistic
rossini
assent
nauru
sancho
doppler
gladly
altarpiece
py
sidi
noises
articulation
martino
ramat
citroën
thunderbolt
kwazulu
windham
mountaineer
vomiting
roh
courtenay
denoting
distributes
byzantium
gigi
riccardo
informant
felder
mahesh
anecdotes
subsets
triggering
keene
encrypted
bows
loki
limassol
québécois
relinquished
helper
jacinto
ccf
cheerleading
prosper
steinberg
namibian
tolls
suicidal
ordovician
previews
retribution
hardened
mays
avian
manifolds
maidstone
guinean
kuwaiti
fostering
turbulence
streamed
valdez
mendelssohn
politburo
pritchard
snowball
canned
prevail
chores
aborigines
appropriated
jaffa
walkway
surprises
parkland
dso
ros
pliny
toxins
saber
flo
strickland
oates
mello
père
throttle
valentino
towing
augustinian
largo
trademarks
holman
nye
coimbatore
stanza
cougar
hickman
afaik
ajay
shogunate
polonia
overt
sweets
sinks
cornelis
stormy
nazionale
correctness
placebo
ombudsman
patriarchal
hamlin
consular
hakim
garter
cummins
prized
manslaughter
complains
moldavia
standout
subgroups
cremated
siva
kirke
nightmares
leaks
dat
gladiator
hunts
bloomsbury
intrusion
tonal
contradicted
undoing
lullaby
grievances
orderly
conquests
pdc
tranmere
lemma
sleeps
slavia
brant
dina
witt
sunil
socioeconomic
clio
gaussian
skinny
fitch
romanticism
connotations
fractional
tonnage
boucher
dutt
kari
zeitung
olsztyn
dignitaries
eugenio
fisk
potosí
buick
inspections
mum
thugs
florentine
entomologist
amnesia
ordeal
hwang
psyche
stockhausen
manganese
jepson
ilya
monza
gills
serrano
vicksburg
sena
saffron
compensated
nagy
mta
moncton
tait
ecstasy
wreath
warhead
ashkenazi
blyth
ruse
pivot
lament
jagger
finer
arista
damp
bridging
forbid
gpl
cagliari
whorls
troubling
entrepreneurial
footwear
seeming
hoyt
allergic
inertia
plumbing
courtship
fractures
bibliothèque
georgi
assimilated
bela
hadrian
cornelia
complies
dishonest
leben
ado
keenan
facets
unregistered
retractable
fated
forgery
anonymity
gabe
shelly
chand
paranoia
luge
saud
electorates
deprivation
farmington
aik
practising
microbial
emit
unproductive
bunting
abstracts
baum
thumbnail
earldom
deteriorating
bateman
disapproval
blackmail
mikael
mitigation
minuscule
priestley
attila
spying
tamworth
clerics
ollie
dredd
goya
aisne
mash
restructured
interscope
boxed
richly
bruges
woodruff
diffraction
coll
openness
couture
waitress
española
futile
superstructure
handler
offshoot
carrera
moreau
accelerating
miserable
sher
mahabharata
sportscaster
slowing
driscoll
colonia
loretta
conspirators
looting
djokovic
steaua
conscientious
knighthood
granger
paxton
pir
lig
juárez
cornwallis
umberto
gillette
pokemon
overlord
fascists
bessie
salomon
maurer
obscured
liners
codified
hester
sama
clément
joplin
safeguard
infielder
skirts
fertilizer
surat
adolescence
opaque
sandman
levski
hue
neuron
flourish
grassy
melancholy
bonuses
cor
tat
solemn
chants
wadsworth
shimizu
kristian
agustín
explorations
digitized
boosted
strap
batsmen
chancery
thigh
shepherds
blasts
novosibirsk
crat
cursory
renounced
polaris
effected
lichfield
shrink
tcu
belo
brew
barbecue
pauli
biochemical
tending
reichstag
jfk
britons
electrodes
pharmacology
montoya
radars
gw
debatable
jingle
rajya
widest
coronary
sidekick
vas
sagar
gliders
soares
sporadically
podcasts
thunderbird
revamped
baskets
zeit
lesbians
devin
worsened
comoros
sexton
starling
blount
loughborough
bowden
paw
electing
unfavorable
tongues
nihon
alvarado
lewes
gaze
regrets
fearful
seanad
indra
pdp
diminish
yam
varanasi
shrinking
burnside
adversary
protracted
liberian
amur
gheorghe
sloping
romances
intuition
punishments
lobo
cctv
luminous
scrum
roommate
ftp
mayan
blum
pep
thayer
pilasters
overflow
stacks
albright
negros
vinegar
embark
undermined
helene
forwarded
elgar
crichton
rea
miraculous
mott
lucian
accustomed
lucerne
bronson
diy
epidemiology
jurists
mitra
overthrown
carcinoma
arnaud
pugh
grunge
pls
assemblyman
escarpment
calibre
rollers
lago
administers
shipyards
yee
aggravated
westerly
seasoned
nazarene
bassoon
magneto
mentorship
seizing
contextual
brightest
ravine
snp
capitalize
shakti
bleed
savior
sensational
dae
accommodations
finder
nuisance
pedagogical
familial
een
savoie
enix
antibiotic
conquering
derelict
dismissing
farce
scoop
collide
transnational
australis
retract
hoy
publicist
platte
fédération
claimant
surrealist
belleville
flowed
pretext
kosher
gadgets
pavia
gorgeous
fps
gurney
condemn
<
ealing
hopeful
foreigner
infringing
illustrative
celsius
hobbies
faiths
italo
katowice
nbsp
heisman
fraternities
retitled
downloading
riggs
camino
assamese
parliamentarian
grossly
antennae
hardship
discrepancy
stalker
drilled
atv
lukas
piping
gleason
shillings
idiom
kuomintang
shaman
bsa
lsd
atrium
nypd
crore
terrence
misty
equestrians
gian
pyotr
completeness
kyushu
epirus
grantham
civilisation
sturm
sorties
lest
punched
pastors
marston
firmware
cistercian
borden
professionalism
amg
svetlana
juliana
accommodated
musicologist
reboot
visas
ageing
leandro
inertial
textures
monash
pods
outfield
blurred
mundane
innes
harman
ids
billiards
constructors
knopf
agnostic
conformity
yup
lymphoma
ppg
inward
stewardship
xxiii
bogdan
multidisciplinary
thermodynamic
envy
girard
projector
loneliness
hubs
behest
detachments
uw
britton
nga
cutaneous
trunks
lends
melton
spins
starch
carvalho
kessler
patrice
estado
ashamed
feyenoord
cooks
awakens
ontology
hainan
fairview
ltte
hmong
streaks
activates
decreed
dreyfus
ning
stabbing
slaughtered
robes
‡
clipper
firefly
fansite
deceptive
circumvent
sonnets
gia
dispatches
gator
madman
dime
tanzanian
respecting
cloning
chisholm
universes
matheson
translucent
observable
malice
sidewalk
mover
dag
ephraim
smokey
psa
noms
sired
avi
fugue
hautes
shiraz
removable
dall
suggestive
vauxhall
disciplined
matsumoto
malignant
hoboken
minden
barbie
voicing
konami
resigns
conti
hof
entourage
chemically
storeys
murat
fiorentina
trios
viewership
tectonic
shearer
fable
deployments
nps
exporting
patriarchs
behaving
stooges
alban
galleria
profoundly
mato
brute
krzysztof
avenger
overshadowed
recreating
positives
unaffected
scooter
glazed
nearer
pedagogy
interconnected
shack
tq
rabbinical
subterranean
hominem
raman
slack
mirrored
chopped
probabilities
sanity
kelsey
gecko
multiplier
shook
chhattisgarh
eaters
supervise
phylogeny
schmitt
rocha
wastes
cavite
daniele
mcgovern
scripps
kedah
cristian
jacobsen
stevenage
necessitated
inducing
objectionable
subaru
reacting
razed
beware
pip
grail
dehydrogenase
barrymore
fergus
longhorns
subsections
ez
vicenza
toole
eisner
spindle
ast
heineken
egregious
scotsman
ossetia
islet
smiths
lehmann
interacts
cleavage
radford
subordinates
clausura
inconsistencies
ffd
cx
disgusting
bale
warszawa
disregarded
compartments
aiken
nightclubs
pepe
instigated
deduction
woolf
mcculloch
isthmus
septa
edmunds
tearing
maxine
lisp
operetta
tadeusz
baseless
senatorial
pacheco
♠
notoriously
leech
interpretive
oahu
nitro
tver
nighttime
bolster
stereotypical
cowell
aeroplane
kipling
hillman
thorax
iwo
harlow
delegations
propositions
midday
tributes
zhong
summarizes
sculptural
lugano
speeding
swine
weeds
molten
kanpur
critiques
weinstein
angelica
stepmother
predates
ganges
adept
ley
flourishing
crocker
butterfield
devlin
keaton
locust
hain
pharmacist
cordillera
parodied
allegory
lair
conveys
afternoons
huston
figueroa
macgregor
kml
greta
interacted
floated
moreton
margot
donating
malagasy
quaternary
pdt
bluffs
purana
ordinances
budd
oi
hangul
infiltration
mcgowan
anu
fiddler
exerted
dissidents
gaz
andromeda
mould
ig
mauricio
hitherto
actionable
hempstead
monique
vat
milner
nylon
harriers
sharqi
geophysical
shogun
wick
cyclops
mcclure
kazimierz
georgina
indycar
roque
purportedly
receipts
yokosuka
alchemist
recoil
tentatively
charente
pesticides
graphite
sounders
mantua
typeface
lees
liberator
intergovernmental
departures
defer
shelves
tricked
parchment
hindered
saturation
kemal
britt
oed
organisers
miyazaki
amr
miki
paola
vg
jive
alastair
gathers
parcels
originator
medellín
mouths
uav
mecha
shun
reaper
pneumatic
mace
ecole
hijacked
melo
msu
pudding
seasonally
quark
quintana
recovers
filler
bungalow
elusive
aqueous
consciously
subtitle
nanotechnology
bac
zamboanga
tiberius
convocation
barth
crc
loaf
dashboard
kaiserslautern
squirrels
akita
pens
carpets
duquesne
hama
petrov
pentathlon
focussed
poorest
bowles
beauchamp
tripura
seinfeld
oblivion
jams
sonnet
noodles
frieze
consequent
eastwards
charmed
mrt
doomsday
synthpop
hormozgan
appoints
treble
braxton
vin
treviso
esquire
bergamo
eighties
wolfsburg
durand
normans
fittings
waiter
veiled
subscriptions
malt
brandeis
clicks
chanting
stints
socialite
pickett
transplantation
brothel
meek
koi
surya
waterman
dyes
wittgenstein
chatterjee
kal
stemmed
pashtun
upscale
khrushchev
roper
excise
raza
ahmet
farah
eriksson
methodists
arad
madam
seán
mullen
midget
figaro
northerly
sault
yorke
jaffna
grenadines
bearded
passerine
dearborn
confucian
zac
tis
roswell
ignited
fascia
hatton
industrialization
rabbinic
dangerously
bitten
gare
dubrovnik
puja
helms
thani
pathological
richelieu
exploiting
rouse
infiltrate
mexicana
rectangle
chn
darfur
lorne
zia
cursor
supercup
brokers
smear
spartacus
hardness
mirren
sucks
beginners
bleach
cadre
ducal
sulfide
millard
organiser
resumes
chunks
lina
stare
arresting
humanism
deb
yat
dowry
cultivate
megatron
overlooks
totalling
meir
alder
waterhouse
gambit
ebenezer
anomalies
dole
superstars
boardwalk
chippewa
fandom
conte
andaman
valentina
traversed
spacious
concussion
soleil
pests
subunits
prosecute
eucharist
elise
haze
penetrating
haplogroup
gutierrez
mysteriously
synchronization
renown
marlowe
breech
bandung
bulge
crunch
nürnberg
predation
holliday
hypertension
pasta
crouch
spacetime
flick
undergraduates
banu
drexel
edmonds
untouched
herds
disconnected
unsubstantiated
carriageway
vert
sinaloa
eastbourne
extradition
truro
pennington
flashing
unconditional
sideline
rockers
kool
immersed
delphi
fredericksburg
sterile
fours
iloilo
zi
retina
bess
castilla
peacetime
mcmaster
audubon
wrestle
aaf
gansu
luce
unremarkable
exacerbated
cfb
stardust
mishra
surfers
articulate
fuentes
welded
takeshi
flaps
burg
kaohsiung
zoran
belinda
objectivity
dames
gunboats
chomsky
longford
waving
pskov
habitation
rubén
weil
ethic
messengers
disperse
tort
npc
animators
eliminates
blackstone
coney
syd
campania
vlad
commuters
hawkeye
golan
circumference
bv
optionally
carmine
rajendra
conveniently
borges
kirkpatrick
bakr
collin
jacks
fundamentalist
yap
stewards
catalogues
inspect
marlin
lowercase
definitively
devastation
leander
frisian
obelisk
yarn
famer
langford
vaults
nasl
knitting
cohort
gabled
crosse
birkenhead
bartender
migrations
dialing
profiled
unmarked
swearing
landlords
endorsements
rtl
gabriele
singularity
corinthian
anthems
legitimately
kuban
playful
arf
thakur
dn
toowoomba
stair
rewrote
whigs
vere
allusion
gayle
demetrius
hypothesized
bandai
gopal
durability
mahoney
mst
reinhold
makoto
offseason
mains
steak
validate
exited
dictate
boomer
duma
warped
bypassing
professed
sx
gimme
garza
tonic
rockin
pelvic
westland
mormons
disillusioned
velasco
screws
livorno
kimura
bennet
setback
geffen
knapp
cdr
joking
ath
realignment
densities
laptops
apis
coherence
brownsville
interviewer
bh
canyons
schuyler
wmc
appraisal
snowden
primal
penh
whisper
voronezh
reconstruct
hauptmann
shelved
moorish
rawalpindi
escalation
kar
portrayals
aida
bette
ares
litres
transcontinental
parlor
leblanc
lapse
forage
unjust
chameleon
evaporation
sandoval
brownlow
shocks
discharges
charlottesville
mathis
analyse
masse
mancini
cornice
sprung
bethune
pang
xvii
siamese
dubuque
chronicler
embroidered
sei
campground
alp
rajiv
haywood
thanked
cheerful
millet
ribeiro
vet
beckham
albedo
wylie
lemur
thug
henchmen
dis
delft
courthouses
calories
herodotus
tame
bribes
montfort
behaved
brecht
isnt
gatehouse
shoals
spiny
harbin
warmth
boleyn
predicts
fawcett
wicklow
tribunals
disgrace
casas
ashram
jiangxi
aroused
dekalb
celine
tiers
almaty
fff
sphinx
brigitte
inhabits
bhp
talkin
disgust
gambler
fluoride
fasting
lovell
creationism
obtains
meteorologist
greenway
welcomes
offend
admire
monologue
johnstown
weller
pandemic
romagna
kidding
pascual
lump
depots
architectures
bazar
magee
symbolizes
quarrel
aristocrat
defamatory
embroidery
kami
plentiful
flc
uyghur
cushing
decidedly
schalke
kriegsmarine
utmost
colder
tyranny
astrid
trooper
alienated
perrin
mondo
ballerina
castilian
etruscan
rook
jermaine
rightful
simms
gharbi
ronaldo
offenbach
hye
conclave
recess
shortening
chevy
¥
vandalize
shouted
mediaeval
culmination
palgrave
lotte
cloister
geiger
daredevil
pacifist
moser
mérida
purposely
sagan
misspelling
rooftop
tabriz
thrones
danilo
bobsleigh
loma
autobots
tendon
degenerate
franca
boil
templeton
corcoran
sighting
erratic
limbo
subscriber
fiercely
stanhope
arno
vigilante
acoustics
angled
delegated
coltrane
thunderbirds
maldonado
baer
henchman
mj
bantam
deus
ritz
makeshift
diligence
pouring
piety
suitability
leopards
twisting
rambling
tambourine
bicentennial
hsu
jug
undivided
grizzly
amphibian
commute
acknowledgement
adrienne
virtuoso
karol
youngsters
fantasies
usability
theses
rounding
refute
deploying
exempted
yell
cdu
justinian
pardoned
grandma
ouest
cornet
tompkins
brentwood
enlightened
ananda
memberships
coruña
amplified
stucco
annan
cathode
sentient
voiceless
playlist
hurting
empathy
wollongong
chapelle
mtr
sender
topo
weakest
conjugate
requisite
galactica
suffixes
fulfillment
acb
ze
suppressing
iec
hawley
conflicted
farnham
favoring
rahim
crewmen
monstrous
summarizing
alia
tapped
mervyn
terrific
orator
aggies
ligands
oviedo
koh
motte
caribou
wildcard
niall
intellectually
aisles
reassessment
accords
landscaping
bends
hurler
vb
disagreeing
schoolboy
faux
coulter
mersey
eberhard
chewing
mcrae
macfarlane
junkers
marburg
nathalie
resorted
fiancée
northumbria
lakota
subtitled
protons
burying
conde
bro
blends
yew
misplaced
flamengo
neurology
overlay
pence
utilised
drayton
notables
uneasy
illawarra
generosity
rin
fonseca
mocking
orchestration
bracelet
tramways
murad
acs
percussionist
watercolor
nasr
pahang
kiribati
homemade
extraordinarily
daegu
mich
gilberto
bnp
barbed
chaser
blueprint
alveolar
svalbard
barefoot
medford
bounced
kenton
daleks
gangsters
atc
fanfare
seam
caa
canine
geocities
mak
penetrated
enhancements
belles
himmler
goth
luthor
petri
chromium
tamaulipas
alluded
hereby
rajesh
ire
rockland
cabo
gresham
steen
tundra
bribe
kita
kwok
chimneys
indifferent
thinly
fink
harrier
elsevier
stapleton
coping
tiling
hardships
enid
westernmost
advantageous
exert
stanislaus
iupac
missoula
goo
erstwhile
oscillator
needham
friendships
piraeus
monorail
granny
arthropods
unorthodox
sumerian
fatherland
postdoctoral
miley
silverman
apprenticed
oratorio
dynamical
krause
alianza
somers
dijon
editorials
khulna
tossed
gilchrist
unbalanced
reestablished
nested
irvin
volcanism
bianchi
stead
duality
refinement
quid
cfr
dives
wander
prosecuting
townspeople
blofeld
amounting
diets
admirable
isi
sss
carpenters
suleiman
surmounted
complied
majored
hummingbird
kagoshima
cutters
bisons
rigging
marred
aia
scarcity
ishikawa
yahya
recognises
sprang
fil
consuls
pleading
assembling
nomads
plummer
undeleted
univision
rupture
goode
mullins
nineties
referral
existential
bullpen
atherton
sanger
aborted
visconti
harare
sane
gully
orpheus
spotting
assorted
belgaum
colgate
tardis
atatürk
negatives
replicated
stairway
numeral
bum
potentials
werder
magnets
sewell
lr
deficits
kira
bestselling
applause
treatises
renomination
anemia
lila
ornament
sidings
entails
wilhelmina
serenade
neustadt
diarrhea
pierced
llewellyn
vicariate
timmy
unfairly
unanswered
tokens
kazakhstani
whitewater
greedy
hammerstein
inactivity
bendigo
göteborg
gifu
infer
wrapping
abingdon
shackleton
dreamer
hajj
bushes
biscay
hertford
quarantine
exchanging
boldly
concensus
amen
dacia
dms
gol
klamath
fax
bondage
arable
pineapple
constituting
reconstituted
ypres
tsn
fen
twickenham
evanston
minesweepers
netting
utopian
redistricting
cui
hindustani
careless
stoppage
rachael
terengganu
scissors
synchronous
heywood
counsellor
sps
ribbons
cortez
melodrama
aleksander
matchday
seeker
invertebrate
belongings
businesswoman
wordsworth
nearing
pursues
steamers
nada
villarreal
rotunda
shakespearean
unwarranted
shiv
lombardi
alonzo
mulligan
qasim
melee
finns
castor
airframe
soaring
swam
interruption
zhi
polarized
whence
qazvin
goiás
holotype
linemen
regionally
succumbed
underwear
extremity
mahdi
eur
wittenberg
screwed
steroids
passer
igbo
ababa
heater
missy
chah
boulders
motocross
drifted
yehuda
estero
recursive
tasting
mundi
discomfort
spinner
hyatt
vélez
bumps
locator
manley
walsingham
passwords
vehicular
priya
trimming
buildup
mexicans
cried
barisal
sidelined
cohesion
lewiston
masterpieces
bottled
galen
transliterated
profiling
lind
hwy
pervasive
advertisers
hasbro
brahmins
cirque
shaken
bernadette
goin
rosh
apologizes
cpr
ivorian
firewall
bower
música
apse
simplistic
lupus
felice
schwarzenegger
jawaharlal
cough
mikey
amazonas
kennel
nakajima
timeslot
reappeared
verdun
occupancy
kayak
jos
lps
mille
throughput
briefing
antitrust
altman
lyrically
nui
neapolitan
undefined
ile
kashmiri
louder
retreats
buffaloes
winkler
urinary
ptc
johnnie
uniqueness
anglicans
michal
extravagant
substantiate
goliath
landon
iit
induces
blinded
upbeat
dispose
ibis
košice
jnr
sickle
ordinarily
unnatural
containment
mccabe
campaigner
misspelled
vivekananda
analysed
reiterated
possessive
ascertain
overcame
barkley
markov
bobo
petrie
swann
plead
avn
sandhurst
minami
aegis
condensation
picket
ventured
derailed
wifi
specialises
reminding
stork
amplification
vacancies
furlong
staudinger
emphasised
discredited
sarcasm
electra
tho
voss
jiří
milligan
kandahar
sevastopol
unintentionally
hostess
invitations
sloppy
oldfield
mekong
levied
swallows
stalingrad
bree
timur
southwards
complemented
parrish
heinemann
mong
hegemony
tabor
willingly
intolerance
imitate
honeycomb
dewan
lockout
rallying
nucleotide
xd
gaetano
puzzled
hacked
isidro
kart
babcock
guillotine
mentored
alluvial
mie
garnett
kitchens
immoral
westbrook
kor
dinah
earthly
overkill
maa
smythe
oss
rcaf
comprehensiveness
predictive
strives
nemo
kabir
jardine
caveat
celeste
apprehended
expel
sizeable
tenders
alicante
extracellular
paget
replicas
disambiguate
tamar
whitecaps
negeri
palsy
delaying
hens
leger
samar
euclid
jpn
ennis
vinnie
libertad
unconfirmed
omer
auctions
ofsted
attendants
disproportionate
assortment
drank
sco
linus
justifies
forfeited
armageddon
refresh
maury
portfolios
pinus
shutting
normalized
sheehan
optimism
pershing
oppressed
weavers
vagina
kieran
cajun
distressed
outsourcing
freiherr
seriousness
annihilation
cpa
lagrange
chic
supérieure
gens
stag
std
underage
formulate
escobar
discourses
parton
nikos
brutus
pooh
hyperion
hollyoaks
proclaiming
manors
seaboard
chechnya
winfrey
tully
schofield
ide
gh
rommel
pleasures
coupé
mosul
mbta
jérôme
flop
núñez
adhesion
vidya
uprisings
ezekiel
nkvd
psychiatrists
waite
merrick
tyrrell
mogadishu
jars
massacred
provo
precursors
muay
arp
irma
niño
benoît
ott
misconception
frenzy
decommissioning
handwriting
saad
easterly
stacking
knoll
aguirre
atelier
nascent
bernd
paler
profanity
imposition
intermittently
implant
kafka
stowe
heist
pasting
msa
congratulations
kwon
telephones
sungai
hydroxide
viscosity
scarcely
vents
qaleh
gaspar
codename
obligated
spicy
mainstay
shahr
whelan
pathetic
voor
thelma
trotsky
columba
ladd
upn
hickey
genevieve
recollections
hegel
juneau
revere
confessor
housewife
kitten
discriminate
estrella
amphitheatre
polymerase
batu
plugs
oppenheimer
mya
parting
tomáš
flavored
contingency
inaccuracies
fulfil
tennyson
silica
peirce
gorbachev
adil
goodyear
hrh
almighty
dread
lonesome
björk
rewarding
cheeses
cartesian
parapet
sora
bara
planters
segal
barclays
thrilling
seaport
stara
tutelage
boswell
joon
chronologically
leno
tennant
wis
preparedness
landis
khanna
lingua
gorges
fragrance
cider
awa
mehdi
netted
serotonin
stew
hofmann
eugenia
eros
flaherty
giulia
excursions
compounded
hardwood
wye
causa
emitting
caters
acharya
unbroken
tomasz
pesos
soapbox
intestine
dao
keyword
gent
lethbridge
tromsø
dedicate
indistinguishable
ams
harmed
bets
amish
corfu
admirer
rhinoceros
gpu
poke
coinciding
vigil
prosecutions
approving
brugge
hinckley
siècle
hades
dogma
zeal
raspberry
objecting
misused
nikolaus
fetch
leibniz
gibbon
spore
snare
acer
oswego
bahasa
guilford
depletion
smartphones
conduction
shree
yorktown
rentals
yamagata
acosta
smyrna
fullbacks
irons
eid
counterpoint
euthanasia
dunk
exposes
rsc
regia
tungsten
groeneveld
jános
greatness
easternmost
healed
subsidized
belvedere
unplugged
ultralight
wickham
cooley
bef
mastermind
episodic
bouncing
obispo
ravaged
planter
mesoamerican
quill
dislikes
assisi
fireman
civilized
jayne
bedouin
insofar
abbess
faire
handgun
cordon
decorate
alarmed
josie
shaikh
wavy
wikileaks
intimacy
homology
synthase
envisaged
jag
mio
knut
josip
seren
dolby
scrooge
pompey
lancelot
bennington
diode
slum
presbytery
solves
friuli
balances
seri
twists
sus
calibration
shutouts
harbors
louth
paralyzed
harrogate
outlining
vallejo
penthouse
lippe
afflicted
councilor
shines
vieira
thunderstorms
beginner
lech
ensues
pai
airstrip
olfactory
noticeboards
corbin
reinforcing
ammonium
allegorical
dowling
forging
gainsborough
eth
decider
coo
pigments
paco
curators
transitive
leaking
wagga
volts
participatory
hansa
mutated
schoolteacher
gays
symbolize
parishioners
coahuila
ecoregions
propagate
asterisk
undercarriage
crackdown
cunning
fringes
corrugated
shaughnessy
alexey
peaches
cohesive
wig
namur
distilled
solstice
adhesive
subscribe
dunham
tackling
nephews
cyanide
intertitles
downwards
polio
hitters
xian
estimating
mosaics
caspar
caricature
mep
firepower
cupola
friendlies
contour
criticizes
personalized
ayers
bodybuilding
tackled
equip
vienne
anderlecht
humiliation
usmc
funerals
hawking
provenance
decisively
withers
unjustified
jennie
swell
freemasons
sherbrooke
messed
imp
thebes
implants
conceding
wiggins
viennese
keegan
encode
dalhousie
slc
plated
cro
concave
fruition
skateboarding
diminishing
mawr
rabat
lorna
persepolis
koenig
enshrined
ralf
jumbo
slander
chaparral
reels
immanuel
perpetrated
cale
workman
anatoly
auctioned
healey
comrade
disparity
ufa
brides
littoral
kangaroos
awhile
mou
allusions
extratropical
imran
jenner
typos
seq
progeny
tilted
madsen
wont
schenectady
globo
barbour
kelantan
hochschule
stripping
mancha
necks
mesozoic
infect
jeffries
kiwi
uncanny
lutherans
disobedience
valery
aditya
exemptions
grammatically
dreamworks
filip
kandy
coincides
brittle
bangla
normative
recurrence
pn
windmills
vanish
xxxx
gunman
mixtures
camels
yosef
searchable
freedman
foss
usages
rancher
radiator
repelled
hoard
nizam
cline
imbalance
retake
blackbird
weldon
visibly
fonda
shelling
fab
puccini
rwy
bullied
atypical
rudimentary
hiro
bevan
quail
moseley
extortion
councilman
salam
xie
wil
remo
cameraman
oa
pcr
vio
snapped
nee
playa
leavitt
neath
integrates
barrios
blocker
casanova
yar
tatars
panhandle
coppola
nadine
repay
janis
furry
stallions
gehrels
dnipropetrovsk
rudi
allman
snapshot
terriers
adnan
shakedown
domed
hendrick
diptera
learner
latitudes
satanic
luckily
tajik
steamships
weary
smu
gated
organises
lindbergh
desserts
godavari
larval
elmo
howrah
handley
aceh
winifred
hypocrisy
crocodiles
roadways
ethyl
spector
pim
hindustan
acp
einer
organisational
chiropractic
reinhardt
fuego
deriving
yarra
num
dope
sabina
sumter
tubing
recognising
horribly
controversially
interoperability
severin
surrogate
villanueva
reformist
tampere
staunton
montes
bane
thon
dumps
foes
functionally
fleur
devonport
cortes
autosomal
cooperated
berne
frees
estudiantes
arya
essentials
pushkin
warred
pantomime
threaded
ans
surgeries
linton
unheard
cockburn
pavilions
asin
nominators
chatter
rumoured
storia
hossein
conduit
wheatley
kottayam
hamiltonian
weiner
romantically
strategist
planetarium
romana
tsv
lansdowne
kumamoto
persuades
banff
pemberton
ruskin
streisand
bolded
genesee
maloney
misinterpreted
misinformation
tbd
girona
arles
myra
fixation
suitably
errol
slew
vodafone
banat
overweight
incline
chamberlin
uniformity
hacienda
reviving
alienation
ionization
gretchen
beagle
biscuit
colossal
dum
nok
kinney
bruckner
expressionist
wolfram
incremental
bahraini
cords
ibiza
gulls
luminosity
needy
wsop
egerton
carex
ionian
lua
bleak
wielding
infante
awe
tray
vm
townhall
tur
intracellular
lar
ramadan
keyes
midpoint
manic
cemented
distillation
tulip
qatari
freeware
rinpoche
lms
aki
walmart
intermediary
attributable
felicity
petro
footpath
naxos
snacks
dues
hoosiers
henceforth
woolly
hutchison
pitman
irresponsible
sobre
formalized
improbable
defaults
dv
georgios
lengthened
conroy
miyagi
mcfadden
robo
stylus
empower
macaulay
unsatisfactory
bayonne
mischief
leila
annoyance
buds
tout
combating
nonexistent
chanel
ronan
burrow
fruitful
biscuits
microprocessor
disgusted
bottoms
spock
fer
habeas
speciality
cantatas
bradman
clovis
jeanette
tlc
refractive
isd
morecambe
kursk
johansen
monticello
concentrates
annunciation
alkali
facsimile
podgorica
comprehend
catalyzes
acadia
thanjavur
preclude
meetups
csx
estelle
embroiled
stearns
jie
kinder
stocked
epistemology
anyhow
sidaway
kamakura
encodes
recommissioned
goodnight
casing
carrot
escalate
summa
maia
zvezda
skelton
imagining
hertha
fairest
cronin
recollection
paley
leela
abilene
archiv
iranians
baiting
plover
kissed
parallax
primo
pretender
ramayana
minimalist
recessive
albatross
grandpa
frazer
footprints
rebuttal
stardom
ditches
forfeit
jeffery
seminoles
unusable
ust
biblioteca
sharjah
forbids
oo
vegan
canoes
fostered
zh
abridged
prayed
brenner
sticker
livre
interfered
skirmishes
esteemed
labelling
laurier
resettlement
comical
whittaker
redirection
menus
gilman
fiberglass
ashe
disputing
chilton
herod
rub
networked
indore
jainism
petrel
doric
josephus
fdp
spinoff
keywords
incest
bobsledders
kaufmann
propagated
intrigued
sakai
billionaires
saarbrücken
gendarmerie
experimentally
catechism
garda
avril
distraught
invoking
muppet
toolbox
genghis
adventurous
mandalay
shao
nordland
hippie
agostino
teri
bani
dynamically
downgraded
bnei
grandstand
sander
brahman
unrealistic
más
preventative
poisson
bock
ludicrous
osijek
volgograd
agility
ov
thoroughfare
tongan
calypso
activating
sinhala
shakira
watertown
conspire
nausea
mackintosh
gmail
litchfield
zhen
woodbury
mane
concerted
genomes
carts
mansur
olives
rushers
homelessness
apoptosis
cosmo
furnaces
anas
steroid
tama
illiterate
leakage
terrified
bos
polluted
aquaculture
azteca
tuscaloosa
ecoregion
informer
aerosmith
romain
excuses
vandalised
adhered
mahogany
chaco
populate
grotto
footbridge
bourke
bamberg
hmas
extracting
sandro
lured
comédie
astral
adirondack
secretion
layouts
dragonfly
druze
thorns
pollutants
afar
bsd
consorts
dowd
inquest
paratroopers
summertime
schuylkill
lata
herbarium
incense
agony
aptitude
sutter
coda
einar
grenville
dawes
birdlife
messing
henryk
fermi
descartes
figurines
reruns
thoracic
etching
tinged
jojo
organists
equinox
censuses
valea
urn
hitch
yeats
nicki
fief
tarot
deutsches
vipers
hug
higgs
seminaries
schwerin
carta
headland
erfurt
umayyad
iona
dooley
uncles
swore
ancona
guizhou
mönchengladbach
vogt
motogp
bmt
ohl
beggar
cunha
moog
whitworth
hanks
capacitors
tattoos
findlay
accrington
aung
herschel
merthyr
webcomics
testosterone
confrontational
ebony
skeptics
minster
waged
vos
fyodor
tangled
signifying
magi
foote
vázquez
larissa
samir
branco
choreographers
corrective
escuela
hoops
fortnight
alludes
ascend
soaked
hélène
tutorials
ironclad
sweeps
hillsboro
autistic
allergy
prout
niches
chinook
medusa
northland
iu
balboa
joss
praha
philologist
cyclo
shortcuts
overtly
ovation
goofy
heterogeneous
hotchkiss
thrice
royalists
vologda
toei
theatrically
decompression
blender
kimi
paleolithic
figuring
sed
josep
leprosy
suborder
porcupine
bugle
unintended
complimented
rioting
widnes
pliocene
latency
haag
coils
superbike
doña
catalogs
audible
tipton
ornamentation
empties
pooja
carolingian
ovarian
nvidia
garnering
gerd
indifference
ricans
maeda
sourceforge
khl
mindset
smuggled
payton
biggs
solicitors
bhushan
spectre
harem
khartoum
selectors
proudly
germ
childs
baruch
seconded
hsbc
interfaith
nijmegen
lackawanna
omitting
altai
constructively
ven
lifestyles
pyongyang
stoker
honoré
ripon
kiln
günter
viktoria
elst
longing
rake
lycoming
trax
dara
sita
amit
poseidon
axles
taranaki
¤
neuronal
teh
revolts
attica
épée
tuttle
blythe
castleford
salinity
anecdotal
horrific
fifties
puppeteer
mus
jma
oversized
hush
striving
peril
azam
halen
iyer
adriano
atalanta
cuenca
iba
fiancé
rumour
magpies
clarion
tonkin
metcalfe
ncr
classifying
transept
aya
embedding
idris
harwood
sledge
stirring
interpolation
newington
rein
humphries
fetish
starlight
wand
mocked
inglis
vending
fad
sparkling
bhai
mcallister
stationery
sirens
mormonism
dreamed
ziegler
dist
antônio
gila
sree
layman
computerized
ramona
desai
tillman
battered
curricula
tub
hardback
priestly
perseus
bombarded
jax
interchangeably
swinton
seawater
lrt
passaic
fao
márquez
fermented
environmentalist
hydrocarbons
lecturing
recount
verma
unifying
vader
nicol
emailed
tuscan
watergate
punishable
molar
adc
precautions
charisma
irregularly
nxt
homeopathy
grands
genomic
gentile
natchez
connelly
anglophone
cairn
sse
hyman
embraces
bowed
caffeine
patently
neu
unhealthy
sudamericana
darkest
hamish
metallurgy
amon
sore
mallet
burgeoning
ck
tra
moira
californian
texan
harz
chou
jk
elvira
overload
polity
homeowners
toi
behaviours
chairmanship
incubator
kip
festive
shivaji
dwell
condé
guilds
farr
chernobyl
pore
stefani
curl
asl
hipped
fictionalized
gunther
metaphors
polygamy
deem
microorganisms
entomology
lancet
academician
uranus
vaginal
npb
bayreuth
taman
orienteering
scribner
hamadan
malleus
meetup
carinthia
neale
ghats
charger
paulus
ridership
knit
animosity
sayyid
methodologies
frye
resilience
faraday
cafes
acrylic
dictates
retrospect
heh
wrigley
desks
sympathies
gauteng
clustered
drills
cristóbal
gaya
outback
mcneill
programmable
henrique
redman
projectiles
tainted
beresford
minimally
aye
hindenburg
sedgwick
austerity
centaur
michał
wiseman
adrien
repatriation
bora
piet
sia
deformed
á
famicom
playmate
verbally
grotesque
vaulted
delia
fernand
excommunicated
aif
mercia
arras
lonnie
medvedev
zambian
sandwiches
elegance
abortions
wallpaper
scottsdale
leeward
fuss
atwood
plywood
casket
lhasa
bulletins
kingship
toolkit
plenary
unreal
lieberman
fractal
sikorsky
grandes
bowel
kline
huguenot
luv
transistors
bibliographic
caitlin
uniformed
carlin
cupid
zorn
annotations
seeding
cognate
samaritan
narrowed
scrapping
emulator
sakha
krishnan
ironworks
drown
jed
groupe
pup
bunkers
harshly
wineries
prospered
outcry
catchers
beattie
beneficiaries
pom
sackville
oscillation
boron
tutoring
asbury
reappears
sensibility
harden
pfeiffer
rennie
collectible
chapin
desperation
caro
persistently
gauntlet
gob
woodford
bigelow
ucl
adversely
pedals
chappell
première
wordy
xt
voip
salty
slabs
overtaken
markazi
frau
directives
chien
cumulus
afloat
creepy
genitive
ceres
malian
clough
cambria
ard
htc
springboard
ayatollah
contradictions
rainier
questionnaire
ais
seduce
duckworth
sooners
catchphrase
unwin
rockhampton
universitario
whl
distrust
tramp
faked
grabbing
condominium
dundalk
racks
versed
cubes
expressionism
compagnie
torrent
hesitation
narendra
interprets
minaj
soma
gbr
phrased
quirky
shabbat
observational
apollon
delano
husbandry
disposable
outpatient
amis
maris
regaining
adoration
akhtar
jw
resonant
vasa
homophobia
mayflower
grundy
slovan
stv
cay
beneficiary
technicolor
minimizing
fifths
tariq
trawler
guildhall
branson
charleroi
trentino
fared
topographical
ormond
carew
amphitheater
standby
imax
predicate
impromptu
filtration
ovens
hemp
aspiration
conventionally
panned
demolish
vladivostok
hamad
savy
ruff
revitalization
thérèse
liebe
morten
zara
uxbridge
maronite
headlights
toed
ofc
reorganised
snippet
bernice
ashraf
montclair
dressage
asu
bering
issuance
cantonment
musketeers
agnew
midshipman
renée
métis
forked
henriette
attributing
artisan
opting
motorways
banerjee
shay
hitachi
yeo
nunatak
mailed
skeptic
importation
chekhov
sacrificing
multilateral
vassar
xe
aldridge
mendel
eyesight
vijaya
macedonians
tweak
haider
repel
fleece
livermore
typewriter
harlequin
deletionist
masons
duggan
feats
osnabrück
germania
longman
inge
intrigue
roasted
merriam
televisa
lanterns
sucker
environmentalists
disbanding
shingle
jamming
maddox
unveiling
aes
muhammed
armin
greet
aoc
breaching
osu
jn
newt
ultima
ebola
cereals
marist
dq
hertz
antlers
obi
aberystwyth
ayres
evoke
hopping
shiloh
embryos
attest
upkeep
hilal
fanning
tele
ccm
frith
skit
tidy
helpless
albemarle
deacons
cryptographic
lugo
odense
centrifugal
idiosyncratic
finned
subversive
behaves
governs
liabilities
argo
littleton
sieges
dojo
roaming
montessori
gita
bolivar
pitts
brice
kirill
slipping
thanh
pasquale
foundational
morphine
disallowed
booty
laing
chaucer
digger
misrepresented
platted
rocking
beni
doran
inert
garnet
wanderer
udp
attaching
ahn
zadar
protégé
assyrians
hardwick
moulton
walloon
autograph
interrogated
jeddah
hikers
williamstown
hawkes
janus
engels
thrilled
mcfarlane
hounds
basemen
serpentine
agua
grêmio
sliced
speculations
derwent
phipps
slices
artie
restores
nandi
stanislav
polski
preferential
xing
terracotta
silvery
klang
swahili
funerary
reinstate
hooded
navigating
aldermen
criss
vosges
penance
paperwork
kola
mandel
goldie
gourmet
ord
debian
purged
handmade
buckland
iirc
lucca
syntactic
profitability
budding
toolbar
colton
curzon
vincennes
interplay
landslides
anselm
culprit
haunt
dionysius
fluorescence
aman
roo
patil
photovoltaic
necropolis
lawler
wba
isps
michoacán
arcades
pertains
è
xy
banja
coordinators
medway
assures
westerns
malo
epiphany
slavonic
randomized
firefighter
colossus
urgency
sultans
assay
dickey
refreshing
kana
metamorphosis
martel
basra
serbo
chakra
theobald
bouchard
loudly
footing
sho
nag
kraus
solvents
overton
omni
iglesia
fathered
passover
detonation
portman
pap
knowingly
ema
sera
cci
parkes
compromising
footscray
eq
dutchman
intangible
recife
gazetted
schlesinger
addendum
primus
awakened
hysteria
plethora
hauling
klux
leveled
touted
greenpeace
housekeeping
uptake
sampdoria
gaon
cartilage
pratap
emblems
aztecs
presbyterians
bayard
roulette
roscommon
emergent
inequalities
chau
gloss
kubrick
dipole
kidnaps
raya
mccullough
holyoke
reparations
zapata
foray
enormously
keats
saltwater
ramsar
peregrine
niki
bundestag
centurion
insured
englewood
mulberry
chez
alamos
motorists
phra
tangential
electrostatic
steeply
halsey
paganism
rosalind
prudent
urs
apulia
llp
incubation
canis
marvelous
gpa
peckham
aac
parser
greats
maurizio
mink
urbanization
sled
widescreen
sonja
seok
camped
moderation
janine
salah
motherboard
zanjan
cowley
numerically
mediums
mackey
censors
thunderstorm
jalandhar
giordano
radiohead
discard
monet
religiously
usurp
triumphant
compositional
solaris
oppressive
fragmentary
bellingham
turbocharged
metalcore
multan
okayama
ipo
curving
kiefer
ultraman
optimize
ovid
tempe
refine
dung
hopewell
plo
clockwork
popped
wpa
matriculated
trailed
turkmen
justifying
bleu
versatility
perú
alleges
clapham
bonham
brunner
lê
andretti
eustace
punching
tma
buchan
gridiron
grounding
bielefeld
constellations
kirov
contractions
fugitives
unify
velocities
downstairs
arunachal
lynchburg
hamm
cannonball
coeur
southerly
gard
installments
instructs
procured
matador
plundered
deterministic
gerardo
kershaw
chrétien
improv
gaol
kata
retinal
abad
surveyors
rabin
hora
sulu
mashhad
barbary
fpc
wildfire
galbraith
popping
abram
clamp
laughed
dormitories
psychotic
remit
serif
conducive
empowering
heartbreak
innovator
chr
rotates
suv
wac
corvettes
luanda
tsr
signatory
encircled
rawat
compassionate
stirred
plough
castello
westmeath
foucault
sis
sephardic
wrongful
ulf
unneeded
skipping
lichtenstein
reeds
tanganyika
anguilla
millimeters
arjuna
doctrinal
macedon
teixeira
gophers
reaffirmed
guts
approves
dagestan
pringle
pere
kd
peptides
pimp
cinematographers
housekeeper
shakhtar
anthrax
centerpiece
pathologist
boland
interpreters
tum
forbidding
penrose
ute
rubinstein
dst
langer
garrisons
intimidate
lupin
eto
blaise
panasonic
happier
misunderstandings
debussy
capsules
gerrard
struts
burman
bainbridge
pseudonyms
jahan
wola
kozhikode
warlords
rapping
eos
rosas
györgy
equine
tia
dielectric
cagayan
portuguesa
toned
concentric
eaves
abbots
multiplex
precipitated
agha
hijacking
halton
frontage
millie
seduction
ararat
saraswati
dealership
mamluk
ridicule
townlands
valois
nicht
mime
hussey
jinnah
quotas
loomis
unwillingness
homs
prerequisite
bile
icy
freetown
confucius
kuo
iggy
reckoning
lk
dunno
wie
piacenza
weave
tenets
diplomas
uwe
disestablishment
deo
renegades
unborn
jurors
petrus
graceful
emo
slums
kiki
dividend
counselling
rescinded
vfa
blackwater
broadened
hdtv
devise
claimants
whyte
echelon
doubted
bom
gastric
robby
gruber
cultured
fermanagh
thirst
alfie
officio
trolleybus
hutchins
bayesian
connaught
airliners
ounce
allele
theorized
forewing
winnings
mera
fribourg
silt
rations
oft
serene
blossoms
curie
dansk
tph
dey
staggered
maarten
withheld
italicized
housemate
canaveral
conley
medicaid
beryl
hopelessly
dios
cellulose
hla
huntley
equate
bbs
latimer
creeping
trumps
lazar
vehemently
blogging
roast
ounces
jockeys
concubine
didnt
banca
guwahati
worldly
myron
potters
interscholastic
carrick
cymbals
microphones
yip
broome
sully
uphill
geometrical
marr
quixote
gippsland
tesco
conseil
annales
criminology
loot
edvard
selwyn
cleo
incompetence
caulfield
liberate
totalitarian
initiates
stylist
ugo
armbar
tahoe
yakima
standardised
therese
pino
downturn
mortgages
puy
bint
langdon
deir
elmira
binder
juries
sinus
abode
reword
apostrophe
moresby
appellation
mobster
parliamentarians
buggy
samuels
emptied
vane
cie
libelous
rosetta
recitals
reputedly
chaney
untold
spar
bayonet
provocation
tvs
misread
telephony
gauss
lesion
amritsar
fabrizio
brokerage
richness
secretive
faithfully
gillis
garvey
kon
seamus
ticino
subordinated
thesaurus
syllabus
snowboarding
zedong
hips
kaya
firefighting
meats
valour
flocks
pathfinder
phoenician
llanelli
exposures
scramble
altars
parrots
winton
naturalists
guanajuato
dissection
anesthesia
courtier
garrick
signings
pickups
poltava
workable
carnage
loader
jovan
fatality
goss
zimmermann
cranial
grosse
hydrocarbon
tse
gimmick
atf
abies
basingstoke
fahrenheit
medically
wearer
deactivated
summarily
suwon
beasley
magically
stoner
kaur
skid
wim
burgas
turnaround
ackerman
plutarch
sybil
dorm
bec
claiborne
rondo
margo
renfrewshire
qingdao
appended
eiffel
hakka
leif
fram
hallway
rescheduled
coatings
suzhou
marcin
cmt
confectionery
headaches
tot
subiaco
barb
permutation
quintus
semen
recast
scipio
mineiro
sycamore
bothering
proverbs
fibres
jilin
chee
lager
jørgen
pye
liddell
hymenoptera
frankfort
phenotype
malek
kinshasa
patria
redacted
wrecking
pernambuco
ascap
aron
pebbles
curtin
inventive
raffles
bund
oxides
typefaces
wali
castel
stalemate
bains
thirdly
naturalised
searle
safavid
projectors
horizonte
kidneys
mcnair
dieu
respiration
heaton
counselors
tempore
lf
jekyll
marginalized
poitiers
harms
clinically
affectionately
augment
suzy
krauss
chap
aussie
extinctions
mascots
environs
ncis
rota
hispanics
xtreme
extinguished
engravers
brookfield
remy
nite
aps
,,
whats
icarus
sugars
declan
edouard
almería
todo
unison
ayn
jocelyn
phosphorylation
ici
duplex
hails
qr
exporter
jackpot
compliment
kalmar
rephrase
rem
bachchan
incomprehensible
materialism
decoding
patras
neatly
warhammer
inflatable
holocene
masquerade
aire
astounding
leavenworth
pleasantries
promontory
valletta
mollusc
denys
palmyra
ferro
blooded
emd
oblong
synaptic
chieftains
collared
rho
kuznetsov
beecher
complication
goto
odis
girolamo
assyria
protruding
pons
benghazi
exquisite
fertilization
mcnulty
retailing
riffs
carrillo
canteen
rajan
opal
monochrome
parke
uplift
unbelievable
summarised
disdain
cheated
woking
stressful
whistling
sandpiper
isidore
judson
zeno
lili
tragedies
abdallah
paok
shenyang
cassius
airbase
postulated
bram
olsson
elevate
heracles
carmarthenshire
enigmatic
merkel
forman
hons
hobbes
mercado
belgrano
nmr
giuliani
wallachia
gott
presumptive
peta
obscenity
könig
traverses
guessed
arabi
kean
conveying
emulation
zaire
prophetic
jaroslav
outposts
karlsson
chowdhury
acquaintances
spilled
sina
coronado
reproductions
patten
aylesbury
saladin
molded
merced
pew
tcp
kmt
els
lecce
altercation
obnoxious
earthworks
teton
receptive
storyteller
mares
mobsters
pinter
everlasting
suitcase
ravel
subgenre
lofty
chute
donny
peek
assassinations
vue
lien
karst
restitution
ponies
refrigerator
loudoun
zine
doves
krasnoyarsk
attaché
parra
pixar
staffing
rotations
arafat
ancillary
druce
humility
undocumented
jelena
phonological
lh
thursdays
superstition
dordrecht
dismisses
raccoon
moods
heidegger
sasaki
authoring
pao
southside
fodder
kale
pebble
observatories
torsion
tweet
saratov
correa
animate
cnet
isthmian
caicos
sarkar
cruzeiro
tapered
lodi
stubby
wto
compilers
puente
shrike
authorize
folsom
saito
cyp
fsa
selkirk
cumbersome
ln
pahlavi
rafvr
filename
xyz
dysfunctional
milt
hadi
idealism
isaacs
cassini
rearranged
photosynthesis
glutamate
forester
chiu
humberto
fitzwilliam
calvados
relapse
marquee
swallowed
mainframe
susceptibility
maoist
sonya
adidas
dominions
mex
mazowiecki
olympians
connotation
broaden
typography
fiasco
habitable
croat
stiles
galley
peralta
jodi
agonist
gordy
zeros
temptations
adverts
jacobi
matured
edifice
kishore
hearn
tighter
jovi
oligocene
uda
carbine
genders
erick
elo
accumulating
dulwich
ralston
vedanta
alston
emigrate
larnaca
murdock
bachelors
hj
dravidian
señora
juana
goldfields
calmly
russel
yogyakarta
porous
selim
niccolò
mcdaniel
concorde
bogart
robles
epithelial
diagnose
battlestar
savvy
kiran
stl
everglades
zak
skype
feces
dagenham
phineas
pancho
comique
argos
ledge
moldavian
edwardian
straus
annulled
ryukyu
beached
setbacks
janitor
narrates
oru
bullies
pectoral
willows
criticising
marys
esque
hackers
ys
hogarth
sil
estrogen
abrasive
accomplice
atonement
dragan
hwan
bok
messianic
westwards
phantoms
earp
iceberg
backstory
sigurd
ornithologist
chronicled
westlake
deviations
revolve
adolphus
conductive
aalborg
benn
officiated
clowns
primordial
brabham
subscribed
quits
loreto
ibsen
dodger
headless
follies
cooperating
pinky
paleontologist
stipe
malays
judgements
reflector
jellyfish
infringe
worldview
swe
marxists
revising
sosa
pelicans
shortlist
instruct
groin
infested
intl
betts
rosalie
vocation
surabaya
blenheim
sacraments
varese
mihai
cabral
flamingo
remorse
phenomenal
archivist
capone
abner
minstrel
epistles
astley
constables
pouch
smithfield
riparian
kiryat
pyle
degrade
fashions
sae
resented
xiu
kofi
moored
bioinformatics
peroxide
predetermined
archbishopric
tilly
starving
reproducing
kelso
flor
nunn
aoki
colman
removals
extremism
rath
whorl
aland
rnas
resupply
devolved
pressurized
nuit
gramophone
kendra
usm
hemorrhage
vile
subtly
methuen
uplands
shaggy
burglary
wiser
biker
ivar
acronyms
bluish
elia
prohibitions
enthusiastically
messerschmitt
summation
extremists
pamplona
nik
pittsburg
willi
consecutively
tipping
armaments
unethical
yankovic
blankets
cautioned
rupees
frowned
taro
wager
pediatrics
susanne
contemporaneous
misfortune
recombination
convoluted
berlusconi
qa
vladislav
allende
rav
majid
bowes
undermining
primrose
aristocrats
jordi
azeri
putative
shrapnel
rump
bellows
campers
cantilever
formulae
obese
booths
carpathian
transsexual
aphrodite
tumour
raga
nantucket
intravenous
auditing
subclass
partitioned
torrance
deduced
notary
udinese
timbaland
whitehouse
hustle
kirchner
simón
tierney
toros
gironde
smarter
inhabiting
plugged
sengoku
luoyang
nous
ragtime
arrogance
certifying
capri
hayashi
raptor
veneration
dang
arcy
trainees
aq
jardin
handwritten
foy
berths
westmorland
rong
herbaceous
felicia
stride
joaquim
mes
pilipinas
tú
uncontrolled
primacy
magnetism
staring
vw
islets
cavern
taranto
kansai
protestors
brewed
shaved
ivanovich
shading
sandusky
aloysius
insomnia
algarve
consolidating
cmg
roddy
insensitive
lawmakers
wyman
ayer
rug
pendant
confines
stoughton
skewed
tessa
hazrat
neilson
parris
corning
rollover
yazd
saxophonists
xuan
husayn
farina
sexism
rede
windward
mchugh
onondaga
hotspot
hoo
airstrikes
allowances
monsignor
foxx
raquel
stickers
reins
jm
marlboro
irkutsk
distinguishable
esprit
assemblage
hornsby
weeping
yelling
eugenics
neutrino
frightening
irritation
fairey
timo
qt
tweaks
epping
chained
lolita
rishi
tora
vk
dreadnought
malfunction
coldplay
shlomo
gallows
strathclyde
enlarge
murderous
helga
breen
gervais
grote
józsef
mansell
gerber
assaulting
popcorn
oceanography
vermilion
doctorates
praia
overdrive
cortical
osage
wag
buffett
contemplated
haste
supergroup
fatah
anatolian
buddies
souvenirs
convergent
cern
stylish
eradicate
vitamins
arriva
recourse
grit
yolanda
ucf
rehabilitated
biologically
kyrgyz
bayan
eradication
ergo
embarking
maas
artistes
okanagan
moulin
sorcery
tuskegee
decepticons
janssen
yau
rcd
incoherent
tivoli
rowling
prophecies
takeda
vettel
bundy
intertwined
philly
abstracted
baha
mela
ethnography
ludhiana
ostrava
characterizes
perfected
dreamcast
attainment
davie
tagline
codenamed
germantown
remodeling
lakeshore
apron
vetoed
hodder
franciscans
lonsdale
watchdog
ignite
mostar
laois
ophthalmology
bookseller
woodlawn
fenwick
crtc
gypsies
brasília
cosmological
swollen
mcdonough
tyrosine
andalusian
classically
pussy
bitterly
drier
traditionalist
plebiscite
associative
modernisation
raton
butterworth
jat
howitzers
besar
cirt
streamline
pranks
compromises
marcellus
lichen
narnia
freaks
steeple
matriculation
eccles
arrowhead
blaster
ridings
shasta
exploratory
lawless
sangeet
noam
progenitor
furman
toon
nang
acapulco
disseminated
lavigne
mathilde
francia
raster
partick
jeju
repercussions
fluff
humanistic
paternity
burner
repressed
undecided
stamens
bursa
lis
pentathletes
bubba
vor
hover
beggars
parenthood
trotter
sparrows
crucible
forcefully
getaway
pyramidal
tougher
salas
kavanagh
göring
rococo
moriarty
musgrave
commodores
detour
basie
miloš
valeria
lisboa
afrika
hath
knott
facelift
decathlon
lingering
tommaso
miscarriage
fates
armitage
alle
aggregates
maxima
cracker
navratilova
felton
caruso
moratorium
hitman
unlicensed
emirati
envoys
griswold
othello
soe
waterline
luxembourgian
beatified
ailing
donahue
cray
dunlap
jt
luger
superboy
illini
susannah
pickens
hallucinations
psychoanalytic
superimposed
materiel
dato
triplets
kampung
hirst
≠
indispensable
puppies
newberry
marais
geographers
demeanor
dreadful
anvil
sturdy
biden
permutations
santi
therapists
centrist
atl
intruder
horseman
kauai
realistically
sludge
frasier
winona
mules
holbrook
schoolteachers
rayner
renting
stalks
rioja
laramie
furs
jetty
radiology
watercolour
doge
cadence
kumari
csu
basso
stabs
aides
recognitions
deli
agar
tsai
charms
wrapper
donizetti
cbi
molotov
proclaim
caballero
affirm
genomics
punter
interpol
avenida
avionics
mayfair
attrition
rimini
centauri
puberty
wraps
prompts
lymph
privatisation
industrialized
persson
insulated
lala
rok
dividends
stumps
stahl
kites
nymph
aurelius
whois
tikva
laterally
tarragona
rapes
capsized
nerd
tomsk
digestion
pesticide
bougainville
palisades
mardi
phys
mods
immunology
daffy
copley
teng
bab
midrash
powerless
sanctuaries
abruzzo
saracens
jamboree
surpass
glyn
flamboyant
bosworth
zola
sakamoto
ojibwe
loi
myung
basses
justus
dw
tsui
placid
prognosis
unbeknownst
videotape
heilongjiang
akademie
bhatt
igneous
lenders
vv
xue
substantiated
settles
mislead
cleary
supervillains
pies
esposito
grandiose
clustering
loftus
instincts
dilapidated
favours
kamehameha
reworking
kenney
satin
rafi
domínguez
nozzle
tilde
vantage
unam
hikaru
vaud
nowa
jeannie
pid
irritating
cordelia
silurian
broward
stringer
formulations
headlands
coyne
rigby
cpus
kee
detriment
smelting
arthurian
malnutrition
kodansha
tenuous
verandah
quitting
ephemeral
occitan
hei
quests
viejo
rockabilly
maki
alternated
curing
detractors
taxed
kirkwood
sakhalin
rab
viacom
acupuncture
kinross
punisher
troyes
tatum
subchannel
chopper
buyout
blooming
nicolai
inflict
quigley
leona
pio
lilian
dilute
supplanted
lobbyist
enrolment
arif
fades
haymarket
cropping
discern
apologised
homophobic
slobodan
quang
samaj
lighted
veracity
decaying
taoiseach
swain
tricycle
beatrix
sanatorium
pescara
guimarães
refineries
statistician
ive
papilio
gurion
blanks
feasts
biochemist
taoist
leaved
olt
haim
consultations
idiots
flak
simcoe
motivate
coercion
riksdag
husky
bailiff
infused
nmi
horsham
trumbull
omsk
conmebol
harming
sikhism
brunel
kde
competency
schaus
disapproved
caltech
sheath
jodie
velázquez
hypnosis
charlestown
nome
fulfills
herrmann
valenciennes
cauldron
proc
cushion
apg
neuchâtel
ece
khomeini
joni
capitulation
rosso
huan
shostakovich
activision
incursions
indre
pwi
crux
src
alphabets
constructor
invocation
blazon
corals
darjeeling
lynching
querétaro
anode
jour
gijón
interrupts
additives
colvin
humiliated
kindle
anticipating
normale
thumbs
sockets
undersea
insolvency
acadian
greeley
eukaryotic
dermot
correlate
chia
sprawling
lunatic
silverstone
vitality
anus
whittier
echoing
koo
acropolis
rockaway
plunder
koblenz
balearic
poincaré
hittite
theravada
vn
ittihad
antonov
audiovisual
ubc
contends
coldest
skew
presidio
chilling
kamikaze
valparaiso
cornered
turquoise
rectified
comté
freshly
watering
virtualization
waring
millers
strung
pitcairn
insufficiently
severus
videogame
dobbs
gritty
aesthetically
halliday
adversaries
camelot
bodyguards
surrealism
overarching
glamorous
gcse
aomori
kashima
zona
vandenberg
hispania
anaerobic
shellfish
finisher
lyttelton
pediment
jozef
qureshi
conceive
francine
majoring
clears
penzance
enclosing
booming
newsletters
snell
darth
kohl
backlogs
barbra
decentralized
effigy
determinant
armée
normandie
inbound
eskimo
pará
sapiens
milošević
hatchback
ukulele
bremer
harmonious
autoimmune
sncf
magenta
acl
neuter
sincerity
guzman
donner
perón
wallet
addams
bibliographies
pans
uterus
youngster
lieder
dwarves
knockouts
hinged
cardiology
foaled
aircrew
guyanese
abdication
metcalf
rout
waiver
revolved
pumpkins
parse
ntv
glitch
compulsive
♪
brixton
blazing
bartolomeo
collagen
allowable
campion
smallville
subic
curia
mano
combative
despatches
filament
deprive
lga
icd
corolla
liquidity
wits
nsf
kindred
dianne
lng
carmarthen
infrequently
deliberation
taos
regretted
stillwater
accountancy
elmore
ozzy
bj
dutta
degeneration
encampment
ipl
freehold
jokingly
hammers
loom
adamant
hobbit
haiku
racket
counteract
absorbs
intimidating
actin
mowbray
dusky
lleida
publius
solicited
cma
humiliating
haile
presidium
caw
pinoy
arlen
sch
chantal
blindly
petah
quack
longhorn
valais
gamba
deepak
slump
ryo
cavities
eusebius
attachments
goh
mythos
sewers
nav
hippo
lys
arco
physique
maserati
workflow
sutcliffe
denouncing
bonfire
palmeiras
tpb
guayaquil
lecturers
rangoon
marque
città
behold
cronulla
landscaped
gracilis
systematics
mechelen
roadrunner
multipurpose
nahuatl
antagonistic
genoese
choruses
shevchenko
wray
fairgrounds
mcenroe
mena
headphones
womb
reiterate
wildcards
executioner
shaolin
transnistria
méndez
autobot
honoris
speyer
storming
circuitry
curiously
sibley
tunbridge
václav
ridiculed
ginsberg
divas
livonia
fibrosis
voltages
fte
qom
mellow
hecht
poughkeepsie
iliad
xerox
soriano
stoney
schoenberg
lucio
incendiary
panchayats
hernando
surrendering
arr
bellini
tran
mystics
oscars
remediation
egon
deducted
eesti
obituaries
sem
weill
cur
quetta
compostela
handedly
renovate
kos
sayings
minions
mchale
glitter
stamina
mol
svensson
giuliano
taping
reece
uninterrupted
intrusive
slavonia
pretended
yevgeny
vestry
dismal
octavian
pardo
nix
virtuous
eclipsed
anecdote
watchers
pascoe
refrigeration
unambiguously
tonne
virginity
bianco
inconvenient
graft
locales
ricketts
noodle
blooms
vj
tack
gestation
elisha
laotian
aerobic
flax
rw
sneaks
disintegration
sacking
irreversible
healer
kierkegaard
methodological
mizoram
resorting
cotta
villainous
shoal
oats
sheena
outfielders
pell
renumbering
channing
twa
cameos
thurman
repainted
evert
revolted
naismith
freeport
hatching
diligent
bader
ejection
sceptre
irreducible
hanlon
oriel
revisionist
naylor
pleads
patagonia
privateers
laporte
odo
bankstown
mage
widower
popper
negev
oshawa
resuming
naïve
chino
segovia
theodosius
vassals
netanya
grainger
olmsted
parable
mogul
ichi
tutors
berks
implanted
pentium
dasht
svp
misconceptions
charterhouse
mian
mme
reared
blasphemy
curtains
dickie
espanyol
hauser
dodson
snl
opioid
octagon
memoriam
combinatorial
flooring
lothar
fleshy
forceful
euphrates
broderick
belfry
hulls
aan
freeways
naturalistic
zorro
shorty
radu
unrecognized
bruxelles
gwangju
softly
hilo
fanatic
nishi
corrigan
plunkett
wading
blackmore
bonanza
pandey
tigre
widen
jewell
algonquin
postman
optioned
prejudices
superficially
lombardo
dlc
sages
micah
guerilla
talladega
platonic
buoyancy
reverence
kinks
mah
exponentially
tsang
slugging
brilliance
doolittle
nanda
clam
vee
analyzes
martyred
phage
yogurt
sexist
woke
kimmel
lauder
darreh
bonaventure
warburton
langston
telemundo
swamy
tecumseh
ceases
nicotine
certify
arkham
pasteur
recessed
geronimo
nameless
punic
biosynthesis
nadir
abidjan
walcott
majorca
woodside
bohemians
ry
shinji
iraqis
parthian
executable
broadside
jalal
infiltrated
buckle
donoghue
unattached
adhering
discrepancies
principalities
osiris
daft
dashed
strut
suffused
beyer
sill
evils
baloch
ipcc
morelos
balmain
sind
disgruntled
xxii
recite
degrading
onslaught
gustavus
democracies
roam
gilligan
adhd
matty
wojciech
recognisable
murmansk
rud
reuniting
allocate
avex
kermit
amato
coburn
purified
terrifying
propel
waterfowl
kalimantan
ulsan
indecent
steaming
modesto
notebooks
católica
schreiber
precedes
gre
amazingly
crackers
dirac
leaps
footer
dine
potion
ventricular
tamils
banknote
arlene
rapture
sip
anew
veer
appalling
snippets
sibelius
neologisms
owain
carlow
swastika
pawtucket
sdp
bernal
mccallum
hinge
giancarlo
timpani
vultures
corse
ozark
eritrean
mujer
prerogative
delightful
hamza
ioan
tenors
overdue
curtailed
janusz
spectrometry
avila
conformation
jolla
matisse
ney
defiant
walid
scalia
trajan
justifiable
forgiven
rattlesnake
agm
oxley
fibrous
goaltenders
nawaz
hinds
asc
swifts
jörg
farmstead
forensics
pencils
televisions
cle
rockstar
woodville
botafogo
inglewood
solitaire
himalaya
marisa
kneeling
totem
sein
heartbroken
lovett
femmes
jester
pronunciations
javed
clegg
sportscar
revivals
transmembrane
gora
benedetto
dordogne
realty
sacrificial
unintentional
recitation
strikeout
shaky
tavares
tok
palate
spades
bohr
cages
yonkers
sssi
limelight
sparking
proprietors
anfield
oy
universalist
exp
sommer
dy
waltrip
aliases
voc
simplification
mutilation
kleine
centimetres
lx
stalinist
riverfront
westside
taker
hydrolysis
devotee
pointy
languedoc
dismantling
anomalous
twente
gaseous
ticker
dunstan
decorating
robberies
rel
crawling
rearrangement
waverly
unser
pocahontas
générale
baillie
loon
bladed
issa
toyama
lomax
toluca
meena
interlocking
obstetrics
bequest
schematic
ahli
caterpillars
newsworthy
stalked
prog
terraced
kilogram
konkani
rags
pheasant
applauded
aha
gees
polishing
inhibiting
hennessy
atchison
sushi
scarf
prudence
statesmen
neckar
kippur
goodrich
sunglasses
montreux
inductive
ballantine
yvette
chimes
transcriptions
cherbourg
diversification
brushes
evidences
nicobar
ria
wirral
vadim
decoy
blackman
downes
heian
kapurthala
historique
sorrows
agendas
onscreen
spreadsheet
espinosa
lorient
lindley
laine
tories
kfar
hendry
rakyat
marcy
feuds
georgie
perceptual
jukebox
nederland
makeover
prost
amending
airy
monika
inhibited
aoi
depressions
earthen
hallam
conjugation
ehf
spitzer
authorizing
rolex
crucifix
adaption
judea
expressways
fillies
consumes
aves
apiece
riba
plunge
golestan
mohr
camacho
flanker
lexus
blondie
hotter
manche
adventurers
maxime
milieu
valkyrie
tuff
cretan
parabolic
newfound
skunk
iván
attaches
pristina
wednesdays
vasquez
kabbalah
insecure
postscript
indeterminate
submachine
filaments
buffet
goths
zeeland
heralded
estuaries
jigsaw
oberst
occupant
kidderminster
sylvain
juicy
hough
chas
fooled
bakshi
inactivation
yisrael
unloading
strongholds
zoned
cocktails
rainforests
mahendra
taps
rerouted
staffs
brookings
uncontested
tarsus
gauges
yitzhak
ica
cre
instructing
cheeks
catalysts
asymmetrical
ironman
skillful
huffman
contraception
dachau
rivières
misunderstand
acme
refereed
ashoka
barristers
ncc
cuneiform
hypnotic
daisuke
entangled
rationality
typhoid
banter
proximal
trademarked
ranches
boyhood
homologous
reactionary
classifies
amine
poirot
grandeur
aortic
unduly
northfield
gist
transitioning
enclosures
mcclelland
leven
cocker
emmerdale
enrolling
paulina
embodiment
reiner
inquire
lender
yakuza
vail
sheryl
yugoslavian
goshen
westbury
briscoe
bate
favre
birthname
hurd
polska
affections
relinquish
torrens
poul
woodwork
lwów
meri
outweigh
spoil
drosophila
daw
heine
jervis
schmid
reap
nlm
treachery
imperialist
ora
elegy
pertain
plaid
klingon
barnabas
kurtz
footpaths
redoubt
beira
madan
sucked
cannibalism
goebbels
elaboration
amalia
devereux
curiam
celts
workington
lott
reuss
burnaby
becket
vitae
oeuvre
epp
workmen
roundhouse
lancer
sheik
astrological
stele
infrequent
valera
auxerre
resurfaced
aversion
pathogenic
hyphens
ainsworth
treacherous
cef
edn
resettled
stool
ranga
bofors
scorsese
olof
irritated
testimonies
wacker
nomad
modulus
filthy
haram
gorky
noblemen
kayaking
romanov
gmc
luhansk
infidelity
toughest
colonized
designating
snipe
ise
rosenborg
dismissive
haakon
collaborates
weightlifter
enclaves
startups
arca
addictive
oxidative
colonisation
ahmadiyya
taco
setter
upholding
buzzard
dvb
oakwood
maids
stanzas
listens
orf
sitar
jackal
conspiring
constitutionally
austronesian
imitated
pinocchio
yp
groundwork
tequila
orbitals
worthing
roxbury
hedwig
detects
powdered
cranston
cremona
lenox
aleutian
sst
lilac
eparchy
achille
cleft
landes
mises
toussaint
pastel
firestone
esteghlal
deseret
creamy
deerfield
toshiba
dillard
bmp
flagstaff
caverns
parishad
steered
portmanteau
sentry
sternberg
comm
omen
rts
symbolizing
premierships
seibu
secluded
vries
gaels
endocrine
mufti
aspire
stomp
starcraft
rarer
ces
appleby
chime
brazilians
mendez
ansari
cybertron
depopulated
compatriot
lanier
hemlock
sparingly
boils
edie
candice
nautilus
lemurs
reagent
redeemer
pura
choo
unharmed
ochoa
hainaut
bidder
ogre
squat
fowl
flares
magister
ifa
coop
pancras
blurry
rivas
disparaging
smokers
zz
lenient
lewin
evo
cottonwood
wilmot
vir
mosquitoes
prefixed
hydrophobic
consensual
condon
lewisham
rios
furthering
boosting
equivalently
ekaterina
denham
snaps
goldwater
métro
deteriorate
downside
aosta
cassava
bhubaneswar
grappling
confrontations
jah
watermark
cremation
placer
benefiting
forearm
bak
nonviolent
ashcroft
goodall
hpa
sidewalks
burley
phu
boi
aspirated
incl
arkhangelsk
slr
revd
gros
pushers
aries
shrek
wilma
foiled
diesels
colette
asw
bluebird
tien
rewording
guntur
pierson
technologically
revocation
abercrombie
bryson
contreras
miao
biathletes
erode
placeholder
fend
inspirations
esplanade
magda
danse
diverged
flammable
mcknight
chelyabinsk
ej
rhondda
molde
cadbury
tahir
calumet
canonized
salter
req
mcg
shyam
diabetic
tightened
timberwolves
starbucks
ronde
grained
relaunch
architecturally
alois
usurped
dosage
nostalgic
ghazi
eglinton
desist
estes
evaluates
renounce
farthest
mahon
facades
imams
fredericton
gilliam
demography
beowulf
stave
pubmed
veterinarian
frick
chf
irl
gallup
doodle
farnborough
hinterland
betray
cusack
alford
solano
sibiu
kamil
vinod
academical
sembilan
diagnostics
heinlein
proactive
niven
laurens
cannibal
ext
grilled
jolie
osmond
wielkopolski
gcc
nitric
democratically
spills
hrsg
merchandising
unhcr
condominiums
multiplying
fanbase
antidote
disabling
popeye
warranty
overcrowding
gradients
makati
lagoons
sokol
wasteland
debit
prabhu
methanol
tiered
jacky
zealanders
susana
outbuildings
tucumán
intrinsically
nez
impede
streetcars
gatineau
edda
chauhan
volatility
martens
schoolmaster
eo
jethro
chambered
adsl
nonpartisan
sanctioning
amends
hoa
searchlight
airflow
rockville
bebop
vibraphone
dekker
aclu
toms
opposers
unavoidable
chicano
clueless
tigres
confessional
hangars
rainwater
misdemeanor
kissinger
khabarovsk
mugabe
jamieson
bowyer
telekom
schaefer
touchscreen
erdoğan
jpl
resumption
sedge
franche
bode
ade
loanwords
nrc
gait
outcrops
instantaneous
noche
vieux
itc
enforcer
fh
batches
sag
ilan
laplace
hops
sachin
backhand
heed
koji
alternates
ladders
cures
placements
negation
confers
archdeacons
trivandrum
retelling
eminence
lada
umass
delusion
hindsight
bor
fanzine
callum
gazelle
towson
hallows
irrigated
thirties
thaddeus
dsp
bystrica
championnat
baikonur
veda
nin
remarking
cobham
marlow
timeframe
vesey
craze
varela
cuff
shanti
antebellum
startling
guardia
diagonally
maulana
haaretz
pizarro
pluralism
grouse
tripartite
moraine
barnum
regenerative
knuckles
bene
finney
toa
lifeboats
maturation
incumbents
plunged
prokofiev
soundcloud
ridiculously
corinne
parentage
shaftesbury
flatwater
fv
teamwork
conceivable
pq
gwent
spitfires
happenings
gaunt
nucleic
entitlement
busby
sputnik
universidade
bourgeoisie
mircea
ibadan
dalrymple
hessian
patrician
priestess
pease
corazón
oblivious
nationalized
silently
unknowingly
scorpio
recuse
unloaded
manifests
counterproductive
wiz
ati
discounts
swung
azhar
purgatory
reddit
vicarage
smugglers
centralised
sola
accommodating
dfc
descendents
carbohydrates
petrova
needlessly
abdicated
reciting
beatriz
contaminants
ramones
nth
baptismal
congressmen
nesbitt
enumerated
beitar
infusion
moma
screenwriting
procter
rattle
hailing
hashimoto
unicameral
bloodshed
tarrant
googled
intimately
epics
nailed
busts
cantabria
tui
deepwater
allgemeine
speer
katha
calicut
footy
tuesdays
muni
ahly
defenseman
discouraging
dickerson
dissolving
aster
disband
ripple
thracian
overtones
polarity
blasting
gannon
morin
rideau
ryerson
ambulances
specs
shaykh
sightseeing
paderborn
hradec
halley
kcb
rhododendron
unconnected
bracken
lino
seong
renaud
fenway
gardeners
cartographer
sepals
buckeye
drc
schrödinger
mchenry
scatter
ddt
sándor
archimedes
composites
tania
ucc
corral
bigfoot
poetics
shetty
seabirds
snuff
shinkansen
slovenes
capella
bibi
masculinity
chickasaw
reminiscences
debtor
adrenaline
fahey
inf
aram
dominguez
exaggeration
ahmadinejad
thane
yad
candid
communicates
horacio
pinot
phonemes
dominicans
dipping
exponents
lowers
sculls
swapping
assurances
crt
firsts
corsair
solubility
multiples
basu
valdés
franconia
supersport
grids
militarily
grimaldi
rawlings
nitra
musk
electrolyte
tread
applicability
omg
nightlife
elapsed
netanyahu
banbury
ringed
kerrang
tekken
pruning
hovering
manipulative
buckner
sith
zephyr
ifc
giraffe
hispaniola
mcmurray
mitre
kaz
ttc
hagan
turntable
philo
amused
mackie
unwritten
epidemics
bumblebee
ics
fedex
womack
barbosa
frisco
esper
blight
seeger
imitating
subtype
shaffer
gilt
thrived
malden
lorry
procure
pw
fujita
overthrew
thiele
limousine
casually
butts
gondola
amorphous
persuading
shriver
budgetary
nay
rolfe
burnet
displeasure
synergy
søren
riverdale
diss
drogheda
comstock
orthopedic
calculators
boardman
cadiz
agonists
braces
brendon
redefined
laced
smoother
comintern
chainsaw
pte
conceptions
holla
whirlwind
relayed
acetyl
bogs
hernán
freedmen
ishmael
ingenious
dyed
overhauled
straps
newtonian
primavera
liars
valet
soledad
bothers
hashim
evolves
esc
spiritualism
theophilus
participle
byway
strabo
categorise
podiums
translational
processions
possum
brando
capacitance
stupidity
disrespectful
dichotomy
leahy
munroe
evoked
romulus
vallée
rua
affaires
intoxicated
sprite
woodson
rocca
mercier
standoff
phobia
anz
zbigniew
payable
tas
gli
crowning
probabilistic
gul
amory
tupolev
robbing
stainton
convincingly
karla
consejo
jagged
stitches
mesopotamian
ghostly
metallurg
oberon
feynman
islamists
sow
iz
orally
hispano
contradicting
curlers
deterrent
lozano
credence
braid
squarepants
hervey
baa
wannabe
elasticity
visayas
checkers
unquestionably
ariane
waka
asad
fireball
quell
orne
bubbling
ganesha
whispers
adel
mediators
righteousness
retroactively
metabolites
reciprocity
limp
jv
shouts
kingpin
kongo
stately
kino
grooming
hordaland
decepticon
marv
carbohydrate
sequoia
unwieldy
anglian
gq
outwards
dit
grandsons
amps
hydropower
hessen
jama
república
segmented
renfrew
gwalior
muppets
brat
aau
backers
boggs
croce
dingle
denominator
quaid
uva
kickboxers
tigris
quantify
affirming
hy
kendal
chanson
abstinence
salome
memes
seigneur
excitation
kya
disrespect
worsening
bouchet
ponder
sucre
mahmood
rants
charlottetown
ewart
shomali
clem
lytton
rosy
kristiansand
torus
kerosene
cpl
pcc
archeology
fleischer
strangled
newsroom
inconvenience
davide
barbadian
punishing
prairies
schiff
phalanx
rhea
confucianism
rojo
ranjit
noi
ame
storied
lessen
lynda
lilith
hw
monti
carlsson
lira
mpa
intelligible
garuda
bengaluru
worshiped
jef
yea
tonbridge
stylistically
harriman
strongman
cowardly
psychosis
ism
prieto
dodo
sangha
sanctum
cárdenas
qantas
agni
rhyming
confidentiality
ramifications
lte
gopher
iterations
rectify
curfew
nurture
boulton
booby
condolences
snapper
karelia
hanseatic
mcguinness
slime
peso
odysseus
locomotion
bauhaus
catapult
smh
excalibur
dominik
gags
mackinnon
novello
reductase
alleles
byers
preoccupied
signaled
paraphrased
druid
scant
visakhapatnam
nenad
apoel
warburg
spoilers
oceanographic
lanark
scc
aliyah
habana
decency
pals
vanishes
etudes
decimated
eliezer
xun
appropriateness
mitochondria
sod
ccd
cunard
holst
donne
rajkumar
slapped
aligarh
kalyan
honky
tomography
apprentices
prinz
symbolically
asif
evgeny
gert
millimeter
grozny
hanau
warlock
sophistication
dentists
melayu
actuality
parson
irreplaceable
narrowing
gon
uber
sunda
booksellers
nie
qinghai
anh
catharines
tonk
dsl
artois
catharine
chem
williamsport
colouring
chp
circulate
scratching
overseer
aslan
hatched
ferret
athos
ubisoft
sinners
glynn
cristiano
songbook
expended
itinerant
vedas
separatists
shilling
doria
msp
moreira
eunice
supervillain
pores
affirmation
morristown
adonis
maritimes
watchtower
pristine
yomiuri
masterson
capitalisation
talon
rayo
fingerprints
maimonides
flipping
kuan
naidu
★
vick
accolade
nagaland
nx
sensei
tacitus
geneticist
burgers
serenity
shih
categorical
gypsum
stingray
drm
aragonese
antiochus
harvester
virgo
ewan
deceive
sneaky
caron
belknap
editorship
phelan
wingate
pashto
vistula
oliva
whipped
benetton
antigens
clooney
polymerization
loup
zeke
haldane
begged
zoltán
melzer
sturt
carmelo
photographing
sainsbury
bonneville
bastille
rollin
ratchet
oedipus
reprising
sowerby
rizzo
endlessly
overruled
matron
conclusively
walford
katarina
parvati
separator
oakes
skis
eulogy
schott
sich
aquifer
cate
esters
sugiyama
bataan
cuyahoga
braddock
cultivating
alhambra
upsetting
impulses
forwarding
fortran
robeson
cthulhu
shovel
ingestion
cambrai
urmia
legalized
docklands
scandalous
dalek
paleozoic
jodhpur
truthful
nuanced
tangier
allie
scania
aikido
sverdlovsk
commoners
tus
highgate
fairmont
ansar
restraining
dike
krupp
pompeii
omits
avraham
depaul
sinner
ragged
wtf
monarchies
utilitarian
horus
fournier
georgians
trad
mots
musket
hulme
runes
pledges
merrimack
cuomo
reinstatement
censure
dartford
trusting
underparts
aloha
kfc
hiroyuki
shanks
django
bessarabia
withholding
dacian
linearly
meltdown
thornhill
lj
renard
oka
plano
roshan
distrito
dougherty
catered
mysql
enrich
sinfonia
nederlands
parte
shams
digs
wallonia
alum
interracial
overtook
vela
lw
suture
cassell
carolinas
cumming
lamas
shakes
cabernet
palme
sarcophagus
woolley
brann
ancients
aff
athenians
matti
towel
bol
shielding
novara
iwata
magpie
suva
kiowa
associating
zig
permeability
cesena
shoppers
memoirists
jäger
shroud
poisons
gallatin
qajar
mites
turboprop
disturb
cuttings
cornhuskers
lachlan
monteiro
chorley
intifada
roos
deflection
heinkel
propensity
dejan
abolishing
mambo
aquarius
waned
howl
golds
eagerly
gulch
craiova
cns
vigilance
hydrographic
união
ilam
unlucky
manoj
connective
entail
restricts
pentecost
collaboratively
solidly
bead
dredging
bickering
saucer
soler
gusts
impressionism
entente
boreal
ascetic
fannie
nominative
harvick
rearing
disseminate
randi
sondheim
minimise
saarland
rolle
contemplation
dinners
fluminense
shinjuku
characterizing
subfamilies
olin
vinson
tomahawk
matte
seabed
kinsey
fsv
indica
nuke
drawer
psu
cuevas
sadat
firenze
carbide
webcam
shave
schilling
oceanian
conforming
rajah
gotland
overhears
synthesize
thong
mica
transmits
arran
bdo
auditors
disregarding
vivaldi
subdue
chakraborty
milburn
nutshell
perseverance
mesoamerica
popov
molding
westerners
prosthetic
astrakhan
proclaims
albury
srb
boycotted
nou
boast
sofa
seagull
brackish
overpass
leviathan
bagan
unter
gerrit
upheaval
mississippian
underdog
nanking
sarkozy
storting
hemoglobin
bake
modus
solidified
fagan
zis
corea
airway
renata
spooky
joyful
rarities
diurnal
svt
alte
hearth
enfant
appalachia
lisle
meticulous
apra
edmondson
radom
anya
sartre
chiesa
dessau
reconstructions
blasted
geary
totality
zonal
wir
lebron
roddick
berklee
ppv
tarantino
bild
agitated
adalbert
abiding
usaid
respondent
willed
dwindled
correlations
limoges
longfellow
triptych
espoused
pallas
mccracken
seamless
napoca
levinson
nava
masque
gai
hera
crankshaft
changsha
moles
amer
bonifacio
oscillations
coniferous
badajoz
stupa
ponsonby
copernicus
schindler
ihl
scalable
pawnee
attendances
vases
coffey
grips
maja
elie
battersea
manageable
artistry
dulles
oni
ellesmere
kohler
automaton
middlebury
cáceres
hae
waldron
furthest
invader
entailed
wulf
rolando
suri
liquidated
defection
chalice
bandcamp
rephrased
jørgensen
bharati
niko
schule
oxen
countermeasures
precarious
dyslexia
kudos
beltway
ferdinando
mcdougall
windshield
selina
prešov
appease
pontus
custodian
sanborn
esports
purplish
mmorpg
cfo
bitterness
existance
bloemfontein
briefs
inclusionist
inflicting
talkie
cheerleader
brine
shedding
kootenay
napoléon
advancements
humankind
businessweek
syphilis
hmcs
improvisational
yonge
microbes
marsha
coulson
sepulchre
allotment
valenzuela
♥
sxsw
transfusion
wares
munch
vyacheslav
sewn
jakub
definately
trending
heston
awaken
closeness
gruesome
bucky
calif
watcher
artiste
rambo
heng
madrigal
haw
maddie
brod
chrysalis
crossfire
petre
pedophilia
charlene
meister
intercultural
commissar
albin
balzac
asquith
latinos
overriding
mcphee
herding
lucid
muskegon
iban
eights
funimation
fruitless
rainey
idealized
guesses
marbles
loy
buller
domini
presse
rapist
wrongs
lancia
elongate
saliva
thanking
matchup
wooster
osteopathic
istat
krueger
soulful
sprinkled
sodomy
irresistible
giorgi
latif
emmet
inhibitory
meh
constitutionality
republicanism
kostas
chiral
slows
cartographers
whitechapel
ailments
sucking
carina
thaw
sula
bbb
baal
hildesheim
rookies
excepting
benning
designates
untimely
chum
complexities
hiller
contemplating
agro
cartman
swampy
brilliantly
ryazan
slugs
lucinda
strachan
cruiserweight
jewry
rehman
fsb
foyer
tilting
roost
gallic
annotation
noire
cta
disingenuous
desi
dared
wisely
botha
aroma
sabbatical
beaulieu
ioannis
genitalia
cookery
combe
intentioned
redford
appalled
pelvis
warblers
darul
dipped
®
isaf
pancake
nazir
imposes
resurrect
encarta
cartography
stripper
xc
gad
crooks
squires
musashi
mears
chairwoman
chromatography
asociación
infuriated
materialized
woodhouse
graubünden
compel
plasticity
chartres
supercar
emphasise
lutheranism
bridgend
theodora
semiconductors
venous
bookshop
aerodynamics
caetano
mattress
centrale
wodehouse
affective
gravitation
substation
vx
maggiore
sergeants
bhakti
commuting
oxidized
warheads
ethylene
muses
alibi
crucified
outscored
jak
impossibility
iau
shrubland
alarms
rebounded
biddle
stiffness
lahti
parkin
missa
krieger
finitely
directv
waterbury
funniest
bouquet
initiator
fez
shipley
storyboard
diluted
fdr
paes
neonatal
fim
chagrin
okada
niue
gaylord
coon
grist
leaping
alarming
vulgaris
padova
reconsidered
usf
redistributed
brookline
yost
futurama
cnc
naturalism
unlv
drago
ecac
ooh
inciting
disruptions
parlophone
peri
hewitson
mansour
coursework
modesty
babes
sensations
kinetics
hv
marathons
petter
sororities
endgame
marseilles
ligne
lise
stressing
asi
whore
mackinac
kalam
sharapova
bathrooms
osceola
repeater
harford
iff
snatch
sterilization
webpages
mcewen
™
elam
pickles
auerbach
apparition
seaton
orientalis
coyle
orientalist
coups
kintetsu
dobro
reflexive
oaths
discontinue
aude
ntsb
reload
mammalia
luang
ingolstadt
gunpoint
hatchet
pierrot
quatre
nola
hmv
midshipmen
roundtable
liters
yui
xinhua
stig
facet
mezzanine
riordan
nuances
batangas
pacing
ferrell
hilliard
cla
bundeswehr
crowther
lizzy
rif
papyri
bogie
leed
maricopa
enlists
clothed
girlfriends
roadster
suomi
diaphragm
philippa
unger
paddock
nha
inks
misfits
emilie
maneuvering
gurkha
etched
debrett
sie
slanted
olimpija
winch
ntsc
necrosis
endogenous
traversing
capitalists
shrunk
giang
étoile
skink
loring
murakami
whipple
lampoon
mckinnon
mujahideen
coronel
penchant
bridgehead
benevento
inhuman
floodplain
dystopian
juggling
canaria
slur
thwart
inject
jahre
kush
unwittingly
workout
clientele
sauvignon
thi
rubbing
preamble
twigs
secreted
antioquia
misrepresent
borealis
coalitions
redress
gábor
marquez
conforms
reusable
outfitted
eerie
freezes
margate
messier
soups
lettered
postpone
sportsperson
teo
paton
kp
auxiliaries
majlis
printmaking
cyr
nicks
affixed
gyula
weathering
relieving
supersonics
proposer
blackjack
depressive
stryker
hulu
regains
oe
alb
viewable
recon
falconer
alva
tremendously
sling
morte
dwindling
romsdal
ell
arbiter
talmudic
joked
braganza
recoveries
benzene
commences
renzo
sadr
queenstown
probate
jem
ginsburg
porky
ehrlich
kenosha
schengen
liguria
speckled
bulacan
peloponnese
pdl
quarried
industrialists
csm
dann
abandons
nymphs
strapped
complements
skates
disguises
onslow
sufism
goulburn
brevard
marissa
supermarine
jimenez
chl
roskilde
unionism
ervin
sadistic
dimitrov
radiating
nightfall
amalgam
nyse
palawan
fuelled
herne
checkpoints
stinson
hustler
helpers
bastia
amazons
limbaugh
presidente
bosniak
wh
debrecen
bowery
pampanga
hcl
routers
hydroxy
juanita
roch
precede
codec
mademoiselle
cit
roewer
noonan
insightful
lynette
orville
mortuary
ftc
cellars
régime
frying
roofing
megalithic
leopoldo
politico
caithness
formalism
gauthier
autobiographies
vitebsk
identifiers
condescending
nap
noone
carmelite
maharishi
sequenced
dodds
mcarthur
volodymyr
räikkönen
spoils
shaheed
sydenham
murong
europaea
servitude
equated
streptomyces
mordechai
cytochrome
santoro
rac
perjury
angoulême
kao
commercialization
youssef
brow
ranchers
pala
rosberg
kowalski
corby
lyne
tiled
chautauqua
kirsty
mischievous
ruining
cocoon
changchun
weser
mattel
esmeralda
turismo
azalea
smt
istria
cosimo
standardize
jacobus
cytoplasm
allege
polygons
playgrounds
spammy
centering
xenon
spraying
centimeter
¼
bowdoin
minimized
deems
waddell
crafting
malformed
inundated
monolithic
habitual
relaxing
sever
ats
saki
hostels
levee
occured
northwich
schweitzer
stoned
islami
excommunication
mendocino
lindy
orchestre
roxas
conning
zap
mockery
redone
tacklers
palakkad
ranching
bulky
detectable
unwelcome
medic
albacete
shad
countable
jaeger
auspicious
complimentary
seams
levante
zeiss
restorations
folders
modernize
telstra
independant
chimpanzees
bodywork
doubly
riva
fetched
nuncio
arcane
mombasa
kirche
srinagar
leamington
modulo
cca
ilocos
takers
sylvie
darin
alexios
retrieving
saxophones
ornithology
cusco
rejoining
castelo
mueang
indistinct
louvain
cobain
cev
pde
melchior
criticise
docs
toxicology
collegium
blas
pellets
printmaker
beaton
suarez
barros
dutchess
bfi
eldridge
childers
manon
glebe
frankfurter
motley
whispering
meera
innumerable
algernon
spectrometer
braced
grin
attains
recherche
misled
fenced
arrondissements
alluding
subversion
illuminate
outcast
genève
bellas
harrell
sidelines
rote
lambs
laughlin
boosters
crag
hud
circumstantial
seng
jh
rik
nist
iaea
kyu
branko
mutt
curses
sahel
montero
meningitis
gravesend
naturalization
corman
suharto
pane
showbiz
nittany
repentance
distortions
whitmore
adagio
gatwick
khazar
pur
fey
abdulaziz
restructure
coed
winn
odeon
metrolink
leeway
referential
latrobe
lapsed
newburgh
ruptured
garnier
towering
geller
reloaded
wandsworth
schröder
wattle
workstation
vole
hingis
leanne
ferocious
captors
pomeroy
cooney
dehydration
plainfield
satoshi
proquest
illuminating
destitute
gurus
razorbacks
cheque
farnsworth
adkins
groton
piloting
realigned
traffickers
matlock
shima
sportive
nepenthes
cobras
langton
obsidian
ibaraki
tryon
navarra
parametric
croke
eunuch
dill
stow
foreclosure
nevermind
lowery
caius
hoskins
dona
athanasius
taskforce
fright
polemic
motherhood
lefebvre
rampart
davids
lettuce
micky
baines
impurities
terrapins
aube
tem
mens
hassle
excellency
evangelicals
putney
methodism
oysters
dissenters
caesarea
fulda
identically
dalit
forza
archetype
foresters
cassettes
arousal
semesters
rugs
continuo
silenced
hotmail
bumpers
enacting
hendon
ias
tartan
deathbed
lowndes
ow
mishnah
countering
cantonal
uproar
insecurity
nimrod
droplets
marianas
untreated
crests
deptford
emits
warne
northwood
beacons
flyover
nationalisation
walrus
ibf
setlist
mutilated
redshift
glaring
chauncey
scriptwriter
klub
lefty
smack
testimonial
briton
pollack
symbolized
minot
mirroring
schoolchildren
nuneaton
buttresses
tawny
whim
conveyor
carrollton
typepad
deadlock
astonished
gonçalves
helical
hallelujah
lia
pkk
beret
vestibule
upsets
epitome
balinese
tenured
battlecruiser
gibberish
shockwave
winterthur
displeased
cassel
jacoby
distanced
traitors
angelic
manta
tabla
wield
fahd
lakeview
unforgettable
scilly
shabab
nsdap
macabre
uconn
moulded
dera
agglomeration
molloy
ciel
priori
hijo
wilberforce
painfully
scion
ubs
akkadian
episcopate
umbria
cristal
slid
beet
phan
progressives
biopic
fells
dalí
logarithm
tumultuous
inhalation
mindy
acquittal
centenarians
waved
talisman
rammed
kabuki
rosewood
frei
hamasaki
reverses
holger
incandescent
poignant
rcmp
ogle
diocletian
brasileira
spartanburg
chak
jackman
unfolding
aleksei
azur
waldemar
reckoned
goff
unaffiliated
wolfpack
hatcher
registrations
divination
solicit
admonished
alters
repatriated
disuse
ayumi
rodman
lindberg
hornby
roca
padang
soaps
sylhet
betis
sulfuric
cadmium
impart
molds
treehouse
subsystem
bottling
saar
lipstick
prudential
cayuga
jillian
vagrant
bipartisan
berlioz
jails
inexperience
blois
dukla
solis
ciara
eloquent
bakker
abt
parietal
punctuated
danbury
lows
splicing
impatient
dido
besançon
martine
whitlock
iaf
fondness
roxanne
thea
restraints
remedial
whedon
christos
nîmes
tenths
hepatic
foto
passionately
cucumber
waldeck
bicolor
icing
helios
guang
redeem
arboreal
redgrave
homeworld
badr
cda
salvia
lamborghini
zhuang
ration
hawes
shakur
fingerprint
izumi
jeune
restrain
vai
hin
geopolitical
morelia
ceredigion
revoke
geodetic
cloned
lula
comme
maniac
featherstone
parsing
lingerie
wald
miklós
belligerent
candace
stumble
altrincham
beige
ancien
dalmatian
nocturne
belcher
raining
iota
leds
baroda
afghans
inspires
hippocampus
immortals
almagro
lavinia
khalsa
rutledge
radiated
disproportionately
bionic
hafiz
earths
germán
huntsman
shing
spaniard
visser
inclusions
dol
sørensen
supercomputer
ludovico
illogical
retarded
pfa
pundits
mordecai
unproven
antisubmarine
leconte
remodelled
bdsm
cité
newry
camberwell
lightfoot
outburst
libertarians
precincts
flavia
snowboard
correlates
resolute
asynchronous
plastered
discriminated
ansi
kachin
zealous
penney
negotiator
haruka
neve
aretha
ince
algonquian
sandford
kingsbury
emigrant
mundial
repositories
silky
tribeca
disorganized
hotly
thessaly
legalization
ary
sprayed
humorist
skateboard
stains
rosette
bimonthly
dietz
mortensen
tiki
duan
injections
tron
mediacorp
negra
resistor
marimba
chimpanzee
capra
somatic
sardinian
masturbation
mandi
putt
antonin
synths
rasmus
deakin
corvus
confiscation
morel
wanganui
vardar
wec
bahamian
daybreak
peloton
crispin
undetermined
abv
dieppe
alberti
chauffeur
ssh
slo
joliet
pickle
regalia
schrader
radium
intimidated
nigra
steamboats
sailboat
brt
disprove
vives
pitbull
mgr
bhosle
udine
sedition
bachmann
bacchus
roussillon
ocampo
harwich
swabia
mindless
fukui
collusion
karlheinz
trofeo
hattie
wrecks
huerta
injecting
sade
rework
taoism
martinsville
redeemed
philosophie
unworthy
howling
osamu
brecon
duvall
utensils
infield
sensual
tyneside
despised
bhattacharya
taiping
manus
taras
bwf
pondicherry
linkin
highbury
abreu
gulag
pollination
sforza
karzai
garrisoned
unclassified
pave
haha
darnell
dionne
cris
clarissa
dulce
llywelyn
albino
osvaldo
ssl
inga
bighorn
feu
selassie
federalism
knudsen
brevity
intruders
lenape
doreen
oleksandr
perpetuate
ghostbusters
sangam
xr
nicer
touchstone
paramedic
buda
hoi
narasimha
summarise
soliciting
fora
newbery
singly
higashi
ugc
wg
dax
phenomenology
rudyard
admirers
bfa
radiocarbon
nunataks
hone
rebranding
pakistanis
converges
musharraf
tumblr
hawkeyes
nah
pears
officiating
valdemar
anglicanism
bjørn
homing
kol
ukip
noc
gretzky
jacopo
confuses
bhavan
lorca
solihull
adv
ornamented
ponta
etymological
torchwood
enlistment
thresholds
zaman
jahn
clarifies
rigor
firewood
marín
gorica
parenthetical
oriente
ambedkar
abbé
yourselves
sephardi
ennio
glaciation
papuan
dassault
occidentalis
aerobatic
counterculture
gdynia
rationing
divya
madhu
acrobatic
keita
decor
gambino
tic
nir
cordoba
individuality
squeak
chariots
barak
minoru
halting
olimpico
macho
hybridization
lun
brazzaville
acknowledgment
maghreb
pharrell
byrds
kirkus
edson
laila
tiananmen
lmp
tweaking
cin
secularism
taxable
defy
drawbacks
hsieh
pituitary
norwegians
bracelets
hoare
oxidase
quarks
ethnology
hamstring
plugins
positional
ker
poitou
capitalised
entrant
sayers
noh
posey
shoshone
frantic
gyeonggi
peebles
paloma
moll
neri
hamster
galena
deregulation
necessities
biopsy
spiritually
agassi
argentino
bullion
drawback
baring
mug
lash
addington
scanners
distort
christiansen
thorium
molière
reappear
gisborne
mayr
wallingford
analysing
cheetahs
refitted
wada
eldorado
gabby
faro
mirko
abuja
dravida
trimble
budge
chastity
cenozoic
tugs
holby
vandalising
livin
purview
asuka
intermodal
manawatu
chivalry
vendée
fedora
chişinău
hauptbahnhof
mortem
rabbitohs
kreuz
santorum
glenwood
arbs
fertilizers
exalted
poaching
copious
fortaleza
syfy
celje
supermodel
shastri
causation
indentured
karthik
ahern
miyamoto
granular
famagusta
coolant
baronies
kot
nicolaus
ghat
sauces
vaux
bernoulli
clutches
etchings
ripping
bligh
husain
breckinridge
gaby
xander
metallurgical
padded
rioters
incursion
seashore
cosmonaut
rhonda
coy
cowper
globular
summoning
mumford
hales
militiamen
menachem
uncensored
oud
parakeet
seaweed
predominately
starfish
paradigms
pacifica
lakh
mauritian
beebe
cyberspace
ingalls
whittington
jamison
taichung
salons
bran
ssc
conceptually
enlisting
monsanto
cheering
mcmurdo
cuisines
cmc
paragon
martí
oldsmobile
brookes
instrumentals
stipulation
charing
yeltsin
diario
dnipro
covalent
nutcracker
atr
pasco
schulze
bridal
ores
underdeveloped
headquarter
libro
sto
multiverse
clap
sqn
hermione
deduce
zayed
donbass
brunt
nong
chua
slashdot
bamako
bueno
fresnel
mottled
solace
berthold
sturgis
caravaggio
regularity
eon
selectivity
lifecycle
enthroned
spinners
swallowing
devolution
sdn
stonehenge
mountbatten
pinkish
renoir
rowed
cigars
situational
addicts
tweets
sds
oeste
kearny
beauregard
visualize
replays
radioactivity
littlefield
wanders
newsday
desegregation
tribesmen
gorillas
lotto
lovejoy
frida
apocryphal
innocents
indented
affectionate
itt
midwife
ringer
scr
argumentation
lse
athenaeum
abul
camilo
oakville
safeguards
microprocessors
extradited
africana
japonica
mismanagement
gulliver
warmly
cote
sitter
sextet
showroom
saville
turpin
trucking
zionists
rami
menstrual
dogmatic
pima
delicacy
napalm
torment
sawtooth
deceived
subduction
appomattox
tabasco
ghar
poised
contrived
methylation
squeezed
juventud
pagans
depressing
factbook
dac
hotline
creditor
craftsmanship
gtp
salina
galleys
guidebook
mimics
istituto
brothels
contours
standish
altoona
paras
yakovlev
razak
mangroves
rotting
thiago
cq
margherita
caucuses
rar
kiosk
repayment
capillary
christiane
kemper
dancehall
undisturbed
fundamentalism
gunter
erm
bushehr
rascal
feinstein
staining
notched
kamchatka
canoeist
sufficiency
alcatraz
bilal
stettin
simulators
approximated
bagley
lads
brantford
lubin
comprehensively
ueda
carthaginian
anchoring
feeders
instrumentalists
triomphe
galle
zim
penobscot
concurrence
hitchhiker
kula
modernised
myer
repressive
madge
bohol
norske
exclaim
goldsmiths
hummel
carlist
darlene
fateh
indiscriminately
ronny
jonson
liberec
marten
erotica
czechs
octane
leaflet
albatros
baz
pinkerton
heisenberg
bloodline
scrimmage
hufnagel
loo
fils
durch
sata
afa
conwy
ornithologists
conifer
gaiman
southgate
ikeda
raith
iwa
gorham
tensile
seti
railings
colonna
greaves
pate
regenerate
stuyvesant
bannister
northridge
monrovia
brun
munson
dummies
atlantique
mavis
seer
brough
bartók
dci
glaze
zahra
fourths
mailbox
silo
detonate
hula
patricio
kev
talib
achaemenid
dilip
imre
nostra
acidity
aec
plating
mahayana
chaudhry
mobilize
celta
andante
schooled
caterina
dresser
nene
hala
stigmata
tzu
railcars
superpowers
frampton
pardubice
defying
tweeted
underhill
coulthard
airman
heuristic
personification
masson
shangri
eyebrows
barque
gamblers
preakness
shabaab
mahan
iridescent
hervé
commoner
sceptical
davos
meteorites
banque
batten
saif
shaving
billiard
fleury
ryde
cordial
aimee
vasile
kelowna
salvageable
israelite
darn
giorgos
foggy
abstracting
autobahn
sprints
tms
detachable
accommodates
ilo
wildfires
mckean
gillies
afoul
sukarno
harkness
customize
borgia
reliever
pvc
fleeting
ephesus
proletarian
grasping
electrician
mineralogy
tallied
amstrad
stances
patio
meditations
mts
goodwood
türk
osce
worshippers
sars
thrillers
eukaryotes
hoff
ramparts
nontrivial
ferrero
baez
wacky
haddad
subjectivity
hellas
sauer
eicher
refund
windhoek
chaldean
retires
hossain
ratu
portraiture
evangelism
overuse
ronson
afield
manuela
wolfson
brompton
kama
demarcation
meghalaya
urea
elicit
funicular
gunma
risking
pravda
mouthpiece
caesars
sarmiento
pryce
imogen
duque
millimetres
metaphorical
thankfully
onus
grammars
juror
overgrown
causality
eurozone
ascoli
cation
chaves
isolating
attenborough
trampoline
waddington
scriptural
evangelists
numismatic
wrt
solomons
wop
wests
ploy
spooner
viru
crappy
abbotsford
sacha
hedley
tiraspol
aloe
frankel
cielo
preis
ratify
zoologists
leninist
pedantic
diagnoses
ostrich
germaine
outcrop
spleen
rigs
parrott
dismayed
rendell
panelist
reba
gleeson
measles
conceivably
carols
sá
wayside
shingles
bailout
refraction
sheva
stetson
unqualified
fiend
fürth
starved
autry
subsumed
revamp
occupiers
boasting
propagating
confessing
wingfield
spacex
ticketing
stringed
tiempo
rattlers
aisha
cif
clipped
florent
buoy
pikes
layla
kamloops
comforts
mandeville
psychologically
embellished
mull
paramedics
cetera
cotter
coker
ragnar
udaipur
polygonal
reshuffle
battlecruisers
kazhagam
sasanian
pharmacists
collie
ange
senhora
ghazal
trnava
particulars
manhunt
osa
livy
florets
iwate
fundación
urgently
fosse
jeb
homburg
vacations
tithe
salta
rossetti
polyhedron
laity
shelburne
tisch
fille
chul
tilden
anemone
judeo
pinochet
mee
loophole
intervenes
ump
deletionists
lassen
dossier
feuding
shearing
kutch
geddes
flirting
camillo
roda
pawns
evokes
hamill
lamented
evaded
cormorant
tremblay
barrows
arbroath
kagan
bbr
rickey
plumber
hoods
intersex
rah
rowdy
multiculturalism
sortie
retardation
canvases
astaire
cafés
savile
borja
tobruk
colorless
regimen
emptying
hagerstown
billboards
geophysics
trento
antiquaries
afs
ellipse
elwood
pancreas
sanz
régiment
hopi
opportunistic
hayek
amulet
ambivalent
npp
prek
belgians
tenement
ingested
maidens
pixie
platz
schaumburg
kaoru
perpignan
legume
sooty
dmc
lovelace
elon
gödel
tatra
favourably
cir
dunning
stewie
alvaro
ducts
scratches
lindgren
vern
powerplant
delineated
muerte
norah
mongoose
sik
pompidou
extraneous
nurturing
bursting
fastball
ocular
balconies
farnese
feathered
siddeley
bretagne
lichens
hydroxyl
conspired
hammered
healthier
bmc
logie
publicised
fieldwork
characteristically
pars
agustin
exec
overwhelm
reunites
prefects
kievan
studded
camus
nuno
harland
paralleling
ingham
eleonora
conscripted
slugger
stabilizing
brushed
triumphs
antiaircraft
disclosing
bentham
lobed
boilerplate
interplanetary
momentarily
kanawha
evocative
caricatures
shmuel
turkestan
yeoman
titleholder
snipers
dignified
meager
dou
embodies
automata
yarrow
fermat
capricorn
excesses
latterly
fieldhouse
hypocritical
inhabitant
bakers
lawns
speculating
clipping
oshkosh
copland
shibuya
haat
decapitated
rung
borrows
strikingly
fawn
grodno
nomen
nuova
flurry
aggregated
stott
cubans
fuzhou
rodin
doubs
suing
bearcats
haredi
afrique
fait
aso
skidmore
overran
disintegrated
yoke
reine
tossing
slider
haden
aleksandra
umeå
topps
spinoza
ota
capuchin
metamorphic
rapp
dornier
caldas
discernible
scalp
wilt
exclamation
freemason
courant
zoroastrian
verkhovna
scourge
haller
impacting
tioga
virginian
impressing
sayed
sèvres
hijackers
shou
oerlikon
juba
agreeable
shortwave
plantings
spitting
scrambled
harmonics
administrated
campsite
retraction
knack
pes
bathtub
militaire
chesterton
nra
jammed
izzy
shatner
qiu
alms
gallimard
hump
copyrightable
uribe
polytechnique
peres
generalizations
fragrant
barnaby
vl
pcb
mccarty
kunming
bookstores
congested
depository
pilkington
consulates
mew
oem
maranhão
requisitioned
rajeev
breastfeeding
overcrowded
vapour
bobbie
bhd
hebrews
recounting
oulu
brion
pundit
punts
jett
sloth
troubadour
holographic
resilient
jaan
impresario
altenburg
dorn
amigos
netscape
bossa
dui
veritas
physiologist
argon
complying
girder
volker
valli
commandment
nicaea
mite
photojournalist
dahlia
beeching
shimon
weighting
egalitarian
nagai
motherland
gilgit
figs
piedras
pathé
majorly
uzbekistani
woodwind
bebe
deference
parenthesis
delimitation
tutu
loudspeaker
orsini
wayback
marshy
kwong
geneviève
emanuele
vladikavkaz
mynetworktv
quattro
tangerine
gelderland
piss
advisable
winnebago
deceit
succinct
dissertations
curitiba
salih
monarchist
reappointed
quickest
affidavit
cordova
tek
rogues
nis
abound
amputation
elkins
hagar
heretics
crowell
ratna
palatal
stumbles
damm
melaleuca
hsc
lowther
adventists
melanoma
crump
linkages
censoring
bight
silvers
valdivia
potency
faisalabad
diddy
aragón
udo
dionysus
dello
followup
reflexes
eased
appel
afrikaner
discourages
plankton
mangeshkar
endothelial
paleocene
museu
mystique
lufthansa
impersonating
katharina
loudon
cnr
crabtree
mawson
preludes
povs
braintree
simulating
overtaking
mclachlan
melts
snes
comin
supercharged
nossa
stifle
feline
federalists
mythic
swordsman
messrs
spangled
ifd
acceptor
tuner
ayutthaya
hoe
steph
bunbury
ohrid
gillard
kohn
fluency
galina
führer
wy
palacios
dissociation
nadi
gotti
stocking
astrologer
dimitris
sarthe
nigger
adrenal
vivien
chipping
experiential
antivirus
aut
estudios
detainee
adheres
basset
bau
councilors
mariam
condoms
starry
abbeville
presuming
mot
warmest
onyx
symonds
neuro
calamity
globes
sunnyside
nabi
futurist
intensify
babble
amador
footballing
preposition
penalized
dinghy
bras
volk
microsystems
reagents
busted
librettist
envelopes
paf
ucd
orca
arian
chesney
drs
securely
humanists
flavius
holme
medea
fausto
ironi
philately
tampering
labourer
profane
herons
rapport
mayne
nationalised
barnstable
callaway
displacing
verdean
defied
wallabies
posen
bromide
kristine
gnostic
acetic
deflected
carrasco
clarinets
kz
bannon
viscous
unoccupied
airwaves
filth
korda
mailer
genovese
quilt
titian
trish
nothin
ligaments
keiko
bakhtiari
parlour
importer
pharaohs
pittsfield
monogram
preeminent
battlefields
supergirl
congratulated
partitioning
reorganize
crm
zwolle
lipids
vitale
rebuffed
predeceased
soar
urbanized
cygnus
columbine
seedlings
vientiane
reedy
birdie
rigidity
furnish
vostok
csc
ogilvie
pounders
downright
catalyzed
endo
pelletier
donato
relentlessly
monogatari
lineups
nabokov
lithography
daejeon
operationally
culminates
uterine
cortina
ela
betrothed
péter
hsv
assuring
cahn
infanta
idioms
enmity
rotc
customization
upriver
propane
woodcock
renditions
delusional
lajos
hirise
sadiq
rotax
argento
klondike
highlander
bovine
lbf
ramblers
autoroute
perera
topper
brandi
wiping
heerenveen
angelou
sangh
tanning
repechage
habsburgs
insure
staffers
rfid
trapper
sura
stagnation
strengthens
regio
puig
maier
pena
sandown
manoeuvre
obliquely
depeche
slaying
rethink
hilaire
dykes
rachmaninoff
daria
tainan
jafar
doubtless
fla
carmona
cached
slurs
cede
megadeth
rochefort
creationist
pussycat
interurban
allium
dilation
extremadura
kurosawa
flakes
argonne
entertainments
molybdenum
roleplaying
childress
hombre
telepathic
laughable
bourg
vinton
freitas
wields
ffc
vid
altona
guaranteeing
monkees
aleksey
partake
sharper
cardigan
trotskyist
injures
bia
datuk
vedder
shukla
amal
thatched
oriole
infinitive
bandstand
fontainebleau
episcopalian
mallard
busters
rubles
disbelief
goldfish
psc
cornwell
miramar
intoxication
jeter
botched
sérgio
regius
isobel
amara
wurlitzer
pärnu
junius
jab
deliberations
dyck
siri
hungerford
harmonium
differentiating
cerberus
spillway
mpc
rhoda
bons
mingus
seitz
derrida
dillinger
rediff
cmos
tactile
witherspoon
unattended
jac
backpack
ptolemaic
kincaid
wgn
goalkeeping
wink
hinges
ramesses
masking
surinamese
furthered
shocker
snapping
chipmunks
ect
outta
valparaíso
monologues
encroachment
amitabh
absurdity
smc
itinerary
risked
carsten
exoplanets
shahi
trilobites
desoto
werke
verifies
julián
pulau
kingman
dissimilar
lagrangian
incapacitated
bigotry
sills
slams
aliyev
intonation
stavropol
virgins
spiro
pretenders
paweł
tana
pissed
suspensions
kebangsaan
blueberry
converters
matsui
rogaland
blob
hodgkin
cpt
caches
gunfight
marymount
cleaners
proteus
spectacles
premiums
kestrel
tec
parva
blumenthal
redd
sandler
termini
lowly
turnovers
kwang
crompton
signified
stratigraphy
goulding
superfund
expanse
brito
heaps
homeowner
purchaser
drip
hildebrand
serialization
delusions
legge
prato
tibor
extraliga
shameless
qe
cheapest
shiga
tireless
triumphal
roadblock
annoy
bottleneck
compress
contagious
dinesh
emulated
agios
trombonist
ctrl
libertarianism
brd
believable
makhachkala
whistles
lumen
kalisz
feliciano
catalogued
guarani
nacht
angst
tsing
everytime
tff
restrooms
waning
precepts
asterix
telecasts
questioner
whittle
executes
bondi
sprites
puducherry
bicameral
musicology
pineda
ninh
mayotte
noblewoman
undulating
ccp
hyperbole
adp
pola
livejournal
rosales
complicity
perceives
safest
limpopo
convair
forgets
beal
prussians
baumann
jahangir
carrots
tandy
caddo
sewerage
bristow
athlone
stadia
nieto
sten
logarithmic
lotta
warringah
aymara
dela
jonsson
kaluga
innovators
flawless
tohoku
bunk
avondale
canvassed
reggaeton
emden
locates
nablus
mito
weakens
lithuanians
orgasm
predate
wintering
bumped
quicksilver
venetians
carbonyl
courteous
shp
angkor
harps
millionaires
betrays
exonerated
genitals
götaland
greig
krypton
untenable
tilburg
fuses
tinted
pangasinan
sagaing
forel
vibes
headingley
calibrated
inheriting
aeneas
cinder
morph
kenilworth
tête
fumes
stepson
prenatal
capel
aust
borland
tanager
dnieper
adrift
colm
hinting
diem
misérables
motoring
polygram
heroines
spaulding
bhagat
gautam
corso
whistleblower
intermedia
strikeforce
paulsen
beechcraft
conservationist
pups
workhouse
shikoku
grub
richey
lyra
leftover
graces
disconnect
trafficked
corsican
spezia
merck
tableau
knuckle
dandenong
disraeli
theoretic
balaji
bischoff
fuente
persecutions
sunbird
determinism
engulfed
pohl
chadian
colliding
yaw
penske
coaxial
koko
swabian
cystic
hanshin
toads
mandala
lasse
slapping
wielded
lian
trombones
medan
ellery
virtus
maura
endeavours
lookup
moa
milder
liberating
defencemen
alcalde
formality
marchand
shari
borisov
novellas
tripp
chevaliers
sevier
guglielmo
ueno
underlined
ustad
sweetness
orientations
gracious
pdb
hounding
pythagorean
tease
bosniaks
heyman
muda
hyphenated
insiders
hackensack
staggering
ophelia
coiled
carve
naa
manure
resists
georgy
adelphi
modulated
lanham
sportsnet
furtado
antiquary
austere
rucker
reactivity
huesca
ramiro
overheard
edm
grinnell
fronting
dugan
airships
vaz
wicca
waterproof
msx
urawa
coadjutor
cosgrove
jna
tunneling
lockdown
misaki
lillehammer
annuity
edema
nucleotides
jayhawks
wellcome
suspecting
demonstrator
tiff
mockingbird
aca
lugers
rasputin
gabriella
luise
overturning
neurologist
crs
nco
ael
jabal
vigilant
beauties
kwh
crusher
gerda
discord
approximations
sanremo
intuitively
realisation
nanoparticles
jacek
folkestone
bibles
segundo
celibacy
chivas
werewolves
slasher
roald
vulnerabilities
boredom
mário
fane
cel
thx
cadres
byrnes
bogies
flathead
nunes
kda
exegesis
slipper
formulating
widget
powerpc
mandible
clarksville
dodgy
tibetans
raping
herein
gutted
björkman
repost
splendor
bolted
sotheby
gazeta
accrediting
paypal
negroes
karina
skits
lohan
dominick
joystick
dispensary
tull
crickets
posits
shearwater
madero
ils
renewing
leclerc
mossad
gisela
octavia
hierarchies
ochre
polyphonic
miraculously
triathletes
lner
cellphone
apo
hayat
seva
individualism
provisionally
keighley
pampa
koper
suction
‑
metroid
darko
carrara
pocock
akers
sdk
ply
loci
gravely
carpentry
reprimanded
cheaply
timid
repaid
affinis
hetman
zacatecas
morgue
kwai
cheerleaders
bedding
conrail
tectonics
bossier
prods
elms
symbiotic
sagas
abortive
moustache
ethno
léger
wandered
environmentalism
cardoso
bhaskar
evers
mobil
jsc
rajputs
caprice
nestlé
cuny
peasantry
herder
sati
sexiest
florin
incite
fuzz
khimki
friary
malloy
carillon
xa
interment
equalled
barbershop
haber
tobu
dma
vodacom
signalled
totalled
watered
dewar
elba
kauffman
euphoria
crypto
styx
airspeed
wolfowitz
risc
sensibilities
acr
mfk
pogrom
alerting
ayub
cloverleaf
hazara
chromosomal
gingrich
arezzo
eri
hippolyte
feeble
smit
howland
iced
fielders
chemin
ovary
vd
orozco
northcote
sheaf
riddled
diogo
huguenots
stabilizer
blundell
embezzlement
mss
orenburg
hokies
kharkov
broadening
winless
wilfried
aurangzeb
paredes
devotes
babelfish
motorised
deflect
herron
consented
metering
tsu
coalfield
bypasses
assessor
teague
fluorine
merited
sportswriter
linnean
landers
southwell
mga
hofstra
schaeffer
computations
berkley
watersheds
niro
andalus
cru
verso
subtypes
ato
gris
fabrice
rife
hoisted
puritans
yaakov
isuzu
ldp
belfort
birdman
perish
freer
leek
willson
erhard
bridger
condom
kojima
grevillea
burglar
amenable
uvf
tibia
transferable
broth
executor
annihilated
tighten
northumbrian
carcass
westpac
midas
sangre
shotguns
groot
ochs
lateran
complicate
drysdale
cosa
spiff
vevo
duchamp
sympathizers
tiara
spiked
rsn
dewsbury
fishers
completions
borrowers
mlc
unconvinced
rodger
legation
gimp
buch
ohm
rawls
wilkie
backstreet
kikuchi
deafness
terr
ludvig
lilies
recharge
spc
eisenach
pct
minamoto
salix
syndromes
abbaye
maryborough
narragansett
bolstered
mtdna
morbid
kenseth
angelique
kernels
ouachita
buford
thun
hannes
burch
blazer
sausages
bolding
andrej
smoker
melodifestivalen
indemnity
urbanism
hargreaves
vitoria
nomura
keio
ruc
khel
nid
costas
ammo
tenn
mughals
petrograd
mpg
vga
galois
csp
allard
redhawks
gravestone
teleport
bih
gluten
numb
ashlar
southerners
interferes
omani
utilising
indulgence
ridgeway
schwab
saddened
directorship
calculates
theosophical
rosenbaum
polyethylene
finistère
metalurh
tet
decipher
isabela
abort
indebted
kinsman
pharmacological
jagannath
tps
ecu
zielona
hennepin
arduous
bayley
nonzero
ultrasonic
nemzeti
myocardial
facs
celle
aab
satu
pekka
lista
amo
scribes
hugues
tenacious
huns
infernal
journeyman
pall
ruddy
uncertainties
bonita
africanus
centrum
mistreatment
mysterio
immaterial
mdc
thaksin
ruud
injure
rena
shareholding
havel
anaconda
interns
muted
fleshed
trang
ivano
ismael
anjali
leonean
cordero
aspirin
ladakh
théodore
sunbury
kirkby
paranaense
odia
awami
slapstick
lapd
retaliatory
tipu
caching
nama
ruslan
amphetamine
cashier
smacks
calves
sequentially
palmetto
poetical
usp
mechanised
myotis
absentia
calvo
auger
tripled
droughts
saas
dataset
niklas
neto
boas
idealistic
gravy
tantric
analyzer
kalinga
euphemism
conspiracies
subplot
deen
crowdfunding
genealogies
iguana
wrench
disbandment
mcadams
integrative
gopi
uncovering
naka
gaba
navigators
scrubs
delinquent
eastside
kilpatrick
eisenberg
stefanie
grocer
tfa
oireachtas
campsites
iterative
mda
legrand
giza
satya
countrymen
spires
ramone
halliwell
hysterical
popularised
tragically
día
encyclical
herat
sathya
torrey
taming
ecologist
mais
brie
kittens
reservists
multiplicity
usefully
hillcrest
purposefully
ramachandran
ukr
satisfactorily
yekaterinburg
rove
farid
mirnyi
bedtime
pag
barcode
sylvan
starfleet
whitlam
driveway
brier
reckon
yulia
finnmark
vigor
hoist
hamer
cruised
mea
dolce
infestation
supremes
neutralize
karting
csr
specter
ninian
agassiz
overlaid
¿
cinta
noses
draped
vishal
houdini
appendages
suzerainty
jaffe
lemonade
histone
deirdre
reardon
levitt
hillel
neglecting
gwyn
godard
fudge
gustafsson
inspecting
moya
ajaccio
pittman
steinbeck
semicircular
livonian
chasers
protease
gauguin
theres
undiscovered
undp
eyck
examiners
chanted
barbican
urquhart
octaves
upa
albano
visceral
charan
femur
sio
eldon
tomlin
plinth
elis
abolitionists
thales
denbighshire
trekking
diverting
vegetative
alcalá
herrick
telemark
facie
amundsen
publicize
praxis
banach
subsurface
shielded
wootton
skanderbeg
mpaa
leonora
jeannette
capua
strenuous
biodiesel
vireo
catskill
cautions
narva
nab
heretical
collectives
lumley
itis
calcareous
momo
icebreaker
stimulates
deductions
bodleian
opengl
levelled
ebb
behar
emus
withhold
magnification
morningside
fosters
zvi
runic
clams
poli
fructose
lovato
país
csf
paschal
monopolies
avenged
diameters
segmentation
mila
voiceover
ecstatic
awaits
gianluca
ridgway
pews
angling
lowestoft
perched
inman
subregion
shigeru
disrupts
tochigi
kidnappers
bachman
recieved
gateways
hideyoshi
perforated
phenol
demonstrably
prunus
loyalties
bureaus
cytoplasmic
managua
eines
natwest
bacharach
aip
kostroma
nrk
worsley
erlangen
sentinels
rethinking
antonius
marathas
andi
waterworks
aguinaldo
goings
telescopic
kuznetsova
unprepared
ostend
chronicling
anglicised
suspending
shekhar
umbilical
reprisal
poking
cour
psd
yumi
carburetor
endpoint
gx
lem
fracturing
collars
unites
reintroduction
wafer
solon
rtv
coen
divisible
calloway
alkaloids
audits
kom
soybean
riverview
fatimid
plow
danforth
siddiqui
hokkien
yachting
anastasio
denounce
underprivileged
mulroney
victors
purports
mediating
tou
receptionist
idw
meer
ariadne
auld
iraklis
alcohols
slay
chipset
smelter
lingual
jus
kea
swarthmore
wonderfully
bataillon
walkways
shattering
shem
separatism
prickly
goody
normalization
retrograde
braithwaite
muffin
yokozuna
condemns
decayed
msv
linfield
loa
sponges
seduced
leste
ptsd
breakwater
devoting
rabindranath
iva
heretic
undertakes
tammany
cyan
fervent
doon
zamalek
cottbus
concurred
emissary
shameful
dali
polemical
foothold
firemen
sorceress
excretion
alyssa
audiobook
montpelier
tyrannosaurus
wsj
narcissus
piquet
rohit
lemieux
speculates
ainu
thankful
thunderbolts
epithelium
pfizer
atta
touré
deluge
maputo
tantra
ingenuity
handguns
brezhnev
disjoint
beckwith
turban
atrophy
qs
marauders
novices
ille
llano
uncharted
chuan
stosur
orthopaedic
negate
swimsuit
deuce
jeon
tasty
trappers
steuben
clinicians
akershus
khitan
disarmed
vinny
leng
leanings
kidman
rescuers
picky
griffins
inanimate
onshore
bx
heartfelt
handicrafts
handlers
nederlandse
goon
sportivo
favouring
marti
coffins
scathing
sulaiman
barnstaple
indulge
testifying
renton
messi
chrono
alphabetic
franconian
wildwood
hornbill
hideo
oliphant
allegany
sheboygan
flannery
artistically
playmates
cranbrook
sakurai
gac
jazzy
hillsdale
leveling
acceded
reprisals
hardie
gelder
tinto
planks
stipend
houseguests
bushy
lederer
fairbairn
putra
newcomb
looms
humbert
watermelon
ziggy
pinewood
megawatts
lenz
wikimania
responders
mussels
stela
sachsen
hibiscus
discounting
toba
graphically
chiltern
rehnquist
opinionated
mcs
insurers
burrowing
trabzon
sweater
cac
ticks
overlapped
generously
uttered
chaudhary
subotica
smuggler
blackrock
encased
earmarked
mimicry
wollaston
beckman
tyumen
piggy
jerez
gripping
yule
blackheath
corbet
stockbridge
grigory
fredrikstad
busta
plateaus
jani
hormonal
kwame
hohenlohe
withdraws
thrasher
stoll
pumas
ardenne
snider
michaela
ashfield
mainichi
tuxedo
flicker
troupes
plazas
stiftung
buckets
origen
wedgwood
biannual
sedentary
phonograph
reuter
montmartre
apathy
mohamad
flemming
satirist
concepcion
drags
carlsbad
transylvanian
unauthorised
mcclain
hibbert
fonds
replenishment
bahru
boarders
pdr
displace
tiebreaker
uf
nia
lusaka
lifeguard
configure
droit
artnet
welland
decorator
hur
paintball
gazprom
dic
avatars
paleo
xf
stalling
rhenish
spammers
minaret
gluck
intensification
denser
oba
macrae
sampras
ont
garages
huelva
ladislaus
iga
gucci
acf
morita
reinforces
taiga
moreland
wbo
freezer
extrasolar
fists
sebring
accusative
espresso
bodybuilder
sca
ceos
ranchi
reassignment
seddon
ivana
muskets
kodiak
borel
sorghum
elaborately
rima
merv
lambton
harpers
sefton
diwan
mixtapes
bandmate
petrochemical
studi
serine
riverine
lashes
tetsuya
preside
dnf
receivership
corrie
cec
yunus
equaliser
ceasing
plessis
disparities
sperry
flavio
burdens
jha
zahir
bourgogne
butchers
korsakov
replicating
bandmates
noyes
shelbourne
swayed
thoma
crores
❤
congas
invalidate
saale
leonidas
fiennes
nordiques
sinan
tumbling
grissom
mimicking
zealander
pickford
leong
rfcu
accented
ethereal
platoons
glenelg
structuring
relevent
investiture
imagines
homebuilt
ecliptic
pritzker
kragujevac
sloops
topless
timeout
rer
ariana
kearns
lundy
cormac
aptly
sideshow
bolshoi
hydrology
vitesse
instigation
pregnancies
fortescue
tracer
pepperdine
radicalism
fling
pint
rigorously
alessandria
heaters
circadian
gratuitous
sprawl
webmaster
rebate
chemnitz
folklorist
cripple
nayak
wayland
osgood
pesaro
append
gatherers
huntingdonshire
unfold
inept
redondo
quartered
transplanted
gaspard
crosstown
hommes
nyasaland
oriya
paralleled
gv
drinker
airdrie
exhausting
manatee
valeri
gta
shayne
barksdale
sopwith
waveform
mpp
mulholland
govinda
sergius
generational
capensis
zag
halloran
antietam
lupe
puzzling
ecozone
darien
mated
macomb
steyn
herz
galt
deliverance
psalter
upc
stags
malevolent
biloxi
barium
negligent
recycle
unwise
aran
suppressor
estimator
rencontres
hitomi
horta
protectors
nashua
pinging
ogilvy
darter
sinful
handover
unstoppable
fernanda
childcare
stinging
stillman
harlequins
diodes
tupac
tightening
individualized
twister
vacate
archetypal
tock
governorates
netherland
rak
stitching
gust
prada
pps
maluku
laments
kyo
renominate
emphatic
rafts
norrköping
morehead
distantly
digitised
superpower
plantagenet
subconscious
phish
presto
sabin
secessionist
omissions
spammed
psych
roadshow
hairstyle
wicks
courtiers
boathouse
undeniable
mako
gearing
faking
rafting
bikers
arbuthnot
asymmetry
informations
crematorium
poo
goggles
forking
leitrim
valhalla
corneal
cubism
goku
foresight
tgv
daewoo
probing
unfolded
golding
choppy
wealthier
dov
forfar
uncompromising
mitigating
equates
adrián
bahr
vitali
connery
hem
apparitions
cisneros
retaliated
scouted
hatches
retook
aristide
cee
henk
grenfell
brash
awfully
chimera
khaki
cataract
mindful
bryansk
barthélemy
edd
intertoto
newsreel
chor
fortitude
mucus
airstrike
laban
stalwart
grapevine
ecb
cleaver
aviva
guardianship
grahame
benítez
heliport
sportsmanship
tcm
diverge
lloyds
bozeman
pharmacies
deepening
governess
hein
fluke
grampus
manoeuvres
neotropical
offensives
ediciones
vertebral
tera
midwives
menéndez
goers
fueling
confidant
neurotransmitter
naik
contraband
oncoming
wap
discharging
yaroslav
abrahams
declassified
hairless
attractiveness
inshore
havelock
pontoon
malhotra
racketeering
gunned
mennonites
cabs
flung
exhumed
abundantly
increments
chatsworth
refrained
sutras
euston
nao
miura
towne
hideous
ehud
burgundian
lice
editore
zuma
mistral
sarnia
alamein
stanislas
looping
mouton
arellano
sunbeam
abteilung
watercolors
bihari
dissonance
breyer
tunnelling
damping
grieving
natured
eriksen
goalless
kapp
jesper
unsettled
bibliotheca
emc
qd
pinning
pocono
coos
outages
fibonacci
morrell
rabies
notations
discworld
greifswald
pearse
morgantown
ganguly
pepin
colville
baffled
macroscopic
infrastructures
shipwrecked
gogol
juarez
toho
buffers
krebs
kyi
apl
iodide
ethnologue
northside
vfr
telepathy
mayagüez
szabó
fastened
homers
monteverdi
reclaiming
palearctic
haase
brazos
viscounts
flowered
oboes
iis
agosto
pachuca
soria
scholz
costumed
archduchess
cps
mauser
stockwell
dilution
tatarstan
dalla
bennie
gimnasia
moyer
toma
yamashita
artemisia
mota
sourcebook
custard
insoluble
irishman
ehime
spearhead
ajit
mccord
selenium
tsonga
silicone
eiji
brits
perce
sharpness
krajicek
foodstuffs
suceava
carbondale
nauvoo
linguistically
alcoholics
metzger
descriptor
galapagos
calvinism
whitey
sonoran
masahiro
modifier
gory
lasker
whimsical
tapering
lapland
machiavelli
scofield
categorically
timetables
betterment
pcl
nws
fluffy
shank
bittorrent
paradis
isola
yasser
quinlan
extremities
tester
tendulkar
revolutionized
jurong
espírito
firebird
vermeer
proletariat
impala
paulson
textured
ende
alignments
overtake
stratification
renames
dürer
liston
gendered
tranquility
decoder
viborg
howarth
glorified
christa
nouvelles
transpired
borrower
hearsay
rafter
hounslow
ewell
avellino
cochabamba
ber
antichrist
pecos
automate
sasebo
ishii
raffaele
enclose
greyhawk
swaps
gaffney
jaén
elinor
lune
filho
neva
transcriptional
toddler
crate
zell
overloaded
outro
daman
roving
saban
bales
antonín
beatification
catalysis
xb
monolith
algorithmic
unconvincing
hooghly
xtra
delirium
meehan
dimorphism
escapees
murchison
simplifying
pharma
workstations
kasparov
cypriots
dormer
jesu
buttocks
semper
gurgaon
ando
tegucigalpa
coelho
wakayama
umno
brandywine
eunuchs
silverstein
electrochemical
delilah
luxemburg
misinterpretation
rhif
obscura
perthshire
pennies
bronzes
mistook
pilate
konin
nei
taper
nfpa
mahindra
forgives
leal
likud
misnomer
lyell
granby
blume
sandboxes
huggins
ofcom
sz
yum
bacillus
porno
joubert
hom
clp
outbound
opie
esl
magallanes
distinctively
natalya
necessitating
lollipop
mmm
muldoon
sistema
manassas
szeged
climactic
koala
kenan
rügen
paradoxes
posited
møller
bhutanese
scrabble
durrani
agricola
posterity
heathcote
conjectured
countryman
anglers
sorensen
kentish
ims
hau
draco
aurangabad
bárbara
annandale
hergé
rollo
aristotelian
donatello
quarrying
falsified
intensively
mend
apprehension
hells
cocos
undersecretary
bustamante
mycologist
tusk
hyung
puffery
analogues
nationalization
villers
natively
romo
burney
catenary
cedars
sauk
rcm
cleverly
simmonds
oryol
picardy
reciprocating
filings
mindoro
makin
gurdwara
statisticians
babbler
firestorm
circling
srinivasa
mauna
tropic
undermines
trs
hordes
rasa
prescriptions
blockers
sketched
harpoon
swede
tosca
tapestries
daemon
cybernetics
emin
stoddard
garret
patiala
herndon
unpaved
dystrophy
canadensis
runcorn
ccs
teamsters
gautier
vivek
iverson
groceries
allergies
tere
gillett
eglin
brews
livia
callers
woodworking
minotaur
kashiwa
sotomayor
stateless
underestimated
reiss
stakeholder
erecting
quieter
cluttered
outboard
birthdate
turan
cade
fathom
restaurateur
après
halim
suzie
coronet
leto
unfaithful
estoril
belgorod
eniwetok
impersonation
beersheba
hangover
decode
kj
katana
commandery
narcotic
pylon
agrippa
scotus
shunned
amicable
transduction
bronco
anglicized
datta
bolzano
heber
annular
bulletproof
sitka
terminally
slits
atwater
munnetra
minutemen
hydride
gn
diversify
rusk
aru
kazuo
giroux
slipknot
debugging
piled
ewald
beaked
blackie
loew
usha
wuppertal
flintshire
defectors
talons
gambier
ragusa
galápagos
peep
gwinnett
movers
kiwis
breads
accomplices
therapeutics
looming
kilns
upanishads
remission
sideman
griggs
affleck
gers
amity
amma
stepan
anatomist
lz
sefer
encirclement
intrepid
kuching
amassing
setúbal
typified
patrik
carlsen
backer
carrion
whitehaven
revolvers
supra
trot
rood
hawai
casale
masquerading
peake
crease
skåne
symmetries
marat
alkyl
carranza
caligula
dor
freda
flavours
asgard
dealerships
lifeline
scratched
communicative
intercession
grinder
spammer
jl
eeg
vesicles
moltke
phonetics
rebounding
endorses
locarno
antecedent
duplicating
perlman
unsuspecting
scarred
lae
pelt
mikko
endowments
nadh
gannett
bou
licking
yardley
frankston
cardozo
gartner
ptv
ruston
nicolson
flugelhorn
wallaby
alloa
siro
kandi
varga
baie
coleraine
woodman
frome
machining
dialysis
abdi
typhoons
mahony
seaforth
authorisation
anguish
voz
moussa
chiswick
minto
mudd
motivating
historicity
understandably
gosford
trina
cuvier
nagel
tenses
polyester
quarto
telly
sedans
unfolds
guetta
finkelstein
powering
calorie
aesop
camshaft
biotech
ewa
araneta
orientated
corazon
denim
dunmore
kanal
suitor
everyman
leonhard
paladin
valerius
colonials
covenants
maidenhead
brickwork
shredder
sidon
bangs
deviate
hyacinth
gamecocks
narita
deg
bambi
unconstructive
peripherals
schweizer
colorectal
git
commissary
legacies
slammed
grier
dohc
grc
murali
ullman
mitchel
synchronised
heathen
maximizing
deuteronomy
utilise
bnsf
bastards
ivanovic
theseus
subjunctive
kyoko
redfern
nubian
geyer
avert
soundly
seljuk
squatters
technische
dahomey
renters
hutch
studs
cul
familiarize
mizuki
mura
stockade
emporia
dissolves
coetzee
impediment
posner
satsuma
uns
enveloped
westcott
transplants
bah
salvo
jeffreys
bulawayo
amplify
briefcase
hx
sixes
ningbo
cephalopods
staircases
rightfully
caramel
dales
bribed
vmi
huracán
plexus
townhouse
jenson
pretense
yamuna
brechin
epo
keppel
nanak
nunnery
nus
cams
cbbc
ferrers
palladian
clemency
bloated
oto
stylised
meghan
cramped
magnates
dishonesty
benigno
precaution
sens
inseparable
jurisdictional
infecting
salting
elmwood
uplifting
frustrations
unaltered
ulcers
foggia
elvin
plz
mellor
dissipation
prr
isoforms
nasional
strawberries
geodesic
advaita
esperance
innocuous
ingeborg
konstantinos
kirtland
hijack
multivariate
camara
semifinalist
bashar
mayhew
redbridge
nevsky
wilshire
enron
​
havens
neutrinos
laborer
baikal
ganz
amarna
greyhounds
erasing
usman
lipton
unequivocally
fondly
tydfil
witte
pym
campeche
kenichi
freebsd
gallardo
isometric
abbeys
tezuka
phonemic
kbe
habilitation
characterisation
ayurveda
stuntman
trisha
hensley
subhash
carpathians
beehive
candida
regrouped
cram
bubblegum
conversational
curley
beheading
imperium
clausen
printmakers
shunting
jiao
ufos
rhett
ingersoll
ushered
amalie
contending
ena
skywalker
fluently
sundown
shadowy
remuneration
dubstep
ibarra
affords
antimicrobial
prospector
cyst
cilia
dispensation
unclean
mythologies
timers
hereafter
ove
ach
radon
peachtree
labelle
whitewash
hulled
didactic
biff
emigrating
flirt
baptista
chopping
aller
sokolov
novell
sialkot
watchman
syrians
desiring
cima
cyberpunk
gilroy
biel
aneurysm
xiamen
rafters
crick
sau
wirth
runescape
basse
tenet
piling
natura
droid
crumb
thoreau
salm
arie
geisha
whitefish
zandt
quirk
cinéma
altos
psoe
blige
torrential
devanagari
vetted
bashing
convective
mikado
murthy
raines
rumi
vestiges
marrakech
whiteman
assailant
correspondingly
wicke
multiplicative
gautama
wps
asperger
loos
hanford
beauvais
ceuta
clementine
rivero
milanese
ficus
catawba
feudalism
thule
philippi
debacle
qualcomm
recombinant
rcn
minoan
astm
potable
andros
mariposa
fenian
kasper
pca
swears
geospatial
asturian
purging
cashel
negara
rifled
odell
icu
cysteine
falstaff
embraer
espoo
boro
portia
teasing
supremacist
resent
pats
kangxi
ontological
anni
carboxylic
tsarist
wasserman
tokushima
harada
dhs
classicist
nfb
oca
headteacher
gutter
aligning
aquaman
generative
llb
narbonne
burnie
venu
cleansed
deadpool
vytautas
headstone
tormented
frisia
psychedelia
garra
tendered
eelam
rehabilitate
refuges
coercive
bpm
wicker
liberalization
finalised
payloads
corvallis
collegial
trigonometric
mussel
estefan
diggers
nhc
puns
keeler
nyg
discreet
côtes
takeo
koehler
baffin
edging
opossum
psl
wga
deepened
internships
spoleto
wranglers
hoyle
raghu
linares
bukhara
loveless
grt
conscripts
errant
glaser
eisteddfod
gabor
skytrain
cranberry
castaways
boers
mutton
meticulously
nlp
depp
rizzoli
sportswriters
cenotaph
wholeheartedly
frail
borac
biometric
cirrus
mooring
palos
stuffing
converging
megapixel
toppled
burghs
reverb
sanga
tempting
zwei
worshipful
azarenka
ncl
mamma
superfortress
pavlov
dictators
pyar
starkey
pda
coexistence
birla
patronymic
mhc
hares
whomever
syllabic
stumbling
fondation
dass
dispensed
unimpressed
gamboa
slaughterhouse
sheraton
dissuade
rewind
contre
tattooed
kontinental
scaffold
randomness
exxon
fanciful
disallow
steampunk
stretcher
polje
steadfast
lagging
templars
zeitgeist
gamal
hawkesbury
proverb
spicer
coldstream
zapotec
thirsty
synapse
conformal
simferopol
hurdlers
taunts
mismatch
placenames
getafe
margery
unfavourable
zoey
injunctions
spares
adaptable
robredo
paleontologists
pedophile
fiske
marinos
khamenei
nsb
ivanhoe
alcock
tunku
biofuels
augmentation
mummies
diphosphate
locos
tous
cherished
bernese
unilever
vignettes
thorp
girdle
vitaly
forsythe
benthic
payback
perpetuated
pepys
nla
hundredth
parapsychology
nella
hipparcos
minkowski
corrosive
lorena
greenbrier
precipitate
sukhoi
cuneo
geyser
superintendents
suraj
awkwardly
cornea
golem
velasquez
classicism
geocentric
seagulls
steppes
mosses
demille
adama
cwt
farber
joakim
sek
yaoundé
bpi
siglo
lamarck
português
lennie
taunting
wexler
gts
boavista
slocum
defoe
gonzo
vengeful
etta
purvis
scsi
montmorency
luminaries
elbert
eoin
czechoslovakian
thrift
borromeo
smirnov
lobbyists
fukuda
aunts
hairdresser
methamphetamine
ferrand
headmistress
stellenbosch
rocked
wraith
educates
longview
duels
stagnant
kalan
jinan
silencing
cliché
corroborated
brainwashed
xenophon
huai
moura
courting
agk
thrashers
banding
tomislav
dismantle
coerced
ewe
akiko
taxing
soles
communicator
bittersweet
insecta
glyphs
localised
klm
vasil
lengthening
majorities
cliffhanger
ieyasu
dryer
woodley
blotch
cano
sme
hyder
twitch
shimbun
keynesian
darpa
remi
tiago
whipping
cubist
gosling
semnan
railcar
invokes
crvena
apostasy
ettore
banked
rimsky
tabletop
oyo
fairmount
clippings
vilas
spokeswoman
tropes
noriega
hating
ashdod
meissen
honeywell
affinities
nance
derailment
contemplative
matos
sauron
sweetwater
ronin
puncture
lomas
midwifery
farman
prescribe
schiavone
akram
allier
refering
verity
redlands
malin
cancelling
congruent
glued
agnieszka
bbl
antecedents
tsuen
peacekeepers
eas
coining
waging
opry
otero
sela
lillie
arequipa
hoya
farquhar
pidgin
godmother
babylonia
kuiper
bartoli
wadham
principe
madoff
plasmodium
fema
beloit
safeguarding
sharkey
annabel
disgraced
woodcut
harsher
cézanne
beset
impartiality
gooch
acadians
gilgamesh
aloft
akp
crates
impostor
hieronymus
enquirer
layne
brinkley
ribbed
pres
luzerne
maginot
metra
tiresome
yer
kuh
trumpeters
maung
symptomatic
bucknell
frisch
schmitz
silvestre
limousin
tart
sandringham
mertens
dôme
reimbursement
weathered
multnomah
bashkortostan
fecal
xxiv
allocations
flake
reputations
chandos
corwin
boko
fiance
maître
pinpoint
tabloids
nowak
libri
wargames
collieries
netto
nudibranch
farrow
osi
inpatient
circled
amélie
schafer
proms
transformative
naturelle
priceless
predicament
unsurprisingly
gant
oxyrhynchus
abnormally
rediscovery
headgear
soapboxing
rakesh
northwestward
haircut
puerta
hellman
cuatro
rotors
blau
predated
kiri
hiss
whitehorse
elitserien
studebaker
slag
ranji
tryout
seeley
hausa
clarinetist
rarest
dah
egret
courted
volkov
unedited
bayside
purdy
ajmer
dardanelles
ginny
musicale
paws
irani
convene
enforceable
inez
fripp
dux
trudy
christiania
hippodrome
pounding
bijapur
absentee
gruppe
crumbling
gowns
godoy
intrigues
reissues
luster
dysplasia
changi
redox
crozier
aer
straddles
yann
infarction
grandchild
bhupathi
nakano
agora
brokered
midori
steed
menezes
mellotron
jacobean
tricolor
cfc
roundup
trajectories
impunity
tits
gretna
peder
iia
dupuis
sahrawi
postmodernism
mcclintock
xenia
balsam
embed
entomologists
kona
getz
gozo
massed
jonathon
yuriy
ominous
ruthenian
apu
crayon
washingtonpost
summing
nsc
señor
helmand
aeros
vonnegut
morissette
zygmunt
festa
dermatitis
manchukuo
macapagal
hérault
riddles
haddon
vella
keeffe
glens
flensburg
fiftieth
focussing
suvorov
dugout
attenuation
tenderness
disqualify
rehearsing
wellbeing
infertility
grenadiers
dismounted
inaccuracy
nanaimo
occam
perilous
aurelio
frontera
chirac
¾
léo
buri
manohar
desolate
orbis
shred
individualist
yama
seclusion
antagonism
aureus
cog
daunting
appointee
misha
pirelli
praetorian
euphorbia
halts
navel
inherits
uscg
kilimanjaro
franken
aker
shiite
scherzo
hurled
taggart
macintyre
pressings
revitalize
atrial
justifications
herpes
barks
nikolaos
empirically
geologically
hacks
shatter
caregivers
rehearsed
flamsteed
eucharistic
maldivian
simi
rafferty
cfs
substitutions
indent
marmara
blockading
maccabiah
fleck
grasshoppers
hobo
yoshino
lookouts
mobs
desertion
nast
lismore
sveriges
barricades
obenberger
mornington
grievance
motta
oren
dermatology
menlo
intramural
multimillion
oromo
confine
movin
emphatically
nilsen
bettina
gallus
heraclius
publ
accomplishing
outwardly
glock
bioethics
disarm
norden
newham
watchlists
randle
scrape
delights
camper
hashtag
meigs
septic
gloomy
haida
centenario
moores
valverde
wollstonecraft
plucked
outtakes
flared
elche
granollers
saha
grieg
eda
usurpation
emblematic
kinsella
consistory
wycliffe
stary
beaks
ticking
crossbow
wakeman
mohave
intestines
katanga
langue
thomsen
grubb
crystallography
wettest
briar
materialize
emanating
moderators
undetected
typhus
chievo
lassie
teleportation
wrists
rfu
yonne
roommates
inset
prevails
gandalf
uninterested
circumnavigation
unido
scheming
falcone
tavistock
hairpin
tfl
ascends
praeger
fearsome
pula
khao
exorcist
vero
ahmadi
voluminous
yannick
deportations
dammed
coolest
dijk
unset
ashworth
danvers
todos
rainbows
mang
kilmer
cura
harker
mukesh
burleigh
charly
divisor
nurseries
lovech
ducati
joyner
sauber
invercargill
bengt
emptiness
donkeys
catacombs
dalby
sbc
foyle
expandable
muchmusic
pilgrimages
ninjas
emiliano
approvals
aerosol
scarab
opposites
adjectival
killian
buzzer
alkmaar
pasig
asm
mozambican
samiti
wayward
pippin
pacemaker
neanderthal
hansard
thruway
stonework
boltzmann
concealing
mihail
pld
firehouse
jungles
wetter
homeport
sprout
júnior
laminated
stamping
grunt
config
neely
ender
queues
argentinos
peacemaker
shards
aberration
disarray
monson
reconstructing
maciej
chickamauga
hymnal
wcc
boxscore
consulship
kars
misusing
spectroscopic
quantified
kayla
valerio
bondarenko
nadezhda
rathbone
quiero
cherries
arming
yucca
olympiakos
erc
eloquence
irradiation
centurions
raritan
parley
isc
parachutes
dispense
nürburgring
amputated
keyboardists
francisca
arn
palencia
kaori
betraying
thrills
oper
clinching
licht
faunal
ascertained
athabasca
gallium
sfb
douala
hamel
axon
geordie
sisterhood
dz
hideki
avis
scares
giselle
nanyang
meeker
fairytale
haro
alasdair
aeg
skole
dene
amicus
audley
understudy
mathura
figurehead
scopus
congresswoman
entomological
saleem
boyne
vaslui
impeccable
joys
rbc
mitzvah
seductive
hunedoara
oration
benefactors
perlis
watchmen
jingles
toda
ibc
bivalve
provincia
anselmo
congratulate
kultur
carruthers
mohsen
moynihan
fossa
gallop
clergymen
rectors
lido
madhavan
amboy
fridge
psychoanalyst
sook
samaria
byng
mcghee
holley
shahin
lumped
lauper
beatings
incur
aime
trolleybuses
hitchens
revisionism
discontinuous
harte
bermudian
sundial
microfilm
rabaul
pulsar
honouring
angelus
labonte
fsu
hogs
cursive
dupree
carré
nepean
corollary
holborn
chantilly
compliments
coexist
vividly
brindisi
mnemonic
sinbad
pessimistic
lubrication
drei
futuna
alessandra
aoyama
coachella
leica
quinton
widths
vegetarianism
wolseley
repudiated
redrawn
surrenders
kilburn
ahvaz
monongahela
infinitesimal
panionios
olney
bins
inexplicably
ruthven
forgeries
nantwich
contraceptive
laver
petal
usac
pelagic
talia
rasheed
goodies
motörhead
axed
delany
bacolod
hieroglyphs
pastime
staley
crassus
lumière
teodoro
cusp
kerouac
fiancee
tacit
rebroadcast
abdur
embody
strom
flatter
tunic
mair
bde
leveraged
donohue
alles
epicenter
pardons
interrupting
meu
roddenberry
zug
manet
washes
uavs
hdi
egbert
mcelroy
dpp
moderna
alix
tuzla
parlance
jabbar
berhad
disembarked
preto
anak
vers
rou
dari
bahnhof
darkened
ugh
rina
redstone
confides
rationally
hoof
beastie
belgique
pella
minuteman
mistresses
mcloughlin
competencies
cccc
sufferers
perturbation
etat
circassian
authenticate
moraes
headley
peninsulas
evangeline
garrard
moonshine
rotorua
hikari
vijayawada
morehouse
astragalus
sheba
shoten
oficial
jamaat
zeng
tatyana
dinh
overheating
wenceslaus
dreyer
aguascalientes
trickster
invests
esbjerg
satish
rupee
samarkand
psg
armadillo
ashikaga
sherborne
grebe
durante
appointees
etobicoke
barter
fumbled
bergh
cysts
kath
rpgs
denning
xxviii
lenoir
gander
orsay
grandis
fortnightly
cdma
foals
glenda
blagoevgrad
astrophysicist
libero
hauls
ailerons
ashgate
metabolite
unsustainable
biophysics
benchmarks
typology
oberhausen
leafy
focke
pevsner
koto
nimbus
scaffolding
disclosures
sketchy
dennison
gangwon
advisories
trainings
chowk
unparalleled
megumi
dehradun
zafar
virgen
mads
palliative
yeong
misreading
terek
hangman
hardtop
vibrating
cassino
alisa
hyland
asante
ganesan
cappadocia
deterrence
cockney
infra
biarritz
torrington
bizet
rtf
bylaws
whalley
technicality
obtuse
blackfoot
trove
comprehensible
izak
unruly
tutored
ganglia
bandera
btr
urbino
matic
tabitha
dufferin
bedside
termites
glare
parsley
lawrie
macrophages
dap
tugboat
elevating
jains
laymen
meryl
manmohan
whit
blowout
fusing
drab
dior
chiara
tricia
blinding
gara
forde
vikas
recursion
babar
archaea
lymphocytes
macroeconomic
nostrils
swaminarayan
balan
caper
hyena
hibernation
amazonian
sloboda
flips
pylons
astute
weg
lessened
quadrilateral
benelux
spoons
inspectorate
andries
englishmen
asti
woolen
powerpoint
customizable
doorstep
flavoured
mathematica
particulate
krakow
causative
geeks
brookhaven
ebu
conifers
ktm
editable
amico
stinger
overpowered
woollen
madera
blackadder
sutures
kinases
maddy
dayan
pfalz
remedied
isley
homeopathic
nadph
yelena
farragut
gipsy
stoves
smuggle
edelman
gwp
tsubasa
wishbone
babbitt
thrombosis
militar
colfax
kurgan
witwatersrand
auden
napolitano
stallone
rawlinson
goby
gau
xhosa
tambov
crayfish
nieminen
bartolomé
psychoactive
anima
ciphers
refutation
cheadle
comité
saavedra
tripping
sovereigns
prefabricated
asker
interstitial
siobhan
vixen
dressings
tut
scaly
transfiguration
draughtsman
trivially
milos
upfront
darrow
savages
factsheet
claudette
deathly
orientales
yakov
perverse
geforce
druids
extrajudicial
göran
cocteau
uthman
elixir
enumeration
entanglement
kunsthalle
scoreline
steffen
yury
mourners
conjugated
mémoire
arcana
cana
eder
pmc
darwinism
flagging
alpert
streaked
penicillin
ssa
niu
arbuckle
binge
lukewarm
gagnon
mek
dimitar
dbe
agamemnon
vittoria
arslan
tint
humayun
uab
telethon
sped
energie
yash
motorbike
drenthe
deranged
septuagint
wea
aberdare
refrigerated
baguio
jute
falsehood
germination
arendt
frantz
kessel
inquisitor
pickled
corrado
stadler
huckabee
karam
rawson
trackage
ventricle
pari
rosenfeld
woodbine
tanjung
coughlin
yoshi
doughty
leiber
lumbar
shulman
shamanism
resonator
insurer
interruptions
dozier
mémoires
crippling
broodmare
leans
camarines
contemplate
fellini
aargau
dazzling
warmed
disinterested
govan
ako
eastleigh
determinants
proportionally
bibliographical
perceptible
abl
flagg
finchley
crib
pero
coupon
cluny
refractory
uncovers
subsidence
idiotic
hus
alina
retroactive
substandard
seamlessly
woes
klagenfurt
universitatea
raytheon
lathe
quesada
geostationary
handout
bexley
reo
tiwari
parganas
pixies
numa
msgr
lithograph
froze
isomer
polanski
flask
canard
belli
cauca
coogan
roan
lendl
ideologically
icty
obie
dafydd
sanctity
deadlines
tidewater
curler
exhibitors
bulbul
toleration
pseudonymous
criterium
caravans
scavenger
compounding
chargé
powerlifting
boson
freire
barrichello
demeter
¦
morro
courland
glyph
vicario
mariupol
masami
qadir
heraklion
mackerel
yl
ishida
unscrupulous
jonesboro
crespo
kravitz
juncture
iww
bodhisattva
degeneres
tosh
lysine
raps
marija
hajime
prendergast
coworkers
cochise
judi
kanyakumari
prodigal
gamaliel
crittenden
financiers
bakeries
durán
acevedo
kanda
meth
conic
insulating
orcs
porters
rolland
akon
glas
tilak
stepney
tofu
innuendo
marsalis
eurosport
sab
symposia
rmb
veneer
metroplex
iskandar
chuang
mitterrand
daugherty
montezuma
impairments
gadsden
skoda
labem
georgiana
carefree
chambre
pensioners
lyre
menard
emotive
gorton
tif
ambrosio
roebuck
hexagon
tupelo
hamar
pantera
discreetly
clunky
vagabond
montauk
giovanna
pentagonal
orbison
gillan
carnivores
anthropogenic
meerut
mcafee
corbusier
xxv
sopron
tru
silos
pog
placenta
wcha
dingo
barreto
brainer
étude
pennetta
aqsa
arti
tarmac
yahweh
chardonnay
lejeune
huckleberry
bundaberg
xxl
descendent
drugged
luft
donal
sharpened
kaleidoscope
bertolt
euripides
ribera
grayish
taiyuan
deconstruction
lures
heatseekers
impulsive
tokyopop
tva
reentry
popularize
kailash
preset
grau
yvelines
matlab
luncheon
barracuda
jamil
racecar
lifes
cockerell
postulate
windermere
hellfire
degenerated
landlocked
angell
velodrome
crenshaw
feeney
xxvii
scopes
ey
manna
spotify
lyricists
hurdler
cruces
hurtado
ewen
wenger
uso
zf
outlived
bergmann
chansons
rims
greys
deville
suitors
tetra
mustache
gorizia
rida
captaining
dendritic
uncontrollable
disappearances
santosh
teak
paleontological
televisión
doubting
juxtaposition
monoclonal
domestication
howells
macs
lunenburg
dendrobium
tulu
tirunelveli
sieve
facilitation
inkscape
iwi
dvorak
spinach
phuket
intelligentsia
hals
spirals
deloitte
harney
cocks
ged
cnrs
corte
andriy
gogo
fireplaces
infects
registries
bajo
yasmin
heredia
spilling
gj
kickboxer
alon
monies
riel
chit
saloons
ner
gaynor
allianz
naoki
purges
ungulates
nimitz
privatized
mcnabb
seu
trespass
peppermint
lombards
dweller
compatriots
juices
rimmer
consummated
silicate
hdmi
machete
tinsley
posh
brainiac
festschrift
fallujah
bushnell
viana
hausdorff
kbo
murex
dartmoor
nugget
unbreakable
alfalfa
mips
bystanders
authorizes
zdf
glandular
giri
disulfide
lousy
modestly
lodgings
uml
austral
glimpses
precluded
noteable
toki
analogies
jochen
udi
muammar
madigan
telemetry
macao
trabzonspor
eject
choudhury
bletchley
bagpipes
aap
mulhouse
cowen
vercelli
alejandra
aku
sassoon
catalans
creatively
ranchos
brookside
mishima
cto
snub
claudine
fundraisers
calico
caustic
mellitus
booklets
seamount
broadest
gakuen
burnout
scraping
kazuya
sneaking
vanier
lucía
subsided
radiological
bloodstream
greville
junkie
tsa
ilyushin
shakers
nicklaus
yuji
shouldered
fergusson
tedder
mainwaring
cosworth
clearances
beretta
intolerable
kenshin
assimilate
isolates
meandering
cleanliness
forties
especial
lcs
goswami
shoreham
picton
politique
dispensing
astrologers
boney
porches
kain
hasegawa
equating
inferences
theron
bluntly
mvc
bogota
emmanuelle
platelet
disorderly
tiller
isomers
mpumalanga
speciation
ludacris
northeastward
scuola
outlandish
lurking
minna
rushton
origami
tos
möller
erasure
herbivorous
payoff
kennett
köhler
jyoti
cyndi
westgate
stirlingshire
isotopic
accrued
newsreader
sono
huntly
fuze
kagawa
retrospectively
baudelaire
legia
fished
timbre
forfeiture
engelbert
kremer
morison
informants
callie
brochures
ulcer
pinellas
paroled
terrance
balustrade
sparkle
chaka
formaldehyde
interdiction
bannerman
foothill
subtract
snoopy
tippett
commendable
commemorations
whoops
longstreet
marwan
implantation
invicta
littered
tomo
ctc
pohang
hatchery
bnf
utes
ili
serfs
gilda
lachaise
pattison
noida
glazing
royston
neves
glycol
kareem
catholicos
ospreys
spitz
kjell
revisiting
kui
bowing
bonne
williston
grumpy
taganrog
natchitoches
bernhardt
saf
senseless
pradhan
boilermakers
bledsoe
educationist
portillo
raged
lessing
cryogenic
byproduct
quinta
hallett
catheter
famitsu
delisting
unskilled
ordinate
amigo
ponting
espinoza
smog
imageshack
stagg
mungo
zan
franciszek
jara
interregnum
masted
normand
affirms
tarantula
verdes
ssi
pitting
nuestro
enlighten
warfield
bata
mitte
texarkana
bowser
jacqui
dfl
decays
yalta
captioned
staines
electropop
usurper
pantry
fenders
qpr
saatchi
outermost
bodine
brainchild
maracaibo
powhatan
chests
cherie
atmospheres
amaral
tampico
ozzie
sedimentation
springtime
pogroms
elitist
maw
adela
datu
devo
porridge
geeta
wellman
tunstall
riflemen
nehemiah
pirie
paperbacks
journeyed
scams
bourges
etihad
puno
hagiography
vicars
mam
diogenes
cvp
fsc
curlew
pseudoscientific
rangel
chattahoochee
kafr
brontë
grandmasters
andrás
clef
taoyuan
insolvent
deming
greenleaf
hynes
dares
ní
obit
wma
brava
nussbaum
memoria
tasha
failings
impasse
petrovich
dialectical
unmasked
kendo
ano
pma
justo
stoic
marianna
damir
hues
cathcart
castell
kolhapur
scum
obstructed
taylors
blakely
flack
hindrance
balanchine
showings
bruton
cours
triumvirate
hanoverian
converged
randers
harrassment
ahan
gori
mcewan
circulatory
litt
coombs
sacra
zed
bento
churchmanship
audacity
unprofessional
larue
menacing
blackhawk
odes
idiomatic
maktoum
shirin
cvs
thetford
goblins
bespoke
ramsgate
chitty
teas
jeunesse
elbows
hyo
larisa
ornithological
mim
najib
theropod
iridium
residencies
grosjean
stitched
desolation
pervez
morricone
tamura
ironside
livres
vive
scheer
shutters
covariance
frisbee
enfants
etymologies
dagmar
haddock
enugu
epoxy
genk
yume
clouded
conferring
headliner
spaniel
kaspar
convents
gamefaqs
scrivener
forma
wik
merrie
fiore
hec
predictor
hospitalization
duh
hibs
byes
modifies
muhlenberg
nuance
bonny
inflection
kenyatta
loris
yarborough
polis
constantius
imtiaz
yukio
subsystems
foal
barricade
yiu
abhishek
mikoyan
discontinuation
reminders
veena
reread
cleves
steels
applaud
broadbent
batchelor
haugesund
evict
eff
cagney
anorexia
biscayne
ringside
dore
siret
bouncer
gagarin
rémy
dailies
cortland
rejoice
hunchback
previewed
yells
orme
easing
wardens
ascendancy
runaways
marika
bruised
scanlon
nrg
croton
colima
dictatorial
licentiate
moorhead
adjourned
gato
eilean
muncie
stipulates
masato
oup
aliabad
engl
thelonious
chrissie
prolong
ouse
marketers
muskogee
maryam
reproduces
celebratory
miró
zo
energia
boldface
shafi
konya
redshirt
payout
arnie
rabi
isr
elkhart
churchman
ecc
vilhelm
xiong
reconfigured
archway
carapace
dries
loughlin
gamespy
disassembled
nematodes
heide
smelling
rekha
rayleigh
hydrothermal
doorways
microfinance
gabi
marysville
dimaggio
imprints
bleachers
mujeres
uriah
decapitation
diemen
banfield
pires
autonomic
banishment
phony
kiyoshi
sascha
qazi
earhart
conjoined
conquerors
bugatti
universality
alleys
supercars
brouwer
unleash
taf
papadopoulos
kondo
tithes
railing
perils
egmont
suede
cargill
coldwater
falla
meagher
policymakers
udupi
pontypridd
malla
jee
carteret
simplifies
boden
tenacity
infatuated
welker
reenactment
okinawan
complainant
paphos
martians
repetitions
parsonage
bastian
multidimensional
startled
canonization
rambler
sambo
unaccompanied
carnaval
resins
mcmullen
imprecise
hartwell
besieging
endicott
silverware
benefice
chilled
darshan
waylon
sagra
neurosurgery
printable
zagora
chae
invalidated
tucked
suspiciously
sorel
loudspeakers
unfriendly
audited
tisserand
symbiosis
seon
verifiably
marrero
spat
seraphim
helsingborg
grime
representational
orphanages
equipping
acrobat
apprehend
autres
nicene
dearly
ibid
chalet
weep
pradeep
starz
penicillium
moretti
ait
thankyou
ocd
admires
cardona
granary
sebastiano
hani
seyyed
unreadable
sandstones
joyous
polymath
habitually
glycine
fjords
notting
whips
cromer
complicating
perpetuity
aspired
exorcism
lorain
hardening
triplet
oust
imelda
assailants
brownlee
separable
colley
ardèche
subs
spenser
sams
république
cladding
molson
dissipating
npl
beulah
chisel
rostam
carton
wushu
laguardia
mykola
arxiv
spyware
voigt
noxious
ayrton
strides
bled
wildstorm
uic
chancellery
petrified
jhelum
uppercase
adenosine
nighthawks
civilly
mundy
birthdays
tuam
fallacies
freund
geoffroy
deanna
latour
chikara
rona
relented
confusingly
tilbury
lunches
kilo
partying
sleigh
stratum
ivanovo
genet
narayanan
castellón
pape
foxy
disseminating
lcc
puddle
ug
pasir
alphanumeric
underpass
novelette
haskins
expositions
guardsmen
cheats
clays
businessperson
srebotnik
voorhees
mohanlal
encapsulated
levon
hemispheres
gloster
ppi
intermediates
dram
stratified
bastions
coven
cools
frontenac
bian
esophageal
mclennan
abn
vernal
tls
leones
kinabalu
renominated
transcendence
despot
daf
gwynne
jest
rosebud
bridgwater
sandberg
cautiously
ila
rhone
fishy
adf
miyuki
cmd
typographical
thighs
lesnar
dahlgren
sculpting
patrollers
bravely
torches
yana
solothurn
sported
glade
recurve
carmela
allotments
minefield
filipina
hammett
unbound
credo
greenhouses
kelp
pennine
inventories
dropout
salamanders
leyden
fakes
mower
espace
nematode
crossley
nansen
faddle
darkly
unchecked
ite
anbar
verdy
rinehart
pho
accuser
hateful
cuttack
eisenstein
trna
kovács
komi
shuts
indonesians
decentralization
collider
hauts
berdych
hylton
teleplay
coincidental
vander
swinburne
kamala
auch
abb
blm
axons
cruciate
ganglion
tuen
condo
octavio
conserving
sheltering
grzegorz
belleza
cession
elicited
marple
trestle
sandals
brodsky
purchasers
vpn
mithridates
withdrawals
conch
hollister
vigilantes
infringed
colouration
rewrites
duchesses
galvin
waziristan
robison
draftsman
framingham
alumnae
bla
irfan
olde
bbq
datum
pacquiao
pentax
kilowatt
gamelan
stiller
vilna
earrings
metropolitans
lucan
carradine
ansbach
timmins
stink
bhat
orthographic
asexual
toru
instructive
professorships
unchallenged
safi
verbose
cristobal
fih
lacroix
envision
arce
worsen
boynton
hurried
polymorphism
koran
mannerisms
deviant
prov
remanded
inwards
admissible
hourglass
leveson
oppenheim
renunciation
pusan
shogakukan
mccaffrey
kure
lundgren
keswick
shaheen
moorland
tz
transporters
scapegoat
jorgensen
lossless
pappas
nuys
materially
tottori
diseased
kirilenko
aeroflot
wormhole
revel
cracow
perihelion
plush
wsu
jinx
baltistan
khwaja
anachronistic
aiden
harvests
annabelle
talkies
weblog
luxembourgish
amasya
dysentery
pál
zn
buell
retrieves
mortals
whiteside
slowest
bunyan
uga
uninformed
undertakings
xo
soybeans
unstressed
auditioning
parlement
observant
evesham
fullest
avia
trill
puppetmaster
nogueira
terns
ordinator
pra
secunderabad
griffon
langs
rupp
salinger
numan
elan
stourbridge
armas
linde
incision
gauls
hammarby
lenovo
octavius
kerrigan
equalizer
rumsfeld
panicked
plat
blackfriars
garvin
fret
respite
janie
grosseto
delve
sark
literatures
buf
armchair
headway
installer
asshole
demobilized
bradfield
diderot
slovakian
recitative
baek
leviticus
oxfam
machinations
antimatter
einem
sacs
ntc
crossword
viv
nubia
melba
reiter
origine
roadmap
petrels
simba
eurodance
turntables
hunterdon
ogawa
afca
nuovo
yanukovych
krajina
haverford
raipur
ecologically
christiana
placings
disengage
maclaren
glenorchy
choking
nakhchivan
hsien
reston
cruciform
meine
achievable
hollander
strove
impeded
lionsgate
krista
iman
vented
dialectic
samos
biao
transcendent
ltc
macdougall
zohar
qm
invades
kish
unseated
enver
carnation
viterbo
vial
livelihoods
faiz
viswanathan
ppl
calton
bismuth
arguable
swenson
mfc
thad
narvik
codecs
serpents
herbivores
semple
hajji
cimarron
drammen
potawatomi
mullet
eso
tyndall
laserdisc
taran
endanger
throwback
geometries
counteroffensive
mores
commend
tengku
whack
lull
argumentative
ishaq
hildegard
eyewitnesses
bindings
daggers
cockatoo
céline
lighten
proust
mór
mephisto
superstitious
indentation
amaya
takao
defensible
nurtured
kuroda
norepinephrine
fouled
issf
jogging
belizean
blackboard
rifleman
akshay
novelization
sodom
tristram
brigid
diez
nutmeg
boing
schenker
heilbronn
guano
hologram
pnc
aas
crusoe
obliterated
illuminati
ludovic
enzymatic
wrangler
venkatesh
blower
ribosomal
figurine
correia
reportage
negotiable
amway
yeti
maidan
dependents
sauna
enforcers
lévy
cic
fryer
selva
retaliate
tenancy
autónoma
vashem
amun
dubs
mccauley
congregationalist
bullard
ribble
shillong
kartli
centerville
fossilized
cordell
macneil
diners
melkite
slideshow
rodionova
infighting
decomposed
culled
siddharth
cockroaches
awry
pretentious
lindisfarne
kennebec
heaney
discriminating
portability
passageway
storefront
erupts
customarily
tableaux
swordfish
poco
murky
hoosier
hime
hisar
outings
mayall
thematically
kalyani
koda
selves
sbk
nerds
minton
hiawatha
teck
shove
priyanka
bz
srebrenica
crain
wildflowers
subtraction
daedalus
aly
bukhari
hightower
prospectors
incarnate
vespers
caged
masaki
boll
strathcona
vladimirovich
gwynn
tristar
honorius
prospero
oxidizing
introspective
disruptively
lepage
viic
roja
perpetually
tsinghua
huw
pnp
capitan
inna
lubelski
teased
murfreesboro
olmec
motets
engle
enrolls
radii
cascading
blueprints
killarney
reeder
thackeray
kigali
tuareg
cerf
starbuck
nairn
janne
premiering
semicolon
adapters
asparagus
leduc
standstill
cayley
hyphae
babyface
channeled
faff
fangs
₤
miquelon
hallmarks
najaf
sacristy
unlinked
trinidadian
ramallah
undergrad
cmu
fanatical
manolo
pictish
eads
lipetsk
jalil
bolo
restroom
monocoque
emitter
lasso
applegate
candlelight
schell
schenck
stover
immersive
scraps
vdc
gosh
clumps
trusses
edgbaston
braden
miri
cypher
dedicating
antimony
tappan
cece
insistent
southall
keogh
condense
ruffin
thrusters
symbolist
compensatory
sensed
dipper
marg
sixtus
gestalt
weevil
gsa
vesta
whatnot
cricketing
haasan
paoli
tyrants
chula
huon
sdf
prescriptive
groovy
stasis
lounges
turkeys
dreamers
predating
armani
mommy
berbers
donuts
dcs
uneducated
overtures
gotthard
anatole
happiest
sancti
bartley
cyprian
understandings
reformatted
berio
nipple
balthasar
driest
sou
karpaty
ferrous
banshee
koizumi
manhood
bgc
moulding
eminently
edicts
slashed
taiko
linder
beaker
splice
alef
silliness
kanazawa
stillborn
ryszard
oregonian
chetniks
bojan
rickard
oradea
buckwheat
gourd
sawmills
zé
mindfulness
lockyer
airshow
irgun
fluctuating
howes
broccoli
ammonites
lieut
malawian
millicent
endpoints
metrical
elphinstone
accelerators
leu
magazin
kampf
yolk
villager
saddles
scca
firsthand
moderates
triangulation
pratchett
auntie
manoa
morena
checkmate
ionizing
minima
telluride
uid
uwa
debtors
cotabato
supercentenarians
rechargeable
landsberg
karbala
granules
yrs
whiskers
perumal
transcend
treasurers
admittance
natak
amerindian
fortis
triumphed
fawkes
idp
puffin
fraught
promos
scapa
colonize
binh
goan
vindication
prays
mankato
quebecers
coors
amicably
peale
thang
nether
fess
dhc
vlaanderen
marksman
homelands
shona
cts
siegen
soweto
oxnard
tycho
laissez
infractions
budweiser
paulie
seaway
quintessential
scorched
redefine
helmholtz
montagne
nonconformist
skated
upsilon
anorthosis
callao
bolu
greensburg
streamer
grenier
slovaks
sweating
interconnection
salamis
fls
hine
waseda
xn
multiracial
capote
keri
rowdies
ifl
comer
didi
accretion
canister
folkways
iro
yelled
seceded
sacco
sihanouk
shredded
platnick
stereoscopic
ichikawa
bigot
zambezi
stubbed
shanty
dynamos
ternopil
mexicano
sienna
renner
demolishing
mq
raiser
rosedale
cady
kwajalein
declarative
vsevolod
kuk
zora
seiji
twig
ticonderoga
mesolithic
gondwana
assertive
hippopotamus
ets
cohorts
hypertext
adversity
merovingian
votive
barrera
cacti
sti
zander
vomit
pago
escola
inflected
azusa
dimitrios
kidnappings
boardroom
halford
astray
benazir
años
rbs
takedown
puntland
tisdale
arguement
salyut
neared
astrophysical
horrid
neel
año
forges
chitra
chroniclers
conspirator
saya
submits
derided
nuri
alborz
ona
meara
swapo
lita
capes
aggie
overlying
bobcat
adderley
caveats
martindale
fouls
campinas
ignaz
federación
weald
ratcliffe
derail
encircling
memento
tulare
colosseum
oas
resubmit
airfoil
regionals
simulcasting
callan
jacksonian
alstom
fruiting
aether
dissected
mrc
eugenie
ahsan
reconquista
ioannina
acreage
hdd
julianne
hdr
forgiving
serrated
molyneux
kruse
buttress
urchin
courtyards
seuss
springbok
biome
azimuth
promiscuous
superconducting
gambian
askew
liberally
vellore
levers
laxmi
aubin
hermits
mashed
haslam
shipbuilders
raphaël
murillo
bradenton
unscathed
complexion
blasters
frescos
pimlico
bev
poblacion
christiaan
yuko
thickened
landlady
boleslav
coquitlam
extensible
retinue
suo
reichenbach
brevis
karlovy
lantz
ziegfeld
hollins
njcaa
texaco
openstreetmap
maur
madrasa
strang
acland
asn
alessio
renovating
acne
payouts
agri
signer
liege
tov
scorn
kazi
gamepro
longchamp
confidently
cartels
trowbridge
dostoyevsky
roseanne
dolomite
vergara
jabalpur
glottal
eliyahu
siem
dnb
lobsters
nakayama
skyway
carnot
sinensis
solute
gatorade
datasets
autocratic
sloped
undeniably
conferencing
muck
rtp
dislocation
soften
assembler
kaduna
cybernetic
alun
ribbentrop
alabaster
hemel
dps
janko
nakagawa
binaries
apocrypha
benzodiazepines
differentiates
stun
tiverton
safeties
jewett
rishon
faris
bouvier
vivienne
ronstadt
burgoyne
yorkers
waukesha
inbreeding
staffer
cem
mourn
episcopalians
iceman
ichihara
lackluster
paganini
dún
conchita
interconnect
douai
youngblood
román
ndtv
jal
aylmer
flimsy
unplaced
attentions
bridgeman
fedor
bassists
newlands
debutant
tinge
ido
unita
pigott
darya
knobs
bulleted
nox
sopa
rinaldo
competitively
physiotherapy
reclusive
karlsruher
kingswood
verdasco
trainor
schäfer
creationists
lichtenberg
chillum
harun
petitioner
apia
parasol
rashad
deliberative
twinning
teruel
sua
murugan
outage
osorio
dandelion
benji
intersected
msm
scarlets
atolls
blackface
emp
drunkenness
weis
fiu
energiya
hooke
yamazaki
witold
unsung
fliers
zar
borgo
wrongfully
genji
buccaneer
woodforde
donut
kiosks
gobind
unholy
cartographic
rosita
nikhil
asd
radiate
lactose
sor
pha
scented
winemaking
muzik
pipers
bonilla
handshake
ambulatory
chih
universelle
hesperian
aggressor
molasses
paradoxical
sorrento
piranha
snapshots
jamia
beveridge
haque
naam
sovetov
otc
pater
provincetown
pacifism
pinnacles
disclaimers
fatwa
léopold
hypoxia
acetylcholine
roadrunners
caxias
clinging
wittelsbach
orestes
groomed
propositional
barbet
lemons
xanadu
govind
broadsheet
radek
countywide
moy
mints
shiner
gazing
hunslet
skylark
mustapha
prosser
champa
orc
cnt
trapp
maynooth
brasenose
dharwad
distasteful
larus
intermarriage
stingrays
flagler
jarman
maida
whirlpool
collated
pragmatism
subcutaneous
incised
impersonator
chuuk
suis
sameer
rashtriya
rumba
hayman
bem
merwe
mistaking
rivadavia
distractions
manipulations
daze
chlorophyll
fabre
brooding
taverns
nani
diatonic
softened
canby
paducah
shortland
shales
uppermost
linköping
deportiva
worshipping
worley
indignation
expelling
cea
mayoralty
conciliation
retainers
bjorn
borderlands
intents
hussar
troms
weezer
hbc
fitzmaurice
piecemeal
kitson
stranraer
huygens
yazoo
croquet
adl
cutie
prodigious
venturing
patchwork
mcentire
ozawa
ossetian
segura
dlr
poulsen
duda
willful
ihr
humorously
bohème
uneventful
belém
delinquency
prioritize
draconian
intolerant
pz
knud
outsourced
aorta
nashik
budgeting
doordarshan
ravana
bishkek
erste
jaco
fredrick
fingal
rerun
woodard
krazy
finnegan
seafaring
swartz
wyeth
qmjhl
adeline
turbojet
impersonal
sherwin
mycenaean
tetrahedral
nalanda
masa
hendrickson
kumi
navi
dian
balloting
michèle
knowle
selene
trigonometry
mingled
harmonia
commedia
amaro
cinco
sdss
conyers
winded
ursa
youzhny
wilmer
icp
miho
sojourn
petrarch
ebro
deadman
sadar
incessant
breisgau
emporis
bitumen
zigzag
roseville
aquariums
kx
servo
cowdenbeath
wozniacki
contingents
plattsburgh
itch
kulkarni
sundar
odom
honeyeater
pausanias
keck
triennial
pulsed
yuwen
retainer
grievous
vimeo
richman
severance
kinnear
coconuts
pant
chanda
marquise
freya
nys
claro
scarface
mannered
ichiro
aam
despicable
bulldozer
ayurvedic
salish
tvn
garo
solder
usns
nefarious
humpback
menagerie
commenters
xmas
thermometer
drydock
geosynchronous
ietf
prs
eradicated
gera
beda
mestizo
mitigated
dative
aslam
ascents
resignations
tufted
limes
schopenhauer
whoa
vladimír
airtime
quicktime
siedlce
miliband
sufferings
valuables
ardmore
mobilised
bridle
zaidi
extravaganza
,and
propriety
heptathlon
hn
handedness
angra
kut
guaraní
beltrán
ife
corry
kafelnikov
forgo
beep
tarc
idi
speck
aileen
riverbank
godolphin
gehrig
karelian
steinway
cymbal
suman
vindictive
pairings
flywheel
catwoman
bracing
honoree
coulomb
gor
bada
pazar
vices
shallower
cathay
grierson
stal
nzl
modalities
purify
compiles
cwa
presides
ganja
chromatin
incriminating
giver
hillingdon
lanza
aaaa
defensor
hospitaller
pubic
averted
deserters
bridgetown
faw
gorkha
huck
rudeness
intertidal
silvestri
salto
incitement
cerezo
wairarapa
interwoven
epigenetic
adherent
medallions
defies
yuba
hansson
pronouncing
pelé
petroglyphs
rau
indestructible
keyser
caregiver
supercopa
stockbroker
haverhill
bogged
whitchurch
abra
prepaid
masala
bodley
miz
haganah
byung
berta
akan
newscaster
hazy
blandford
escherichia
chats
phenotypic
quayle
lyonnais
telegraphy
goblet
bedi
accesses
nightjar
alo
nascimento
mms
oglethorpe
yardage
edgewood
ruy
sherrill
preponderance
psychopathic
bitmap
occipital
eller
theban
responsibly
alight
tymoshenko
shueisha
travolta
pipit
warrick
ketchup
defuse
molars
visigothic
laziness
midge
methionine
clydebank
flagrant
industrialisation
tortoises
anse
drôme
squarely
mikhailovich
antofagasta
lehrer
cosine
makassar
azov
fatih
reebok
altruism
patiently
herbst
counterfeiting
mesquite
heim
embossed
vindicated
conklin
marcella
neuilly
gyeongsang
donn
lomond
composes
cathal
avellaneda
americanism
grieve
canvass
christo
castings
handbooks
newcombe
helge
reznor
rubus
mayweather
haru
schaffer
être
aranda
nanotubes
pdfs
pcm
tremor
widowers
miro
trott
lockers
accumulates
conjectures
ballymena
takagi
fujitsu
ellicott
bowe
nazaire
schooners
paradiso
biswas
proviso
krieg
unbearable
tantamount
kcmg
wharves
tumble
returner
photosynthetic
decadent
anadolu
heartache
degrassi
kkk
kublai
zindagi
sociedade
esophagus
aline
intelligencer
nodules
synchrotron
celt
malpractice
ashburton
excite
ankles
positron
emmons
sparring
foresee
damnation
vz
yc
amidships
dunbartonshire
penza
nida
woodpeckers
pulley
aar
popularizing
glitches
cécile
hefty
babol
badd
bristles
mayday
kono
boutiques
apricot
gomel
tiberias
socialization
statuary
ordinated
aldehyde
webbed
jf
toyo
evangelista
canaries
eamonn
hamada
thc
almonds
mccready
acuña
quarrels
indio
semiotics
cetaceans
coupons
kristy
progressions
dixit
hypothetically
seongnam
farkas
ailment
extinguish
dixieland
mamoru
oscillating
zr
stopover
sel
grammer
bcc
winkle
abstained
christendom
salmonella
togolese
conquistador
utterance
travelogue
buchenwald
aftermarket
belton
succulent
lucchese
abo
evasive
hanns
unravel
reunions
compulsion
param
shiro
morey
tokelau
amol
saprissa
changer
edberg
biafra
carioca
ncos
predominate
prospecting
antiquated
algal
diphthongs
eec
decorum
bearcat
pécs
solidify
preemptive
enlarging
gat
clijsters
leonor
overrule
slayers
covertly
anacostia
sidekicks
apaches
subvert
productively
surinam
foreseen
babak
controllable
adverb
hangout
badlands
recluse
danske
cinque
shalt
bentinck
zec
suu
lennart
anointed
pledging
uninteresting
jindal
deepen
graced
huo
yamanashi
kellie
lode
bracketed
ursus
anja
prewar
elland
subtlety
flashlight
kpa
aqueducts
donington
illiteracy
mirabilis
rafe
smallwood
drifter
ccr
kel
rowlands
cme
josefa
nominates
strode
simona
frosty
pergamon
repulse
wale
filipe
beryllium
petros
preferentially
genotype
checkered
whiteley
dama
eurogamer
garber
araújo
nuffield
biofuel
acker
gustafson
trebizond
phonetically
isham
subordination
woon
duchesne
shrewd
shaka
reb
tipp
cob
prospectus
dreaded
claes
trickle
wasteful
buchholz
streep
lacs
allude
clocked
venter
germane
brougham
camber
halved
salads
fut
gemstone
prejudiced
shunt
inked
remixing
amx
replenish
downriver
eamon
commonality
janaki
universiteit
holdsworth
beardsley
geri
stanfield
assays
exclusivity
baru
germs
ym
excused
unforeseen
gpo
spirou
quantification
wk
manipulates
dicks
yousef
arantxa
abrahamic
smock
watercolours
bombard
zoroastrianism
uscgc
provençal
sophocles
atsushi
kadokawa
tauranga
apologizing
voix
becuase
mithun
powerfully
pickard
kasai
qasr
bergeron
forcible
unsolicited
longwood
esch
synonymy
sparky
monro
tyrannical
kozlov
lauda
montparnasse
prizren
pzl
leiria
orquesta
dimethyl
uru
stasi
cushman
nevers
narcissistic
hilde
desalination
hollingsworth
famille
objectors
ree
rajasthani
immunization
prepositions
mariachi
dukedom
fenn
faraone
grating
chios
overijssel
blakey
levies
bernini
kilbride
ribeira
maliki
pontefract
samadhi
hariri
terme
dislocated
picardie
characterizations
facilitator
flue
sheeran
pettit
taka
qarah
minter
siti
hiroki
selfless
icbm
greenhill
togliatti
demotion
modems
amharic
marla
barometric
bonsai
fabius
torturing
conservationists
transposition
racked
greenwald
damning
yeager
shuster
ricard
magsaysay
pds
dilemmas
widgets
breuer
plagiarized
soden
cahiers
momentary
guilherme
jagiellonian
getter
zipper
slav
bolger
epithets
heralds
singling
norad
crazed
offa
bodmin
somalian
oakdale
osasuna
flattering
negri
restarting
wer
empoli
mastercard
optimizing
jig
divorcing
brereton
gielgud
alexandrian
snowstorm
clot
emphasising
galli
nar
nacho
franchi
chs
obstruct
esta
gliese
vukovar
blockhouse
prius
reuptake
scraped
preoccupation
feelin
hino
crewman
placekicker
liberté
woogie
gab
anatoli
roush
premios
depriving
steely
femininity
hexham
dura
marshalling
merino
concubines
hes
contravention
minesweeping
greener
keeling
gascoigne
scrutinized
subdistricts
generalize
choe
scholl
srinivas
crandall
evoking
dex
olivera
richfield
boz
sabotaged
leitch
barroso
llama
ruck
rudra
dif
enda
satie
cheong
graff
injustices
uyghurs
aalto
bahía
henriksen
abdulla
paseo
seabird
shura
cantos
zvereva
detracts
standardisation
ulterior
tso
toth
declension
pellet
donates
cupboard
excised
rectangles
gennaro
antonescu
lavery
factorial
scythian
quantico
jari
hock
rabid
preta
ibáñez
misgivings
capping
meher
blurring
kortrijk
maximise
marchant
libertas
kahne
fec
stolberg
burgesses
futility
fishman
randal
tartar
smurfs
salma
conspicuously
silverbacks
lifesaving
islamophobia
exporters
middleware
eifel
kalgoorlie
bothwell
bridged
keselowski
shazam
oneness
mabille
steiger
democratization
summerslam
drava
nuttall
jud
suffusion
morbihan
qiang
surgically
traoré
groth
leszek
befriend
decadence
moffett
paratrooper
conga
phasing
winehouse
tangentially
kees
ori
rmit
occ
parsi
detain
newsom
seaford
lumsden
rdf
redux
inversely
lum
academicians
taito
pastiche
chatting
utv
hing
kasey
mansard
cowardice
periscope
anabolic
sneakers
heckler
gosport
marquesses
dolph
diploid
woolworths
exif
sla
solanum
quintero
prat
feuded
tirelessly
dikes
kingsway
rationalism
honed
punks
aveyron
phong
starve
canfield
breathtaking
gorgon
modality
bayes
sweeper
kenora
spectacularly
obscuring
leake
eltham
unicorns
lucretia
✈
kakheti
geraldton
obeyed
ure
carling
basaltic
grader
rearguard
nimoy
sufis
emmys
kleiner
ibanez
epidemiological
marte
solent
sandwiched
henin
fissure
dualism
rips
shifter
castaway
carotid
kotor
disproved
broadleaf
sotto
pauls
degas
ferraris
stalag
methodius
nonviolence
camargo
downer
paraded
bestow
viagra
deuterium
srinivasan
gazi
bicycling
exclaimed
eternally
couplets
nutt
nevill
aro
trailhead
takeuchi
brownie
psychical
distorting
hovercraft
mitcham
puss
twofold
distaste
mutineers
nullified
newnham
amina
tamer
invents
clichés
succinctly
ij
megawatt
buddhas
dushanbe
chandelier
darwen
factional
faure
mercator
hyuk
chipmunk
patched
bioavailability
colne
zoot
authenticated
supercharger
koichi
diffused
unattractive
mattias
exchanger
alternation
jarring
vejle
debug
bathe
appreciative
loggia
inés
itchy
arai
extramarital
octet
adcock
yuk
galego
timorese
bhi
prune
generically
benedictines
oily
marrakesh
mizrahi
becca
tupper
irena
panics
lightest
chidambaram
maksim
arabella
ballistics
ocala
obstructing
csiro
inyo
lattices
overcomes
fca
intergalactic
begonia
fiduciary
watercourse
dempster
resounding
pericles
repute
aharon
femina
migraine
grohl
zhongshan
rheumatoid
toughness
soot
pruned
imbued
quibble
brea
severing
jaume
mami
colonist
narada
garb
mejía
irv
neuroscientist
discarding
hippies
branford
jarosław
unsatisfied
macaw
provident
carne
oic
opm
tooling
menominee
hillbilly
karoo
pyruvate
linwood
lld
cyclical
luleå
soa
gish
davydenko
ih
ula
waldman
siempre
ketone
deniers
accompanist
cariboo
hap
maradona
mccollum
carnarvon
braided
schlegel
galeria
magnitudes
sudha
etheridge
eloise
throwaway
vann
teapot
futbol
inlets
shard
almanack
adorn
hawaiians
yearning
haunts
rowman
campaigners
prefrontal
pauses
ruggles
actuarial
graphene
nichiren
honorably
oscillators
hives
danza
pacification
hering
bookings
kham
slotted
ilford
norge
villar
prescribing
adjoins
subprime
suborbital
escalators
bessemer
raine
kashi
disinformation
picts
leppard
metzinger
shim
personified
lahn
epistemological
xanthi
christen
booted
wildflower
boulevards
chilly
collectibles
dinar
steadman
sagebrush
maturing
geer
rochus
belenenses
reggina
vmware
steyr
legalize
casement
elizabethtown
ini
gregson
minimizes
gam
widens
oita
bola
hak
ttt
escher
nika
lacquer
beadle
roasting
mmp
dips
frenchmen
mestre
moveable
brisk
dementieva
wtc
modoc
credential
smoothing
schoolboys
postulates
nyman
alfaro
devising
yuka
philological
mendip
heffernan
cancels
ashkelon
kells
rika
outgrowth
orlov
debilitating
recep
kirwan
mci
rapporteur
faerie
anagram
firebirds
crowder
wilhelmshaven
mishap
jaber
hisham
abed
mtn
wook
naya
barranquilla
boulanger
tanja
phonographic
halstead
commercialized
ventspils
encephalitis
reichsbahn
willett
nameplate
cytokines
cotswold
exterminate
raisin
tremors
buffs
adder
tyndale
dangling
farsi
krusty
booms
pacifists
aest
pgs
limitless
humbly
cranmer
ghani
boe
childlike
ismaili
taunus
sochaux
aamir
ponderosa
serjeant
everard
hyacinthe
mbs
cottrell
coote
repubblica
surigao
tejano
sivan
firefight
makarova
tremont
replayed
depreciation
beecham
kumasi
bulkhead
preposterous
clann
dtv
scientologists
errani
bulger
charon
allocating
atacama
knuth
pais
repose
tolentino
lingo
protester
passau
monogamous
lora
elven
leash
cot
tyrell
longo
arthropod
thorny
sluice
mcauliffe
escalante
courtly
trespassing
bur
underscore
unlocking
willa
hitmen
filibuster
wawrinka
catharina
tasted
condenser
levitation
hermetic
diligently
lehi
symons
alanis
campgrounds
corleone
headset
diction
shabazz
pupa
topaz
gaillard
moron
mcdonalds
tutti
fallow
doin
goodison
ux
sani
shampoo
carnivals
horsley
shimla
evangelion
citigroup
oar
regroup
bayview
hindmarsh
rogan
verein
savant
pythagoras
gleaned
wedlock
yatra
pastoralist
keyhole
grimshaw
machinist
enforces
hanger
venkateswara
barbaric
gulfstream
gsp
eleni
masood
beavis
menzel
redcliffe
afm
openoffice
almirante
iffy
culpeper
cheeked
innermost
amedeo
gollancz
alania
theotokos
radiative
waterville
elstree
pathologists
reclining
riverboat
corky
valedictorian
makerere
amply
rawlins
denali
splendour
azevedo
schoolgirl
dpi
richthofen
pregame
sportswear
abdicate
seaplanes
radiance
leaguer
fluted
cri
uil
soared
leichhardt
wane
rube
thessalonica
nieces
windscreen
marbled
fogarty
discoverers
bungalows
arrangers
hobbyists
schnyder
babur
noe
alpina
impassable
dens
checkout
plumes
mobilisation
aubert
edina
evaporated
gretel
intermediaries
mehr
honshu
galindo
impenetrable
ionized
anyang
novorossiysk
makarov
prins
atholl
amanita
cichlid
marl
lumberjacks
karloff
ultras
ataxia
rothwell
enquiries
ivey
hazen
inaction
qutb
cdf
yellowknife
offside
bicarbonate
nordisk
hurwitz
trask
eben
pastries
vestfold
owings
betancourt
lackey
gianfranco
doane
gabonese
hondo
halal
greasy
skips
pauly
vallecano
mischa
feller
skimming
giraud
hazzard
skeet
plump
abellio
cutthroat
reinhart
ilona
chubby
dripping
erzurum
dyeing
sinestro
ocr
faceted
bards
devious
columella
langham
archeologist
chara
electromagnetism
orinoco
nll
persephone
modo
unconditionally
musicologists
cowles
barneveld
secrete
welling
whaler
camila
pancakes
mattie
dredge
emphasises
toothpaste
ucr
allenby
impregnated
budgeted
greets
understated
arvind
brunton
geist
furnishing
anesthetic
infraction
mahut
imitations
guin
foxe
plumb
frères
camaro
secretions
bolero
wnt
newborns
luk
fatale
cataloging
stavros
precession
requester
bream
inexplicable
machina
krone
dufour
outbursts
sofía
minolta
alcoa
interrelated
intermission
isaak
paparazzi
baht
matamoros
intercepting
lass
blitzkrieg
rebekah
pyaar
plundering
tabled
lauri
vadodara
meadowlands
lázaro
mannequin
fcl
nevins
unregulated
bana
angeli
trendy
gto
popescu
goffin
mcalpine
genova
quine
gynecology
ayesha
copacabana
amuse
archetypes
deadwood
leonards
plagues
viticulture
midler
percussionists
ranting
snide
cand
restituta
stilts
lansbury
villegas
brac
melodramatic
bewitched
hasse
clarifications
chasseurs
mollie
cogent
salo
loach
wilkerson
birgit
morphologically
châteaux
nkrumah
arminia
imus
bhg
starfire
eventing
crass
diverging
hydrochloric
roslyn
maleeva
hüseyin
hugs
halcyon
mardin
zoë
rationalist
novitiate
miramax
debunked
ulla
catalyses
ufl
fleurs
gympie
lassiter
nextel
tei
antidepressants
hesitated
feist
quintets
soir
wolcott
riverina
cornerbacks
harrowing
viña
awardee
coro
slaps
headdress
steinbach
hillsides
algoma
scissor
milly
macaque
vaucluse
unjustly
tala
lirr
opec
sayre
ould
stratosphere
pegs
clackamas
rosemont
goring
expiring
murrow
colonnade
corrientes
polyphony
transactional
pippa
hocking
ladislav
arora
hamper
granth
tralee
ilk
espiritu
closeup
freudian
patuxent
maxillary
pedagogue
mycobacterium
apec
nss
parallelism
effendi
mullah
minimalism
billington
sverre
nishimura
higginson
pomegranate
phosphatase
chaitanya
tilley
interludes
trouser
harju
scrambling
cwm
pompous
kohli
hutu
rushmore
comatose
specialise
montgomerie
laszlo
ingo
hetherington
equalization
solís
gelatin
onsen
ptt
chilli
sivaji
functionaries
trianon
watermill
assange
abou
shute
falsetto
gaeta
senecio
hatter
shamrocks
ferrol
matsuda
steuart
sparing
ved
nudes
alderson
krug
lippincott
sepia
aig
abreast
islas
varney
surry
morrill
outed
sé
goldschmidt
cadogan
cabell
snr
yon
wala
nissen
bahram
throats
gout
impounded
lpg
antisocial
lahiri
homeric
narrating
slattery
liqueur
spotsylvania
springboks
doubleheader
danmark
taira
edgy
sketching
spate
consequential
atticus
bally
commotion
aural
yay
eichmann
arnulf
portadown
lazare
fives
sweepstakes
matriarch
capitulated
gatsby
mullin
vermillion
millington
stralsund
juli
cleave
worrell
edgardo
islamism
rit
sdtv
elías
witten
fitzhugh
garrigues
ormonde
quin
unsettling
kolb
midseason
gamut
gawain
clades
devalue
gascony
bystander
stepdaughter
mocks
cleese
emeralds
metamorphoses
aurelia
sirjan
meteorologists
meltzer
repulsive
sewanee
alco
worden
mulcahy
swagger
binoculars
pillsbury
mamie
swallowtail
vel
ophthalmologist
lotteries
cistern
kell
alfons
handcuffs
brubeck
doped
articled
compaq
subtracted
hotspots
fondazione
wretched
homegrown
internees
ihre
upholstery
telstar
ysgol
dimensionless
stato
cpp
antananarivo
kemble
tuanku
matías
undamaged
bustling
conquistadors
hander
winans
secede
walthamstow
rauch
flatly
asb
catholique
crossbar
armen
valdosta
amrita
dunams
synapses
ringwood
lithgow
rostrum
cleanly
mordovia
flashpoint
horgan
bullseye
ipoh
mem
ketchum
disposing
marmaduke
blunder
seaward
halogen
douro
blinds
provisioning
clary
cancerous
chronically
freelancer
ifpi
nabil
disjointed
yutaka
uptempo
mtb
backups
legionnaires
hazlitt
sandefjord
caja
gog
percentile
omonia
rcs
clave
heriot
morello
barstow
precambrian
grammarian
cavanaugh
avenir
circumscribed
hernia
tocantins
mitford
courtesan
trawlers
aimé
clapboard
folkloric
thurn
blurbs
sorely
balmoral
unknowns
odette
pauling
craddock
hasta
wholesome
nasi
inflamed
bosley
qualms
apologist
caswell
kilgore
shingo
yn
peerages
marinus
writs
huta
dorje
predefined
mazar
brahmaputra
hospitalised
destruct
pon
expedient
breakdowns
sasuke
sorenson
magnusson
rubbed
superdraft
aic
jenks
greenbelt
lourenço
steffi
pankaj
bsp
sma
ssn
moms
zanu
vez
backdoor
toaster
jami
nva
cirrhosis
fucked
sleek
tappeh
ura
dearth
montebello
marton
encylopedia
stimulant
dauphiné
gilpin
transhumanist
wigmore
fajardo
malfunctioning
refreshed
overcast
fringed
gaiety
tendons
urbano
pogo
oha
cockroach
caerphilly
neurotic
blockage
tetrahedron
beatz
efficiencies
mapuche
gwyneth
semifinalists
philharmonia
affliction
oiler
gio
felled
waffle
ecclesia
violoncello
primeval
designator
mannerist
pheromones
transgression
chops
perceval
protectionist
herkimer
juggernaut
whalen
laverne
bodybuilders
orig
flightless
robespierre
meditative
pml
flashy
civ
uaa
crake
riddell
assoc
reticulum
incited
browed
lusitania
incensed
graze
emissaries
hércules
fitzalan
brackett
lolo
muirhead
cheddar
hellboy
ilie
krylia
meteors
asaph
bolivarian
kryptonite
harish
psy
casals
werk
entice
wenzel
berge
charcot
oddity
glossop
dory
onsite
penniless
raglan
susa
hamelin
albers
solver
wozniak
judaic
peloponnesian
lom
ghazni
bernadotte
luxor
fes
sidecar
bistro
raina
medica
articleid
mouvement
kronos
bahar
leroux
frankford
missal
earwig
sanitarium
sano
vanadium
fetishism
rushdie
gentrification
fj
catwalk
paymaster
moline
kingstown
mahadev
lawrenceville
withstood
probationary
imsa
bracts
korg
mayes
umi
galton
incompatibility
abbasi
coauthored
straddling
wicketkeeper
interferometer
meenakshi
sommers
wenn
coutts
edi
bremerhaven
strewn
candies
doraemon
liaisons
unannounced
hardwicke
taunt
technologist
androgen
storrs
helmuth
venturi
veritable
kaifeng
edgerton
massoud
drifts
refuelling
cav
faceless
canaanite
epidermis
clinician
superstitions
someplace
valerian
carpi
yf
dials
agarwal
baidu
conservatorium
terse
abril
snohomish
gruppo
tino
wolfman
fateful
mehra
xhtml
cerebellum
marshland
seema
stedman
visigoths
preexisting
ophthalmic
smk
unintelligible
horan
fiedler
bizjournals
hypothermia
snowboarders
headmasters
ermine
mccook
fila
trm
azadi
reconsideration
lymphatic
serfdom
scepticism
tamás
jours
dena
pandas
lisburn
arndt
vang
iri
rinaldi
kelli
sda
cllr
invisibility
hafez
tatsuya
csn
conant
impropriety
holton
mcm
llandaff
glazer
crediting
sici
dsi
sowing
wz
warlike
violas
emulsion
afterword
delirious
semarang
sabatini
tripped
contralto
gasquet
consumerism
strontium
kerber
overground
presque
reiko
attributions
lectionary
nebulae
airforce
coren
cosenza
emmerich
resistors
ilija
brees
orang
bülow
annika
hachette
fierro
polaroid
calderon
antti
bolling
kardzhali
gmo
obregón
dramatized
shamans
lipscomb
stjepan
savory
incubate
metalist
milli
rajinikanth
garbo
isl
nudge
phraya
idyllic
coxless
cannibals
counterintelligence
selectable
invariance
imparted
busway
ves
lalo
unprofitable
pil
principia
vihar
traviata
obedient
exclave
ablaze
athol
caesium
kennington
actes
safed
crüe
anka
katerina
quagmire
amerika
goons
taku
ilse
pantograph
castelli
taff
underlies
meru
holcomb
primitives
brockton
sandeep
steers
unresponsive
hillier
wold
mitrovica
klee
necessitate
watchlisted
contador
terje
ane
immortalized
playfair
weizmann
messe
mz
connolley
bonin
townsfolk
averse
redeeming
kenner
operettas
outweighs
munk
adapts
officiers
spatially
mtk
rustam
schaffhausen
retriever
incognito
sherpa
tailoring
ankh
alcorn
formby
taz
knvb
milliseconds
barham
erna
latch
corus
nestled
obstructive
consummate
denominated
assesses
ovaries
janson
velika
lutyens
mech
anupam
anubis
stroll
critiqued
guerin
elucidated
islay
venlo
wpt
zuid
woohoo
branca
donau
equus
pta
velde
cyrano
perplexed
tsushima
sweetest
tirol
campfire
visalia
percussive
adsorption
precautionary
malwa
hebert
demarco
sarsfield
stalinism
givens
mks
entablature
calcite
bayt
hst
agence
permeable
tyner
yancey
exposé
mucosa
stalybridge
flava
putsch
regulus
chulalongkorn
cylon
chinensis
thereupon
halides
bosh
dost
doj
rauf
barratt
keble
californica
clearfield
censured
morsi
csl
rhs
inferiority
viridis
worksop
messiaen
plas
holyrood
rapists
stoller
dulcimer
signers
fitchburg
fictions
goldfinger
toshio
sunnis
timbuktu
monomer
rayyan
calligrapher
ragas
rrna
lta
lightship
reinstating
raison
schur
sieradz
brushing
perrier
johnathan
citi
massie
hyeon
plugging
cacao
birkbeck
entwistle
foramen
sabadell
turki
aleksandrovich
mulan
perpetuating
hamdan
cuando
manitou
suffices
xena
zhukov
urology
trincomalee
nabisco
estuarine
warehousing
nineveh
dup
unreserved
evacuating
zemun
krüger
maule
dermal
chita
neanderthals
srs
fansites
mies
saran
drawbridge
bikaner
margie
tailplane
thrives
swivel
farouk
teenaged
dissipate
xerxes
famers
enos
curtail
riau
narrowest
washer
poon
jhansi
ramen
dependable
dupage
veitch
preliminaries
fredric
methyltransferase
atlante
bouches
collages
analgesic
cecily
carcasses
axl
shag
rundown
smurf
soi
galilei
messes
disfigured
dexterity
shafer
unsaturated
birger
bethnal
castleton
shortfall
ssd
apologetics
ovas
homesteads
dearest
accelerates
smuts
rothman
kesha
tahrir
panelists
sizing
quilmes
merciful
masterful
shopkeeper
vests
morgenstern
seger
pcp
amines
bamford
walworth
gauri
qianlong
vasili
practicable
oswestry
juma
leda
amu
tbc
disservice
beards
brooklands
marmalade
naft
tracery
operandi
saguenay
beaconsfield
randwick
consuelo
snelling
overdubs
armoury
antithesis
leesburg
batley
karo
matures
longueuil
realtime
pellegrini
fogg
improvise
mockumentary
wiccan
perverted
timbered
gatherer
crevices
sepp
everly
gaiden
eger
meow
mcwilliams
emacs
scotts
diwali
permissive
manilow
carman
ppd
selden
rickshaw
karsten
hardworking
tadpoles
shone
dawg
rijksmuseum
wort
discontinuity
bergerac
kashgar
lug
homemaker
anglais
purposeful
addictions
flintstones
handily
gorda
montego
cadastral
quinto
evie
vertebra
vendôme
bathhouse
gabba
bloor
hexadecimal
moulds
ddg
philologists
bretton
smithers
electives
fhm
glengarry
eleazar
internationalist
herzliya
jossi
gwh
reassembled
serhiy
globalisation
karna
igf
trobe
cisticola
kayseri
coagulation
lapses
mladen
noda
kamel
alten
dungannon
emcee
inger
dabs
cantrell
unrequited
oceanside
miu
facundo
crawler
trueman
paulette
usfl
installs
atleast
cfds
vestments
shanahan
matson
katakana
machi
payer
aeroplanes
powders
mathers
rigoletto
gmelin
reestablish
salcedo
melaka
boydell
skateboarder
morden
lilley
diallo
haan
hermeneutics
wahid
tmc
kivu
esso
duk
infallible
hermosa
stuttering
concurring
breakthroughs
bremerton
squaw
uncalled
thanos
marbella
winder
libra
bleaching
procedurally
kimble
freeview
indictments
clashing
ebrahim
marengo
luzern
durrell
skim
morgana
sucrose
elmhurst
elks
castellano
plenum
nami
mise
giannis
vaulting
trackers
gaurav
suzuka
finesse
conceptualized
livius
brooker
semaphore
faria
inhaled
perf
drucker
glan
codices
macbook
leaky
scooters
voce
pilsen
foix
reconnect
trapeze
hewn
booklist
swinhoe
lenore
conurbation
certiorari
disparage
glockenspiel
lactic
thrashing
forêt
glaucoma
scone
daydream
reyna
lorries
escalator
brahe
cava
yeshivas
passively
krugman
contemporain
higham
fairport
reus
infantile
stoltenberg
fiume
vespasian
xfinity
borghese
schenk
hansel
stenosis
alexia
symbian
orford
kul
frieda
merdeka
alene
repent
baca
clapp
lubricants
sluggish
vying
eckert
downton
sigrid
longitudinally
shibata
barca
lifeless
ldu
tutte
miserably
goetz
alexandros
litany
proverbial
laurentian
zvonareva
memorize
marvels
calming
redevelop
stash
demeaning
stilwell
profess
casio
dacre
negated
secures
bonanno
swims
lq
pounded
immunoglobulin
sapienza
bakar
underrepresented
fürstenberg
doggett
belgrave
congregate
bitola
millionth
lectureship
cargoes
gaulish
slumber
llodra
hsiao
docg
positivism
gingerbread
singha
sequestration
metalwork
bulbous
serna
sponsorships
né
wasatch
mut
gcb
pohnpei
chonburi
broz
mosby
vetting
castiglione
hydraulics
responder
bhagavad
masterchef
ruan
kum
gentiles
oars
madly
kamp
brownstone
septum
inadvertent
delos
cbr
milliken
imphal
neuropathy
sokoto
fitzgibbon
layering
ntt
barnacle
progesterone
uli
gullies
sutta
inflate
nafta
rhizomes
toungoo
decoded
verano
straightened
improvisations
femoral
marchetti
pellegrino
ghettos
pele
bharti
néstor
clerck
effingham
inconsequential
transponder
sys
podlaska
nikolayevich
categorising
lockport
altair
myrna
akiva
capacitive
samuelson
sympathize
keiji
annoys
acumen
sadd
rappahannock
damme
bisexuality
scuffle
loiret
saa
melina
chasse
untrained
pontiff
exemplify
compensating
inadequately
fso
edirne
jehan
kimchi
khun
gwendolyn
monasticism
csv
rialto
sweetened
trope
mistrust
mouthed
lusignan
clos
formulaic
calyx
whitefield
nesmith
spandau
orden
seb
gennady
zelaya
matchbox
emulating
worf
arnaldo
ambushes
mistreated
unep
trollope
joris
handset
untouchables
militarism
masterworks
asmara
plácido
marchioness
spliced
jarrod
enc
nitrous
carlsberg
attentive
pigmentation
ainslie
cofounder
tsg
nazar
urals
earthwork
steeped
dredged
artem
recreativo
sandia
cheered
baia
creoles
auk
spiritualist
laconia
potash
detergent
shrouded
léonard
shrews
whitbread
⅓
bodo
klf
lightness
boulez
leper
svoboda
munda
statuette
satyajit
zor
dived
mirna
cellos
noncommercial
denbigh
hermon
glycoprotein
fairbank
timon
plebeian
otsego
loam
haj
hoch
formalised
pediatrician
isolde
edoardo
roundel
likhovtseva
janelle
fol
elongation
satoru
chernivtsi
anda
iago
polychrome
responsiveness
ssb
sais
aurobindo
irishmen
repton
ferndale
anker
endangering
dueling
ronda
audacious
jerónimo
flautist
holtz
mercilessly
nevin
savona
vcs
chrysostom
politiques
tiring
ytv
estrellas
outweighed
hardman
synoptic
massenet
vowing
deleterious
instill
existentialism
magnificat
vitreous
analogs
kennesaw
pessoa
catedral
rels
homeostasis
vouchers
grp
syringe
loggins
beeston
raisins
succumbing
flushed
ilyich
elio
vimy
nda
gregarious
hsin
hillsong
ferrier
amethyst
antidepressant
alvar
dazed
oxo
roms
dewi
schizophrenic
motagua
supervises
nieves
arnett
tiësto
rephrasing
adore
mumtaz
licks
dru
grammophon
naha
scène
sabu
quasar
sadc
robusta
coughing
glycogen
comix
ashlee
abkhaz
admixture
hartland
deniz
witted
desportivo
virulent
ignace
caliente
bagpipe
depose
dateline
abnormality
lasky
connoisseur
strafford
bridgestone
safa
fuerza
tortures
kennedys
hager
seto
concealment
sila
prater
dobie
metalworking
olivet
underestimate
riker
wishart
sextus
tubercles
ernestine
zacharias
leaned
antioxidant
ridgewood
chancellorsville
meiosis
biju
tourette
nagle
ayodhya
orhan
commissariat
heals
olympiads
expounded
laud
haris
prentiss
delacroix
greenery
barisan
pollinated
munn
porgy
impair
bracknell
ecotourism
aaj
manama
bandicoot
arecibo
raye
snows
loopholes
helices
dengue
mpi
sopot
goebel
archon
reptilian
crotalus
xaver
bosque
hel
schirmer
copperfield
claret
raab
kolar
gy
galleon
enright
robustness
juridical
lint
kiko
ponzi
woodcuts
weatherman
dibiase
oam
alphonso
kirkcaldy
crossovers
rhizome
cognac
woe
moen
kolbe
tachibana
cmj
thurmond
bsi
badged
gargoyles
guantánamo
recreations
replete
malmesbury
oilfield
qiao
juniata
amstel
godley
marvellous
junkyard
diop
thier
redhead
zm
mexicali
boycotts
spiel
purporting
rincón
carlotta
tabular
pender
michiel
rhee
roslin
ohne
musicianship
millennial
peculiarities
annulment
wham
lunchtime
radeon
chrysanthemum
evangelization
compressors
curled
adversarial
durbin
doughnut
wav
unbounded
spitsbergen
tutsi
northeasterly
negativity
upstart
tani
distributive
lacan
casal
zaporizhia
cavanagh
groucho
toyotomi
ess
perrault
esk
arl
waterside
²
inasmuch
bendix
amhara
encamped
projekt
candelaria
meetinghouse
lakhs
wipo
offhand
noll
malachi
waxy
sarek
storks
splitter
bruni
paiute
neues
conglomerates
aruna
spars
bizarro
beachhead
alderney
omnium
nim
nebulous
bodhi
dizziness
amado
blush
sassari
badakhshan
wemyss
tiffin
holyhead
metaphorically
tuscarora
aerodromes
visor
zeballos
condensing
burwell
alcott
nankai
beata
multicast
niemeyer
tether
bhatti
yasin
glycerol
matsumura
sultana
islip
workspace
sandhya
biogeography
greenlandic
ypsilanti
goldstone
farooq
ober
trophée
refuting
defensively
foch
hallucination
storehouse
solvable
precocious
siphon
crags
gunsmoke
campbelltown
moni
tamaki
nutty
evaporate
auctioneer
belvoir
talley
sepsis
heusen
hvac
stormwater
workaround
highfield
quaint
caliphs
satori
ambiguities
uaw
shuttles
haney
alito
akash
roberson
junge
rained
ensue
jayson
mejor
impaled
aspersions
vali
vocally
méliès
nigh
uday
paywall
breckenridge
gracia
bollocks
refresher
ellwood
glentoran
minstrels
dpr
atrocious
garten
tickle
dedham
oban
sada
westfalen
heartless
tca
azmi
algeciras
ferrying
winwood
crystallization
basques
hiwish
saxton
desde
juxtaposed
encoder
lod
looped
unelected
bisected
clout
tournai
intractable
grapefruit
apothecary
sohn
salgado
hollows
denman
backfield
mesto
raghavan
aum
ucsd
garay
pabst
keng
vibrational
verner
walkin
fallacious
mok
rigidly
pelosi
bernier
leia
fanatics
jeu
exoplanet
unnumbered
denier
kankakee
americano
egyptologist
factoring
resonate
wilford
giffard
pacino
lucrezia
hogwarts
wenatchee
thruster
carbonated
yai
cze
cfg
underline
orgy
loathing
vasari
stockpile
inxs
caritas
tid
rutter
mucous
vandalise
sema
arashi
motionless
midline
sookie
sarai
sgi
mandibles
submersible
abington
chiles
cleve
compost
rundgren
soong
scuderia
usk
prisms
otoh
sevenoaks
bailed
imperator
threefold
parthenon
seafarers
gnomes
reykjavik
cherish
epi
doukas
sombra
idolatry
unload
retirees
brantley
frederica
professionnelle
michener
dumitru
embankments
aiba
reconciling
jinja
mariko
claxton
smashes
agave
pinckney
gerson
chica
uchida
sumatran
pietà
centipede
vistas
tzadik
ghoul
catlin
defaulted
rollercoaster
wahl
pgp
battaglia
woodblock
swells
haigh
vesnina
gelsenkirchen
tris
hoot
perceiving
floris
cheval
scimitar
setanta
soak
grapple
slicing
deere
nutritious
briefed
stéphanie
alexandrovich
techcrunch
annecy
sfc
summerfield
rashi
fête
polluting
xxvi
palisade
orientale
mermaids
acad
kovacs
bramall
cna
artis
sundanese
overzealous
dorman
kennet
macroeconomics
lemay
forerunners
thq
kherson
freeland
almshouses
dori
racquetball
tact
sustenance
takumi
ravines
jansson
beckley
veliko
kader
headliners
testifies
profesional
saka
moshav
songbird
playwriting
medics
mixers
shires
bobsledder
proportionality
paribas
tyra
kroll
transgressions
narrators
rosanna
kokomo
seguin
vecchio
enslavement
clementi
kubota
rushden
arin
grégoire
elko
ethically
hilt
zsa
mitosis
offsets
miskolc
ramanujan
hastened
kirkuk
docu
legislated
precludes
prelates
fens
comunale
camouflaged
roz
keele
aue
ikea
chetnik
lillywhite
thiel
levees
omi
woburn
appreciable
cheol
pandavas
dur
jourdan
anthracite
tremolo
spud
fundamentalists
barbieri
lifetimes
bingen
sanhedrin
prong
lillooet
racquet
sown
lorimer
picker
ranjan
malick
cheeky
banging
sian
aloof
plasmid
verden
vásquez
tabulated
erde
hjalmar
duress
edf
quirino
dogged
kempton
renate
wolsey
goalscoring
zuckerman
kharagpur
retaken
visualized
biweekly
orfeo
omnia
ariège
chippenham
acetone
chandran
halide
bonaire
ethers
mariya
testicles
rasht
couplet
chaff
bassey
okazaki
penning
dfa
minefields
snowflake
superposition
gatekeeper
rsl
clapping
urgell
dsb
pyrmont
autographs
codification
reincarnated
pkp
ján
kargil
babysitter
soissons
romford
unworkable
ignacy
osc
incisors
estrela
commensurate
redo
carcassonne
quiroga
busoni
periyar
manukau
unranked
philatelists
insectivorous
electrolysis
holed
unbuilt
domingue
dayal
marcela
bridegroom
hyperlink
ebooks
vought
mahjong
acrimonious
primogeniture
congruence
tritium
legible
bassano
sohc
effie
misbehavior
blücher
tilton
tetris
yvan
restorative
hikes
fouling
fylde
panthera
resourceful
irreverent
lucero
pathan
moroder
galloping
nemanja
cowie
pui
populism
hefner
adriaan
adria
liquefied
eschatology
taki
recherches
snarky
dien
tauziat
palustris
impeached
bap
aeolian
garson
flagpole
tejada
baehr
epileptic
katarzyna
delimited
disenfranchised
dep
epochs
bayeux
arion
misa
phobos
moir
subtracting
fauré
jeeves
gloom
disengagement
unaired
rta
instated
queenie
valse
lope
matias
varley
disordered
searchers
manaus
möbius
mita
tomcat
placename
hpv
colmar
msi
fiqh
bruises
lci
tlingit
slippers
mccloud
frente
epps
storch
webby
overjoyed
rih
technologists
yanks
caldecott
freaky
poppins
utep
aurea
aishwarya
tirpitz
roussel
gilead
forecourt
jeux
platypus
mbeki
soundscan
devotions
kempe
dba
woolworth
cogan
uprooted
blimp
kantor
mensa
veloso
testa
amyloid
lopsided
kau
subjugated
sls
merciless
vouch
homicides
thumbnails
sagittarius
webbing
bramble
newland
baboon
bivalves
firstborn
mukhtar
hailey
verna
joost
emplacements
rubik
cidade
barman
antigone
threading
specious
deeming
kloster
yank
liberators
programmatic
founds
citibank
rother
neutralized
suomen
auer
paradoxically
bromsgrove
knopfler
haydon
pimentel
unplanned
tawi
lietuvos
chocolates
émigré
belediyespor
circe
xiaoping
rusher
mino
ales
gerlach
adverbs
bloke
merlot
blok
gunning
garrido
recursively
mckeown
steeper
hitchin
madrigals
clearest
inflexible
smitten
apportionment
endocrinology
impure
ganj
nona
curiosities
wearable
diu
trovatore
fajr
diarist
newsreaders
immorality
boomers
perfumes
tân
etiology
expedite
bollinger
girders
sweeter
embarks
rebuked
mötley
suburbia
onlookers
kaine
ƒ
cabeza
microbiologist
nook
erupt
koe
ridgefield
eames
semyon
ort
virginie
laidlaw
prd
kazuki
collett
tewkesbury
amjad
avocado
shareware
exuberant
warangal
mccurdy
hasselt
swirling
crum
strathmore
ene
whining
graceland
mère
smartest
takayama
gst
strindberg
mobilizing
nazim
shaver
rigg
resale
bil
triads
autre
rapa
glencoe
creeper
ujjain
sunfish
xj
excreted
jenn
skillfully
shipbuilder
workmanship
saltire
thermonuclear
hep
goodreads
hearne
thundering
jenni
attenuated
moloney
berets
mur
willey
lek
torsten
willfully
charentes
babbage
vitis
misadventures
semblance
angelos
hardline
kroger
gawler
rundfunk
rectum
uz
girardeau
okamoto
dejean
dts
cng
tdp
alienate
distilleries
handicraft
anakin
legendre
khans
equalised
swelled
luttrell
implosion
minnelli
continuance
régional
faintly
issuer
swindle
broomfield
rubicon
molluscan
ths
intrusions
barrack
blockaded
deering
lamina
sustainment
abyssinia
excision
alda
insulator
selig
rascals
turdus
dashing
jolson
appellant
straighten
leniency
vinay
nrw
splc
maneuverability
subcultures
transjordan
saws
ftse
gálvez
staunchly
pleasantly
fromm
maes
gordo
mati
elen
airbags
shimmer
raccoons
avenging
lexicographer
aja
vuitton
izz
bataille
cling
stratovolcano
hatteras
ulaanbaatar
impassioned
infanticide
schweiz
fingered
pirin
kellner
cynicism
foreshore
cooperates
haveli
octahedral
cse
mckinsey
conflated
cueva
firebox
mmr
aspires
aboriginals
cozy
generalised
overbearing
manchurian
macros
bushido
interstates
industrie
munir
kavita
vangelis
maga
ruggiero
superannuation
prejudicial
chub
pontic
diehl
nain
rowell
refereeing
riddick
naca
euskara
spiker
vesper
overhanging
parabola
convolution
proportionate
equaled
barents
shashi
rensburg
yavapai
bhojpuri
pauper
mond
burdened
superfast
pseudomonas
verdicts
motet
savchenko
creamery
jas
citywide
idiopathic
durst
quashed
adorno
giessen
sicilia
abyssinian
sobieski
ablation
diverges
legumes
psychosocial
laswell
buenaventura
matterhorn
papi
hoffenheim
bassoons
manhunter
nogales
inhumane
cantus
cask
concordat
exemplar
essonne
tarlac
correspondences
jemima
sarcastically
icann
patterning
hydroelectricity
funnels
repulsion
abstentions
impressively
wied
bosphorus
putter
runnin
bailiwick
hypothalamus
darío
albee
taha
danielson
heike
lexi
graben
cheesy
magus
crewed
embolism
jackals
laker
deeded
bittern
tubers
blom
monochromatic
awoke
abbottabad
buoyant
watertight
noguchi
lipa
proclamations
stour
tlaxcala
zucker
libération
pathos
tempera
motu
stockman
hants
overthrowing
vcu
suffragette
rockport
pica
bounding
baile
enlightening
pennsylvanian
jón
electrolytic
cowling
imaged
jebel
metros
grayling
souter
freighters
gallica
tyr
kossuth
pathogenesis
pettigrew
daugava
staphylococcus
rcc
warts
factored
mitsui
casco
levan
mahadevan
labours
fairing
savoia
calmed
pilbara
sickly
sequencer
dupri
reachable
imaginable
kaneko
rousing
safina
inefficiency
ulmer
frederiksberg
zavala
maldon
vico
lookin
bayonets
cumbia
dhawan
musculoskeletal
unlockable
ishq
barat
niculescu
eventful
politeness
debunking
ayacucho
geneticists
kavala
procurator
capoeira
afon
piney
parables
whitcomb
turbocharger
audax
magog
meander
ancash
aaliyah
superlative
valens
fixable
wertheim
shaquille
raz
domitian
plummeted
heydrich
flatbush
hannan
emporium
johnsen
prichard
watling
grasse
utada
jobim
pattaya
hab
natale
qwerty
pueblos
doré
nsl
illyria
craving
mikel
ecologists
lurie
wheelock
fop
corrects
bmo
fae
intensifying
⁄
chasm
holbein
gordie
antonis
revitalized
poulton
subpoena
harbinger
aldous
edgewater
carthaginians
komatsu
edgeworth
anuradhapura
sassy
tinian
computable
attlee
cluttering
yvon
minibus
palembang
batgirl
condone
labial
underdogs
flirts
ecija
toccata
autopilot
,the
mulk
kluwer
mahathir
scythians
uddin
gyrus
noa
jackass
unlawfully
rüdiger
larne
rickenbacker
aryans
haye
nighthawk
kabaddi
modernizing
akhenaten
collides
counterterrorism
meriden
rejoins
resentful
abell
abbie
yoda
floodlights
cliche
chillicothe
veterinarians
mame
lidia
metastasis
redbirds
batang
imperatore
mobley
watchmaker
mey
gayatri
blouse
volumetric
etna
skids
abbe
sylar
taiji
rickman
adjudication
stormont
unflattering
seduces
citizenry
gottlob
aphasia
lire
hag
postcolonial
interrogations
lye
disaffected
asteras
arthurs
duffield
solicitation
mcauley
exerts
negotiators
nervosa
cyclonic
veronika
marga
aleph
ferried
taboos
coastlines
predicated
francophonie
theremin
xenophobia
belge
rha
sra
tbm
gargoyle
determinations
unp
empresses
gonville
fergie
gnosticism
jla
shijiazhuang
dwells
susumu
voldemort
selfridge
frse
sundry
wiggle
belated
redeployed
sump
contemplates
pollinators
gbe
defaulting
stoneham
flyby
alsatian
landless
hesketh
hindering
mappings
mikkelsen
lithographs
proscribed
wiles
ferraro
cosmonauts
thinning
ginn
sanjeev
flipper
qua
seizes
retold
deviated
crisco
paix
franjo
bauman
tvnz
monckton
kyrie
fuad
socialites
pictou
evacuations
sayer
roethlisberger
toggle
unmodified
ubiquitin
ther
hythe
stockdale
vuk
gujrat
depauw
sukumaran
minos
bankhead
trotting
akane
sinfonietta
aardvark
methodical
anis
emt
roa
dilated
wabc
tethered
hoyas
mónica
lalit
oxbow
alexandr
marksmanship
brunette
déjà
mariusz
dormers
heyward
stingers
teardrop
sew
fenner
dailey
ridder
karolina
carbonell
holmenkollen
akiyama
oftentimes
leh
freestanding
esau
epidermal
humanoids
eac
ascribe
messer
warr
holi
fertilized
symantec
kuru
grinstead
jeet
reassure
csb
loveland
fain
fittipaldi
manitowoc
gharb
diaper
narain
dimer
theosophy
sveti
candidature
rehash
dss
honorees
ung
caernarfon
veronese
chandrasekhar
coritiba
distracts
kress
scholes
konkan
iam
foregoing
watkin
germanium
finches
wessel
astronautics
anza
reprises
guillén
sharpening
optically
morgen
kirkman
abomination
rectal
gruffydd
royle
econometrics
crowding
immobile
ripening
ulyanovsk
repackaged
nursed
stax
feliz
zinn
cowes
misspellings
tapia
outcasts
handkerchief
laughton
eilat
brm
melancholic
transiting
chaffee
miko
traumatized
benefitted
rearrange
hoses
hezekiah
gums
alaric
pth
gasol
sacramental
gyro
relativism
nts
sandinista
queried
tizzle
mountjoy
aeneid
candlestick
tuan
romer
bucs
veal
thapa
nitin
wilber
ahh
vitus
dazzle
acoustical
albi
permafrost
truk
srt
cursing
keir
jujuy
maugham
aristophanes
mineralogist
blackmailed
emphysema
entombed
roughness
radiotherapy
egil
conformational
bunko
ttn
considerate
swath
montt
ivanova
tiber
hectic
ruano
skilful
ries
pix
henriques
rtc
headlamps
chuo
bootlegs
clerestory
neurotransmitters
surged
awakes
manzano
blacksmiths
tirupati
nota
aronson
olden
quartzite
malkin
willingham
uit
backfired
lemuel
batty
elly
schiavo
constitutive
ekman
kushner
backcountry
dominus
stockholders
undertones
stephane
typesetting
hitoshi
arakan
earthenware
ywca
seifert
lett
danica
coughlan
nour
kabhi
neff
monarchists
dragonflies
despise
showy
eluded
pronged
hummingbirds
iaa
quintanilla
flamingos
hamed
andrena
satyr
obstructions
seria
santee
atrocity
dodging
solberg
indium
fujimori
liceo
eakins
netherlandish
prawn
roemer
pallida
luxe
diminishes
shapur
rix
scifi
tol
ack
suffragist
sankara
ethnographer
gigabit
devaluation
pearly
exacting
rothstein
michell
radley
bba
transformational
vagueness
jihadist
forecastle
leaderboard
westview
accomplishes
bebo
patchy
sundaram
prototyping
platter
weibo
abstractions
jessup
melilla
procuring
abergavenny
manos
bushfires
pare
reals
laure
consternation
untouchable
hoxha
violetta
hutchings
murs
ulu
raiden
virtuosity
remand
khash
choked
undercut
zhan
jussi
surfacing
voucher
bushranger
boku
monahan
thanet
nines
robocop
kellerman
corroborate
wsl
stine
snape
cyanobacteria
hors
uu
bedlam
stereotyping
astonishment
ede
grose
sacral
masthead
abraxas
skylight
bagration
prohibitive
hunch
safin
fluctuated
definable
submissive
pillaged
pontevedra
vasconcelos
subgenres
evita
criminally
weidenfeld
soca
ache
dmk
gord
bloods
tvr
cunt
bornholm
fifi
insufficiency
gasp
ruf
bragging
batt
greenaway
squamish
subliminal
primorsky
princesa
tdi
capitalise
lindner
marshfield
kosovan
personas
morbidity
purest
acura
trickery
aveiro
orel
inquired
catanzaro
hodson
gounod
patriarchy
totnes
pitfalls
blondes
wigs
renwick
kora
parodying
huy
impersonate
dreamland
kirkham
bolan
stilt
sprouts
sturges
cholas
predictably
insemination
haringey
linger
opendocument
ashdown
rann
plantain
libellous
slurry
somethin
tuft
bestsellers
moti
galería
aníbal
berea
subchannels
bernardi
salar
mandating
masterton
sherri
embattled
fella
gratification
computationally
paraphernalia
franziska
cantwell
unexplored
disrepute
multiplexed
jarrah
tema
finsbury
mose
indisputable
enriching
inv
tidying
lamia
heredity
yt
directx
rooting
breezy
landshut
woodwinds
darrin
aotearoa
alligators
jacobites
rehoboth
itching
woefully
sebastião
rayne
anschluss
tombstones
sterne
cerebellar
fluctuation
testers
corvinus
agate
patrimony
insecticide
sundsvall
dissented
synods
défense
kleist
hosni
traceable
uttara
eurocopter
pita
lyase
ovw
clarice
beauvoir
modifiers
mcveigh
anderton
shamir
tes
gur
siskiyou
cpm
reposted
tseng
gaziantep
coopers
callisto
sandys
linga
inclement
hejaz
nodal
showrunner
tribulations
yazid
daigo
angler
testicular
pours
fara
emmylou
signet
priming
panes
rimbaud
reprimand
valente
apologetic
ricochet
leib
inst
motels
virgilio
kiva
darley
annuals
kook
neverland
elsinore
fervor
garhwal
mattered
derecho
baritones
cloisters
cadena
jomo
skynyrd
cirencester
gata
dasa
fallin
intelsat
aeronautica
roxburgh
arica
donned
bohdan
pacer
exterminated
prismatic
dollhouse
infertile
blenny
faraway
margareta
mingo
emf
asymptomatic
cunliffe
radhakrishnan
clc
marlo
più
allround
intercepts
franke
shirvan
scribd
supposition
ashleigh
schuman
noticias
triathlete
salesian
concours
banyan
supernovae
piaget
redfield
meaningfully
pge
chamorro
cannery
misiones
zain
reorganizing
ackermann
osha
carronades
mandrake
nigerien
jezebel
raitt
durbar
eis
forays
innkeeper
rnzaf
spokes
ferb
jor
overwritten
rpi
menstruation
unabridged
witham
wipeout
hippocrates
texte
pareto
blindfolded
playlists
bharathi
welle
ulan
frauen
cyde
plotters
predominance
passable
powertrain
neruda
oligarchy
amenhotep
kettler
reps
oj
mahoning
wallach
shipp
damper
conquers
smithson
validly
hsp
zootaxa
interrogate
plein
resistivity
synchronize
svein
barometer
fleas
mitchum
squatting
chantry
occlusion
legitimize
strasburg
belmonte
prema
―
bcl
atletico
copulation
pakenham
timişoara
ccl
palladio
cancellations
evacuees
prebendary
polyurethane
scarring
darwinian
landwehr
ruta
nand
grillo
excavating
dedicates
ronne
birding
riser
olly
grassi
mansoor
zirconium
touristic
androids
tanglewood
usps
oakleigh
winningest
mulatto
geriatric
tangle
crammed
pata
fredericks
komodo
orangutan
brosnan
ciro
ansel
sikorski
blister
deductive
instituting
frémont
chitral
interferon
bigg
satires
resuscitation
kenmore
rochford
natures
newbridge
juha
crescendo
cloths
barthel
diversions
columbo
pennell
heo
cobbled
carle
transposed
freemen
papp
hvdc
osh
gba
bookmarks
scherzinger
iwan
macdowell
obtainable
thurrock
offbeat
wordplay
chagall
inverter
igg
fürst
poulenc
daggett
dispel
bca
lawfully
galatea
arta
serres
woolsey
iep
bounces
morelli
blackened
andrée
niebla
classifier
conservapedia
quays
lashed
geraint
infact
platelets
lyricism
goaltending
tarleton
booed
pollutant
emulators
inaccurately
trevelyan
frodo
fob
flocked
krosno
bua
kuril
creme
morea
brenton
wdr
henschel
henman
forsaken
dorms
bibb
amba
sulpice
cen
leftists
allyn
bein
transcends
ladysmith
totalitarianism
captivating
practicality
pashtuns
kenai
humerus
panay
spacey
divested
tonality
worthiness
mercian
amputees
ballon
satanism
stanislaw
goldeneye
grandin
kurukshetra
prabhakar
axiomatic
dmz
chatto
tbn
exon
rubra
skipton
backside
buckethead
morphed
neuromuscular
gascoyne
colle
freshness
overrides
armorial
brownian
heil
pug
glut
pallet
agincourt
chamois
seder
reprieve
tio
dicaprio
digitization
conveyance
igniting
sculptured
pcf
josephson
erlich
punto
streamlining
bombshell
stolle
garfunkel
acuity
posada
radnor
lard
cert
salve
manas
lumps
tovar
metastatic
eliminations
fiddlers
dha
ahem
seco
misdeeds
krohn
gyan
galactus
futura
bartow
showman
edin
mizuno
isma
pretorius
brockman
briarcliff
soros
haka
misidentified
fro
tablelands
tailings
jiro
bauxite
omnipotent
leeuwarden
annenberg
iti
daisies
maccarthy
sobriety
zhuo
swp
neuroscientists
fabien
farhad
whitten
frauds
jarl
incurable
furiously
arapaho
coromandel
battambang
amygdala
magnetization
enterprising
companionship
ouagadougou
fon
nikolas
venables
opa
surtees
lafontaine
thera
ramana
stung
friedlander
delphine
suter
sanam
booting
merriman
veera
cra
brauer
farris
watchful
generalitat
umpiring
skimmed
refinements
abramoff
hashem
carat
marsupials
gemstones
horrendous
atco
cassia
ledesma
fricke
adj
hydrologic
streatham
paused
nanchang
lak
brainwashing
tage
pegg
flourishes
siouxsie
storytellers
ratchasima
memos
hakeem
obra
neko
hapless
ballade
horseracing
translocation
kuti
nutter
sont
criminality
remus
sanctus
onassis
qat
incubated
blacklisting
sunnyvale
viggo
bumping
breeches
lintel
franky
wily
efendi
papaya
dispossessed
maar
mui
gooding
demonstrative
domingos
potro
nonesuch
hirohito
weeknights
duomo
waitangi
hayakawa
exoskeleton
jost
legislate
tcr
concertmaster
lupo
pavlovich
gaborone
cortlandt
alana
apnea
dprk
asda
teramo
pickwick
sleepless
mauritanian
adjudged
fantastical
caddy
saxena
rupaul
navan
concordance
newby
remodelling
peeters
axelrod
newsgroup
dispatching
tetrapods
bina
dougie
banquets
politehnica
nadeem
arginine
exxonmobil
grazed
shuffled
ibero
phenomenological
nhra
holon
armidale
quranic
mayberry
urns
sophomores
termite
drifters
sona
perks
alienating
legio
dargah
sprays
zuni
juke
cae
tiga
unequivocal
bidirectional
dutra
hattori
dasgupta
luciana
dunstable
alumna
hema
wuxi
lapwing
phenotypes
pottsville
semantically
draughts
generality
rajapaksa
boni
reaves
overridden
erred
balthazar
sorin
coronal
tonto
grainy
kashyap
havok
diagnosing
carmina
escondido
celluloid
mallow
lain
haarhuis
poaceae
strada
apuestas
hina
associazione
wallenberg
martinus
goodson
sheldrake
varnish
scaring
rehearse
noose
safeway
hemphill
seidel
soni
flogging
tokyu
contributory
farrington
ennobled
aquaria
sieur
deformity
wgbh
burdon
hoodoo
lyudmila
gettin
scat
coxed
gelding
hayworth
traci
aleister
yb
abstaining
macleay
barone
girth
unmatched
burj
sparc
mahone
infantryman
vizcaya
castellanos
crustacean
hrt
saintes
capitalizing
gravestones
vets
pepsico
sarcoma
badgering
forefathers
cyrenaica
hollande
bilingualism
wmd
mondadori
gunnison
thf
fiefs
brom
retrial
daud
sandal
hornblower
schütz
waratahs
snowdon
sixers
pathak
autoblock
supranational
milking
foxtel
spb
curragh
aiko
cull
ppc
championing
serviceable
poop
ellipsis
vorarlberg
blot
killa
venizelos
debs
wf
creeps
kristofferson
carcinogenic
lobbies
italiani
wein
straws
fulani
miyako
lamy
gente
suffragists
magnified
mandibular
cropper
creuse
adrianople
quai
canopies
karpov
christus
ibrox
prodding
ostia
ça
cosplay
atms
amiable
reliquary
rayburn
benet
raving
dispositions
flange
pentateuch
ese
cooperstown
zakaria
walleye
kinky
ischemic
econometric
oude
unaccredited
gaudio
matsuyama
tranquil
osteoporosis
versace
shenhua
embarrass
sreekumar
sappers
hardee
wazir
soaking
maxie
modulator
recused
tsuyoshi
vesuvius
robben
tunney
stackpole
visayan
aggregating
treadwell
deon
volpe
fart
tubman
auster
khon
hillock
rawa
fabled
overseers
heft
inlaid
spina
apportioned
emptive
imperfections
lubricant
arundell
welwyn
insertions
unmistakable
utley
golgi
buganda
coq
carswell
recruiter
infiltrating
geniuses
gow
freeholders
adenauer
mander
thyme
canute
jeroen
porfirio
thucydides
nolte
printings
kensal
levantine
cleanse
inquiring
petitioners
killeen
tallies
là
leveraging
defaced
redditch
marigold
nonstandard
oromia
noddy
blotches
jefferies
agong
risa
abscess
antal
daycare
kavi
acclamation
handcuffed
hydrological
saussure
strawman
hasten
perelman
punting
behan
plunging
zetian
hark
pequot
biffle
villars
slingshot
thalia
pec
unstructured
aaas
electromechanical
mashup
birthright
martell
bitches
nip
ramu
iana
quirks
absinthe
royer
rangefinder
watery
heung
vaduz
dfs
rind
pbl
custodial
radiators
troublemaker
conformed
levett
cann
stretton
hindley
lezion
lindemann
konstanz
lawlor
culex
conductance
canes
enthalpy
panoramas
flops
looser
hydrostatic
cybermen
plos
hirschmann
thefts
halberstadt
msdn
handyman
absolution
videogames
remastering
grafting
lavishly
armee
lilo
lytle
iida
osten
giurgiu
vik
conciliatory
groening
fátima
sidewinder
wendel
hattiesburg
baran
kidder
bellman
camellia
nlcs
phoned
muskingum
rawhide
carpentier
consults
maddalena
dottie
caster
waveguide
cayetano
fritsch
pakhtakor
ffe
nusa
gangnam
latakia
meanders
shopper
belén
nita
prez
waive
ashish
notional
yuichi
yerba
unscientific
masaryk
wilco
ciaran
connexion
hertogenbosch
zuo
bouzouki
irregulars
rackets
chania
vasu
pina
nightwing
wausau
mcshane
szabolcs
militari
velez
corbyn
lanao
caserta
detritus
eea
kani
dudes
tdt
poodle
concisely
castlevania
flume
digested
kemerovo
polemics
gana
pagasa
hoshi
enniskillen
misfortunes
kimono
caras
swamped
cosmodrome
recoverable
cormier
knickerbocker
cofactor
tradesmen
ousting
zahn
inker
travail
bottomed
chela
biodegradable
soundwave
cytokine
bava
inductance
bramley
nagarjuna
vibrato
hammering
dili
cgs
categorizes
frantically
heathland
springdale
watercourses
aroostook
artefact
nieuwe
slp
demonstrable
swazi
onna
butters
geraldo
clocking
divorces
krystal
ember
thoroughfares
punctured
yuna
devas
winterbottom
norra
dwelt
stuttgarter
interleukin
glades
theistic
superscript
saheb
townes
azar
placental
revels
earners
aarau
pontius
moshi
offensively
enchantment
gymnasiums
mists
intakes
tubby
chucky
lyall
gunung
rapides
naps
introspection
fta
grattan
cayenne
mohandas
balázs
halfbacks
clawed
bipartite
cramps
arkady
delisle
disenchanted
zhe
archeologists
easement
joule
ventilated
kimber
possessor
homeward
kura
bidders
goc
gliwice
dakshina
naan
bren
storybook
planeta
rosina
mmc
bloodlines
sepahan
yusuke
harburg
vickery
maisie
beholder
degenerative
polybius
hoh
tarawa
cresswell
fillings
quds
mush
yousuf
nayarit
arusha
telus
whiz
coulibaly
fata
meriwether
swanton
pomp
goble
troup
desecration
illusory
hopkinson
rugrats
kongsberg
caribe
paramaribo
allure
udall
rectifier
gruffudd
ballerinas
milligrams
kraftwerk
sibir
machida
backfires
lieber
nichol
marauder
chaste
narcissism
nunez
simulates
imc
gormley
ruch
screech
danko
devito
wilks
lorde
starke
electrodynamics
testaments
gainsbourg
bolder
pipa
hewson
beazley
treblinka
zou
futurism
drinkers
optometry
awardees
repurposed
schiffer
falsification
bexar
popularization
meza
arent
korps
justly
timescale
belafonte
moh
instar
cort
thickening
larch
machen
aws
hor
disapprove
combos
cleland
gomer
koster
fondo
chipped
ruckus
suetonius
labuan
kanto
dismembered
floss
sudetenland
megafauna
moksha
wanton
impressionists
caplan
recites
statham
tailors
samsun
marduk
jülich
selznick
fuseaction
halil
linlithgow
effluent
khas
golfing
wendt
kwun
letitia
wendover
blaney
arrington
naim
deja
handouts
segmental
valles
reinserted
watters
irrawaddy
frets
celso
unspoken
downie
bruin
backlogged
chittenden
rameau
samad
kas
hameed
valero
overpopulation
yusef
bipedal
disgraceful
cinq
miramichi
bajaj
tench
banal
gid
viewfinder
rive
genocidal
hooligans
evangelistic
interdependence
boutros
emblazoned
jaques
klezmer
avalanches
overflowing
fulcrum
nya
sofie
ulama
erudite
cautionary
jenin
sauvage
wilds
foundries
hammerhead
martinelli
mensch
slut
luring
mourned
noli
glarus
lingayen
pontificate
eliminator
argüello
székely
dca
√
hackman
flounder
fairground
juraj
ambrosia
nifty
asaf
menial
martineau
contraceptives
snowboarder
polypeptide
tiebreakers
diddley
shrank
stereotyped
greening
pegged
unhappiness
abusers
domaine
flotation
efes
importers
burdett
morais
devizes
firework
bhima
quinnipiac
underwriting
hijab
smoothed
cavalli
anare
caterham
mbbs
burners
umpired
reductive
merengue
landfills
chawla
adored
metternich
cooperatively
frontispiece
margarete
tosses
townhouses
stora
populus
hiphop
cañas
kindergartens
energized
isner
mccool
tannery
darla
regenerated
umbilicus
indecisive
jog
bromine
muscovy
cetinje
cavalcade
memorized
populaire
walkover
scrapbook
larvik
kaj
herbicides
understory
bicol
grunwald
mcevoy
mcginnis
clwyd
resnick
biathlete
fujii
binocular
buoys
shimane
aggravating
meo
porting
clubhouses
hick
smederevo
soured
humming
tarnished
cates
necaxa
metrorail
victoire
encircle
otway
elfsborg
mccay
easley
cgt
predrag
pyroclastic
brasilia
strident
cff
watercraft
airdrieonians
bixby
decompose
southpaw
flutter
cookbooks
ganymede
yarns
fci
blantyre
derivations
hôpital
pentagram
brianna
objector
kou
alcs
elizondo
dak
echr
joiner
brecker
fundy
renderings
isro
soundgarden
azerbaijanis
laborious
choate
desjardins
widener
mistletoe
straub
profited
ilp
chahar
stansted
acutely
chine
craigslist
harewood
grammarians
procopius
vento
freie
condensate
transom
nootka
broussard
roamed
ooty
kronstadt
berkman
redefining
cholmondeley
questionnaires
brienne
pori
hydration
propelling
acorns
quicken
dugdale
maduro
oddities
fluvial
tropicana
ucsb
osmosis
azuma
exogenous
diversionary
podlaski
pichilemu
voided
stc
berners
motorist
underscores
ringling
impurity
grandview
relish
layoffs
postponement
lewisburg
iulia
breathless
ivica
puffy
screamed
remarry
militaries
allis
ildefonso
spor
archivists
underbelly
hyenas
ineffectual
expendable
abstention
incurring
truckers
trays
redneck
retrofitted
hubbell
connick
emailing
yasuda
fábio
popups
actuators
herzl
servette
bega
hamsters
tracklist
ercole
academie
effortlessly
insidious
coc
andreev
constructivist
markt
erectus
fandango
metropolitana
margrethe
assembles
adjusts
belorussian
bimbo
dennett
ovoid
hann
cpsu
wile
ethnologist
molestation
infatuation
svend
maxx
reade
stateside
wept
mauresmo
supplementation
carola
celestine
kohen
xylophone
pham
fronds
perla
petitioning
etude
berserk
été
wieland
blogosphere
gurevich
júlio
bcr
tempus
rerum
moda
meissner
stp
transited
feria
trillium
antiquarians
licensees
spender
reimer
larynx
magalhães
choy
thr
hss
tegan
passchendaele
discos
decried
katja
reconquest
tumours
snipes
couplers
parklands
cañada
healers
autodesk
innis
haight
hewett
musings
experimenter
loudness
const
remodel
ourense
teodor
quizzes
ionia
boccaccio
configurable
viareggio
imprison
cuesta
salivary
toscanini
melvyn
flycatchers
ejaculation
bountiful
mcp
caveman
axa
winchell
dob
eyeball
chemins
iyengar
electrophoresis
symphonie
agribusiness
tolerable
inr
whaley
singularly
andrus
marsupial
spooks
perdue
weiser
ines
hasharon
curacy
waal
dench
ert
dvr
oli
referrals
galvanized
gec
graaf
steelworks
licensure
berryman
descriptors
bayliss
noto
reparation
potocki
anheuser
elegantly
coldfield
rie
nysa
ozarks
brophy
resized
grahamstown
corrèze
pka
slowdown
estonians
bagwell
máximo
devour
mote
wouldnt
maples
vasas
whiplash
centrepiece
therefor
moeller
noord
araki
occasioned
pullen
recur
horrifying
tricking
wahlberg
prophesied
baronial
tympanum
thorsten
shadowed
baluchistan
gaskell
regrettable
mirai
turbofan
chroma
enderby
maroc
cynon
charred
hooves
glittering
stratocaster
côté
nilo
hyperactivity
lucasfilm
pasay
bax
matinee
trt
tondo
organically
peary
pouches
jeevan
sheung
qp
neha
perot
washoe
universiti
bourget
mishaps
germinate
stopper
paddling
deviantart
oso
veria
arse
blossomed
mangled
equilateral
nettles
instantaneously
cutlery
wagoner
parkersburg
surpasses
barbaro
monfils
drillers
famines
rampur
tambo
dano
wahab
kher
stitt
despatched
kotaku
needn
shl
fortunato
counterclockwise
brimstone
giga
joly
desh
antigonus
scythe
reentered
phoebus
preyed
kerk
pendragon
badu
caso
leer
lucasarts
hortense
malang
minting
southey
curatorial
gali
arak
flammarion
towels
chui
firs
dmca
warnock
munoz
ligurian
haulage
hrw
anachronism
elaborating
agnès
udf
snowmobile
lapointe
sheathed
novikov
benevolence
autofocus
lindwall
shir
frege
arsène
janez
civics
offsite
mataram
salmond
gush
diller
corregidor
rime
maquis
retort
msl
vazquez
böhm
smalley
addie
brel
gyroscope
brazen
coals
skagen
educationalist
revives
behemoth
montgomeryshire
arrears
garmisch
latinized
demosthenes
pygmalion
basildon
coombe
actuated
grower
husserl
potok
sebastien
prt
wareham
avars
slaughtering
montauban
gulden
canines
conservator
newsprint
driftwood
hamptons
videotaped
lindström
inconspicuous
crucially
adrenergic
mada
wishful
capstone
tarim
cartooning
dunhill
resold
dreadnoughts
felon
mayu
burgenland
majapahit
trefoil
hakan
bolshevism
kanu
baka
shuttleworth
melanesia
disintegrate
mannix
hedda
barberini
peshwa
britpop
brasiliensis
peterhouse
vigour
imi
kenta
miasto
cajon
cdn
treadmill
bandra
absorbers
virology
moroni
psychotherapist
squall
heikki
hhs
jointed
odour
thurgood
twp
jasmin
philosophically
singularities
blavatsky
dannebrog
zebras
tanjong
mostafa
talavera
bossy
pandya
intensities
kinston
hirschfeld
georgette
fano
rfk
dany
kardashian
urbe
kiernan
swarms
bse
masashi
steppenwolf
ayu
bruford
sop
historie
tua
aeon
generale
ransacked
avicenna
soule
hollies
kido
tunings
raynor
jagdgeschwader
hammerheads
dogwood
candela
centerline
counterattacks
geun
amok
tweety
refuel
plante
universalism
infantrymen
regrettably
pharyngeal
maan
moguls
cortisol
surmised
worshipers
strobe
timings
antiviral
misinterpreting
ephemera
bideford
plausibly
energetically
interlaced
sealand
unsold
keung
imprinted
lucena
comforting
jeolla
petaling
bookmark
gks
banc
notte
filer
arman
laminar
eck
fixer
cutscenes
phrygian
tero
juku
goettingen
ied
callas
batak
elfman
mastodon
rnli
tikal
translink
machinima
lugar
basile
warping
renewables
logics
wyo
gaither
frideric
langlois
nsu
unattributed
cashew
bagged
biltmore
wotton
corgan
paucity
não
moyne
excels
daya
saarinen
benitez
sheesh
metheny
alternator
hurriedly
madhav
wormwood
sieg
tryptophan
rien
jaap
darken
naas
gans
frawley
gertrud
hywel
mohs
pillows
cranks
obeying
haruna
forsberg
deauville
grogan
berra
appendices
shuai
troopship
unitarians
exmouth
manoel
steinman
seki
dislodge
takht
nandini
haphazard
rogelio
mizuho
tabula
grosser
rubric
motility
salaried
macht
hindemith
metropolitano
hammock
hanukkah
geert
vermin
wsi
scherer
imparting
ident
functionary
memorialized
sissy
monotonous
arouse
mendelsohn
jager
wilhelmine
reigate
uo
loner
kalle
aor
adiabatic
javan
teaneck
whangarei
ciao
mdt
mallett
gash
harpist
jessop
materia
ies
rza
classy
xls
complementing
fatboy
seasoning
crackpot
metabolized
implicate
detach
bungie
soli
absences
swf
bewildered
imposter
moiety
ricoh
krupa
complicates
steelhead
andalucía
dnc
miya
lamenting
kapil
tcs
adipose
limo
repeaters
blistering
mobutu
haughton
kalahari
abeyance
iskra
lionheart
humphry
philibert
sessile
rilke
dialectal
toddlers
buono
motile
conlon
jeffers
idealist
preah
crabb
putters
wiggles
ashtabula
aeschylus
conger
morozov
grigor
benneteau
supplementing
bongos
notepad
rothbard
phoenicians
omicron
gbs
endearing
tir
signa
mantras
blip
malton
arnott
dhl
timmons
waza
thirtieth
shoddy
ultimo
repelling
tarek
barda
takuya
bouillon
tanah
psr
hyperlinks
feldspar
chilliwack
sapper
edsa
admirably
weigel
ambience
rebuke
reprinting
fincher
collegiately
bissell
denzel
kiddie
accumulator
meeks
villard
francesc
zwickau
aggressiveness
wylde
inroads
speculators
bahawalpur
admiring
heinous
aquifers
zong
hersh
petticoat
gales
rossa
caxton
intransitive
knowlton
chested
sadhu
rooftops
perturbations
serre
refrigerators
seca
saracen
offends
booze
marnie
mendenhall
zverev
burghers
floodlit
janeway
acceptability
sharman
rearranging
domus
spout
zs
arbitron
arbitrage
tahitian
arianna
nena
koninklijke
bulwark
audiencia
deduces
skylab
lorelei
ironclads
promulgation
polyps
moskva
matta
madhouse
knowledgable
mccreary
scudder
contrabass
sporty
suckers
medio
amide
betula
gault
synthesizing
piz
ledges
follicles
argentines
marcelino
vibrate
anuradha
chicane
gulshan
berat
scm
umbrellas
prithvi
voids
celery
softcover
stent
reconstructive
waxman
thais
wartenberg
deschamps
shimada
unidos
lexisnexis
aza
maggio
joes
evangelicalism
kirin
habs
christer
latimes
coxswain
insurmountable
intracranial
krantz
kreuznach
eyelids
fords
tusks
alnwick
cdi
kingsford
pheasants
ripples
gyms
bareilly
pssa
principled
artful
hsinchu
mckim
zina
porpoise
notated
gerhardt
griff
nathanael
larceny
oceana
meijer
gaeltacht
soldering
mvs
besson
orrin
outnumber
irvington
nepotism
veit
aud
debby
wilno
jonestown
scioto
bloodless
shana
yano
weaves
katya
shusha
westlife
mickiewicz
zeroes
instilled
angevin
flemington
mccloskey
deum
provokes
compacted
kalinin
tippecanoe
docket
neptunes
bleeker
trembling
remaster
lineal
distilling
minced
poniatowski
dmx
andreu
privatised
guingamp
skaggs
creams
dervish
underserved
judicious
asghar
liliana
serialised
tyrolean
sengupta
kalpana
bure
chol
supersede
warrnambool
labors
daltrey
bade
beckford
timaru
pruitt
farmlands
fogerty
tantalum
yek
extrinsic
modulate
seafloor
emmaus
corneille
neretva
hassett
roughnecks
improvising
ionescu
rimes
tiflis
velma
tomboy
voyageurs
cleans
milepost
gilmer
searcy
noriko
omagh
♣
armes
encephalopathy
hookers
disillusionment
ratiwatana
bord
muscogee
slumped
wyn
eia
ingmar
leathery
equities
landline
camaraderie
boehm
paralysed
stratus
sociales
sabri
subsidised
enríquez
monotone
zala
poi
zhengzhou
truex
riki
unsound
charlottenburg
knightley
mha
olcott
shel
thatch
comers
unchained
disapproves
cbo
pisces
habermas
spengler
hieroglyphic
constructivism
gump
daniell
forelimbs
crh
reintroduce
ghq
collette
exteriors
mds
vegetarians
centralization
singaporeans
photogenic
momento
hayter
matrimonial
baroni
micheal
baur
clog
alighieri
summarises
dbs
riverton
lebedev
encroaching
rescind
transcribe
clarita
burslem
baraka
borrowings
taupo
fairweather
taxicab
urchins
olli
lemaire
atman
nightshade
roadhouse
dezful
backus
ü
ooo
ambon
delfino
attesting
morphs
lottie
masur
mirpur
cutbacks
unreliability
sappho
deathmatch
quimby
quartier
succumb
maritima
catalyze
shoved
awd
corkscrew
ensenada
rancid
burge
destino
frobenius
tenements
stunted
stari
morpheme
filial
gracefully
subsonic
seuil
megami
riverbed
hargrove
propagandist
gabala
ivf
roden
barrens
antitank
scheldt
ingres
eusebio
snodgrass
alemannia
mna
danton
suiza
wilding
sugarloaf
waals
shoreditch
byline
deceleration
navarrese
courbet
chimp
subjectively
stoop
diplo
mccomb
zabrze
dancin
bateson
magnifying
palestrina
unsuited
felis
dreamy
pavarotti
timeshift
pikachu
boatswain
glick
ramble
ogun
immer
flore
petronas
carrey
hugging
popa
bellow
wofford
trinitarian
spink
violators
reloading
sandalwood
deshpande
warder
edsel
vestibular
antares
bernabéu
westover
wanamaker
mers
baptisms
laxman
moodie
oued
prithviraj
canarian
allosteric
cundinamarca
haughey
maybach
chesham
radhika
smalltalk
eddington
abubakar
ratan
sqm
coutinho
grok
boughton
bunnies
slashing
whitelist
eccleston
shears
savio
bayamón
aryeh
maréchal
hatta
egyptologists
houser
vamp
alyson
colburn
menopause
vorbis
malé
otaku
lexis
samajwadi
payson
beatle
gauche
kanon
infill
besiege
flèche
parco
nau
nonprofits
kenwood
banaras
logano
fisa
agnostics
dispatcher
receptacle
carnal
wunderlich
afzal
tenorio
nouri
cuddy
smalls
kapur
lgbtq
okhotsk
deshmukh
civitas
arborea
screwdriver
tazewell
insecticides
engendered
brassey
headlight
cuffs
shonen
fodor
minigames
fairway
titania
horrocks
greenstone
equidistant
alchemical
npcs
molineux
calatrava
louisbourg
fairlie
aircrews
cullum
asen
fal
udon
symmetrically
whitewashing
keenly
tsc
shep
niña
egos
helder
bandwagon
icrc
refreshment
laut
pelle
zilla
howlett
ills
hemi
bloomingdale
rti
jeweler
muddled
binns
mckellar
strayed
nalbandian
krefeld
somber
frosinone
thicke
mondale
chabahar
airtel
jsa
lapentti
instigator
halmstad
erebus
pooled
eason
fmri
marchers
cienfuegos
cowl
validating
hingham
tsukuba
seabrook
reappearance
piezoelectric
fleischmann
bidwell
annapurna
mahi
greenspan
lathrop
volusia
fawr
nmda
hodgkinson
criticises
thalamus
lynyrd
sensationalist
bodie
leonel
volition
korolev
behr
foyt
constipation
tallying
briefings
scepter
exaggerating
lupton
tojo
cep
brightman
clickable
venting
kyaw
playmaker
staked
hazing
talal
yoshio
scheduler
trapani
snatched
devoured
gobi
llewelyn
ramadi
galore
pastoralists
revues
calliope
angelis
backfire
stonemason
abate
adic
midterm
aldrin
gaultier
landa
brera
pgm
michelson
hjk
plesetsk
berthed
keypad
mazur
bluebell
fgm
hamburgers
newness
crohn
grueling
tayyip
boac
nahin
dagbladet
poznan
geranium
punjabis
minelayer
sheeting
vehement
ahoy
bds
eckhart
throckmorton
pétain
pheromone
eishockey
memel
strelitz
bülent
histology
coincidences
skagit
ischemia
softening
laye
janes
refreshments
murshidabad
sedative
dismissals
albarn
kamara
kinsmen
sociale
rpt
ibo
kinect
scotti
breakage
shortstops
mooted
katia
vivendi
châteauroux
avram
humiliate
arvid
urania
intricately
conceição
chur
cardiologist
toner
naughton
dkk
guida
surges
fujimoto
kingsport
cellists
barnwell
egrets
posit
disappoint
pianoforte
counsels
eyelid
keeley
grudges
baumgartner
mainpage
nación
beleive
toh
stings
oxy
zenica
yemenite
bullen
farhan
windsurfing
espy
lado
aquileia
zulia
sumptuous
revolting
alu
degrades
automaker
despatch
craton
fabiano
duhamel
dusted
felling
coot
molesworth
massillon
activator
doorman
limón
receded
tunica
thickets
formalities
leaguers
tuileries
terceira
topple
trenchard
eustis
orchestrations
loewe
hoppe
culling
piedmontese
hazelwood
aspergillus
hsi
scap
younis
vara
venn
bacteriology
aftab
alumina
arima
castration
rajkot
drexler
sheerness
wogan
cantal
denials
strachey
paterno
zaki
bomba
fitzsimmons
winsor
scurvy
drawers
tomoko
cheques
rheinland
undeclared
yucatan
interagency
oma
maire
adaptor
breathed
lyndhurst
embittered
tomkins
homogeneity
hummer
welby
kampuchea
pairwise
murrayfield
sorcerers
ondo
helsing
natsume
henrico
deterred
sledgehammer
asr
kotoko
romantics
brainerd
marqués
beaverton
kamran
teja
airmail
junko
otello
hagia
gimeno
ojeda
rowena
taal
bhagwan
newgate
cranfield
horvath
globus
immunodeficiency
beets
tsung
largemouth
sabercats
hams
bifurcation
restatement
sawai
recuperate
fulk
appalachians
backwater
yassin
unreasonably
mab
permanence
erikson
mireille
captivated
elkhorn
moctezuma
albini
trung
overstated
congestive
sibyl
blackness
anzio
wwc
stevan
trolleys
applet
struve
donning
breguet
downtime
gpp
solemnly
assemblages
varietal
outliers
wvu
montefiore
bonney
calum
blackmails
poirier
lawman
christening
traditionalists
stumped
bookkeeper
kesteven
unisex
indulged
dictatorships
ramming
lighthearted
morpeth
divina
sustains
norodom
sarath
mib
auteur
phg
drg
viaducts
neurobiology
kitsap
abdulrahman
rulebook
poppies
rapier
chairing
funchal
colson
sekai
goole
castletown
scalability
alga
matrilineal
altruistic
seles
bhavani
spoofing
precept
swingin
impervious
technicalities
codon
oksana
coulee
zakir
wasnt
allerton
momma
disposals
busting
marshalls
fidesz
stencil
aforesaid
menuhin
oporto
aphids
specializations
ezio
capriccio
manger
ehc
taut
pietersen
oshima
origination
wedges
persisting
industrially
rosecrans
carreño
bilbo
coria
extirpated
centrality
depositing
gyu
outstretched
jarre
brats
catamaran
canisius
superdome
desktops
artvin
spiced
grooved
wheelers
beto
anamorphic
utterances
grindcore
fortify
whitelaw
fayard
respectability
shawl
wru
hallways
rollout
urbanisation
kidnapper
theodoros
bul
vickie
treasured
lansdale
posten
thomond
mashonaland
abernethy
centralia
milhaud
ammar
overpower
hadfield
extrapolation
bruising
homebrew
greenlee
kladno
ionosphere
gastronomy
crofton
lindau
crewmembers
burwood
coverdale
wingman
deplorable
fluxus
bermondsey
chunichi
tsukasa
snark
litex
libreville
jeweller
immediacy
stoddart
vesicle
abernathy
hannon
amparo
gatling
paediatric
werden
bole
nepomuk
lascelles
haar
vod
kaul
maurier
brickell
ouster
zsolt
hilversum
millstone
stabilised
facies
vanquished
grambling
pleural
scruggs
fulfilment
charpentier
barba
terran
niobium
cutaway
squier
municipally
chumash
phytoplankton
soas
autobiographers
quackery
confided
kabylie
bronchitis
lipsky
gah
mediates
caracol
socratic
subalpine
lorenzi
kunal
pré
cantabrian
hedmark
moribund
giveaway
loggers
witton
convening
moffatt
hoarding
adda
grills
nisha
subways
presumes
théophile
storer
spyder
gatos
montañés
manorial
cyclopedia
technion
kilmore
blanton
porphyry
cédric
berenice
narciso
shipman
subservient
complutense
homozygous
mccrea
carvajal
mcalister
reynaldo
babs
tonawanda
solway
pettersson
gorski
pierluigi
moorings
burne
unlocks
philemon
ludwigsburg
meads
professorial
tabu
berisha
maintainer
agata
grisham
conakry
albay
incestuous
sprinkler
charade
nellore
unmanageable
springing
franklyn
weiland
sunnah
antiwar
valérie
damodar
tulsi
faithless
klara
grasped
weeknight
reusing
obligate
brower
bemidji
fiorentino
damiano
counterbalance
turco
thuringian
martinsburg
impeach
hotham
michaël
hennig
secretarial
peterhead
whittingham
slava
aami
delaunay
broun
eagerness
macklin
kadir
chappelle
parachuting
canción
rivington
provable
ashmore
rincon
subarctic
stoning
petits
phishing
wyandotte
maurya
metrology
lindholm
monomers
michaud
etymologically
dissociative
junaid
culminate
gottschalk
wataru
angelika
tacked
neurosurgeon
okeechobee
alkaloid
corio
wavelet
titling
iom
lumia
wheelchairs
durations
marblehead
seahorse
asim
cyclotron
declination
shrinkage
footwork
rtd
mclellan
piatra
aguayo
caving
annihilate
assis
grammys
marketer
boxcar
phat
whistleblowers
mcgann
feinberg
lochs
olivares
tolerances
mowat
sulayman
pressuring
stena
beery
wheelwright
teplice
gini
voa
testes
sportscenter
murata
prostrate
rukh
brittain
perryville
ringgold
cubana
toit
silvana
boyfriends
henshaw
transducer
milla
ags
frighten
hondt
airlifted
yim
wholesalers
suncorp
carreras
caretakers
kawai
ethiopians
sauropod
channeling
raat
haldeman
tudo
ruthlessly
bynum
rewa
swainson
intensifies
frere
ungrammatical
deleuze
macdonnell
sniping
geometrically
flowery
eloy
yoshihiro
panagiotis
squeezing
bootstrap
behrens
gantry
bakewell
allo
boldness
pared
hoary
itm
downplayed
bling
querrey
ravindra
cohan
sayaka
pistoia
concertante
erling
parkside
ugg
crabbe
pooling
sleazy
uploaders
gyatso
halpern
zetas
rooks
biol
cognates
keisuke
middelburg
apus
saurashtra
personages
toshi
naser
cousteau
veers
stieglitz
bunt
dower
glassware
silhouettes
wobble
franchised
nlrb
hiragana
sitio
moab
composting
norges
replenished
overruns
absurdly
asus
northgate
letts
drugstore
refectory
rimouski
nen
narmada
tuc
realtor
tupou
assr
saver
nozzles
nca
yisroel
taub
gallegos
characterise
josefina
appeasement
infamy
apostate
burges
ciencias
bamber
knowsley
abominable
maryville
maltin
hieroglyph
ammon
nagging
mehmood
izu
indigent
lucent
disowned
dampers
seamstress
weisman
marshmallow
ovulation
clump
arana
breve
pms
kalman
ozaki
downy
benedetti
aurelian
craigie
bushfire
federative
budokan
bumpy
ballantyne
ino
maasai
ncs
dhar
millbrook
zonguldak
luch
fertilisation
dobrich
lemoine
schoharie
mutter
gymnastic
holiest
molla
meddling
sugababes
doak
alk
epc
matra
resistive
rasul
keren
sleuth
meldrum
gtr
interceptors
intrastate
braşov
contes
streptococcus
harmattan
fooling
skydiving
crotone
faustino
uspto
giraldo
koro
naveen
materialistic
smothers
picayune
eca
ris
vestal
tsunamis
sharps
gamla
landmass
cortinarius
silkeborg
robinsons
shipboard
dunkerque
baen
wildest
atul
blackford
dfw
monotheistic
subjecting
ecclesiastic
headphone
polystyrene
moyen
rhp
pierpont
gatti
variegated
instinctive
viceroys
cashmere
squamous
networth
ignazio
brumbies
mildura
expropriated
amersfoort
alor
undercard
daz
szabo
baudouin
farmhouses
caviar
ghraib
whitewashed
antipathy
computes
delray
reassured
seppi
marini
bragança
bingley
duchovny
ginga
jaro
seale
gristmill
gleb
qué
apprehensive
anaesthesia
shinichi
tuva
jerky
veliki
kampen
marionette
motivates
stelae
schwarzenberg
koller
hok
paging
navigated
duplicative
effecting
buttercup
harrah
ltu
milled
sorta
bundling
concomitant
sprinting
imola
gazebo
suzi
colloquium
revell
wiesel
coetzer
ghee
keyed
varun
fizz
pacify
¹
iib
bottlenose
gastón
pyrotechnics
matsuo
culprits
russula
kho
msf
multicellular
wagtail
hiromi
ribosome
cerrado
keirin
eibar
mcnaughton
hier
corset
folktales
scs
strafing
pfg
clitoris
quilts
jukka
karp
militancy
freikorps
ayyubid
effeminate
dávila
mcgarry
vino
nouméa
indulging
prolog
laon
akari
tomar
trachea
baskerville
marciano
airframes
ploughing
southfield
circuses
nyj
bunn
nott
shattuck
abia
tillamook
cancún
carmelites
chandan
stortford
gcses
ruble
zidane
furies
türkiye
rebus
addy
caledon
necklaces
sinéad
vaca
evanescence
mountings
glanville
ifs
shinoda
agen
fier
bessel
siddhartha
shortness
gobble
americanus
suda
antoninus
oneonta
bolden
surcharge
pbk
foxtrot
harv
rothesay
acheson
tupi
coverts
bayerische
arpa
globetrotters
mayonnaise
isha
reforestation
tasker
thrusting
salvator
schleicher
icj
mukti
idents
platformer
exaggerate
vena
cobbler
hemmings
wargame
vasyl
unconsciously
sixpence
marinas
violist
anathema
leni
mages
koren
pawan
ament
grendel
tamarind
pyne
homesick
dropbox
skippers
giménez
transitory
nambiar
underwing
paces
pinafore
undescribed
courtois
fibreglass
gbp
pox
silverton
pickers
werribee
reise
indios
ase
trixie
banshees
sfa
albertine
embarkation
antibacterial
rafsanjan
matanzas
smedley
pinheiro
benno
brak
hypocrite
bosons
basilicata
haeckel
sanctorum
matin
anabaptist
coupler
muhajir
bogdanov
indepth
hesiod
deceiving
capp
nadar
plana
ius
trimester
balu
drowns
hdl
ojo
horthy
heeled
aber
atlases
sammarinese
spyro
arquette
michaelis
delmar
islamia
colitis
antillean
toomey
jarrow
nazrul
infringements
onerous
tecmo
burris
knitted
bly
zila
aco
arriba
weintraub
hooray
differentials
ogc
fightin
scharnhorst
cawley
abutments
opioids
nci
mayumi
victimization
fireflies
freeform
jerks
encloses
ashwin
veranda
gakuin
shenanigans
welk
rhyl
dethroned
reburied
wad
yonder
saks
seditious
bridlington
neglects
rogério
mccluskey
rebutted
clavinet
nihilism
nagata
kwa
bertin
hooking
subjugation
grigori
cbm
pagodas
cayo
corliss
bma
cabana
brooch
nonverbal
cobblestone
prerequisites
beveren
danby
irises
omnivorous
rian
carnivore
summarising
amputee
nalchik
revolución
palatable
cti
scrope
damming
tolima
ketch
agung
rader
voetbal
eclipsing
monck
tandon
rowntree
bootle
luba
winemaker
polisario
boaz
ral
quatermass
expropriation
slanderous
tubbs
naoko
neuf
corsairs
pinker
medulla
oblige
deviates
geldof
gilad
chalon
amorous
incubus
ferruginous
dede
whisperer
lovable
buss
culp
snowdonia
assiniboine
raimi
dreamt
crushes
tuners
empiricism
germinal
faustus
manorama
seedling
andorran
banka
paulin
creeds
hannity
countesses
plurals
ouch
pronouncements
fides
pcbs
barreled
walling
crafty
allstars
grâce
citrate
repo
seaworld
intermezzo
shrinks
videography
telecaster
flapping
secularization
hela
bratton
ntfs
delius
camry
iver
jeonju
vanda
enschede
fairtrade
dads
bambino
blackbeard
laoghaire
edel
sbb
gotcha
rereleased
vollmer
microcontroller
felixstowe
speight
yeni
billingsley
bigoted
bettered
cliques
faulted
marikina
piedra
thereto
reeling
botticelli
apologists
chutes
honeysuckle
distancing
hobbyist
lombok
goshawk
cmp
escalates
vignette
gandy
maersk
racking
stubbornly
trampled
aurel
goldsboro
hargrave
minorca
ravenswood
koen
bielsk
syr
melanin
pou
benzodiazepine
dukakis
industrious
francoist
augustana
bustard
yolande
paramilitaries
timm
kjv
tokio
brin
featurette
gouda
romaine
pah
lumberjack
uan
mauve
iquique
subramaniam
gte
breslin
faulting
dota
lage
rallycross
fahad
mesnil
ustinov
movimiento
muang
coverings
yushchenko
hashing
burford
molokai
coster
nacionalista
zalman
hast
homan
eggplant
kati
beall
reserving
pineville
ratnam
distillers
zanesville
heenan
guilders
ohv
minimization
lusk
maliciously
grocers
hercule
asic
hermanos
asch
shinty
pelton
schwyz
horny
stimson
michels
overlords
equalling
lugosi
hoshino
ekiti
vantaa
kanaan
unthinkable
unanimity
avowed
sniff
bache
pitkin
venetia
unt
nbr
candide
koga
vier
waders
ried
cavalrymen
attests
kubo
earring
minecraft
nanette
hearty
foresaw
kirsch
azadegan
dispersing
patrilineal
constanţa
virulence
limpets
nickerson
deckers
ullyett
lofts
massing
laryngeal
thiers
baños
sequitur
beaded
tove
gondor
gyeongju
maniacs
ritualistic
tenochtitlan
mago
paleobiology
tallaght
spillane
pleases
howdy
alm
jayaram
involuntarily
starfighter
dwarfism
exasperated
leoni
machined
buckles
wester
carrefour
dolmen
waltzes
distressing
cooker
magellanic
prashant
caboose
greyfriars
andra
moorcock
empresa
uffizi
coriolis
barbier
magik
aral
valentinian
typographic
asano
tobey
tuco
sethi
iambic
severino
boosts
lesh
ruckman
nettle
pursuers
interlinked
hopman
shvedova
panelling
lashkar
buns
cryer
trawling
boletus
abacus
lapis
fba
maio
brim
amiss
belushi
landmine
rosslyn
newhouse
suave
underlie
shelbyville
longley
lancastrian
yaya
psychopath
platense
infiltrates
rheims
geopolitics
hob
majumdar
superconductivity
mignon
floatplane
neeson
béziers
deflections
quint
dispenser
puna
mussorgsky
adachi
stansfield
khor
hickok
dumplings
vinh
loyd
eyebrow
bombardments
gib
vinatieri
inclinations
tosa
bushman
sastri
mariinsky
clowes
eris
genealogist
blackouts
oxymoron
bested
mushtaq
burl
shruti
ilm
everson
samford
wl
capillaries
pyrrhus
vaulter
bullfighting
smoot
alexandrov
corrupting
heterogeneity
cbf
grazia
saco
ilia
somoza
mythbusters
restate
gruff
birr
cumann
honeycombs
sica
hughie
wegener
heartbreaking
beanie
glutathione
rhoads
bogor
catalana
mahinda
kristi
armadale
vig
incas
kristensen
geckos
thicket
reformists
uninvited
novosti
wich
baldy
fullness
reinvented
stradivarius
reinventing
peart
birkenfeld
independencia
hooligan
asami
wimpy
athlon
baghdadi
showa
michiko
ens
violets
unrecorded
orly
sudhir
campana
egyptology
gunnarsson
lubavitch
sive
kanata
preload
epr
gorakhpur
ningxia
sandnes
catamarca
subatomic
kowalczyk
mothership
syndicates
clearinghouse
embryology
faithfull
qaida
neoplasms
pirated
sportif
yummy
waitakere
erbil
relaying
motorboat
allama
tmz
jago
rut
chacón
abid
hypersensitivity
spg
speculum
mikkel
vilma
multifaceted
ipsum
jeanie
sylvania
malfunctions
baldur
excavate
endzone
sadhana
edgware
sugden
sliders
diable
dalles
nahyan
lauro
saskia
coloma
bartram
solna
iim
lengua
merion
zlatko
novaya
ingraham
goiânia
jeezy
mamba
tdc
purpurea
piggott
prescribes
geetha
nostrand
viseu
cardwell
mckeon
kink
lacombe
révolution
ameer
shinobu
steubenville
gulfport
hiroko
prim
amalfi
transgenic
abduct
wroclaw
ericson
culpa
bukowski
starks
indefatigable
confraternity
pynchon
bedchamber
heligoland
thibault
lilla
demobilization
vacationing
paddles
paraffin
colloidal
wightman
sandiego
articular
pondering
arshad
bonheur
striated
themself
bobbi
desam
subcommittees
hol
hartwig
multitasking
asme
babington
dmitriy
abramson
cochlear
rankine
marthe
reinvestment
burlingame
shonan
khoury
copyleft
keyserling
masao
buin
tallulah
mizo
brogan
nerdy
bek
mountainside
discerning
thickly
eurydice
opining
samaritans
palaeolithic
friel
matsuri
discerned
violeta
underrated
utama
gru
cucumbers
ojibwa
ebbw
reaffirming
culvert
dancefloor
recruiters
fukuyama
citric
whine
tye
ragnarok
courtauld
villalobos
kelleher
deflation
condorcet
besa
florio
aho
misogyny
absorber
mauretania
demilitarized
goro
skirting
balling
zúñiga
finkel
paulinho
cws
buffon
bloomer
demir
pavle
grigore
striata
lop
itza
holyfield
corriere
rossiter
hialeah
predicates
dumbledore
ambivalence
masaaki
bellinzona
locket
ibex
princesse
khin
wronged
intendant
eldar
couplings
iterated
relatedness
ulrike
geylang
familie
phylloscopus
populating
cvo
unione
philbin
curd
kurtzman
gurdjieff
manipuri
profusely
pearlman
hohenstaufen
tripathi
fem
starlet
eko
sawa
sze
tasso
shephard
prensa
ngai
kitsch
hsing
janette
picnics
protour
pamir
rijn
skene
quand
cattaraugus
bez
insanely
reinvent
verney
yearling
mamadou
brawn
ferruccio
daugavpils
dupuy
upmarket
lehr
sires
uruk
tuolumne
progreso
dury
fellowes
hadron
bremner
mraz
wesson
swirl
ligeti
schulte
hyogo
secondhand
concedes
unsupervised
farren
kmart
headstrong
accipiter
iac
sampaio
crystallized
altay
endeavoured
cuckoos
ket
brome
artigas
commissars
aragua
nourishment
longworth
hos
audiobooks
hébert
nieuw
woodworth
incomparable
batumi
graciously
kutuzov
thermally
corr
ichigo
scanlan
brickyard
tamper
decomposing
prokaryotes
bim
calabar
smtp
mosher
rebrand
wenlock
ishtar
ilex
tightrope
pki
glancing
friedrichshafen
aphorisms
mellencamp
chaplaincy
kyrgyzstani
rallye
ruger
myriam
hopf
cyborgs
teesside
skien
nellis
umd
talkback
mixtec
giggs
horten
raonic
tachycardia
faqs
savitri
fyfe
erectile
avar
skeeter
tonya
titi
tomi
murry
cuticle
ransome
diospyros
dtm
sasso
qed
mattek
kurz
atheneum
augie
britta
bathed
buñuel
schwarzschild
tubb
redwoods
pforzheim
burnette
jools
buzzing
jpmorgan
otra
edwardsville
ratner
gioia
incontinence
schatz
mccutcheon
bangles
foretold
partenkirchen
sanada
wondrous
ripken
rosebery
ahab
frontrunner
eugénie
anointing
novus
wuthering
folger
diggs
blinking
kagyu
hod
ttl
copycat
gershon
chars
wilf
gyllenhaal
cavallo
sockers
hawkwind
chaperone
clitheroe
icl
harrelson
reflux
cleaved
irate
feisty
crumble
retd
headstones
tengo
gce
mcnamee
samut
krs
hommage
porosity
chiyoda
inalienable
sarge
calaveras
keokuk
bahama
townley
fap
misrepresents
matilde
lhc
augustan
sueño
sunburst
robotech
bre
roseland
bagram
haggerty
shizuka
alireza
bridgnorth
coauthor
juventude
interuniversity
fergal
havering
jbl
licenced
janesville
lene
giraffes
thine
teater
gung
colwyn
maior
jis
tisza
feld
autonomously
stichting
enamored
friendliness
lani
panning
howler
sisko
rosset
maxilla
charmaine
renshaw
haddington
dua
brunet
cond
lydon
renominating
wor
elina
apolitical
tapings
bester
jaye
haplogroups
arkwright
filmworks
aviary
rattus
dhillon
uar
antena
khawaja
wailing
rahal
georgiev
durkin
svn
sarin
carell
strasser
bande
kumara
maitreya
kalat
fairer
delves
ricks
tooting
zoltan
dicky
conformance
enna
nala
wls
godot
isherwood
pettis
wissenschaft
incompleteness
merah
fath
blackwall
meagre
figuratively
unbecoming
fannin
affordability
sailplanes
vidin
mmx
proportionately
howards
cdk
etoile
proline
exons
diphtheria
intricacies
disguising
spook
concha
kuma
fonte
eastlake
junkies
conformist
weems
scavenging
shuffling
liana
belive
participations
inbred
subheading
tauris
owensboro
romanos
deepa
howden
cityscape
wcbs
bloat
latins
obadiah
leticia
nigam
lika
pawar
officier
plums
pontypool
benchmarking
redcar
raff
oryx
melendez
krill
defector
taunted
illingworth
unpopularity
silveira
pageid
freiberg
milltown
stonehouse
rebelling
radix
arma
monotheism
humus
bishoprics
appraised
haters
mcguinn
executors
kevlar
denby
pampas
kurtis
shania
gurudwara
shimmy
septimius
janvier
dnr
uplifted
lundberg
braniff
keg
lecco
havant
spindles
mittal
quadrangular
scribble
aschaffenburg
kraken
smes
transparently
dalziel
boc
microscopes
mamet
kluge
phenyl
motorcycling
poked
deschutes
reale
pierrepont
merckx
elihu
mcfly
bodega
daviess
meyerbeer
greuther
vyborg
bellucci
carers
ashbourne
enactments
luan
garratt
capistrano
fibt
strummer
instalment
austerlitz
fantail
seagoing
araujo
galilean
chay
fragility
bazooka
eazy
perceptive
bts
knockdown
ouro
bursaspor
smetana
cleburne
doktor
squatter
houseguest
refounded
aligns
dinos
compressing
bhumibol
klaasen
dally
masri
pilipino
ecs
legg
dour
cephalopod
nisa
sanya
gua
daa
toshiko
rudge
windfall
burundian
locusts
piscataway
naperville
nombre
hyaline
siebert
snopes
dft
hayato
campanile
ghostface
coastguard
vanden
datsun
›
apologises
faltered
mclain
specialities
clemons
bjarne
mazhar
verandahs
charlevoix
ramachandra
resonances
cromarty
selatan
neurodegenerative
krishnamurti
frostbite
albus
shirazi
northerners
chaz
privates
bedingfield
sorkin
canceling
currier
odors
fairleigh
lokeren
käthe
ambrosius
librarianship
jawad
dalida
normality
citizendium
strangle
menswear
gaslight
racists
physiotherapist
aéronautique
bohm
joh
nostradamus
testimonials
radisson
cherno
saros
caproni
slims
eurostar
carib
lida
phitsanulok
bogan
cheever
minangkabau
wysiwyg
dnepr
tights
eretz
slates
deventer
cardenas
vani
roni
lovat
uninsured
bandages
tks
jct
ccn
kis
ferré
rte
sandi
bru
wyk
juni
lecter
forst
undergrowth
decomposes
veered
arup
pasternak
pfister
aml
vcr
quipped
ghalib
monteith
lewd
switzer
normanton
huawei
comically
ssg
cilla
errand
psm
wedderburn
uanl
intelligently
minion
universo
zest
nri
groza
torbay
bolognese
turek
imd
trophic
colquhoun
hedgehogs
borer
ballou
joffre
galante
kamo
glorify
rinse
crist
troubadours
copyeditor
kaw
geoscience
lustre
dentition
forex
tote
beja
scorecards
amano
bribing
foreshadowing
heme
septimus
flamethrower
eif
montaigne
oxon
berner
benavides
despises
acrobatics
zob
shorta
socializing
sleaford
turmeric
ferri
fistula
sify
moonstone
ludmila
petru
misrepresentations
minarets
mims
topsy
asistencia
tve
multiplexing
sharan
rizvi
medico
approximating
komsomol
orienteers
gremlin
denunciation
previn
chapbook
todor
camberley
unitas
ccg
subtribe
relinquishing
pankhurst
palumbo
uriel
fou
exacerbate
diouf
chabot
divan
kishan
stenhousemuir
hanafi
condoleezza
yangzhou
havent
gade
tosu
orifice
instinctively
beninese
homewood
scottie
yardbirds
structuralism
tempos
gunston
starwood
ghibli
composure
taser
hannigan
galland
rothenberg
cronkite
scrappy
mordaunt
vijayan
rabbah
rejuvenation
rwd
guernica
muzaffar
bharata
rubenstein
pce
adarsh
wafers
katrin
gametes
vaillant
roku
adac
weatherford
amf
dunkeld
moyes
chiefdom
trilobite
hofer
abitibi
malformations
bromberg
fiddling
emplacement
nal
finlayson
nephi
rikki
clydesdale
sulcus
goma
gudrun
panhellenic
albertus
westphalian
kolding
torii
abrasion
ingushetia
naresh
lengthen
outgrown
ersatz
montcalm
bogle
waterboarding
alka
mattei
echelons
chivalric
adhesives
dabney
menorah
pyrite
maccoll
oostende
counterweight
moderating
corzine
tali
lavalle
philosophic
deformities
organelles
bioshock
transceiver
granddaughters
milorad
hinde
roubaix
expulsions
hiratsuka
unapproved
belatedly
hepworth
melanesian
northam
tamayo
spaceships
novela
divider
gpus
lignite
matchmaker
crept
sth
cornel
chiron
powderfinger
macedo
mangal
meisner
janse
ambler
gunslinger
jónsson
aguiar
tik
womanizer
hata
prototypical
pernicious
ryanair
hurtful
overused
gallego
appreciates
sindhu
sopra
malini
realtors
pelota
sinica
taffy
saipa
conditioner
gneiss
franchising
freaking
dauntless
dissension
vax
nipples
lazer
seh
julieta
unni
elin
cicada
duca
billet
screwing
centrifuge
upheavals
dcu
davila
indicus
bubonic
bumbling
jenkinson
bolingbroke
wigner
chandeliers
gaumont
fanclub
linker
marca
agee
cycliste
cytotoxic
kannan
politic
axillary
misra
prue
nimble
pulsating
perfecting
cont
mulla
bombarding
monegasque
fantagraphics
lynched
opine
marbury
zhuhai
shiba
northwesterly
visage
landmines
oost
gravesite
secs
marisol
changeling
tine
begley
pirata
eponym
targ
gigas
moller
prides
skiff
constrain
hinders
isan
lombardia
gawker
yle
androgynous
roldán
melanogaster
akai
dmytro
natsu
mohegan
kennard
guinevere
spontaneity
coffeehouse
weinberger
authoritarianism
opulent
othman
perrot
reconstructionist
fusarium
premièred
reredos
salvaging
thoroughbreds
etv
minutiae
ergonomics
pinion
sprocket
falsehoods
hbf
adequacy
receding
potenza
socialize
crawled
slaven
troughs
parham
millen
nowell
chemie
thutmose
fogo
aad
deol
frobisher
debugger
mcnab
amadou
roby
tule
ayaka
pinchot
tachi
attestation
farthing
manipulator
magruder
voight
saulnier
escudero
advertises
wreak
beano
schwarzkopf
thomasville
wayans
quakes
nikolaj
adoptees
sadi
nris
succumbs
papillon
brainstorming
blabbermouth
maccabees
fanfiction
gasses
lesbos
alfreton
vapors
yokoyama
ramsden
herbicide
primers
smokes
tollway
prasanna
chenango
uncooperative
viciously
footnoted
fsf
peduncle
moreschi
neuen
fms
doty
fabricating
truffaut
sagrada
jeeps
smithy
mourinho
drapery
bq
carnes
khoi
myst
rete
africain
bankura
unsecured
waynesboro
wsc
pitta
guha
methinks
redundancies
ditched
epworth
maron
façades
baffling
fsn
marquesas
lingam
bilaspur
biochemists
piel
mcdonagh
bunge
trims
seiko
perestroika
folie
parle
poona
chloroform
neuquén
serviceman
npo
hristo
brookwood
jule
ladoga
maithili
erb
viticultural
retrofit
moisés
pedimented
shwe
grice
lnb
myeloma
solenoid
basilio
ferocity
answerable
phyla
unveils
artic
wither
udr
doble
yossi
akc
batson
asses
stipulations
polices
summerhill
praga
reggiana
sheela
biggar
interdependent
ajith
penarth
molested
restorer
sistine
paulding
playbill
klug
gowda
chepstow
belen
bcd
propellants
stowed
marginata
alesi
rafaela
messner
marketable
internationalization
andronicus
wyre
telegrams
fluctuate
assunta
purnell
kpmg
renzi
görlitz
ofi
bathgate
detonating
rattan
leyla
gebhard
frusciante
bley
ocs
litton
unfulfilled
traynor
appreciating
prosthesis
cardamom
legionary
boomtown
openers
foreshadowed
massimiliano
purebred
trivedi
sandrine
balikpapan
oruro
blocs
annexing
blasphemous
stomping
annexes
allyson
sager
lysander
basements
wrasse
hirano
immigrate
ilkeston
ksenia
yul
countrywide
letras
mende
breitbart
lipman
）
sheehy
pugin
cybersecurity
ursuline
bolin
khe
viki
bushland
patroness
codrington
irenaeus
deejay
lettre
bharatpur
pecking
aftershock
reka
stoneman
referent
neufeld
yolo
akins
slamming
regionalism
inadequacy
saldanha
counterexample
pentland
polskie
audie
morgenthau
matsushita
agrippina
llyn
almada
cerritos
jaded
acu
newburyport
dez
sohail
photojournalism
musser
stabilisation
petioles
tng
forgave
ssm
heuristics
backend
diatribe
charteris
cooch
follett
bituminous
librairie
threshing
stromberg
victimized
jib
cruikshank
patroller
klinger
bluewings
seacrest
cim
pembina
lockerbie
attainable
bfc
alentejo
wip
khanty
bss
arjona
southwick
evian
untrustworthy
redhill
créteil
hazleton
ilyas
mcbain
anca
basilan
rorschach
ventriloquist
peschke
dampier
masayuki
ilves
willcox
brickworks
irreconcilable
reflectors
supposing
eliade
amass
dickenson
reversals
hanbury
infielders
bhatia
vicomte
arabesque
kosi
docudrama
clr
rappaport
leggett
appendicitis
jardim
danson
interactivity
concocted
mtu
arda
phenolic
violator
shrestha
mtc
bier
vannes
phylogenetics
lossy
alagoas
cmi
shinde
cultivators
takada
lalonde
wynton
tagus
mccrae
mccourt
fulford
hardiness
toasted
masada
coenzyme
pessimism
ponca
mors
gravelly
civile
czesław
matej
polities
dropkick
elucidate
ecommerce
epistemic
humberside
sallie
goulet
imai
croc
lumbini
panini
pervert
eglise
jurek
palomino
aggrieved
goodfellow
nae
demesne
sift
dubh
trichy
lanzhou
lordships
magritte
gigantea
chaumont
reay
truscott
trapezoidal
newsmagazine
remoteness
rabe
wynter
identifications
seminarians
macaroni
frills
mayen
awan
elaborates
medeiros
shahrak
innovate
huss
kutaisi
handbag
vrt
milena
dothan
concertino
munger
mstislav
momentous
sitara
berk
silliman
hartnett
sweethearts
photometric
youngs
waiters
actuator
dandridge
probst
scavengers
restated
oldman
menschen
penetrates
chertsey
tsim
reminiscence
acha
harrisonburg
cardiomyopathy
franciscus
tipo
justicia
appendage
ides
picnicking
spinelli
roofline
bullitt
faulk
barnstormers
loder
abwehr
pattinson
shinawatra
verband
skylar
reforma
pompeo
annesley
emg
montreuil
gallaudet
loka
spinosa
redistribute
maturin
ketones
hairstyles
charleville
presser
kissimmee
akio
gz
bartel
bulkheads
grandparent
giotto
desertification
guested
fouad
campanella
rupa
cementing
smoothbore
questia
romy
nirmala
excerpted
presbyterianism
gigolo
hutson
ratnagiri
jip
moderato
inane
patronizing
buie
wetherby
saxo
accentuated
jur
sonne
yoder
xperia
expend
penticton
disembodied
astroturf
colonised
ghaziabad
deactivation
táchira
talmadge
pterosaurs
tokai
mcluhan
troilus
childe
baghdatis
raucous
minardi
weathers
loaders
hellcat
luxuries
pontchartrain
landskrona
fea
dunaway
lamarr
leaching
himmel
cribb
dribbling
etruscans
cabriolet
canna
potemkin
kranj
kilowatts
civilisations
drina
fitter
phobias
poussin
wetzel
foundling
scull
resisters
ossie
shrublands
imparts
sharpshooter
sumitomo
barbe
nightcrawler
xxix
griffey
broglie
valenti
brigantine
injects
ypg
mec
thaler
vernet
unlisted
arcos
follicle
donaghy
mannerheim
osmotic
tenney
fragrances
jayanti
sankey
mdr
peppered
ostrovsky
nowitzki
tlk
sandor
anhydrous
mansa
koma
consigned
specialisation
pyre
synapsids
columnar
trigg
lefevre
exploitative
smokeless
reprocessing
plantarum
mire
comercial
forestall
autographed
sisi
ellipses
domineering
toots
winemakers
jalalabad
speechless
almere
gass
foie
usta
geotechnical
prob
erg
saltillo
alister
zaza
horváth
nakata
aberrant
kuch
yasuhiro
deceptively
pvp
showgrounds
lacoste
nape
cormorants
godman
qassam
sunita
oden
castlemaine
dhamma
ardea
deport
selfishness
politicized
sivas
jacobo
pigmented
maplewood
talleres
askari
endoplasmic
constriction
helmed
sez
eroding
carty
lumumba
shahar
bookmakers
nitride
muon
gerwen
proximate
lakefront
zofia
matsu
kilt
substantively
uy
valuing
zaid
maddison
gabs
hankyu
maharani
makkah
fanshawe
rocketry
trumped
uppland
jassim
sasquatch
volos
wooten
koirala
hearse
vasiliev
smp
stannard
poroshenko
bpa
schuler
nobilis
coatbridge
rovere
quidditch
educación
terrors
aintree
tutankhamun
seligman
magyars
cipriano
discrediting
stich
duterte
leapt
paycheck
conjure
iversen
canter
orla
amenity
crosbie
humpty
somalis
worshiping
natur
psoriasis
rangpur
chunky
dysart
inigo
heliopolis
sunder
castlereagh
mustaine
kahan
ramla
olivine
virtua
shinya
asta
falaise
reformatory
bozo
vachon
dlp
gibraltarian
vrs
morpheus
tirade
bonaventura
sativa
dione
joël
legco
satyagraha
hakodate
guppy
docile
antwerpen
churn
papier
baugh
uomo
overwrite
assistive
etf
bungee
phosphates
amplitudes
optimist
langhorne
lippi
solapur
oracles
bij
cudi
kopp
gehry
janner
grandmothers
cataracts
sergeyevich
maslin
pesky
popham
ambroise
notches
soman
philpott
confiscate
palliser
thereon
beautification
derg
antler
tagger
lally
munition
kenrick
ohno
automakers
cerevisiae
amami
verbiage
schechter
emancipated
clanton
townsquare
voles
wct
nicodemus
farrer
unconsciousness
cashman
salonika
shambhala
maruti
grandfathers
tarnish
lactate
ohms
haridwar
kass
dalits
poachers
esher
lbc
bentonville
potions
panelled
lunn
prioritized
qadri
polarizing
finno
bonito
digi
emirs
lluís
microgravity
recuperating
qh
dorking
kudryavtseva
alekhine
loca
patric
fpo
boxset
nothingness
wyler
thabo
wholesaler
coerce
manchus
toot
matsuura
dragnet
capaldi
singin
coghlan
sinkhole
boban
hoang
margret
bohlen
tampered
zita
zhongshu
shubert
lamm
televoting
fibula
chills
utilitarianism
holderness
sheringham
ethnological
leominster
contagion
mastiff
savannas
lauer
linke
stocky
desirability
furrow
luminance
westin
faints
fabergé
trulli
grazer
protectorates
kur
mitral
hydrated
pudong
sheri
upendra
hallucinogenic
rosetti
loosen
stoppard
naito
commandeered
cablevision
culloden
unchanging
metis
squirt
abney
countenance
portobello
stebbins
damsel
mammary
kerremans
riveted
isao
marit
renan
gere
barrientos
plath
hollowed
omniscient
maltby
patronized
xinglong
ruthenia
militaristic
rachid
aizu
beckenham
uncompleted
taint
pentathlete
leakey
takahiro
findley
pretensions
oki
mcginn
tancred
herts
sharad
igloo
spits
gtg
madama
campanian
miquel
palatial
motorcyclist
crumbled
tethys
dragomir
mantel
indias
malling
midtjylland
fdic
limehouse
shilpa
macinnis
pastels
alen
marooned
elysium
sakharov
traian
ricordi
reformat
wsa
whirl
malaga
nicolò
judean
approachable
daoud
cadaver
catapulted
qumran
soothing
gahan
germano
papandreou
whitford
thrall
lanchester
preservative
tryouts
pecan
simian
pantheism
molise
avispa
imbalances
redline
overlordship
blackett
cpe
llandudno
ojos
kebab
hadhramaut
foz
overhang
oglala
whoopi
freising
painless
photonics
trinamool
incapacity
elwes
finalize
loudest
compressive
pani
bocelli
bcci
fabricate
putty
conservatively
sheaves
numerator
umts
zambrano
secunda
poggio
geochemistry
logotype
earthy
frameless
ezequiel
pinal
solheim
aniston
conversing
bartsch
hillbillies
matchups
bellary
carmody
fazal
rudman
anthers
scotrail
pascagoula
nep
hakoah
fraunhofer
wert
pandemonium
aikman
sachsenhausen
apeldoorn
schwa
nordstrom
hydrodynamic
clutching
gunshots
kock
afire
kosta
avanti
drenched
odile
schreiner
quadrature
nyssa
anabel
moultrie
envious
faversham
girish
rohr
publique
riaz
llamas
sawdust
aldeburgh
kandinsky
willesden
lemmy
albanese
berkowitz
stairwell
lsi
arlo
amaury
undaunted
embers
rededicated
nitpicking
reeks
bintang
subhas
cheyne
comyn
xenophobic
orangeburg
overblown
auc
alcazar
irradiated
mhp
geir
begg
nagi
zechariah
adem
hadassah
skyler
cushions
schüttler
capitaine
grama
extrusion
conditionally
tadashi
jokers
zooplankton
psilocybin
mola
mcd
altamira
syme
oat
pfaff
allegiances
crossroad
cuauhtémoc
fads
tourer
foils
pinar
seaver
masha
prishtina
boars
apm
pleiades
trawl
detonator
indoctrination
broch
exertion
understatement
anticipates
fdi
lindell
judaica
duniya
manifestly
publicizing
impresses
brenneman
crouching
buttes
cabling
spurrier
accrue
tevfik
salami
karr
menken
empowers
chard
grete
hauge
belted
abenaki
fareham
sámi
sapp
isoform
sedaka
mmo
awad
ldl
sargsyan
mosman
rabinowitz
waregem
schelling
newsnight
combustible
downsized
carles
pedophiles
gujranwala
plympton
bandini
turandot
dougal
darnley
petrescu
voy
curative
steinmetz
sennett
pendle
cadmus
vestigial
mintz
blacked
brive
chipsets
limestones
riveting
passo
glyndebourne
stallings
animax
neuville
hennessey
amb
bulgars
rmc
lacma
ramey
kling
leatherhead
rebadged
gillen
astrodome
afridi
écoles
malatesta
fennell
vlado
maguindanao
illuminations
thwaites
amateurish
poder
ubon
aerobatics
postures
hanno
beeblebrox
warms
higginbotham
dvina
capers
flattered
typewriters
rhesus
ecg
achim
taguig
norcross
alleviated
yah
doh
blossoming
khatami
neiman
latifah
blakeney
overfishing
logarithms
glo
docent
crustal
pik
gynaecology
legazpi
wentz
tourney
duller
icebreakers
reactivation
homogenous
schönberg
spacewalk
episcopacy
northstar
petaluma
tenures
litters
thymus
straining
sussman
sobriquet
autos
sande
altarpieces
bluestone
châtelet
effigies
scotiabank
swoop
hyperinflation
rabindra
bloodhound
religiosity
hyperspace
necessitates
keil
cls
inclusionists
swipe
khimik
droplet
kristoffer
shakin
nev
buuren
gygax
sensibly
proteases
poltergeist
zambales
mcginley
bruner
fireproof
hemming
mearns
communis
ruinous
minuet
gandhara
iosif
lewinsky
raked
shouldnt
hoke
cervix
orientalists
underpinnings
counterinsurgency
pietermaritzburg
viscountess
karadžić
petén
jigme
aqaba
pennants
jugend
benham
matthieu
suki
győri
lances
sqft
battlements
rq
bayne
izhevsk
sargon
cuttlefish
freese
pika
ragan
beamish
emmen
blackmailing
anatomically
consenting
womens
insubordination
mildew
toga
¬
wirt
palladino
subtext
inboard
subdivide
pato
bassline
chestnuts
molinari
suppresses
cassa
regattas
culpable
prick
slavko
farcical
varghese
lodger
pallava
hemorrhagic
daoist
nanoscale
ismet
scotties
chuen
cbt
rolla
libor
zeb
suvs
necromancer
novae
pataki
nacl
riesling
merman
inlay
tuber
ballo
khazars
kiha
murmur
festus
knighton
quatro
eee
outtake
svay
cynic
oswaldo
fidelis
trappings
nando
pgc
kop
nullify
pölten
thanx
foia
unexploded
bevin
fraudulently
vllaznia
asio
ludwik
veils
haystack
sonu
precipitating
pomo
scarp
upfa
glaad
skonto
limpet
lomé
gti
nizami
dangerfield
vivre
urb
mailman
gaudens
quintin
purport
changeover
plunger
millenium
italicised
hopwood
amulets
caballeros
håkan
cocky
nielson
dimorphic
proteasome
zb
windings
rebut
torr
endures
antonino
arakawa
spad
grampian
nachrichten
pby
marchese
displacements
virtuti
piledriver
tams
insinuations
burleson
roode
dedications
brca
wegner
xfm
garros
masi
miletus
seul
visionaries
hooters
turpentine
politik
ere
unreported
llobregat
awm
serling
partie
idolator
donati
storyboards
inoperable
sphagnum
amata
albertville
nono
ceiba
tsutomu
silken
guccione
semite
sumer
transponders
howlin
puppeteers
lenticular
pickets
kaos
ramblin
heep
popup
poa
taube
grate
capps
scolaire
simulcasts
aficionados
coryell
sfd
siddique
quod
shawinigan
indulgent
woodhead
ffs
golub
fangio
rosettes
margit
kerri
harmonized
mired
hashmi
resents
genealogists
decently
diabolical
lca
wuxia
rawal
tft
langan
quadra
beerschot
zito
bermúdez
ulrika
interpolated
artforum
sausalito
chubb
supp
toil
deploys
buk
doro
kawaguchi
fissile
faithfulness
transference
potteries
thorndike
folios
repellent
disengaged
lemberg
stb
wardle
yayoi
blaenau
consortia
deductible
clontarf
thar
bikram
población
schnabel
cicely
pers
microchip
lakshman
obsolescence
perishable
ponty
nestle
cbl
walken
tribunes
bagot
desiree
mowgli
beltran
comilla
yeomen
parachuted
prahran
schola
agena
thump
orchestrator
animatronic
rodham
borthwick
mne
nesbit
wyclef
microwaves
thurgau
huntress
womanhood
bascom
krka
anatoliy
saintly
approximates
samman
manse
infiniti
beng
bergson
cobden
amphetamines
coliseo
bsn
macartney
deprecating
etruria
sanda
manju
efe
ashgabat
fingertips
furukawa
agus
pungent
arcot
crédit
passe
firmin
vocalizations
paar
siting
victorians
linh
besant
dslr
kehoe
aguas
aef
ble
candi
almelo
hamamatsu
opc
candler
thameslink
geodesy
professes
minnow
bandura
inositol
mansi
hotelier
schiphol
elway
loh
estella
synthetase
mechanistic
supérieur
recreates
wako
bookkeeping
hosea
tensei
thurles
takara
ludwigshafen
inconsistently
rosser
setups
herders
setae
rammstein
mdma
triage
sarab
decryption
kratos
cdm
recto
mazarin
sensuality
dolor
brut
nobuo
chakravarthy
saugus
lomonosov
simonsen
nabc
chechens
haupt
gilly
arsenio
docid
warrenton
grantee
armagnac
flexor
brockville
behrend
manda
jona
bengtsson
anant
vms
shar
sammlung
glorification
bonnier
lilli
donskoy
nahum
observatoire
malheur
fdl
risker
telegraphs
fue
aage
navas
batticaloa
tractatus
scientifique
fermions
lockett
strasberg
dunphy
bandleaders
sukhothai
favoritism
eic
bellaire
venona
moluccas
ett
aykroyd
stiffer
saadi
nitrite
onn
paralympian
zephyrs
¶
tsugaru
bastrop
giffen
boylan
playford
godly
nilgiri
bridport
rnc
cherokees
haa
inconceivable
coley
brdo
gigante
sarabande
tarr
priors
maharajah
fetuses
hammam
lagged
boundless
lagoa
ghastly
noelle
estradiol
brainstem
acte
farrah
imperials
ratliff
defensemen
condors
olympiastadion
splashed
haitians
kalb
alanine
saransk
fingernails
luthier
tricolour
birnbaum
fanzines
tidied
cacique
chairlift
utilisation
maz
exerting
ctu
pti
regine
gediminas
hosokawa
mountaintop
hinsdale
daan
snappy
dingwall
hoyer
esi
chaudhuri
ubu
naranjo
blacksburg
inaba
boateng
lemos
anambra
canmore
manish
daredevils
gerbil
pljevlja
bogey
mindedness
deka
jaworski
curation
américas
gumi
kolmogorov
delong
nore
softbank
kann
sania
sesto
optimally
printemps
corelli
clerc
supercomputers
colons
etcetera
unsympathetic
nicolau
tarun
schumer
complicit
dima
liste
andria
zant
follette
birdwatching
provincially
excitatory
mauthausen
dorf
slashes
telecoms
numismatics
celibate
política
oconee
kincardine
mags
buen
staccato
donegan
fretless
frustrate
mantilla
daventry
cus
spool
transcribing
nuff
fiorello
leszno
systema
malan
noronha
naves
kyun
euphemia
circumventing
lilia
kha
talker
boomed
blackest
kuantan
fivefold
jacobin
simca
thomason
peretz
harboring
laz
kash
telefilm
alpini
glosses
utm
hecker
gaur
commutation
braque
penner
individualistic
musso
telephoned
maus
lansky
meniscus
matthäus
xk
ledoux
inari
burnell
linearity
inapplicable
guarda
levelling
bamba
haploid
tass
indelible
silverberg
hayate
nightwish
ferment
concolor
ngs
agia
maeve
diapers
bluefield
smelt
cuzco
couriers
becher
abkhazian
renouncing
sinusoidal
enduro
sba
mariani
condiments
augustów
wiesenthal
ordinating
kes
rosenblatt
scud
feigned
hermosillo
oberoi
postmasters
shapeshifting
giannina
mcmichael
spinola
remixer
stipulate
bullhead
oui
reassuring
razi
apd
bron
gabler
angiogenesis
chambersburg
vandross
fognini
clavier
colter
clogging
camperdown
anstey
amusements
masterplan
bizkit
ultramarine
hemings
geomagnetic
ruda
hipper
bloomsburg
knin
incrementally
montour
seiner
rudely
golubev
modulators
harnesses
dmt
pinged
phosphorylated
piaf
schweinfurt
cobol
ocarina
maggot
tatsumi
affixes
aspinall
bernburg
neuve
merlo
sparrowhawk
hetty
republik
clapper
palmdale
puckett
ceding
dav
alsop
angering
blackbirds
meld
jahr
sombre
boggy
shirakawa
umlaut
hypo
objectivism
leftovers
oke
surakarta
chal
gnr
dijkstra
salahuddin
kamaz
millville
dimes
concierge
blowers
fibrillation
hairspray
keiichi
stengel
yug
enlai
wetmore
wtcc
lunacy
flutist
leite
utilises
jalpaiguri
microcomputer
clotting
ullah
koei
biggie
outlier
perro
rhubarb
pocatello
rata
rydberg
elster
kripke
malus
specie
bonar
lampard
dubbo
mankiewicz
lebrun
toynbee
painstaking
blyton
laycock
véronique
guerillas
federals
hydrate
szolnok
leucine
vellum
inflame
norberto
telekinesis
gerontology
umatilla
carabinieri
horia
halland
prevost
meneses
naver
swimwear
guanine
hoppers
orpheum
fingerprinting
wry
vac
mette
bosom
unsw
lambast
babi
kym
bookshelf
klitschko
winstanley
thetis
sperling
myeloid
donelson
vitriol
chewed
muscovite
fogel
eklund
superbly
haugen
dini
cepeda
resubmitted
dniester
dawa
francais
visualizing
hrvatska
mandelbrot
styne
taxiway
sailboats
pvda
transliterations
unirea
denizens
regt
processional
indenting
abidin
riedel
mcilroy
joséphine
schultze
peeled
foreland
ryoko
noize
cassation
sawan
preble
alcatel
sfl
navin
zaporizhya
nexstar
artemio
tursunov
masuda
ormsby
sevenfold
sabotaging
spoiling
albertina
uniontown
effector
ballinger
coconino
wielkopolska
beltrami
jps
nikkatsu
morioka
mcu
southbank
doucet
intermarried
edwina
browder
impersonated
loran
etty
centreville
rapunzel
palpable
fictionalised
seaports
brainstorm
fortin
paraíso
bahu
soraya
pressman
seafront
synthesiser
handsworth
brazier
kiley
filamentous
badalona
qigong
tabby
seducing
terrorized
blagojevich
ayman
gastroenterology
krishan
dewhurst
schramm
reger
endemol
dogfight
rathore
copula
bellona
zayd
cso
chelan
kimiko
rotem
cornices
truncation
namor
huddleston
lintels
mcsweeney
mauri
grouper
matabeleland
solange
dakotas
carelessly
beleaguered
immigrating
madina
beekeeping
wisbech
schorr
kyra
chama
usan
celina
armavir
sawn
officiate
expedited
bumble
undesired
satomi
incidences
faustina
valjean
nimh
ehlers
sharpen
longoria
tovey
karnak
eyeballs
rutan
immovable
harmer
accusers
dorr
surety
pariah
evergrande
chilena
despondent
klaas
boothby
feasting
maelstrom
naumann
blomberg
tirumala
overdubbed
eastland
littérature
selman
alcantara
engelmann
bombus
baskin
cosmas
yukari
expunged
pompeius
magno
theism
originators
fadl
wachovia
protege
boyars
mcnaught
fanboy
halmstads
straneo
berle
divinely
chavo
opacity
bod
irreparable
sanyo
scape
sabc
floodplains
canale
pera
mow
excepted
bic
mladenovic
hasidim
panova
hebdo
lando
wiese
burghley
wtt
enz
hydrogenation
aphelion
toul
ulrik
bronisław
krishnamurthy
cremonese
neoliberal
solana
panties
heyer
proscenium
s,
shuttered
camarillo
downplay
swizz
kap
bloodiest
beatport
makai
debutante
crofts
admonition
diagonals
dwyane
saverio
sridhar
ilhan
prévost
updike
jyp
schön
gcmg
rudders
knorr
résistance
predestination
samsara
manfredi
vlc
hadid
abitur
nishikori
glossaries
counsellors
santini
resurrecting
rapprochement
ludington
parsed
sordid
hater
chrissy
cressida
cmdr
captcha
soka
branagh
paa
monaro
steinitz
whidbey
nuthatch
jans
dalarna
praveen
chukchi
espen
shrug
champlin
rundle
fatimah
aigle
benedikt
fuqua
transpose
duckling
stabilise
drunkard
sdi
lurgan
cataloguing
darke
harkin
outkast
fanaticism
halakha
scarlatti
cranium
varvara
mathur
harnessing
polonium
newhall
remedios
likeable
indisputably
shankill
grandi
foursquare
fieldstone
thacker
electrics
whatcom
elim
grandfathered
reiterating
eisen
downwind
espada
popolare
irritable
bulloch
kroner
naf
sequoyah
boop
balakrishnan
reshaped
taw
deland
où
agostini
reshaping
tannins
shadowing
defacto
syriza
reinsurance
muralist
kabbalistic
patagonian
annealing
petipa
carlito
rothko
thana
ducky
jaish
symbolise
cvt
combats
bourque
burrito
zweig
michie
secondarily
sáenz
bergstrom
retour
hijacker
zenon
gullible
kravchuk
sprained
atherosclerosis
dreyfuss
yitzchak
wisniewski
grenadian
anopheles
peerless
plotline
jinping
classe
doth
lawley
gct
reconciles
logue
hungama
banga
natick
climatology
rockdale
kuna
cogs
cisterns
ramin
flier
verónica
sellout
impeding
luria
wellingborough
annemarie
hittites
dunblane
astride
mechanization
wrangel
skateboards
blanchett
myosin
ravages
khatib
hospitable
muskoka
hounded
mattingly
souk
peya
mwc
terrebonne
wheatland
pedra
hth
oye
cve
prefaced
rspb
djurgården
columbiana
balch
seltzer
strayhorn
embodying
wnbc
bickford
bough
ryuji
vad
shorebirds
bosse
weddell
cartilaginous
collard
modell
vsc
untested
pillaging
wilbert
moana
thurber
ncp
helsingborgs
germplasm
buckshot
toney
adenine
chattopadhyay
permanente
wallsend
mineralogical
southeasterly
mclane
giannini
jvc
londoners
noailles
wahoo
bergisch
melodi
hirth
betrothal
mra
smeared
utara
prokaryotic
scg
tring
ossian
chiayi
earthworms
vacek
agronomy
bactrian
castiel
kilobytes
cecile
discloses
comiskey
verlaine
mumps
popolo
darkening
pascale
tami
groundless
calmer
arenabowl
sewall
libera
markey
ortho
badi
targa
kaitlyn
pluck
camillus
kurzweil
steepest
southworth
ventilator
ebbsfleet
kosovar
duped
anniston
parkdale
retiro
avoca
sabino
straddle
loretto
fissures
chevrons
adt
psychopathology
angiosperms
aftershocks
raskin
blane
sniffing
chatterton
rosenwald
laroche
centaurus
evgeni
embry
hasselbeck
aai
così
fcm
clove
shimazu
shai
invoice
costanzo
trelawny
carelessness
obeys
ogasawara
jeunes
zil
rathaus
connally
mohawks
absolutism
stairways
wickes
htv
miku
svendsen
angulo
buzzcocks
hanrahan
malachy
sverdrup
helmer
starstruck
drosera
shabby
hass
psychics
heures
kerch
inadmissible
weingarten
kingfishers
durkheim
taqi
comunidad
tortilla
ginzburg
lurid
tumbler
bangui
plaine
margaretha
avp
cycled
÷
affording
trustworthiness
predisposition
borda
nigerians
bufo
millonarios
françaises
tpa
slats
macular
rockwood
starace
caer
broadmoor
holywell
subscript
resurgent
ganda
jintao
mitzi
duster
kah
restyled
kuopio
lennard
screwball
choirmaster
crowdsourcing
amadeo
motherboards
mateus
flashed
zanetti
menendez
globemaster
racewalkers
gladbach
euboea
hankey
olivo
esmond
engler
erroll
illusionist
portnoy
brubaker
saori
ugric
seely
mitotic
miyan
chaux
torrents
repin
erupting
belsen
robustus
woof
eero
crotch
socon
spotter
pran
berwickshire
vanni
texting
instigating
parthians
muar
decibel
indomitable
sharpshooters
cohabitation
montano
balkh
perron
wrinkled
goad
maclaine
dott
jel
iai
unleashes
gero
adorable
ironwood
rainiers
sfx
solzhenitsyn
nightline
immolation
madura
flexion
costner
barwick
whitesnake
einen
wynyard
metrobus
underpinning
encrypt
suckling
bille
urethra
aberrations
proboscis
manrique
tahu
malleable
melodious
zainab
dariusz
alachua
imereti
hizb
spirituals
superspeedway
sealift
futuro
estaing
remittances
marcelle
bowker
exmoor
anaesthetic
johar
lumbering
leaded
candido
geno
portishead
goren
pips
swank
petrovsky
rakhine
girondins
derbies
decked
badawi
dhcp
venkat
shuja
dravid
nobile
trucker
iconoclasm
forger
magne
khaimah
lactation
cli
spreadsheets
seimas
hippy
tnf
cascadia
loin
bonkers
wolverton
cruickshank
mpla
jorma
wizardry
adieu
grander
geum
articulating
duplessis
bhardwaj
steakhouse
rodolphe
biotic
mccrory
rps
hori
cga
lifter
chakri
thorburn
burra
ironstone
sealy
mcca
ritually
helier
magick
arba
hamdi
mariel
comforted
subtopic
convener
jacksons
collis
sleeved
nebo
magnetosphere
walnuts
letterkenny
prosody
peacocks
taleb
refered
swc
weekes
waldegrave
tweedy
banish
résumé
kayaks
maneuvered
ibs
bosna
aed
sak
surendra
strenuously
viol
carrickfergus
liffey
nouvel
demiurge
scarsdale
mcdowall
mohammedan
macgyver
rightist
leckie
deadlocked
vil
fireside
nortel
convenor
novembre
tanka
mercurio
vortices
seismology
pausini
juried
hydroxylase
amazement
materialist
orbs
decolonization
riddler
clarksburg
gendarmes
gaël
dimas
smoothness
wayanad
airings
nisbet
animus
hecate
hoge
sympathizer
partisanship
zachariah
eaux
rumford
vyas
nii
bsg
aarti
tcl
funkadelic
dink
relevancy
caudron
parsifal
gracias
balding
asylums
obsessions
neutralizing
kapustin
mews
epigrams
rastafari
kaizer
vallarta
corel
ghoshal
entrapment
dalglish
creswell
maglev
quince
vasant
firewalls
oxidizer
sonics
helin
ntr
udc
saaf
dworkin
mascara
carabobo
plunges
polymorphisms
camas
maceo
brussel
africaine
boyar
speculator
juju
gorillaz
ashburn
kidz
airasia
mccandless
tite
infrastructural
kangra
sues
akuma
grout
deciphered
desdemona
deceitful
bpl
amat
incisive
teddington
ikki
hartmut
symbolises
sudoku
likable
jtc
docherty
carpark
introverted
csaba
sansom
adlai
houma
customised
rivet
streamers
bks
corrales
yasushi
comeau
witney
shabana
silber
peet
confidante
chastised
salis
lucile
nagisa
furtherance
joana
godin
narrate
unrwa
schlager
preventable
callous
novas
registrars
tempers
baranja
makino
insatiable
blériot
demolitions
bagdad
joao
themis
naturalis
srivastava
lazaro
maul
götz
maffei
cip
kare
redaction
saalfeld
candidacies
railhawks
katayama
disables
absalom
mantegna
purser
barris
dongguan
clift
dunkin
conundrum
upham
domicile
lovingly
nigga
ginza
potted
bes
harnessed
nakai
talat
heresies
adoptions
inoculation
kats
cartan
rpc
dhe
aventura
riana
sgs
paramore
buffering
metropole
wn
gurung
ubiquity
gels
wrinkles
hucknall
gla
sobota
goodhue
bulla
fakir
xlr
snead
lanning
nicolay
eschewed
jellicoe
butyl
cuza
morell
strangler
gheorghiu
otu
mantova
recumbent
pommel
bodil
spivey
mónaco
växjö
interviewees
carril
inversions
berrien
joann
pavements
storefronts
lanius
compañía
csis
raeburn
poindexter
shani
caton
tumulus
creditable
pimpernel
oram
deified
annibale
buries
rosemarie
nahe
recoup
elisabetta
qamar
briain
hollandia
mobiles
teflon
encapsulation
kirklees
alwyn
myo
existant
filles
vajpayee
sinuous
janakpur
triceratops
bellerive
churning
juniperus
shaukat
rosewall
airtight
abydos
sensitivities
aks
materialised
orazio
gresley
greiner
kingsland
bastien
nassar
socialized
lampeter
eggers
medicina
coquimbo
defenseless
spacer
vy
kaushik
shinn
carbines
feedstock
liberalisation
shockingly
dik
warplanes
repealing
krasny
venegas
scylla
augmenting
sass
concertgebouw
ahmadabad
rives
convalescent
harbhajan
jarry
hase
peinture
sarma
raimundo
wahda
canova
asceticism
sargodha
polygraph
convulsions
barletta
rajat
lorem
mlp
uzi
wortley
devises
gascon
theropods
korakuen
pederson
viasat
surreptitiously
fipresci
harel
ilić
jaa
kerner
malisse
saakashvili
progenitors
runciman
bibby
flavoring
expansionist
pasa
eidos
vide
hairston
ingush
opts
depuis
wageningen
mullan
otl
deputation
amici
strathearn
banderas
wiiware
claud
lumpkin
helle
aswan
sitters
parler
harmonize
keelung
moorhouse
tudors
doable
emre
langkawi
industrialised
shoji
spoofed
dictating
leiter
tadpole
córdova
backpacking
calvi
khosrow
eleonore
caricom
petey
quicksand
circassians
brega
tactically
aragorn
abhay
premeditated
russellville
eldred
ussf
demarcated
valéry
smokin
strictest
kuerten
kwak
grönefeld
surnamed
sauter
delanoy
hayne
adige
veolia
annuities
subnational
hypoglycemia
rocher
hydrophilic
junhui
signalman
sgp
linea
balto
dietmar
pottinger
cissé
quibbles
brattleboro
ppt
brotherly
anxieties
galahad
salafi
castres
odnb
caltrans
temecula
antonina
demerara
occultist
inhospitable
clasps
neuroimaging
schwartzman
tauber
emb
rez
manton
compiègne
subscribing
immutable
superiore
santurce
fz
fiori
remarriage
ingesting
groff
glabra
hutcherson
loong
wallin
labored
karakoram
reinterpretation
lessee
mimicked
tolbert
cherub
misinformed
broadhurst
plunket
mewar
disarming
derision
adriaen
sittard
allin
trond
checksum
sublimation
cyrene
interdict
rinks
kogan
undifferentiated
mccoll
mcclatchy
punted
chastain
humana
grafted
lissa
llanos
mcallen
paneled
byproducts
bardo
lollapalooza
euphonium
cowdrey
perversion
apollinaire
hiker
ramanathan
masako
replicates
hisashi
litvinenko
kabila
ciphertext
debunk
payers
ballesteros
silks
primed
magnetite
savar
jumpin
cutlass
stratos
retirements
lifelike
pbc
halesowen
timeform
tigray
neuer
sarandon
yehudi
mutinied
maac
gnp
bahman
esterházy
cloutier
albumin
bulwer
hominid
abdus
tcg
salutes
peuple
naoto
unseemly
dynamism
rakuten
crispy
alarcón
tatjana
orem
hegarty
stroudsburg
breivik
caracal
foi
sirte
althea
camorra
inextricably
snares
paavo
ansett
fusco
repress
regionalist
dirge
hierro
mcduffie
crutches
leonie
coots
mwe
histamine
ecr
bloodbath
gdansk
sasa
ishihara
dodig
clostridium
jair
carcinogens
mädchen
vajrayana
dispassionate
tulips
enders
ihor
burchard
neuburg
lindstedt
coriander
modigliani
occuring
okanogan
silversmith
palaeontology
photonic
godunov
haileybury
uninjured
tropez
eka
emmeline
reitz
lewandowski
avilés
infidels
voicemail
sassuolo
generalist
danubian
simo
waleed
guesthouse
matkowski
endymion
telltale
colourless
tullio
arendal
tamed
németh
iras
kuni
shand
hydrochloride
alleviation
ghazali
merz
yonsei
ogata
stol
detergents
mav
egregiously
chacarita
camphor
ghul
siddha
geni
rawdon
flattening
sailplane
hornchurch
madrasah
dionisio
canisters
vexatious
dla
scapula
jono
sereno
misawa
microcosm
gabashvili
lully
falsifying
rous
supplant
carthusian
plaines
cannock
sepultura
alençon
hoffa
ratzinger
sedation
nemours
irregularity
barbora
orangeville
biogas
whey
dreux
overstreet
finders
nears
niv
helmsman
raghavendra
fluidity
koon
jez
clogged
coolness
regnant
roti
highwayman
goldfield
manifesting
socotra
leeson
daum
freiheit
truthfully
closers
chishti
khamis
eraser
nullification
stews
stillwell
heli
saucers
ullrich
cascada
mangan
cuernavaca
lioness
stabilizers
alpi
parkman
austell
lithographer
séance
unrelenting
zanuck
levu
tantrum
sandhu
rufino
depopulation
shum
laurels
disintegrating
rifling
romp
dulko
midwinter
glc
fordyce
marceau
helipad
glassy
braunfels
haliburton
biogeographic
ruislip
glauca
mitsuru
taliaferro
ril
angara
monts
nukem
egy
tornados
phool
yehoshua
conceit
matera
preselection
napster
berezovsky
baltika
dawood
thyssen
amistad
tuskers
lleyton
immemorial
entitle
germanicus
bayezid
enchanting
curren
templating
apennines
rekindle
cooder
kitano
babylonians
indecency
diorama
ept
chloroplasts
cbgb
mattresses
friedl
astern
immunological
podcasting
shamil
domine
klose
thermo
trialled
carneiro
escorial
wooing
ffestiniog
giambattista
neoclassicism
fco
patan
peltier
lehtinen
onegin
rusticated
bohun
qwest
embellishments
rendez
boarder
bouncy
cotto
bhangra
gallifrey
radially
cripps
fisichella
paprika
rossellini
hawkman
rutherglen
recklessly
maisons
karlstad
lelouch
alcibiades
cradock
helpfully
sequestered
monogamy
simla
rhiannon
strays
sliema
chaining
lavin
midden
pasok
spelman
sado
rosamund
lingered
fama
madhava
wessels
gusto
unclaimed
quelques
kaka
huis
epcot
adar
reproducible
westphal
phallus
orman
centaurs
forthright
tsi
kneel
barua
coves
idc
trailblazer
fpr
mckellen
colston
panjang
workgroup
rodez
halliburton
bedridden
whiteness
nirmal
draftees
libido
polymeric
dubliners
bemis
omens
spinks
encores
threes
impossibly
diablos
caux
monotype
durie
darden
nem
placate
rocko
flach
ullmann
drax
rationalization
fett
hirose
wallachian
cfu
marginalised
sachiko
painstakingly
rooke
congleton
schon
bronfman
ilha
bailly
parachutist
rounders
gpr
moca
hallowed
lithosphere
croker
abalone
sauveur
ganassi
ombre
polyvinyl
ciarán
bacall
caracalla
pritam
memoirist
pilatus
archipelagoes
lacuna
jawahar
lpc
hofstadter
toscana
conjunctions
tanasugarn
meltwater
wreaths
geng
atg
dmu
carolus
switchboard
outhouse
appenzell
subsist
ligature
appointer
overloading
wether
logician
nera
bourassa
statuses
nhu
haemorrhage
stimulants
sejong
bassi
dostoevsky
russes
turnhout
promiscuity
boult
strangest
obstetrician
zentrum
hurrah
iberville
egger
musee
dvi
escapist
pewter
nissim
civet
franko
crampton
rais
jammer
phallic
biomarkers
abuser
ephemeris
souths
unprovoked
disproven
mattia
pavillon
plausibility
honorifics
katsura
daihatsu
submitter
delbert
hato
mcinnes
fuscus
spadina
icosahedron
cadell
icm
nanna
purbeck
misnamed
periodontal
ghg
numancia
gove
marmion
apostrophes
aldehydes
putrajaya
retracting
dockers
loosened
arango
purim
asakura
sectoral
chieti
befriending
bago
impolite
minsky
signe
unreviewed
yaqui
vocations
beanstalk
pyke
dianetics
lanzarote
molto
humilis
aleutians
newsome
notaries
calibers
subroutine
shockley
halperin
huda
bung
oleh
fuca
alli
sorrel
leukaemia
monoamine
unimproved
recoilless
kickapoo
officership
counseled
earnshaw
inflationary
carbons
outrun
pentecostals
simonson
nevil
jepsen
shopkeepers
munshi
canales
sweeting
toya
capricious
garin
grg
talleyrand
photovoltaics
tpe
vvd
weta
bysshe
idlewild
altamont
clannad
pennines
bdp
taliesin
kunz
lodhi
ched
lra
dissuaded
bartolo
nema
wipes
blunders
staves
heavyweights
sabo
reiser
chartreuse
synthesised
borderland
amityville
cramp
sores
calles
rebuttals
ictv
etcher
usnr
lsa
appian
protozoa
cocked
itasca
offshoots
divulge
citron
doron
macmahon
bato
moslem
littlejohn
wildman
rivoli
folklorists
hippos
bilge
ingrained
sketchbook
burdick
pensioner
bandon
escapement
ambala
eroticism
gopalakrishnan
beria
wrinkle
enticed
palpatine
kqed
tynan
jumble
noboru
nickelback
samo
eap
lavoisier
petrucci
atn
vesna
hiatt
honiara
leeuw
tejas
aalen
infallibility
bursary
hgtv
cmb
pnl
niamey
lynott
caruana
jap
museveni
santino
periodicity
ambika
kushiro
acetylene
casings
roped
capa
joffrey
bullwinkle
ecumenism
gymkhana
liturgies
sofer
carden
pinder
kudo
modulating
unrepentant
banton
jeri
metronome
guia
palmar
arkin
grooms
mangoes
gulbis
nyon
refutes
cccp
hyperplasia
peeling
hombres
lofton
bereg
detonates
firpo
cheri
atsc
resection
bevel
felonies
prion
adaptability
pasolini
ssw
severo
zines
seawolves
voisin
asx
ouen
generis
coffman
bcm
littlewood
uncountable
marden
interventional
liangshan
polypropylene
underboss
swarming
andika
söderling
claypool
oatmeal
survivability
patricians
armistead
wallop
biak
moult
kcvo
meditate
takamatsu
pinkie
lawlessness
jawa
googles
darkroom
testis
weirs
skyhawk
findlaw
sprinkle
supercentenarian
nwo
forrestal
affluence
bmj
sandwell
arnaz
tvp
watermarked
fealty
tailgate
norval
sardis
pestilence
moncada
areal
birkin
abdullahi
désiré
imitates
velazquez
newlyweds
mov
wels
bayswater
varèse
reticulata
enhancer
oratorios
kil
hrm
heute
loess
rectification
orchester
juste
overprinted
pel
crocus
paulinus
cydia
puisne
pappy
enosis
cliffe
vaccinium
cala
savers
androscoggin
showgirl
pna
airbag
pisani
janina
landform
goof
rogier
stille
está
theresia
hew
bopanna
akhil
caguas
testbed
electrolytes
kaga
camcorder
kalev
ruthie
andere
hoey
waalwijk
congreve
wart
msps
tomic
venereal
choristers
magill
rafa
afterthought
merkey
teda
interviewee
veli
lashley
documenta
tasteless
parfitt
californians
encyclopedically
monarchical
twelver
plainview
alchemists
nett
truckee
tinamou
immaturity
naturel
interrogative
calvinists
kahane
iturbide
montesquieu
troon
indebtedness
cnut
shivers
parlors
gleaner
paulet
growths
crave
kutcher
machinegun
powdery
rafah
boatman
troika
armature
dairies
sali
whispered
mercure
similiar
upholds
collectivity
coalfields
proofreading
astrolabe
shoplifting
sigel
gulzar
woodburn
holler
jaramillo
electrocuted
nicoll
kaneda
harlingen
ramjet
fpga
manmade
saranac
egm
bib
oculus
olle
invictus
eres
mcdiarmid
backroom
hippolytus
fujifilm
prankster
yoshiko
verte
tenuis
tfw
patella
secretory
cranford
fuerte
cataloged
khai
avangard
querying
deserter
vaishali
taíno
inedible
ejecta
changeable
standup
yy
agnus
cranky
perforation
imago
meralco
neurosis
mudaliar
zod
barbers
misquoted
oases
soren
dra
tilghman
cucamonga
strangulation
entebbe
gilson
machu
flamurtari
propagates
harborough
blatt
escambia
habe
mickelson
pujol
margarine
wyvern
schemas
husk
spaceport
melodica
latte
lindo
kak
temasek
reapers
smrt
elegies
clemenceau
gramercy
fet
undesignated
egmond
thermostat
shauna
ziff
kohlschreiber
willys
mems
afshar
endoscopic
arapahoe
vaudreuil
olmos
kio
kupa
patnaik
ashura
bulkeley
telos
assassinating
smokehouse
commendations
crone
bustos
dominicana
hoss
vitor
cupertino
pollux
sanctified
dud
herning
dialectics
pallets
effortless
henkel
veen
ileana
rimmed
damián
resetting
fittest
spassky
allee
untranslated
chemung
xpress
atheistic
kunstverein
cusps
petkovic
wonka
nuñez
prakashan
studer
bomberman
morant
graaff
ulnar
publican
reintegration
waheed
hashemite
chc
kyustendil
ehsan
bidar
temperamental
interments
stinky
valeriy
nebuchadnezzar
ajc
keitel
orth
mathieson
grigg
tormé
postsecondary
wef
banten
webzine
librettists
heritable
rohe
bachata
sinead
alleviating
lafleur
hambleton
senile
purba
cabildo
foxnews
histogram
yoyo
crusading
midweek
ceti
verges
agüero
circumcised
klerk
fauquier
fein
yakult
prosthetics
mineola
sharada
takeaway
hipster
sisak
abject
mickie
incipient
deepika
figueiredo
couto
hira
smokescreen
simmering
insinuating
conflating
hemispherical
snapdragon
hah
edelstein
killen
oop
kuang
mapper
kif
vallis
hindman
bublé
chambéry
ferretti
teotihuacan
rostislav
melfi
sunspot
skinhead
miser
radiography
hypertrophy
vta
burks
hillfort
jcpenney
antibes
ferrocarril
tetanus
popp
rosol
doan
bilecik
otranto
amro
judicature
stanwyck
adress
artaxerxes
leninism
agama
gropius
oddball
worsens
evp
nanning
bardot
ahar
integrator
dois
withering
vps
quadrants
extraterrestrials
takuma
fruity
nant
moffitt
rox
spanky
caan
gotra
extinguishing
polymorphic
albertson
mucho
separations
thallium
damselfly
ringtone
foro
homilies
branislav
blockbusters
extravagance
refurbishing
falkenberg
jadwiga
asan
hehe
forgettable
hardcastle
egremont
burdwan
tsb
glimmer
carb
braham
notifies
ingle
aalst
jerkins
eod
cgr
adamantly
heiner
cracovia
heitor
rda
très
allred
eaa
universitaria
lawyering
antigonish
janney
unpopulated
sonatina
buna
galeazzo
polygamous
excavator
goalkicker
mula
decals
smeaton
vityaz
thornycroft
eustache
daniil
spellman
kir
birdsong
constantino
usatoday
waratah
exchangers
imac
dels
heirloom
bleached
wallenstein
scola
rosenheim
comtesse
felons
befitting
wokingham
ehrenberg
obscures
sohrab
parka
dahmer
gonçalo
canandaigua
rcp
buno
duns
lavra
abysmal
exclusionary
lafferty
syndicalism
philatelist
rytas
interferometry
bramwell
choudhary
fluoridation
utd
sobel
elston
girton
abductions
reinaldo
crankcase
kazuma
merchantmen
simcha
instigate
hearths
tgf
chhatrapati
sif
elitism
alif
disembark
aci
taguchi
glial
zahid
yuliya
yaoi
margolis
cera
microtubules
reynard
turnip
shiver
aback
horsens
killah
avc
lier
sns
branning
enya
thoth
goldsworthy
sverige
leyva
bruch
godalming
geolocation
ccha
urbain
dado
geophysicist
barret
scouring
lawford
reaffirm
assen
andino
mausoleums
frits
endeavored
lavas
immunities
eba
jaundice
norrbotten
plimpton
spangler
coalesce
kahlo
suffocation
methylated
samarra
shimoga
overlays
vikramaditya
scallop
brabazon
polyglot
belvidere
onda
histórico
maumee
carlota
quart
undistinguished
snellen
satirized
kokoro
midsomer
rapoport
eubanks
tredegar
lutea
rauschenberg
rheumatic
fabián
mendy
ryuichi
cipolla
gns
grated
oulton
autosport
hagley
yamanaka
follicular
davina
hanne
injector
toyline
hoorn
reino
sok
xeon
pillage
godiva
orientalism
reprehensible
domo
existentialist
roane
pana
jessy
knotted
sram
smits
clotilde
unseeded
inductor
borderers
hermaphrodite
sternum
russe
ambrogio
critera
sajid
bauder
conestoga
naqvi
kieron
tfs
colonizing
mccarthyism
nacogdoches
rockne
sukhumi
exportation
urination
allendale
hardwoods
eateries
yas
flavin
halfpenny
valentín
idem
optimised
kuki
bangladeshis
malformation
ngati
mell
dauphine
roadie
pegu
microelectronics
yoshiaki
rheumatism
quimper
casson
kucinich
intelligibility
suspends
alberts
sunnydale
phrygia
ellice
sete
haplotype
noirs
kaas
sapir
jeffersonville
confining
goldin
rohde
badal
chicoutimi
scriabin
vaccinations
poste
agathe
harbaugh
jagdish
duro
hartnell
niosh
coax
gira
mortis
grado
compels
pritchett
cleaves
schaub
bettis
sriram
suggs
ico
dissonant
buckler
coachman
saïd
miron
lohengrin
pimps
macauley
antioxidants
guoan
esse
huddle
discursive
rivne
borehole
mohini
premonition
insulators
tpc
travesty
kryptonian
confounding
betel
civilised
petula
aei
corot
canciones
neuhaus
leche
sylva
lorre
takayuki
languished
visby
denounces
bulgakov
yamhill
pavlova
tallis
phs
gari
chenoweth
complimenting
chapultepec
diethyl
blowback
pepa
cookson
chileans
quelle
menotti
elmendorf
gestational
btcc
pissing
esme
zips
invert
greenback
zagros
vodou
moros
ludendorff
djinn
mcb
theologically
montalvo
unassuming
agp
twitty
otome
zai
agusta
frat
yankton
orchestrating
bulmer
castañeda
illuminates
macchi
scientologist
refill
classifiers
tattooing
artisanal
misfit
scrapes
crombie
jokerit
subtleties
symbolising
rechristened
balaban
lukashenko
suntory
zoya
weighty
lavelle
phoenicia
affront
wilander
javad
inflating
hyperactive
enric
lumpy
otsuka
callus
robey
dhoni
circumvented
piaa
boles
fujairah
chinchilla
koshi
agricole
excrement
amrit
sergipe
harington
reinterpreted
uesugi
broods
hislop
fingerboard
pinacoteca
mccray
smokies
undeterred
boaters
wilkin
madhuri
leisurely
nicolaas
polotsk
páez
stauffer
reassures
nyack
stratofortress
gagged
inti
tarik
sulzer
batons
aerials
zedd
décor
tura
montenegrins
maronites
argosy
mcginty
schulman
bukidnon
khadr
axils
kerstin
gsk
buddleja
portraitist
scorned
wraparound
yala
quevedo
mongering
mannar
chungcheong
seacoast
emlyn
smriti
elson
concerti
biplanes
siete
dunster
heparin
donat
perturbed
zahedan
fido
unrecognised
fid
overcoat
surfactant
hefei
roza
orix
foursome
spearman
eni
parnassus
deployable
angina
brownfield
initialization
sangster
ointment
berenguer
dresdner
galactose
attendee
publicise
torts
luneng
zardari
battenberg
snag
hfc
sandburg
iw
eze
nouveaux
schist
outperformed
crudely
feroz
tati
sdram
istomin
vérité
restrepo
cath
monge
shamokin
boylston
elope
evolutionarily
bakunin
llagostera
bgm
spellbound
brentano
regretfully
rivets
zwingli
equatoguinean
aphid
laetitia
virginians
vulcans
fsm
alexius
literati
miyake
anthologized
biceps
bandaranaike
denzil
prestwich
osgoode
tutt
vira
zebedee
peculiarity
zeman
kinski
rhodri
gaucho
fijians
marmot
nima
marinette
florist
santarém
maxims
konica
unimpressive
stuccoed
pitiful
ulema
umbra
eötvös
uup
schaller
precondition
rosea
rossendale
tearful
alhaji
majoris
ryman
keke
nob
belk
goodrem
trg
verhoeven
occultism
raimondo
berkhamsted
fevers
claridge
cather
anic
nitty
cerebrospinal
perches
estrangement
olbermann
ahram
minho
insensitivity
doğan
microrna
rouvas
navidad
heartbreaker
vall
bellum
outland
weyden
pati
laa
byways
hickson
vios
ndr
gavia
kla
efi
ammonite
powerpuff
arachnids
padmini
mediocrity
petraeus
vizianagaram
stifling
allegra
facile
shoaib
spotty
kushan
amédée
firma
gervase
negating
thornbury
atkin
sikandar
stn
ulloa
overlain
delineation
bottomley
pada
lauzon
gossard
ricki
kinsale
plantar
flattery
mechanicsburg
medlineplus
valdes
ipso
wimax
yasir
akagi
quartermaine
feodor
rekindled
canadair
silvester
hatchlings
bast
khali
tubs
tico
minimus
aljazeera
benched
dormancy
karlskrona
ecco
borno
hite
uttering
harbored
timurid
cno
lashing
cheatham
chamonix
nds
mymensingh
tiebreak
bsb
uist
stonington
curls
newhaven
sdl
variances
messines
lamprey
vauban
conjunto
pizzeria
gadd
subantarctic
awadh
angustifolia
perkin
fasteners
skupski
gunship
distrusted
scalloped
smiled
payday
supercoppa
usefull
nukes
frankland
mimosa
dernier
cantopop
dominoes
bookmaker
plying
archimedean
hyped
aerobics
chert
kisan
lisieux
heelers
kubica
ethnomusicology
specks
gossett
purifying
velho
caesarean
deltas
obasanjo
bentheim
swr
allt
puyo
alisha
miscegenation
hibernate
drusilla
figueres
boudreau
magica
unforgiven
shreya
verdon
babos
musselburgh
outgrew
benoist
legalizing
theresienstadt
ebdon
protrude
herbivore
tucci
accede
fortieth
natan
pacs
hooky
wilbraham
sprinkling
oya
handsets
gell
mardan
macdill
antonelli
gamasutra
kilian
waveforms
heartily
mancuso
plumas
standardizing
firings
meson
cheb
kufa
interlingua
dilla
epinephrine
nastro
theroux
keira
incessantly
rahi
mensah
methylene
brodeur
hibernia
doren
beckmann
stares
vidarbha
woot
riis
palmieri
cumbernauld
babysitting
oyama
phosphor
guava
withstanding
xenu
defraud
equipments
lepidus
affaire
muséum
hydraulically
grb
abadan
whitburn
justifiably
cres
trinita
rkc
culpepper
pansy
biographic
argonaut
bygone
macduff
jinn
hideaki
gentler
satchel
considine
meaux
undersides
perri
mamas
mehran
rct
hualien
scarves
sarita
profusion
lovelock
chace
downsizing
extrapolated
ager
lich
hazelton
fag
kilroy
cupa
barenaked
drivetrain
gutman
maggots
bct
cauliflower
nbn
flirtatious
haft
afton
resp
martyrology
sleight
welshman
equilibria
borodin
barricaded
castrum
congregationalists
omb
eloquently
kellen
ackroyd
beith
issuers
supercomputing
hanes
wmv
vvv
medellin
gtv
uninhabitable
ilias
clarinetists
heliocentric
redvers
mairie
mosca
refitting
calderwood
hauer
rieti
hangings
dinara
alemán
uloom
prodi
centerfold
plowing
nbs
unwavering
beechwood
tyldesley
theoretician
aravind
duxbury
naturae
eloped
rigaud
guttenberg
achievers
dalmatians
personalised
ordain
dayak
andrija
stg
clonmel
vied
heretofore
thrusts
divest
shepperton
akwa
dumpster
naberezhnye
quiñones
manatees
someones
slimy
delineate
mossy
tongued
ruggero
paroles
palenque
handbags
manisa
shader
triestina
rjd
mildenhall
unleashing
defections
maslow
anchorman
pluralistic
slipstream
reddick
sigint
suga
vieja
harpo
krogh
syncopated
cockpits
condos
slovo
manasseh
ribes
paneling
elma
biella
danner
helpline
garlands
wulff
garbled
cotonou
rifts
spr
émigrés
rara
aguila
baggio
rog
mii
tepe
encino
carpio
lifeguards
harrigan
chloroplast
foolishness
bonelli
erving
nobis
mandan
anisotropy
nitpick
valiente
hardinge
annam
swale
scops
cerda
apoptotic
sastry
webkit
leur
loaves
hanan
fws
biagio
uncompressed
showrooms
emerita
bna
diadem
extruded
kristiansen
sayid
batik
fremantlemedia
ded
beinn
zuckerberg
kitakyushu
eoghan
angiotensin
dort
christology
multichannel
mottos
debarge
firecracker
toscano
sociable
qualitatively
supermajority
ussher
passant
kubot
popstars
devouring
finlandia
fwd
hdz
carbone
curveball
durian
phantasy
viernes
admira
haughty
newswire
así
untamed
mbp
undying
demetrios
chukotka
kofu
swedenborg
transits
yanni
giorno
kurnool
topanga
netaji
knightsbridge
paltz
heatwave
schuller
yawn
kreuzberg
ingvar
sintra
rivaled
galesburg
teuta
loosening
millais
corporeal
lundin
antiseptic
battering
simcity
stallman
altus
shipton
straying
bourdon
alpena
cpn
costanza
ipsc
goodell
inuktitut
ballparks
berton
manne
ministered
semicircle
mcneese
tereza
maisonneuve
scharf
dumbo
darbhanga
mosquera
lexie
hashed
lerma
wsb
passos
sakata
libris
pushcart
kenzo
sella
birkenau
scanty
halep
codeine
mohsin
lessening
delinking
iconographic
comunicaciones
macworld
deciphering
jewellers
zayn
quantitatively
cased
fau
jamshedpur
atlus
teleported
surfboard
josepha
atromitos
solvers
baig
earthworm
prelature
hatun
gerasimov
alai
delinquents
assizes
centraal
honing
emboldened
misappropriation
tulloch
clarks
mazatlán
headroom
publics
debi
muti
démocratique
lunatics
araya
wombat
abcd
tengah
svd
rainstorm
brachiopods
baddeley
bautzen
stalwarts
extort
friesen
laney
ashur
camerata
blighted
kii
manzanillo
hippocampal
molino
visitations
namgyal
territoire
daiichi
viti
darussalam
gobierno
skyhawks
cornering
saccharomyces
halsted
refocus
macerata
marlies
bascule
professing
boos
jablonec
schnell
wetting
mousavi
scp
gakkai
flacco
lorin
amores
canseco
spinster
stang
conceals
bunches
sauropods
huth
darwish
joinery
rattling
carlile
ziva
whitestone
nfa
evi
nrj
stanislavski
winslet
delgada
dunhuang
corinna
olmedo
joannes
sakya
inkjet
floored
otley
varner
dehydrated
mouldings
fedorov
jutta
mcavoy
newquay
uplink
maharana
saman
passivity
workouts
rasta
horwitz
humorists
faizabad
harries
prestwick
olmert
kaun
starman
functionalism
woomera
bms
eruptive
spousal
funen
prakasam
scolds
waikiki
enriquez
xxxi
fbc
brzeg
factoid
twh
motorcade
thirsk
suro
caixa
satyam
turners
outlawing
corpo
tommie
lonergan
quisling
patrese
kumaon
demented
benediction
seedy
howson
tacky
gryphon
tanabe
gola
paragliding
volendam
graciela
earthbound
keely
monferrato
openbsd
sng
athabaskan
cammell
allston
rueda
carbonic
jacquet
wsm
sanna
biddulph
suh
perret
batsford
hulbert
raves
leann
friederike
ndc
moyle
indecision
touchy
gaudí
angélique
roker
unscheduled
nocs
gann
linings
prescot
bores
zakharov
protectionism
aktobe
splintered
prolonging
juden
lufkin
reassurance
hashanah
reni
bleacher
evaporates
simonds
devries
hort
sholom
warners
pollak
kasi
cajamarca
iftikhar
photoshoot
cruyff
sagarmatha
jeopardize
mammadov
menem
sisu
syncretic
reposting
alwar
revoking
hebe
splinters
thiessen
covina
jarkko
seldon
strutt
exclusions
ambiance
antipsychotic
underlines
cassin
azim
lantana
sib
poehler
roop
fln
doohan
wmo
gouverneur
aitchison
sime
conceiving
wildebeest
chelny
acolytes
domer
shipowner
frictional
solms
senza
mondiale
rfi
maven
scour
maruyama
asf
stocker
wisteria
georgescu
impatience
groundhog
vissel
siemiatycze
tighe
waterpark
ruddock
annexe
bodin
eyeglasses
bayfield
neh
marginalia
industria
minigame
benjamín
goncourt
pirot
refurbish
gemeinde
hirata
quesnel
caricaturist
faisaly
orcas
strahan
mele
mañana
vijayanagar
atria
queanbeyan
lieb
aras
reconfirmation
carburettor
cardio
orellana
vanbrugh
fractals
reminisce
mostyn
ucsf
paraphrases
prine
lozère
pisano
prowl
monastir
implicating
roraima
firmer
acqua
jaina
bernt
rpf
joop
abercorn
pubescent
modernise
graziano
walkie
bosman
withered
turnstiles
basti
thrashed
pozo
trac
dunwich
undemocratic
shota
warminster
wedded
hilario
castrated
soundscape
megapixels
iap
erinsborough
niel
ammons
husseini
shackles
dashwood
lafitte
lautrec
repudiation
pekin
kamrup
geraghty
mohinder
mummified
kalki
maclachlan
prerogatives
tautology
stefania
langevin
kurd
jørn
owsley
ramaswamy
truthfulness
rovigo
photojournalists
sarkis
harshness
apostolos
montag
bls
bulldozers
repugnant
nailing
weng
lipped
somos
contras
aubry
dsv
defensed
undersized
mugs
cloche
henze
evictions
dabbled
cvg
leys
vma
oku
irrigate
kanna
ferrante
chitose
frunze
murrumbidgee
tinea
tilapia
kodama
npd
hakone
jace
polideportivo
patrimoine
radovan
symington
yoshikawa
universitas
bbva
avoidable
nahal
wailers
sterilized
worrisome
amoeba
crj
sabor
norristown
kindersley
tearfully
blumberg
charlesworth
homicidal
myelin
vvs
matsuoka
foci
salesmen
tenzin
nadja
echidna
morphing
gratis
octahedron
‹
spotless
burhan
sov
ottavio
linklater
srl
stirrup
contingencies
rance
clásico
playbook
loathe
isf
baynes
constricted
extinguisher
golson
orff
rauma
materialise
ppa
kiedis
gremlins
chanter
odawara
ovals
miyoshi
devan
phaidon
mcintire
⅔
topologies
sandow
grappa
glace
msk
hilson
superconductors
bohn
makepeace
stuckey
formalize
elude
marginalization
keratin
shambles
berates
roxana
mariscal
bolland
apollonia
manalo
carpal
wetpaint
smb
sardines
collinson
bie
heidfeld
collings
saudis
komm
suan
reproach
gulbenkian
rotator
gauchos
impotent
trl
tribulation
londres
unaided
blazed
tetsuo
longed
obata
madoka
jamuna
eland
destabilize
hawick
mahé
malina
speakeasy
synge
beekman
mcmurtry
newhart
weasels
migrates
kwara
transversely
shadwell
shaul
pieve
orta
futurity
manlius
officinalis
internationales
infidel
inchon
irrevocably
gramm
generalizes
transferase
michaelmas
exupéry
chetan
glycolysis
hatay
texel
electrocution
pelts
immobilized
hokuto
betjeman
apprenticeships
wheatear
cosme
coalesced
sacrum
amoral
keweenaw
whims
hypersonic
indivisible
smg
corny
tseung
scindia
naar
mcinerney
floriana
fontenay
lgm
eskilstuna
childrens
hinkle
paulino
branden
penfield
decontamination
glories
wimborne
dishwasher
muscatine
foreskin
supergiant
wario
ibu
dovecote
watauga
kateryna
valuations
était
agusan
fixated
queensway
exhilarating
bootcamp
brocade
appa
readmitted
renfe
devereaux
mvd
pns
fainting
alauddin
plowman
caliban
summerville
recapitulation
sattar
catullus
mikheil
compactness
jaxx
ghouls
mucosal
apathetic
enquire
macromedia
proofing
mle
attar
rino
cruelly
paraplegic
avm
marché
excelling
chimps
frontières
keying
zulfiqar
aji
fiefdom
tajikistani
zayas
shrugged
zin
vliet
tonopah
marvell
aditi
keels
corticosteroids
granados
pharynx
nucleation
demoralized
surah
haya
rêve
overpowering
extractive
hubertus
ophir
suwannee
minella
zemo
workbench
liane
hashemi
izumo
marsa
haggis
amesbury
plautus
metin
predictability
geosciences
eyewear
bartlet
fé
bourguiba
haswell
springvale
sweaters
vaulters
arty
macadam
worships
carnivora
schick
coincident
postpartum
bereavement
sentosa
sede
balaenoptera
juillet
dressler
pse
venera
berchtesgaden
welton
concierto
cheerfully
mdot
nods
punters
berthe
maracanã
spt
chubut
stabler
grecian
communicators
maint
vala
arousing
brackley
motilal
celebes
goyang
kalamata
exhibitor
transfusions
pisgah
bbwaa
enqvist
maximilien
soundscapes
brunson
kore
litigants
semis
vell
recanted
belasco
kashubian
galls
custis
capablanca
bregenz
mammoths
schutz
mcglynn
worrall
cavour
lob
malkovich
pompano
aspartate
substations
grudgingly
steaks
trikala
drover
sufferer
pompadour
yevgeni
vojislav
yarbrough
cordless
lycia
pospisil
technica
newline
rabbinate
thingy
revivalist
delicacies
steadfastly
shenton
rousseff
doings
downbeat
riche
lauryn
shaan
kinematic
airlife
qq
chequered
collectivization
proliferate
bedded
telegraphic
ecclesiae
salles
squawk
xxxiii
drivel
dinajpur
lakas
gangrene
integra
piercings
irfu
bama
hotbed
zither
amiri
dimmer
zooming
ostracized
lightened
extensor
guaira
regurgitation
llm
shahrukh
bertil
kunstmuseum
irritate
accordionist
overviews
forthwith
pissarro
kms
resize
pertwee
aprilia
mogi
cyl
rosenblum
sunray
culpability
juliane
bankrupted
perdido
gchq
frg
prawns
mycology
bijar
myrrh
torreón
bormann
naja
asphyxiation
supa
trunkline
neuman
larks
quackwatch
khoy
kircher
taney
ravan
randell
wordless
wyandot
nachman
heroics
flintstone
roundly
spiteful
alexandrine
posturing
sanctification
basta
spm
lakeville
isamu
naz
tvxq
ponts
rubies
janos
murphey
boardings
hopefuls
vq
caputo
decoys
dyna
pujols
gazzetta
radoslav
rew
graafschap
arbitrate
solvay
inp
jungian
istana
moulins
humes
scrubbing
mudge
jolley
bocage
entitlements
bandage
floodwaters
broadwater
cpo
koyama
samplers
foodservice
penile
sabretooth
sunt
tebow
nazca
redshirted
helsingør
condiment
edifices
caloric
headman
arrowsmith
akt
pupae
manufactory
ramapo
hpc
profiler
banger
alkan
vespa
cowbell
pavlo
disobeying
novartis
broadsides
hongkong
exaltation
vitruvius
stanmore
omnipresent
nayaka
ihs
eurosceptic
deftones
souris
hohe
myint
lesage
grandiflora
rôle
stedelijk
pleasurable
colonizers
unabated
cvu
asuncion
yoshiki
philistines
gonsalves
malaysians
beluga
marly
corgi
swag
biophysical
pordenone
vegetated
diya
telepathically
downgrade
satiric
cheikh
billingham
ostensible
sociopolitical
uhuru
luque
creosote
punishes
dreary
cubitt
bioscience
rectilinear
lamentation
ozu
amours
stad
reverie
hanssen
chota
baldock
connemara
agitator
swa
fiends
brokaw
freyer
chore
accumulations
rackham
spee
hoar
totten
fela
unveil
tfc
buccleuch
regionale
mazes
treme
ameliorate
murrell
samaras
lighters
leadoff
impotence
farwell
charmer
npg
famas
nisan
lanny
attainder
bickel
thermopylae
dinwiddie
wouter
carrow
duiker
ubi
hammadi
gelatinous
manresa
evolutions
exhaustively
desperado
alejo
sugita
buttermilk
carel
inferring
katyn
trp
upturned
cronenberg
agrawal
dagon
pythons
lawmaker
panhard
fillers
concertina
leprechaun
presumptively
contactless
reassess
longa
agaricus
hilarion
wampanoag
mcclung
oppositional
juve
hermine
borax
impressionistic
lymington
trapezoid
housatonic
hinkley
elspeth
eisler
italie
stauffenberg
lids
distorts
brunch
bila
underscored
crosley
oko
andesite
dislodged
roméo
lajoie
romanus
reuven
tentacle
sergiy
pyridine
adjudicated
cellini
attilio
guardsman
lande
adamawa
unwell
proconsul
rockman
wain
shalit
megiddo
mef
vitaliy
ojai
zemlya
sangamon
homenaje
editorially
christoffer
soko
wrappers
placeholders
yrc
indulgences
swordsmanship
câmara
trop
belew
pathet
bisects
secker
mendelson
bhawan
boycotting
directorates
fula
goalball
detmold
baited
salieri
aerosols
shroff
marita
balad
zt
poh
selhurst
koufax
republica
granitic
talus
partita
infestations
godhead
paik
kirchhoff
disheartened
parasitology
communicable
gopinath
digg
marxian
layperson
ballooning
subsidize
strat
universals
microarchitecture
reminiscing
fritillary
kosh
advertisment
frederiksen
enchantress
gev
ryland
dds
mielec
pidgeon
brus
pentecostalism
tillie
pederasty
nisi
usafe
catchphrases
toodyay
aldwych
cataclysm
strolling
vivir
kohlberg
cpd
vanya
zaria
dioecious
unstaffed
ksu
crestwood
teases
tenggara
handicaps
moz
masayoshi
manistee
zakynthos
ajmal
funhouse
agronomist
diaghilev
âge
jiri
transmutation
tirtha
greendale
helpmann
chuckle
pharoah
mhs
imperia
harri
bastos
lcms
nodding
yearwood
drogba
pauley
eckstein
dupré
frederico
animas
karimi
wargaming
optimizations
shrill
afresh
libertine
lgpl
lemoore
heterozygous
lalor
criticality
deism
weatherboard
mosquitos
haydock
casks
provosts
troubleshooting
bellarmine
tripe
sanh
zaheer
ditton
typ
hersey
northport
bresson
liezel
blohm
fabrications
alberni
spoonful
relents
bruegel
insectivores
contaminant
steinbrenner
muna
aadmi
amri
espouse
atb
anatol
cantilevered
thunders
micron
lbw
sankar
latium
canopus
friis
fingering
maddux
aznar
tolled
gnat
spectacled
ventricles
deviance
doobie
delving
cerrito
staffel
panthéon
consents
bix
precipitous
otus
roig
stakhovsky
benares
usu
jabez
birthing
jesenice
minab
kray
azs
beneš
cornhill
hammill
satirists
merchantman
purposed
levis
lanois
showground
agadir
flaubert
polonaise
bugsy
foghorn
krumm
banteay
estación
hov
reflectivity
sittings
amand
alexi
karenina
publica
tigran
stiletto
meatballs
ime
collet
zev
terai
vaccinated
perfectionist
shimonoseki
sacro
otho
trompe
whopping
hibiki
agnosticism
statler
riverhead
moskowitz
chinn
rejuvenated
swaying
dhanbad
bergin
edwardes
colborne
crowbar
roselle
parisians
executioners
univac
disliking
lesbianism
aswell
keeled
finisterre
interlaken
humanitas
akali
ards
unwitting
harmonix
karn
heterodox
niebuhr
myrick
roadblocks
twas
savo
lodz
critters
fce
intracoastal
iworld
wiedemann
inverting
bognor
srikanth
moats
sultry
wdm
junagadh
doa
koontz
apostolate
changwon
saarc
jyllands
modernists
rahm
infinitum
mando
railhead
resurfacing
kikuyu
dictum
gaff
zarzuela
anisotropic
balm
inevitability
celled
louse
bootsy
anurag
daniilidou
sansa
moyá
ecce
objectivist
shyness
crumlin
perle
marlena
jolt
enticing
kölner
fft
haring
buzzards
upsurge
redland
kooning
speke
setzer
gawad
ctesiphon
lares
volandri
rusted
endre
seaborne
tapa
dinky
simran
regressive
juin
ferranti
brownell
lochaber
droppings
takings
shias
calabrese
lox
typist
nassr
absolved
moraines
hemolytic
rickshaws
aah
sanat
troicki
wpc
barco
mudstone
combing
herpetologist
backseat
kaminski
hati
companhia
parmentier
abatement
perfusion
gautham
nozomi
zenobia
chippewas
dowell
chetwynd
advani
minutus
mcas
haviland
overworked
pleasanton
phenylalanine
stupas
joie
broadview
millot
cleethorpes
towpath
corley
cylons
whanganui
wau
indignant
perching
atwell
metamorphosed
caio
jerrold
cistercians
parisienne
lebel
flopped
svensk
jenkin
hallowell
herero
rheingold
bioengineering
searchlights
sectioned
vieques
bronte
eider
pappu
blindfold
proxima
polyhedral
indianola
masud
pounce
proudhon
dramatics
milian
rize
whetstone
vcd
inácio
taxiing
poynter
golda
micronesian
plasmas
goodricke
dummer
btc
strider
pmo
gela
hardaway
decrypt
triphosphate
disconcerting
faison
bowditch
bambang
chaps
moyers
madi
fortier
wernher
dubey
bindu
priam
glencairn
aep
glutinous
uhl
liger
evander
slane
orangutans
hentai
mugen
magnesia
sizemore
salud
ksc
morty
graydon
icelanders
queuing
disobeyed
illegible
parisi
unenforceable
rmi
cholet
farmville
tanna
bickerton
regenerating
strabane
whitton
sprouting
supercritical
druk
síochána
melchor
miodrag
´
sontag
stinks
ahi
wolof
farben
multimodal
otro
gaping
ika
cronus
kieffer
conjuring
rondeau
mingle
conflation
ahs
pretence
bombastic
hidaka
prager
harwell
tinkering
pajama
santé
sphincter
kazakhs
takei
arévalo
saltzman
sarum
kodi
pindar
standardise
hypotension
congratulatory
frosted
ndebele
nicktoons
phaeton
nol
ranieri
amberley
kokoda
creston
foligno
hemsworth
damas
époque
darna
salton
theocracy
lma
adage
pus
brumby
bezalel
toshiyuki
pinkney
conyngham
masoud
penalised
mbt
lisicki
ruprecht
qara
morn
lvov
frisk
mersenne
taber
escutcheon
collyer
jettisoned
badass
earnestly
actuaries
curiae
safar
mll
dershowitz
novelisation
arrowheads
statehouse
kiwanis
wuz
hoban
antes
saro
incinerator
basf
soames
polina
pbx
disagreeable
cobo
potchefstroom
katalin
attache
klass
băsescu
monkton
mudflats
archdiocesan
rohini
sergi
clockmaker
tpp
practises
funnier
funafuti
kull
miwok
macha
spectabilis
epistolary
wot
anstruther
squeaky
reimbursed
liebig
grupa
rosé
hoffer
cpusa
catskills
guidebooks
diamante
birdy
incantation
macaques
rages
jann
biomechanics
obsessively
superlatives
cys
tubercle
unionville
halden
burgher
tridentine
fessenden
snc
dilbert
laterite
recesses
tiro
ﬁrst
palanka
wenzhou
lugnuts
bsu
macphail
semiotic
cuz
rlp
paltrow
copts
queensberry
kapellmeister
gro
dramatised
preying
calligraphic
nambu
rajas
mencken
showalter
furlough
loja
prowse
marillion
terrains
lentils
hella
lory
hoisting
picchu
shelving
guaranty
veined
fuga
glossed
ratifying
eminescu
initiators
retaking
boba
fuchsia
metrodome
nees
lah
glorifying
sheaths
telemann
mammy
massú
tacitly
hornsey
ebel
desa
crawls
repressions
gilder
barnaul
taishan
firebrand
aurore
ors
eigenmann
pique
yellows
clipboard
glinka
yoru
picketing
ramage
vocoder
bharath
lemmings
cloves
unquestionable
piglet
halleck
touche
ailsa
bist
deforest
deke
cefn
delicately
acidosis
fertiliser
henna
seamounts
mouscron
anaya
astonishingly
birkett
mondrian
necro
junichi
dosing
mondial
subba
subfield
blisters
nari
arbour
abusively
spanking
tailback
juvenal
ced
dramatization
diarra
evocation
umpqua
tecnológico
earache
kinoshita
régis
mathewson
wreckers
barnyard
storekeeper
beyonce
milagros
apalachicola
speechwriter
foxborough
brierley
allegories
wawa
horwood
altan
resettle
cineplex
olave
frugal
jamey
bott
lullabies
recuperation
kellett
razorback
sculpt
landi
hibernians
subheadings
masaru
utero
granta
timestamps
clyne
bubo
bartels
azaria
pulchra
novum
xxxv
macbride
foshan
aten
norberg
dinas
ginkgo
tzvi
ueber
suppl
lhs
hedrick
bencher
didsbury
tincture
elke
prabang
jurgen
proliferated
rade
discards
tilson
bunsen
barbs
subtracts
letchworth
shamanic
buzzfeed
halberstam
infomercials
bannered
rufc
armband
gatefold
jewelers
wittman
kawamura
pingtung
biliary
dallara
lali
syunik
pleasantville
pogue
lefèvre
basking
tasking
segregate
pulido
basilisk
underlining
titusville
ionesco
quantifying
tarrytown
baltasar
critiquing
elfin
blain
nse
conny
cervera
ketchikan
interpretative
subang
martius
vermeulen
maracas
vivi
cashed
báez
bubbly
chatty
redknapp
piaggio
hannaford
mogens
prilep
weitz
dubin
tss
nuku
reams
thalberg
amkar
vocalization
pincer
purses
drdo
weise
paxson
mourns
shahbaz
collation
pastorate
belladonna
triste
envisions
disturbs
ginseng
fetching
newsgroups
unassisted
grazier
yori
pajamas
cavaliere
chagos
welder
azeris
même
sharples
sainz
friese
gumbel
belluno
pyro
faubourg
publically
tualatin
magni
songbirds
custodians
années
eustatius
logicians
lav
nanterre
colditz
simson
quasimodo
exorbitant
acoustically
mance
houthi
frontpage
friedland
jabba
fyrstenberg
clearings
dramatica
aegon
albinism
gorse
latreille
gratefully
mdm
zastava
gano
roundabouts
mullingar
cliches
ingen
gosforth
aizawl
evaluative
salk
ringers
destinies
tinnitus
niobe
logger
heeded
waxing
tash
prelims
quiver
bourbons
purulia
excavators
icarly
frontale
suma
airworthiness
kops
surpluses
trashed
catapults
seagrass
pienaar
fatma
tillis
bolelli
klas
hordern
masoretic
eavesdropping
brisco
monnaie
planktonic
microtubule
mcquillan
cassavetes
efron
flipside
mentalist
erzgebirge
abercromby
myopia
bromfield
birney
twining
wrest
kishi
fibroblasts
silverado
cofe
verging
ramya
dé
sfgate
stator
abbasids
imad
anhydride
indi
troglodytes
skokie
bork
gbc
cranberries
dudu
pozzo
baryon
montez
sesquicentennial
throbbing
wanderings
ambiguously
herrings
hse
trost
seeps
masterminded
prideaux
underpants
anemones
violette
gripped
calamities
pharisees
feds
titleholders
hortons
historica
wallington
eighths
imaginations
macneill
lindenwood
llosa
illegality
swayze
hundley
streeter
programing
phds
reitman
unpowered
jeg
pastrana
rebooted
contessa
nunca
strippers
ably
aare
wheldon
sago
sangli
chiaroscuro
defensa
kuro
juncus
holograms
chernykh
hemant
imitators
dormouse
needlework
wbz
goslar
photoshopped
issei
sentimentality
xxxiv
shinobi
berri
neda
landy
uzbeks
larimer
takoma
harsha
woodall
apertures
sebastopol
sviatoslav
bskyb
omo
platters
trebinje
invulnerable
koppel
seagate
handlebars
separators
narcisse
ligier
bostock
tachikawa
arvidsson
trevino
tbi
halevi
oceanographer
oxytocin
kitzbühel
aurich
cofounded
hoxton
whampoa
pis
gals
surmise
saddleback
inundation
flagbearer
ingest
cherwell
oblate
curvilinear
rémi
obfuscation
headship
nefertiti
carstairs
obstinate
frenetic
turgenev
esm
alpheus
extender
tomy
sabra
sira
sayonara
barna
quik
soya
choreographic
tacna
defiantly
berlocq
misogynistic
bayly
warhawks
nek
sneaked
mossley
artesian
isdn
gandaki
mylène
harkins
chera
korail
mimo
puller
rusedski
mirim
mung
mellen
confounded
khasi
boisterous
kamui
covenanters
pmi
holodomor
unia
florencio
brae
rothmans
jaspers
tadić
westmount
carlingford
likens
glaxosmithkline
caucasians
lakshadweep
tch
roches
bitching
sergiu
shorelines
denialism
collaborationist
clairvaux
treacy
beamer
wadia
septal
paxman
anise
animating
megabytes
farrelly
corretja
bartleby
glück
springville
lightspeed
marksmen
ruthenium
tham
rufa
hmc
preece
golders
hideaway
sellars
muthu
johanne
daughtry
soderbergh
tallon
megamix
defecting
hyperbola
pfl
carburetors
ksl
redpath
cathédrale
internationalism
jello
odenwald
benoni
joong
supercontinent
inbox
lycos
prüm
parbat
daffodil
smr
hypothyroidism
thein
ribot
westley
helgi
nettie
pooley
kartik
halla
doughnuts
atahualpa
overflowed
vulgarity
matrimony
tetsu
nicklas
hogue
splatter
infirm
lermontov
paedophile
ternate
baywatch
exclaims
flatts
backhouse
unassigned
sneeze
yoshiyuki
melgar
kenna
hygienic
yearbooks
dalia
teletext
roark
ditko
washtenaw
maximized
synchro
takarazuka
jit
agi
bogue
noob
siyuan
kroq
millett
baltar
restarts
lillard
swee
ramses
holmberg
meditating
benue
nef
mochizuki
penna
swire
gerhart
quarterfinalist
koa
huánuco
ternana
kristallnacht
cfp
cloaked
goldfrapp
spaceman
courtland
filipinas
ucs
bristle
gunfighter
istra
tsingtao
espousing
petersham
sherif
blacker
trimmer
southwesterly
steinhardt
reinach
buckman
callender
tarp
oktoberfest
reshape
enplanements
weeklies
vater
spiller
idiocy
azzam
lifeforms
arabidopsis
pok
cette
bicester
orm
capitale
wenceslas
waca
marfa
vit
vmf
durrant
misconstrued
edmundo
henricus
majuro
shimmering
cadenza
chibi
oakenfold
eradicating
robs
unsc
chicagoland
takeru
shahab
courcelles
plucking
purists
mapa
villon
attock
aznavour
shakir
cutts
unending
lapel
retard
nikkei
iredale
disloyalty
defame
imperio
latifolia
mortgaged
yomi
crunchy
bequests
birjand
gosse
esquivel
rebelde
dees
oun
semana
lepanto
wiesner
crumbs
hiccup
gort
jardins
kahani
gauleiter
rathbun
fowey
dismemberment
sociocultural
bizarrely
quanta
ose
ansell
kuno
aub
mahabad
rsi
streetscape
pardee
irritant
cupcake
sone
banksy
cormack
fdc
hoek
helsingin
eyal
stoyanov
boreham
chek
turley
vinaya
olam
lawes
minn
uiuc
jailhouse
chiloé
sandstorm
vishwanath
campagna
northrup
tyrrhenian
proselytizing
antunes
lynton
adulterous
hooton
stilted
muri
hamann
melatonin
millman
rendsburg
migs
consignment
meléndez
diasporas
takin
iic
superstore
centurylink
bagong
impregnable
joules
baboons
bruguera
consiglio
ippolito
madang
unidad
faruk
campagne
rabelais
cabinda
landholders
plumbers
ashmolean
dillingham
yorkton
pointedly
xli
shultz
pacified
squashed
ohr
sctv
bouton
derulo
nijinsky
morumbi
khoo
abm
jusqu
changers
chater
naught
whitmer
fitton
sain
condit
covey
gallbladder
stephanus
treviño
hangin
catastrophes
faeces
sunflowers
latvians
dined
eurofighter
lymphocyte
capitoline
llangollen
alouette
dhivehi
enceladus
deft
vrije
prabhat
olivetti
brownies
hardball
karajan
wardell
negus
iles
bechtel
bhadra
flt
haidar
scrotum
bitburg
hoyos
colliers
brühl
kundalini
metairie
loxton
iommi
navigates
dellacqua
stranglers
decile
sliver
olmstead
pano
miers
theodorus
saku
bosanski
mattson
clerkenwell
cinemascope
oxidant
gavan
scn
toothbrush
terrorizing
puc
lammermoor
gacy
wagstaff
brayton
irrelevent
chandu
wrangling
srp
demetrio
prus
bijou
waukegan
unbelievably
shirai
eastham
bicknell
fugues
vivace
plebs
breathes
giddens
wyss
menelik
stutter
sodor
coxon
paynter
richardsonian
maggi
paracetamol
lucida
wrangell
hedging
baggy
whirling
runyon
softness
doppelganger
webley
assessors
jurado
snoqualmie
milam
voracious
parkview
nomi
trig
counterattacked
extricate
weatherly
abp
digress
reconfiguration
bucher
fabolous
bulging
abrogated
hamming
stunting
misdemeanors
qala
candor
theorizing
yachtsman
mobb
oppositions
perigee
bakri
vanna
penitent
competences
satrap
nib
rickie
morvan
mantell
hypnotism
mazurka
bialik
muskrat
hanyu
smelly
skynet
stabia
kasem
slowness
scolded
florencia
bayi
florins
dunwoody
hematopoietic
carracci
alcázar
argentinean
misrata
duchesse
ige
nusrat
traill
challis
elli
reciprocated
berthier
mgs
analytically
gurkhas
uq
fragilis
expertly
tolerates
zam
bruise
beate
sutjeska
balogh
tadhg
masjed
kerwin
brigs
fredonia
woken
witter
insinuation
moc
jovian
sidwell
acidification
natsuki
michalis
airey
arial
elysian
teu
malinowski
doublet
anta
gringo
lavatory
beals
gmp
mcl
staatliche
bugging
tassel
landsat
avinash
ferromagnetic
duvalier
suffield
heighten
kimbrough
zarathustra
dubose
gandhian
superheated
hamzah
clinker
reoccupied
mutate
mirabeau
fatherhood
dwarka
goce
fracking
bushwick
ivie
chiranjeevi
dsa
kasim
roosting
eberle
inka
assi
variegata
cannabinoid
santorini
haverfordwest
pyrotechnic
zuzana
veng
extractor
cliftonville
kickoffs
ploughs
versicolor
tapers
queiroz
extrapolate
kenn
belittling
markos
amazonia
maywood
doig
schneerson
twine
lenihan
wimmer
antanas
holroyd
istres
frisians
karts
trusty
tavernier
amora
trusteeship
crossrail
nargis
affix
bluebeard
carotene
kweli
mallon
munna
generali
tribals
muro
pádraig
nizamuddin
conscript
alde
cytoskeleton
beamed
djing
muñiz
uncritical
panellist
serf
hélio
earley
uninformative
profiting
unmask
fiercest
siraj
addressable
gregoire
grasset
conover
magnuson
maro
manningham
isidor
repeatable
bdd
russa
kunkel
ecclestone
judgeship
malaise
zamoyski
remaking
plm
yikes
tapioca
barthes
gmtv
psychopathy
brutalist
halfdan
albani
offical
sbt
chiaki
bourse
onside
eczema
artyom
barbarism
oakey
syntheses
naso
einaudi
tedeschi
jaz
semitone
gizmo
chinatowns
castaño
lazlo
akashi
looters
huskers
heartbreakers
kosciuszko
dermatologist
ascendant
weirdness
usury
stowell
allegretto
reinstall
rangi
martello
swamiji
dmus
coppi
delmas
karak
xxxii
piso
wasser
charbonneau
ozma
wrens
cixi
bactria
vips
elegiac
taconic
kintyre
npt
kairat
collegio
synergistic
lightsaber
bhatnagar
oiseau
authorising
zealot
encapsulates
committeeman
spiraling
angeline
halfpipe
beek
straightening
alvis
topmost
osorno
ripen
matins
briand
behn
vasudevan
ridgely
unexpired
cdo
stornoway
zoroastrians
spied
dau
skateboarders
ermita
intoxicating
philipps
flynt
machin
salut
kaman
pressburg
fantaisie
homeschooling
apcs
guilin
wolters
marcher
enslave
bauhinia
perros
hornung
sture
carli
mlk
herold
calderdale
bair
micrometers
ravage
immaculata
peering
amery
moan
dern
pretzel
libs
rsp
mexicanus
yorkville
cortese
sorbian
ment
erastus
gornja
quanzhou
poitier
sayles
roseau
animism
gurley
reichert
rudo
sinuses
altes
photoelectric
pacha
sukkot
toothless
secretariats
altimeter
trappist
ediacaran
debreceni
lilium
atlantica
dockyards
cota
vineland
azlan
panna
malaspina
anarchic
dda
otherworldly
rustlers
lic
xfl
inwardly
fainted
crean
kuntz
scents
tuvaluan
offstage
ayuntamiento
tuo
albinus
bechuanaland
inverclyde
stalkers
helden
mykhailo
blomfield
bedouins
hku
loughton
kozak
toons
interrogators
tripolitania
bloodthirsty
impediments
schoolcraft
tots
bicentenary
neuberger
contented
rearward
anser
tavriya
kone
pitney
satriani
liman
birley
mdy
clamps
gongs
mamelodi
blainville
desecrated
padgett
astounded
meshgin
euan
hulst
collectable
refunds
preempted
juanes
zomba
reformatting
boyband
supermassive
gedo
kitamura
adjoined
seg
safran
rondônia
birt
lamentations
shimin
qx
carn
bailiffs
sickles
flon
deplored
unease
erol
otomi
fangoria
undirected
spratly
yael
ornette
peddler
pratique
waldheim
feltham
supercell
obstetric
generalizing
evgenia
flippers
nikko
computerised
ballina
zang
uerdingen
bresse
dhofar
legalistic
wark
doce
stuka
européenne
wagram
buhl
franchitti
ponders
superseding
savin
deportees
fada
airsoft
soulja
indiewire
yamauchi
hadji
dechy
hobgoblin
wenham
hrc
shahzad
stoops
hurl
luzhniki
jowett
pterosaur
guianas
sundari
hubris
mejia
ory
psychotropic
tanana
dwt
kanya
bernat
binnie
bps
laparoscopic
dala
depositors
baud
rustica
mikan
larousse
tellurium
appleseed
unblocks
dibrugarh
microsd
hoddle
halvorsen
celaya
béatrice
bonjour
barcelos
housework
polak
mauer
tamale
altstadt
prieta
chardy
gwilym
parkhurst
kramnik
econ
soane
tolman
clandestinely
glu
yagi
archana
accelerometer
deca
peacebuilding
funnies
avesta
reutlingen
interbank
bci
iveco
duopoly
mcgwire
levesque
friedkin
malthus
turbidity
tsuchiya
daniella
yuvraj
gopala
quadrennial
reales
berwyn
choc
gastonia
hooliganism
redoubts
habla
thimphu
herren
awol
aparicio
sakaguchi
patching
divestment
greif
sheedy
triglav
presidencies
gooseberry
commonalities
caryl
teleports
fancies
djamena
leibowitz
tangents
sasi
patera
unsurprising
selmer
eggleston
zainal
zobel
unwillingly
waggoner
seabury
kaito
eckersley
prohibitively
mocha
gilani
hutcheson
awash
unspeakable
bertone
surging
paro
siang
mmorpgs
dobbins
insomniac
nta
tolling
spoofs
carshalton
culbertson
wawel
atal
keto
strangelove
lato
ranjith
biron
heathfield
adèle
subduing
maximo
nats
viera
capilla
newshour
crutchfield
atone
saeki
squaring
erle
kiril
sylvanus
playfully
miraflores
palaeontologist
hcp
wnyc
cadfael
interrogating
atar
toten
feigning
yokota
haruki
tert
jenifer
tsk
dewa
unsalvageable
loe
constantia
tapas
treen
upped
leadville
westerner
tableware
sarasvati
starships
reacquired
wwa
amboise
stench
cheetham
stymied
bleeds
chutney
belarusians
hartigan
holcombe
divertimento
ryle
transcending
vim
trejo
shakey
xerez
monger
shovels
homa
tsuji
yaqub
florianópolis
fluxes
ebba
bridgman
johny
stubbing
adulyadej
factionalism
perusal
stockholder
quarterdeck
bown
copywriter
psers
garcés
ramstein
gelfand
sujatha
syncretism
perfecto
backgammon
pmr
borbón
sharm
heydar
wfc
partway
spoilt
mendonça
tameside
kournikova
kashan
placentia
uusimaa
wael
osuna
seaview
cavalieri
ashy
breezes
hubli
bambara
nonpublic
newspaperman
kru
varian
lilienthal
miyazawa
coworker
tussock
você
manoir
cristoforo
technik
busing
roxie
hematology
bilder
libyans
koya
atty
gynecologist
clevedon
majeed
tanto
trammell
hoppus
belov
diamant
harnett
ashot
divo
diffusing
márcio
perinatal
zulfikar
redonda
minivan
estudio
teletoon
sante
perc
patronising
folktale
metellus
icehouse
jérémy
spurgeon
bluth
gertie
berthing
milf
edulis
schifrin
anoka
annunzio
paule
yuva
ledbetter
gatt
lsc
velha
hyperbaric
fairhaven
ansgar
subplots
ucb
duce
eastgate
overflows
gunderson
borys
mbh
aib
dga
vetter
wyckoff
camborne
paston
vakhtang
martti
weeps
osler
benchley
mohanty
kastner
burgin
redruth
learjet
iguanas
inelastic
villalba
revs
wolds
mangum
basalts
henares
tenebrae
grayscale
decrepit
halter
muzaffarpur
zk
hazlewood
denizli
rhymed
rancheria
stewardess
transversal
predictors
reservist
optus
raheem
capello
brinkman
cumbrian
newstead
moldings
elissa
circulates
rojer
pints
salomé
sfu
telenor
frenkel
milutin
fader
szymon
herpesvirus
ravenscroft
jue
duras
gasification
horvat
temporally
summerside
jeffersonian
barzani
magadan
vulva
carcinogen
bogdanovich
anastacia
pollinator
punchline
bruck
chhota
lambeau
stokowski
sedona
blinky
chirico
kobo
ream
foetus
halas
deactivate
hatshepsut
hazmat
nxe
reciprocate
arnim
rml
stratotanker
fado
jrotc
hannay
eef
raghunath
friedberg
carbonates
bridgeton
wavell
stovall
helvetica
rafiq
artical
givenchy
geysers
fusilier
baleen
lorenzen
gretsch
bolkiah
alts
mcdougal
estevan
pedigrees
ayo
lakshmana
icap
thigpen
tarja
vallance
neutrophils
songz
edc
hildburghausen
heald
ducats
endemism
bsf
playboys
wipers
trapdoor
mohit
milkman
gilding
uca
loesser
salleh
lorrain
obstetricians
peekskill
abolishment
morland
wałęsa
lardner
trygve
systolic
pallidus
stourton
gandhinagar
historicist
mcminnville
scallops
waterlogged
moradabad
crataegus
socialista
amplifying
venstre
paralimni
unaccounted
xalapa
ferran
communiqué
ignites
darab
nori
aleut
mccune
gratia
annulus
pipistrelle
createspace
sherrod
sainthood
kirchberg
keun
moaning
intercom
compositing
vetch
overlong
surbiton
vise
tudela
carmelita
massy
mycologists
powerbook
watermills
salvio
nmc
trutv
scraper
lasseter
hickam
jalgaon
cullman
turkana
jupp
gleaming
gripen
grandnephew
salé
oort
interleague
wärtsilä
osmania
camrose
janitorial
arcturus
creamer
rhythmically
duclos
callow
bitrate
tanga
cassano
memorably
forlorn
ouverture
peto
lavey
mulgrave
tsp
tallow
risto
gentian
eliya
hershel
cyrille
weeding
godson
switchover
bayshore
bpo
mannerism
freckles
lbj
archuleta
buckskin
delorme
crouse
heinze
buckminster
granularity
thermoelectric
nanga
hosseini
turton
bizerte
sunsets
transkei
deformations
unrated
arguements
oakham
linotype
centralize
stiglitz
isidoro
pumice
citra
adelheid
abetting
labatt
entitles
jayapura
fushimi
sorrowful
sso
southeastward
endoscopy
zapatero
vfx
raking
keanu
parvathi
solveig
embarcadero
copepods
guidlines
lidar
femi
rutger
pantomimes
flutie
dendrites
superconductor
myr
ibises
monolingual
outram
landholdings
fujisawa
leet
doppelgänger
skillet
antonine
tigger
redact
geomorphology
pressurised
fira
delorean
captioning
rasch
undetectable
amira
stools
doss
yellowhead
circuito
sponheim
sinologist
tollywood
aquí
gimmicks
développement
growling
tain
flashman
vtb
jarred
baptistery
sle
muy
mcduck
glyndŵr
safekeeping
ordóñez
seibert
dempo
unfairness
dorcas
meridians
secaucus
benidorm
takako
snowshoe
pomfret
oranje
wildland
videotapes
backbench
interrogator
coniston
gundersen
kathakali
nakanishi
ligation
headwater
stam
botnet
silencio
brolin
homonymous
torched
mish
maigret
wythe
sardonic
oscilloscope
taher
mcardle
ibge
downe
tabulation
elohim
melling
excellently
gostkowski
tempt
snps
ipp
pitti
spasms
cutty
fatehpur
wcl
sreenivasan
thrushes
harmonization
biosciences
exhortation
hereinafter
aonb
exuberance
tete
helmond
gianna
tarski
jatin
akihito
marriner
dodecanese
convenes
caries
songkhla
tanjore
penafiel
burmeister
ayako
zealots
manik
lunt
arrayed
interviewers
bancorp
bereft
guiyang
brydges
skerries
neymar
lor
mestizos
msw
echolocation
manipal
johnsons
balaguer
malachite
horsemanship
accusatory
esslingen
saff
genotypes
toft
khatun
horry
imprisoning
parapets
spokesmen
nawa
howth
clawson
bere
ejido
akiyoshi
towner
wgc
zarqawi
sensuous
riera
sombrero
lumping
stomachs
toriyama
proportioned
purnima
bishan
arabica
playfield
looe
vartan
quarantined
disregards
superceded
calabrian
fabiola
maxillofacial
wedged
negotiates
foerster
thao
geisel
scoops
pumila
tremaine
developmentally
smet
bloodied
risso
marquardt
burk
bollegraf
unapologetic
spivak
unbeatable
battlefront
aoa
disputation
govortsova
bisbee
sandbar
polanco
spiritus
dictation
luján
hellmuth
grata
tangail
mossman
zelazny
bhanu
interoperable
minuit
haqqani
hegemonic
bromeliad
erudition
becerra
unfettered
bottlenecks
burdensome
bolus
trite
almodóvar
bereaved
clubbing
arvo
piran
rosyth
gridley
inwood
ectopic
biograph
pkr
pinhole
upenn
liebman
franchisees
mudslides
pulleys
canonically
cragg
imamate
carrière
retford
bhandari
romsey
edp
synopses
gtx
alomar
tapper
adopter
bellflower
cresson
iea
cobbett
libretti
alleyn
nilgiris
adjudicate
stuffs
marri
reconquered
kaki
datong
yarmouk
massaro
marchenko
lalita
sibylla
yakutsk
dagobert
hovhannes
michi
venta
arcadian
crochet
sako
longterm
pancreatitis
innovated
graphing
bahujan
circumpolar
aldi
kemi
wladimir
contentions
caldron
sheepdog
southsea
esr
shinee
hopton
cots
bradstreet
prowler
babelsberg
miwa
mauled
wohl
bricklayer
ivry
semites
experimenters
juninho
amersham
magoo
powis
liveries
canta
lessig
federalized
srf
predominated
docteur
dorfman
rego
amadora
karmapa
chante
kinematics
spratt
fancied
mckeever
tpm
humaine
tocqueville
malory
whores
athleticism
diametrically
ordinariate
chadderton
plame
qadi
kovalainen
lensing
thornley
saanich
fcw
salado
pininfarina
temperaments
molnar
hamlyn
chagas
lagan
grandest
ingots
keener
amritraj
reinsert
graziani
pando
almanacs
herford
betz
kavya
wartburg
lahr
sigourney
exquisitely
zdnet
nucleoside
bustle
cabrillo
thurlow
tarbes
charest
contemptuous
yadkin
pavlodar
hendra
bucaramanga
corpora
jaxa
salonga
selectmen
seawall
betti
flippant
marigny
radiates
lydian
mfr
heiberg
paladins
giddings
immerse
nanometers
orvieto
saboteurs
hermano
tasteful
helter
yk
straubing
solti
sidereal
geolocate
tolliver
brio
pacelli
cays
asma
ravichandran
bonfires
legnano
vilar
proteomics
gdf
airbases
esparza
manheim
bagel
disjunct
grafts
kaa
bedell
tren
espnu
nozze
chengde
cella
flankers
rylands
kord
outclassed
gubbio
shoah
risparmio
colonie
permeated
bassa
sibu
escrow
leacock
coldly
brandes
abolitionism
weft
tripos
frc
smearing
sectarianism
yalu
absolutist
isadora
trevi
mcvey
hosiery
khalidi
kudryavtsev
norrie
privateering
yannis
hawa
predominates
metrostars
cumberbatch
fiestas
varman
samhita
knightly
kanan
bryden
dislocations
kathak
sargeant
subsea
belair
beardmore
expanses
rampal
bardon
bestows
elkin
oxidize
stationers
heterosexuality
sanguinetti
weybridge
papas
ugliness
sharepoint
carbonaceous
reconsidering
chiricahua
werft
universitaire
abdoulaye
mauricie
wickmayer
rationalize
echinoderms
kilwinning
heiden
subterfuge
refrains
cataclysmic
odie
potosi
hippolyta
chairpersons
technetium
ajman
commonsense
ringleader
fof
pillbox
yoshimura
panton
misinterpret
weevils
volterra
twc
banister
haiyan
sarada
yeasts
bunton
shahnameh
ungar
viziers
alleluia
peronist
endometrial
sotho
frescoed
rediscovering
fouts
savoury
swart
directeur
romulan
inlays
congeniality
shirk
ablative
vtol
leitner
discontinuing
menai
nak
förster
strategists
sunburn
gumball
novelties
prolifically
supernumerary
kotaro
ashman
oamaru
stonebridge
vei
proactively
stoneware
jaclyn
pagerank
underpopulated
belousov
coupes
pratapgarh
cartouche
spastic
kohat
staatsoper
harstad
socialiste
meshes
gauging
goulart
sundowns
corsa
goldenrod
brockway
trombonists
alena
nata
hairdressers
cornette
madinah
ulbricht
cyclades
yttrium
maclennan
gagne
barnhart
nenets
melas
charlatans
mukhopadhyay
tamim
jetblue
liquidate
laudable
bestowing
avni
universitat
localize
recapturing
yara
kelis
dudgeon
cobble
twomey
crucis
personage
gentoo
thumping
slidell
prp
retargeted
masaya
souther
morts
markku
wilders
marjory
meanchey
ogier
unis
studium
kummer
covalently
attu
howitt
barrages
resolutely
ergonomic
growl
riza
chacha
henny
autonomist
linley
juggler
kross
degeneracy
bharu
yoshinori
suchet
traver
letzte
hildebrandt
geoghegan
cortona
lymphoid
cajuns
intranet
rostral
auth
kurstin
blackmon
dibble
lithographic
cvc
bainimarama
parada
phenom
bux
synesthesia
ludo
fitr
whiff
chalky
diverts
sawing
informers
switcher
newsstand
wedel
langa
penda
harrods
capriati
suse
blatter
predisposed
bunce
staveley
scutari
kisumu
adan
hypericum
mindaugas
nanotube
mcadoo
hanif
airlock
apostol
matchplay
mlle
tomita
countermeasure
fos
davi
rhodium
wooldridge
uhm
conceptualization
pref
pinakothek
cooktown
piave
mercosur
lagi
linehan
frenzied
questionably
ebden
muddle
herc
seepage
fulfils
encircles
frosts
dimensionality
supersedes
werther
nelle
allying
holston
leeches
copyist
victorias
brinsley
rabble
sueur
calvinistic
madawaska
vicarious
aprile
awkwardness
tedx
beater
karoline
indah
nicanor
englund
yasuo
cottle
instalments
goldblatt
geopark
bonk
podkarpackie
eurythmics
devitt
kilos
conduits
illuminator
sagara
counterintuitive
analyte
quelled
pastimes
coloratura
etsu
pincus
susheela
repaint
ferber
desarrollo
cialis
sneha
nyingma
uchicago
cambio
darkside
tennent
chitchat
lbl
spearheading
taniguchi
grampa
redwall
barracudas
shihab
parkour
bravest
cowichan
aetna
ventimiglia
smallholders
sajjad
flicks
netbook
sonam
testud
expansionism
permittivity
protestations
scorpius
soulcalibur
corina
roeselare
speedometer
fitzherbert
ardrossan
lozenge
striptease
athanasios
barbu
huila
palmes
risqué
chatfield
spadea
xuzhou
uktv
tribus
fathoms
valmiki
vlsi
dominika
gelman
waterwheel
yilan
cpg
brule
lxi
kaaba
involution
tradeoff
absorbent
belmore
tilman
malia
rougher
rpo
galatians
tá
creedence
invective
mateusz
folate
morningstar
pizzas
wasl
brigands
rfl
reapportionment
helton
whiter
iol
successions
pleadings
destructoid
fallback
piri
penstemon
bagmati
xlii
auberge
nscaa
alderley
sepulveda
kagame
ergotelis
caloocan
corydon
dro
sempre
thicknesses
caprica
ionospheric
smallmouth
maximiliano
ishak
frigid
lysis
disapproving
zevon
rosamond
lill
soliloquy
conquista
rtr
metlife
profitably
keystones
rlc
barmaid
doosan
arka
dissipates
valls
arrhythmia
greenblatt
faenza
rarotonga
calamba
pobeda
pallid
piedad
kawakami
latta
pathophysiology
silmarillion
tanis
surfactants
halos
straczynski
srivijaya
menahem
distro
bacteriologist
tah
varuna
dependant
mottram
telefónica
duong
iphigenia
akimoto
xxxxx
unfamiliarity
arash
reinvention
guesswork
barnegat
pulteney
inclines
fenimore
vesa
stile
transcended
athist
preemptively
signor
slung
fraudsters
yani
shrimps
hanyang
congratulating
cheech
novy
todorov
enjoined
unseat
pila
useable
stef
metamorphism
oroville
steppin
toujours
ferrylodge
bessarabian
concatenation
surrealistic
asgardian
lowrie
lynsey
katsina
darlin
efl
peckinpah
curtius
smylie
afterglow
faun
hadar
hurdy
pandan
tomorrowland
glasser
diatoms
collarbone
valenciana
ophthalmologists
entwined
rotenburg
alphonsus
hassell
veblen
rsm
atsuko
menander
ots
wass
entertains
hasina
antipsychotics
paya
pastorale
ake
offutt
devendra
redness
autocracy
checkerboard
nuits
zondervan
assize
angélica
dok
jaunpur
gardenia
sensationalism
clamped
blass
demobilisation
supremely
taxonomists
nagas
rigour
dharmendra
positivity
pwg
lexicography
bellagio
merridew
cristiana
rhyolite
ornata
burrard
ecma
craybas
crocodilians
aos
tromp
nettwerk
imager
rajab
sorsogon
lougheed
agr
unacceptably
carpentaria
binney
reasonableness
conditioners
langues
homeschooled
colla
chandni
recht
udea
kac
archiepiscopal
outpouring
darrel
bpd
patrolman
nhtsa
gsi
meso
drb
opeth
orthopedics
arabe
broca
restaurateurs
paralytic
thampi
manifestos
kaen
corroboration
pram
realists
lafarge
racy
jugs
lapierre
piemonte
brag
shoestring
dein
griselda
janissaries
patronised
vari
pekan
oscillate
estadi
ftl
neverending
mcclendon
hazell
nels
iupui
disloyal
ovidiu
scuttle
plage
fistful
xbiz
stretford
amnesiac
innocently
boehner
hensel
multicolored
antifungal
provocations
afferent
pucci
ceremonially
orators
joanie
depleting
persie
cowgirl
garman
snicket
baltazar
otoko
lexicographers
cbp
tryst
daystar
epiphytic
alimony
moniz
bruna
frascati
hollyhock
modesta
nickels
zika
peruse
herta
conservatories
membranous
kennan
badri
occultation
uas
marach
leffler
nagesh
fussball
sorrell
eagan
fru
kaba
neutralization
disinfection
tsvangirai
podolsk
castroneves
swaraj
diarmuid
mounties
hodgins
conflagration
sifting
undertow
margarethe
niamh
confetti
ppr
backdrops
decathlete
brookville
gyumri
diomedes
kaminsky
peculiarly
raa
portas
vinayak
iot
fuckin
comox
samus
sayyaf
hage
rocksteady
higuchi
beefheart
caversham
compaction
morphin
sno
reinstalled
transdev
rennais
denigrate
blamey
nourished
kipp
serrata
ihf
wos
konishi
squandered
parasitism
resurface
commercialize
compasses
revenant
marjan
wynette
traversal
romberg
historiographical
caird
tactician
glazes
sharpton
despotism
shudder
raindrops
reappointment
circumflex
gammon
antiretroviral
unionized
ouellette
miscommunication
mynydd
ddos
fattah
brannon
victorino
nygaard
subclasses
inordinate
birdland
kad
transducers
hoyland
quba
hamden
cloyne
pokes
comercio
osteoarthritis
carthy
skua
funders
antun
malatya
sacré
strapping
gumbo
tehreek
chauvel
fazio
rabbani
dystopia
commenter
tangerang
frère
mí
urquiza
uap
sloths
isleworth
rane
glinda
laurentiis
oberland
poesia
anju
bartle
bashi
spry
dando
yami
yz
concussions
closeted
uto
runge
xxxvi
kayo
underpowered
sbi
bluffton
carbuncle
efa
lushan
shacks
ashbrook
molnár
schoen
dube
chesnutt
herediano
rayman
injurious
jewishness
romina
judgmental
beslan
parkways
cosgrave
carron
kropotkin
hed
articulates
menelaus
canwest
liberman
merlyn
josephs
shor
murphys
sliven
ofer
deviating
retargeting
policed
redesigning
romandie
distinctiveness
alytus
istrian
zapatista
abbreviate
roehampton
iza
bullfrog
aspelin
parekh
venable
haug
ruffo
chock
staton
wangaratta
rudin
fenice
gardaí
cherian
propagandists
badia
helvetic
mestalla
spoonbills
magnitogorsk
humperdinck
sarwar
halligan
rehabilitating
nila
dungeness
quam
upwelling
spotlights
harpsichordist
drinkwater
brindley
inescapable
bionicle
propounded
glenville
lalla
tedesco
khoikhoi
slingsby
nahr
falwell
quench
eagleton
appending
cuxhaven
demarcus
rfm
kirti
imap
benda
unctad
neosho
talkative
arjan
dominator
tesseract
rodina
guk
preconceived
mendota
pozzi
czartoryski
rippon
toppings
abundances
tiburon
wildside
looper
erne
kalevala
lisi
lysenko
dreamin
vada
hypnotist
kastoria
amphora
ampere
arching
oeil
congrats
svu
stenton
discernment
contentment
troughton
hospitallers
murr
elmar
nagendra
josias
misleadingly
quarles
lwt
monumenta
lix
rajon
echols
dentate
classique
prolongation
blizzards
retouched
erakovic
gassed
sosnowiec
symon
siliguri
inflight
disembarking
aksu
botolph
arminius
devalued
lyubov
oswalt
switchblade
toddy
venda
refugio
capably
beerbohm
calc
pastes
microarray
pocketed
adu
chewbacca
stagger
galba
nathanson
donde
ottumwa
newlyn
shtetl
gagliardi
khana
kasich
kanta
commending
naqshbandi
nosferatu
blaxploitation
styrene
canarias
patrie
dwarfed
revitalizing
pulver
ayre
mcguigan
wilsons
balk
compensates
quinine
monotonic
cringe
jelgava
daula
spurned
dcm
absurdist
kiu
spinnin
breakbeat
wisp
stelios
willems
gaede
hore
bornean
schoolmates
bhiwani
equipe
hunza
seurat
unadorned
faraj
rossum
servicio
electrifying
wj
adelbert
feldmann
pso
hesitates
fothergill
fletch
wankel
isambard
basho
khayyam
benjamins
hedvig
gcvo
bera
equalized
verdugo
ramasamy
connoisseurs
bannock
bolaños
carrer
negras
ullevi
gnosis
aurum
commodus
everquest
kron
nyquist
libert
paresh
réseau
tvt
scoured
ewood
jagan
magnificence
neem
pepi
extradite
yuya
mallika
transcaucasian
cordes
gallacher
soldat
pinyon
veneta
walkthrough
transportable
ladin
coffers
kantian
sklar
encroached
disavowed
phill
dolittle
senanayake
holladay
fairlight
backwaters
metrication
zubin
gangland
crème
razzano
twi
cements
gharafa
negi
westville
init
allocates
capitols
phlox
wey
agu
einhorn
tarango
comanches
salva
uncontroversially
embellishment
maximally
archangels
destro
bathwater
berti
uptight
sadf
striae
bushels
jetta
skelter
gardena
hac
tristis
swope
catena
fock
wenders
haga
rohmer
urinate
bawdy
capitán
lateef
talas
tbl
perforations
gordian
ctx
mateu
walser
colobus
wickedness
krystyna
skyrock
shortcoming
glennon
falcón
memoranda
bunga
rescuer
nishan
farquharson
defused
pasión
shrubby
claymore
heathens
koti
blekinge
thal
arik
scrupulous
drooping
commutes
hawthorns
spasm
tholen
securitate
spinoffs
tro
spanner
kunio
fassbinder
ejecting
displaces
paintbrush
schwarze
bibliographer
petting
upjohn
lacus
overreaction
scipione
midgley
fidelio
laurentius
vlach
audra
pediments
petersfield
ulam
disappointments
ascribes
springhill
chorlton
gameday
cardiothoracic
proffered
qaboos
nisei
flannel
kesh
insubstantial
smug
balmer
dalkeith
salama
bengalis
romany
barbell
limon
darbar
maly
bohuslav
soga
utters
matija
ndrangheta
ductile
vals
alok
planing
troma
jagadish
oddparents
hambledon
jolene
lachine
deadpan
unobtrusive
dennehy
merian
tellers
tubules
marcie
allstate
drudge
valjevo
talktalk
gooden
apotheosis
cag
peacekeeper
tcf
crookes
duodenum
subconsciously
jogaila
sitwell
sanandaj
budden
orto
paramus
bugis
cocke
iki
pavlyuchenkova
tmp
henke
trotskyists
pallavi
grunts
millimetre
cardinale
gül
aujourd
fredy
functionalities
intruded
titration
recidivism
encrusted
johanson
strictures
mytilene
hapkido
moped
liskeard
hirshhorn
vogler
tpg
deleon
reva
detoxification
beachy
temuco
massless
dyfed
oxygenated
culebra
genial
louw
levallois
akatsuki
ansaldo
wands
intermolecular
achaean
turgut
akrotiri
marja
oneworld
clementina
rosalyn
stillness
kosova
rattlesnakes
papillae
bushell
transworld
novokuznetsk
bahri
lito
bastogne
hysteresis
sommerfeld
rosehill
gulab
butchered
stifled
rupiah
falkner
nog
thankless
southbridge
battleford
inu
deimos
falkenstein
milhouse
dropdown
francolin
sncc
ashbury
pentatonic
rosalia
scrupulously
copperhead
haygarth
alans
gprs
milkweed
crappie
recharging
luongo
clonal
jairo
coty
hendrie
roerich
campbells
ranbir
normalize
klimt
webern
screeching
menton
unterberger
spla
catt
corday
peay
gioachino
quonset
strangling
kurumi
hypnotized
diplomatically
pontifex
glenview
amniotic
chuckie
ladybird
multiparty
outrigger
insel
tiburcio
gargano
mpr
acción
vats
firestar
unforgiving
mollis
ovc
makuuchi
geez
wordings
melcher
aftenposten
copse
ciliary
prepositional
shimano
niwa
ofm
eulalia
rendezvoused
foreclosed
farida
showered
korona
zand
amalgamate
breck
zabaleta
marzo
intermixed
mutagenesis
franchisee
shitty
nfs
corroborating
fip
poacher
sportswoman
rtm
kalu
jardín
ashokan
rohtak
hinson
eggman
vong
ecowas
hednesford
jada
telnet
transvestite
grus
xuanwu
wic
infomercial
kohlmann
flagella
weisz
interprovincial
sarcophagi
endow
keltner
aktion
affidavits
minehead
ummm
dragutin
odra
brundage
eclecticism
irrevocable
iwasaki
xkcd
hannu
altiplano
manstein
molitor
mafiosi
lepers
shiina
sof
gsc
campestris
pastore
cotswolds
negates
guna
smethwick
creel
volleys
baar
canuck
islamization
comms
vegeta
estimations
jorhat
recieve
rearmament
mcreynolds
didcot
ridiculing
certifies
passageways
copd
moated
archaeopteryx
preeti
coffs
multifunctional
jörgen
morcha
indymedia
mendis
msr
taxiways
khar
wielder
gearboxes
schoolers
delved
expediency
courcy
toucan
chaturvedi
yugoslavs
sylvestris
petzschner
colgan
nitrates
baki
orillia
alvear
satsuki
breakfasts
hematite
motorcyclists
indefensible
formica
tiene
briefe
sahu
issac
kayes
tulagi
nsaids
matara
kutty
farian
penrhyn
dakin
spira
boren
lindstrom
collison
centimetre
overpowers
calms
pavlos
maatschappij
giallo
chipman
bienville
relict
mujibur
heb
trioxide
mpv
arto
danielsson
mobius
transpires
hagi
ripa
bunter
helmsley
newts
kauri
sahni
lodovico
aspartame
forebears
alcántara
quire
iru
testable
weiler
maintainers
mudra
nias
tolland
harvesters
nicholl
recyclable
camilleri
vlaams
toppers
dred
abhimanyu
medias
toyland
deluca
riddim
raney
refraining
ministerio
eshkol
ablett
substratum
ghazals
bhagalpur
tamarin
einsatzgruppen
rehashing
aunty
quien
morne
faridabad
kasuga
ammann
godless
cancellara
truffle
procreation
sancta
pokhara
mvv
equaling
stenographer
congregants
muharram
schomberg
antelopes
appleyard
cueto
vfds
forklift
gilliland
coarser
disorientation
modularity
agape
schoolyard
underpinned
vill
butorac
authorise
bagnall
detested
ploughed
irwell
blt
diptych
thespian
battersby
radioisotope
vitally
athenry
maree
jessen
strozzi
guardiola
mandurah
politica
inquisitive
lupino
nandan
moncrieff
pryde
ferric
grabowski
yoshimoto
igreja
froome
clots
ticked
señorita
budva
phraseology
lazarev
kalashnikov
partaking
deprecate
jy
demarest
manzoni
nieman
erdmann
nicolette
passat
disoriented
cristatus
erlanger
genki
nocturnes
condoned
takasaki
tikhonov
copra
trine
wolfenstein
priestesses
costliest
mahila
sdr
dolomites
walkley
jakobsen
minder
shariah
cancun
ulvaeus
benaud
skinheads
kunwar
tartarus
unrestrained
aeromedical
preminger
porpoises
cheema
slipway
prut
riata
mardy
isar
gowrie
epidemiologist
kien
schoolmate
gcr
soeda
permaculture
boileau
yeates
eyewall
handclaps
kem
caerulea
besser
karun
ising
signora
rinzai
kava
euthanized
multiplexes
anabaptists
amaranth
muralitharan
meadville
infocom
shoo
unavailability
exothermic
alexandrina
seneschal
ribose
inhaling
raffaello
saluting
hindutva
soundness
deyoung
musial
friezes
goodbyes
igarashi
schellenberg
stressors
garrity
diversifying
cellphones
porcaro
scholten
gra
desha
gurdy
headstock
vam
horoscope
soper
plovers
adverbial
orin
glbt
helio
mangala
lyga
sassafras
baluch
atyrau
asada
philippoussis
holtzman
lancing
elca
telfer
congreso
aang
dhani
leena
sherburne
parmar
mamiya
lda
lewistown
lts
mercurial
galleons
boscawen
impaling
ragin
jadhav
ackland
basilicas
joya
hillis
potentilla
uttam
raden
buckling
muralists
hydrazine
underhanded
cvn
fie
ratha
brite
astin
probert
ivrea
mulch
toppling
earner
delon
philby
hypermarket
malika
overheated
wral
stockyards
trigeminal
carbo
sdlp
sultanates
nelvana
khair
clerking
adal
lamoureux
reconvened
essa
yams
taek
detlef
rumsey
korfball
bostwick
docker
airworthy
lauter
misia
jovial
kanepi
solemnity
francaise
horticulturist
cranwell
ronaldinho
saybrook
recusal
crispus
myna
starlings
millipedes
cogswell
banzai
contemporanea
ltg
diamondback
agitators
ilbo
photographie
ciclista
belittle
wisc
esb
padlock
quando
cronies
shalimar
harith
indu
zog
ped
burkett
kaji
malai
abutting
hitchhiking
coram
psh
cua
sentance
tryin
sinnott
peyote
furred
unearth
ero
ultramarathon
forecasters
buryatia
myitkyina
outfitters
prospekt
velvety
langmuir
sah
belden
domhnall
candia
tuf
heise
scl
adina
launchpad
urrutia
ude
tepper
nudist
tutelary
thistles
plasmids
nonhuman
uninspired
agnelli
guises
tannhäuser
peniston
pande
homily
disparaged
galea
maneuverable
positivist
noreen
hempel
kapitan
rocío
staats
grameen
julianna
plp
moreso
haikou
elevates
stopford
monadnock
ancaster
riz
inking
caricatured
bluesy
poèmes
hayao
sidhu
insides
maximizes
chandi
expository
handa
mrsa
nephrology
alberton
aegina
aina
kroon
underwriters
hir
gasket
fagin
acd
woolston
ghali
bolsover
bogomolov
ushering
westbourne
rearrangements
trw
vocabularies
entrust
serdar
tangut
oreste
retorts
kawashima
frehley
depardieu
ruisseau
umaga
blubber
strother
tretyakov
tesoro
presupposes
slaveholders
thwarting
xxxviii
nagashima
marcantonio
gunships
disreputable
nyan
voyageur
sighs
cornfield
uts
timesonline
banjul
highsmith
dreamtime
batesville
markowitz
songhai
bunyodkor
scarpa
reynaud
grinch
holloman
maleficent
morganatic
aramco
tranquillity
edelweiss
dolenz
buoyed
niedersachsen
gowen
alc
karlin
adopters
wednesbury
inventoried
lukács
somerton
coorg
serially
caceres
rejoicing
fionn
iconoclastic
orp
tikrit
bathers
kurihara
ouyang
hake
gorchymyn
einarsson
reactivate
septembre
quadrupled
dramatizations
karunanidhi
bude
smears
maza
toponymy
usurping
arithmetical
vocalion
operable
ostrov
khammam
cetacean
unfavorably
odious
spv
cotten
beaumaris
abdelaziz
schalken
carus
cutout
constancy
schwimmer
dushevina
nuuk
alamance
solingen
musculature
saumur
slimmer
qvc
ntu
tusculum
heseltine
dpa
flávio
abelard
mirador
shrunken
bemelmans
citroen
raekwon
arsen
apna
budgie
unidirectional
lemke
yogic
dromore
workflows
ktla
greve
deteriorates
karimov
cer
shoving
tfg
fergana
skirmishing
memorization
jigs
deaconess
strangeness
hormuz
trenitalia
lade
zing
glanced
peleliu
vissi
hammocks
otherworld
revamping
stoica
larrabee
junker
preaches
bringer
undiscussed
freefall
barakat
cavell
baily
ebi
hoagy
boorman
moebius
psn
babbar
macromolecules
hallie
maintainable
hainault
vernier
kaisha
mahajan
vibhushan
hesperia
romancing
singlet
wingtip
claps
discontinuities
unobstructed
seta
rachelle
chiefdoms
kiska
bussy
supersymmetry
hairstreak
ocho
kyla
meguro
haynie
newsreels
browner
maka
bleecker
brazoria
michio
barossa
tetley
echizen
mosh
dutchmen
burnin
pini
anastasios
hasidism
ritmo
officeholders
robards
marsan
kumaran
giardino
ewes
arrigo
solanki
siegmund
caraway
steroidal
dovid
ashurst
wallops
vlissingen
stane
rodeos
afoot
iliac
rpa
kingsmill
turtledove
archpriest
optimisation
reflectance
enumerate
thinned
hockley
ahrens
limbic
triviality
glutamine
cornett
southwestward
esd
gambill
reformulated
sna
condolence
unsurpassed
naumburg
ean
derringer
glaucous
raffi
churchmen
brijeg
localisation
reapply
erlang
olusegun
motorbikes
paiva
rebates
padstow
xuanzang
gatto
nostril
bicyclists
kitsune
ohara
snook
hermite
honeydew
maysville
miscarriages
cobourg
carnahan
stylings
grisly
wille
infotainment
fma
schlumberger
muds
kumiko
jolo
soest
jumbled
sapphires
spinout
steller
furrows
ault
reinserting
flac
slacker
jornal
anoop
timekeeping
saharanpur
spurt
platon
mlg
savagely
depravity
yousaf
mella
checkbox
tomes
dungan
vinicius
doorn
thien
russet
wesel
turenne
tgs
bails
repositioned
urinating
veering
refrigerant
sdsu
grits
kazuhiko
bellefontaine
weedon
dib
herzberg
labia
ingress
parallelogram
ynez
talkers
staub
wieder
musicological
histological
quenching
crowes
kix
doma
benguela
newbold
unaddressed
toyah
jvm
copernican
gynt
naturist
castelnuovo
couturier
alanya
mehboob
pinchas
ossuary
ihsan
stabilizes
clothier
foolishly
cassiopeia
osx
daichi
intercut
clogher
aristides
twiggy
stockpiles
icebergs
willpower
despina
reith
lofoten
moriah
imps
gradation
guderian
croom
mgb
devant
unevenly
miike
tagle
smelters
skippy
lytham
kaczyński
tetrahedra
kamiya
lazuli
perchlorate
nebel
tidbits
muhtar
seychellois
celesta
nakuru
sousse
discoloration
minnehaha
beeson
geneseo
yuval
público
deposing
berserker
multipliers
naina
kinnaird
mainframes
auchinleck
renewals
slush
rykodisc
sidonia
ente
jabs
alamogordo
rereading
terrassa
tomaso
bachelet
protea
osun
physiologists
hayabusa
pausing
shs
woong
bachelorette
lavrov
beaverbrook
maran
conservators
mirjana
troublemakers
subscribes
facetious
tcc
lik
kibaki
perea
vadis
khoja
sounder
patapsco
surmounting
chorister
alsos
dayne
hatted
malvinas
mcphail
scheveningen
juliusz
ead
tribhuvan
jehangir
mtl
dupe
bryne
venezuelans
adélaïde
fois
ashburnham
mastroianni
ogdensburg
talbott
autun
collina
decking
admonishment
mohammadi
melli
xamax
lube
reducible
aspera
upto
leaped
southam
octubre
chon
moylan
reit
newmark
galvanic
lupa
firewire
bushrangers
contractually
comenius
chronos
deluded
nusantara
zoomed
purley
moorehead
longacre
syrah
easel
critter
caligari
kunlun
dryness
sandpipers
polansky
guile
hornaday
undress
warhawk
pern
harum
malate
klassen
bravado
abdurrahman
veg
insecurities
montigny
khurasan
nordin
kristof
quilting
wilts
dwi
deandre
amalgamations
noosa
taubman
dinu
skyrocketed
malfunctioned
porterfield
knighthoods
yoshimi
annes
mvo
algerians
bellerophon
betta
stinking
berglund
kiya
sackler
annunziata
despenser
culiacán
eccentricities
zdravko
delmarva
enrollments
comisión
belkin
pith
nuked
stabat
quitman
grégory
kerguelen
mideast
lemond
bakke
purine
conjectural
rummel
cobos
thynne
acquiesced
yudhoyono
milnes
macmurray
schurz
daiei
haman
passy
falconi
plebeians
pss
giulietta
swati
pawel
wahhabi
stickler
panos
jugular
mulvey
ikebukuro
danubio
alcan
stampa
harmonie
flautists
leghorn
woonsocket
gargan
hagman
morphy
weinstock
volunteerism
velásquez
francorchamps
talcott
bishopsgate
prunella
portes
lustig
renuka
ravensbrück
nongovernmental
gbagbo
mouthparts
eula
ween
incineration
allardyce
sudirman
swarup
largs
mayakovsky
melanchthon
fnc
belloc
ratcliff
oif
ehrman
reimburse
mantled
angiosperm
foran
etr
alawite
fareed
westboro
penderecki
unoriginal
chirality
tailless
monsoons
shmona
pelayo
babson
foulkes
emulates
downtrodden
berenson
grosses
krall
knute
regicide
nikitin
minke
birgitta
bpp
mowing
credibly
heathcliff
zulus
boothe
tib
niemann
invoices
mael
radish
bioware
initio
sealers
chaoyang
isolationist
szent
canongate
vagus
modicum
matar
pendants
coder
nacelle
brueghel
wangchuck
bonde
flaccus
popish
gannet
mse
ironwork
devolve
ironing
ebs
beachfront
gokhale
pwd
colombes
blr
hird
segue
tarpon
stagnated
ashutosh
yh
mucha
malet
lightening
bannockburn
reneged
expat
sportiva
peruvians
congenial
unhurt
vanes
barrowman
organics
atra
ateliers
hernan
frater
duero
riptide
fons
shingled
laborde
necrotic
morpho
godparents
platforming
terni
ilhwa
melford
libros
goodenough
kersey
mbr
cyberbullying
kalpa
druzhba
roel
metzler
pereyra
lenten
tollbooth
danziger
treading
langur
stonyhurst
leganés
micrometres
mcvie
opiate
baalbek
teluk
kawada
balin
gütersloh
crb
tabbed
aoba
pitre
aqa
garifuna
freeborn
futa
lari
kedar
duryea
lowenstein
vaqueros
pdas
shreds
bootlegging
coterminous
bertelsmann
calligraphers
antonella
carrizo
ampang
sackett
tapir
freeholder
moonraker
karaj
boothroyd
engadget
mochi
epub
toluene
bowell
dinosauria
sierras
colic
starkville
archenemy
harpe
bewick
flintoff
nationhood
winterton
abarth
regimens
maciel
paice
nureyev
santamaria
inbetween
coriolanus
ferme
agitating
lorber
denson
nff
spokespersons
enthralled
personae
cherepovets
reminisces
demotic
fervently
farmingdale
gir
pattani
cori
byblos
upshur
eltingh
grubbs
friedel
poule
psychosomatic
eves
nots
encampments
readied
zw
schreyer
tayler
tecos
simão
fatimids
días
microcontrollers
deepdale
iterate
statoil
skeena
vitaphone
heyerdahl
aat
gushue
almanach
seaters
alluring
périgord
unsightly
gohar
geste
societe
akiba
razors
oxalate
nihil
bashkir
polikarpov
marauding
bacteriophage
yongle
broady
carrión
khanum
handcrafted
amalric
voskhod
suzette
goud
dolphy
majora
furor
mordred
hariharan
nawabs
vivant
moja
rsd
mro
friedmann
orland
bethpage
ayckbourn
newar
yoshitaka
lubomirski
meacham
sakarya
lectern
harpy
esf
kittery
frew
dumpling
penne
giganteus
yair
quedlinburg
spahn
droylsden
painkillers
muharraq
brusque
folies
uncharacteristic
cunninghame
harvie
puyallup
brinton
moret
brazo
terracing
fudan
traditionalism
deena
maryknoll
barretto
cookware
harshest
discolor
körner
fico
pulsars
yuriko
khlong
kudu
kmfdm
rorke
encarnación
disconnection
khotan
vishwa
bicker
lagarde
scoundrel
flirtation
anatomic
cornejo
masahiko
carinae
polder
assemblymen
patricks
tig
convento
ubaldo
sandgate
alki
burglars
antipodes
iguaçu
snuck
hildreth
mcgrady
shenzhou
tashi
matz
kavanaugh
atonal
saur
raisonné
loons
kinnock
zavod
tepco
earlham
pivoting
gaughan
jumeirah
hissing
fabbri
hirsh
joi
rogerson
porteous
hyang
feliks
gatlin
injectors
wasim
latrines
devos
jubal
kismet
forceps
katholieke
accenture
trafficker
visualisation
sittingbourne
hatorah
escapades
pflp
itzhak
zürcher
sorbus
priddy
karat
sinop
studious
xenakis
electricians
pgk
gsn
enacts
maile
vassilis
medes
khalaf
haroon
harty
président
codify
bangabandhu
incan
secularized
sattler
iditarod
spouting
wq
timms
mertz
villafranca
qubit
buber
anura
destin
tigress
rectus
beesley
laminate
martialed
seashells
alitalia
kapital
laika
mtt
fenix
jermyn
rohrbach
montiel
jakes
tenma
setters
riegel
promissory
knotts
interrogates
greenough
panicum
nicotinic
olean
balaclava
quincey
langdale
folha
infamously
rattray
platini
nariño
junqueira
klotz
alcala
relegating
sachem
polarised
harlech
weyburn
atma
artaud
mascagni
velella
juri
noma
bugger
batra
vidar
maximization
aklan
theocratic
resonated
tarkovsky
headshot
hurries
buccal
lindqvist
jaune
butlers
aquabats
pacts
giray
saic
mrg
vasilis
teasers
hamblin
simpkins
fitzsimons
egotistical
tork
arpanet
sakuraba
confection
furtwängler
chyna
ephesians
cipriani
bananarama
pathologies
coddington
gesù
kdka
penryn
jes
multilevel
typeset
lantau
corden
azazel
thresher
mogensen
khattak
aag
blasco
flickering
preys
regularization
olena
yoav
endearment
echos
isak
villette
thusly
winstone
pyrolysis
hibbard
delores
seep
younghusband
kabardino
wasson
habitability
shaven
flatiron
voest
canticle
bronchial
workday
jao
kehl
educations
lenka
reassert
arteaga
yavuz
samy
murnau
sapna
fastener
morag
spf
seagal
jamais
mabry
peony
pwa
stereophonic
speedwell
stubbornness
musics
stamper
tranche
aficionado
gais
logroño
postponing
kael
polycarbonate
polycyclic
scituate
macready
grob
kroeger
vidor
staal
phaedra
zyl
faience
langlands
calthorpe
bowels
swathes
lynwood
amyotrophic
toca
rifkin
osei
faune
compresses
napo
kreutzmann
conflate
yukiko
lentz
diecast
garwood
disputants
nypl
agulhas
cii
giscard
iuris
computability
mothballed
persimmon
legionaries
aedes
heflin
nasrallah
uppingham
tradesman
hephaestus
buru
heerlen
xtc
napkin
conjugal
baoding
zakat
cloaks
oudin
midair
introns
giglio
interventionist
arbiters
mycorrhizal
outperform
colfer
bahini
sepinwall
robie
middling
backstretch
petworth
hulton
hustlers
aspirant
centar
raffle
coquille
sct
tada
nazarbayev
lipson
jeanine
bhs
raichur
aiello
sepoys
braids
incompressible
suwa
normanby
uilleann
unlink
hetfield
errata
zydeco
statically
mciver
erez
inguinal
heatley
maren
troyan
squibb
delamere
utsunomiya
proofread
recharged
wigram
angora
enveloping
handan
dashiell
kelton
fireballs
pruett
geodesics
zoroaster
ordway
castaneda
corrêa
tld
pav
zakopane
btn
hkg
dannii
vamos
neurologists
impermeable
pusey
trivium
billericay
roeder
chieh
doghouse
criollo
dammit
shiromani
extents
thome
rtb
attractor
sardine
tellier
cruze
taupin
gdi
sli
raspberries
clairvoyant
contravene
irn
harringay
haciendas
killaloe
ffb
nainital
bip
sportscasters
eisenstadt
bask
guiseley
lrc
graveyards
choon
dogmas
seiu
csg
gellar
pillboxes
doctorow
iwc
plied
molen
maladies
snorkel
alleyne
ligatures
pez
osb
mso
kinesiology
dapper
gambrel
vivisection
brücke
naguib
kcal
divx
uric
baumgarten
precipice
eman
ltv
internalized
naro
defecation
lookalike
phosphoric
mili
bethe
cowgirls
caisson
scratchy
hinrich
douce
schuckert
snorkeling
heterocyclic
rennae
shellac
incongruous
padukone
transportes
ostrowiec
danie
spiegelman
moroccans
ltp
fotos
intercede
kloss
regress
haneda
krull
bivouac
mrp
populi
lampung
atcc
nadie
fleshing
bellshill
sowell
ambit
whooping
canadien
unraveling
comba
retardant
carolla
kunduz
fino
fmc
kathie
ibom
shide
lapham
puffed
cavalleria
tanz
tmf
convertibles
hovey
gérald
skunks
algirdas
hiroaki
adenocarcinoma
bathory
pasar
yx
eirik
tlr
tongs
martijn
penises
bda
congrès
jaafar
kohima
gaddis
mcguinty
skegness
bcp
macinnes
lacko
hardt
interfacing
nicest
cromwellian
statuettes
doggy
agustawestland
daun
elucidation
sheppey
potrero
gutta
naturals
skala
epl
ashington
abutment
imperatives
caerleon
recommenced
steelworkers
santschi
duncombe
blida
renderer
plotinus
villosa
falter
dhu
poprad
igm
tooele
concoction
polytheism
komen
bff
untill
raha
skimmer
oiseaux
banna
balsa
minimalistic
middens
unproduced
underclass
mohali
franzen
uther
bmd
irie
parathyroid
megalopolis
gtc
puffs
tola
facilitators
dreamworld
poston
ramakrishnan
maryhill
daren
purpura
aloys
dhanush
ascetics
cowlitz
pob
ogaden
icbms
malignancy
targum
quads
senghor
nondenominational
artsakh
beste
epigraph
debaters
musketeer
amberg
unguided
alaa
likening
jahren
raycom
maler
vestige
changzhou
pfeffer
talbert
moccasin
wrights
kadena
dianna
alg
bettencourt
randa
rasool
seele
kuipers
mainstays
jaden
ramus
helplessness
avaya
bulaga
arata
changhua
gowdy
boykin
discontented
blick
bhakta
manhole
capsid
unguarded
snobbish
untied
castlebar
teeny
prats
meares
tacos
egoism
aimée
gms
neuropsychology
comando
eme
liquidator
claudian
werth
megara
lightnin
striping
olathe
agostinho
shatters
kingsville
preveza
abducting
panoramio
icr
buncombe
valkenburg
bakken
authorhouse
hmp
praja
propylene
clo
frightful
kaku
espejo
pagliacci
taillights
abramovich
jobson
jeffersons
shobha
sandhills
dalle
swisher
touraine
pinched
dundonald
michail
arbeit
olhanense
minis
cribs
baobab
peripherally
houseman
sidetracked
anatomists
sandakan
karolinska
bendis
warranting
naively
tortuga
ranfurly
meteoric
sabato
cabrini
voir
priories
subfields
spyros
topsoil
gani
eek
clichy
sassou
wobbly
incantations
pronounces
cámara
ulus
emplaced
bric
declarer
livesey
petrochemicals
advertorial
aimer
freestone
jessore
cleeve
maligned
froggy
antica
shehu
ctr
mcgehee
strix
tufa
sejny
harbison
civita
mdp
thunderball
departamento
intron
chika
jousting
mediaworks
chardin
sienkiewicz
usace
fts
prabha
nikolaev
keizer
zany
shimer
hoards
adat
prefaces
aliso
nuh
swordsmen
liens
irapuato
ditty
osho
theorizes
cimino
relieves
bertolucci
oncogene
comparably
annis
jailer
fanned
mournful
pendergast
sambar
asec
mikuláš
descents
painterly
benford
sulphate
rops
emitters
orbited
preservatives
manzanita
maoists
guanacaste
cama
lapeer
tyburn
coccinea
trampling
shapeshifter
cryin
acuff
rhymney
costco
canarsie
euronext
tomoe
cleanest
sgr
misbehaving
ramsbottom
levene
elric
resizing
troposphere
grandee
homonym
neela
iheartradio
bampton
susi
elva
calipers
stepbrother
friendlier
marilyns
constrictor
willibald
kannon
palmach
delancey
niazi
hopelessness
ifr
dems
curbing
mortlake
ater
choline
zvonimir
nuria
fending
coons
melons
withington
janssens
stranding
phenix
freezers
micheline
golfo
lampooned
jisr
nazario
yarder
sfs
elphaba
wearers
preempt
payette
mccombs
yamasaki
burgundians
imamura
marketplaces
cerium
alanna
glaciated
tita
abuts
broadcom
numbness
babb
arduino
misbehaviour
brp
bazin
dehn
muertos
misunderstands
chloë
imt
microstructure
arabiya
acca
trix
bottomless
intergroup
totti
heimat
mccarter
tanger
finial
namath
etch
rootes
spiky
skylights
asterisks
previewing
probus
renaldo
bookshops
pasi
beaune
crowne
jukes
keisha
upp
elkton
dainty
cimetière
kandal
benedictus
feely
milliner
rerecorded
binet
malam
sango
mülheim
sines
tuy
pertained
moxon
eliciting
welbeck
bushmaster
maupin
rydell
segregationist
makan
duckett
motti
bowland
substructure
drang
morimoto
seda
accentuate
pto
broads
norbury
gangneung
cassady
pellew
shearman
nutritionist
nordhausen
lucullus
verstappen
dcf
rolleston
rakes
sonate
yoichi
kariya
colli
wyck
billiton
raby
beli
variably
dokken
shippensburg
uncommonly
perso
travertine
díez
herrington
disputable
sida
saddled
pinnate
romblon
dykstra
parcells
hedy
cabello
saffir
amani
vaticanus
starburst
piceno
toning
marinelli
aldred
klingons
reincorporated
hebden
ipr
nibelungen
wayang
wapping
eup
agartala
piura
bobs
bonet
fickle
secondo
moderns
gollob
gentil
cheka
gook
tachyon
jobless
parading
isbell
yongsan
radioed
skool
microbe
alcide
klemperer
gorka
serendipity
cybele
askin
sanu
breakin
grewal
noyce
beuys
deform
bisson
symbolised
hauraki
infosys
mcchord
dolgopolov
lindquist
snob
unopened
northbrook
svetozar
gemayel
abducts
welshpool
faribault
kusanagi
coombes
mòr
washers
canvey
pln
carding
blalock
whiskered
jetstream
forehand
carinthian
sympathized
bfd
skatepark
sonali
sieben
drummondville
metatarsal
inefficiencies
ghia
catterick
japonicus
choppers
mackaye
santangelo
ismay
rosanne
alegría
galil
ske
pemba
boxrec
pontoons
conveniences
wallpapers
amie
goldene
artichoke
legalised
ovi
jirga
burkhard
dhoom
vociferous
statens
mckechnie
colombians
swum
butting
inayat
groen
mulgrew
frock
magar
lumières
offsetting
wintered
andie
saumarez
dhruv
sushil
fermo
istvan
boonville
cornerstones
romanticized
newseum
howick
florentino
xliii
sigur
ł
cichlids
trabajo
zea
poulin
spada
bamyan
liquors
overtone
analyzers
mascarenhas
postmortem
borrego
suitcases
raad
wantage
engelhardt
gonads
gainer
polytheistic
polizei
firman
quarreled
melle
huet
decadal
headlong
zeebrugge
stx
frink
clacton
pogues
threonine
devore
lebaron
tancredi
secularist
currant
protrusion
fmp
zune
castellani
clarkston
tarquini
veni
ellerslie
nasri
pavol
villena
sashes
kadri
blanch
perennials
iconoclast
lynde
ecp
despotic
kazuhiro
twyford
captor
exemplars
rsvp
dialed
blais
trice
waid
aliya
acrimony
feint
cotes
gicquel
vana
monnet
evaporative
grandstands
juhl
cryonics
wilk
egalitarianism
generalissimo
campy
somersault
almas
poppe
tolosa
lommel
runnymede
yverdon
grilling
nuptial
wix
sikes
touro
outre
kinmen
methadone
holmgren
kuri
anushka
luso
zutphen
tyrion
craighead
moka
relive
flintlock
unbridled
kapadia
engrossing
terrorize
dizon
auditoriums
shintaro
marseillaise
fbo
mismatched
chartist
bayless
carvers
mukerji
nyerere
shira
zemin
studley
dawning
hastert
lewisville
looie
tuscarawas
svante
stirs
nsr
sudo
achebe
bifida
ouellet
didgeridoo
aea
culverts
sturluson
bearden
exhausts
feira
personhood
garofalo
keres
quietus
lcm
bruns
aleksandrov
hirsuta
counterfactual
sonnenberg
shula
pocklington
scooped
flatbed
plowed
lif
estimators
merida
warez
zawahiri
huq
corda
spurring
siv
monumento
kuba
pachelbel
indochinese
ushl
tozer
gramsci
rosenzweig
cne
lcl
khufu
musicality
hirai
liss
optima
ttp
whist
chicana
moriya
telecinco
pcie
uman
mcgillivray
strived
asper
risley
alphabetized
blaylock
noranda
feilding
hawn
talmudist
adare
explaination
typecast
probed
clerked
stipulating
chameleons
usga
nlt
tnn
catatonic
sobers
llanview
burkhardt
hakata
nidal
feuerbach
lechner
rocchi
metamaterials
tatsuo
closets
chana
gura
bitey
supple
coton
mandaluyong
clas
lyte
deutz
channelled
doboj
valance
dimple
nivea
detaining
landward
nutley
gemmell
vandeweghe
kadam
coroners
pangaea
nilly
edinburg
subtopics
uncomplicated
gowan
waistcoat
nld
racketeer
bnl
jelle
figueira
ré
npf
hijos
glyphosate
wmata
kahl
anshan
tabid
saveh
categorizations
promulgate
chartier
pharm
shue
neoliberalism
shreve
aird
risorgimento
tractive
cheonan
letcher
deprecation
deftly
maurus
halibut
comas
yona
oriol
roshi
dyslexic
jacobins
gci
dusting
func
iranshahr
magnussen
chowder
fratton
rheinmetall
antero
koestler
frelinghuysen
reanimated
sharaf
firmness
neurologic
gigabytes
biomarker
unconcerned
shoko
retorted
mccrary
lalitpur
bobsled
pequeño
jovanovski
uncategorized
phillipe
bcg
mgmt
mugshot
soundings
oxbridge
abdu
yerkes
cranbourne
reiterates
wanderlust
macdonell
overworld
ikon
citv
intelligences
sagamore
glennie
talca
chiao
saboteur
issy
thiamine
jz
flatten
defensores
bardic
hallo
aon
alin
furlan
refocused
mokhtar
teacup
sedges
recklessness
assn
shankly
stawell
occident
commentated
horseshoes
rado
frode
infierno
peron
neave
flin
mystified
matchmaking
bardstown
cheesecake
liquefaction
sedalia
forme
grable
exacted
waddy
davros
stepchildren
sfio
waldhof
moberly
manes
fluctuates
yazidi
corti
retraining
aby
cromartie
saison
tyga
antara
melamine
goalposts
bolsa
fick
hatters
confusions
taproot
syco
sicard
capiz
romuald
absa
clothesline
gripe
paus
shopped
hobbits
cornucopia
quartering
toland
gena
guyon
stiffened
freelancers
celebrant
bratz
raisers
fermenting
peritonitis
iola
radomir
gajah
hampering
stobart
hcm
contributer
afg
hockney
niños
consecrate
holzman
mistry
federica
willkie
redstart
tauri
hund
pvr
dolina
erratically
papeete
marischal
petes
strafed
actuary
supposes
brune
unstated
lupita
numero
radi
houlihan
breaux
referenda
festivity
wab
prioress
sveta
moonlighting
nicollet
amager
sprain
punky
providencia
sefid
srm
universalists
igcse
cinematographic
lanz
morrisons
chicopee
celiac
hirt
baytown
cosas
sète
jere
sympathisers
melvins
terrapin
pinner
xps
deferring
nofx
peddling
flatulence
gide
supremo
kusatsu
placards
lorelai
suleyman
scuttling
heure
hasselhoff
paramagnetic
rapide
pharos
gauze
hosmer
omid
sailer
minx
reclassification
mcdaniels
excitable
korman
vaidya
switchfoot
hafeez
reseller
mycelium
vania
psychometric
nonspecific
ultimates
barden
meles
pagar
reviled
dorsett
lags
wam
ignatieff
morrisville
liberace
tempelhof
poise
lacrimal
aruban
appropriating
villefranche
risdon
chalcolithic
watermarks
orde
cobbs
tolerating
goldenberg
ameen
pneumoniae
mathilda
resonators
schenkel
wertheimer
imro
overestimated
structuralist
kettles
woodroffe
knaresborough
savigny
diarrhoea
bunyoro
bitterns
pieced
trimet
turnstile
boyden
frisell
damen
ridin
mima
lez
fete
trolled
lotion
pinhead
dereham
gille
kraven
chok
bushey
megaphone
tanahashi
cardin
nondescript
boli
draupadi
spey
ladino
finbarr
magny
cartons
barty
memorizing
muta
autódromo
prophylaxis
glazunov
pendlebury
roosendaal
shabnam
brocken
uncontrollably
tynemouth
niort
fontes
droids
skinks
tabas
lipoprotein
madhavi
ulcerative
repetitious
dudek
deron
asides
microlight
congratulates
anantapur
bhartiya
geographies
threesome
barras
wethersfield
darkstar
confederated
metered
meunier
angelico
ilkley
struthers
crutch
fichte
idyll
baylis
furthers
mutates
chlorinated
posadas
closings
aromas
nagaoka
trak
brose
wok
chavan
usat
comptes
breccia
fédérale
handpicked
locksmith
gouache
delphic
idealised
tizi
concretely
sweetener
enka
ponti
progressivism
nestlings
tibbs
lutes
enthronement
picturing
fennel
stewarts
scandinavians
oboist
branigan
zawisza
mutya
khuda
koprivnica
mcw
jarrell
xuxa
arsonist
dzogchen
neptunian
elroy
residuals
quackenbush
dtt
rudiments
preemption
gwendoline
jsr
rubi
uncharacteristically
extraterritorial
poppa
mateos
gottwald
demirel
schönborn
livio
atti
icse
awlaki
stragglers
telkom
voltron
rethymno
mopping
ecclesiastes
starrett
contrapuntal
rollergirls
dharam
ouija
bronwyn
abhi
eiko
triumphantly
suze
deschanel
hovered
walz
liqueurs
watterson
barro
moga
cdna
dishonor
csds
echeverría
formosan
claudel
sandhill
rozelle
yeshivat
inder
slumdog
incompletely
bourdieu
palettes
lannoy
stoudemire
ferdowsi
harappan
tuomas
longueville
smattering
headband
voda
bhaskaran
barrois
larose
nuncios
botham
etowah
saal
macgillivray
entercom
zululand
tyrannus
moche
schwarzer
vitriolic
silvano
olongapo
adelina
ziad
hither
eran
khosla
rebuilds
seamanship
curio
basuki
disinherited
conservatorio
yaga
cumin
jaffrey
graça
geochemical
homem
buhari
blevins
storyid
dais
doer
wasco
noland
pelotas
medline
shuang
hemispheric
tetovo
gaspare
rationalisation
equalize
peeping
vítor
masterclasses
ittf
fledging
grund
pail
eridani
kakinada
disallowing
abdou
interconnecting
romita
surrogacy
glycosides
interfax
pbr
leverett
crisler
mcadam
meac
expletive
murasaki
binders
philomena
kairouan
painkiller
purves
mtm
stratospheric
thermoplastic
khaldun
caput
hasmonean
chirp
minnows
taksim
remittance
disobey
oystercatcher
kennon
polytechnical
newkirk
bellefonte
gomorrah
generalisation
backwoods
winer
lubricating
interreligious
radiometric
mortier
sob
dafoe
bulimia
lingus
silkworm
shakib
olinda
kanako
berrer
rspca
gating
devaney
smithville
nguesso
gorgan
caiman
erdman
focusses
aniline
quacking
keiser
modelo
rousse
jordin
awning
nwt
anole
nikolov
anke
genting
antipolo
advices
cinemax
saaremaa
rukmini
suffragettes
aphex
sothern
lunsford
muncy
nollywood
correlating
moorlands
gou
selle
merseyrail
bijeljina
steptoe
reynosa
entendre
cowries
amisom
lewy
persecuting
pettersen
sturtevant
transcaucasia
mendicant
erigeron
bronstein
netherworld
pygmies
younes
ksa
kocaeli
ballin
degenerates
rediscover
bann
mfg
sulfides
tiptree
concurs
masanori
runt
saylor
naomh
glonass
overexpression
wpix
belgica
agron
kellman
fowles
kadima
epigram
wisin
falconry
mapleton
setia
okafor
kazumi
personalization
koli
spetsnaz
crawlers
malabo
gusty
contaminate
mckeesport
dugouts
valentia
neuss
chouteau
molik
bruiser
accompli
holzer
tattersall
softcore
cowards
taxman
cameroons
libres
erdem
wusa
chihiro
dlamini
sprigg
darragh
velo
spirito
sorge
regretful
kallis
plantes
sumac
reopens
cockermouth
paralegal
disassembly
kadyrov
subdividing
sangakkara
rádio
biscoe
belltower
psf
flaky
shere
impoundment
scrapers
bierce
lightbulb
powerline
truffles
debater
patter
virginis
casein
carpe
dingley
neuropsychological
vagrants
wolter
bldg
vanua
njt
estados
bucking
hieroglyphics
simoni
mctaggart
sga
clampett
fussy
swatch
mader
dribble
usma
sandon
stenhouse
orchestrate
witcher
symes
kotla
neelam
stigmatized
showmanship
smi
agrigento
schatten
vanzetti
councilmember
aiadmk
dita
cittadella
murtaza
revilla
senussi
gcl
otros
rotherhithe
cadman
oncologist
soham
escudo
barbarous
colorist
poc
mafioso
hsl
fernie
kurunegala
recurved
gsfc
minar
capon
catesby
propos
ismailis
schutzstaffel
wyse
dejected
mercyhurst
matei
idm
kayser
passerby
fairclough
geauga
sublette
finnigan
asana
majin
kretschmer
idate
uehara
ringmaster
comandante
jocks
nordrhein
lfp
shikai
tehachapi
afghani
greencastle
mundus
virginiana
dioxin
sunspots
carley
saari
zc
errands
ega
sanyal
obelisks
privat
sylvian
waris
mujahid
rsv
notching
meghna
ahrar
krum
chiropractors
drainages
marky
tenfold
trentham
kootenai
raimondi
sybille
greenlight
bugged
fakhr
schaal
ddp
cranked
iifa
universalis
sehgal
kage
gohan
gosselin
konyaspor
eupen
zinoviev
puffer
leasehold
attuned
raghav
pirandello
raigad
attentional
rosey
annabella
longlisted
morang
weehawken
savchuk
stuarts
overwinter
pronouncement
sixx
sanaa
jls
profil
hele
khatri
smarts
dizzee
gullikson
distrustful
petronius
tarred
detox
undercurrent
forney
shuman
turlough
blondel
luminescence
halmahera
dzong
limosa
philander
arrieta
rola
yoshinobu
starrer
rubidium
weaned
telekinetic
desiderius
anteater
liban
inhale
lemoyne
trilateral
snatches
downforce
kalmyk
plex
nimmo
mackillop
vaal
aldgate
lans
kessinger
unhappily
joventut
jeffs
gigantes
shama
alim
partei
amputations
balla
factoids
drywall
pyjamas
gibney
millstones
planking
ratel
tuberous
unpaired
pry
yekaterina
westermann
cadw
hovers
goddamn
mkd
inborn
barnacles
bauchi
haras
yugo
precast
cuéllar
maman
testator
urdaneta
hobhouse
solidifying
summerland
sprouted
rodwell
idar
evades
litmus
senescence
tmd
middleman
aint
lunda
manis
teasdale
mortification
zebulon
archivo
fréjus
swarbrick
docomo
sekhar
lynden
mccreery
chouinard
crayons
signum
yevhen
peele
apropos
opines
kovalenko
falsify
billups
tustin
larkspur
mightiest
krautrock
silencer
seigenthaler
toyoda
chiming
hsa
lapped
specialism
dammam
adornment
burly
exhumation
bevis
dalkey
valentines
sunan
spann
hangers
expound
soldered
hito
cinerama
intramuros
izquierdo
maltreatment
demersal
doyen
benbow
bub
unm
banqueting
puntarenas
ananta
bluegill
ghostwriter
arius
montaña
tej
gse
omdurman
frown
courtesans
zico
tianhe
rubs
mien
tendrils
wisest
anette
sikander
farscape
historicism
envisioning
gren
stackhouse
oblates
flexing
belligerents
rossington
lep
glean
aparna
repairman
henge
ulises
odakyu
bestiality
ensigns
reconnected
smothered
kfor
kostka
cotterill
subjection
harrold
jad
matagorda
bakugan
hasakah
eitan
kosciusko
uti
arenberg
perse
fornication
doctored
geld
panicles
gaskin
kibbutzim
nesta
spinnaker
nares
bobruisk
jagat
christof
aérospatiale
bargains
wirtz
zygote
erykah
lawry
sneaker
iiia
decimus
savi
nadeau
loney
putouts
unfeasible
cometh
knievel
valkyries
shippers
transsexuals
centripetal
erkki
pectoralis
inflections
imperialists
motivator
misato
protrudes
garmin
masques
lukoil
malle
schlitz
ferrets
malda
marj
ooze
fayed
uris
walmsley
crushers
kipper
combed
sinjar
intruding
newfield
gibran
xxxix
muniz
utsa
wilkesboro
penciled
intravenously
tehuantepec
twang
rapidity
argentinas
pankow
guyot
gratiot
diuretic
jankowski
marray
validates
harmsworth
dugald
disa
irrelevance
landover
theorize
fractionation
alessi
midnapore
woden
gergely
miran
reisman
brigada
gourlay
caracciolo
alcove
triana
steno
coupland
edmonson
kayserispor
bexhill
paltry
joystiq
faltering
kurri
bisset
padi
esotericism
hilltoppers
homoerotic
senta
vampiric
cantabile
hadad
silences
stylists
pelli
jadavpur
hbs
pisco
dihydrogen
complainants
pwc
isosceles
ruffed
rigdon
slays
amarnath
renmin
rehired
oncorhynchus
commends
rehovot
arsenals
baudin
ratko
stroma
soundsystem
shylock
broyles
ilham
pioline
nahi
southerner
cirebon
cibber
reefer
featureless
sundara
salvi
fillet
straightaway
pronto
sfo
unhindered
youtuber
rubella
openweight
jx
soba
carmo
unga
houseboat
chordal
cloistered
uncomfortably
janey
jobe
depraved
iarc
nishikawa
straights
elwyn
droop
unquestioned
teemu
vina
trialist
jayasuriya
ungur
reaping
sentries
trike
metallurgist
rampaging
unpainted
reynold
shallows
traum
arcata
nigar
groban
tankard
herpetology
garners
iscariot
xavi
rias
gunawan
couric
gravis
borah
sables
cdad
citta
gasworks
tessier
pizzicato
ambiente
kolomna
tilda
algo
rosin
colas
cardenal
pagano
khedive
floodgates
tissot
nudibranchs
todt
pragmatics
cattolica
shortfalls
westfall
ahora
guanaco
borodino
peritoneal
oxygenation
thame
icac
brandishing
reappraisal
winsford
swaths
marae
bruun
wack
chopsticks
heritability
reda
hsun
giacometti
dever
gollum
yeshe
downlink
dinars
fontane
mazzini
emsworth
velenje
nalini
vassallo
cloverdale
saini
epiphone
theatrics
karine
eponyms
sagittal
carmack
sclc
transpersonal
tollemache
coolers
delonge
gromov
pyrenean
blancpain
kumble
petersson
tugboats
marz
rebuffs
indenture
cte
melos
dawe
jésus
kennels
bawa
lengthwise
carin
haussmann
northolt
dawid
jezreel
cros
zhukovsky
mafic
bme
yelp
murano
zárate
malbork
muto
pettibone
pageviews
commandeurs
gekko
hiroshige
belichick
velu
wallasey
ceballos
forecaster
ljungberg
abounds
chiari
merci
redecorated
girardot
caisse
dunder
kennewick
jönsson
marquand
bausch
ridding
marchi
mannion
michaelson
ministering
lamarche
watan
tonks
iie
camelia
lemony
vergil
domodedovo
janitors
utr
hoyo
jonze
marios
thapar
shilton
ponytail
rogen
lumiere
lunga
chron
densest
bova
transnistrian
retrospectives
mapai
jameel
fauntleroy
bunin
sibilant
longshot
mook
menudo
neunkirchen
baltics
feelgood
alamitos
lippmann
chanute
vandalia
aranjuez
moyo
fehr
dells
adrianne
attenuata
hedonistic
suisun
magnetometer
moria
tingling
phe
dentata
movistar
snug
jutting
scalpel
varia
chakwal
damiani
bibliophile
scd
pieris
brokeback
dacca
corrine
ordos
lamport
depositions
craniofacial
edrich
cygni
dilworth
catharsis
circuitous
lalu
mln
rya
quasars
eggert
offload
shapeshifters
hortus
sapo
volpi
eschatological
munday
insula
kajal
gruen
vélodrome
minutely
boman
cobh
assuredly
noy
roadsides
nido
blobs
krona
welter
mayaguez
iuniverse
splashing
hern
sweetly
kingsbridge
brundle
holography
hashish
puglia
polley
tyree
naum
dieudonné
woy
ichabod
deighton
fratelli
whelks
armida
bladen
bcb
rafinesque
befell
ajahn
demeanour
kremenchuk
longbow
marinated
roused
sinden
nivalis
evaristo
spectrograph
goch
middleburg
geral
flanges
rgs
postgrad
flory
milstein
epigraphic
sharda
lector
thenceforth
memmingen
legitimized
mccaughey
prata
corvo
continuations
shush
nahar
givat
neurosciences
flann
humbled
arvin
mosel
alby
carberry
emr
crisps
stormers
weizsäcker
cien
ransomed
tancredo
cavernous
alcorcón
killigrew
guyed
talwar
élan
petropavlovsk
xang
animaniacs
phillipsburg
dodoma
evidentiary
schembechler
facelifted
nyx
shoup
greentown
taxila
adirondacks
sait
eachother
whincup
sadism
tuxtla
goldfinch
pretrial
pistorius
kunda
triunfo
ridgeline
dogfish
langen
foretz
stoltz
sterility
krasner
cubicle
youtubers
tamarack
thorold
chamba
sinfield
cosette
cockle
ansan
jumpstart
whitson
jihadists
yancy
kaskaskia
wrestles
touting
kanak
overshot
ramenskoye
lazzaro
tortosa
downers
maschera
comunista
laff
frentzen
bardi
aksel
delinked
iwf
dufresne
rdp
schacht
configuring
■
enviro
wolstenholme
ntl
aggarwal
tindall
tempel
gameshow
impartially
mendelian
jadid
sixtieth
botvinnik
stritch
untapped
hominids
hooley
junoon
trabajadores
ninomiya
macrophage
religio
blueberries
prequels
busking
dopaminergic
rbd
circumvention
gambrinus
allay
seagram
boscombe
krim
encapsulate
mirada
cinemagic
pratibha
chevelle
palanca
tidings
iwaki
housman
momoko
pharmacologist
angiography
basha
dissociate
corbeil
crayola
lambie
nishida
pontificia
namie
mowry
hellraiser
chuy
jourdain
doki
persecute
fiz
clausewitz
tableland
salalah
marcuse
siskel
chandrika
aggregations
pathfinders
exclaiming
unreachable
endangerment
bouvet
argh
jetstar
mobo
polyp
omri
cassatt
transitway
uinta
bums
personifications
pent
golconda
kimmy
hinman
byd
unifil
clung
stander
surin
scheepers
biko
poliomyelitis
nua
krafft
wingless
littlehampton
flagellum
irène
janta
midsection
minimising
righting
fairyland
medill
troppo
glows
pili
gillman
vevey
pyo
shootouts
liddle
praça
exd
datasheet
pergola
sleepwalking
precipitates
ices
undefended
seibel
selsey
gardel
willemstad
adjudicator
longish
panchen
dumpty
relatable
accreditations
unfunny
caitlyn
disciplinarian
rajaram
passé
aeruginosa
casebook
agc
cytosine
xlv
decal
smelled
nurul
livable
ossining
pantanal
salesforce
nbb
daws
qos
galliano
afrikaners
chronometer
viljoen
wiper
bimal
lamberto
kenobi
zork
gourds
glycoproteins
enescu
drawdown
traktor
nyborg
shweta
tobe
proterozoic
verrill
ynys
autogyro
gangetic
jailbreak
complacency
rejections
revisionists
redeployment
discours
castille
ridged
tisha
streaking
pylori
viento
crevice
jedediah
evermore
genteel
warpath
eutelsat
artesia
konstantinovich
pedrosa
cbeebies
rial
tutto
revolutionize
adios
tajiks
mcduff
toshack
ryong
attwood
dbl
ragnhild
decorators
hadera
sabi
flav
mutinies
implacable
antithetical
bowerman
sikasso
terranova
airmobile
ondrej
watashi
dahan
predilection
kob
placido
blockades
mccullum
mobilise
meerkat
doux
boosey
ungainly
plagiarised
tantrums
charney
andris
necker
orpington
fourfold
proofreader
obscenities
smee
albina
narváez
ceauşescu
irrationality
telescoping
duvivier
entrée
dogra
wiegand
honegger
saru
mockup
unfocused
pian
alida
wittig
refueled
lortel
automating
attainted
airpark
diggle
minchin
herzen
josquin
sobral
tetralogy
balochi
dop
anesthesiology
roosts
caney
texcoco
bethell
screed
jujutsu
braidwood
rov
leadbeater
mero
tivo
mystère
mousetrap
otwock
ramzi
furore
piazzolla
lozada
valiantly
griese
ucsc
mirth
americus
oeuvres
gavel
guerreros
lampe
pasts
letterpress
dogger
rumah
thoroughness
sinkholes
siltstone
tni
redefinition
chapple
rava
embalming
hunk
thacher
peshmerga
commentating
koç
skelly
kitagawa
poore
khoda
hoole
gamera
fassbender
kerb
batalla
spano
freyja
ube
lunde
inflows
cair
unnikrishnan
dodgeball
lesueur
guarantor
sentience
mcanally
labouring
kapiti
eurohockey
karaganda
départements
okey
dingoes
impreza
lela
alibaba
pua
rollie
malfeasance
kovalev
alpe
cilento
fundação
abyssal
sehwag
tibial
chambliss
bojana
bonhoeffer
levying
rafale
hsiang
tunable
featherston
twee
markstein
foulis
intermountain
chunma
loony
disjunction
casillas
convexity
tubal
ditching
dialling
filippi
tauern
webisodes
kocher
jagir
lfc
groote
bluefish
feingold
sumi
scoundrels
trc
forearms
nauseum
cognizant
opportunist
mannarino
habitations
dth
bloodshot
lloydminster
mitsuo
diffuser
rochambeau
nock
knave
jory
estevez
urmila
saucy
heartedly
tsx
bycatch
yoghurt
warthog
lakhimpur
swv
ligure
buryat
analogously
panch
dewayne
cathedra
surrealists
gravina
epigraphy
poonam
mcnutt
yeoh
gilboa
unattainable
codd
iberoamericana
toews
quantifiable
girardi
cosima
mostra
wristwatch
merr
marwari
baw
decibels
giovani
akbari
burbage
pandava
fourie
dissenter
ormiston
annotate
phoney
busses
northallerton
glans
yatai
konitz
methodically
shinagawa
golovin
gair
xlvii
abet
bélanger
varennes
hathor
supercross
imprimatur
nsp
parsecs
bluefin
callback
twisters
seaquest
lugger
tawa
rmx
serengeti
tonnerre
quem
crippen
taa
lethality
avr
traylor
bossi
hdp
tannin
scheckter
arjen
hic
ursinus
pinsk
tunguska
herculaneum
walvis
cnd
ribas
inaugurate
stormfront
eudora
cultic
biomedicine
bermudez
gossamer
namm
nass
falange
meaty
cil
granit
genocides
gstaad
ssu
feuer
contraindicated
gombe
lupine
madani
humanitarians
kunar
mutualism
chafee
hafner
clamping
méditerranée
owusu
babette
pinehurst
diener
clattenburg
dissociated
seixas
hallé
insead
bellew
intrepidity
imagen
runa
tripadvisor
thinkpad
rookery
clubbed
memon
souness
hoag
sayyed
ormerod
shuler
dcp
timah
ocelot
walkman
walgreens
ovum
muhsin
thevar
squids
samu
musashino
viaggio
folketing
hosking
vigna
triplane
jasenovac
babangida
postgresql
gracile
endocytosis
roadkill
murine
celica
ljubomir
arcangelo
mcgurk
bungle
rapped
kwiatkowski
totems
raum
kiruna
isin
uke
incisions
cantt
biya
cambodians
reintroducing
tins
denney
picador
spe
avs
patois
criminologist
dace
uol
suriya
perplexing
jamiat
laces
piya
emea
herlihy
hmrc
flatness
friz
studd
alopecia
narrations
kph
miso
tullius
cocking
marins
afterschool
capper
sev
piranhas
auteurs
niet
freitag
coosa
toffee
beauclerk
neverwinter
footloose
cookman
morons
pook
unpunished
stimpson
gutters
faggot
foregone
hyannis
tynecastle
modulates
gabel
andalucia
osipov
puli
eguchi
farnworth
justina
crips
arche
buenavista
microbiological
inventiveness
francophones
solly
machias
caliper
searcher
urvashi
brandão
invulnerability
quash
eurostat
caravelle
roping
dinka
unplayable
equivalency
fwa
connotes
aveling
instilling
lso
schwann
lumet
begotten
grouchy
tamu
infective
sepik
xxxvii
izmit
kevan
giddy
contigo
upr
rfp
sabir
mayport
wss
kinki
longitudes
bbfc
derisive
leva
tello
psychedelics
havasu
sergeyev
makar
workbook
participles
coit
arago
lynchings
keo
bice
bano
quantifier
stooge
nueces
landrieu
miraj
wallowa
leonis
nobuyuki
striver
insinuate
preclinical
magmatic
caillat
caccia
deflecting
empted
dunston
kabaka
polyphemus
drakensberg
bando
pema
riverbanks
interlock
kesselring
belper
interdenominational
eldredge
maku
trias
kriek
keef
crosscountry
moloch
agriculturist
vandalistic
puyi
sheared
fisch
scrubbed
morbius
ahmadu
achiever
patanjali
uninstall
merrily
davor
telegraaf
orono
frizzell
shimoda
naturalisation
chingford
cristea
petrobras
radiologist
bowlby
godwit
tatishvili
yorkist
ratt
nls
sawed
lebens
torricelli
iqaluit
abacha
recasting
schouten
situationist
efficacious
percolation
carpathia
trotters
mannequins
particulates
lankans
colorized
srna
salat
bercy
tarver
immunologist
boogaloo
leman
meistersinger
duncker
porth
subaltern
carballo
spotswood
easa
prostaglandin
tew
hcc
birbhum
haplotypes
neubauer
millidge
tice
movimento
kiro
christodoulou
antonello
otomo
flexed
baits
fortuitous
brotherhoods
dissecting
joc
arum
pcd
dwan
buscemi
tanned
electronegativity
simha
camshafts
cottontail
weo
dabrowski
granule
debauchery
chiller
fantasma
lisette
timbales
gulu
sce
frelimo
irby
rsk
meadowbrook
dioscorea
baldi
gaithersburg
fennelly
burdette
funes
fledermaus
mahdist
pikeville
beiderbecke
quetzal
mccarron
sutil
tevez
goossens
cva
paszek
jamshid
uprated
unhinged
oms
cartoonish
mahfouz
deok
mckernan
retaliates
glynne
casado
horwich
fresher
glasnost
campeones
junco
unsworth
finality
ejector
dimly
boyes
unearned
subjugate
kirkwall
tramcars
clea
ekaterinburg
chari
engelbrecht
chom
saplings
quickie
mog
scab
megaton
allosaurus
jyothi
blodgett
goliad
albatrosses
lerman
cheshunt
councilwoman
melancholia
sisley
appraiser
guajira
aisin
williamite
iplayer
buxtehude
tics
faut
scheffer
puritanism
pearsall
whittlesey
intermingled
noguera
rhona
oberstdorf
clin
anan
bulgar
stegall
petrosyan
alexandrovna
plosive
ghc
fryderyk
areva
cephalic
mishandling
weidman
kein
immunotherapy
burckhardt
bujumbura
anvers
frcp
annul
pless
plies
yount
pastureland
stolz
sceptics
telex
tyme
lattimore
synch
niclas
cress
brannan
epson
fortean
appraisals
stepsister
antonie
lactobacillus
routh
weightless
breaths
crunk
bns
sojourner
rookwood
circumnavigate
recaps
ipanema
blis
masterclass
bluewater
cento
epicentre
languishing
grilles
senkaku
cationic
headhunters
bhargava
chim
joslyn
bourdais
monopolistic
divinities
bayh
jansz
chlorides
heyden
rostropovich
shandy
wisner
arema
nrt
anam
aldabra
winnetka
buffered
changeup
mcconaughey
bozorg
jonker
approbation
andresen
kareena
ravenhill
tolna
kickin
vaporization
kaif
reconstitution
zarya
bandara
keillor
beim
flavian
mitsuko
dror
anthemic
lugs
honk
linc
ume
mccown
holgate
murtagh
newsstands
rustin
schöne
adamo
matabele
vardon
overhangs
innisfail
levites
regalis
jamaal
skirted
adorning
meilleur
mirchi
kristoff
hsiung
ayan
eatery
plon
physiologically
newsarama
thracians
tauro
urartu
wuchang
env
ayla
protists
decimation
ziv
subtilis
saddest
spyridon
ohlone
musick
emerick
partridges
joof
jarno
callander
tomomi
logitech
ruhollah
daddies
prunes
abri
abbesses
najran
ghaznavid
doodles
samrat
kathrin
nightmarish
badfinger
ostentatious
hoople
deduct
hsm
grotius
gaijin
gusev
uj
baila
thiru
borman
spaceflights
bermingham
patina
unrecognizable
tittle
conversant
contaminating
aching
resell
froth
erythema
biometrics
urso
madejski
artifice
whittemore
serafin
loraine
quills
treads
tootsie
juntas
paracelsus
lacunae
revisits
nutting
stealthy
invocations
hoshiarpur
transgressive
slings
gfc
liftoff
comelec
tci
unitarianism
sawant
waltzing
erato
espoir
tonge
fanfic
chessboard
sneezing
oiled
jughead
magan
galant
bodkin
kdp
teatr
ormskirk
obp
bhabha
setar
pillay
glushko
tanith
arriaga
cyclase
kirksville
itanium
mandell
soling
septentrionalis
octopuses
dryas
liew
subglacial
wallkill
corunna
truong
couldnt
nothings
normalised
orgs
mcmillen
superoxide
shifters
slg
complementarity
leukocyte
rainmaker
loomed
antisense
goel
calkins
kermode
ginepri
eci
bilirubin
edgard
gonda
nexis
khalistan
mowers
tena
mazembe
benguet
reestablishment
slayton
derick
mycena
nwr
folksongs
zarand
gimli
dobra
kori
stringfellow
writeup
aisa
ldr
mossi
wheelbarrow
anp
astrobiology
windus
elista
finales
legionnaire
swinger
weatherby
dreamgirls
deuxième
hijra
pippen
rifting
northup
chasseur
macalister
cranach
cavitation
jambi
brathwaite
janikowski
microbiologists
impedes
averell
bhola
dox
adena
karuna
mujib
virginal
soundstage
henne
hammurabi
waterston
pyrophosphate
rathod
bethanie
facings
mousse
almonte
greenwell
kraj
applebaum
vinogradov
pager
bowring
overexposed
americanization
glycosylation
katniss
adcc
hitfix
dowding
crespi
marshlands
benatar
hoad
gauged
gholam
ostrander
ashi
thieme
wallflower
rso
whitehurst
anaïs
ardagh
wail
construe
bobbin
muff
gavaskar
safra
brachial
burghausen
loggerhead
unsteady
illa
unfiltered
soapy
pineapples
hadiths
saiga
digitisation
farrukh
horna
facsimiles
wrekin
whitcombe
tremble
meridionalis
slopestyle
ddd
schock
hashes
imprinting
norrland
midhurst
hammad
koop
shamsher
amerindians
btv
empathic
ussuri
kolberg
mctavish
merkin
beehives
adak
syne
hypnotize
geel
multitudes
railroading
neg
mapp
trilingual
supergrass
westmont
vietcong
exacerbating
petrozavodsk
psb
tari
menin
pompeu
umayyads
hartog
countertenor
carrom
rambert
rossignol
gigabyte
rabia
hsr
iveta
overestimate
troutman
uka
overalls
jvp
frontiersman
iquitos
manabu
daphnis
almeria
collegians
angelfish
pramod
pallium
iha
transhumanism
americanized
yeux
samizdat
dunton
kneale
tante
jaane
toledano
pasto
thermometers
cstv
fmr
capell
corsi
sayuri
hansi
volcanology
navarrete
alge
kuen
turnkey
skra
psychoanalysts
wallonne
flirted
matriarchal
equalising
cerretani
excruciating
eugenius
collegian
terraforming
bludgeon
loko
diyala
paternoster
cattell
kuzmin
phonics
godsmack
riverwalk
waterbirds
ahimsa
deliberated
ahlen
assuage
siddons
allon
peut
tomasi
uba
bruijn
clackmannanshire
nanticoke
rosner
cingulate
barbatus
keshav
inquires
oreo
sissel
poeta
merwin
ffg
wyllie
stegosaurus
moulay
keystrokes
fowley
plaça
rahn
smirnoff
fester
hadden
thrifty
siddur
naish
rivaling
aozora
mccaw
tumuli
wilful
outrageously
musculus
slezak
fellatio
destabilizing
lenton
dake
butane
libertador
eschew
stradivari
tiepolo
pinches
nyanza
helplessly
loro
competently
fume
constitutionalist
brainard
anastasiya
interjection
shoegazing
palio
malloch
resets
borgnine
lignin
knockin
weasley
neodymium
dierks
berankis
eadie
fishguard
dismas
glances
templeman
merc
ilana
phillippe
gaal
basle
parasympathetic
stereotypically
titov
verily
bonnaroo
polytechnics
koning
macri
dogon
wigwam
tamir
walküre
soltan
khattab
persevered
castellana
maunsell
reticulated
huac
podge
ummah
guanabara
cissy
perp
arnoux
repopulated
photocopy
mogwai
brita
glenbrook
balder
margulies
udaya
verdad
guðmundsson
swac
nq
guadeloupean
fantastique
pushrod
calleri
folksong
almshouse
titicaca
infuriating
blowin
mineralization
interbreeding
agitate
yoni
tortuous
rivermen
preconditions
cwc
exton
downpatrick
spruance
dancevic
policewoman
stenson
marceline
divino
eschewing
hola
tosi
plauen
carpeting
realy
chinaman
droits
riefenstahl
mifune
panavision
haben
woodham
wilfredo
mcleish
repeals
heady
romulo
creatine
headsets
taitung
bursaries
unicellular
sudeten
canola
ringtones
safford
pecuniary
prabhakaran
ledyard
cpb
menno
verneuil
barbeau
supremacists
edom
verus
walkout
ninjutsu
krško
arcseconds
eren
fenech
cockrell
australopithecus
ondine
henn
dwindle
collinsville
heiko
subcontractors
dejohnette
holster
greeneville
houthis
railed
paisa
shepp
perusing
bluebirds
kaunda
beltline
dmg
chauvinism
aiga
thibodaux
gloire
milgram
orations
hmmmm
planina
columb
nucky
basa
bence
breese
ushers
infeasible
lambe
journeying
sugawara
egress
rajendran
engrossed
folklife
mops
rctv
shur
medien
searing
jgr
presumptuous
mingling
sneed
codeshare
khodabandeh
gillani
chiasso
villes
amana
broach
chifley
aileron
altamirano
tuk
sunscreen
windom
bri
teg
habu
clarín
desiccation
warri
raymundo
burgdorf
abovementioned
flogged
aurochs
substantiation
okamura
taaffe
layoff
paarl
wilke
bazaars
zapp
kabyle
bloodaxe
allingham
macalester
khost
koubek
creech
milkshake
rayong
karas
melvill
adrianna
crp
macrophylla
csk
heifetz
banting
autochthonous
brooms
blackburne
papin
cadillacs
nihal
chea
teardrops
sidmouth
sobbing
biwa
downpour
sdc
kobi
melia
olajuwon
peal
elizaveta
kanchi
bapu
mesenchymal
marella
bish
opportune
feder
desirous
tatler
shalini
ordine
hav
trifle
neumarkt
cct
helston
trani
prk
thelema
duleep
marland
matchless
balustrades
formers
garnish
owari
dadu
sundberg
ldap
krylov
avianca
emmerson
granaries
sahar
quacks
criminalized
neary
youn
elyria
ardenn
goldilocks
pdm
metabolize
minty
waxed
stendhal
faroes
epg
longwell
bánh
kamke
likenesses
girly
neoplasia
akkad
occupier
cochlea
velcro
aggressors
editorializing
aske
howser
neos
preservationists
vials
abajo
galvanised
hendersonville
penghu
thanatos
wfl
sarno
melun
quakerism
ménard
torben
hopkinton
afterburner
eke
wetzlar
polarisation
relocates
enola
athene
bolyai
striatum
dagens
gunsmith
showering
pias
hahnemann
arazi
lowes
wbbm
vahid
syncing
cada
anthea
kaitlin
tranter
radnorshire
antagonize
dinger
dropouts
shunsuke
martingale
santhosh
berated
balti
komarov
tuke
ratepayers
estela
cleverness
bantry
combinator
vianney
bodden
cloaking
∧
wrenching
blumenfeld
cmr
melding
quis
kunsthistorisches
sigler
chloé
chlamydia
farmsteads
acrobats
lethargy
dileep
dpj
maf
irian
easts
allergens
fraudster
intergenerational
elated
scudamore
rumpole
sampath
tarquin
tuvan
letty
monod
gwang
ineffectiveness
hypoplasia
loni
nekrasov
vilified
pott
assimilating
harar
bgsu
flaring
outfitting
yazidis
hedlund
haydar
boethius
aacsb
triglycerides
hyères
rosi
ulna
deterring
millikan
viswanath
downham
aprons
leaden
palle
samhain
reappearing
epitomized
butuan
escobedo
congdon
illit
quli
gcm
silicates
venango
saps
cadavers
wideband
snitch
svr
atzmon
nayyar
anjouan
efta
vishnuvardhan
salmson
fringing
codemasters
teheran
oled
flp
microseconds
fini
bratt
opiates
comorian
limiter
auteuil
bemused
valderrama
hasler
kaneohe
inebriated
ampex
bonzo
mashable
radioisotopes
meda
peixoto
airbender
jhang
unkind
scholasticism
wealden
guanzhong
fili
unpalatable
stewed
odissi
ous
poppin
convalescence
cfi
wintour
rhinebeck
omnipotence
goosebumps
silverlight
bourgoin
criollos
brc
mangano
pollute
emam
kukushkin
cheviot
contravenes
tendai
tarom
belford
lampang
pushpa
benefitting
nahda
wilkens
bewildering
swingers
zynga
pareja
veselin
arrhythmias
takano
physic
commandants
erythematosus
busey
deuces
centennials
linnet
newlywed
benassi
resemblances
addo
idling
mahwah
catton
styrofoam
kelson
schlosser
muskie
maundy
overbrook
borobudur
jot
apothecaries
matthau
shaki
clogs
hausen
rizzuto
ckd
willington
hopped
alleghany
dmv
shifty
fdd
dioguardi
spennymoor
celestino
coursing
northfleet
gidget
condenses
patrizia
vagabonds
brunetti
grinning
parviz
unscripted
frcs
benvenuto
skymaster
umesh
discipleship
maclay
strasse
khushi
vijayakumar
carrol
bamberger
gardermoen
kilby
gladwell
naeem
reena
scrutinize
dolmens
salonen
ngoc
ilium
proyecto
rurik
kudla
peretti
samoans
duplicity
beatnik
biennially
landowning
schisms
arild
threepenny
kristianstad
upson
gillet
leeuwen
hinojosa
woodhull
schematics
keeneland
geezer
overheat
ziya
hullaballoo
quintessence
cletus
insignificance
cuervo
nerva
aberavon
jove
fraga
rittenhouse
monotony
steerable
nervousness
nasp
horsey
philistine
prescient
daron
vfw
straddled
parisien
petrovna
clarksdale
ascari
chamfered
nobs
stickney
bleue
lifan
battuta
ambi
bulges
javi
barfield
mesothelioma
repositioning
cannonballs
recurs
compuserve
sokal
brigg
fàbregas
hudgens
ahr
forages
prés
breakups
unimaginable
gethsemane
abeokuta
slieve
aggravate
cannibalistic
rockfish
blissful
spirally
leftmost
rebreather
galo
declaratory
pokey
cei
shoemakers
kopf
lapin
pietra
cecilie
derails
bep
palindromic
elocution
dorada
propulsive
shida
solari
kranti
highrise
arron
wiggly
avigdor
pleated
baldini
gmos
rois
bullfighter
loosing
efren
drongo
seram
soundboard
penta
dowie
lazier
waimea
lief
oocyte
perris
sunroof
finalizing
obc
throng
sakis
gdc
curating
heya
boole
msh
sune
kerem
melodramas
wavering
hynde
garnished
cuiabá
lanthanum
badan
liaquat
mukden
hypercube
kenyans
temminck
imitative
rell
accosted
todi
panis
pandering
kourou
panacea
gleefully
pathe
cahokia
aino
unum
kotte
prokop
cherubini
caridad
erk
frolic
mul
idx
sugary
tze
browsed
sakhnin
tambor
jitter
michaux
malignancies
tharp
selleck
jaisalmer
edl
slighted
phares
gws
alpen
wx
impairing
métiers
confide
razzaq
akihiro
mirzapur
sympathetically
possums
moriyama
charlatan
bocas
bandyopadhyay
apoplexy
fretted
waqf
luckett
molchanov
pandu
bassin
dri
meisel
cordially
metastable
turboprops
skanda
authentically
superfly
salutation
harrassing
prophylactic
galla
pagani
topi
etsi
pira
kelman
easements
delicatessen
dodgson
sinaiticus
rimfire
ketamine
kcrw
snatchers
rahat
mcquaid
eggshell
naseem
galiano
poached
anionic
ancestries
formalization
spier
boozer
silvestro
heroically
ruthin
boal
seahorses
ramada
sakshi
conolly
karnal
micrometer
tritt
yaar
dunst
krasnaya
steffens
harpur
nio
rione
capsicum
moos
ovo
mackworth
chakvetadze
lucretius
rusticana
batcave
stix
baldry
henty
hassler
warsi
homesteaders
raze
reval
waitresses
patra
arnoldo
onetime
realschule
filibustering
gifs
yuchi
pusha
raji
grylls
benicia
kalidas
parasitoid
chanakya
transshipment
hartung
wishaw
trillo
prettiest
galvani
synonymously
wfm
dischord
chern
appl
decentralised
unfunded
hairdressing
naac
norsemen
spero
lue
spyker
naugatuck
longshore
freescale
cheam
airpower
drivetime
goldmark
montel
orthodontic
microcomputers
fastening
hellish
uninitiated
oldbury
crothers
fearnley
chalfont
santonja
oster
shami
charette
jcp
facultative
supergiants
milland
cashbox
somnath
torments
nove
maddock
distiller
donmar
goldoni
campi
tress
tswana
chacon
penitential
clearcut
prudhoe
siloam
kugel
aitkin
psychopaths
drewry
chisnall
adorns
fook
garneau
shas
kasay
ironton
unsavory
ruminants
locsin
reuther
neutrals
goodridge
cuellar
hanwell
underwhelming
jurgens
wrightson
erith
masai
tynes
orizaba
diageo
figo
cupcakes
coexisted
stumpf
kendricks
veiga
gondar
sidcup
recalcitrant
laudatory
dazzler
canty
cotoneaster
muenchen
hardeman
coolie
wfan
resonates
fernald
kunze
naypyidaw
quoins
schalk
tecnico
sav
remover
lasik
eliseo
disinterest
fok
mendeleev
pintail
garbin
clichéd
liliane
gianluigi
kebir
yogesh
codebase
fecundity
teignmouth
pliers
patchogue
adjournment
masaharu
radin
ffu
fawlty
boule
intrude
sí
posttraumatic
bokaro
rhinelander
montaner
marshallese
newstalk
krabi
anticline
sympatric
plotter
gagné
amica
ture
potty
palmers
conkling
volo
bickley
timespan
fios
fluttering
glidden
bestiary
unpredictability
sisson
katsu
brannigan
besiegers
metathesis
brassica
giacinto
cinematheque
tux
synthesisers
cantina
khirbet
newcomen
mysims
boma
reverberation
swerve
varda
mcclean
glasnevin
merk
coste
mauch
peripatetic
corrs
sijsling
manado
tsimshian
bandmaster
jorn
constitución
stepanov
brighouse
puran
menashe
moraga
dispensers
southwold
lavoie
rudan
chosun
oddie
tarsal
hatem
masterly
phospholipids
vincente
subregions
grinders
zena
exasperation
jawed
malakand
itineraries
unsanitary
yogananda
umc
machel
nakazawa
altmann
rhoades
plesiosaur
flatland
sibi
wku
kral
frequenting
caffè
maxis
curtiz
abated
farleigh
vtv
plater
wintertime
braulio
pilcher
élite
teniente
alyn
sori
goldmann
elbridge
xylem
realizations
lifespans
corin
emerton
tramps
jayawardene
keough
schoolgirls
retailed
totò
yiannis
shakuhachi
whines
kcr
moises
gestion
wonsan
unisys
venison
seite
bdg
simard
unissued
passersby
brudenell
siew
scoliosis
araucaria
meadowbank
aéreas
gisele
agon
ezrin
kult
soley
taschen
covenanter
regrouping
sahab
carburettors
ravensburg
internals
brendel
secreting
besting
bartolomeu
lederman
folic
groundnut
montecito
sorokin
justicialist
auctioneers
jinhua
acquiescence
peruana
antic
darcis
flotillas
boobies
subcontractor
orthodontics
brushy
miler
selçuk
bracciali
folke
berrigan
fastnet
rosella
umma
nitpicks
skyblue
willmott
pillared
oilfields
fulgencio
famiglia
fagen
binyamin
motherly
nyberg
ehr
bigamy
shuki
neb
subside
bodyline
minibuses
runabout
hollinger
shapira
vtec
burglaries
airlie
sayin
nian
damped
hurston
sph
pollster
camões
eucalypt
lfa
dockery
traumas
adh
pws
maribel
littlest
merganser
twos
gatton
subpoenaed
millwood
ibuki
ahad
postlethwaite
jawbone
andal
teena
rolston
dyadic
evatt
biotin
coking
kalina
buehler
counterpunch
howley
sutlej
roud
tendering
omnidirectional
wilmette
coho
philp
karyn
davenant
bharatanatyam
malraux
barbette
cavalcanti
carbery
pavan
loïc
asplenium
fervour
tskhinvali
okehampton
lgv
bioko
grasso
resellers
mustela
khajuraho
inoki
holles
radula
agt
pastorius
orangery
shatabdi
banstead
keneally
stringing
patrika
lisbeth
knutsford
coshocton
sydow
vitalis
roxette
stargazer
whitehill
patenting
pourquoi
sleds
physiologic
unknowable
powerball
ione
massena
nunc
asis
nosy
flexi
recessions
euphoric
numerology
csir
anscombe
wolk
suffocated
lota
kase
nro
sbu
comparator
boleslaw
rieng
cmm
kyw
zooms
neurophysiology
ramalho
cky
misr
adn
bathsheba
yohannes
unanticipated
sacher
kirshner
beas
pupal
hulse
orkut
jiangnan
vande
kkr
médecins
malahide
idiosyncrasies
berland
lineker
legato
lotz
vidmar
nif
reinterred
piqued
fulvio
tewksbury
toolset
ploughshares
nrp
shevardnadze
egidio
torrence
ramblings
cantacuzino
midleton
deegan
madea
rajahmundry
chesley
lympne
diggings
outmoded
reyne
madox
izzard
virtuosic
manville
marder
gents
alamosa
suprised
kilsyth
gorey
goltz
olimpo
klemens
cryptids
demers
mainstage
electroshock
herm
quantrill
oligarchs
longhouse
tumult
ischia
blackmun
icicle
tammi
ravaging
philharmonie
backpackers
neisse
succubus
saruman
liwa
sanguine
functionalist
demobilised
toulmin
greenhalgh
subcompact
rokeby
nauseam
anticancer
pondered
rst
batna
micaela
retooled
levadia
paediatrics
vaporized
italicize
swanage
inexcusable
carretera
coxe
binti
saudade
tingle
sproul
velarde
lodewijk
sonically
dryers
gladwin
pelly
judaea
randomization
overwinters
afr
boatmen
leonov
novato
outsource
spank
perrine
shakuntala
jamiroquai
crema
endeared
scantily
superset
mto
azrael
varden
frist
bui
lundqvist
misjudged
cust
weds
throes
zabel
heino
viscera
pricewaterhousecoopers
acetaldehyde
racemic
alertness
drescher
anomie
meted
basheer
euphemisms
nikolaidis
botulinum
grossi
beetham
henrich
maryann
heaped
cicadas
polysaccharide
siggraph
vacuous
netware
dumfriesshire
kawabata
raia
valur
palatka
palimpsest
bitters
lalande
invalided
waxes
gsx
projet
asiatica
orsi
shaming
exhale
gilford
spiderman
tamm
nyb
otte
kanaka
khalili
métropole
powwow
kugler
velociraptor
corbels
copyeditors
wassily
yakubu
fridtjof
hhc
rocque
wimp
eir
depredations
griqualand
gneisenau
avonmouth
recompense
unsealed
darlinghurst
mccafferty
matica
steamroller
preterm
annick
playtime
finials
sexology
wealdstone
polysaccharides
kubert
astorga
coz
coherently
casuals
jsf
oeiras
idler
chedi
thurso
ptarmigan
herodian
brahim
gerund
pipistrellus
tiran
dzungar
halpin
pancha
waltons
abcs
maimed
voskoboeva
nul
hopkin
kina
shemesh
edenton
parvez
contrôlée
mnf
abdelkader
panjab
avebury
imperfection
billets
hideyuki
wimsey
sarojini
viale
metaxas
vieille
wrede
stonemasons
allegan
nihilistic
polecat
calcification
biologic
diluting
borge
rhind
apsley
shivering
hailsham
basham
immortalised
pooch
haberdashers
ahmadis
casuarina
dualistic
fuk
sastre
stronach
guillen
prolactin
ikarus
dorji
milpitas
haran
colombier
brodmann
lamu
occidente
roblin
cusick
braine
mycroft
âme
volcanics
haver
chehalis
byram
susitna
peppercorn
wace
unité
vse
goleta
guaynabo
warde
reiki
mistrial
haiphong
herrero
chichen
entanglements
jessi
overwriting
maiko
nex
immingham
nacelles
liviu
rinne
rpp
matsunaga
suisham
storico
mesurier
tirso
marling
mousa
bosporus
sylt
libertines
footman
delage
duxford
theodosia
puk
kesari
kasabian
mineworkers
gnrh
bednarek
bogra
ockham
colloquialism
ambassadorial
prosaic
stowage
urie
dalston
nrm
roly
ciconia
guenther
sorocaba
homeroom
tiernan
fiscally
ancre
owerri
diacritical
toshiro
anup
reorganise
homie
barasat
sandbach
foraker
hydrodynamics
natter
simm
pwn
comitatus
crescents
oau
unappealing
mouthful
hadj
rthk
forename
depreciated
confiscating
orthopaedics
mariage
murayama
wenner
hitz
takeovers
dtc
galas
northbridge
dphil
cavett
tmnt
weatherfield
akim
warty
whisker
revelstoke
voyagers
unbanned
borislav
summerall
costin
ousmane
marois
pollok
komorowski
shanthi
bokassa
jeppe
dienst
highwaymen
rabobank
apf
sievers
musgrove
signposted
syntactically
wicomico
servile
megabyte
brawley
vrindavan
villano
avogadro
kellaway
graeco
mogollon
mallards
juvenil
everitt
nne
piha
brighten
mâcon
mahidol
bruguière
overgrazing
polygyny
royalton
brindle
sukkur
detours
bradwell
khader
basquiat
sorkh
doorbell
crus
karno
seni
maqam
mik
bylaw
zhivago
cecelia
polanyi
setagaya
permeate
sot
cron
milledgeville
domiciled
chortle
foxton
donnybrook
syncope
oded
iiis
unsophisticated
iy
eliane
gimelstob
sittin
meno
subverted
trotskyism
calera
spillover
zouk
brzezinski
albicans
lessor
sethu
busse
semitones
blogged
lawnmower
yusof
durocher
zircon
snafu
boudin
nommed
koné
silverchair
bagging
kuchma
knolls
kander
stanko
maupassant
mcbeal
basutoland
benzema
tippu
sibylle
recollect
kajang
mlm
tecla
kaia
christopherson
untagged
cumnock
attleboro
idb
nowra
haywire
trakai
helvetia
coden
geeky
barça
amarilla
filton
bartenders
ceri
sgc
tla
rheingau
headington
hypothesize
maryse
deserting
placard
boigny
yasmine
outbuilding
topher
victorville
jayant
surly
sayeed
ksk
janowicz
mpl
bhim
menke
gand
antislavery
fraggle
puffing
cephalonia
ramchandra
incredibles
ossett
fmt
whelen
cryptographers
unjustifiable
asner
kazim
awdry
perche
testicle
dismount
goer
biondi
twista
optimality
delco
typographer
aldine
keine
fumio
darters
irigoyen
cetus
ratchaburi
bhuj
pama
quant
grantland
pert
wierd
veb
devdas
puka
sadako
mogg
assemblée
webcams
sylvestre
sandie
sweeteners
bellinger
shanley
belz
timescales
feodorovna
latha
magisterial
anaemia
backbeat
figment
demain
globalized
megaman
proudest
racemes
hamtramck
navia
irakli
jammers
verena
obsolescent
juho
daulat
flapper
jhu
ghs
incubators
karmic
eez
collum
amorim
monophonic
colusa
perpetuates
salesians
renfro
whoop
aglaia
trellis
nephilim
isaacson
kjetil
zvornik
shamed
malvina
rovaniemi
demidov
goenka
cibc
canavan
smyczek
arv
ahmose
malign
medcom
kickbacks
photochemical
grebes
wcs
pressley
goldsborough
rouges
harmoniously
sabana
wasabi
katt
schnitzler
seesaw
koons
etzion
saqqara
haywards
mitchells
chokes
anoxic
tempah
yohannan
gyi
berndt
bissett
sjöberg
pedagogic
collectivism
archdioceses
jaakko
klimov
liggett
socal
tomko
britannic
elahi
quebrada
mansel
nuba
sobolev
quip
siles
balked
correggio
lifehouse
rahmat
huancavelica
lordi
kogi
poti
tamagotchi
diniz
juhi
acculturation
cinnabar
arbutus
kerns
weblogs
michelsen
vidas
campanula
granites
vulgare
explosively
aldosterone
cervus
splashes
narco
repressor
tonio
overwintering
cbsnews
aphrodisiac
warfarin
kanter
crier
diamantina
twit
laga
lackland
whitty
accomodate
catriona
sayf
mesons
abides
genaro
smpte
accel
clank
ragsdale
gadfly
aramis
benelli
tantalus
formless
cova
necronomicon
millisecond
tomoyuki
sickening
catalano
adamski
chaman
fishburne
chander
birrell
madder
areca
voi
susilo
mazza
iya
unfccc
minnetonka
constantinescu
khat
phenols
mayans
ruppert
ninoy
modernising
synovial
skeletor
whiston
fretboard
luma
kri
chiudinelli
loath
smitty
holdout
programa
fairley
samanid
chaminade
internazionali
dhule
llandovery
conran
skied
kawase
lawmen
netherton
chocolat
arna
carriageways
jumpsuit
infomation
hidayat
malcom
getúlio
curries
crackle
potholes
bagge
shipwright
nitida
pares
militarized
homages
noé
bowness
squabbles
cheboygan
ejects
casares
hirasawa
keno
majoli
plumer
naji
taryn
minimis
karg
spithead
stranglehold
hca
iht
airstrips
bilson
thora
wijk
scariest
eaff
lysosomal
cosy
montford
vandalizes
kyrenia
laffont
gosei
vasilyev
drapeau
tenancies
inniskilling
postoperative
manisha
entrenchments
aal
buran
interchanged
analgesia
intrauterine
redraw
vixens
heros
shabbir
gompers
kye
bsl
ddb
anjana
amari
seljuks
nardi
ferrando
couched
medrano
geht
belgravia
yantai
zebrafish
mysterons
shahu
wiretapping
piccolomini
auber
seaworthy
entryway
sandbags
trevally
yesterdays
risi
galaxie
stricker
biro
ligonier
keat
caven
effusion
fittingly
suba
shelford
dosages
esh
linsley
pirro
decimals
jamnagar
walcheren
künstler
montpensier
aylward
consistant
mobi
musing
arsenault
sceptic
fieldturf
draconis
vingt
firmus
malherbe
tano
reallocated
mateen
wiggin
hatim
odum
chaffey
stuffy
tenby
subsoil
zabriskie
nobby
gräfin
fuelling
jelinek
azzurri
yoakam
shins
cukor
sandilands
milind
ragga
sanomat
supplication
kinloch
lolly
penfold
rond
eriko
marionettes
stumping
nakashima
pedley
yardstick
darian
rubbers
minelaying
hoes
carcinomas
rotted
millikin
gravitated
prag
giggle
valentini
deplete
sidgwick
cantona
barré
treves
icici
blakemore
yesteryear
stabling
hoverfly
acerbic
phc
rosaceae
devarajan
mcv
antonioni
dahn
seraph
godlike
synchronizing
clarin
clumsily
seahawk
tant
potlatch
mork
vacating
whatley
hertzog
gruden
donohoe
tannenbaum
couperin
actualy
defra
glides
ander
duplications
quatuor
chettiar
quatrains
mozarteum
comebacks
ngan
sluiter
muruga
essequibo
theodicy
jurij
proulx
borat
dartington
acknowledgements
gowers
bloodletting
ppe
ctf
zhemchuzhina
chincoteague
toltec
dct
poche
croghan
galaţi
passerines
ostriches
biofeedback
tans
carlebach
mildmay
mwanza
infuse
scheuer
warton
prekmurje
bellegarde
conjures
karamazov
burak
fetzer
outsold
kabc
aventures
bextor
kike
tattered
disque
eberhardt
madtv
liddy
floribunda
nhat
urszula
defaming
suruga
milosevic
wheelhouse
honeybee
pbb
grottoes
ankit
flail
meola
wallets
kritik
sudhakar
goldblum
warmia
frequents
waqar
joven
aven
gea
clamshell
crossbones
frostburg
oji
bicyclus
enel
wagnerian
middlemen
houtman
halladay
intemperate
ganapati
llorente
sepoy
capuchins
tann
cottesloe
telarc
beqaa
chenab
vegf
shamelessly
morandi
zaw
burnings
okuda
gundagai
kaká
kothari
enea
carnell
bech
autoharp
youghal
synchronicity
exhorted
carrack
cavalera
foams
diarists
magnify
bhaktivedanta
podestà
munhwa
baffle
ssris
armadillos
jamshed
paglia
rusting
tissa
seay
intuit
thuy
cottingham
aelius
eloi
paquito
engen
levey
conall
preconceptions
fst
cozumel
natacha
mendon
clack
pedestals
hampel
nakahara
quickness
formalist
rdx
linens
apolo
currey
raziel
brahm
monta
rrr
laboring
helensburgh
raunchy
hambro
interlibrary
ampersand
operant
rmt
kolo
obafemi
milani
josette
riquelme
krems
behaviorism
pul
disbarred
huma
jebb
farringdon
antar
heave
telemedicine
kreutzer
presets
equerry
wirelessly
dabbling
mcgriff
mullally
stil
rutten
hamon
goldmine
physicality
wlan
gelb
saúl
konak
spektor
icefield
checklists
nangarhar
patiño
habibi
reunified
ascribing
muslin
endor
eide
jeez
ribosomes
barenboim
karamanlis
sert
kolchak
malak
heterosexuals
legoland
neuroendocrine
allaire
plateaux
rediffusion
jongno
zermatt
batmobile
enteric
coakley
smut
coaling
anania
yoh
biomolecules
moralistic
fuerteventura
awl
necktie
rila
catlett
boccia
whammy
shuri
petites
colquitt
macgowan
diehard
thema
madeley
householders
kirkintilloch
thoughtless
tessellation
sandbank
charli
tumi
karaiskakis
tvc
jaar
adjuvant
outlay
eratosthenes
outlander
emad
hypochlorite
gendarme
conejo
averill
sidenote
tbh
oems
astara
plotkin
catagory
sedgemoor
alawi
kamla
spiers
cuyo
sacd
carded
skimpy
neoconservative
calderas
stalactites
kgl
ruthlessness
molester
boobs
masbate
puke
sourcewatch
hydrolyzed
hts
farage
stimpy
juggle
lio
topp
betelgeuse
aransas
sondra
menorca
broil
engender
gaudy
tco
invincibles
bioactive
atherstone
kairos
dorados
mphil
copán
charlotta
feliciana
satyricon
jonge
bannatyne
alternativa
adduced
ivs
stephenville
lockjaw
msds
saluted
bellanca
camagüey
korte
provocateur
liceu
système
denigrating
wana
monocle
creepers
arethusa
roebling
abf
grassed
spawns
chatterley
perk
flournoy
yagnik
bunnell
ahluwalia
ciutat
sternwheeler
kurland
buzzword
simile
kittel
moise
tempering
osmium
riggins
cabals
licorice
scree
sear
storeroom
daub
rulership
cerebus
cmo
reticular
quails
huppert
lambasted
cassar
theophrastus
loverboy
coursed
florey
skrulls
engr
flers
nayar
pärt
nikolayev
mariette
isation
hardanger
klemm
ekberg
baader
kako
comrie
mvps
aarp
tiziano
shindo
malka
sarabhai
deadweight
indict
unicaja
eyepiece
clairvoyance
florid
teleological
koopman
ipomoea
borger
ston
colombiana
fettes
overrated
plotlines
insulate
exemplifying
provenzano
sabr
thomaston
caltrain
amram
whitgift
clymer
commandeur
joslin
smoothie
cattlemen
begawan
teitelbaum
africanist
communally
subpoenas
contemporaneously
proserpine
followings
mctell
ugarte
sephardim
arcuate
nobili
luci
taichi
integrin
geonet
leftwich
gss
menderes
eisai
vivacious
billabong
sleet
passmore
shanties
jamaicans
sika
fungicides
rdc
kier
scrapyard
unabashed
shortcake
reimagined
stefanos
munsters
schroder
hopetoun
pbm
gpc
grosz
pinson
veganism
godaddy
sog
mince
agb
shuichi
tars
sagging
gwern
indeterminacy
ptuj
ishi
paignton
sidearm
thammasat
duy
dbms
seguros
solvency
zinta
mengele
aponte
impersonations
yhwh
monongalia
glassworks
raisa
betsey
petrology
balaam
sympathise
vaio
upmc
thrips
coppin
dustbin
embezzling
quetzalcoatl
circulations
stf
corfe
battler
eniac
surfin
picaresque
implore
família
vanquish
assaf
shackled
nouakchott
cullinan
castrol
evens
castries
hijri
suboptimal
russification
mpls
juhani
vesting
brieuc
lefroy
gatekeepers
decoders
arbil
misapplied
pnr
collectables
gentlemanly
wechsler
kuhl
envisage
olbia
demme
microeconomics
cedarville
lazo
baume
dva
turunen
waterson
warley
khurshid
biosynthetic
cryo
thickest
fel
divorcee
vreeland
charnwood
margolin
icahn
cowry
gast
guri
mnemonics
otani
glassman
mcdevitt
byrom
totenkopf
cundiff
allergen
ramdas
vss
magnetically
pippi
catan
beeches
retief
elvey
macaws
furst
biohazard
gridlock
turney
golly
fugazi
parveen
klum
warrantless
wanstead
glenmore
babysit
soh
scullin
derbent
toxteth
absolve
sacre
seguso
lites
capdeville
demigod
dvor
oldcastle
dti
tokat
llanfair
roderic
yeol
sagrado
kwami
truncate
eddies
paquin
kiely
blimps
dundrum
mopti
ascertaining
nontraditional
timex
monstrosity
mustering
spurlock
adductor
zeki
wavefront
brashear
amaru
xliv
natatorium
muhammadu
chateaubriand
aiaw
cabarets
jacque
momenta
ptah
quilon
consett
mycological
allahu
mackinaw
thos
loveday
profumo
macaca
vereniging
deeside
harriett
fibiger
casi
battisti
fiorina
dishonorable
erases
tcdd
karras
kooper
frederikshavn
leukocytes
colombe
snatching
dirigible
koroma
eartha
kke
luchador
difranco
boyzone
turi
nilotic
septet
lond
eerily
brawls
encyclopaedias
rotorcraft
primitivism
enamelled
umbc
kbps
fersen
bordj
selwood
levert
kpc
apostates
chaloner
affable
bais
popovich
kujawski
bookie
wittmann
saluzzo
befall
arraigned
claustrophobic
jetties
minyan
asoka
adha
deceptions
kediri
articulations
popstar
offal
dotson
sarnoff
goodale
raimund
roache
republication
calibrate
clemence
tegel
baggins
miskito
sacchi
namen
iizuka
marci
chimaera
chirag
jablonski
livid
inductors
prioritizing
waterborne
templer
driffield
sugimoto
suazo
sota
heydon
appellations
democracia
performative
wijaya
greystone
interleaved
isiah
lemming
sladen
vigan
outflank
selly
amstelveen
voort
killswitch
fss
dupin
stelle
snot
exceptionalism
reserva
juanito
hava
rittner
fugger
drumcondra
padraig
mosfet
crim
ster
fpl
dutiful
hawkers
punchy
dystonia
demetri
mineo
crediton
phylloxera
sumba
tidwell
jordon
gesualdo
bridesmaid
issaquah
mérimée
ratp
kazarian
hedland
baffles
deryck
kamar
fescue
sintering
cfe
matriculating
ouattara
lindh
workin
melita
jessel
vlasov
meester
oundle
chihuahuan
dorney
inouye
gurdon
coretta
mcneely
denisov
dongfeng
mastectomy
sienese
thibaut
piña
tardy
mirabel
impeccably
evaluator
sitges
alfieri
shoop
yuta
conjured
ariz
ebbe
manzil
xiangyang
mmhg
kunitsyn
naturale
nanometer
halder
wnd
cardiopulmonary
suffocating
ostrowski
ybor
asser
strassburg
sainty
korner
repopulate
nizar
incurs
eyesore
hegelian
scabbard
tamsin
ultravox
temperley
scaggs
framers
communique
congregated
bloodstock
matanuska
relegate
jayan
syringes
samia
coldwell
ospina
drupal
empathetic
jubilant
fungicide
gantz
coterie
sylvatica
morlocks
reconfirmed
dimming
maroney
incinerated
danni
seeman
schapiro
sivakumar
tarde
uninhibited
hmg
seminario
kingfish
anteaters
fengxiang
chitwan
neagle
sanofi
guitarra
reevaluate
lapping
thambi
francie
danaus
recoleta
dignitary
pillory
fusiform
fala
frieden
cartago
spewing
contraption
easington
hyams
restlessness
sotelo
laboured
yupik
continua
stonewalling
karman
macias
jadakiss
greenpoint
jwp
monasterio
sealdah
kalla
fsh
queenborough
mfs
bjd
molteni
cardiganshire
antequera
assailed
dhruva
scherzer
ritu
katsumi
eleftherios
marmont
mothra
seperately
zd
fasten
tullamore
behring
coubertin
baloo
odinga
seascapes
kengo
partington
urbanised
oscillatory
washboard
crypts
khani
todmorden
zolder
provably
tackler
myopathy
turrentine
superstock
homologation
wrecker
frança
scorching
shastra
marwar
codons
akmal
chancellorship
instabilities
foragers
brawlers
chilensis
dumoulin
sahl
oversize
explication
osment
contrition
homophones
gimbel
kanno
altdorf
completly
leonhardt
pantone
esv
spal
oum
appellants
moyles
sancha
ouray
smasher
rizwan
forti
diabetics
acanthus
avion
sullen
penshurst
magnetron
bookman
fastback
nulla
moonee
sieber
faz
gibbes
pattie
mahakali
lanyon
masatoshi
familiarise
hooft
ugliest
numeracy
formic
slaton
plaguing
anquetil
arra
storyville
escapade
goosen
madoc
wetton
mckelvey
tourisme
grandad
lameness
schum
amoroso
esrb
glasshouse
giffords
chianti
rustenburg
toshiaki
technocracy
feuerstein
bartlesville
selborne
ahuja
cameramen
cmf
bioluminescent
resurfaces
lateralis
transients
triffids
luhrmann
culkin
bhagavan
calvino
lapa
kensuke
pantai
rost
gim
plummet
millan
zaccaria
intercommunal
tobi
brage
lorrie
samira
stereophonics
pco
lenard
neyland
ethology
homestar
albie
potsdamer
nue
mironov
bangka
conlan
weekender
moxy
cecchini
argumentum
yaroslava
okumura
pinsky
salmo
walkinshaw
kermadec
jayden
amelie
neots
jedlicka
baltica
ordinaries
pae
urethral
lampert
grigsby
nahua
sieger
haphazardly
celeb
iredell
tanu
whiteboard
khachaturian
kaspersky
myskina
trumbo
waterton
legalisation
lamellar
deangelo
kirton
undiagnosed
jcr
atk
quarrelled
kotzebue
saddleworth
shockers
shantou
neoplasm
lento
itemid
barringer
rist
teletype
cambuslang
bricked
externalities
dilma
deira
eyelashes
smokestack
unleavened
botero
cabezas
acv
gaster
masta
stunningly
saenz
viator
bondholders
fleisher
mande
nasiruddin
campbeltown
connectedness
caballo
bramhall
reubens
seabees
mannes
goodspeed
greymouth
forego
verdant
ballance
lundquist
reinvigorated
maoism
sawada
colloid
btec
plainmoor
siad
equivocal
ssk
finucane
wannsee
abdullayev
pilates
samling
idriss
kashif
hundredths
ethane
ashwini
gershom
bodhisattvas
vespucci
transcriptase
sabotages
iow
crosstalk
mbeya
zapped
itamar
mephistopheles
woolmer
runnels
haroun
minding
characterises
anat
ynetnews
reprogrammed
holyoake
implored
emeryville
dziennik
canny
osada
kawanishi
nourse
judit
lareau
strathspey
spo
skerry
moj
konstanty
nica
dispelled
reiche
memri
hazelnut
kliment
bunty
halachic
calabasas
preservationist
shiho
ohs
awoken
grosset
akureyri
multiplies
aew
statesboro
enfranchised
shemp
scolding
newley
savagery
tillage
expander
nbd
shriek
shimura
voll
kawa
disorganised
illya
naftali
homi
millfield
turkistan
petroglyph
londo
andranik
hamline
coders
pangolin
extendable
bundelkhand
kanga
equatoria
dereliction
denholm
leurs
crosswalk
electrochemistry
alenia
interpolations
gioacchino
valen
asexually
dialup
restating
chappuis
budi
dimanche
grantees
tightness
matheus
anesthetics
jupiterimages
stupor
ipn
pieters
magia
kuku
hiked
infoworld
stange
asce
hypoxic
faring
theon
novodevichy
racetracks
leventhal
encroachments
mourad
santamaría
baskervilles
easyjet
cantu
braemar
visualizations
pithy
delias
butkus
stumpings
forgetful
nikkor
hiccups
horatius
derisively
wurm
suzaku
piro
institutionalization
sugarland
grandison
carstens
autumnal
steger
valenciano
jinzhou
cornelio
regen
meller
costigan
sanctionable
conv
idleness
bombo
annaba
fastpitch
ifrs
reauthorization
gallaher
iñigo
lachapelle
magen
perdition
hauschka
bleep
garr
silene
madlib
burstyn
apses
kania
neige
sueños
pulsing
torrijos
ady
paradoxa
tapeworm
acehnese
chipper
solomonic
nasik
avelino
holl
huffingtonpost
boucle
halicarnassus
migraines
emas
esu
archbold
driller
mayorga
unh
spiking
phillipps
poesie
motherfucker
wattled
emmitt
sood
maillard
roto
wrested
berke
mycenae
eichstätt
sherds
paraglider
bulfinch
ods
ptp
shiel
rahway
glória
thievery
gratz
jajce
ouzou
taxidermy
machinists
lumina
cerna
pombal
youmans
hyndman
glickman
cleon
vls
hyrum
jer
arend
gravels
adab
andersonville
owyhee
wingtips
powerplants
kuei
strychnine
ascanio
daguerreotype
rennell
lymphocytic
basara
yura
vandalisms
hota
bayhawks
casimiro
irritability
tamang
benalla
shredding
thorson
marra
samper
jansch
mikhailov
savoir
killam
mitton
ickx
impairs
rejuvenate
aira
spitalfields
singletary
treks
goga
unwed
pilla
segrave
knollys
coauthors
cdh
mez
biomolecular
crustacea
odm
allmendinger
stratagem
paraiso
fernandina
burro
disintegrates
sapa
fdny
anges
linseed
slv
pluvialis
allu
jaki
futurists
epm
safeguarded
preez
corson
kiara
unsportsmanlike
harmondsworth
moravec
sarat
veeck
hentschel
bernays
vincentian
gamed
syllogism
ivanovna
naw
goli
lautner
astrocytes
makarios
saragossa
lightyear
rationalized
gls
blameless
harpenden
willingdon
drews
xmpp
demote
apricots
barrhead
slinky
kamov
shikari
bounties
northcott
tul
burchell
bootable
santis
konjic
unsanctioned
sevan
droitwich
corrector
sull
sivasspor
drovers
pierces
spotters
perpetuation
nistelrooy
leyenda
lenawee
biofilm
zadeh
hamedan
takis
ligo
ahmar
orthographies
wardha
ddc
vigils
niklaus
kresge
attics
ivins
customizing
abhorrent
unf
polynesians
lancs
lubricated
shik
kgo
zlatan
hollingworth
riffa
calmness
kronor
ziggurat
sandino
siskin
dysphoria
orloff
upwind
madcap
huainan
repudiate
mongkut
hokusai
trastevere
neptun
siebel
upu
moorer
lieutenancy
grech
tilts
prestel
karta
asakusa
mimosas
swarovski
roadies
conspiratorial
mrm
floridians
complacent
honka
nazarov
gluing
slavers
longwave
neurath
phau
streit
rma
storie
refilled
disappointingly
papel
asiago
ranelagh
codeword
kab
wagering
mitsuki
vaccaro
payoffs
obrero
mangold
orrell
camra
sauvé
inoperative
redmen
scavenge
sexualized
downplaying
grazie
backbencher
outcropping
pathans
okra
niang
stablemate
freightliner
bienal
bartosz
naiad
stroh
voci
bruning
pretzels
wiebe
lonestar
amplifies
harb
audiophile
walkabout
wari
intramuscular
dagupan
riad
droopy
dzongkha
delightfully
matosevic
perpetua
cemal
appreciably
warranties
thoughtfully
omnivores
steinfeld
eufaula
ukiah
nichole
nucleosynthesis
clench
ballgame
determinate
backpacks
osbert
iagainst
magnificently
nagato
inadequacies
herded
byelection
oppressors
hartwick
braveheart
queueing
snowed
marcha
wretch
galán
armpit
dastardly
ibni
offloading
apatow
cowdery
schoeman
irrefutable
farhat
compleat
sidra
brasiliense
fhwa
vian
butthole
heilman
pardes
rockbridge
mallarmé
unrepresented
consummation
penalize
corduroy
prot
zgorzelec
ferox
nyer
oreille
anciens
nappy
acbl
mrf
raed
reichel
jacq
cordia
tikhon
jérémie
bonnard
kaddish
smollett
wrongdoings
blankenship
hedonism
pangea
neccessary
galata
glenavon
objectification
sisto
nickell
becasue
murree
brinker
electromagnet
rhodope
histoires
pacifico
dabo
jalen
favreau
mackerras
silting
geest
kirkbride
hypocrites
guttural
mdf
kooks
euridice
coulsdon
liebknecht
iwai
downgrading
fib
isopropyl
saluda
cecafa
legibility
ogura
henshall
mnc
cott
antin
backlot
rebeca
scotian
haircolor
carignan
swissair
yumiko
coalville
milch
voluptuous
normalizing
maranatha
prijedor
devvarman
sweetie
drapers
oac
binky
choa
whs
vinifera
synchrony
bachir
bayerischer
arish
jammin
leilani
archaeologically
aeolus
scottsboro
carters
akerman
konta
nevermore
legaspi
pud
sach
laypeople
aggregators
victoriano
rambles
ury
gerland
molesting
virchow
flamborough
hoodlum
binay
squabbling
gushing
pelted
crematoria
eru
unbalance
sustainably
perky
prettier
aguadilla
farces
izard
almora
thiam
khaleej
beekeepers
ovechkin
kurupt
savard
hilmar
martelli
pend
wonky
benita
deller
balaton
websphere
leitao
tumbled
meira
milroy
unu
jamali
westernized
namesakes
harrod
obelix
shafiq
aeration
iryna
chattel
gametrailers
allister
zaleski
luscious
waa
digression
eiger
negroponte
ongar
ifn
overdone
asclepius
saadat
sociolinguistics
solidity
marcellin
wooly
rheumatology
fainter
pejoratively
qual
â
asam
ayyub
doggie
longue
opelousas
taxicabs
siracusa
naseeruddin
stowaway
lancasters
listowel
berns
investigaciones
cornus
dodged
rhetorically
disbursed
montepaschi
arthouse
gudgeon
gantt
suspenseful
beppe
mahe
submerge
alkanes
mercalli
nogent
livers
botetourt
moulder
marand
lakehurst
deflated
rectifying
handlebar
roadbed
obliges
maroni
autocar
dressmaker
hitless
guayana
henríquez
wooley
ayat
pnas
trashy
nantou
beaufighter
truely
soothe
tejon
schaff
ilfracombe
tda
headcount
tapp
spikelets
technics
sledding
pontoise
sewa
newall
rhd
saltash
mauldin
weyland
schall
almoravid
mico
jeopardized
gerrymandering
charité
overpasses
iveagh
loathed
wehen
tamla
seppuku
tayo
blackthorn
coronations
lha
emms
selke
sweatshop
sudarshan
balakirev
auriga
vex
knell
chippendale
bifurcated
seiichi
falsifiable
overdub
falck
alamgir
lior
pandanus
tuque
iter
presbyteries
adjacency
lamine
mothersbaugh
büchner
coghill
arita
araucanía
singlehandedly
vallee
salacious
colonialist
viviane
kojak
fibak
alevi
delink
ekta
homely
diosdado
wistful
mirv
uff
buckmaster
horvitz
disinfectant
burana
yoshioka
hely
herre
ege
schwank
lawal
cantorum
zao
cruden
wenbo
memorialize
kian
womanizing
notionally
hagel
baro
helpdesk
inclusiveness
laffitte
pinang
chaya
wfp
receivable
achtung
ayaz
virago
atriplex
accelerations
featurettes
quadrupole
godspeed
veliky
sahil
kottke
prolapse
tarantulas
strath
petermann
webspace
legates
masturbating
tomoya
isb
aquarian
midgets
autocorrelation
augur
intraocular
somatosensory
leachman
heras
overman
arliss
akhbar
analgesics
evaporating
kirtan
discordant
hassanal
zellweger
ramah
shara
shanna
bushel
jedburgh
parken
murti
unsatisfying
poème
overhauling
tsars
khiva
underfunded
herbalist
ojibway
sdt
andropov
cellmate
scotstoun
investigatory
cols
‬
barajas
ismaily
addressee
trailways
lachin
sugi
slippage
normed
neeraj
mandrell
fanon
riven
disastrously
lingers
recant
plasterwork
putri
teeming
casus
halina
gairdner
acasuso
interconnections
quoc
badging
phrenology
exclusives
koan
macfadyen
moulana
batis
hollands
timeouts
yorba
hardiman
tvi
sheikhs
sammo
andreotti
pickerel
irreversibly
dieting
insinuated
clares
equinoxes
bukharin
ottaviano
tarnishing
tamalpais
jordy
holford
butterflyfish
gsu
avocets
arouca
kampot
pdpa
roseburg
rafik
peavey
pomegranates
laurin
ulica
simeone
platts
mics
druitt
thousandth
twitchell
spessart
scoble
smad
kwik
computerworld
tigh
doorkeeper
goldfarb
stryder
retells
castellaneta
autopsies
cavemen
xen
furioso
calcasieu
argentia
cannabinoids
grumbach
domenica
svff
chiropractor
tennessean
rosse
telco
alleyway
thyself
hawarden
tzara
belleau
everhart
kamm
brittan
sleepover
jimena
inverters
mois
tithing
shiraishi
kesey
makara
hašek
evaporator
uchiyama
uttoxeter
luminescent
resistances
mordor
quayside
bettie
gaekwad
junie
souci
farias
miniscule
waffles
ipods
benzyl
disenfranchisement
wardlaw
tiong
noroeste
tubulin
piter
mumbles
cpgb
conmigo
ahd
barc
repayments
inspects
deion
prehensile
invalides
hijackings
insaf
himes
grrrl
animatronics
berthelot
podemos
abruzzi
peu
quist
antão
pudsey
refractor
esteves
depositional
hanzhong
taung
ghose
penns
pedantry
gizmodo
rós
streetwise
elrod
shite
dewolf
keflavík
tolan
moosa
natty
latinoamérica
blackfeet
unripe
notley
seance
shaoxing
milgrom
ryback
dalzell
pacemakers
annabeth
ximena
sru
acapella
lachman
shigeo
bopp
hathi
muggs
parsa
mechatronics
losey
muffins
cattaneo
quatrain
regretting
collectivist
vajda
ruel
yanina
gribble
roco
numismatist
medullary
seasonings
aarons
lucienne
sécurité
bullocks
ostinato
mcr
villani
wernicke
sarbanes
hamra
soter
berto
lompoc
ituri
littering
worthies
mfi
inte
extensional
domestica
comber
lubumbashi
telematics
housemaid
unicycle
antihero
agriculturally
valiants
tns
pentameter
gruppen
dramatisation
turok
jala
oceanus
sele
aleem
koerner
bannu
budo
granz
bristly
ambani
compressible
moxie
dii
gouveia
beaming
curacies
jwh
molo
colds
kiarostami
enthused
cayce
flatt
maistre
arundhati
seis
nunciature
miniaturized
depress
rushen
bilderberg
deitch
frio
gamepad
deu
kassa
inexhaustible
adsorbed
lanyard
alkylation
mpd
camanachd
gülen
sycamores
nampa
spats
tintoretto
sfax
tisa
labat
fishkill
pittston
opitz
balbo
jemma
noakes
pinnock
adaptions
tidbit
felicitas
hippocratic
caterer
dedicatory
apologia
stared
accursed
vagrancy
tracheal
rhb
sde
mutoh
sweatshirt
onitsha
kraemer
travnik
ramis
christoffel
parenti
mahanoy
confirmations
arbeiter
uea
sturgess
lyles
glutamic
overhearing
sisaket
caning
asir
broder
drumheller
politika
downfield
defile
krakatoa
biffy
freakin
manakin
internecine
rmp
aquitania
tablature
borrell
inoffensive
lins
amartya
mtg
khong
irritates
cuerpo
fisker
capilano
dods
papillary
gambetta
propped
pasqual
dacey
pwr
cordier
ruíz
walhalla
menthol
decorah
seers
dodges
quintuple
samuelsson
exocet
fania
medaille
ges
axum
jayaprakash
wriothesley
pensée
reveille
knossos
vca
oga
leatherface
huangpu
shinohara
zawinul
idolized
natarajan
malarial
suet
zama
daimon
neater
komo
kittredge
digne
pongal
niners
paratroops
overprotective
whitstable
eutrophication
brushstrokes
atterbury
rodrigue
abounded
yulin
paratransit
makedonija
mending
matadors
nicolle
gazelles
sharmila
flocking
ricker
commercialism
tractable
heider
hedberg
circumferential
myc
christel
feathering
topos
searles
hydrocephalus
cvr
ermisch
mcalester
skagway
enunciated
barbizon
nits
millward
sjögren
gutmann
arauco
terrill
comport
thalidomide
sauerkraut
socceroos
siaa
furler
atmos
symphonia
sukumar
stift
yuuki
subverting
philmont
scoutmaster
blurs
usatf
amaze
sparkman
fallowfield
cason
seit
tilling
lowenthal
mapk
videoconferencing
tormenting
haggadah
undine
fatigued
bergkamp
rosenkavalier
foner
amnon
confédération
arabism
fichman
egged
uighur
hoagland
whitwell
handsomely
altenberg
baldness
scotto
incumbency
drona
chadbourne
straighter
tortillas
tish
geisler
ismailia
rattigan
nanni
bairro
overshadow
debord
enameled
compagnia
beato
baptistry
emeric
halsbury
undoes
jornada
sva
ratchathani
ifor
tillich
saris
palfrey
permeates
blatchford
lescaut
charioteer
meany
cheuk
ceyhan
flodden
vink
supercouple
brittanica
perennially
takács
valorous
shortens
trini
voronin
wnew
buggies
zillion
combatting
elicits
shure
kosygin
hilltops
copp
squabble
seaborg
sbd
hammerfest
sahiwal
plesiosaurs
screamer
paquette
modeler
pmp
harmonisation
cally
optician
afterall
beaty
redden
iphones
overwhelms
apologising
hechingen
daydreams
aphorism
yenisei
italiane
durazzo
brawling
kony
tomohiro
edgecombe
bondy
breather
conroe
undoubted
sigman
opportunism
ulric
sharecroppers
poulter
gema
irr
anelka
pilote
skookum
archi
greensand
collinsworth
backlit
kazuko
bodrum
dreamlike
vicissitudes
seghers
sunn
tynwald
conclaves
pneumothorax
nsd
ozzfest
gwadar
chartering
tolga
userbase
choong
telfair
hegde
embezzled
dzerzhinsky
reticent
statistique
semyonov
generalities
falsity
kaukonen
exempts
piñera
wormholes
usbwa
passacaglia
mongoloid
masterwork
strangles
repainting
involvements
interglacial
rigel
kishimoto
dimitrije
manoeuvring
monroeville
keach
suffocate
weaning
bocca
schimmel
upshaw
kalispell
cero
maseru
mcaleese
dorff
oversimplification
boru
zurab
quotients
états
ebner
althing
rony
rattles
nessa
prather
hellenes
enumerates
crf
scholastica
corroded
conwell
bretons
delacorte
afn
sammi
unmoved
aish
siddiqi
bolstering
oleander
kodály
rebroadcasts
scrolled
recitations
carrigan
estrange
stroheim
proliferating
tepid
webm
evarts
pairc
courtrooms
sobotka
tourneur
verbena
emanate
bersaglieri
homocysteine
bolognesi
onerepublic
laundromat
lochner
enfer
zink
subpopulations
orthoptera
fünf
tetrapod
fouché
metastases
pitfall
joffe
dispensaries
zhai
guerrera
smallish
wasilla
sweepers
expels
pron
coady
ryn
vado
cezar
viscountcy
swastikas
bernanke
vibrator
gullah
shaban
bilston
winkel
qw
fach
yenisey
cees
passim
guidry
kirchheim
hotaru
gjakova
critica
ipsos
screamin
indicia
akasaka
alona
melodie
leiva
movietone
baas
blazes
schilder
kaho
canids
bodmer
greenbush
mélanie
crawfordsville
tabi
dermis
evidential
gaj
línea
savonarola
lusty
masry
stationing
melmac
brockhaus
zilch
inla
roars
neuwirth
esri
tawhid
manali
kacey
honeycutt
zrinjski
reema
agaric
spottiswoode
spacewalks
chislehurst
winfried
bisley
malnourished
spiros
bastide
kranz
inaugurating
touchpad
photoreceptor
turnpikes
cde
levitate
carolinian
dinkins
lacson
mallets
skara
reordering
vasectomy
preprint
rogge
hickox
secours
oppositely
zilog
glycemic
thoms
mykonos
lathes
levent
vasopressin
griot
harlot
extolled
muara
tulku
campobasso
haine
brydon
casbah
alii
weissman
goreng
rumbling
efraim
arbus
rollbacks
leptin
arx
wreaking
ostracism
levite
overwatch
riviere
voith
nishioka
lilydale
kristie
millán
bohai
stipends
busier
frond
tenderloin
fermín
oleksiy
starkly
fibroblast
ethnographers
kazemi
nlf
cafeterias
fazl
schnitzer
vergine
cobbles
isothermal
bink
refreshingly
joko
camcorders
sólo
shareef
wilmore
shivpuri
meli
laverton
steinar
makings
blankenburg
thunderous
androgens
catchments
aftonbladet
inegi
precentor
merrion
dreher
wep
estoppel
rootstock
balasore
epidural
adelson
skiffle
adhikari
leverhulme
nurmi
lindon
maughan
barquisimeto
jinks
repatriate
bandaged
siebold
serenata
airco
rishikesh
heckling
columbarium
cauvery
yass
timeslots
amet
talc
treader
archambault
theorised
phang
soulless
ostrom
mcminn
watchtowers
komal
idps
hothouse
populists
quel
downtempo
electroweak
sarala
atx
mullioned
normalcy
premarital
novena
hagerman
schillinger
whc
cryptozoology
clinicaltrials
aussies
rafsanjani
brix
sangeeta
rivkin
gemmill
heppner
alliteration
bakhtiar
danceable
charlemont
byfield
accidently
conglomeration
dni
curates
abierto
akademik
capetian
secord
peden
ultrasonography
gude
denpasar
dumitrescu
giller
faryab
seyed
baranov
antico
sequeira
leverages
slinger
stolp
burtt
persico
hooch
occidentale
monaural
hockenheim
altaf
oer
fantastically
hashi
petrosian
funke
yoshitsune
capron
bassam
newsboys
slinging
uncultivated
robina
brack
albinoleffe
vaporware
moravians
differentially
hassel
filo
airi
itchen
uneconomic
charron
burchill
habibullah
rustavi
shee
hollie
balestier
hrithik
bordello
vicepresident
ballymore
saddler
wgs
koryo
giamatti
lessard
bardem
naushad
edinger
middlebrook
girvan
therefrom
skeffington
fingleton
schreck
phy
contrarian
overhand
erina
fuld
chabrol
piercy
fynbos
cookers
auge
bracewell
muerto
voivodship
sef
farrakhan
wickliffe
mmu
détente
goodluck
sacrilege
counterexamples
fana
buttery
retiree
eigen
yordan
peterman
bookworm
smithii
supertramp
saree
saburo
treatable
kearsarge
iat
adressed
pål
barreiro
adsense
batum
exegetical
omari
multiethnic
classifieds
meatball
sinned
gebhardt
corran
ungrateful
hallen
shriners
edgefield
blaue
lpo
epicurus
scaife
reasserted
viviani
siler
muntz
intercooler
skydome
mulliner
deceiver
microtonal
waupaca
subramanian
joos
haffner
abaco
mbi
persevere
dhx
dragster
aboveground
guillotined
mcneal
winterbourne
crystallize
groupie
toler
pär
deregulated
blazoned
honecker
dampen
diminution
enema
storehouses
acetylation
telephoto
faired
medevac
westerberg
kallio
stonehaven
jamar
huard
mti
baldomero
brixen
emanated
pironkova
waff
domesticus
sulfates
anika
ailes
procol
savarkar
kasur
seaham
smita
pizzo
artificer
perdu
belhaven
manti
radicalized
mycoplasma
armories
favela
hamersley
sociopathic
fels
eckart
sufficed
blockquote
moorman
burstein
vorster
cropland
andong
misdiagnosed
snubbed
devastator
farnum
pickpocket
beekeeper
hohhot
roared
phytophthora
unmistakably
estienne
genista
rohingya
nieder
rayna
nemechek
polignac
cheeseburger
padmanabhan
tripling
kasa
goucher
weirdly
sujata
kalmykia
marcial
heralding
cwgc
spacesuit
alveoli
wakamatsu
schulenburg
annamalai
sogdian
rieger
oettingen
arkell
alkalinity
abelardo
loge
harbourfront
spedding
fcb
jacinta
filbert
milkins
filius
carom
kyd
aop
taskbar
schoolroom
dunya
birdcage
ynet
hymnals
iwm
yegor
battlegrounds
hilfiger
hif
reordered
mgh
pervades
subpar
tablas
navid
outrages
tiv
suzan
magnolias
romanovs
reisen
manami
bonobo
oriana
ansonia
tunics
spongy
tonite
tinder
sharpest
milsap
adoring
majewski
medved
koreatown
hagood
honeybees
exaggerations
kookaburra
fiendish
subroutines
corporals
devgan
loddon
antifreeze
hci
maritsa
cathleen
sderot
licensor
boner
determiner
obamacare
kreider
karditsa
pivots
arpeggios
hercog
maurits
boxwood
calving
anouk
prognostic
futterman
liebermann
electrotechnical
metuchen
urbina
poiana
resende
avifauna
ethnocentrism
amendola
leathers
flawlessly
vandana
saurabh
choco
yamashiro
calne
kalahandi
racewalking
thung
sturrock
winchelsea
magnetized
lazard
carolinensis
suzanna
macnamara
brittney
untidy
aiims
noblesse
spie
flinn
pranab
babbling
onomatopoeia
taiyo
dadi
trashing
backlund
vini
elles
badenoch
bartok
deepens
scintillation
upm
goolagong
askar
yandex
nsk
waynesville
footballs
vorontsov
elses
molt
brasserie
bushmen
macarena
travelogues
thet
barea
cuthbertson
waddle
irbid
dividers
flotsam
jame
arby
vanguardia
civico
phanom
doylestown
grue
gaea
loit
rajdhani
pvs
kiama
espino
exertions
barrick
anjos
robichaud
stationmaster
kalimpong
meech
verger
clatsop
rosado
hopkinsville
adminstrator
zippy
blore
nvc
alliterative
superheroine
cabra
hydrides
sylla
murrah
hagedorn
jupitermedia
volante
silvanus
galan
kandel
undeserved
schaffner
manco
beeswax
sro
soliman
agressive
danae
ohana
gratuitously
buntings
alcaraz
katona
walzer
shirdi
garish
boonton
errico
cachet
toute
toboggan
yigal
exterminator
monophosphate
corsets
whatcha
exotica
stapp
nettleton
maxfield
rustling
keeble
brics
lerwick
sardi
kroeber
intaglio
gravitationally
crossman
coby
flay
rrc
amanullah
sinker
litigant
regni
panj
kühne
galvez
bacsinszky
repentant
majdanek
amini
craggy
recurred
decoratively
casserole
appaloosa
anxiously
nfp
aldean
overgrowth
huertas
mongrel
crosland
reductio
wintergreen
bavarians
baire
pdi
silversmiths
amanmuradova
mcglashan
giovan
guedes
ferengi
reenactments
dowson
licencing
cessnock
regrowth
schoolhouses
abductor
anxiolytic
maritza
crosswords
aikawa
neurogenesis
daimlerchrysler
bayle
gso
arina
bfe
buttressed
peintre
flipkens
vilanova
bosko
rochefoucauld
limbu
haptic
ivens
fredriksson
doused
alfonsín
piller
cmh
contrabassoon
yurt
pageantry
therion
gorno
korchnoi
obote
odonata
baeza
ajanta
brickman
crl
welford
yorks
palouse
sufyan
gyeong
kennelly
misdirection
latrine
kingdome
carer
mohicans
gethin
dentin
swayne
garou
porthmadog
brandis
estrogens
erichson
redeemable
morillo
curnow
fanta
hadleigh
länder
onega
defibrillator
giger
riotous
tinie
unleaded
magnifica
embrasures
tsm
auriol
fortifying
gavrilova
cheater
monopole
sisyphus
tamerlane
biberach
nagin
laboratoire
arcola
hibbs
mabini
jcb
bacardi
northamerica
gili
unimpeded
lingen
invalidates
tallmadge
foa
tussle
qarase
ventre
agglomerations
sirhan
pilsner
dairying
bigby
maclellan
massawa
uncritically
tommi
exempting
dilly
salesperson
purdie
guillem
sheathing
rhododendrons
elit
trollish
torvalds
ngawang
impulsively
barberton
pipestone
torp
catalonian
jaggery
trekkers
silverdale
footers
lamartine
manto
farul
opossums
qassim
dinghies
druga
adami
infest
diced
fermilab
workhorse
quelling
dissect
channa
plaisance
hybridized
acropora
middlewich
breathy
traugott
dioramas
cobi
daemons
fairings
cordeiro
udit
stacie
bitty
inglot
kirkcudbright
duguay
milovan
wykeham
wallerstein
buffa
turbans
gladden
reprogramming
aconcagua
shunning
talaba
arsenide
danna
jesters
rnr
kwinana
heinie
busk
kickback
mansehra
zemeckis
utu
leavers
spanglish
woodgate
petunia
cockayne
zaha
snowflakes
extinguishers
quickening
szymanowski
methuselah
bso
sats
turcotte
cassegrain
saura
newsagent
shiels
rodd
solidago
shackelford
triangulum
burnsville
virologist
fgc
gravitas
prefontaine
shaab
simbad
growls
betamax
accompaniments
aiguille
evelina
cuarto
gromit
ingles
vem
aswad
emaciated
trackways
kirishima
berlinale
qn
imperiale
baldassare
espacio
dobbin
moronic
imprisons
steeles
unmasking
bregman
kurata
reichardt
hatt
clubman
espanola
detentions
ymmv
doheny
wid
fava
stubblefield
berlet
appiah
tropospheric
rathmines
helmeted
birchall
whoopee
anticonvulsant
ceramists
fagus
nadim
lauria
officeholder
obfuscate
kcbs
teresita
sarang
scruffy
decoupling
wiggum
oculi
menor
puddles
nasu
mckagan
ranunculus
bunning
wasa
moonbase
handloom
lipinski
spoilage
eivind
zer
freethought
celebs
cheon
fekete
bullough
dorval
buchwald
aches
leclair
tensas
queensbury
bronxville
mnr
epfl
outpaced
swayamsevak
mudie
drouin
peaceable
fermoy
jarmila
czerny
privatize
drowsiness
minamata
terrestris
flints
zakariya
emphases
ruptures
piezo
garamond
barbiturates
recede
ghazipur
cadherin
manioc
mkii
hanke
anupama
niacin
bagger
estoy
metastasio
sissoko
leapfrog
voorhis
keaggy
froude
guanyin
inheritors
beath
punctatus
qabala
zhongguo
avco
birks
cassis
marth
apprehending
hornish
artista
factiva
poncho
dtl
silted
apennine
bywater
aucoin
sailfish
vitagraph
eugeniusz
jol
crewmember
tappara
wombats
frankl
esser
videoclip
gilley
payam
ladbroke
powerhouses
gaveston
reacher
khurram
expats
huish
unintelligent
uracil
lanegan
sardou
okun
reaffirms
tsh
sedgley
schlieffen
sssis
saath
mckie
vainly
ineptitude
iop
pubis
pythias
podolski
nodule
heaths
guez
ipads
nello
intellivision
sette
porcupines
biomes
doody
barceló
petrich
damselflies
gangtok
hatoyama
ramzan
grog
schleck
création
galop
micha
avantgarde
partha
torfaen
kazoo
matric
crock
greely
panola
avedon
bargained
roomed
zandvoort
valvetrain
meddle
argenteuil
unafraid
tkachenko
elman
sealer
maxentius
omori
relegations
sadri
christabel
swanwick
lamarcus
ceu
‐
umeda
taisha
bellville
hotdogs
parodi
engin
rosewater
kinkaid
kaczynski
blakeley
dildo
calvet
tumba
pingree
vögele
despairing
supine
tarts
relaxes
kosuke
basc
natt
ruspoli
karvan
inde
suppressive
macey
videographer
teufel
slimane
cradley
rop
awf
cruse
boxcars
krom
coch
endothelium
silke
saraswat
scheel
foremast
anar
signy
birgitte
manji
emanation
morteza
flamethrowers
bourn
ghatak
fossilised
matsushima
windisch
osan
drummed
hasnt
patties
bacher
gladiatorial
determinative
revulsion
cagle
kolyma
educationally
erma
frag
amirabad
diao
mislabeled
basler
multibillion
hayton
kanwar
undressed
cml
calero
amaterasu
monopolize
clings
interregional
barrelled
mottola
widmer
abaza
powershot
yahia
eichhorn
tpi
hinchcliffe
nibali
doby
sayan
collen
tweedie
tonka
lias
balfe
weeden
adenoma
pushkar
greenslade
bramlett
elyse
pieds
surratt
megaliths
bugge
concho
exonerate
amoy
berggren
overdoses
annalise
burridge
ecl
vroom
hsn
renu
bostic
biggin
telemarketing
hammarskjöld
cavani
bedser
whorf
piana
redesignation
leonardi
upsala
hout
devoe
niterói
underarm
lafourche
triune
loeffler
komuro
phospholipid
lilliput
rabb
hausmann
boros
overshoot
mumm
pitzer
lothrop
haruko
dougall
bertens
saltonstall
goffman
vamps
trackway
topalov
powerplay
miloslav
repented
incubating
kardinal
macías
snep
pva
msd
hypervisor
bahauddin
hedin
hashomer
bevilacqua
radicalization
braschi
disqualifying
madisonville
gwan
imbruglia
hatley
biafran
johore
katra
ballpoint
gelora
abbès
wani
syl
khurana
highpoint
anchovies
publicists
rioted
taino
semmes
falciparum
larcher
tautological
mccarran
uscis
brooksville
pcw
jizya
lupi
taza
schottky
lcr
ankeny
lingnan
milagro
renews
micrograms
swaminathan
hammersley
clarisse
fabri
paderewski
cappuccino
holter
cch
sagi
grisman
lupu
batiste
grodd
clute
uac
vibrio
impetuous
gallantly
segway
coolgardie
hued
karsh
ghb
shanklin
scriven
crusty
drumline
hydrogenated
oid
metacarpal
zeitz
arruabarrena
goldbach
tannen
prosecutorial
hayle
futhark
steinem
wotan
wattles
covarrubias
anticoagulant
morice
chavannes
tomei
bhairavi
crn
upshot
riina
droste
lgus
juozas
rigors
grossmith
carats
defrauding
páramo
ventnor
cursus
sprinklers
ershad
extolling
orgel
traceability
optimise
amel
adores
wunder
wishbones
protégés
macdiarmid
dramedy
treeless
dolla
scrutinised
sanson
skt
verapaz
théry
pollo
dawned
untoward
poornima
engelberg
mannan
initialize
bachinger
incumbencies
clavicle
leelanau
groan
gerardus
conniving
chiquita
manso
penumbra
quiberon
natsir
hugged
hvar
jaromír
vagnozzi
dahlberg
lodgepole
escapee
tetrachloride
tsuru
remitted
arsonists
admonish
rhizomatous
guus
neurone
perfunctory
freehand
morrie
gristle
leipziger
fahy
esg
drogo
nuer
kbc
replanted
lariat
saporta
raiffeisen
alper
fha
kerensky
tassels
flashlights
ashfaq
pilings
alessandri
monachus
psychobilly
capito
hallyday
shuriken
stürmer
admissibility
lhp
unpretentious
sanjana
bandana
crooner
sidestep
nogai
llull
levenson
issyk
emoji
chaffin
velikovsky
boyish
conca
evangelize
ital
broeck
hematoma
razan
lii
eldritch
secreto
rushworth
barbecues
kalin
viburnum
niranjan
schizoid
zuko
rangan
angewandte
obtrusive
rapti
martinson
longhair
catheters
strumica
morass
schmit
alawites
schlafly
plaudits
niobrara
adelman
westman
andijan
tayeb
quinault
kilgour
troi
langridge
bovina
fanlight
ieds
conformations
heneage
bueller
hurlburt
nevado
ald
lamond
bruneau
penwith
irshad
mesmerizing
seiler
yoel
homered
zookeeper
milken
mortified
margined
bosanquet
sombor
labrum
saadiq
rausch
cornaro
carcetti
dyess
condescension
quetzaltenango
châteauneuf
synchronisation
zubayr
annacone
glucagon
rosendale
kaisa
scrip
fisheye
richa
torrid
yamin
kazmi
gannets
pomorski
pandian
betweens
linoleum
aedt
loopy
punchbowl
plantains
rotana
linnaean
aversa
shiites
undervalued
choses
mabuse
shrubbery
mns
grimston
hayami
mqm
hyphenation
omak
trimmings
tyger
spica
executables
natividad
maksimir
dinsmore
niemi
insuring
lahaina
ofelia
bloating
vikki
geta
firecrackers
dinero
hoopoe
nimr
sourav
bindi
dibley
defamed
mcbee
oggi
banbridge
wheatfield
derechos
luitpold
vegans
semiregular
diavolo
epicurean
chakras
corozal
shimomura
stromal
commercialisation
ducted
mesas
cheery
namba
gosnell
invalidating
csar
fawzi
talos
fbk
koike
srv
hydrangea
hopson
intervarsity
subramanya
proudfoot
lippert
wachtel
ketcham
villainess
matsubara
shannan
turturro
swearingen
trainwreck
sitta
netizens
sabers
fouquet
hillyer
topol
overhauls
glucocorticoid
ivb
damietta
hekmatyar
offensiveness
arnot
severs
lipschitz
uys
finke
partook
psk
jiabao
sherbet
lohner
saroyan
mahabharat
stereogum
wmu
rearmed
damavand
pinching
jesup
zahab
endorser
refunded
reynoso
selah
pinna
coppice
alternations
peels
maddow
finnegans
protrusions
jeou
ukhl
bildt
fpu
buckie
razer
trisomy
ettinger
airdrop
rytter
nannini
banke
espouses
lach
aiff
inducement
traub
beecroft
risque
headlamp
gute
frickley
asthmatic
peeps
lud
camerlengo
delroy
cosmetology
zorrilla
rashed
ziyad
harbouring
lucina
phillimore
misquoting
laboral
viner
airedale
lmg
kalas
petoskey
millburn
sayle
horsfield
freelancing
uji
intellectualism
senor
fineness
bagby
demonology
difficile
dcl
puerile
lubitsch
merrell
vittore
mip
zarina
heilmann
tourniquet
gyaltsen
taormina
telmo
colum
callosum
misdirected
niehaus
expounding
bilaterally
histones
mcferrin
milkmen
belfield
blotched
croatians
auditorio
fermor
butadiene
derailing
deval
,i
laudrup
luscombe
propels
colorcode
madureira
messick
polyakov
pacey
sremska
emigre
gocomics
jogi
symposiums
frac
collegiality
erzincan
fusions
kailua
beed
fracas
heysel
girija
myeong
soloveitchik
ginette
rottentomatoes
postmodernist
razing
zamenhof
mukul
pediatricians
tunkhannock
perreault
eschenbach
mianwali
accross
bolshoy
izzo
handcrafts
alem
sleeveless
jerboa
extrapolating
kek
prefered
percutaneous
iliev
morven
snowbird
ferrera
meaney
intubation
banishing
vanities
scheherazade
alami
abele
kalika
expressiveness
distill
orihuela
oflag
fujiko
ethnomusicologist
dougan
thrombin
caravanserai
fabia
dundurn
roxane
pochard
kcc
biddeford
layup
arboreta
macqueen
ainsley
goi
dossiers
tga
bandeira
joli
oppress
snoring
uncompetitive
claymores
terada
overstock
devilish
silviu
bator
portola
postural
misogynist
zealously
yaron
strikebreakers
asadabad
simenon
xinhai
lawndale
tillotson
kühn
baksh
clun
srinath
ailey
hermansen
tween
deneuve
bitlis
rup
malmaison
laxative
disheartening
infringes
nuwara
atlantics
flasks
fredericia
tumbleweed
applauding
microclimate
yoji
usi
pesce
heusden
ferland
bhoomi
nicolaes
alou
jabir
brearley
blunted
venosa
mugger
zouch
pfeifer
agl
santuario
comprehending
blanchet
lepchenko
ideation
icw
coningsby
whither
shafter
fabiana
cloture
sandžak
eardley
appetizer
loafing
daoism
clu
helicon
mauger
fontaines
impersonates
jaguares
speedboat
tsw
fabienne
menino
renounces
barahona
commerzbank
wheelie
heineman
arleigh
shawls
throttling
deconstructed
throwdown
aguri
lmfao
palabra
defa
sld
amerigo
suna
trautmann
monocytes
ilchester
tongarewa
munchausen
lindow
jehu
quorn
ambrosian
vjs
compositor
munros
servicios
dotty
accomack
najibullah
basanti
torsional
tayloe
billeted
rpr
diuretics
spoonbill
misericordia
detlev
reaped
connellsville
berating
westenra
prudente
corporatism
inaccessibility
hoppy
scw
homepages
hybridisation
konan
comecon
hedgerows
ints
hmt
assemblywoman
taba
cesaro
ringleaders
speranza
cavers
capitulate
kober
reivers
linford
odsal
cashing
speedwagon
hikmet
disqualifies
georgiou
jase
disposes
equalizing
stoppages
iskander
postnatal
estrus
bayelsa
blanding
ete
tibi
jugglers
exner
tympanic
aldermaston
syros
gigli
sohr
electrophysiology
hulked
menteith
vegetal
striations
antiochian
gnc
usain
riessen
dyffryn
astbury
miah
monocacy
papery
mitzvot
navegantes
ichthyosaurs
gratifying
fairlane
anaphylaxis
lebeau
deewana
bex
viel
avarice
snowfalls
potgieter
pesa
monosyllabic
moonlit
valmet
hamsun
yow
bnr
vacationers
wyche
mismo
desu
deren
insp
yanagi
bookmarking
meatpacking
mengistu
nitschke
itami
parrett
vpro
faia
levittown
subarachnoid
trelleborg
nilson
diastolic
primula
gonadotropin
lcds
chigi
dbr
kryten
pander
consigliere
eston
figc
anschutz
wiretap
pichon
perdita
harts
takaoka
aldebaran
ironed
cordite
baps
montville
premi
valeriu
landor
eam
merkle
olivieri
daro
leonore
hocevar
uppercut
usui
palas
birendra
bothersome
lestat
kickstart
silja
doolan
slu
aled
nativist
klia
hertzberg
andreyev
hete
dht
gassing
fengtian
sonning
itil
olaru
angulated
ahu
giudice
groupthink
dard
tsurumi
gregori
strana
cogeneration
unruh
kingwood
qam
numbing
manan
iapetus
paré
buc
leery
abhinav
tokaido
burberry
skewing
wadden
rorty
wagers
jil
munteanu
ogilby
dorp
wiig
seedings
resta
gidley
underlain
casta
shuttled
jealously
canting
lincolns
retrofitting
daeng
misao
sorrentino
sall
colwell
thawing
californicus
hornbeam
banditry
yacoub
katori
lymphomas
amaryllis
panicking
chillies
bobbio
schomburg
alds
queers
dork
worshipper
icky
schismatic
ppb
archean
goverment
walliams
scb
nederlandsche
saami
unordered
chinned
litigated
shushtar
bansal
arredondo
teleporting
frankenberg
junji
mignola
zadok
atif
wadowice
geico
pichincha
coto
furneaux
unknowing
ilsa
confusa
tagbilaran
pion
krush
agouti
towered
simultaneity
andor
jordaan
kishida
egf
ctp
khoisan
yaman
rugosa
unpleasantness
pinsent
joensuu
aztex
antifascist
watteau
millipede
personalize
stonehill
republish
prongs
grünberg
bnc
fertilize
rietveld
joust
schenley
navarino
olefin
leka
akutagawa
metohija
sequencers
scoping
florissant
bisexuals
stupidly
crawfish
zeppelins
scabies
abadi
bdc
mpt
ephrem
tichborne
willian
eichler
commonest
parthenogenesis
turvey
chronologies
squalor
bvi
arawak
bitstream
mabbett
somesuch
kole
wayfarer
tortola
laan
leeb
tamiya
happenstance
horford
enchanter
pilon
coir
braz
snip
waycross
sadan
digests
zab
garib
ideologue
irreligion
snaefell
yoram
amphion
embree
annoyingly
chuquisaca
ipt
maser
clallam
homebush
nadler
mustn
hayride
quadri
guid
prow
gpi
vyvyan
signficant
fidler
hirayama
sains
staubach
lammers
duffie
jjb
wollheim
forecasted
karnes
erlbaum
insurgencies
milazzo
stockley
viktorovich
cappadocian
weisberg
fallopian
lautoka
belies
rosenkrantz
iwakuni
krk
kataoka
roye
aspinwall
detaching
kultura
fahim
obliterate
bischof
reseda
hasdrubal
iko
javelins
tyco
innervated
manhasset
rancagua
musées
maeterlinck
adware
tonelli
belmondo
latinised
ordinations
kattegat
martz
marotta
arcing
liquorice
calzada
matted
mizoguchi
sandoz
kalutara
annihilator
icke
elizalde
interminable
zte
chemokine
takoradi
margera
tipper
paedophilia
malar
shearwaters
hulks
hammel
flagellation
nasheed
pardus
iolanthe
pyrimidine
sweaty
greenbaum
teodora
shotaro
starsky
yasuko
sumpter
baptize
dolorosa
kakavand
whistled
bett
kheri
fonseka
vetoes
cloaca
subedar
biber
sergent
arianespace
seager
watsonville
biomaterials
cutscene
handbrake
shola
coffeyville
kashmiris
maillot
cronk
letizia
paralyze
twigg
aggravation
nisqually
mimetic
piccard
lebowski
freyberg
ahura
quinte
joyride
gennadi
neta
dungarvan
schram
adua
chesnokov
halve
staking
toivonen
parkers
eckhardt
tdm
equalise
readout
gaudet
dta
homeported
speu
baeck
semenov
reve
lillo
toribio
contarini
palani
loman
brilliancy
lnp
pisan
hajar
reasoner
agonizing
katara
intransigence
hymen
paralyzing
afra
dorrit
toombs
breastplate
bulldozed
dreier
timekeeper
hirta
abr
glucosamine
arrhenius
backlight
americium
meikle
saifuddin
landholder
terese
pyrgos
evangelos
formulates
pipefish
verdier
téa
klinsmann
pushy
boyko
hammon
szd
brossard
xenomania
blanda
decentralisation
interchanging
pappa
contextually
undisciplined
disadvantageous
bodom
mutagenic
pistachio
wilsonville
debora
indentations
dhahran
hodgepodge
davit
rulemaking
sambal
tozzi
willet
fiordland
souto
mdi
kenmare
bailando
uncollected
pinos
iligan
yantra
shunted
heifer
triennale
nadel
darlings
biostatistics
pretenses
onizuka
bcn
sepúlveda
nini
konno
distaff
steelheads
asli
skyrocket
barta
awolowo
morphogenesis
flett
wpsl
shochiku
mujica
subhadra
mané
dohrn
trotta
kerkrade
jeremias
camerin
imperialis
leoben
ffi
masochism
musta
lauterbach
saxifrage
fus
beefcake
pestalozzi
bion
frid
floristic
theophilos
conceited
singe
macdonalds
khanates
turpan
bartered
coldness
unrealized
yadda
synergies
roanne
rebar
butchering
lohmann
piñata
delaval
contributers
aventure
materiality
veniamin
llwyd
palance
birk
süddeutsche
ebc
cliburn
sartorius
cahuilla
imperfectly
ornstein
imputed
panne
haymes
siriusxm
shawano
upminster
ecf
adalberto
borowski
upperclassmen
autocad
sampo
pulis
solas
louisburg
bouncers
radboud
bogen
incapacitate
porras
timberline
repurchase
danja
briefer
lydda
qawwali
interscience
affirmations
shiksha
whopper
tallman
criminalization
nitish
epica
remsen
fisted
longhurst
lavie
sheree
malthouse
retrievers
swashbuckler
shill
pronghorn
beaudoin
aydin
zzz
glib
guilder
ginetta
chito
lolland
montjuïc
underweight
delimit
andersons
bwa
diedrich
divines
downsides
paas
stor
generales
kune
dorota
aalesund
medi
modding
hanmer
windowing
forestville
foment
superhumans
dkw
villach
hdb
glendon
lovesick
zukunft
nanoparticle
whitacre
herzfeld
dosen
stabilising
ayam
rickards
wladyslaw
zal
eval
historias
almaz
longmont
midrange
bilby
unmolested
pownall
honeymooners
ampleforth
tatham
registrants
helgoland
manzoor
sumit
eubank
kington
stubhub
provisioned
tweeting
gambled
scrambler
preloaded
landsman
shawcross
pavlovna
malady
kade
sakae
otakar
deni
wracked
abbado
tulle
glenfield
pottstown
orbán
lustrous
bridesmaids
saviors
guttman
langtry
tamas
pensive
lumberton
immersing
tmi
maharajas
watercolourists
akihiko
mitten
giuliana
céspedes
valter
sonet
plows
rerelease
arar
vezina
jerald
utagawa
detonators
suleman
gentleness
keothavong
holberg
zephaniah
triplett
kazak
sectioning
sachdev
fetishes
charnock
dtp
hcg
partito
albertini
wel
guinn
jarmusch
bleszynski
omm
nemeth
underpin
ascorbic
hunnicutt
bloodstone
svea
ging
santhanam
bellis
strum
salvagable
meagan
coheed
televangelist
gundy
watermelons
emeka
vaught
creatinine
lukin
capelli
jefe
cabarrus
marano
limping
ladybug
qais
solow
amalgamating
hebridean
pratensis
moin
surinder
newsradio
epigenetics
heymann
yari
lobato
hereward
kulik
appelmans
mandelson
ministership
odometer
cookstown
eakin
relent
nipper
cacophony
liye
infers
amante
nourish
katrine
compacts
headbutt
bandeirantes
johnsson
backman
immanent
bumblebees
enamoured
beane
newtons
emptively
aimlessly
faf
putih
arachidonic
chennault
poss
ndi
castleman
fintan
schiele
lehane
standardbred
bisping
uvalde
mohamud
inimitable
corned
brainwash
gha
masan
darr
guidi
khanh
nct
toolkits
purist
styrian
masvingo
solanaceae
lurk
generics
colenso
boonen
eber
acceptably
wellsville
densmore
mamta
chipperfield
irreligious
yoshikazu
bettering
swabi
oliveri
stradbroke
uprights
armatrading
golgotha
woodie
cascais
shotton
himesh
luff
ift
karachay
rito
bruyn
cocoons
rattler
tints
museet
tenison
vampirism
raikes
larocque
halsall
salability
waterproofing
fingerstyle
porcine
voces
matale
cleghorn
ickes
cibola
fentanyl
topsail
hodgkins
typescript
brisson
rast
gusti
morio
pestle
tsarevich
melange
stepwise
piccoli
stipendiary
europaeus
feo
strobl
purveyor
phaser
xzibit
fazil
arians
protozoan
physio
albacore
decipherment
cahir
environnement
dorling
combi
ticketmaster
baga
hermaphroditic
arnau
unlearned
iniesta
manservant
ruber
mourne
primero
siciliano
obed
supriya
bathinda
bruder
vitas
goodfellas
lynd
teva
concourses
scobee
colmcille
willowbrook
blasio
harriott
rabbitt
tussauds
makeba
bizzare
dumber
giv
zorba
arenal
leeming
kageyama
skog
reutemann
intuitions
porvoo
bonavista
peppino
cardus
coiling
scurry
rajamangala
sunna
alwin
macrocarpa
pittsford
sevendust
boldt
sonntag
frenchtown
tokamak
canaanites
pyrrhic
hexafluoride
militarization
intros
brenna
phantasm
oconto
nhp
abaddon
massinger
malaita
zoologica
hdac
magisterium
blemish
baryshnikov
loya
gorm
requisition
ipm
merano
martinho
fudd
upholstered
lumberman
illegitimacy
cowherd
castellated
workloads
takaya
garside
wove
gaddi
hesperus
herriot
pesach
asci
dawud
overrunning
urich
carotenoids
surrogates
spittal
resourcefulness
themistocles
manilla
steichen
decrying
swarmed
interlochen
dacha
seascape
maypole
funder
coarsely
jackrabbit
colonialists
margaretta
iduna
braff
brassy
wrx
cois
eckstine
morán
nandu
jelen
anthropic
pinilla
bolles
riske
nabucco
goold
waynesburg
rhus
showboat
oilman
kamini
vinland
leggings
osmani
kajol
hadrons
politecnico
fairytales
toomer
returnees
peptidase
pièce
slanting
vining
swadesh
gos
się
pinkett
sextant
lamonica
sifton
lofgren
sambora
oktay
vagaries
liancourt
lausd
allington
rajagopal
twombly
ivi
speedster
juche
saroj
palmolive
soloing
urticaria
bellies
apalachee
clg
extraordinaire
unearthing
lowrey
legolas
succinate
kojo
taib
alphabetize
retinopathy
juggalo
copilot
ribe
ratifications
fledge
sedate
erdogan
dugong
shammi
estanislao
bedworth
salal
cornu
heisler
rossiya
ménage
ammi
holguín
bamboos
videocassette
condoning
kanzaki
wafa
macromolecular
elas
yasukuni
aimless
harrisville
fountainhead
sakuma
ifp
evaluators
monteverde
temnospondyls
librettos
zeffirelli
baqir
chieko
toko
segre
premadasa
zinedine
mockingly
katzman
nairne
kriss
bouguereau
parmenides
kobus
cellophane
arabians
kumano
nocera
mnt
gelug
invisibles
tulio
manolis
chuk
costuming
abattoir
yelverton
ranson
jihadi
crossbows
insa
marron
okara
tadao
scorchers
naoya
koval
anciently
hazaribagh
beaujolais
shon
colina
meydan
charlize
prerecorded
trindade
alternators
mexica
acquaviva
bobbing
estancia
magoffin
hami
vayu
broadwood
fitna
ambo
winches
rodan
mke
irredentist
rania
serio
worsted
erland
padmasambhava
straitjacket
jati
nevi
taverner
nyheter
cerulean
inanna
disproving
grenland
russias
tellus
takashima
tare
discouragement
sarda
bunks
azkaban
vite
resuscitate
turbochargers
reemerged
floridian
aquatica
veronicas
operación
jodorowsky
corsini
ichthyologist
auglaize
araz
rebounder
poynton
otterbein
cyclamen
tiu
phospholipase
disquiet
irmgard
naruse
reconquer
brownwood
chantelle
anglicisation
bouldering
spall
aashto
satake
kaan
elizabethton
foraminifera
bernheim
hadoop
piasecki
nolasco
mendiola
kevorkian
plumstead
jayavarman
rigger
bjork
leafed
vilmos
headteachers
panamericana
havas
tajima
cugat
decriminalization
liotta
abhisit
pyu
digitalis
catford
pfo
stoked
unbundling
mullets
terrazzo
osm
rainham
pyunik
albornoz
zygomatic
smartly
frommer
riccarton
dobell
kollar
greggs
basinger
hodgman
observables
cny
tertius
cully
téllez
boggling
higuaín
jcs
graetz
cholinergic
jkr
sundae
enrol
knutson
popova
robi
winstead
sodas
paddies
janna
nigrescens
chandrapur
seidler
demigods
farmyard
qualia
underwritten
sheard
unyielding
penistone
jetsons
gaeilge
wojciechowski
maclaurin
bhonsle
brengle
barthelemy
ducking
kronberg
portlaoise
taizhou
remparts
gaudin
trachtenberg
jboss
zimbalist
lunette
tilia
bankruptcies
fabricio
boreas
osp
daffodils
funaki
endocarditis
victoriaville
iczn
quaestor
alvord
hypertensive
curvy
tigrinya
insurrections
chastise
zeeman
deconstructing
wxyz
porterville
méthode
aumont
satirizing
radian
neer
medievalists
thimble
pranksters
effusive
aiaa
chewy
proposers
backtrack
athelstan
whereof
lemuria
orienting
debased
brynner
cheyney
shapley
backlinks
photobook
topham
finck
davitt
skinning
allam
guzzi
glenrothes
jalapa
buz
theoreticians
easygoing
neame
jacksonians
akihabara
lamothe
yasutaka
turgeon
hisd
sulphide
yog
nsync
willan
gaas
ponferradina
jaycees
findable
embellish
portela
rhos
hobie
retry
spectrometers
lightbody
fraenkel
paperclip
vicuña
brehm
principes
intercostal
chaldeans
sharpless
candidly
strabismus
portales
emmer
kakadu
cockatoos
waley
reber
saxby
danone
yakub
turnips
burhanuddin
cuda
nantz
johnsonville
catflap
canonised
bousquet
revlon
conrado
oooh
althusser
unmentioned
authorial
unobserved
adiós
brundtland
swanston
mamaroneck
entranced
lessens
vce
corker
pluripotent
absurdum
miklos
kohistan
imhoff
donnellan
gouge
allsopp
durance
acor
sifakis
humanly
pgi
iñaki
chambord
ascher
mihara
masquerades
crystallizes
wildfowl
elision
nasim
bfs
ccw
unbelievers
setiawan
eons
minimised
robitaille
estar
wambach
examen
liguori
jell
dsk
yakut
harrach
kadi
bompard
moorestown
feign
sundarbans
mahavishnu
chudleigh
winamp
velayat
celeron
opulence
bori
schönbrunn
pstn
marmoset
slavoj
miglia
catamounts
étoiles
okapi
homophone
contravened
fossae
abha
geshe
bernina
mesenteric
mccaskill
polonnaruwa
outlasted
vectoring
herriman
oona
lazenby
cyclocross
orl
camerino
multitrack
defrauded
flam
brownback
yate
incrimination
duesseldorf
lmc
usedom
apte
peppy
norseman
campbellton
watney
lujan
meic
noni
daigle
nervously
guichard
lpa
etherington
hader
grouch
bracketing
hosei
teamster
marmon
vef
siue
radiographic
grindhouse
marck
satirizes
jemison
kommer
jeweled
roundels
linney
shh
wilby
contemporáneo
morand
verticals
garrow
shorthanded
hammonds
regno
pirlo
warnes
warrier
miaoli
afanasieff
nazran
dunin
uecker
orangemen
lohr
uzun
hohenstein
harmonizing
stoa
wharfedale
kunsthaus
junks
mutu
ssbn
krog
dalila
wariner
peo
contorted
maldini
micrograph
hermeneutic
icar
halacha
apolonia
raghava
washingtonian
shrikes
narcolepsy
workingmen
gouging
cari
jtf
quips
fazenda
photometry
pugliese
spined
shelduck
pamuk
baccarat
bekker
picot
dispatchers
nariman
amla
elc
laursen
odetta
economia
nicolo
también
expectant
jadeja
aqui
uch
beggs
trickier
aref
pleyel
bobble
sculling
yawning
séances
pennock
khadi
tehama
tts
alissa
delfin
strumming
purveyors
ruma
pdo
anchovy
tahar
dreamcoat
aoife
timbres
cudahy
augsburger
tepui
sdb
irgc
crotty
delillo
wilfully
lct
aspirants
walsham
mopeds
mountford
gastroenteritis
cutouts
emarcy
alcides
gaffe
colonise
natalee
malevich
giovane
abyan
fibrin
launay
implicates
swoon
iet
knebworth
ippon
mongolians
virat
idly
ashour
expending
devika
cist
malpas
radclyffe
togoland
diazepam
foc
gadolinium
transceivers
bigorre
scrubber
circo
makkal
davout
meinhof
toutes
shilo
miniaturist
bateau
gerlache
tangshan
vorticity
ullevaal
lacklustre
worldcon
meds
popularise
lesko
jayanthi
barnstorming
holmen
wajda
jimma
hafnium
adenovirus
horie
mbarara
animist
scannell
brenham
evinced
specht
figuration
enumerating
joists
opposer
parquet
pontianak
baf
zubair
encuentro
mckeen
comodoro
fischbach
sakon
autologous
sternly
shaper
awacs
azo
kozlowski
parishioner
implores
buddah
enviable
incredulous
balli
mugu
degen
bernardin
allegiant
coolio
visualise
decameron
ballston
lsts
watership
rublev
shepton
pfi
cazenovia
wayman
mirjam
soiled
chaozhou
jorgenson
amala
kawachi
beswick
apolis
corpsman
benko
detainment
palabras
perpendicularly
rund
harran
grieved
beetlejuice
thaliana
mahasabha
digitizing
credentialing
tuscumbia
isadore
jou
cybill
qub
kenworthy
porres
chlorination
aronofsky
driggs
polen
micallef
hierarchically
yarkand
renn
kasha
shul
ventana
calabash
hokey
ieuan
dryad
brax
miskin
nastiness
eisley
collate
daye
evol
stallworth
fairman
daylights
heraclitus
sperber
blockhouses
anish
googly
sarangi
acw
thrombocytopenia
drips
replicator
yoma
watersports
wivenhoe
underwriter
albian
antigenic
foliot
tog
shala
huybrechts
manizales
mccusker
accruing
multiprocessor
dotcom
rundschau
waterboys
tyrwhitt
cph
hasselblad
augments
sixto
loke
mapes
aun
temeraire
cremer
awn
rnb
superba
strecker
rabban
wsh
shuji
semicolons
constraining
nonsuch
paolini
benner
gpm
acolyte
usurpers
asg
controversialist
beatbox
centralism
blaisdell
daad
hyslop
frill
dufek
deaneries
arlanda
squidward
thant
acg
klavier
upchurch
fulling
vidyarthi
poyntz
frogmen
giolla
moomin
ongc
prefab
maybank
desperados
cymbeline
nasution
imperious
jevons
encumbered
avan
gadi
meritocracy
perugino
alphen
didot
pizzarelli
martes
tela
cabañas
potvin
mescalero
krewe
pech
tatsunoko
vasey
craigavon
microcredit
rajko
beis
reignited
lieven
laski
veljko
bitterroot
basketry
sprat
karlsen
panza
azide
xinyi
sapling
meffert
mediaset
cesium
shiu
spandrel
contractile
kertész
offloaded
airtran
wickersham
wythenshawe
ukraina
girlie
molinaro
fsi
polycarp
paddlers
barka
devens
whipp
silvan
nationales
multilayer
fetter
yakutia
mélisande
warman
dunford
balam
omelette
saliba
biddy
vba
bophuthatswana
deasy
roko
sedbergh
kahin
allemand
nisar
gape
lamour
amiel
srichaphan
kreuger
ogres
altercations
maji
durden
showpiece
sutro
rashidi
yiwu
welds
blithe
paulist
carmella
preoccupations
overwork
comerica
kamat
valla
overburden
cian
woah
tayside
aynaoui
fräulein
acquaint
iab
maximin
positing
englishwoman
moye
thessalonians
nlds
stroop
fonzie
ncb
cartes
laf
cityscapes
rhum
costar
carissa
joelle
birdwood
decrypted
worldviews
salic
kempten
truant
softpedia
sleaze
swindler
cygnet
activex
prester
indrani
tamblyn
cvd
paysage
brodhead
nnw
mapquest
northway
ramy
imagineering
bureaucracies
blackstock
grube
razzie
backtracking
fadden
rohrer
essences
edw
oliveros
balkenende
interconnects
grindstone
butz
recklinghausen
agriculturalists
consonance
reimann
lindahl
beelzebub
ror
gleneagles
wholeness
dray
elec
rehabilitative
sfaxien
windstorm
cañete
kleybanova
jind
basant
soll
bidvest
getters
tombeau
roselli
blesses
imogene
pestering
alceste
staat
nepa
royton
ibrahima
memnon
odb
wilmslow
bidens
metalurgs
britto
bsnl
omnes
limousines
ledbury
slessor
magie
cathars
cephalon
yunlin
sociopath
egyptair
crom
aneurysms
trilogies
kriya
anga
greenbank
dhan
bonhomme
fluorite
harboured
catacomb
thomaz
yeading
belanger
dirck
norreys
ricin
usar
––
maesteg
bullring
delo
mahmut
guyton
tikka
emanates
taussig
sison
denunciations
appia
levasseur
goaded
leavin
mavor
moai
disunity
pathum
definitional
persuasions
mva
omnivore
franny
izzie
praetorius
gargantuan
hunky
glazier
videla
holz
lentil
dattatreya
baza
patroclus
penman
toray
paladino
monnier
catalase
urbane
oron
canelones
coffeehouses
braverman
boso
lobotomy
dantas
elfriede
cantaloupe
pikmin
coronavirus
atomics
keydets
alondra
sadao
bocconi
heiss
menger
horncastle
kinsley
inactivate
ganging
herdman
suvarnabhumi
figural
kommando
sese
hairline
driverless
wyke
fuenlabrada
anticipatory
mielke
citrix
skowhegan
mizan
squats
stoicism
piste
foiling
prostatic
bairstow
paca
invariable
agnihotri
zawiya
greenham
djembe
dirksen
graver
shamisen
romilly
kingmaker
dania
nahl
zbyszko
wfa
stapleford
tst
dnd
morgans
rómulo
eckerd
skippered
phinney
marinella
snobbery
pacini
quiescent
rajneesh
snowmobiles
lwin
callis
divulged
bédard
eveline
kravis
anouilh
drivin
akka
berntsen
crossbench
cryptographer
inis
spreckels
chavis
ontonagon
jubilees
powerboat
redon
arw
messager
manjula
devours
zelenay
bootham
gnk
syndicat
motian
hassall
cigs
evin
ranariddh
cais
devane
gfp
brummer
feni
transubstantiation
tekle
societas
centenarian
stoppers
koalas
philco
enns
maries
cluniac
broadens
lipase
jeannine
sedated
kcl
groovin
concessionaire
kauffmann
salvadori
hyperthermia
glantz
savai
imitator
shavers
thiepval
hre
zhili
bima
addu
buka
mitty
bechet
nachi
kaepernick
trion
mcentee
recs
venezolana
marsala
presumptions
dbu
bowdon
ishigaki
panmure
zinaida
sanabria
falcao
magnússon
ulcinj
irreparably
shadrach
planer
hopkirk
yayo
wendi
skal
buskers
ihc
presences
eastport
krakowski
currituck
usg
ntra
bluejays
tenderly
osawa
installers
clemmons
microscopically
cassock
nyy
hallman
bga
plucky
lenta
elbrus
altadena
barabanki
lamontagne
quadriceps
villareal
rousey
grampians
telemachus
bjerre
leen
pevensey
amati
busca
squint
regicides
kistler
farallon
gilbey
gauntlets
couleur
extragalactic
recollects
bummer
hartshorne
porvenir
schöneberg
launder
qemal
topically
contralateral
chora
bayda
roughing
speedo
schlossberg
merriweather
jiva
taya
benes
wildrose
crit
tabrizi
nonconformity
speckle
sunbathing
rvs
mulhall
hspa
couches
schoolwork
gunilla
ageless
preuss
qila
kercher
deputed
ertegun
masih
utsav
ethelbert
naing
désirée
ajs
faeries
forsake
zamir
intentionality
keefer
myrmica
fanboys
dimples
siachen
hoste
skilling
khairul
perma
chocó
keïta
daza
beevor
déby
debenhams
palmares
heartbeats
claydon
helo
feigns
bhavana
junpei
whimsy
sorvino
acis
envelop
floatplanes
maggs
gravedigger
specular
maines
monopolized
flug
rationed
mugging
manduca
paria
parter
elvish
lecherous
carbajal
nanomaterials
shippen
subsequence
harbord
hilariously
binks
jacaranda
incarnated
francke
reoccurring
charis
mudstones
solons
dilshan
revolutionised
simion
winnemucca
mumbo
raje
tamba
demoiselle
kirstie
silberman
dendy
kildonan
rosencrantz
volcán
neutrophil
bareilles
secretes
mortlock
droves
propagandistic
scorch
tessin
habra
flicking
tullahoma
sibirica
philpot
escala
rall
gehenna
kaushal
housings
dicussion
hox
salvini
hartz
norn
shorn
sym
unspoiled
ebm
pawson
clouseau
haripur
belmiro
crécy
uncooked
kordofan
psion
longleaf
abella
nevus
mailto
mees
schnee
gagra
troubridge
welty
paternalistic
hyposmocoma
wilburn
lyttle
empt
loamy
higdon
ricco
anticlockwise
tev
newsagents
slandering
cgm
vires
hsh
baux
noite
afflictions
synced
transboundary
taylorsville
olten
mettle
logansport
pottawatomie
johannessen
arsinoe
techtv
clematis
costilla
kurdi
teke
zon
golani
quagga
hyperlinked
beatifications
sedley
bulat
afula
swampland
gopalan
rhetorician
abdali
wardak
horticulturists
boulter
hypothalamic
greengrass
wigtown
cojuangco
buffington
teru
erdos
opcw
leidy
tallapoosa
mafra
dunia
devers
grotte
hristov
vignola
impostors
caryn
telepath
abscesses
critchley
mckeithen
tribuna
piven
dregs
multinationals
olmo
implantable
schmeling
jitendra
plummeting
ogee
polito
dissolute
orginal
tololo
stingy
rlm
fêtes
armbands
unforgivable
harajuku
mastic
holsworthy
hypertrophic
españoles
dicey
stennis
confound
clow
ntp
doral
coeliac
moree
classen
waterhole
superstation
pennywise
hypotenuse
radionuclides
fishbone
hypatia
batlle
dominatrix
ail
rerouting
clerkship
purer
postcodes
loras
javid
flatlands
berm
correll
overrode
supergroups
webmasters
ronk
fasts
friedländer
tempestuous
trillions
kroc
capelle
shema
certificated
playfulness
ystrad
michela
kalka
zeid
zook
tindal
transposing
aisling
peacemakers
mahanadi
pitchford
floresta
londoner
rashtra
mure
challoner
crimp
northants
lohse
walley
tuscola
duddy
interzonal
basseterre
sks
neurotoxin
lapwings
shorted
longshoremen
heffron
lustful
adonai
edgcumbe
mcconville
saadia
asie
commissaire
sharpsburg
marlboros
turreted
panagia
segall
krøyer
cregan
bathhouses
mendieta
fantasie
palast
hypothesised
finzi
lele
magistracy
décoratifs
tickell
buckled
hussaini
seba
dimitrovgrad
blk
deeley
bellwether
precis
celsus
midlife
huachuca
sund
weightlessness
ghassan
houck
attard
dainik
emelianenko
millau
selectman
nameplates
graecia
arbitrariness
pathologic
drizzle
hps
mete
mpb
bigots
wolin
lungfish
pegula
kirn
escudé
biomechanical
metromedia
overdubbing
mathisen
remainders
maitre
latvala
stutz
residenz
burkhart
inequities
laconic
machineguns
tiana
tapio
kosice
gasser
dragonball
schild
palaestina
relocations
kamath
hullabaloo
hulman
seedorf
godart
aerodynamically
penitence
ipecac
mcclanahan
tailing
garam
frito
crevasse
venerate
jakobsson
molton
blackley
expendables
relient
papermaking
ensor
payal
sempill
zana
akal
simulcasted
kirkenes
chapbooks
hotshots
moscone
vlora
clk
mikio
dalibor
insolation
drakes
savino
breedlove
raksha
beier
hocus
aurillac
statics
miele
wakulla
habbo
mexborough
carnoustie
shamus
stamm
tbf
tragédie
stojan
ruffled
allot
gula
diddle
cumulatively
babaji
burbridge
alvalade
raymonde
petered
bookbinder
janowski
campbellsville
birdwatchers
dialogs
solovyov
hexagons
tourcoing
mno
footlights
briers
americorps
shakeel
kármán
neophyte
curbed
chessington
committal
stunner
ingleby
duper
mozzarella
brainy
petkov
applets
ziaur
ushakov
visite
poma
intransigent
phair
inexact
bursar
eusocial
shamim
jetix
ocp
shraddha
bardsley
olofsson
ijaz
garzón
silty
mangere
crenellated
hma
kulwicki
beckinsale
efrem
bikinis
boulay
sbn
anecdotally
tatras
qubits
suffern
tweede
strongbow
godden
supplanting
thorstein
murrieta
ariosto
kates
stepper
bozoljac
dadar
wess
trayvon
transpennine
alloway
misidentification
thunderdome
varanus
brockport
rhinitis
ceb
monagas
evonne
ier
stanwix
groats
ltda
tracie
uckfield
tatchell
larkana
yola
ossa
disinclined
raynham
coasting
tinctures
brigand
kassim
mclemore
parvin
boda
fgs
rosenbloom
greenup
sadist
lrp
kahuna
fuc
bogarde
craik
bighorns
gund
vaclav
greenlit
starhub
uah
raper
botkin
archrival
goda
dors
copperplate
archiver
stealer
skipjack
easterners
fathi
splint
hoists
salience
trinkets
gbu
remick
telegraphed
nde
marussia
fyodorovich
beaudesert
walston
edta
dicker
vecchia
cornmeal
chery
astroturfing
esto
vanuatuan
tsao
spilt
depew
eventuality
busto
chatwin
dancesport
graduations
coaxed
enso
enhancers
guarneri
realign
pharmacopoeia
clovelly
aspasia
jocasta
kravchenko
kort
lillee
strangford
esmeraldas
whos
tacking
trevithick
natl
chitin
escanaba
impermissible
tarija
highlife
harnack
fenella
rodale
mesic
winks
igo
improvable
thumper
wahlgren
sirleaf
sandinistas
anticommunist
matas
licata
pineal
sulphuric
trinh
roleplay
jiangling
culberson
ephron
brabourne
javits
overlaying
kotler
sidelights
daydreaming
bonnets
snappers
pales
delmer
leytonstone
punctures
ghislain
tulkarm
cardew
heres
feldstein
gada
infineon
colonizer
spikers
trintignant
eol
shootdown
varius
weiwei
meni
oatlands
hayford
baiano
ewtn
pampered
bouverie
khadija
gyre
cherubim
bwh
butlin
mended
inheritor
knez
ripened
tenable
waw
injectable
mccoys
acoma
sensitization
bardwell
cami
rogowska
chakrabarty
precognition
talismans
scolari
alh
wolde
ects
otr
presentable
mainstreaming
silverback
beeman
tylor
rattled
pekanbaru
incognita
mert
bricker
outgunned
conradi
taskmaster
cathartic
imhotep
dobre
bohannon
dorgan
bemba
praiseworthy
yasunori
karlo
bunched
beaumarchais
lanigan
steles
craps
moscoso
ibi
toller
holcroft
hysterectomy
kashrut
shamal
uct
debre
unfavourably
cassian
nyi
blackall
kyocera
niccolo
mup
straker
zinfandel
europeana
fsl
sullivans
barcodes
svc
conquistadores
datetime
tartans
isra
proscription
skelmersdale
factuality
actuation
reinvigorate
kerrville
oocytes
divergences
drunkenly
ipu
mochis
cordata
conjunctivitis
brasseur
twisty
grangemouth
carolan
medlock
phalke
biracial
kuroki
grell
ulceration
lely
chalus
edens
ruisdael
preschoolers
brotherton
lauritzen
huxtable
baccalauréat
whacked
northcliffe
parapan
luas
kokand
blandings
doddridge
pornographers
unadulterated
taddeo
allport
cellier
tinsel
rhenium
levellers
monoliths
irf
agdam
refracted
conceives
coppell
grito
¡
medicated
fernsehen
chie
logbook
zhonghua
vergne
cantors
salavat
caribs
deporting
bipod
bootlegged
binaural
hallucinatory
jcpa
beirne
jaffer
mcinnis
btu
serous
sanguinea
dirks
andreae
causey
wheal
felidae
delineating
asiana
arlette
airstream
byun
infiltrators
jaynes
frohman
intestate
mahia
jellies
naphtha
torry
vey
wagener
glamorganshire
lampson
insincere
tikkun
hunley
cowra
tremayne
venoms
pareles
quoi
pazz
centavos
lumix
randomised
livings
pilger
psychoanalytical
cohesiveness
latched
afrobeat
angelini
apurímac
edvin
tmt
purushottam
incites
impracticable
candelabra
acro
roubles
rózsa
cmas
basten
telomerase
zuber
reposition
lgb
tracers
epaulettes
shays
lykke
towa
carruth
pustaka
mallick
conserves
shulgin
sloat
blaikie
kneller
haren
characterising
paquet
ramo
herwig
gusmão
mulia
wissen
vindicate
smirnova
sanitized
peafowl
roose
hyrule
brdc
ecclesial
jemez
mineralized
rothes
killin
chams
rothenburg
deafening
rcw
arielle
waterlooville
elling
decima
périgueux
sione
khairpur
burry
ludger
cinders
homological
reprimands
mementos
glitz
tav
lanna
tsuda
janke
galung
disassociate
shinnosuke
skyward
potting
guth
maxey
evra
semiarid
brinkmann
ptr
toothache
machar
retracts
causeways
hemet
ephrata
liefeld
angioplasty
kernan
martinet
beatus
jazztimes
unionization
meriting
efraín
tobol
racquets
dominantly
umbrian
boondocks
villahermosa
volver
clocktower
keon
rudel
nightingales
beshear
sepang
broxbourne
dervishes
senlis
mesquita
autodromo
amant
faridpur
sippy
mugged
farrier
mone
cib
fulmer
jürgens
kanin
pournelle
meditated
minimax
goethals
pahs
cudworth
sheepshead
backline
decennial
polokwane
hagiographic
shyamalan
frum
thionville
marrs
boltz
kazem
corts
rashes
strobel
chickpeas
hemsley
tooley
woodcraft
booing
fark
malmedy
haddonfield
johnsbury
linville
flatley
pittsylvania
tehrik
skydive
chariton
jeh
unna
averroes
endometriosis
easterbrook
scoville
alcântara
baan
gavriil
hoekstra
nájera
tubas
houlton
machetes
hendy
kegs
directness
senthil
krav
kuniyoshi
machiavellian
garçon
gummer
manzanares
mand
deidre
courrier
paniculata
fulvia
oti
bluey
joyfully
heidelberger
foaming
pressburger
rhimes
whdh
tahlequah
wia
phillis
twirling
naturism
landauer
antaeus
regula
froese
koot
desimone
filmer
epifanio
morges
uncasville
muss
staffan
saida
biblia
miceli
bloxham
ridicules
uridine
underhand
maryport
streicher
khatam
galvão
chivalrous
schein
guadalquivir
teething
lutte
feverish
blewett
disequilibrium
bonnyrigg
mahala
ipf
agnetha
slavin
nickleby
opensource
zafer
sool
olympiques
bagels
avidly
marias
baike
nostrum
villainy
buckhead
interlocutor
gyroscopic
shipmates
ttu
vapid
orridge
tricyclic
isolationism
aborigine
courteously
vashon
nayef
halving
antagonizing
netley
avalos
mikuni
kutztown
avner
tibbets
mamo
cesari
pittodrie
tsarina
rcr
savas
mensheviks
allstar
claris
virtanen
peranakan
yamal
cracknell
judiciously
tanners
rothe
nso
koz
vaa
railgun
bootlegger
canio
tarbell
frankenheimer
riddance
labeouf
smudge
tessie
panetta
muscled
msrp
blackberries
pleasants
acn
nearness
vint
microbiota
hardcopy
foreboding
negombo
dme
yavne
tvm
bornstein
kleve
heckel
nowicki
meio
garg
emulsions
mondeo
mifsud
juncker
matheny
headbangers
fashioning
methil
anthropologie
ncap
fui
unidentifiable
stansbury
riband
somethings
spéciale
bli
copps
interlocked
oakwell
sdo
arraignment
arseny
clyro
climaxes
moyet
shaq
inkster
gondolas
cubit
robillard
cryptologic
montacute
libertà
piscium
hagler
kaeo
lochte
someren
hoteliers
naipaul
petry
butchery
slrs
rooming
springwood
loewen
montalto
mhl
zoster
tubridy
procyon
swail
fricker
sagamihara
bluebonnet
abut
tono
hattersley
pursat
wende
ferryman
vav
nms
consol
thw
galling
kaguya
fomenting
gaan
yanbian
shoppe
yunis
bentsen
dosa
otp
transamerica
funfair
puppis
misión
quahog
availed
shanker
diatribes
meudon
serravalle
cotentin
mourner
platen
berbatov
dagblad
kmc
centralizing
hinault
upliftment
paratroop
bledisloe
haruo
californication
senter
hopp
camelopardalis
jts
rashmi
digesting
acceptors
waren
euskal
unencrypted
spiracles
espana
tangipahoa
yaa
triannual
gaviria
hatcheries
tercentenary
projectionist
mcbean
thereabouts
stylistics
mozarabic
schooler
ecclesiology
coverages
metalurg
boreholes
osterman
catheterization
medievalist
persica
revie
curtailing
vtr
limped
gruner
shantaram
councilmen
paddocks
howstuffworks
gudang
borlase
expressionists
basford
aric
greensleeves
giggles
montagnards
mckenney
lilongwe
cataldo
hülkenberg
yichang
parkville
soars
gerakan
midbrain
repetitively
rufina
tellico
brokering
subdues
dement
influencers
westernization
dagan
deist
bogard
kernow
ishibashi
webmail
brane
monkhouse
bouquets
resonating
plainclothes
rustaveli
widowhood
ruts
profiteering
hil
kittanning
ranald
gurdwaras
manuka
purna
spigot
wath
incunabula
jilani
neidhart
centipedes
fibromyalgia
subwoofer
playland
mervin
dahlem
nitroglycerin
hozumi
outflanked
gioconda
hirota
rhi
egor
afer
switchback
kth
albéniz
coupee
hydrofoil
pslv
landen
bakelite
boivin
caprices
orrery
tocco
annas
matting
scatters
iorio
silang
util
terroir
gastro
rubel
occurence
uncivilized
whately
floorboards
cheaters
frailty
phonons
yorick
ramil
nescopeck
alsina
apollinaris
dreadlocks
harbach
corneliu
bazan
risers
coextensive
gailey
compactly
calogero
saye
flavonoids
invalidity
goalies
canaletto
pasqua
lá
junto
anibal
actus
earshot
marshalltown
unrivalled
natascha
chretien
duelling
profuse
partiality
overproduction
joppa
scarbrough
tago
suso
annihilating
upf
robinhood
westerville
wlw
barbel
obits
dugmore
stroked
songshan
unwind
sweetman
wege
billikens
uks
chickadees
towanda
spanos
telefunken
utzon
fairplay
bucolic
giorgia
fozzie
calarts
palauan
milenko
sudeep
chemo
unconquered
fyodorov
misuses
miserables
quandary
klink
inbev
tamra
badshah
cultivates
diário
sundin
gardes
costantino
benzoate
serban
mcmanaman
clitoral
instigators
eragon
bilan
sherriff
ivc
ditka
crumbles
bermejo
pivoted
mahlon
essie
bestia
bedminster
dowden
irib
abang
rybak
hillclimb
toye
zovko
beh
erring
diebold
scher
todman
milica
peau
banishes
eti
lyla
hostetler
lepton
rowboat
whittled
whitening
ghanem
internationalists
sedaris
palácio
budde
staind
semele
ludolf
llanfihangel
vexed
mannie
unprocessed
foreknowledge
cresta
bosc
avocet
nbi
dismember
aleks
robustly
turm
diaconate
respirator
fulci
crepuscular
diz
intertwining
munsee
wahdat
pacaembu
thorney
traitorous
etisalat
wiccans
polychaete
parthasarathy
reidar
eightfold
pinatubo
inoculated
pasty
swales
hlinka
namib
atiyah
soapstone
arap
taipa
delmark
orbach
bureaux
meninga
hermès
derosa
sequester
bhaktapur
chalkboard
suhr
khama
munsell
robinho
arirang
cortázar
khang
backwardness
chirk
perryman
hachinohe
incisor
queensferry
cremorne
barragán
falke
kempf
barys
gruelling
atu
belting
westen
tallgrass
joginder
trece
apparantly
peacemaking
employability
velox
anolis
sammartino
fera
mase
rosier
wilkison
dawei
caprivi
backbreaker
lyster
fawkner
havn
littlebigplanet
mamata
prostheses
cafepress
gtk
metafictional
grundman
wiktor
pedagogues
lagerfeld
slonim
naxalite
beckenbauer
juliano
mécanique
constanza
confessors
zippo
rainn
klaw
abedin
inkerman
lindeman
boye
paternalism
nenagh
winglets
hillyard
chelating
catanduanes
cagefighting
unflinching
doberman
miguelito
downstate
adjara
mechanicsville
tatton
ruka
miserere
willits
bayfront
respighi
baniyas
ampa
helgeland
irvan
shipowners
headscarf
ceramist
inducible
textural
sucha
poonch
aylesford
conned
speirs
hernych
dharamsala
offscreen
taner
peris
vung
fabrica
sleepiness
poplars
sheffer
nish
caminos
samithi
tunas
meccan
denn
substantia
negrete
gulfs
davidovich
sandpaper
quintessentially
bjarni
yongin
flyway
elmina
roces
corsicana
bangoura
susskind
embarrassingly
royaume
pokrovsky
chivers
maubeuge
maddin
scriptwriters
vindicator
circularly
entrails
monteux
candlesticks
marechal
humongous
gav
kiryu
yod
lamson
tamales
mcgavin
sabot
legalise
burscough
nyro
inauspicious
bootstrapping
kadett
yongzheng
afshin
homefront
dailymotion
boxoffice
ludmilla
capuano
manzarek
cabuyao
treacle
kirchen
retransmission
dutifully
nikopol
panegyric
torquato
cosma
narrowness
chandpur
refinancing
thorogood
repossessed
capsize
boliviano
montaño
geier
geriatrics
lavington
digitize
exley
wingham
academicals
désert
oxegen
finnair
moldovans
anneke
crestview
ironmaster
fortuyn
longbridge
keshet
manni
söderberg
utz
mahaffey
maudsley
hawkshaw
avgas
tatort
upadhyay
fiesole
blixen
encapsulating
knelt
kae
chatman
munchkin
callbacks
camm
puro
supt
balta
cycleway
buyouts
bertini
venustiano
jubilation
dwivedi
cookin
badrinath
fulcher
dingy
sansone
serangoon
bryozoans
nidhi
frankincense
llanelly
bariloche
uintah
davydov
mcburney
odon
charlot
mfp
ferriby
eases
prise
daphnia
smacked
amusingly
madley
andreea
ingleside
barnhill
devastate
katan
bira
josey
demetriou
marinko
marler
gildersleeve
gondoliers
periwinkle
tannadice
lamotte
mergea
wister
storr
tripper
menaced
omniscience
neutralise
truax
juego
lahey
haliday
seeb
frenchy
dawley
barri
margam
natali
hynek
surman
maho
schauer
joysticks
mijares
anim
callen
bunka
meadowlark
gaf
cau
irradiance
zildjian
jovovich
peromyscus
eskdale
ddu
hartsfield
dineen
usfs
yearns
andreassen
sagal
streaky
talpa
roskill
melnyk
investigational
spandrels
dunker
bakerloo
elazar
entrusts
granja
polson
studiocanal
freaked
muswell
fns
beynon
tushar
bena
gane
reisinger
belding
wattage
margaux
meshuggah
resiliency
noblest
tichenor
ucp
newsman
chisinau
overpopulated
flieger
blushing
kassai
disaffection
léa
kariba
heth
buloh
hrd
zarah
nupedia
humbucker
longhi
ecn
scrubland
wizz
prestatyn
gynaecologists
darned
catron
paperless
felines
harter
outcroppings
slacks
tange
hathorn
pem
usoc
ziarat
amortization
madalena
soldado
jascha
tanenbaum
sabena
shedd
porphyria
caray
kamenev
rehn
arguello
silvestris
axton
hilmi
haqq
lexa
nmb
boric
dampened
popo
hada
artillerymen
mundell
gobies
stina
elwell
wikepedia
muc
schley
ventilating
ciano
isms
lunge
losada
secondaries
dint
rummy
cadel
mattoon
sproule
candyman
dorin
safes
dbm
uluru
reassemble
bugler
holdover
cleef
keilor
gaits
tarantella
serov
willenhall
blackmailer
godefroy
shadi
dollard
fenchurch
yeahs
optare
alr
poema
qays
eschews
tavener
fermin
namsos
girault
simmer
orontes
valette
ergenekon
sleepwalker
unacknowledged
deeb
faucet
sabatier
yoshihiko
polyandry
gehringer
omura
landrace
razia
hille
franquin
canards
mua
johannis
hutchence
alek
blayney
krish
oom
effy
tapu
talmage
kudzu
pelléas
deluise
krumlov
simplifications
lamba
sangiovese
hasbrouck
segues
wimmera
bioenergy
pinkham
hungaria
mattsson
ceci
hayyim
preben
bonobos
korngold
marinho
consanguinity
prana
eliana
ancora
tira
chhetri
mccorkle
conder
kiang
lewisohn
courageously
milliyet
neumark
nitpicky
highline
triplex
gluttony
shamanistic
hendriks
ribéry
shg
gertler
entrenchment
nonproliferation
ahearn
domenech
leroi
carvel
fenians
cabinetmaker
siltation
acetaminophen
organelle
karyotype
agnese
vierge
thomasson
oujda
tradename
haddam
tvo
andalusians
underlings
vollenhoven
bernarda
cardiologists
cherenkov
bankes
gasparyan
neocortex
buzzy
miel
emotionless
maat
cftc
muscarinic
zammit
ptfe
excites
busia
defusing
bujold
lacerda
srbije
hmnzs
pauwels
furth
hoodie
irredentism
mishmash
mdl
ochi
unmaintained
masterminds
laddie
tankian
imlay
sws
barend
broek
retz
curiel
satmar
colonnades
extremis
montalban
urinal
hagerty
germanica
overusing
merkur
danda
pandits
yashin
besse
desantis
edgeley
moviegoers
crivelli
gotch
gunnell
tonsure
akito
daiki
ioana
snowmobiling
carondelet
gcn
spacek
southard
abrogation
röntgen
birtwistle
coutances
abnett
heon
otolaryngology
enlivened
anthropomorphism
myerson
salai
hastie
spong
lanner
atanas
almaden
jakov
deadlift
southwood
bebek
rendel
minicomputer
waterfield
pliable
travelcard
pawlenty
breakneck
mclendon
microorganism
hellcats
waites
detest
gagliano
bakhtin
rosendo
tolworth
skilfully
brooches
logar
nowshera
jakobson
mouser
nivelles
cuddesdon
accordions
abaya
mishandled
apsara
lachey
vedette
purkinje
brière
ebbets
slatkin
viib
biotite
nivedita
cadences
pwm
dowse
vociferously
karthikeyan
revueltas
guria
tachometer
trocadero
ashwell
beautify
ruadh
marcio
pfp
paupers
sickened
perlmutter
dunc
ccb
skidded
swash
ralphs
weidner
snooping
botti
anghel
shuo
tetracycline
sulley
ucce
smithtown
biopsies
tramcar
postdoc
diab
mindlessly
flushes
criswell
baileys
scrums
mdina
indents
citybus
sahir
slammers
evensong
guideway
puffins
oosthuizen
calexico
haridas
thune
yw
freemium
ssf
cagliostro
unfashionable
sarika
balham
chimed
mclay
rosing
keystroke
homesteading
dulgheru
wigtownshire
hartebeest
goggle
webbe
chthonic
mattos
gsd
sybase
sankaran
beeton
tika
constitutionalism
homesickness
yusupov
entrusting
remigius
imperiled
reassessments
infantino
hardens
hartt
amn
amita
martorell
ecker
skywest
xiaolin
homey
mazari
romanum
lfo
trilling
voiceovers
scholtz
bridgette
hurls
laureano
rakim
masanobu
frolov
mgp
tongva
ysidro
drapes
russert
sulfite
petrarca
maestri
wheatstone
griffen
jonesy
minimums
diene
dalal
castrato
replenishing
grauman
televise
segel
yok
satterfield
navigations
latches
youngman
plagiarizing
headmistresses
baath
impounds
autopista
kingsmen
goffredo
anr
ilagan
artistas
crooke
scf
suncoast
blowfish
ramazan
stockpiled
odorata
yanagisawa
miroir
cerrone
michelet
solem
actualization
palindrome
corder
prodigies
fetters
tarquinius
mmt
vitti
jabari
laguerre
brawler
tír
stencils
stephani
nonce
shumway
alvan
gein
glamis
uneconomical
giani
donostia
tey
lurch
ank
crud
muybridge
begining
squalid
handers
bula
interbreed
dameron
lindelof
braddon
katzenberg
innovating
basco
danio
forefather
flaunt
skittles
pictograms
sfi
washroom
fovea
orions
herrin
spheroidal
encyclicals
boombox
moorthy
townend
peddle
adas
garbutt
incorrigible
whitmire
lider
semmelweis
dimmed
hoogstraten
canet
facedown
pyatigorsk
aventine
unselfish
medians
jayalalithaa
hunte
bromo
synthesizes
smite
imu
quapaw
chik
gouldman
hellenism
spurge
yuli
dere
wdf
mimes
engulf
jamahiriya
mistranslation
tamimi
guidon
henneberg
cucuteni
credentialed
neuropathic
koni
mounir
nage
cromford
asiaticus
unico
pff
encrypting
bradlee
pch
smugmug
kamera
gossage
annexations
melua
embalmed
igp
aakash
reasonings
endothermic
sassi
paned
retitle
oolong
noggin
exudes
démocratie
altieri
radiometer
beri
entrainment
tuckahoe
segni
purépecha
holo
staggers
companys
zillions
oñate
nst
shotts
jardines
histon
pugwash
rudnik
mastership
democritus
precipitously
leese
bsh
osteopathy
ducasse
fabry
okaloosa
hiei
bazhenov
danuta
cleaving
frisky
bha
giusto
byam
nerf
criminalize
antonym
macewan
mosconi
telomere
grizzled
ottley
axn
abstractly
immunosuppressive
dramaturgy
gnats
ayame
davin
ballack
evergreens
palus
ivaylo
dramma
northwoods
reestablishing
heavies
sate
mansouri
hajjaj
matsuzaka
taster
stalowa
deas
sanjaya
slop
bolshaya
rachis
sidebars
zehlendorf
thay
scoresby
jetfire
pontecorvo
heliports
veggie
watercress
tce
resourced
hershiser
gussie
ellerbe
bertel
shahpur
bestial
electroplating
reshuffled
meas
pulliam
kalani
comus
ripcord
cliffside
aipac
jangle
zaynab
sangue
ratatouille
christianson
kewell
godine
wassen
prioritization
bonnett
playas
epitaphs
lossiemouth
djiboutian
sikora
archosaurs
gramont
bovary
jacco
citadels
substantiating
aton
lutenist
sufjan
schweizerische
cricklewood
citys
orch
fairford
grenache
anova
macgill
robledo
fasano
cubano
whizzer
telenet
axelsson
grignard
clearview
varsha
sewed
fractious
janya
pokerstars
sexo
hilarity
teres
wallen
giornale
chakrabarti
hominin
cubesat
udomchoke
nsukka
integrations
indiscretions
mizutani
forelegs
husein
fabi
aae
olm
moulting
scalped
gummi
oaklands
conciseness
lifters
naturopathic
perdomo
crayford
soltau
conceptualize
porbandar
truckin
asiad
watchmaking
bhave
consumables
transonic
benois
digambar
sissi
psychotherapists
posteriori
gomera
wole
lhd
farra
geppetto
hyderabadi
plaistow
goading
groh
limavady
mohapatra
lmu
pullback
elswick
packwood
taishi
humacao
infliction
mccarver
unconformity
fuzes
patmos
doda
tripoint
paull
rrs
rachmaninov
guarini
jaswant
resnais
mosse
bookbinding
disallows
jayhawk
ephedra
fante
chums
eastmain
distributional
zippers
beechey
kampmann
maravich
abir
repurpose
bgp
lanthanide
freesat
arbogast
majd
casella
pester
bugg
profs
salamat
cupido
pentti
cela
ferrar
infusing
showgirls
lecomte
albertsons
naved
appropiate
leaderless
rahilly
waccamaw
koki
baier
telomeres
arnon
moratuwa
trew
gries
hoceima
absurdities
jasna
dibdin
salzgitter
dardenne
cyprien
jund
matai
laemmle
cloris
luczak
travails
amoco
ruthenians
fancier
flv
dibbs
pris
beefed
obliging
lindenberg
borgir
kuzbass
motala
esopus
loggerheads
shahriar
alita
pompei
tanda
outdo
serpa
haycock
agitations
uwc
plamen
hvidovre
melodia
kuzma
nullarbor
hmo
vinokourov
asnières
flowerpecker
liisa
patt
cravings
jalaluddin
uncorrected
signification
pastora
sunway
eph
purl
copier
santillana
steere
overstepped
voila
akhmatova
colan
westheimer
jarama
linzer
phalarope
thunderer
pugsley
asil
postgraduates
seducer
scienza
manav
powerade
institutionalised
kors
broadsword
villavicencio
ludgate
moel
champollion
lorsch
postgame
qasimi
faraz
diplomatique
mutinous
stearman
yager
groveland
shinsuke
deviants
evs
moje
delibes
mangle
interspace
leoncavallo
trainspotting
cannell
dentures
drunks
fiera
hoaxer
ows
sneddon
bagatelle
mondes
goodie
andree
referrer
garba
indiscretion
jorgen
scad
hallucinating
outstandingly
printz
olea
canticles
wolfie
gallico
cobweb
winther
monarchism
kory
crna
planta
ler
kuttner
lantis
ponderous
tsuburaya
breslov
splendidly
impersonators
círculo
tenedos
kantner
pupate
dunvegan
brar
luchino
gravitate
piu
krasinski
mrinal
kirschner
serotype
beinecke
offerman
bloopers
bernardine
sciurus
lorie
aoun
weitzman
kwaito
koos
moiré
missteps
nnn
iniquity
shaba
boingo
oakridge
navale
anasazi
glynis
symphonique
esterhazy
hpd
sourdough
bodhidharma
melfort
allgood
banham
elden
mcmullan
gatiss
exchangeable
heiresses
highmark
maestra
wingback
commment
scudetto
scobie
bako
iovine
warriner
faldo
bewdley
nuclide
monier
anansi
hosanna
tippy
boccanegra
muk
overwrought
laaksonen
domenic
bruckheimer
chucho
heuer
feroze
lonicera
spermatozoa
solidifies
ansley
sangram
posta
sather
bailing
sanin
eliott
ibuprofen
fearon
reimagining
shariff
winegrowing
misalignment
masterfully
hsuan
trimaran
dmb
marot
condemnations
bernalillo
cellulosic
antone
cashes
reconstitute
udt
gravano
serbians
oskaloosa
watchable
gangway
ayia
rivlin
vitter
stroking
orci
impelled
etchers
abdal
scold
cochet
kalo
innocenti
comore
birdsall
angostura
carse
tassie
defacing
bozen
carnevale
crepe
ludi
aragonite
crowsnest
blocky
hacettepe
shau
norths
soyinka
seismological
kristal
gonorrhea
mcclaren
itoh
fcr
subcontracted
dreiser
muffler
vlada
molle
neka
bouche
coogee
berardinelli
summarization
kimmo
softens
excitedly
neuralgia
namangan
inhibitions
lucidity
amasa
fmln
spoked
andheri
splitters
kanchanaburi
macdermot
noori
softwood
amphorae
puchong
kentaro
ayahuasca
horsfall
soundcheck
ifv
anglin
olé
unfilled
abas
scruton
erasmo
gatewood
gagauz
pertussis
herut
maazel
iarnród
xingu
gulp
kempen
saunderson
pressurization
froissart
hazarika
pharmacie
glassmaking
baena
albo
cuddly
mizrachi
spokespeople
incriminate
mcclinton
desilu
producciones
adhesions
eyring
galeries
hbv
macc
socioambiental
kitching
doucette
kreuzer
dumbbell
chetty
moroka
dichter
amperes
prenton
frogman
grieves
nawal
cytology
coweta
calakmul
lingfield
gouvernement
pursuer
dcr
yag
asai
beka
spca
bourbonnais
waas
klickitat
soit
intimated
aquamarine
cyd
fortes
futher
raincoat
floro
midges
horak
unearthly
sonoda
phloem
givers
bhansali
inzaghi
mullaney
ttv
lotos
cheriton
wiry
lokesh
chaconne
nacionales
okubo
estás
bothnia
zaitsev
regularized
megrahi
greul
dinant
escalera
biannually
kusturica
leko
jolliffe
rivette
vermeil
linderman
cattleman
appledore
morrigan
harari
caicedo
wardour
haemoglobin
potus
cuffe
toying
millia
gueugnon
anthon
katyusha
paki
islamophobic
inmarsat
ushuaia
humps
poissy
disjunctive
macphie
mapo
orval
leopardstown
zoophilia
tricycles
statecraft
cide
rostrata
waver
mcnary
impeller
layard
seeta
latorre
ayton
bryon
cnes
pollinate
balusters
selous
underestimating
hatha
foxboro
megawati
macnab
calleja
weare
idt
corinthia
ribbing
casagrande
bridgeview
burp
gastone
disenchantment
khiladi
holbrooke
surg
streamliner
grijalva
peregrina
saltmarsh
busker
edler
instrumented
fani
jazzland
atlant
carcinogenesis
jacuzzi
nanoscience
berenger
prl
gastronomic
tragicomedy
herakles
junot
dowsing
ccaa
mhd
infilled
circumspect
fältskog
coeli
chamillionaire
tanguy
pickin
bacha
duralumin
pieria
cpbl
pesci
markie
ginkel
tagesspiegel
renier
snowing
carnitine
sakari
droll
mene
essaouira
preet
mooresville
pli
ech
shanta
sipe
sebelius
frazee
kras
reformulation
halogens
outscoring
fomin
naphtali
gyr
sitaram
dumbest
nnamdi
honiton
wajid
rotaru
vester
realplayer
zomg
josaphat
brockley
kele
uncoordinated
peroxidase
maniacal
sedna
lenk
pongo
meese
humvee
amylase
marah
polow
drowsy
saz
mange
speller
molenbeek
prouty
bundesrat
landesmuseum
canaris
waitin
sogno
linas
ocu
labyrinthine
maurienne
vuh
zaibatsu
lpr
rnai
hyeong
schweppes
santoso
orantes
pinney
nibley
lotti
kronk
huyton
redemptive
naturalness
unmade
freudenberg
pascha
spaniels
irked
manston
kostenko
tencent
indigenously
saira
sobibor
mtp
kneels
gud
mogo
generoso
pef
sushma
zaphod
rizzi
latam
augen
bulle
wfaa
kingstonian
novelli
habré
andreozzi
civitavecchia
speciale
ingemar
adour
biot
pvi
ocat
lynnwood
attenuate
bardia
regazzoni
bourbaki
carville
integrators
macula
csun
parus
watervliet
dupre
queued
leninsky
bartali
josée
softworks
hampi
freytag
montalbán
hasa
bragged
smaug
pippo
geophysicists
delahaye
kazmaier
verl
ehrhardt
ruffian
feathery
concentrator
eponymously
autor
macario
optometrist
grider
schiaparelli
encroach
trescothick
chairlifts
rassemblement
fedayeen
riversdale
ameche
lesseps
dpm
tipsy
naissance
sunningdale
tcb
enberg
enomoto
telefilms
gendron
arnauld
golitsyn
willman
egghead
grassley
guay
sista
atoka
laki
preschools
pollyanna
peiris
faial
forgetfulness
mazzola
fida
hellmut
surmises
panoz
victoires
cannae
overprint
averting
naco
garand
todays
svet
smoldering
buckthorn
samcro
mentone
audiology
fliegende
muldaur
pygmaea
bête
flambeau
esma
perpetrating
gigging
monbiot
kellermann
burrough
nysdot
haircuts
spacelab
wwl
elisabet
marschall
higa
suffren
nakhichevan
pickaway
collating
kempinski
sizwe
butternut
ferc
balwant
channon
minha
hrv
truer
radner
suren
rumpelstiltskin
egfr
dantes
renegotiate
masumi
bubka
atos
carre
interamerican
jq
boomerangs
stoyan
occultists
timberland
agnello
samarium
receptacles
marsch
offenburg
delegating
crimen
versioning
cachar
rebukes
paraplegia
rur
makhdoom
forfeiting
futon
ariza
hydrant
skarsgård
freeboard
ksar
carousels
goyal
rickmansworth
persistant
sids
absorbance
bungo
uaf
hwanghae
ketil
aylwin
tanuki
dyne
boddy
biasing
jessamine
hirsi
sheepskin
remagen
rejoinder
milles
ginevra
framlingham
rnvr
rodion
tapi
hamlisch
tangos
ghirlandaio
ananya
reddi
javert
bourdain
weedy
virender
usama
jpy
serpens
ravenous
katina
hoberman
kenley
curbs
predominating
robed
montijo
verbeek
forgoing
mce
kefauver
idl
promulgating
semiautomatic
terahertz
placerville
spandex
saray
noster
rossii
cubed
chaykin
mudcats
reorganizations
giroud
shofar
bostrom
grumble
spiralling
umaru
levitan
vrbas
jarod
nahariya
risborough
olver
humourous
plaxton
thieving
woodsman
lahm
unida
ook
rochon
ambushing
multilingualism
egham
photosensitive
sirocco
catalysed
mirnas
liliuokalani
spohr
democrática
tarasov
ballrooms
naima
shanghainese
mandarins
valets
moquegua
surfboards
haditha
godrej
kaja
hali
acth
zosterops
seif
heuvel
benders
herrschaft
kohala
passi
giusti
svenskt
petrovic
yellin
fielden
patentable
claflin
comerford
rubrics
akebono
mulloy
hewes
dearne
lehrman
croesus
lavon
peeve
lec
neiva
nrf
regnery
maximillian
coverup
petron
bixler
ambridge
devoto
lifeson
tchad
milkvetch
eulogized
oxalis
albigensian
cosmologist
alliant
petrelli
junio
marka
ith
artspace
bandhan
biola
cybercrime
ttm
vaart
bladensburg
kazuyuki
carolin
overbeck
bookcase
ironsides
marquinhos
barger
cill
newsmax
microcode
boosh
kame
nodular
datura
loadings
taxonomist
astronautical
wehrli
midsize
bont
sloman
watchlisting
harron
seventieth
nishiyama
origo
esfahan
romanoff
uhh
tep
microphylla
quirke
oratorical
fulwood
reisner
hauntings
aardman
treloar
transplanting
formalizing
jewelled
balke
duguid
scottsville
dufay
ambidextrous
woodhaven
locklear
marbach
carmilla
iid
sesotho
mammography
chanticleer
wherefore
bespectacled
astringent
salminen
saker
proprietorship
rostand
meego
alfano
pollitt
mithras
ebe
teeter
thes
dozer
urumqi
renville
masakazu
cuming
thalassemia
lyndsey
icts
mdb
friedan
maintainability
taisho
elint
schack
dimmu
rowohlt
kirillov
sandu
shkodra
jdc
khodorkovsky
glendora
splat
familysearch
remorseful
choreograph
knysna
benedek
snouts
sarre
kleinman
spinifex
swb
francoise
quarreling
millersburg
redeye
oef
hida
blaxland
huysmans
ums
trogir
cussler
rackers
precluding
cartwheel
astarte
corporación
berenstain
compte
etfs
hecla
drame
belisario
sprouse
má
dines
disfigurement
tinchy
larned
sanrio
zucchini
draga
asdic
addai
calpurnius
dyce
alavi
caffrey
croy
encoders
lechmere
willam
setu
milbank
steamy
parana
takemitsu
sumbawa
eastwick
slott
scheider
entitling
sirena
nebulosa
lassalle
synthetically
seatbelt
verbosity
skp
cmyk
straightedge
baghlan
okie
ifugao
margarida
schumpeter
trinian
pillman
bryanston
kynaston
wfmu
falcão
piotrowski
rabun
macmanus
invitees
sendak
seething
rodas
biles
colchis
utensil
barwell
neena
wakefulness
pantoja
flemings
perrone
coquette
targetted
haïtien
ocasek
kurita
henery
yrigoyen
taita
familiarly
directorships
breadwinner
diapason
kindling
maryna
policía
quenched
kestrels
cooma
nephritis
manar
minstrelsy
pnm
mcphillips
decemberists
reddington
weider
conners
vardy
bengkulu
angleton
saltbush
montefeltro
bhambri
ctl
magnanimous
eckersberg
vlan
reily
mcguirk
rotavirus
millsap
kuwata
campden
valk
turlock
itagaki
kohanim
dished
apb
satta
sicilians
copil
ibbotson
aragona
steeleye
miscarried
memling
mier
santer
restful
supersymmetric
sre
jilted
kiis
policymaking
mutagen
seashell
capsaicin
systemically
assa
cabanatuan
slog
dpd
thymine
kingdon
dagar
enslaving
ephedrine
unanswerable
killebrew
taxus
foodstuff
nyk
fended
brecknock
penetrator
glp
lemme
odesnik
praecox
sows
oxus
haytham
piqué
caltex
blindingly
autocrat
dujardin
prélude
moriscos
karakorum
mucking
rovira
leclercq
vidi
celan
situate
topix
holism
escolar
triforce
leesville
munthe
jafari
droite
innova
nonthaburi
gerulaitis
herter
arbuthnott
vimal
tows
milius
alertnet
ishiguro
phosphine
satyrs
canteens
ful
dashi
appetites
loyally
ustream
pehr
superbus
combes
pdg
yosuke
aflame
malinga
delph
canavese
gsr
centrifuges
bethea
lebor
hochberg
minerve
elek
jci
catonsville
bazán
rambouillet
torelli
piatt
hellions
mercantilism
brattle
mankad
empresas
erdal
repels
uche
gallerie
netbooks
winterberg
lukes
kinga
folksy
macanese
aros
acquiesce
matura
opperman
wmaq
olla
megahertz
lov
furnas
switchbacks
wattana
degarmo
gessner
subbed
lairds
vivas
seafarer
sacc
governorships
bayani
vallo
essington
fertilisers
mandelbaum
muzio
aït
coolly
cruella
professeur
folles
reanalysis
tors
tripods
hyperthyroidism
creatives
regurgitated
kunstler
tiergarten
pedant
briana
gerstein
podiatric
phono
subdomain
wordperfect
sturridge
kikinda
vettori
calla
wasc
grimsley
coped
writen
podcasters
publico
solicits
nephropathy
lamoille
janissary
pollinating
cosell
atropine
diawara
mescaline
republishing
scoped
overthrows
andras
barts
outlooks
ksi
charu
bodensee
invincibility
dedalus
bailar
dallaire
hemodialysis
wigeon
rezaï
violante
esthetic
emet
fictive
spearing
scriptorium
roeper
ableton
carcanet
dealey
paywalls
liverworts
unani
incapacitating
mccue
felino
schachter
tangles
riya
unfpa
peeler
rdr
elopement
columned
garhi
ralphie
wcf
asaka
rocor
garton
melis
languish
shab
chatroom
englehart
hatting
vulkan
bootleggers
ramshackle
vec
bpc
rozier
ryton
juts
sisal
seol
prizefighter
ooops
unkempt
ouimet
disfavor
neubauten
inhaler
transaxle
putti
pinta
jayme
redrawing
dissanayake
molin
treasuries
regenerates
holywood
tope
lorises
széchenyi
grandees
headhunter
tartuffe
sechs
hygroscopic
adivasi
livno
qpm
ksh
benavente
modeste
herodias
najd
pwp
elt
jordison
sunde
hydrates
ryosuke
frogger
madmen
senders
gratings
hecke
inaugurates
physiognomy
prf
nsfw
scruples
christenson
enfranchisement
brigadiers
ome
sweeny
nitrocellulose
espirito
antipas
caress
avrohom
corrupts
steeler
socialistic
pous
weirdo
krol
naboo
stockbrokers
lhota
panache
caire
underutilized
sabbat
pipits
paros
clenched
mahar
tarkan
tolly
cedeño
millis
escapism
pentre
dimbleby
brightened
soulmate
relapsed
handrails
wynonna
screamo
kolev
ritorno
phl
speedball
gfa
presaged
ofdm
torquemada
loews
rudos
polyunsaturated
ulverston
jaka
haslemere
mutable
xanthine
pascua
yousafzai
cometary
siesta
anantnag
dueñas
pyroxene
booke
teletubbies
kulm
socom
batchelder
scurrilous
nuevas
conjugates
undertone
wirz
unicredit
inah
ghaffar
vbs
apel
grates
gamestop
cienciano
plamondon
copes
dupnitsa
transmigration
solidification
heylin
degraw
musselman
snowmelt
jask
mimar
mudéjar
hubbs
ortigas
werburgh
hulda
grolier
rampling
dovetail
lummis
markle
palmeri
photoplay
reprieved
betances
efflux
kasia
bucci
voto
umg
garn
pleasence
peptic
nelsen
arroyos
eardrum
royall
unimaginative
thud
carder
garh
whiteface
dewdney
yester
uttaradit
alarmingly
horticulturalist
lacquered
coleen
mkt
skaneateles
volutes
popol
maley
blut
aghast
kittybrewster
palazzolo
envisages
machynlleth
orpen
needful
dashti
grisons
butlins
rnk
outfall
asics
lowdown
schering
lupone
cinna
hksar
siviglia
olen
horovitz
shriner
argyllshire
gillibrand
railfan
infogrames
forfeits
kst
fiddles
forestier
eccentrics
trude
keb
redshank
consumable
jcw
wib
calandra
saguaro
wader
sumerians
keihin
lpd
gutsy
psni
helland
greenstein
staid
scandium
pex
promenades
brocket
jamila
gaucher
exes
unassailable
roppongi
millersville
balotelli
gimenez
muttiah
kamilla
catscan
avital
pangilinan
zabala
larkhall
refusals
tanvir
truesdale
romualdo
ote
abramowitz
drees
urziceni
bares
verre
melnikov
datos
bartolome
hojo
europea
haifeng
noncommissioned
lyng
corniche
imperforate
iits
englanders
formate
dse
samarinda
rolph
kesler
gimbal
shepherdess
eurobeat
drape
schinkel
localism
ceaseless
cobbold
snuka
sparkes
rosati
legalism
inequity
virendra
umbro
volant
briones
ampthill
astralwerks
salil
preorder
apatosaurus
whitt
tako
sarrazin
kristo
ethnocentric
wallander
isabeau
mirages
autoclave
restorers
hohokam
wak
merrifield
nicotinamide
hauck
ayling
scalps
adur
dolny
chronograph
vectra
chastises
baraga
hawtrey
anglophones
senn
syntactical
reproducibility
tuli
dü
stopwatch
manzanar
eparch
murshid
mâché
microstates
automates
gibbins
outpatients
diffuses
pietism
harnoncourt
nabu
carrere
ondaatje
gynecological
lutterworth
renny
trailheads
antropología
kulin
neston
frias
nym
bettendorf
shepley
nira
spheroid
ciriaco
blindside
pegasi
prions
stubble
spriggs
credulity
rtbf
syncopation
preiss
caved
secor
digges
warrent
fug
institutet
buzzwords
brauner
vereinigung
ovis
abhijit
plait
zeigler
balak
turvy
behari
stamos
clavering
rheinische
kohei
entr
enyimba
woodberry
jeronimo
sevigny
uncorrelated
biosafety
maccallum
karunakaran
ayler
atena
prafulla
meron
infusions
berberis
malu
eyton
foscari
raag
manumission
yoweri
pacis
chichi
pieper
dovre
siqueiros
rosemount
olszewski
pawley
hughenden
roams
swi
jbeil
lcp
anny
gauhati
banged
rapallo
particularity
underused
tomer
roya
tacks
beren
malabon
saldana
trevisan
refundable
gibby
thorsen
permalink
profanities
synching
askey
prakan
nihilist
cholo
hambantota
canker
beholden
tactful
celli
chuncheon
vibraphonist
gnostics
ampas
odu
karnali
wpp
limber
astraea
hürriyet
wms
morella
peddlers
jettison
stormtroopers
tohono
tongatapu
summery
parisiens
usbl
outnumbering
celis
karatantcheva
shibe
bittner
craziness
swathe
chessman
centrism
kreisler
manahan
fizzled
actionscript
boxy
abdollah
pedestrianised
ock
erythrocyte
samudra
hoseyn
uncreated
piczo
torc
landrum
sprinted
mcandrew
mux
survivalist
infantilism
badman
pilton
postel
bluelink
sandifer
tempi
nishinomiya
disconnecting
absolut
churned
entombment
tresor
skyways
kalimba
nasmyth
picketed
kamin
mesut
shanshan
fogh
unceremoniously
wapiti
paeonia
jamalpur
ramani
mailboxes
bashed
ossification
imani
brags
deveraux
raam
anesthesiologist
langar
novick
wroth
linnell
cathar
izod
zeo
leis
stenberg
bahai
hofman
mgc
takase
paddled
brunelleschi
breadfruit
tyagi
disorganization
yukawa
balloonist
coughs
redeveloping
scrambles
orphée
commited
pluralist
woodhall
pimple
tychy
greenford
livni
trinité
harasses
bogusław
cathrine
friedemann
krabs
goudie
mente
denialist
speared
slac
wishers
fotolog
crumpled
shirtless
sideswipe
benjy
cañon
ues
dermott
protruded
millbank
babri
puddings
faragher
sirisena
overstating
dih
pocus
bewilderment
keeton
butanol
kilkis
pharmacokinetics
masterman
ecosoc
townscape
lsb
dunleavy
watchung
quadrate
cattleya
credulous
friulian
auric
hutter
rasp
florinda
kayan
applewhite
footings
gerold
watley
shukri
kapu
boog
armors
schoenfeld
superintending
poling
halles
zatoichi
madog
uncapped
vavuniya
gujjar
maulvi
iliescu
pawned
clumsiness
journeymen
elided
blackshirts
wyverns
hillenburg
dyn
osf
caudillo
hermans
hatillo
kmb
registrant
aizawa
misinterpretations
rivalled
agriculturalist
fuhrman
frp
omd
perfectionism
interjections
imagin
excrete
shogo
nonconformists
ayano
cockerill
odp
vayalar
actioned
olvera
delicias
transphobic
letourneau
umkhonto
creon
silvera
raindance
nursultan
unmet
talky
sheth
norrington
satyendra
funcinpec
deflate
moorefield
madhubani
gullit
sensitized
juxtaposing
afu
kulick
koka
vasek
marwick
indochine
outagamie
chibnall
awc
arroz
rangeland
guibert
inh
raylan
guadiana
alegria
veendam
bluetongue
morobe
aao
prn
roseus
disseminates
mainmast
ageism
iscsi
sanderling
wych
quadrille
walshe
snatcher
palafox
stretchers
transmedia
gentium
tiniest
hankou
gion
repaying
trf
trewavas
mastaba
paraskevi
cina
listserv
shermer
fallible
alphaville
micrornas
sherrie
streator
jui
kempner
watkinson
kaolin
greased
anning
ecj
marsham
euskadi
tebbutt
muffled
grossmann
firmament
unas
avx
nowruz
dele
radioshack
lorentzen
safir
honeyeaters
atia
itb
shamsuddin
refracting
orwellian
toomas
spherically
zamani
fpf
jespersen
inactivating
helmstedt
erbe
nought
fragonard
hisako
panopticon
reenter
pastiches
titillating
egs
hetero
megha
okmulgee
puritanical
bussey
biog
shc
kathi
glendalough
southington
dowland
fatman
zarqa
garston
phar
nlc
manasa
velutina
tadcaster
nanshan
drumsticks
hydrographer
pastoralism
rabah
wolfmother
rosolska
lunchbox
sindhis
lakehead
dettori
nsg
oig
mckendrick
marwat
garrone
broadus
rogersville
iguanodon
prudhomme
ticketed
yoyogi
munsey
trunked
adrar
paks
hanneman
mikasa
horoscopes
barbirolli
ird
jeane
gwu
hirosaki
expresso
drennan
anjan
yeomans
zepp
apx
ascania
salus
zhenjiang
nop
glimpsed
precolonial
surfside
einsiedeln
rivendell
armouries
circumstellar
plaintive
carmi
charoen
shavuot
mazer
tlatelolco
mulock
jugal
bhaskara
wesker
mileposts
feldkirch
aurélio
tweeter
deluna
tolo
hauteville
triceps
higson
mcclurg
arcand
teknologi
gisèle
bungay
artsy
corigliano
phonogram
rask
kunis
swatantra
moonrise
rollei
almira
repertoires
aldiss
quilter
chuckles
megalodon
morra
sluices
warhols
knapsack
antje
armm
lincei
doni
grom
quadrupedal
olivaceus
bridie
neuropsychiatric
preserver
ishiyama
unrepresentative
quijano
equips
lovano
teale
figgis
sinh
widmark
witmer
hnoms
lengthier
rehder
macos
therewith
yago
macadamia
rtos
spew
nguema
flagrantly
etter
bendel
elisir
visita
outwit
mainsail
kowal
indigestion
eurocentric
ornately
tunde
giorgione
abbotts
frauenfeld
uninspiring
dusan
yokai
paradigmatic
sartori
creagh
slandered
determiners
ringtail
trestles
kristofer
coteau
steffan
apophis
boldon
westend
drewett
ordaz
mcquade
blix
hcv
khari
millay
dreamscape
okc
kuchipudi
mmmm
funnily
overreacting
rajoy
nightjars
pacífico
coachbuilders
zouave
tonsils
shailendra
littler
waveney
rivka
itp
iguodala
ranchera
etten
ftv
upturn
slingers
folia
nantong
virtualized
quneitra
gwydir
puranic
mohican
resistencia
americo
puya
cdl
oristano
ziyang
digestible
lochalsh
tekke
winship
gummy
shipper
chabert
waronker
peeples
sclater
anomala
khorramshahr
maladaptive
horseradish
stagecraft
kishinev
khera
rindge
upswing
redwing
husks
unecessary
sulzberger
godfathers
roldan
scritti
plainer
filigree
saloum
bby
koskinen
subodh
flavouring
bellsouth
gwaii
chided
cajal
wastelands
lochhead
northey
cabbages
swithun
duplexes
hypodermic
bilton
unnerved
tokarev
recanati
earthlink
killala
sondrio
barbiere
rochette
mithra
seabiscuit
wilmar
vasoconstriction
warbeck
listenership
zumwalt
tagg
lijiang
billah
akm
espanol
biophysicist
siwan
trippin
ilc
devoir
hardys
kisangani
guangxu
terminations
retrace
rossall
starkweather
betar
vanga
fadi
dominici
tarifa
comal
pulverized
frontend
clunes
judaeo
baryons
energetics
blooper
akemi
ilkka
kue
stomata
torana
erhu
trikes
hanko
buh
lamington
bdb
capitalistic
vasiliy
burin
branimir
collazo
hawken
sophist
beare
pilani
wnw
workarounds
weatherspoon
gelli
egorov
trower
auroral
federici
mightily
biche
ministre
suchitra
ugt
amm
fujimi
dobrev
causally
bladders
loewy
talksport
wilhelms
neiges
emoticons
cengage
grimms
kayode
kine
audioslave
botta
unfurled
confed
convoked
amanat
mutating
chaytor
picatinny
micheli
sickert
parkwood
simplon
technocrats
eley
góngora
jeffry
uncaring
templo
baler
verifier
hagelin
endangers
hepatocellular
durning
graig
beheshti
mammon
infinitives
hawksbill
alpaca
redemptorist
submerging
bioluminescence
ayhan
millwright
portmore
panday
drinkable
symmes
undergarments
dgc
sonnen
masefield
gessle
eisenman
jianye
awang
jocko
ganapathi
wavelets
gef
californias
mudslide
breanna
poinsettia
cgmp
haraldsson
unlabeled
tritons
frightens
peachy
wanker
cun
rehan
sarod
gwa
envigado
bacup
doniphan
rocs
sdm
ubangi
batts
serco
phalanges
toivo
popoff
ites
zay
hurlbut
rezoning
schipper
arbre
grappelli
mathai
growler
portsea
hengelo
cancri
ulisse
tusker
broiler
frakes
edmundsbury
toccoa
hellblazer
multipath
sunlit
secularists
balasubramaniam
yuck
crysis
rashard
unrivaled
cradles
sbv
gentofte
tofiq
maye
stigmas
cued
wreaked
michibata
fmv
berkut
sloot
aldermanic
wiggs
pavo
clavell
ugarit
farmiga
ponto
miyagawa
kayakers
buckaroo
warkworth
sasson
smillie
jsm
trilby
howls
gelber
saml
buon
taegu
eisa
lobel
noticable
tangara
gaffer
plaquemines
blasius
trackside
knecht
atsugi
homolka
rto
satoko
zedillo
faringdon
gedeon
shravan
goellner
gazes
bronner
blunden
magnani
overturns
bryk
azra
consolidates
erhardt
adweek
pepo
claver
disobedient
tonino
ews
bludgeoned
adit
supermen
glucocorticoids
cowbridge
mongooses
premised
annelids
assholes
lucey
hobey
sido
avala
drinkin
steamtown
atlee
pannier
boh
khobar
kurokawa
lurks
jabiru
resurrects
anjelica
helbig
benga
mitu
fetches
locum
ambrosiana
ndash
moustafa
lovering
brodrick
wpi
extramural
dawns
sirna
vastness
brackenridge
ingelheim
lepcha
asphyxia
brixham
squeaks
lath
qahtani
balsamo
kinen
charl
chimeric
compressibility
heikkinen
branham
burqa
borgen
niva
subrata
symbology
tpo
gomis
subducted
hixon
kasteel
stds
demar
sharrock
brillante
kristopher
teilhard
nitrile
melara
heerden
skinners
leeuwin
polemicist
ipi
vecchi
groundskeeper
trifecta
jarreau
brevirostris
radiologists
lakin
beygelzimer
bernkastel
klement
attwell
lorette
picabia
buckhorn
digicel
pesetas
pteropus
sheahan
gökhan
flukes
pressler
sneijder
jiji
witching
fizzy
maariv
damo
iturralde
rowhouses
skov
hermaphrodites
sulaymaniyah
baseband
samui
interventionism
fatherly
warps
dowdy
kumbh
jacko
bajwa
punahou
gabrieli
hba
kannapolis
submergence
sheikha
uggla
vipassana
virion
asinine
submariners
inta
powershell
ghor
cressy
bjørnson
sinologists
malm
spasticity
manin
hayfield
kalia
outbid
naphthalene
confectioner
chowan
tabata
bericht
brimmed
stafa
beroun
bektashi
savoyard
townsmen
avtar
lehner
cov
sebald
jds
kreutz
cavalryman
kerslake
farlow
hispanica
edgars
smothering
lisowski
fales
latinoamericana
tihomir
biryani
infuriates
lydney
starosta
exterminating
interceded
musi
ampara
bandgap
wangen
scavenged
niggas
unabashedly
plasencia
pinks
amnesic
blouses
seko
anji
wjz
jabberwocky
trud
centralist
dhol
adoptee
boracay
fingernail
eft
gertz
politicization
dragão
petras
kinema
debriefing
shamshad
eom
ciccone
coláiste
galván
colonic
tsutsumi
quezada
opelika
squeezes
tortoiseshell
positrons
sarhadi
ayoub
reconfigure
atienza
corsten
acht
rosenstock
cincinnatus
boshoff
renin
polsat
blomqvist
methven
jalali
chinon
bovis
absconded
glaciologist
cacho
granma
mantovani
orst
watrous
comhairle
mafalda
imbalanced
stanier
suen
iwerks
metta
sumgayit
patwardhan
chough
wyld
intesa
cycad
danièle
badalamenti
meitner
okabe
vesalius
mimesis
chalets
holodeck
trills
chartrand
bertolini
enrile
uprooting
queenscliff
tytler
judicially
accc
falsifies
emer
incoherence
cbb
kraepelin
zodiacal
mactavish
thesiger
martigny
bennetts
disturbingly
refutations
boneless
danang
debar
lascivious
naba
woodfull
rhyne
imm
conjugations
bbm
wherry
kewaunee
pontotoc
prosieben
plastering
professionalization
signore
arnesen
saturnino
morera
numéro
athan
reburial
dte
kurier
musab
monetization
tohru
kaczmarek
izaak
mullens
attributive
lvi
truda
mihir
byard
lawrenceburg
hubbert
seppo
cce
waded
fmf
januar
taio
correspondance
yellowjackets
swordplay
mangini
coalescence
giustiniani
mandolins
mondragón
piecing
rcti
curwen
oligonucleotide
edsall
thijs
mcmillin
seether
adventuring
spinney
strood
mapplethorpe
proprio
takeoffs
stalagmites
rovereto
guinan
dural
blakeslee
kütahya
cranleigh
mccully
haglund
lindros
durfee
grasps
chintamani
obeid
sharyo
millinery
belmopan
rungs
bravura
lethargic
migdal
wilden
ophiuchus
dida
guillory
macomber
gak
barmer
hicksville
molesey
tink
nichts
rottnest
dayna
ladle
ojha
mclaglen
fulmar
linx
mesmerized
escarpments
schooldays
weirdest
ushant
kindled
wagtails
elp
zeon
blondell
mccleary
litvinov
icom
arthroscopic
godspell
kika
incana
queso
nlb
fujioka
paged
mch
battelle
heba
trow
broadmeadows
demento
frsc
faultless
ganley
dimwitted
tiscali
bulan
janzen
hookah
susy
geol
cupolas
pitot
dus
coolies
dut
castigated
cytometry
prodrug
stringers
tomoki
lotbinière
herringbone
usna
monkstown
chardon
paigc
basterds
telles
belconnen
morné
wnet
autophagy
sonderkommando
hallow
lambayeque
bucanero
beachside
monastics
spunky
ladner
pantano
birthmark
mizzou
ardwick
tranches
inerrancy
salespeople
snakebite
charlier
polymerases
économie
pegram
hannelore
lucilla
unconstrained
newberg
leber
chungju
misspelt
theis
conocophillips
putman
bajpai
supportable
mackin
logon
nereus
contravening
fedexforum
jazmine
resch
dsiware
dieterich
hijaz
refoundation
bakeri
binkley
fleischman
upstanding
seversky
jusuf
masochistic
deliberating
niland
begat
assertiveness
imma
mudhoney
dallin
ocaña
wol
malte
morad
condell
badham
toshihiko
rickover
ponomarev
songo
idg
stiffening
baloney
conlin
berate
serail
doering
franch
fistfight
cah
enrolments
communalism
burnished
signifier
montejo
burdekin
paramour
hüsker
granuloma
mansdorf
repressing
neutra
emmitsburg
ulc
saleen
zanardi
esko
fritillaria
hallamshire
kimo
solus
parkins
tintagel
cawdor
roseate
chimie
beccles
murilo
rupprecht
olesen
dillwyn
tuya
keelboat
solidus
tamagawa
mathison
alang
ndola
altavista
palu
scheffler
workhouses
atrash
witchblade
earthsea
roberti
gresik
shaar
bintulu
rivulet
ardently
dispenses
aherne
landseer
hacksaw
beacham
sofala
gyroscopes
göteborgs
nauman
horkheimer
timoshenko
decathletes
lindblad
tamarisk
baye
evas
mdg
lilburn
statesville
poro
maina
astigmatism
shuck
banjos
kindest
wsw
nuevos
donnington
oddfellows
backdated
westmore
cbrn
botanica
straightforwardly
foreheads
treasonous
imploded
bentivoglio
friendless
railton
veuve
bfg
broomstick
konstantinov
microsatellite
perales
tfr
tizard
legless
satirised
gratton
dachshund
burren
barinas
crafton
peritoneum
lakatos
cobblers
belzer
kvm
duccio
henninger
gaitán
snowshoeing
cecilio
neurosurgical
herdsman
fcf
theil
cheeseman
pennsylvanians
cush
strzelecki
booz
cloncurry
torri
eire
slitting
scleroderma
deeping
celadon
breathable
mornin
tota
mixon
fillion
peiper
tukaram
revolucionario
duwamish
cayuse
mils
virions
gona
hoth
caddie
lasallian
patriarca
manno
tracklisting
andreasen
lepore
josei
freamon
crowdsourced
chiffon
mealy
evanescent
stephenie
neugebauer
emm
puch
atchafalaya
passamaquoddy
rtg
hillhouse
oscillates
emporio
propellor
writhing
belyaev
garriott
awadhi
danijel
karai
kontinen
dhi
accuweather
habanera
nealon
nominum
balachandran
killick
kaboom
brenta
penton
hirvonen
heritages
wkrp
gsl
roxburghshire
mmi
kantha
ashtray
olwen
vosper
yohan
vies
priyadarshan
reck
naumov
matteson
pincher
aramean
loxley
adjuster
sergej
maharashtrian
guptill
peremptory
detonations
tufnell
streamflow
datacenter
hubley
redan
mridangam
empathize
honma
safaris
eppes
persei
byeong
rambam
heaviness
jermain
recce
königstein
medora
kawaii
pandion
schmeichel
conneaut
pripyat
halli
pizjuán
nhm
woodcarving
midnite
obiang
nelli
ryden
platteville
kabel
qaumi
raffaella
achaeans
götterdämmerung
honus
lobelia
epaminondas
funneled
huger
baixa
domesticity
houk
yonezawa
evros
harishchandra
abdol
excitability
nystagmus
nanostructures
siddiq
aulnay
xining
mirra
masseria
theda
hindle
livewire
rentería
machismo
indulges
nobu
hotshot
eurocity
oklahoman
gynaecologist
fulke
plessy
osterloh
strictness
secco
riko
ishwar
elvas
solan
nystrom
huynh
emrys
giove
gloriously
regie
pendulous
cwb
nve
saiful
claves
frontiersmen
dmitriev
ofcourse
sanssouci
misapplication
homerton
gaullist
hipsters
stethoscope
bergère
varietals
hospices
eld
bricklayers
maracay
dmd
rebooting
umber
nyo
ogmore
zakk
pinpointed
chickadee
fave
cics
escrivá
soro
dasher
dudi
elsner
viacheslav
howey
longboat
yuh
mahomet
speckles
chromodynamics
norilsk
carothers
lythgoe
modulations
malena
unconstitutionally
sabines
rothrock
hydroplane
reding
dunoon
wlr
karamchand
higashiyama
chaise
angelov
balalaika
verbum
loubet
destefano
fpas
resplendent
oses
guerrier
pally
geza
viviana
tramping
ormoc
elementals
rainha
bini
blench
crookston
emeli
corticosteroid
dominos
kiani
kufstein
burda
oversimplified
wdc
norio
dicta
orso
tanegashima
simsbury
eichelberger
parakeets
inkling
leflore
voyaging
maximising
millennials
cristi
dmp
yazdi
heffer
saada
macleish
perps
rigo
cloete
laue
astrophysicists
hln
immonen
towneley
crosswind
gumby
katsuhiko
idina
thinnest
kosrae
lauds
carpeted
layfield
hushed
btecs
emissivity
randel
redbook
roberge
altena
rozen
rwe
unk
allspark
carted
walthall
cenci
ood
harling
bracey
balzan
swimsuits
saucedo
célèbre
newnes
graber
joji
svm
anadromous
rideout
scamp
mpm
borelli
dsg
univariate
holkham
expounds
arvada
chamoun
ccny
staden
quijote
sailings
defcon
clifden
xianyang
mcwhorter
ostwald
betws
gunz
dressers
edisto
arklow
ranganathan
nitrogenous
ronge
pavlovsky
gemäldegalerie
durr
tyrconnell
pendula
masham
slf
waiving
hibino
assented
mellifera
cita
bodice
cmv
hampers
yerushalayim
pkc
sagesse
samael
clb
amdo
truitt
hoel
theodorakis
hjort
luray
koivisto
militarist
politicised
planum
cousy
brunn
katsuya
ipsa
psychodynamic
banta
aragonés
energize
contro
crea
prestressed
huseynov
halberd
borate
sandisk
improviser
hofburg
salzman
bensonhurst
sourcebooks
levity
histadrut
corsham
smother
astle
hplc
canossa
dernière
firebombing
kovel
litvak
taxonomies
retaliating
pecci
darvish
pimenta
childbearing
miter
jellybean
academe
twiggs
salween
eline
canted
barbarella
waitrose
chali
calorimeter
izquierda
consenus
goldberger
gilli
ayeyarwady
kawagoe
loitering
cognitively
sowed
warlocks
ashkenazic
pervading
ataman
fria
mahou
dearden
mox
acadiana
boastful
mishawaka
heeding
cino
gondry
chambon
enought
meco
continuities
caymanian
westhuizen
bluffing
rawle
elucidating
aristolochia
kommersant
mutharika
ridgecrest
tolson
nadya
kuta
onur
interactively
petulant
farsley
bheri
izzat
desertions
trebek
ferreiro
decoupled
unwound
zooey
tion
problematical
netzer
delphinus
beastly
wordsmith
monotremes
hecuba
tapan
murton
kahler
secant
dietetics
inhumanity
edification
undertakers
iancu
zenda
rooker
antonios
gharib
marchesi
tamers
kerim
cco
electrify
lumbee
hardboiled
bellson
grundtvig
dawlish
premières
mcdade
snowmaking
josse
kissin
guillot
obando
moondog
dionysian
frigatebird
lebensraum
luminary
iñárritu
poggi
floater
portus
bcf
orban
mussoorie
broached
imbert
enki
forewarned
romeu
gammel
dlitt
dents
idan
idn
bowsprit
montesinos
pushmataha
saturate
hamam
cardo
misterio
korey
overheads
efc
spreader
nery
ndsu
bellwood
mullick
tourbillon
rapidan
thoresen
pursuance
chantiers
goodhart
plantinga
meiko
unnaturally
lank
germar
branta
technologie
kameda
langella
grahn
gorp
affiliating
grinds
bayat
stanly
brij
tripolis
seismicity
sidebottom
devoutly
seyfried
relievers
anachronisms
raho
schreier
chimborazo
chauvin
leas
mahalia
pouvoir
melded
nabis
skyhook
limmat
zamboni
malted
corvair
moorhen
transmissible
snotty
bosnians
lengeh
humiliates
imjin
dlm
unserer
harmonically
ramped
sexologist
guardhouse
usfws
oquendo
ruhlmann
bloodlust
réal
serp
amwell
rikishi
nehalem
será
osip
exhaled
bitton
caskets
untangle
chipotle
stip
narwhal
salafist
piombino
rademacher
lulz
turbodiesel
mizell
icmp
esteve
christman
mencius
spymaster
gorgias
uran
eliasson
opensolaris
heartlands
tanneries
ceanothus
merril
lavoro
decedent
umbrage
dinoflagellates
repsol
tintern
moine
archipelagos
tezpur
mikka
stebbing
ajo
darién
whydah
feydeau
photocopies
jmt
busways
scarfo
awal
wigglesworth
pamlico
adroit
eit
crocodylomorphs
loti
epiphytes
samberg
perfumed
fukuhara
fuori
stassen
lettice
vla
sebi
gye
shoegaze
palahniuk
fairburn
authoritatively
marya
bochner
enlarges
pavlovic
fullarton
chabon
swab
antiguo
porque
hosur
adamantium
plosives
retouching
travelin
ibb
eludes
gerbils
cockerel
helicobacter
texters
superbowl
larga
hoche
afterwords
aldus
arul
multirole
sessional
alencar
frustratingly
nrs
caissons
sheetal
belin
peenemünde
spotnitz
sakhi
natrona
colluding
herdsmen
shivani
caradoc
insipid
trager
santayana
skincare
brda
heinrichs
portis
eroica
cupboards
geena
mylo
dain
calouste
kazuyoshi
ayyappa
dignities
ades
shino
coble
rimington
iir
destabilization
shinzo
gok
refinance
gaap
copperbelt
ayyappan
invalids
bicentenario
qal
midwood
kippax
giampaolo
etawah
matchdays
mathurin
fourcade
cloches
duals
hersham
parterre
rehm
asmussen
mamá
brassard
bialystok
kassam
ravalomanana
sito
superfluid
tremonti
verlander
shoves
truest
nourishing
meccano
teutoburg
ariake
unenthusiastic
spearheads
gmac
homberg
gamini
colebrook
misattributed
spes
gost
pelo
hoarse
buttoned
maritzburg
marad
demographically
momsen
wavered
gillig
aldea
yefim
ejaculate
prophetess
mauritz
nuyorican
cintra
taufik
mckendree
buchman
violencia
kersten
handiwork
necromancy
odio
emba
immeasurable
turbid
bambaataa
calman
rosalinda
ily
blarney
candlemas
fishel
mbb
picus
yearlong
gorin
stents
dja
eventuate
apparatuses
lipophilic
canariensis
variegatus
sufficent
tartars
venner
choctaws
iannis
moorgate
aminu
brutish
arve
conté
hellion
pylos
lofthouse
salters
spaceplane
sidewinders
fuoco
bookmaking
maunder
kubiak
hypermarkets
adepts
hertel
fifo
sepak
milad
babin
nycb
locatelli
daba
ahold
holofernes
cillian
goodly
faqir
ffsa
shatin
lepa
breuil
mys
abay
regolith
garai
neurotoxic
sabian
winnfield
hypnotherapy
brb
feluda
aeronca
baber
mcgoohan
zenaida
furia
wirtschaft
locomotor
chambal
ettrick
afforestation
faqih
gilbertson
leopardi
fourthly
cabled
zhdanov
kurth
etz
remembrances
haemophilia
montages
thew
cynan
gladio
penélope
oregano
laurentia
transfered
txdot
longden
atilla
yevgeniy
sonorous
mcgarrett
hayles
broadstairs
publicising
zend
fishbowl
jewitt
whiley
kunsten
irae
ziauddin
qibla
jetpack
hansell
katoomba
feste
enn
zsigmond
lúcia
binion
gestural
dented
chink
gado
freberg
albizu
herschell
lapp
carolines
ponyo
gabrielli
rupe
alpin
frogner
lacing
togetherness
outwith
ragnall
unappreciated
rakhi
hendel
astrazeneca
duell
legwork
zapopan
hiramatsu
flevoland
suria
norteño
martialled
cheapside
rigida
oglesby
kidron
revolutionizing
gillick
arachne
pictographs
daresay
radetzky
duna
overrepresented
textron
rockall
gutting
wara
overstatement
uselessness
balraj
cou
marts
keynsham
agfa
edgehill
nonsectarian
weei
myrdal
pwllheli
maestre
arcaded
outremont
shildon
ihp
amyloidosis
pygmaeus
relished
expletives
tendinitis
junger
hillforts
zun
buisson
chromate
canoga
mtf
ilir
arachnid
bolen
dreger
sebaceous
ramazzotti
ambar
imbedded
helles
zameen
appeased
hicham
satanist
corrente
heigl
workaholic
vch
heschel
dubna
reichs
hofstede
bared
aspley
preselected
tarentum
castling
lowy
chigwell
superwoman
aversive
plaisir
mysterium
primum
werfel
abridgement
rhema
steinberger
naivety
medvedeva
oligonucleotides
ritchey
fak
expunge
camelback
caren
suco
hazlehurst
bobtail
prozac
dorsetshire
lowden
unifies
directionality
watchdogs
horsa
upregulated
forfarshire
haz
quadriplegic
iridescence
niaz
erdington
hermanus
subsides
vladan
absalon
linette
buddytv
sungei
liberalized
baumeister
kuip
bothell
edy
defranco
emmrich
hardliners
nicu
junichiro
suspenders
urey
topiary
karya
corella
mizushima
outlive
grantley
conscientiousness
genest
baixo
sovetsky
obfuscated
laffite
jackdaw
torrente
monseigneur
venugopal
dornoch
laidback
piccola
singen
adk
shedden
skulduggery
shrugs
dasha
anhinga
chequers
bieler
filene
vigilantism
caroli
schwinn
shefford
raver
underpaid
hickenlooper
harpists
boeck
cavaco
altera
reincarnations
anselme
chumphon
aflaq
mechanisation
supermodels
invigorating
temescal
wolfenden
vrba
clemenza
aeronautic
davidoff
wizarding
tympani
lurker
shepherding
ngau
wun
deltoid
camo
baya
watton
kumba
rossana
zelenka
fillol
xochimilco
yojimbo
elif
lisandro
thruxton
oligarch
interrelationships
marsters
sants
jongh
atvs
disciplining
firhill
gambhir
mccloy
voyeur
grigorovich
tatsu
datable
oberstein
inky
westerman
wensleydale
keewatin
kucha
weblink
kenelm
appendectomy
prioritise
mendler
bawden
uncommitted
endemics
sandlin
nawaf
eightieth
thermoregulation
herbivory
goudy
jodrell
hornell
untiring
ordzhonikidze
gatchaman
miscavige
legros
lowcountry
marivan
oyez
polenta
kilmainham
caduceus
marcotte
forres
cameronians
duilio
outrighted
knudson
markaz
groulx
savary
fontenoy
landsborough
stepanovich
borek
stegner
fanu
evel
nashi
krickstein
carro
oughta
bimini
crozet
cheongju
iijima
treno
calabro
smalltown
macv
layover
canonry
niggers
tuffy
bulkley
jerri
concrète
furio
ork
fossey
daming
dockside
recuperated
jeni
organum
gmr
heartaches
largent
knowland
bost
sherk
voltaic
harmlessly
agathon
vlf
trailblazers
nullity
arz
chadha
hemmed
sbp
sogo
serval
stereolab
starnberg
kitto
shadab
kamata
pasko
shackleford
vanishingly
atocha
chaillot
zoetrope
nergal
brockie
dawgs
ichthyology
ellyn
regressed
vysotsky
kinescope
azorean
frontlines
desiderio
wildenstein
emine
artha
isca
sennen
sakic
veld
tfm
guimard
naidoo
yoshizawa
snouted
palazzi
catchings
hydrogens
mothering
rpl
heslop
ballymoney
kongens
administrating
fortuny
biman
pickler
christiansborg
eulogies
pett
armon
wipro
borba
hinterlands
hopalong
rameez
prinsloo
portcullis
radiofrequency
tila
malad
aretz
immunocompromised
anguished
dieng
dimensionally
titel
bonnell
friendster
nonresident
loria
kamensky
konda
tarkenton
tabb
marginalize
pickling
goyer
denigrated
egocentric
unmitigated
kudarat
caley
regurgitate
chapeau
ogi
calderone
menen
extenuating
ludlam
myopic
misanthrope
kogarah
mészáros
laysan
ascona
sprig
eosinophilic
slavish
carnac
logins
violists
whodunit
sestak
schade
acma
icn
gobbi
mattar
kose
mobbing
pigmentosa
mutawakkil
radm
dinaric
unready
wnit
extorted
wiking
quién
requisites
understorey
desensitization
impaler
menninger
lilacs
casablancas
baiju
naropa
chala
brimming
cordoned
maheshwari
mokpo
emap
jordanians
sawyers
iseult
legitimizing
guildenstern
vdm
peddie
thawed
improprieties
jadran
ideologues
woeful
arouses
pervaded
lannes
nyala
entrapped
flavorings
mudgee
shunga
fullscreen
fofana
mazumdar
lapidus
spacemen
venise
olavi
retinitis
dalry
ppf
kilter
cbu
belay
mcpartland
serj
leth
grimstad
euwe
doman
corded
zagato
barbee
batanes
keyless
menopausal
vpp
yore
ocha
hennie
uat
truckload
assemblee
lpi
gargantua
villupuram
gest
serkan
mpas
riboflavin
petrenko
rambla
nikolsky
gex
stupendous
neuroticism
icecaps
caenorhabditis
cassio
chillin
artcile
mondal
ardal
hte
putumayo
nehwal
planète
davinci
dadasaheb
dalley
baselines
satraps
panty
durack
roxburghe
filippini
rossville
lippman
sedum
knocker
ducklings
abdo
nextgen
underdevelopment
unaccustomed
juda
swashbuckling
knatchbull
okla
bridwell
cloverfield
poetically
fsln
dorrance
anto
georgiy
fluidized
microarrays
bridewell
connah
bobadilla
crandon
wiaa
hsd
njit
adjuncts
mazzoni
fraserburgh
kapila
stockpiling
trifling
theologies
fikret
antonini
ols
oversea
duhok
carvin
drogue
statins
senapati
belcourt
selfridges
batwoman
yodel
lary
panigrahi
latvijas
guyer
bickle
niekerk
chemotherapeutic
musto
tufton
moneys
vardhan
raimo
wanli
oishi
olindo
prodrive
ndf
ignis
brightside
haverstraw
ludlum
astrup
noches
coexisting
hexagram
unspecific
confiscations
anthroposophy
vnc
hie
beuningen
kebbi
kinu
fathering
dhyana
amable
asamoah
berit
leaner
biggers
teaspoon
consoled
epperson
revivalism
brahmachari
alvise
mtd
synchromesh
trespassers
burgtheater
moolah
kaidan
kamma
susanto
pteranodon
tortugas
pek
gcs
roarke
rater
ryall
berasategui
wendelin
palestra
chengguan
poof
viollet
xbmc
tolley
ryders
terminators
vinko
houle
reabsorption
templi
topgun
vintages
abrantes
rougeau
valleyfield
symoné
lemire
peewee
mutua
harmonised
moskovsky
equites
balrog
mercies
defecate
tackett
behrendt
risings
icg
lúcio
cautioning
rowles
deel
yuto
vandy
quixotic
feige
chasma
sniffer
aerojet
agam
deducing
jabotinsky
recheck
panter
hofmeister
eddystone
wcm
ringway
disch
samanta
coffea
rumbold
ironmen
chlorate
rwa
gores
confections
bry
gazan
razzak
ragging
startrek
worldnetdaily
conon
gamarra
vahl
warrensburg
alys
buddhi
soreness
selecta
otay
dcd
vesuvio
rivaldo
kuka
disassociated
dalberg
balaklava
flanged
gashi
millenia
parfait
assunção
arakanese
korat
gloriana
jacquard
takahata
sturmabteilung
donets
sumitra
azikiwe
binoche
inculcate
rossen
neagh
oberkommando
corporates
whitsunday
lait
vaw
shalhoub
ranh
tenniel
naturales
zoé
ushio
pellegrin
colloquy
retable
philanthropies
flammability
fosdick
kohan
birches
definetly
denouement
rickets
portrush
artus
suzana
swindell
murch
dabble
wbs
gardin
patos
ananias
riverbend
embarassing
echternach
braiding
vso
savimbi
xray
lvf
eugenic
conemaugh
indescribable
perspiration
kurta
westborough
conflates
hns
kanchana
hetch
durai
moten
salento
homophonic
disruptor
oberheim
disparagement
deferral
thatta
chori
castilians
resorption
disbursement
ftr
vansittart
inquisitors
venusian
tejo
chauvinist
antiguan
syon
baudrillard
misquote
athletically
finis
majhi
borna
discothèque
uzb
ellensburg
quadrangles
romulans
mladić
rsaf
kovalchuk
thakkar
menil
sarcoidosis
meknes
fiala
coffered
leavened
tutin
aldenham
mayra
merl
plumed
scintillating
addenda
glomerular
couleurs
rougemont
essien
reconfigurable
diplodocus
pakatan
benegal
chakma
rotundifolia
kartal
moko
backpacker
corlett
économique
spatula
sparkassen
necrophilia
tilney
ramji
nagurski
moieties
derose
autobahns
cranborne
borstal
aventis
mcmahan
yatsenyuk
steyning
soothsayer
terris
artfully
raphoe
torv
aguada
inaugurations
crowfoot
servir
pickersgill
fayre
angleterre
hornsea
mansiysk
minicomputers
destitution
xan
lobb
scv
vidalia
koror
steenbergen
caracara
prebble
sempervirens
revivalists
sommelier
kunieda
poudre
waga
finian
sacagawea
eagleson
tweezers
mayville
kist
proby
helfer
knuckleball
poser
thinkin
dailykos
perrett
kismayo
pcu
meran
mottling
madhur
ehl
mof
libéré
westford
ichiban
sarto
tolyatti
generalisations
reconnecting
navajos
newson
pings
immobilization
nnc
finnie
lamberton
bridgeheads
catatonia
tinplate
komitas
birchwood
gibbet
ayalon
ventilators
assiduously
ryota
ahwaz
stolt
pavonia
polsce
melds
snakehead
nandita
pobre
luann
olympias
blackstar
eusébio
amoruso
gnss
polygynous
malthusian
giftedness
disclaimed
najjar
primorye
oatley
tcd
tricksters
toure
mohyla
wissahickon
sargassum
cria
antonietta
vineet
chinois
britz
mcgoldrick
khara
jaffee
articel
dalen
heartwood
bobbitt
saldívar
offroad
goggin
lavochkin
fenestra
grammofon
maestros
kingscote
hayling
kindi
trounced
schermerhorn
dimapur
supercluster
baldrick
conspires
drukpa
quadro
oulun
olim
sharky
crunching
bluebook
shrigley
terhune
imaginaire
billerica
kalou
natta
hueneme
krakauer
lambourn
shuffles
muntjac
clamor
bedoya
slaughterhouses
wiretaps
salzwedel
swara
renminbi
capstan
popularising
milas
zuniga
voiding
fiumicino
brande
halpert
dioxins
blinn
latinoamericano
antivenom
alben
fnb
debartolo
parasitoids
okavango
documentarian
silkworms
anirudh
horas
ilmari
eptesicus
newnan
wigley
mcivor
moneta
aliquot
holub
handoff
granholm
literaria
gode
anjo
stratemeyer
fundo
orgasms
subcamp
domani
qst
cumhuriyet
mcgrew
shamsul
revelry
vonn
tria
gingival
gamsakhurdia
constrains
nma
corbelled
dsr
mego
ghai
slumping
divestiture
circumnavigated
gordons
taq
thickens
guisborough
anagrams
bridgford
magdala
overconfident
queda
jackfruit
raspy
velimir
harriot
geils
amex
tinkler
cig
muthuraman
banffshire
deflects
repurchased
diomede
topographically
tinley
cahors
isler
alea
educative
blish
sejanus
pronounciation
salviati
agde
ciprian
redbird
sanae
guayama
tangiers
resto
dhekelia
nelspruit
edmundson
stainer
ceann
farmhand
rotter
counterfeiters
panamerican
hmi
prashanth
bordentown
holte
railtrack
behaviorally
hartington
mediterranea
fastidious
highveld
neudorf
willes
ramillies
hersch
carrousel
openid
thulin
egede
engelbart
alesha
newtownards
colline
boddington
brahimi
mayers
europarl
sifaka
ashbery
typographers
toasting
duesenberg
adélie
databank
marroquin
ghedin
overeem
matto
upfield
rindt
burkholderia
fumiko
ccdc
eielson
rfe
akil
kbr
kapa
snarl
lautaro
pastored
sensitively
cuyler
tharsis
rtve
amphipods
cassels
safire
matchstick
divi
larchmont
cesi
cepheid
transaminase
bareback
ritenour
zellers
tabak
wahhab
meakin
skywalk
parivar
epos
slaughters
safehouse
chandy
brander
anata
timaeus
outplayed
reymond
berceuse
dobkin
kavli
bhandara
botelho
uncouth
handkerchiefs
glavine
hakam
schwitters
paring
avocados
ageha
radziwill
morganton
gleam
lycian
pussycats
crd
yogis
gérôme
arbitrations
norwell
carnaby
matruh
attired
hamsa
gunns
dearing
enge
centimes
wever
andriessen
etsuko
abiola
historisches
tatami
fito
decorates
skydiver
ablution
nuala
etre
ssds
unloved
ashkenazy
imr
magnes
wcvb
donen
moorpark
scriptwriting
posco
encyclopedist
diw
skuas
labo
enquired
hirokazu
barrière
tremblant
chestertown
sleuthing
marienburg
hackberry
rapeseed
defiled
outerbridge
melania
gavrilov
groupement
dishonored
magnon
rega
nomos
patta
soggy
lvt
nabeel
lorimar
eep
hindquarters
turkmens
cheboksary
katsuragi
mbo
timeliness
unkle
asymmetrically
halonen
curlews
hagiwara
cahoon
hace
olkusz
kabi
anglosphere
peniche
andreessen
devlet
francesa
rastafarian
oci
cathie
lemass
peniel
parenteral
margarito
wingo
randhawa
salley
jaque
galsworthy
gouin
subsidizing
argenta
washy
handrail
camerawork
nolen
sprinkles
politkovskaya
shawty
maol
connote
zebu
después
khabar
piyush
tenzing
kuyper
grammaticus
corbel
pacem
libourne
recitatives
partch
goncharov
yaba
simek
knutsen
selflessness
driveshaft
renegotiated
anais
edwyn
yellowcard
hkfa
philipsburg
enamels
generalship
ramm
kbp
veneziano
scheele
xichuan
hybridize
washout
escalade
hedman
hongo
prespa
phuoc
proliferative
merce
moldy
aerated
vpc
baller
heyford
ntpc
gampaha
intan
bergonzi
cheesman
lindblom
basswood
jenssen
pcia
yellen
gifting
shes
jocular
tiwi
newyork
gbm
adin
spaatz
schnittke
scrutinizing
lavaca
quizzing
funakoshi
sandbanks
gatlinburg
connex
guaymas
elearning
tummy
unneccessary
cherubs
hamburgo
intercellular
vario
yifan
murtha
nsx
chattering
novecento
stoics
virtuosi
badar
unsere
womersley
kuroshio
lmi
cuteness
sinusitis
reductionism
magali
myong
aspirational
titchmarsh
wairoa
extorting
odot
cryptology
dulcie
lorn
markel
delhomme
mashpee
fossum
rosselli
grudging
bezel
democratisation
quarrymen
unreservedly
massapequa
baltacha
conman
pescarolo
malvasia
broadwell
upolu
mcmath
pocomoke
cultists
muzong
pinheiros
anapa
deconstruct
tecnologia
rigoberto
gronkowski
dionysos
cambyses
prb
guanosine
ermanno
ulsrud
teledyne
hashimi
rend
kinderhook
williamsville
gorna
pallett
demeo
columbanus
wiel
einsatzgruppe
rolo
discriminates
laren
broadcasted
sorter
dagen
fuera
fcp
problema
wellspring
bitar
galashiels
endosperm
skagerrak
grapevines
plowright
maloof
pmd
hitam
foreclosures
charam
sugiura
koraput
flashdance
desegregated
rollings
revaluation
trenchant
laurance
unfathomable
jussieu
sfm
cud
claustrophobia
hsia
kerby
harriette
pontes
promethean
leck
frayed
truecrypt
bartending
stong
permanganate
shaye
kaza
peste
turbos
bahmani
knol
sorabji
beatdown
internets
khokhar
rickles
bluster
nordahl
eparchies
mukhrani
ironbridge
unrefined
ryedale
merchiston
heartwarming
laxton
vanderburgh
hipp
litho
haemophilus
tich
amarante
jocelyne
dmi
laoag
ivanka
wehr
balle
phoenixville
struan
longreach
boneyard
disko
volz
tichy
internalization
gmd
nador
hoda
stopgap
rajaratnam
romane
uruguayans
pedder
tiso
fef
fearlessly
jone
abg
kimberlite
kilner
forebrain
huzur
compaoré
toyed
haza
bilkent
buffoon
faze
ellsberg
creasy
serenades
cochineal
mct
chickpea
gonadal
pointlessly
signalized
granduncle
folgore
denatured
wyle
potpourri
sanath
choix
agneta
abax
twink
raynaud
majesté
esdras
atri
huizhou
moule
michon
tsetse
tta
popovic
isac
rubbery
cwmbran
jubei
overruling
sauerland
dubiously
dogtown
radke
felsic
sizzle
karger
sprockets
madhopur
smyslov
tripwire
vion
furey
showband
panamax
engulfing
wallflowers
tyszkiewicz
sofas
guba
kcs
couper
yellowtail
toasters
mathiesen
meighan
reorientation
veggies
ranvir
macdonough
pamphleteer
yaroslavsky
vihear
qiqihar
imperceptible
pothole
earthlings
beenie
enim
jms
dampening
ciampi
moveon
personne
mizzen
provincials
anthologist
nunzio
czink
cubby
bykov
velebit
taxonomically
withrow
pagnell
hoxie
waghorn
cawnpore
cgc
heckman
hetchy
artcle
morpurgo
glanford
etonians
shined
navistar
elbaradei
donegall
servicemembers
jackrabbits
kocsis
ramayan
whereafter
greb
bhagwat
wheatcroft
mody
openvms
lall
prajapati
ndiaye
agoura
berridge
secombe
parang
fulco
scoresheet
strunk
torturer
vache
kss
omx
dunkley
dannie
maija
pleurisy
dinsdale
pericardial
photoreceptors
staël
mylar
cjr
misleads
devry
giblin
ricerche
ssri
kanchan
garrigue
lasius
casse
zaandam
zauberflöte
harlock
carrel
scaffolds
tigra
hrant
emanations
whitsun
abebooks
inaudible
kupang
pomme
fasciculus
seve
boudoir
gervasio
takeshita
errázuriz
paraguayans
ison
happel
lll
keigo
nosewheel
ellingham
storace
nozick
arpad
unavoidably
quiroz
aln
aubyn
faunas
riffing
strayer
fais
raggedy
arbitral
brianti
bühler
mulling
titre
duodenal
rugg
pernod
chano
wags
tuohy
stuckists
wcau
pechora
iraj
sanding
jwala
prodder
nishizawa
unibody
ethnologists
bojangles
amants
sehr
everts
destructor
redmayne
nanako
overhear
vaillancourt
screensaver
khiri
bangsamoro
whisk
cadore
tmb
rectifiers
warrack
loughran
memoires
irresponsibility
pii
mlt
seldes
kilbirnie
hotta
mcchesney
chicha
ponton
strikethrough
schneier
marinha
grafen
artnews
lih
tickling
pict
apac
lenora
bhutia
gaim
severstal
oberg
spaak
ferraz
strangeways
deaton
letterhead
guattari
otsu
scarfe
druggist
kumaris
sputtering
syrinx
nagasawa
fromberg
wom
catamarans
stian
kuybyshev
robarts
crocodilian
hyattsville
arnor
infotech
amia
tuneful
instate
hansom
babatunde
steins
whitecross
campesinos
beos
sangro
strydom
culm
parsippany
pdx
insolent
yoshihisa
segarra
tomahawks
encinitas
victorio
mehndi
bioterrorism
cronyism
segar
cravens
deven
dishpan
burd
ackles
windowless
orlovsky
montagna
crossbreeding
qeshm
merrimac
kaveh
oon
slurred
gmg
blacksmithing
fyffe
liebling
lanky
sagen
beel
invitee
herve
unodc
koeman
mechanicals
ranulph
emedicine
chazz
antisemite
novotny
gassman
sparkles
sidorova
bik
mcaleer
cătălin
kádár
abolishes
adventureland
lorton
goering
colloids
giuseppina
masturbate
accelerometers
bhattacharjee
nuove
arabist
cpv
polycystic
degenerating
heightening
gravatt
tartus
harvin
bleek
steepness
peopled
anticholinergic
hamble
bartz
stopes
efs
mbm
voie
goyder
pamunkey
counterargument
simony
aksum
lentini
pixelated
eoka
jolyon
tabora
taxidermist
inscrutable
hessel
foxhound
freycinet
ballplayers
symbionts
rotman
cordy
panellists
twitching
nisga
buescher
hanako
kommissar
pinnipeds
mable
fount
ick
mcnish
fibrinogen
alghero
dfm
avonlea
essar
snmp
botulism
taga
phosphodiesterase
frustrates
dmso
elliston
alconbury
mittag
dethklok
marijan
baley
lucious
pelargonium
spiteri
globulin
bangers
shelagh
monclova
flexibly
pernambucano
capela
mangalam
obstetrical
deprives
optimizer
millsaps
angello
pintor
dpt
monuc
mehldau
blancas
flaminia
kanai
bollard
nobuhiro
trong
benioff
aee
arruda
plads
ings
sundials
boeuf
afan
gres
nottoway
médico
calcified
rfs
grabar
armonk
grappler
manfredini
swd
barcroft
djebel
landranger
windowed
jdm
absenteeism
compacta
poutchek
chugach
jokinen
zeeshan
plowden
stazione
innately
ncdc
hauptschule
,or
agius
myocardium
gars
bonomi
danco
dowler
geometers
unconscionable
bourton
gulistan
tweeddale
ackley
heckscher
nunnally
sohar
pinwheel
coldfusion
grubs
xiaoming
laface
acrylics
asolo
contraindications
summerlin
starchy
ledo
wjc
duz
cervi
portneuf
yukos
presa
fom
blest
balderdash
thicken
scènes
ving
mikveh
izzet
birkdale
rano
starches
peptidoglycan
lavi
aamer
digressions
phantasmagoria
attainments
havelange
damone
fant
jawan
bedlington
kakamega
hefti
jilly
rinchen
gluons
filiberto
fábrica
pum
stross
kohut
lulin
timepieces
kothi
calamagrostis
pdk
poer
accumbens
rahe
eastwest
kurniawan
frodsham
ottery
bangura
kilkelly
vash
jaurès
monteagle
moroz
sumiyoshi
methotrexate
bustin
hopscotch
pfs
preeminence
medicinally
dithering
smartcard
tarka
biopharmaceutical
hbr
vau
hellespont
sixths
lowbrow
hartsville
timepiece
cbsa
housemaster
dispossession
sarnath
smf
järvi
oriskany
heloise
foibles
spyglass
laterza
expellees
linoleic
balcon
wbf
myhre
daudet
oncologists
tenenbaum
kheda
speidel
vltava
mpo
kraut
gelato
conceptualizing
barrell
phosphorylase
frankness
recordist
désir
beachwood
cosi
temur
sedgefield
nannies
pulps
helianthus
delbrück
whr
akr
presstv
seaweeds
koolhaas
sahibzada
rozas
cryptically
gracey
singlish
genii
babul
shoguns
thirlwell
farmworkers
kirtley
nique
amazes
welkom
pontine
purina
ossetians
shanachie
melek
damask
unwatched
tolhurst
parsis
kokkola
sophistry
bunkhouse
yyy
rusalka
teesdale
mista
sabiha
pgr
retinol
akl
airwolf
usmani
hyphenate
parodic
trigon
kaisar
toxicological
krist
putintseva
ilene
trickett
gca
bohumil
mbk
arnheim
chappaqua
crary
colorblind
cmx
revoir
oakeshott
ahafo
honeyman
realignments
nestling
cerebrovascular
expressionistic
blotting
dizzying
agilis
sanja
curium
hazaras
kilinochchi
delineates
benzoic
duelist
bechstein
onishi
appling
flinging
luiza
talang
pasteurized
blatent
artest
quiapo
lonny
kyon
evangelizing
troyer
sarr
zeenat
edgemont
soirée
fratello
rion
odorless
doshi
citylink
braincase
impactful
pichler
oarsmen
stanstead
stryper
jawbreaker
mcquarrie
beeps
rusi
kik
conchords
otaki
glenister
dusters
hysterically
unsinkable
enthralling
telepresence
dannenberg
barral
dombrowski
geochronology
chaturthi
husqvarna
homespun
conaway
demian
mimas
netz
fabra
thorin
lahar
vikernes
kilotons
poisonings
cowden
winnsboro
dopey
overpressure
stagings
boned
fogle
nieuwsblad
talaat
guianan
downland
starck
postmark
vasculitis
dsd
odhiambo
dinan
hijazi
anemic
anzeiger
bbn
jiaotong
forelimb
webmd
seperated
cred
wpf
greenhills
drumm
exil
cardholders
overstate
securitization
maroubra
larkins
wcco
bch
suppressors
hollings
petteri
hansford
flg
menshevik
arafura
scena
bgs
aww
waldrop
baskakov
armi
parkhead
nickolas
thibaud
motored
magers
dalhart
daguerre
morlaix
teknik
peur
crossland
incontrovertible
pachinko
pmt
khenpo
cholula
pinzón
timi
southborough
proceso
gourock
topside
geto
mummification
melik
adham
asunder
listeria
halbert
tensioned
biermann
félicien
wooed
kalyanpur
vergeer
mirabella
imperialistic
belmar
analects
knotweed
lobsang
sexsmith
sargis
cadwalader
zad
snowmen
pombo
granulated
sprees
defray
multipoint
airdates
regrow
chanced
pcpro
huemer
spacers
tane
shakya
teme
dobbie
chargeable
yozgat
pantin
awaji
lgbti
panjabi
austereo
nullifying
craziest
môn
mouche
jetliners
bdf
profesor
churubusco
vasto
iae
brzezicki
disparagingly
joomla
spendthrift
serafini
pif
exhortations
chunnam
jeopardizing
poy
interferences
bentonite
coletti
arkadelphia
deodorant
montoneros
classico
lamon
kurtosis
trodden
vischer
flexner
zapruder
rato
baiji
criminalizing
lapid
scrapbooks
whorehouse
baguette
depute
rymer
borglum
procrastination
platonov
sellar
cassowary
pcworld
cantonments
aggro
slamdance
pasley
willimantic
palmira
swiftsure
giveaways
golkar
minette
compulsorily
uninfected
harryhausen
microeconomic
ush
mccomas
waypoint
cni
enabler
monsoonal
mexicanos
polycrystalline
spaceshipone
stk
clutha
imboden
cogito
motherless
yivo
dorrien
mwh
koman
catoctin
mulford
oldřich
wachusett
ninfa
bary
neurotrophic
bostonian
macquarrie
grasmere
triathlons
payot
verdens
jesmond
youre
gordan
galera
awesomeness
gabin
burgon
banjar
rahmani
vladimirovna
bedard
boogeyman
preity
inverurie
exoneration
dinklage
labyrinths
dls
finnerty
zari
basson
dfid
flapjack
dacron
hawksmoor
oyu
vasodilation
dvs
piva
fumbling
schrank
zimbabweans
eira
hause
celebrants
pab
ghazan
crossbill
hydrotherapy
wergeland
taurasi
gwb
maybury
randazzo
diabaté
paean
quadraphonic
masekela
siwa
adrianus
potentiation
médicis
morikawa
lestrade
lycosa
isk
espagnole
lexmark
shortline
ovations
romi
sohu
reek
vahan
runnings
hijikata
zetland
wymondham
millican
puncher
seasiders
systematized
meb
bunyip
kfi
vins
mostert
gfs
agis
bowral
syllabi
woolton
ranko
svetoslav
gigg
washbrook
deforested
ergot
zirconia
ashen
fantine
damodaran
wca
phonebook
toreros
tdrs
weyer
humanely
hemophilia
rechts
chrisman
leighlin
stompin
larwood
rediculous
tando
idealists
gagan
urbis
jcc
bahir
keepsake
neoprene
sahyadri
tiesto
gamekeeper
hormel
bhatta
heins
zapf
rir
revelers
plena
murkowski
roasters
coherency
urological
lancastrians
noontime
shekel
muma
chalked
agawam
ovale
rupturing
wickramasinghe
coggeshall
sku
tsipras
buttered
axs
flaminio
lieve
volcker
unvoiced
nyeri
bandi
oirats
keepin
linksys
baramulla
papillomavirus
gleiberman
selvam
prophesy
paez
indoctrinated
amitabha
agw
matrons
wickens
confocal
veux
oprandi
sule
cesspool
paka
zaïre
stakeout
slt
amitié
topsham
alfonzo
prosopis
serapis
boötes
hangal
escovedo
tilbrook
midtempo
terres
dongle
inbetweeners
lacerta
dromedary
gissing
corum
sukanya
langerhans
jatropha
rubrum
duf
madrassa
gallas
cauty
sukma
canvasses
ssx
bloggs
rothery
backyards
bonington
tawfiq
spader
foreshadows
filet
autostrada
reciprocates
refurbishments
yorck
lafont
faxes
uemura
goodwyn
foolproof
pallbearers
jellinek
smale
tshwane
calista
historien
queene
hibernating
pulo
kappel
coruscant
chimay
steerage
unprovable
marinetti
stompers
bening
cannan
zeev
hallucinogens
malting
provis
gregorius
blackhall
trezeguet
shermans
mccallister
nanyue
crusts
mre
wireline
castella
sandip
leptons
inexorably
mcfall
azizi
kamau
bamburgh
tembo
dankworth
miscalculation
headmen
cermak
stanger
sandlot
bengtson
desegregate
malá
desborough
pyrus
feedbacks
akeem
cynically
cdb
corroborates
schwabe
sabarimala
hls
nailers
phils
calbi
remapped
kompas
boadicea
dermatologists
dlf
insignias
guillemot
spens
lordly
bammer
spss
clémentine
ibd
skive
parmesan
mirkin
bacteriophages
shakespearian
mookie
ragland
plexiglas
mutare
wonju
wpm
colden
tunisians
torun
mahela
vasculature
abridgment
monette
wishy
waggon
looker
pks
badin
driveways
trepidation
mahabalipuram
grads
fatu
tipi
crossfit
manoeuvrability
righthanded
satanists
ansa
stanwell
imputation
mercyme
shearers
giese
hosono
lohman
ctw
hankinson
doba
byington
retroviral
briançon
wojtek
pru
lukens
parachutists
hymne
lenr
patinkin
xylene
dryland
chelo
nardo
taiba
morahan
sett
curbside
edita
anonyme
tody
wolpe
lvmh
mosfilm
hibbing
lema
berni
nakba
auctioning
wauwatosa
tempesta
ballinasloe
unseating
baraboo
fossett
andromache
prm
kwesi
nadiya
brasses
jiaji
reconnects
newpage
stoat
hyoid
rousset
chemotaxis
threshhold
savalas
unnerving
hohmann
tschudi
harpsichords
thermals
simonetti
beagles
steffy
hunterian
priaulx
reto
ravin
clee
abetted
graha
disley
prothero
adulation
gilling
cleanups
kring
haining
madhusudan
ehs
gatica
florea
stith
unraced
botrytis
demoed
levar
horrorcore
jarlsberg
granton
marginatus
asclepias
jibe
bastar
ineffable
kahana
kenzie
granges
mazama
klint
fredo
jaron
qand
ees
spect
evelyne
mahratta
delmore
hyrax
magmas
coggins
wraiths
lowman
impériale
ankur
cahoots
brl
lussier
lorillard
kva
melly
granda
notas
bisecting
stijl
wabasha
ruabon
saunier
melksham
murai
frimley
bayerischen
mckittrick
anadarko
justa
dorris
kashiwagi
hoeven
omon
brum
pitino
inveterate
everclear
boito
reapplied
norbu
pietri
bracco
bemoaned
brückner
dek
teltow
tella
kocaelispor
giuffre
hillerman
googie
marketwatch
preferment
silencers
ucas
underperforming
campsie
mellin
fnac
messinger
susman
clopton
energysolutions
fullerene
mobilising
shishi
tippin
borch
hornpipe
ponsana
nanci
carlene
nieve
cercopithecus
halverson
camargue
binning
sammamish
momin
abdalla
esmé
seatbelts
elita
choreographing
yeshua
nassim
cosmogony
spooked
cliched
saramago
redesigns
obviate
sexploitation
chetham
optometrists
chlorite
meshed
borchardt
ranke
faron
adio
salida
okoye
priestfield
bonville
donley
unpolished
wegman
hyer
sasser
honan
voyeurism
narathiwat
trib
conniff
vereeniging
victrix
maldive
gri
mukai
polonsky
beys
phaedrus
segregating
herpetological
athy
gujar
conversed
kaikan
earphones
pallidum
stelmach
thwarts
rumen
tyke
gambon
mattersburg
brakeman
mayfly
legionella
borde
xcel
vallenato
parsimony
setlists
commun
sorley
energi
resuscitated
cnidarians
profundis
ardeshir
otitis
maghrebi
chêne
bariatric
catbird
bris
madd
khote
perley
pldt
iphigénie
ubiquitously
haie
tenon
alicja
obliteration
nicotiana
timeshare
uam
barboza
armaan
mylne
ptb
chakravarty
bugti
dease
sophists
böll
rvr
belyayev
mtor
iberians
homesteaded
reister
liberdade
rhinoceroses
testudo
papworth
romm
mrd
olfaction
chalke
mcclane
beausejour
jibril
maicon
paperboy
stepford
gabbard
admonishing
amargosa
hohner
emmental
fic
hextall
mummers
atmel
iginla
misinterprets
wolpert
radioman
retroviruses
holidaymakers
decarlo
embu
bracebridge
octopussy
arends
townsman
vereen
bleus
inconsiderate
jamin
bosio
politiken
plass
oboists
bipasha
solanas
gephardt
sabbagh
monis
anther
beachcomber
ritson
maskelyne
inimical
activators
elrond
habibie
splashdown
anneliese
mccausland
meiotic
mtskheta
tursiops
weal
bournville
porco
hance
nighy
paradies
humanized
kaikoura
armentières
skimmers
trucked
guilbert
bretherton
mcsorley
nalin
balcombe
malindi
perles
sukkah
coulton
semaine
bianchini
retrovirus
ganapathy
gourley
tsl
ritualized
ductility
chacko
shackle
orcutt
whiskies
nitti
pumper
plaskett
nobbs
tarbert
esen
renda
reavis
heckled
harsin
ciba
shakespear
littell
ripoll
fushun
cockfighting
pengfei
cinquefoil
highbrow
scribbled
borsa
glob
siward
daei
lti
csic
lfs
tard
lubbers
spital
anahuac
giridih
schlick
admonishes
recordable
humala
chauncy
ruh
entrained
ffl
shawna
simorgh
waaf
botley
paraphilia
andolan
antagonized
spouts
tienen
cryptid
dungy
ciu
belittled
enes
nfu
swingle
tsering
piqua
hoadley
klugman
reevaluation
cdg
pining
saldaña
graveney
provan
veneno
capitata
tharu
dumbing
voormann
starrs
hellen
majesties
honam
throop
peeter
minya
requestor
mansoura
smoltz
silverwood
berardi
pershore
crts
tukwila
revitalised
rayford
rifkind
atopic
faustian
aussi
alessia
rocketdyne
banai
nypost
torgau
vacuole
margulis
freelanced
hea
asanas
satterthwaite
cornuta
whaddon
prisma
jolin
cienega
gavotte
varamin
flashforward
heysham
fedele
roon
iiib
vía
eiland
tansy
tiredness
roxborough
raynal
cuarón
ineligibility
cloe
osmanabad
cannula
riehl
sylmar
poppea
jefford
mceachern
inglourious
secessionists
sut
arunachalam
ianto
kosa
renhe
nwfp
otoe
burlap
jatiya
equanimity
serai
polyana
sulochana
novopolotsk
chirping
najafi
armadas
radiologic
parroting
armillary
telecasting
stashed
neapolis
daumier
reeled
gier
atresia
grs
menier
individuation
dighton
mcbrayer
robat
bewley
skolnick
valine
abominations
doumbia
rutte
ouroboros
nati
perineal
werle
irretrievably
tiao
fellner
mutha
subdomains
cycads
kisco
vaunted
kalish
marchal
pragmatist
transphobia
bif
recolored
pyeongchang
dependants
cómo
coriacea
seeping
porritt
lff
simmered
brideshead
trumper
fishponds
vernor
rossano
crocodylus
dimm
ohashi
tepic
waf
zeitlin
egoist
kosaka
madrasas
mckinlay
poel
tpf
urbanus
tycoons
stereographic
embeds
octa
delafield
inagaki
zapatistas
whakatane
newdigate
derren
unmik
borehamwood
chhaya
inflicts
sadik
shuttling
metropolia
rapaport
polyptych
kitaro
djibril
trialed
microbrewery
miseries
namu
echocardiography
terabytes
cochem
agudath
wanless
shymkent
rans
hyla
rol
sabella
riskier
piton
armisen
rajinder
mache
zittau
hornbills
tibialis
moluccan
hironobu
copsey
milb
logis
carthusians
transact
benjie
krasniqi
turnstone
jeepney
gulick
alighting
upwardly
shae
eitel
ifad
highfields
piedade
schuh
kaesong
sfar
jaak
ohta
olu
lenni
hanwha
sterol
assemblers
puede
dhara
noumea
wsf
herodes
wandel
nimba
stéfano
cooksey
naseer
razzle
immobilize
oja
casterman
lipp
habersham
snetterton
vancomycin
quadrophenia
dmr
hildegarde
egr
lefkowitz
scenography
waveguides
segesta
guignol
wouters
boricua
slat
cosmologies
microfiche
constructionism
shrouds
caleta
corporis
flecks
unwary
tulun
koy
melamed
taciturn
pawnbroker
beaters
thys
cosmopolitanism
cherkasov
speleological
lba
kahnawake
sandwiching
baxendale
drafters
vales
accordionists
adib
renta
ferociously
kozlova
aval
kodachrome
berrios
kelsen
dilys
reabsorbed
murmurs
damion
cakobau
unaids
theos
suffragans
leguizamo
savery
giolitti
latymer
tarsier
bienne
neumünster
fezzan
brookmeyer
lnc
fourche
kingbird
woodcarver
teleporter
replicators
foresees
crieff
queensboro
dde
arlon
ddm
guilfoyle
polan
peete
aua
umwa
paten
thumbelina
persis
turd
counteracting
musil
patrón
selinsgrove
loaning
bhumi
legislations
bertuzzi
dalma
mize
kosar
pitviper
tijuca
rashida
quinones
macnee
capybara
descriptively
cornbread
doonesbury
oumar
watchmakers
medhurst
nsi
neorealism
eua
ervine
affectation
xiaoyu
fenland
betawi
ambleside
linacre
cazorla
leibovitz
ingot
icknield
erlend
feltrinelli
deservedly
shoshana
inglés
updraft
seefeld
gabaldon
demy
lavine
bellavista
bidi
kundan
flouting
naman
ferch
taht
nordmann
polyclinic
backslash
goatee
blacklock
kyong
alaskans
sary
transcendentalism
ayumu
bacilli
sugiarto
cultivable
motoko
kitna
lampedusa
suchlike
plectrum
blanka
ipek
valmont
eladio
repos
hobsbawm
beno
singtel
misappropriated
chungking
ankole
dahi
ducked
ormandy
spanoulis
chale
naturopathy
footnoting
dalmau
bachrach
lifeform
fording
santiniketan
whacking
muggle
sexualities
biomed
rejoiced
yukimura
varkey
basayev
najafabad
porat
shehzad
antiquus
rehydration
monikers
photosphere
robbe
bravos
nyong
arbitary
underpasses
fah
poulet
playin
tysons
bruhn
gauci
bpg
freewebs
yellowing
trotman
sordo
chronometers
otaru
telecine
bayram
coevolution
lamentable
herbaria
catalpa
zohra
kindler
reinert
baxley
sockeye
evened
nhn
helme
suerte
dân
jaycee
wilayat
andina
ceremoniously
baio
bennion
larrea
guillou
ambrosini
schippers
carvey
neenah
divertissement
ebr
isomerization
morwell
neuengamme
pinecrest
abidine
dunsmuir
mcelderry
bhagavathy
azarov
hollingshead
squatted
rivard
kalmyks
yukihiro
mcgarrigle
marien
verilog
bhandarkar
bramante
storck
daylesford
salmons
bailie
milwaukie
chabrier
jaffar
esmonde
tillinghast
discusion
deutch
sarees
czarist
marcell
crosslinking
penedès
viorel
stromboli
mikaela
scalise
birlinn
muenster
veres
theyworkforyou
yepes
bembridge
kenedy
smd
anchorages
heidemann
donatella
lippo
fatwas
pedraza
mihaela
prokom
gago
toting
strato
wasserstein
christmastime
tomba
killzone
odawa
cartersville
parented
kharitonov
pellucida
tomboyish
cang
chorionic
goodchild
mld
playability
culturing
iselin
abend
garforth
orn
dovey
tzedek
pained
wittering
enderlein
infinities
slaver
jetliner
goggins
indurain
metaphoric
hooda
tranquilizer
stumpy
nihat
azzopardi
oxycodone
housewares
grecia
yali
krystle
pvv
restrictor
hrvoje
sarina
coire
seimei
ozaukee
unalaska
overbridge
aviano
flanigan
sawamura
universum
sandton
denard
sideburns
fanciers
manavgat
aav
sedin
inuvik
kinnick
matsue
moke
cutolo
irp
warthogs
badulla
charybdis
standouts
kusum
moonbeam
microsite
danielsen
petridis
podolsky
cristián
blissfully
genies
jobber
cimon
fertilised
rsr
starwars
veidt
recuerdos
treecreeper
numismatists
birthed
ukc
jadida
tutsis
radome
hometowns
arosa
peadar
putatively
pilotage
vacaville
tradeoffs
survivals
lauding
opensuse
flesch
katamari
mclagan
unescorted
nomenclatural
cirillo
mikulski
burntisland
havard
cantú
boilermaker
rucka
pawnshop
neuroanatomy
rhian
minivans
cotillard
michaelsen
disincorporated
brot
hih
bironas
hapgood
clairmont
colleton
shawangunk
coloureds
verisign
barmouth
swett
burling
capsizing
poway
ricotta
sofi
wastage
scalzi
aparecida
wast
tuckerman
kulthum
schlatter
waltman
tetrad
haugland
trefusis
saturnalia
weathervane
frears
quarrelsome
hashtags
ibe
hexane
belshazzar
temptress
kodaira
arpeggio
dering
pinnata
armiger
zopp
butted
koussevitzky
headlam
underscoring
fearn
skardu
vieri
liaise
craw
norvell
hwaseong
fron
ainge
marketability
rlds
rosenstein
petco
magasin
baldev
joad
neoclassic
derr
solidaridad
earnestness
laurette
otherness
ireton
evangel
subducting
knickerbockers
mirsky
teymuraz
buarque
pepito
hacer
dlg
teennick
incorruptible
respirators
interlace
phaethon
trollhättan
raggett
moccasins
alpharetta
barfleur
urmston
mambazo
manichaean
feigenbaum
einstürzende
fluviatilis
vibrators
bessborough
hdfc
alaina
calpe
catherwood
counterbalanced
vaporize
worklist
hafen
yáñez
vizcaíno
embarrasses
levinas
maltings
sreedharan
kayah
socha
ystad
félicité
kaziranga
gladius
graven
piggyback
gaffes
populates
barraclough
ocoee
ruffini
hoffnung
anderen
chicot
monroy
financials
diyarbakir
waterpolo
cinemascore
pinelands
nsn
bogeyman
distemper
ohel
signup
reorganising
cantankerous
huskey
findhorn
outlast
unfazed
hypothesizes
transpiration
plodding
shull
arri
limeira
maxton
zimmern
larionov
blaenavon
apop
ungulate
rhythmical
jango
rusch
cardigans
gota
sherrington
whiteford
ruffians
marson
mahim
reclaims
ayad
crashers
koruna
archways
zumba
prostaglandins
demean
intercounty
authorizations
rippling
denikin
intimates
adelante
unsaid
vap
sixfields
brushwork
rumination
makem
fumi
ctm
olefins
pbp
pinerolo
lascaris
pawling
agosta
maite
timotheus
veikko
adjourn
tunny
instrumentality
yamani
geomorphological
wrightsville
comayagua
matane
panofsky
joiners
interferometric
gazala
havers
narcissist
eich
cytotoxicity
invierno
catabolism
giunta
lantos
balinsky
bijan
alexanderplatz
notational
newspeak
lethem
temne
qanat
etchmiadzin
cozens
trnc
pineau
rendall
brambilla
reinfeldt
essam
namo
eeoc
shada
abramov
hany
phiri
matinée
nixie
bomar
tazz
casiraghi
cozi
proteinase
decolonisation
ritt
aloisi
jhapa
windle
coman
finalization
phosgene
geneve
fabrik
hillard
seceding
sunoco
hahaha
vika
strega
wolfsbane
nanao
kishor
morkel
decommission
behead
kushida
unica
dissections
aspa
credito
raimon
cupar
euronews
shila
reconstructs
rottweil
minhas
nkt
teimuraz
commodification
oberhof
alcaide
armfield
lectin
wgp
socs
tinashe
melk
stacker
adducts
guesting
unamuno
langone
sekou
valby
hamner
enriches
dbc
agger
legalese
weisse
perfumery
chumbawamba
lukasz
seidman
horiuchi
monoplanes
biosecurity
cabbie
kabataan
tranz
lipkin
wicketless
micrometre
transparencies
britannicus
zoller
ypf
catkins
asea
engstrom
meringue
utopias
seshadri
longton
vibrates
cadwallader
bierstadt
extensibility
extraversion
chaperones
mortenson
bournonville
behe
mailings
selon
schlock
guarnieri
stouffer
kakar
pitstop
codenames
tty
reem
sheh
hoedown
furr
simpleton
armstead
ronn
deckhouse
majeure
gossiping
yamane
autodidact
tarry
powerlifter
reining
dhaliwal
tattnall
maytag
whisked
flyovers
perspex
technocratic
classica
medscape
vilification
charisse
kyzyl
battiato
towle
rexroth
abeille
falta
reutimann
rajani
azaleas
keady
loathsome
scrubbers
sekondi
fenster
tannehill
phoning
melora
nubians
sansthan
bernardini
rtu
meretz
nisou
sagittarii
startle
farish
mesta
denon
letterbox
apostolo
hipaa
zhuk
maclagan
matthiessen
cheah
marosi
xchange
liniers
penetrative
honourably
ramsbury
shavings
dickin
sisler
sabe
bharadwaj
estévez
necrotizing
aigner
beag
sputum
klinghoffer
boutin
senhor
yobe
kjeld
candidiasis
canzoni
escalona
sigmar
dangle
amgen
amasis
oland
lipka
américaine
armendáriz
cuca
schuck
ebd
bdnf
curtea
plumaged
laeta
daur
najm
pluribus
reorder
washrooms
wescott
mismanaged
bridled
smithereens
mackellar
siraha
waddesdon
noncompliance
kitchenware
sibel
kuskokwim
milivoje
europop
acknowledgments
disassemble
stigler
caecilia
harumi
fawley
somare
romanova
tyreese
parotid
rsf
drafter
vrai
nylander
dewas
craftspeople
stephanopoulos
blemishes
goscinny
singspiel
daxter
atan
wwor
burson
wragg
hitt
healthful
smirke
musei
vaquero
canidae
bouteflika
farfisa
tcas
greyson
yuasa
greenest
trunking
huambo
mft
godina
keatley
tnc
chekov
multicolor
mandara
overnights
dimond
landmasses
ernani
ccu
opcode
basslines
geocaching
thale
fundus
yinchuan
diatom
salicylic
tefillin
insufficent
electroconvulsive
mischaracterization
trouncing
stepanakert
ndlovu
palaeontological
grimmett
maramureş
imploring
hamblen
rham
radwan
cleator
ramli
guardrail
lucrecia
singstar
zamorano
kanawa
wami
blancos
castellammare
ufology
jws
bernasconi
helichrysum
billard
fernwood
conegliano
pileggi
tpl
nig
sopoaga
salvos
westerlies
schoolnet
roopa
frogmore
sloss
handmaid
equalities
caye
gourmand
cohoes
verticillata
igs
commensal
hks
herstal
garbarek
hellyer
borchers
faintest
nabataean
groupon
dolerite
sithole
ognjen
suribachi
unsupportable
nakhodka
elda
steelmaking
chittaranjan
wildness
pfitzner
choroid
tétouan
compensator
bernstorff
afore
dowdeswell
overdraft
vacances
pullin
catiline
freebie
marmots
azeem
vanu
dolman
russie
takanori
relapsing
wowow
expedia
seismologist
erl
korolyov
underpins
dually
kralj
gaita
hitching
morphologies
poppet
kidsgrove
enrollees
certosa
goochland
shoulda
inadvisable
randstad
althorp
corben
walder
waker
saj
unl
kamila
epe
salusbury
gretton
afe
maillet
reckitt
mandola
stickleback
medalla
fadil
asse
lozenges
baul
blackthorne
yellowfin
japonicum
korb
ahalya
hezb
glaringly
herries
sobek
nismo
pleasance
burress
catechetical
gilardino
vcrs
sfn
maqsood
royster
kabbalist
discernable
gandolfini
inmaculada
obscurely
mobbed
rusby
borlänge
oxidizes
moïse
gruenwald
comoro
impatiens
neonates
tambora
glassboro
gulati
signorelli
gsh
genis
lucarelli
santin
thither
usair
pharaonic
corris
toke
vcc
cloven
ballester
photocopied
ebury
tarzana
hijacks
fearlessness
schemata
underwrite
elza
phew
biggleswade
wallaroo
corruptions
breakwaters
rahimi
shenoy
konietzko
westlaw
conveyors
jss
underfloor
sceaux
bashful
byars
prefixing
beddington
undercroft
opteron
amada
ximenes
karori
sgx
groat
teb
corbis
vpi
annus
dross
touché
payphone
appin
louboutin
ferrie
shekar
shaler
spineless
impressionable
okano
otten
madagascan
remiss
mehsud
boliviana
teco
odda
handicapping
electromagnets
icefall
enol
leme
sarfraz
shatt
löw
ircam
uia
mcgeorge
hory
slaney
legault
kilbey
toymaker
langlade
spectres
stroessner
meon
toraja
bucyrus
clydach
dys
weh
zein
toastmasters
lynam
beauport
ordem
jinju
luckman
shohei
nitze
sirk
berenberg
flossie
gondi
endocrinologist
woodring
choro
muhammadiyah
chanukah
quotable
splayed
lafrance
täby
mff
paty
mpf
athletica
shoveler
birtles
doolin
leishmaniasis
guitare
blitzer
saraiva
qayyum
andress
denarius
pacman
dazzled
balkar
sanpaolo
arens
ldc
darge
quatrefoil
parigi
ejercito
moxley
retakes
finery
nhi
modibo
stockland
buechner
rigi
rba
kitazawa
whittlesea
ancher
tedium
shorbagy
mondragon
nambour
baldacci
karbi
flamed
jarmo
demoralizing
iasi
milson
finca
pegging
skidoo
guliyev
accoutrements
amoled
wends
danai
shm
metallo
denki
ufm
saigal
jiminy
nup
paal
vasundhara
abelson
talkshow
caregiving
juglans
ollivier
arcgis
unb
mickaël
gyo
coleshill
campese
lakenheath
mercians
apopka
leatherback
dosanjh
westervelt
brauchitsch
harried
sweetland
ebanks
privet
rova
kiir
integrand
expositor
marzipan
louverture
engadine
pashas
strugglers
mugi
kloten
doga
benfield
styron
ajayi
srg
jeppesen
winxp
tewodros
walmer
masti
tinbergen
cooter
leishman
microkernel
diadema
aska
mezzogiorno
boka
erodes
mullion
walkable
draftee
aaaaa
civilizing
entropic
impac
zombified
reproached
anastomosis
boycie
channelling
entreprise
shiri
throughs
ruxton
lochiel
rotonda
pugachev
rugova
pilling
montemayor
tindale
dagestani
nobodies
barford
gassen
tafoya
ifas
susu
androgenic
llandeilo
hazelnuts
nazareno
penicuik
eckhard
seletar
contraltos
nefesh
lucci
brasher
hadžić
elefant
molinos
cartmel
jeph
crofting
industri
cannavaro
resnik
biogenic
nkomo
stabled
ordaining
beaudry
tarkington
stabile
pembrey
isenberg
wurst
chaban
forlán
trebuchet
turko
backflip
aaltonen
advisement
cuna
boheme
umarov
orquestra
zale
demuth
inputting
ethicists
beausoleil
adrenalin
smilodon
cira
mccurry
chiseled
dariush
examinees
keyshia
wauchope
albu
backplane
børge
zepeda
volkmann
coppinger
tord
danks
gadsby
algar
pagers
erythropoietin
hgh
bwr
barista
shashank
hillocks
aog
dreamcatcher
kasher
hfs
almaviva
biran
ganged
pietsch
erector
bertoni
buzzers
mtvu
blancs
sooke
intently
becks
kumbia
yokkaichi
interfraternity
aurigae
deliciously
mylene
kow
kavitha
humanitarianism
gesner
neuner
raas
clansmen
survivable
vandersloot
djamel
ldf
blackheart
misfire
marleen
hsdpa
retellings
mardle
justiciary
ilyin
atle
initiations
radiophonic
gins
sgh
mêlée
elka
inductively
getxo
sociobiology
µg
rems
gustavsson
jacketed
adornments
perun
caudex
pentonville
jailing
maks
cabe
desierto
munck
serrations
klos
iwamoto
okean
farrand
wolfhound
kundera
subtropics
latus
trd
hohenheim
gibbard
cushioned
scada
lovitz
yakutat
ngozi
karu
centralisation
rúa
aocs
antinous
caricaturists
warzone
carves
stortinget
sandf
kuyt
wem
zima
tamazight
unread
merlino
nagao
hospitalizations
mandelstam
sihanoukville
hilux
ruz
bekele
deakins
ligon
stapf
saqr
friesian
randhir
rdi
gilmartin
banality
weissmuller
brise
dyers
unravels
wainscoting
krzyzewski
masini
weblinks
zwart
vendettas
basanta
maebashi
opl
tristán
middleborough
basij
roundtree
retinoic
atli
miyata
marshalsea
koss
thylacine
racha
darkens
weiden
goy
beatlemania
alija
cakewalk
lacerations
borghi
garvan
longmeadow
anticlerical
whistleblowing
spurts
yakumo
steck
suburbanization
iznik
stabiliser
chandrakant
lagardère
analyser
mafeking
chorales
juul
yenching
codifying
cayey
dinis
maiduguri
djerba
feasibly
molest
sumida
deface
chalker
bolte
bramber
pensiero
revitalise
dzerzhinsk
maudlin
nichi
penkridge
perverts
secretaria
blakeman
jacquie
haack
shabbos
boldin
radcliff
jumblatt
odbc
carbonite
condensers
fq
moneda
renouf
hiva
disrespected
mandler
mesmer
shubha
wagging
shechem
prechter
overpriced
landholding
trottier
baseballs
magid
asrani
taoists
distros
mamun
naryn
clytemnestra
sabata
kaski
zawraa
jabber
excusing
clowning
momoyama
monographic
mismatches
dinefwr
thornburgh
chachi
tiresias
altham
juiz
techs
florrie
rga
pennyworth
sanju
palla
bizzy
augustyn
flustered
liuzzi
immelmann
sanilac
shindig
efx
impinge
malouf
panchami
mordant
gomi
zampa
vassily
rowen
fallows
influenzae
kamenica
grandy
mossberg
karthika
banneker
gcsi
mairead
timpanogos
annadurai
majed
torrie
teniers
diyas
mediolanum
gonabad
spiritualists
mumia
vattenfall
sien
henie
wedekind
alcester
disbelieve
gloating
luxuriant
bothe
slackers
drewe
hagfish
lefferts
okita
varenne
colorings
passione
deoband
seabee
northville
yamanote
groping
metamaterial
satirize
weyerhaeuser
ondes
sutch
sahi
ikaros
norco
quintilian
dolley
aubervilliers
riise
formwork
megson
armbrust
bakula
markley
dryburgh
theologica
kus
dwa
politifact
hpi
unfortunatly
anemonefish
huddinge
wheeldon
radiations
bolas
citic
orgies
blitzen
immobility
snapple
barraza
vintners
hiddink
eterno
feehan
mahir
incubates
varig
silhouetted
davar
communitarian
nub
miasma
aladin
tulin
yoho
aaja
kitimat
sculthorpe
lilburne
prati
nims
skat
kuchar
almqvist
viewings
gamete
uncropped
resinous
pincers
camellias
koha
khumalo
sidibé
gerontius
summerfest
acholi
trésor
thermionic
hongqiao
klausner
isg
oreal
gavrilo
swingarm
sealant
vardan
foulds
harrop
histocompatibility
utp
skorzeny
faller
pauw
turnouts
imber
speeded
hailes
bsm
holdridge
fallbrook
denominators
pittwater
trudi
hova
odf
willmar
pistone
simonides
rafique
helwan
condamine
recycles
oor
electrolux
somone
galadriel
ossified
germline
halili
guruji
kitchin
snowstorms
advantaged
tamoxifen
weariness
hamgyong
ashkenazim
maritz
nega
dieback
ziyi
jingwei
hoggard
aktiengesellschaft
fillets
grif
ncf
olyphant
mugello
jako
seamer
mocs
papilla
leclaire
comparability
schmuck
fomento
gza
biodynamic
vence
beso
departement
kinh
imrie
vibrancy
regalado
shizuku
blaydon
manzo
vaya
groupware
pelion
downsize
farías
terpsichore
sören
selenide
swedbank
compo
wadley
niedermayer
threshers
listenable
squeal
oras
alberic
altas
tyrian
zet
merapi
eberhart
vallon
animistic
torchlight
kibble
opals
emmis
mcaulay
rightness
corrib
nedd
bunche
roitman
salak
oldroyd
erkan
spotland
stokely
cohabiting
funston
whec
neukölln
ananth
dostum
ridiculousness
spurn
untruth
organizationally
reais
chm
rexford
mandl
methylphenidate
lunny
interconnectedness
lemhi
ftn
ruffalo
elwin
didymus
deezer
adra
plaquemine
lalanne
conditionals
antihistamines
coffees
yousif
ambato
monáe
fellers
clincher
particulary
frantisek
daschle
fenestration
nordstrand
saunas
yulaev
mandla
gesell
perineum
yuppie
ratzenberger
hkt
nataliya
kluger
turnabout
bussell
subservience
gandhiji
belaz
headshots
ezine
familiaris
nobuhiko
internationalisation
peccary
knokke
flatbread
eml
lankford
ocbc
sleater
maghrib
massino
ponsford
plaaf
voltigeurs
morose
lerche
grindelwald
hannam
eraserhead
chromite
remount
formulary
chiarelli
gehring
rolt
rutile
musha
unclosed
collines
brych
labile
hurdling
journée
polícia
enno
efate
laus
gies
spotlighted
tawau
hoylake
posible
barlas
swardt
tombigbee
aute
almon
tato
virk
stamen
brede
riyadi
carrboro
oceanfront
assed
ahtisaari
hamdani
dichotomous
albon
untv
lipan
erisa
collado
bratty
cimmerian
blagoveshchensk
greenmount
colosimo
jemaah
prospering
anticyclone
berkshires
ncsa
dessie
leitmotif
anholt
stradling
mclachlin
ballan
chehel
debo
skank
somma
slonimsky
binational
prest
chima
amite
parise
heera
bromeliads
infirmity
rayment
antechamber
judgemental
honeypot
sellafield
uscf
cerati
manicured
benvenuti
falafel
colluded
granata
harveys
igwe
sabonis
morenz
toolbars
riccio
hakkari
rashly
mandamus
luby
doldrums
lassus
trente
sahm
monfort
betrayer
valueless
loughs
thabit
continentals
itl
padrón
ebon
tdk
krab
vignes
habakkuk
byford
macphee
kiser
bellotti
salukis
perls
hillage
robbin
gwich
flocka
fale
enniscorthy
scaup
comminges
zeolite
furstenberg
tongji
margarets
esmail
mangin
brielle
snyman
akif
embrun
mickle
revelatory
medusae
jmp
brewhouse
ikari
henner
cerne
miharu
hersheypark
misadventure
oakhurst
iwao
kiddy
cfcs
kaf
holten
objets
jyrki
belang
oberman
hdnet
seidl
pigeonhole
tooke
habyarimana
infantil
unpledged
risd
srbija
inbuilt
penge
agronomists
wewak
ellora
zahle
fuster
squelch
chattels
santoshi
ciego
liposuction
mavi
friedrichs
kayaker
yello
mutuals
kyeong
noninvasive
trane
carolinians
beary
remotes
zhangzhou
universite
climaxed
zurita
eprint
venis
lightner
moberg
mohe
ashtar
selway
cerebrum
stoneleigh
bracho
berryville
benedicto
chicory
mcgarvey
limca
statist
oloron
sunstone
sabal
tsukamoto
sargsian
damrosch
planescape
stomps
volkan
agito
serkis
hargitay
uproot
jabloteh
crosser
larter
navratri
tsarskoye
fennec
concoct
photocopying
turhan
lolicon
elida
lingle
kaibab
mactan
fatter
fentress
quiller
degrasse
bld
macnaughton
armstrongs
pavese
zarif
planitia
aul
portrayer
seguro
appelbaum
doesburg
kuykendall
rayed
kitwe
trembles
calea
deflector
kempthorne
gainey
debutants
laubach
stellate
podcaster
vorenus
dargis
ertl
hirschberg
wurmser
coan
understaffed
stilton
sidorenko
viktoriya
tomokazu
juxtaposes
mediations
widerstand
sjöström
labrie
derailleur
karishma
gerónimo
pinedo
atto
generación
quilted
sugihara
fiorenzo
haire
valladares
gokul
oxidised
enlistments
populaires
torrejón
clasped
agapi
kera
yzerman
mith
chanteuse
nobre
tinfoil
kautsky
galston
coomaraswamy
adebayo
gnawing
cronica
ayaan
nuking
rohl
inkwell
biblically
decalogue
stocksbridge
jacki
iup
huddled
leftwing
iroc
drugging
veyron
cotillion
pathogenicity
ajami
rottweiler
endnotes
bafana
charsadda
fireboat
rømer
impales
lehto
hideouts
theatricality
ornella
afv
leaphorn
beppu
mediabase
arfa
woodcutter
julienne
englander
kjartan
kusa
pintado
jonatan
muttahida
milka
awakenings
jewison
gesang
civitella
ciccio
hulking
ernests
muzaffarabad
pascack
timecode
physiotherapists
yusa
shags
luyendyk
plainsboro
scherrer
spidey
grumbling
tagil
calverton
wintry
bebel
manteca
fathead
wyd
mony
bygones
satay
canela
beckton
vechten
casady
schistosomiasis
denigration
strathallan
joneses
bugles
capulet
shorting
nessie
benicio
nigella
bailouts
undercutting
ells
kalenjin
geva
paju
shawkat
sifu
casted
fatigues
stealthily
gunton
jutra
costata
progresso
manders
inan
panettiere
pallavicini
schoonmaker
correo
hardesty
lampreys
monkland
institutionally
siegler
caldeira
somthing
urogenital
ludus
funabashi
calley
multifunction
clothe
iad
hepatocytes
krung
eik
shorewood
scammell
hydrofluoric
sirs
avoir
occurance
chinle
uneasiness
buggery
zahavi
foodborne
normals
leasable
correlative
moondance
geddy
maksym
pekar
foursomes
diamandis
valmy
behrman
muons
lfl
chadwell
gallimore
sigmoid
estill
captivate
emanu
safeco
outdoorsman
stirrups
savelli
ichthyosaur
waxahachie
sithu
mutualistic
quique
hanky
comprehended
huntelaar
imperii
uvb
latouche
knighthawks
kinne
robidoux
hebburn
assheton
donoso
kake
sarris
kaeding
meck
giada
fortepiano
renter
smathers
clubland
caernarvonshire
deters
keepmoat
krait
fabless
thain
psychogenic
umkc
exhorting
resubmitting
ivers
galvan
traumatised
spinel
typifies
roney
kristinn
notarized
swabians
osl
macneice
rosslare
irem
météo
lavished
superhighway
zaremba
irredeemable
gorshkov
peruviana
newsted
definitly
spearfish
messa
repurposing
bluecoat
malpensa
coushatta
király
nflpa
panikkar
rabinovich
llnl
vectored
caetani
fdot
mwai
meghann
avast
quibbling
chanderpaul
glauber
parshall
barest
hideously
unbeknown
farge
canonsburg
mbira
enlargements
peavy
wortham
teleprinter
obo
mendham
dalli
aqap
dsw
dupes
burston
sunkist
draganja
erben
vincenti
shahram
jaunty
suitland
lamer
raygun
tyron
kiplinger
ecd
shod
filatov
ratnapura
douma
westerfield
clintons
purohit
takizawa
portimão
masseur
médoc
wadebridge
sherrard
othmar
dogfights
saurischia
ttf
tanta
peja
pharr
oria
lanvin
hotz
myasthenia
curable
wiest
barnesville
brigadoon
frise
bullfinch
sahay
submersion
guodong
sharat
leafless
ehrenreich
moghul
vsb
capos
stairwells
cppcc
escott
fraley
bostick
odets
replanting
balenciaga
botts
westerlund
floundering
constantijn
swanepoel
pof
aelia
nakedness
treetops
maqbool
downsview
poliovirus
margravine
komar
mandar
alums
scoot
ucayali
virden
copolymers
slapp
gevork
rameshwar
telefe
aban
metrically
hagrid
chapleau
overbroad
ochiai
effectors
cotterell
lifford
idealization
cuadra
folksinger
vanitas
beatson
yaquina
maybelle
rylance
mccaffery
harikrishna
hanworth
boucicault
itar
meynell
introducer
liquidators
marzio
cultivator
muzak
miedo
sclera
brownstein
shaiman
sunland
independentist
mogren
kiffin
cpf
contemptible
rockwall
disqualifications
yeadon
favouritism
volatiles
sensationalized
queenslander
kästner
cengiz
niyazov
bartman
hima
throb
stace
lewington
jürg
courser
harvestmen
mashups
nenê
oakmont
reinvested
pursuivant
intertribal
lotfi
jungfrau
turia
furniss
mulino
pluperfect
clairefontaine
languid
rcahms
disheveled
greylock
cookham
anderssen
rendus
truncatus
tcn
borlaug
crucifixes
aaronson
furuya
braised
meanest
belaya
demoralised
kieren
ulfa
foreseeing
valium
brookvale
mattis
reinstalling
chelation
buona
lale
nuthin
ungraded
doughboy
voinovich
highbridge
goodland
austerities
pemex
mkb
rescinding
mbps
sorana
schaaf
souq
colonoscopy
pagination
kydd
kisa
banes
toothpick
elastica
onofre
jcm
cruisin
englefield
lanai
lewitt
baedeker
gyalpo
cornyn
ncu
ridha
svs
pavane
frou
fairtax
readmission
adulterated
makki
terminological
stateline
martynov
sprott
woodridge
triglyceride
leuchars
barghouti
odlum
lockley
entailing
agh
corbijn
automatics
tajuddin
pavlovsk
doulton
neumeier
firstgroup
bildungsroman
gaudeamus
perchance
traore
lagunes
stempel
cordons
eluding
flings
dyad
ariston
ashurbanipal
südtirol
goldner
schicchi
overy
capitalising
aspatria
pettus
revo
patani
dunant
matthes
unheeded
metalworkers
thomastown
phages
wormald
omc
kreviazuk
villi
hanbali
lahoud
awnings
dukagjini
dbt
misbehave
brianne
basotho
panto
interneurons
romário
bolle
januarius
iker
alvi
bondar
loblaw
pasch
mercato
photogrammetry
prévert
golightly
jamel
dimona
azer
megachurch
alcamo
napkins
glenside
eterna
purifiers
micheál
aneurin
birnie
pasteurization
backspace
copulate
umpteen
nofollow
mornay
fugees
unwisely
janam
mcgough
huangdi
transliterate
acls
subtler
liri
maxon
avl
cobre
guimaras
daco
middelkoop
cordis
lateness
angier
messianism
roscrea
gayton
doerr
garçons
grimani
séguin
wilko
fruitland
comden
llorona
rame
pelikan
darnall
mutes
familiars
sherpas
trample
loredana
tiomkin
kott
paan
panavia
splints
gogebic
terminologies
kittitas
pedlar
deckard
samer
zfs
bottler
airbrush
personalizing
tenterfield
powerlessness
rickert
rossdale
compere
kaila
roncalli
thermite
manzanera
susquehannock
pendergrass
lwow
olot
govindan
hossam
margao
transsexualism
rosato
userid
sonority
arcus
instinctual
sidelight
cleats
swinney
destabilized
maidservant
galifianakis
modeller
bowne
dashboards
nerdcore
griscom
goldhagen
mcgrory
remonstrance
josselin
rydal
smelted
nbp
mohn
boudinot
vargo
thwaite
cycladic
hardwired
calibres
dongen
jinshan
indes
saraf
herniation
sarm
crosshairs
novia
matarazzo
paperboard
talysh
excelsis
roks
anandpur
franceschini
anadyr
skillz
iska
hadlee
ifill
brushless
saotome
lanata
kataeb
kudrow
motegi
echostar
rcf
kester
ohka
kinloss
sanjar
postulating
displayport
rhc
yechiel
kanada
boller
triadic
barnato
boob
magnavox
utca
routt
acetylcholinesterase
freethinkers
extrême
geos
fuso
tektronix
stis
outstripped
fibrils
hoechst
venturini
pulsation
stingaree
squeamish
quirós
jupiters
dfx
slither
dhr
gwendolen
mayflies
adulteration
minard
diaw
familiarized
clarrie
legende
doting
gtd
unr
gallinules
shied
michalak
thermae
sidemen
surma
sezer
hillhead
syms
elkington
terming
hebb
hever
bonnar
punning
komaki
agoraphobia
lithia
gottesman
icv
ramaswami
sower
marea
lansdown
prisca
gae
prahlad
hemolysis
omn
ethicist
turncoat
drouot
abrar
incredulity
valency
bagshaw
brambles
lais
schemer
kozan
msb
turbomeca
bucuresti
justyna
whosoever
folle
biggles
boorish
kulturkampf
wigston
pyramus
woodend
sedat
firoz
donghae
coolangatta
shrieking
popularizer
alderton
schweiger
ranjitsinhji
sjc
logistically
saram
switchgear
yasui
botcon
smap
salmi
hibiya
tuatara
mullane
curfews
soyer
namaste
preamp
gavras
christin
simmel
nemes
tumen
vergennes
railwaymen
kawahara
lnr
medel
noo
dubz
percieved
wint
arrogantly
ginuwine
glendower
gradations
nanoseconds
massacring
juels
beaudine
qadeer
wuhu
copolymer
corbucci
killingworth
pusa
adlington
intravascular
runoffs
bongaigaon
moodle
bloodstained
colbie
rootkit
elance
chillout
athénée
teles
kankan
fud
tmg
trichloride
misapprehension
michalski
spurr
aziza
balewa
micrantha
cashmore
multiuse
knitwear
marfan
finster
macdougal
unencumbered
gerritsen
renz
lorenza
succulents
greenport
tush
sarastro
ebla
kissel
mgo
francisci
sikeston
adie
lopburi
churchtown
mcf
beddoes
boulding
masaka
landreth
awg
fechner
resveratrol
insinuates
schönefeld
dolf
rhodesians
siphons
matériel
lemonde
clericalism
jonesville
walkerville
wilczek
fussed
soudan
koopa
shapers
hra
bhalla
fasted
colma
mattox
lecoq
giacomelli
encaustic
racewalker
humours
kfa
bote
mercersburg
ionisation
gitano
eln
utne
pencilled
alick
mdgs
piru
tanisha
endeavouring
vialli
tonsured
lightsource
tattersalls
wyden
awk
clouding
idée
castells
griffons
bienvenido
cootamundra
scoreboards
dishonestly
sokoloff
partir
aventuras
schweinsteiger
webserver
madara
witzel
wieman
ravinia
prided
violino
rcb
sugarman
phosphors
katwijk
naftan
derain
confirmatory
mispronounced
deshayes
corton
lebar
everingham
scaccia
odenkirk
poms
steinhoff
ronsard
beaverhead
anchieta
arteta
smallholder
terrae
trono
goldy
welders
thorfinn
fpgas
westall
warrender
milito
donaueschingen
dfe
gajendra
pring
atanasio
adebayor
georgen
belied
tendentiousness
dunsmore
kiloton
cabela
flan
joby
underoath
doen
pangborn
opto
mrx
busquets
injunctive
ledgers
bikeway
prophesies
extol
erj
bhagwati
dslrs
edinboro
lamberti
alis
croll
callejón
kila
iqra
earmarks
baruah
gwinn
wais
ariki
autoroutes
kerfuffle
tunja
hanifa
senato
cliveden
spindler
mellish
pelley
rudis
punchestown
succop
windswept
krio
dugas
sewickley
szasz
plumpton
mariska
riedl
brushwood
lemar
meizhou
herculean
twr
isaias
headdresses
runaround
recouped
homogenization
disinterred
visio
rushville
zagat
crista
pinscher
slm
lovebird
organo
vashti
ryker
chutzpah
yonatan
neopets
wern
wellingtons
brembo
patentability
lagu
walbrook
brookdale
targetting
hantavirus
mipt
tobymac
teagarden
komnene
reisz
sobering
erlandson
biskra
nanotech
guarino
untouchability
josué
uncontentious
zapu
koba
nunneries
scheiner
boettcher
logica
lookups
jass
rowbotham
ehret
discontinuance
inattention
busselton
bogut
paroxysmal
kary
fragmenting
swinford
harnell
seid
jünger
ayin
covadonga
malts
gebel
decelerate
theist
ipsec
baftas
brickley
mochrie
denazification
yungas
klansmen
kruk
kentwood
benzoyl
tinh
wiman
reiff
knickers
ashgrove
kleiber
depowered
hanscom
meerkats
ellroy
countercultural
miming
dhaulagiri
hasani
craves
takenaka
wbca
yueh
exhalation
lycett
atglen
milked
linnea
grana
mononucleosis
hirschhorn
clémence
wendland
amelioration
signac
farts
sargasso
reboots
pieta
bisbal
bundi
barbeque
beena
astonish
veronique
brás
vmc
messel
nguni
joerg
mondi
xaverian
fowls
nymphaea
thornberry
cuero
clavichord
lightman
cackle
potentiality
grigoryan
pnv
perthes
kasten
reselling
chupacabra
cardle
splm
egoyan
groene
rufin
coatesville
manzini
coppersmith
hargeisa
duta
yersinia
faceoff
vacca
topoisomerase
flir
oribe
spafford
serin
poniewozik
polyamory
cacapon
shotwell
tramlink
macdonagh
underling
maal
crise
dopo
shantanu
gordimer
alkylating
shanmugam
torpor
hasson
krenek
sacrosanct
balmy
bardeen
questa
buta
coppery
etcheverry
ticknor
aromatase
sedatives
sey
demba
gerrie
amagasaki
eyvind
allawi
aquatint
lorch
noriyuki
lostwithiel
fortissimo
helgason
grantor
untie
kilbourne
creo
chosin
cobby
vectis
ugs
ciskei
ohlsson
addon
bolivians
cockcroft
camil
eddi
geithner
salò
jesolo
exh
fonteyn
sasol
gmu
heytesbury
malabsorption
cisc
medicago
heriberto
jumpy
barres
corvids
farnell
smarty
lorenzetti
sadeq
lubavitcher
bovey
shimshon
shew
romande
tremain
matthijs
lassa
izvestia
reticulate
admonitions
safdar
aleksa
swirls
ipsilateral
indraprastha
zick
aromatics
rosalba
busty
chey
exorcise
harping
hinz
gaudi
peine
iyad
counterparty
steeples
maret
baughman
keli
bricusse
morogoro
wlm
havergal
taji
quantifies
nuthatches
nandy
fiefdoms
malvo
overexposure
ortolani
carm
ancelotti
vix
approvingly
shaadi
declamation
terkel
crkva
mrr
diarmaid
dancy
sead
boatyard
dgs
secularisation
iste
aiton
carnet
holmdel
radiographs
polizia
catalyzing
angèle
mabinogion
rainfalls
quintal
sûreté
swish
ferroelectric
adhemar
weeded
peveril
glioblastoma
jesi
rastislav
shuhei
anacortes
mashing
stoudamire
treacher
bente
dodie
laclede
palanquin
laundries
distillate
biggins
mbl
hudak
johannsen
salehi
thb
prothom
brontosaurus
luga
anl
nuclease
eldoret
quillen
mojica
earthling
risch
sylph
laberge
muth
clarinda
tidally
deification
smo
melnick
sterner
scoter
tatooine
korczak
jux
philipse
winchesters
cachoeira
caminho
tuchman
toadfish
bandopadhyay
garstang
ladytron
nabors
mollison
irrepressible
chenier
adw
gaîté
polychlorinated
theatricals
mannering
crematory
thyagaraja
paniagua
weenie
murdo
fenrir
dorner
glacialis
frothy
gentileschi
adblock
hifi
kiddies
rasen
legislating
rybinsk
saith
furnishes
essentialism
flds
vaas
kanti
taobao
aynsley
neuromancer
hearns
vibert
hassam
triggerfish
allain
tourmaline
anuj
donkin
tempio
zugdidi
barmen
pataudi
derm
dilettante
innapropriate
varberg
underperformed
brama
hamp
hedonic
sheikhupura
moly
fite
maîtres
ibra
tater
askia
twu
economica
trippy
terras
mutombo
yefremov
toohey
murtaugh
bargh
kobalt
glitters
gmm
pixley
bahrami
nujoma
fetterman
kirana
zahorchak
edythe
playgroup
holdup
enquiring
（
maryanne
edyta
eldad
mohmand
takhti
actaeon
reintegrated
wcg
chambermaid
klaatu
piniella
wetherill
wabi
gagosian
stuns
gml
trooping
broadley
batam
rationalise
waitemata
lingala
spunk
treadway
pignatelli
edington
wigton
waveland
hazan
mcdaid
dongdaemun
tabulating
witkowski
betcha
uden
duthie
forté
vpa
taiwo
vianna
ghamdi
hight
recombine
brogden
achi
tdr
bratsk
caumont
groundswell
noman
dispelling
furse
airventure
faircloth
rothschilds
moralist
dragoljub
bely
tarnowski
tristano
fuat
timmerman
teichmann
starchild
strickler
maimon
disperses
ecevit
rubini
civica
yandel
andar
ehsaan
hadrosaur
letta
hatice
raconteur
pounced
noemi
mulhern
zedek
courteney
outmatched
savants
muli
chickenpox
montagnard
jette
garver
armalite
manhua
sholem
counselled
grinham
webisode
tarin
siong
quon
sigerson
dá
arona
delvin
outflows
sandre
binz
bearkats
circleville
chb
weathermen
nitzsche
publix
eysseric
dassin
transpacific
rft
newtownabbey
codigo
musicum
erythromycin
ortona
kanika
genentech
alfredsson
reallocation
melnik
hibi
oomph
pavone
sff
apraxia
canley
serafino
aera
subsisted
fitts
foreplay
bethlem
fenson
reidel
gewandhaus
rocketeer
quaking
symbiont
decry
sitapur
jerking
consultancies
healthiest
gallovits
prurient
multicore
wassenaar
brecknockshire
reshammiya
kristol
sharona
clink
aminotransferase
azumi
ayana
teleki
lockup
balmaceda
socialising
redbacks
ayatollahs
cubicles
jujitsu
labadie
buchner
wann
sensorimotor
cricklade
volkskammer
naloxone
saikia
rowlett
scammers
karasev
cjc
laboriously
altea
romantica
exacerbates
recusing
bomp
elgon
brenan
exaggerates
stooks
amblin
exude
killifish
takestan
bearskin
khouri
tecate
bettini
hansraj
beguiling
ruanda
ajab
manang
waqt
sigurdsson
sextets
kalashnikova
balian
elephantine
túpac
neustrelitz
deadbeat
fariña
tose
freudenthal
bluestar
fining
dugard
slaf
montepulciano
dimarco
ioane
sanibel
syst
kazimir
htt
cannington
cereus
hbk
giardini
nemec
stewarton
patios
alburquerque
cookeville
herreshoff
darkchild
beniamino
séverine
liminal
gnutella
lassies
polychaetes
dharamshala
ungodly
trucial
andrius
semolina
heribert
clase
booneville
moans
kingsnake
neoconservatives
secondment
rezaei
boran
szekeres
hilla
blacklists
candu
yakama
abhijeet
aliquippa
sonnambula
bookish
gergiev
glia
mcgonagall
lincs
newsboy
patrizio
dunking
jordana
sinecure
rescheduling
playroom
inappropriateness
grantville
weert
wistar
megalith
jernigan
arnsberg
vare
cqc
shockwaves
deaver
baoji
jhajjar
refits
dodi
lotr
salling
ranasinghe
icq
milnor
pantages
reeducation
cheo
ibt
waif
deseo
vasantha
submersibles
mundelein
edutainment
chakravarti
ferragamo
bednarik
noth
deobandi
bivins
gelnhausen
ballooned
azucena
etl
snort
bonsall
estcourt
merson
wiens
hiroyoshi
entrenching
gossipy
yambol
bonetti
peláez
mckillop
hushovd
abashidze
montell
tigard
colomb
khanda
zubov
cabos
climatologist
okura
reculver
edgier
volcanos
kearsley
speedskating
constanze
cronquist
tasos
menhir
mazeppa
pasado
pulpits
counterrevolutionary
cofer
vili
lasagna
ctd
oppressor
puny
ymer
ambergris
belinsky
frutti
nabonidus
tybee
matriarchy
gyanendra
kents
basia
lengyel
tethers
ominously
meriam
kpk
liberación
shifu
wenden
sabarmati
maira
goans
mrap
janša
harbottle
atac
cloquet
tomkinson
samal
hanauer
wahpeton
komori
ratti
kays
intraspecific
verwoerd
idolizes
symbolics
subpopulation
settimana
palomares
scapegoats
biloba
kpn
rarefied
combate
adamstown
nuptials
hanazono
agana
meiner
altach
gildemeister
reductionist
taieri
marshawn
lycanthropy
erman
entreaties
hummus
kourtney
taipans
gimpel
lindi
monolayer
neutered
senso
rikard
tuguegarao
aacs
gaskets
classless
motorcoach
tardieu
dysfunctions
academica
yeng
canyonlands
wagered
adjudicating
barun
bitsy
trippers
apocalyptica
iwanami
tribalism
nickson
leskov
mcclary
provocatively
arasu
chauvinistic
orndorff
hollweg
obbligato
seagrave
caravel
dhimmi
iiic
derailments
sympathiser
lifelines
homonyms
dadaist
sherard
qari
sry
kufuor
diggins
salif
fantasyland
gymnosperms
stupidest
puzo
sapient
acqui
chisato
casters
everlast
sumatera
seasonality
pettiford
keppler
gertrudis
oxenford
cardington
pollstar
contoured
steidl
nibs
skewer
vannevar
drumbeat
thecla
evernden
bronko
asom
meridional
kirkyard
gowanus
dagny
eby
staro
deforming
cowbird
buckhurst
liken
inspiron
pendulums
ravn
virile
eady
transilvania
musette
annett
lhb
moench
chell
charenton
zainuddin
schütte
igc
ronkonkoma
kiku
duquette
frm
daladier
bernabé
petticoats
kahneman
longridge
jubilate
multiprocessing
depositor
tribesman
cavil
bfm
bettelheim
charleson
centeno
rears
trackless
defintion
hehir
solstices
suhail
doges
roundhead
corrode
metropol
pleats
florenz
phila
frisby
aitor
tanzimat
sini
strollers
moonwalk
shivraj
karns
gosden
bandeirante
tulisa
regi
bida
tamid
urmas
durell
rienzi
sols
turnitin
elation
montfaucon
superpowered
contusion
wallow
lysergic
panatta
matan
wampum
ffr
gabo
abebe
eutaw
figura
sunfire
yoshiharu
tabuk
richler
lundell
fluoxetine
crowninshield
eboli
holness
misbah
stas
caecilian
khal
vaudevillian
kro
pornstar
caltanissetta
modernistic
panta
higo
shellfire
anarkali
nunchaku
pacifying
bouffes
sokolniki
ombudsmen
fatherless
leipheimer
miscellanies
hanga
qasem
scharff
embo
gately
najdorf
korneev
giron
kerley
marquard
disengaging
mahalo
shorrock
nikoloz
climatological
ferlinghetti
croaker
reverent
carpool
uvm
unami
sisulu
recreationally
backstop
kube
psychomotor
tongariro
hawksworth
karadeniz
parashar
neches
salette
ferrigno
bridegrooms
fugard
foreleg
wahhabism
solidarité
kingly
denture
dissidence
redoing
corba
bungled
paquita
ainsi
napp
senora
frivolity
ikhwan
nadira
polyphenols
ruscha
takraw
salander
turris
cynics
walia
euromoney
chipewyan
khushab
raka
faker
bessa
integrally
kittiwake
capercaillie
entendres
desplat
philadelphus
lindfors
wctu
brownson
smithies
ravenel
pikas
radikal
garageband
entrada
pogba
poitevin
heartthrob
pentangle
marni
pora
jiujiang
blg
emusic
bradycardia
jano
snowballs
mounsey
qaem
ferdi
maned
kansan
flasher
truancy
gizzard
guiness
neutralised
shiller
rosell
predispose
marunouchi
microcephaly
barkat
eclair
rozhdestvensky
breeden
betfair
varnished
visigoth
brouhaha
ssangyong
padam
trowel
embossing
chisels
clune
snickers
kimba
marwood
phare
patellar
spanier
dayr
inferential
hennings
himiko
quade
bahari
hardison
unsolvable
petrushka
mapinduzi
nationalize
bocanegra
understrength
lovinescu
mattioli
hibberd
barada
curci
manichaeism
alimentary
sajan
tyrese
interweaving
xul
detrick
lancets
gaitskell
irascible
chinua
pnn
anabasis
hanka
frugality
monsey
darton
louvered
sachar
lambrecht
mouthpieces
bloodgood
khirbat
abela
moishe
lumb
tetsuro
messmer
cobwebs
rzeczpospolita
spagna
cathodes
moyà
athar
bookshelves
snuffy
grauer
musikhochschule
timp
beckerman
lexicographical
handspring
virunga
charnley
oceanographers
tewa
visscher
paden
nxp
satria
vilain
mälaren
ranchero
wolman
abracadabra
schill
changping
riccardi
westjet
gtl
assia
blanford
esha
wirksworth
contextualized
fostoria
banville
kangana
anaesthetics
fairbrother
jazze
vandellas
zorzi
riyad
muhamad
yermak
papoose
ayew
elana
egovernment
sistani
othon
vermicelli
demerit
nodaway
reinterpret
niggaz
medeski
begrudgingly
evicting
pelz
outwood
liquidating
arnell
apatite
clumping
reflexively
unrevealed
kub
hardouin
lule
histrionic
zy
dabul
governador
sarpy
acquit
bruny
horlick
croup
ossi
wordnet
lazily
imslp
performa
nonna
obliterating
condensates
vertu
switchable
crackling
autoimmunity
phlogiston
tuamotu
hippel
recombined
umag
cdot
tressel
gleaning
moorfields
englert
loincloth
télécom
rws
balloch
raabe
fyrom
syngman
reines
dnevnik
wavre
culshaw
lumezzane
epf
metformin
debonair
converses
nkosi
manion
loewenstein
untraceable
kouros
lansford
razr
omnisports
rothchild
borchert
otar
quickfire
vsi
beutler
kaposi
banas
toshiya
timea
franchot
chaleur
impulsivity
barbossa
decompositions
liem
antrobus
subhuman
arley
redback
rumney
dongguk
overreacted
kammerer
novalis
golder
leadbetter
hanratty
tetroxide
ablest
seedless
burkitt
geiss
melin
cripples
montevallo
trudel
whew
hawksley
schliemann
disown
buenas
gompa
alava
shekhawat
sarva
relavent
tatu
mariella
lemonheads
peltz
outdone
gjergj
melichar
aphis
pone
tokiwa
clasping
woollahra
fullerenes
chaliapin
retrenchment
gimhae
dellinger
nanping
silico
faysal
phuong
ricciardi
bamenda
mikes
albufeira
germani
resupplied
megalosaurus
phenomenally
cladistics
crosthwaite
disaffiliated
suid
torin
galang
strouse
granulation
ielts
flattens
lawrance
nicoletta
laleh
bonis
motueka
nwp
dila
tevis
merrett
maximilians
gour
scop
tidende
chamberlayne
rockfield
kaylee
tanizaki
tweedsmuir
baldo
ndt
taurine
seminarian
fioravanti
ombra
klatt
hungate
kups
ferus
youssou
eidetic
prearranged
wachter
beneficent
youl
axemen
trempealeau
isas
jeremih
rwc
kbl
ventus
kormoran
latinate
actium
jacked
marichal
nexon
quitte
wrenches
agustina
faecal
mannheimer
dipietro
reelin
imbue
mcgreevey
taksin
soekarno
softwares
rohtas
oficina
musburger
horden
vlaicu
galicians
finepix
camembert
hanlin
césaire
kartika
guion
stigwood
belas
hatten
cronos
excommunicate
nicklin
kristjan
ncsu
ebsen
yesh
dati
cesc
edificio
premolar
lampooning
inexorable
borris
roaches
mallah
jorden
polygamist
posthumus
feminization
boudreaux
mcroberts
lulworth
ooi
snooty
enshrine
kiera
snippy
brazing
transgendered
saget
ojsc
aermacchi
fuming
preformed
kameyama
sturge
barnette
jauch
damocles
bhl
nonpoint
massages
yaki
jujube
whiptail
martlets
tpr
bdu
jaci
fann
schützen
stroman
mailly
eskridge
sachi
ramping
blunkett
peachey
briefest
stelling
toque
plaut
jato
jantar
wafd
capriles
grifter
bondo
peraza
ibsa
nandigram
speedup
leros
newey
saidi
familar
demarcate
petacchi
sifted
thanasis
interpolating
cranbury
melnikova
soubise
hirofumi
kilts
rainsford
keifer
walney
screamers
harlin
roshni
insectoid
favelas
stubb
fairways
swaffham
behl
orlandi
aleksandrs
parses
pterodactyl
anjum
fdf
mastin
stevo
petőfi
blueshirts
gds
hev
pocketing
multiplatform
kollywood
sterilize
fenerbahce
netta
rackard
mercadante
hochschild
testaverde
sexing
froom
penry
scapes
gipps
brines
weatherhead
lugard
hilldale
pkm
speicher
macisaac
breakdancing
binda
headwear
cristero
duchenne
mandali
caffe
athanase
humfrey
nipa
scrubby
taif
milman
albumen
isparta
ngong
barnardo
piron
morini
khanom
kozo
overactive
omarion
barbey
leeroy
inui
triremes
libertyville
depressant
ballplayer
rannoch
yeboah
ferny
excreta
kosugi
fujio
entrench
deniliquin
andamanese
masaccio
stowers
korine
cpj
meaghan
fede
edney
colonnaded
qatif
newlin
overclocking
fundació
merri
mesilla
costed
obviousness
rumpus
advocaat
accrual
abrahamson
kaal
klahn
luana
litigious
yuncheng
pergolesi
bryans
faxon
lwb
weaved
bagua
pimping
mdx
heliotrope
karnad
tego
historico
bordes
vehemence
yagan
muggleton
sparkasse
macleans
flockhart
durnford
mossel
mcmurry
silverio
goldstar
fatso
pantyhose
pearling
sameness
refocusing
potito
urinals
jebediah
nomo
shachtman
requena
uniondale
rhamnus
touts
latchford
fleurus
vasan
tricolore
dibba
extroverted
chincha
wnyw
gtb
unraveled
camerons
colonias
rci
arimathea
samburu
hottie
hitches
darlan
prg
dunkel
sanpete
wickenburg
macfarland
klay
eulogio
piglets
wigginton
nason
adentro
sandgren
liselotte
rainbands
heartburn
swerved
fogelberg
aureliano
motorboats
neoproterozoic
warbling
shawshank
saviano
abrogate
parikh
pallo
vakil
huskisson
caillou
jemaine
mushy
molk
lifton
dittrich
coonan
malbec
renamo
alesia
pentagons
guadarrama
curacao
laptev
lbt
ekkehard
stukeley
communicat
liberates
pirs
verón
aristarchus
mathes
giggling
gascogne
herbig
elephas
overage
owego
sreten
proteome
leat
hennigan
pradip
sere
anastas
hedgerow
doers
nicolls
millner
tyers
lusitano
alphas
gaskill
telefon
penduline
slicks
suiting
allez
oldie
noces
greenbacks
milward
doritos
physiol
pizzi
iqs
onder
bronzy
tawang
reassign
jrc
addiscombe
amortized
kodo
lamin
ariadna
gipson
feldberg
aseptic
raut
deepavali
witley
lostock
roxboro
bantams
vampyre
shera
steine
dnepropetrovsk
freshers
empson
meloy
kokusai
msas
rosebuds
vass
mumba
portugese
pettitte
vvt
douche
unaffordable
darabont
swaroop
colonsay
buemi
hankins
coitus
kof
hillery
turrell
apprised
rahu
owatonna
reliving
guillard
gwin
cuc
tingley
arutz
oakfield
exploitable
bove
inelegant
diba
bomberg
nagase
bergey
dookie
tented
haystacks
mahapatra
boleros
kostova
harrassed
whitehawk
implanting
mastoid
internationalized
holopainen
lemont
kno
athina
aminoacyl
marcano
bluebells
ginóbili
volturno
bentworth
divadlo
hieratic
defenceless
hasanov
barbero
sakes
cuddle
lamjung
rishta
baze
yimou
foxwoods
mki
borscht
cableway
homeostatic
tantalizing
blacklight
smm
crudup
lobdell
multisport
neveu
covell
paracel
leached
rollicking
postgate
radstock
frandsen
gordini
arkadiusz
hudsons
alibis
hyon
peluso
hansberry
oingo
samak
stealers
précis
moçambique
superchargers
footbridges
cose
buendia
oleo
pigalle
naniwa
pled
mainspring
niekro
rancor
tribble
electioneering
bures
pohjola
meknès
breadalbane
hanse
virility
hootenanny
rebutting
deferens
rubia
perpetuum
prying
uhlmann
warmblood
tof
fpm
mollet
ceylan
berens
ginastera
musikverein
jovanovich
banya
cuse
sigi
jagjit
oleic
apulian
issara
mullaitivu
proserpina
monnow
tutuila
trabant
oxalic
cuboid
rema
moba
asselin
vintner
chernov
winky
bowhead
kowsar
moneypenny
skyteam
elysées
cisplatin
manchin
culross
sapien
planus
nasrullah
linkletter
febrile
scornful
epik
falkenburg
krisztina
ibar
mayawati
pahor
teide
demining
baserunner
ruge
gath
bogo
deliciousness
burkes
filomena
akhdar
coolmore
ania
dualshock
worthen
castanets
pilibhit
zabul
igi
galeano
helmi
honfleur
exorcisms
spazio
solicitations
scu
norland
neuroses
dredger
quadriga
swakopmund
bta
bhawani
matrimonio
manzikert
cosponsored
giacchino
halse
cnh
royally
buzzed
morozova
akha
quaife
schukin
tolle
extractions
blackfish
lapels
crossan
stans
phon
onu
lollipops
sikri
danilov
anglicization
anamika
jayalalitha
middlefield
chandrashekhar
reticence
divinorum
eylau
parrotfish
asche
sabang
kaskade
lefranc
huit
dmm
awi
noncontroversial
colibri
waldstein
thongs
rhinoplasty
vinciguerra
woodfield
fibronectin
rehashed
shenouda
asal
freiburger
magness
gsg
perini
jtag
alleyways
brunell
koke
euphemistic
streetlights
tirupathi
ontogeny
ludicrously
chinmoy
lathlain
emoticon
camarena
corfield
grandniece
sacheverell
lakoff
elul
transcranial
terwilliger
doofenshmirtz
nasarawa
eskil
matisyahu
savonlinna
cronyn
headspace
immunologists
hootie
nonchalant
gouged
lehtonen
roble
spoor
bobb
heskey
irak
recitalist
qinhuangdao
janaka
schwedt
sordi
kamba
pulkovo
gopalpur
musters
huangshan
dodwell
adaptors
pompton
lachmann
mushi
eventuated
gamow
letterer
interstices
eosinophils
geoid
weale
minkus
girlz
chanelle
pitchforks
raimond
allens
eckford
godber
chongming
ezer
schecter
muhamed
burrill
ducharme
vliegen
cheraw
bordighera
rubino
blasi
whippet
lakeman
bonino
ingels
atque
plantin
canti
microglia
misanthropic
kazakov
milon
giteau
bhaduri
homeboy
medius
tarring
naveed
tift
slammer
emanuelle
leeuwenhoek
cardholder
gmb
coppens
tsereteli
mahood
apn
livescience
goblets
plutonic
alleghenies
nazarian
levitating
complementation
babai
circulars
potier
brandreth
burnand
kurama
rathfarnham
vedra
samothrace
magomed
barne
orbiters
maneater
deathless
szombathelyi
cottam
suci
sinton
casca
legitimation
pseudoephedrine
carboni
ungern
montalbano
beller
baysox
eucalypts
paulownia
medfield
jitterbug
conjunctiva
unwinding
grosmont
iger
watsons
litigator
egotism
rjr
vsa
neigh
kvapil
rodes
bislett
linie
iterating
limewire
lostprophets
forshaw
aridity
pappenheim
mbti
gloriosa
lobkowicz
roskosmos
hallucinogen
popjustice
laverty
pesch
letov
baccara
haugh
villeurbanne
plasmon
conboy
gallina
cascaded
fertig
garfinkel
candied
internist
jamz
stellan
snowboards
dooku
geon
subah
tillakaratne
hien
yallop
lambertville
lappin
porro
tibbetts
teed
negron
avante
maspero
obrist
mitani
samen
fyn
tapirs
darkling
orlova
bete
terma
arnstein
northerner
johannesen
servings
pegler
ild
aaya
anodized
mortician
spck
wsp
rigney
kwaku
sircar
longstaff
pcg
colebrooke
snobby
lidocaine
becquerel
lexicographic
foxtail
ekström
saugerties
dyskinesia
oligarchic
górecki
popery
ktvu
agadez
fuerzas
erath
weinman
putera
quero
biologics
trillian
nansemond
corne
tobermory
crockford
ossipee
sharron
flaunting
thinktank
phlegm
miniaturization
weserstadion
joonas
verba
molehill
antiope
tánaiste
sdhc
ilr
kero
nyasa
prioritised
garlick
remotest
gjon
marriageable
rumyantsev
codewords
glaswegian
perna
sibilants
tsavo
yaddo
dolgellau
rabih
borj
caciques
barranco
jaishankar
servetus
lysistrata
malfoy
expressen
endgames
samra
pigtails
longerons
kallen
malaika
cema
bakht
anik
rockett
orderings
scullion
winking
siân
fishin
interlink
nitrox
spongiform
triskelion
alresford
quinoa
mothe
religionists
croissant
warrens
tetraploid
redcoats
crawshay
inductions
bise
clang
zevi
toasts
lianne
laserjet
nutritionists
baddeck
legitimise
ductus
minka
varicose
fabiani
tekin
codas
hellmann
impactor
douches
freckled
soichiro
rosenberger
sekar
boettger
heptathlete
sfp
hallock
mosely
perilously
pincushion
recency
princip
nishino
reevaluated
miserly
bankroll
hierarchs
atef
gewehr
schober
harmonise
emes
krispy
afterburning
schulich
hartzell
mums
festering
punctuate
burgeon
mikal
icebreaking
pacifier
centra
mixmag
grishin
islamiyah
froman
kamiński
bottlers
overburdened
palomo
lvo
synthesise
chromed
saina
minera
penmanship
markland
protoplanetary
comorbid
yastrzemski
bawang
sheiks
rale
hardenne
glues
cammy
grethe
libation
interchangeability
lunettes
kovac
willemse
lpfp
pendolino
rajpal
gondal
lifeforce
menos
reroute
smethurst
heuss
votre
freakshow
viso
rutherfurd
messaged
sarsaparilla
catahoula
caz
meanie
ampat
machan
ripens
nakasone
zhoushan
heu
libeskind
fengshen
picos
masint
boubacar
erat
keong
superintendence
affirmatively
harth
toiletries
kedleston
lyford
warfighting
callanan
cida
lothringen
enterbrain
yaris
binford
redtube
tronic
stowmarket
pucca
miyajima
rossby
pfm
tartaglia
ruo
toulousain
sipa
poconos
salla
hydrants
counteracted
mutch
multiflora
clutterbuck
appetizers
endear
nonparametric
henriksson
ibrd
stymie
bezerra
geb
overstepping
flatwoods
ashwood
fassi
kirat
kuli
funès
shrove
crony
transcoding
vomited
tujia
standen
lente
auvers
foles
freenet
sigue
appraisers
mohiuddin
geen
reuses
veneers
shugden
bundesrepublik
tahoma
rukia
brillant
spambot
ndu
lifeblood
guaraldi
sociability
accredit
gortat
flinch
luss
tbr
nakao
headon
eveleigh
waseem
carillo
sallis
recollected
batistuta
louvin
jmc
vanunu
berga
hammons
lettermen
insolence
pervak
locational
gowran
kaleb
borst
mccaul
olpc
obwalden
praed
woolfolk
cosentino
ranade
archdale
sketchbooks
translocated
monoculture
hisao
uft
mused
koski
mido
loukas
shuvalov
koja
rolly
westeros
griffioen
ritwik
devolving
thompkins
anjar
parente
tinta
spk
devastates
acworth
castelló
saltcoats
svyatoslav
wellsboro
bejarano
yngve
unplug
luu
kaha
kunta
adlon
jannat
mavs
colinas
kinji
hewitts
crockery
bootloader
comforter
grimme
palpita
horney
shekels
transpositions
vizzini
impermanence
médard
yoshimasa
pangs
dunks
goonies
tuol
tagawa
abcnews
modlin
laurenti
vnukovo
sinti
atsumi
dilley
greenways
stradlin
vinge
talabani
amateurism
meatwad
fazekas
hoby
borgias
shinnecock
flatfish
tuas
astore
shorey
flowchart
khyentse
pondweed
wimpole
dieterle
jambo
debenham
whitlow
aymar
puneet
tenis
drumhead
bouygues
elford
harrap
uhura
unabomber
vermes
opencl
kalayaan
hoodwinked
signposts
boroughbridge
raba
migliore
ironpigs
opticians
ospedale
ashrae
corrida
makah
masc
feted
vallier
mcwilliam
paktia
copulatory
socialise
khu
zvonko
remscheid
botox
twiztid
localizes
fuer
antineoplastic
molting
uncool
valeriano
corporatist
mangione
grabby
anselmi
swiping
arditi
paramedical
marcelinho
glossing
pearland
southlake
mononoke
songstress
obrador
miccosukee
widjaja
liebherr
sunnybrook
dorham
seel
arrernte
inviolable
calwell
jiaxing
faubus
shap
telic
higbee
parroquia
situs
riken
recursos
monserrate
starlin
aila
stapledon
mogador
snags
blessington
andrejs
pé
burgo
bullfight
furze
evren
optik
hesitantly
warrego
bekaa
serifs
cuccinelli
grimly
pieterse
sholay
rubaiyat
timbering
longlist
jassy
matia
scrawled
preproduction
blithely
papadakis
gorki
scram
taggers
firebug
adora
scavo
axially
racialist
cundy
lovro
carlaw
aleksi
tormentor
letelier
rumania
nkb
minghella
nowotny
niantic
meusburger
kananaskis
voelker
maltose
rencontre
neuroblastoma
hissar
helpfulness
mattison
barthélémy
benard
creases
disillusion
laburnum
ebv
orenstein
natanz
topspin
sadra
croxton
eschweiler
bassoonist
serialism
stomped
cante
mlcs
pellerin
stockard
caret
ringwald
jowell
arete
recco
sandeman
ohg
lauridsen
airside
animalistic
ghadar
harlington
boga
weeklong
bowstring
holtby
sinise
dobb
usada
caroll
rael
bellboy
agitprop
weifang
krater
discourteous
chillán
nokomis
yasna
ribonuclease
mahdia
showbusiness
katey
frenchie
milenio
fluffernutter
pictet
frederiksborg
iolo
psychokinesis
hanina
vdl
rantzau
ferghana
priština
varnum
resonable
cortisone
kember
webcasts
bellay
leonesa
vitra
unwrapped
tafel
atala
convicting
arpaio
esculenta
zeynep
policyholders
comings
zdenek
carducci
moncrief
mudslinging
hadlow
cemil
esiason
gorrie
wreaks
conceptualised
neven
calixto
reiteration
wildhearts
tenterden
titu
greenhorn
mostel
zamudio
turonian
tno
ahc
stephanos
nacer
flyable
castillon
kamer
amphitheatres
deleo
sholto
eulenspiegel
mouthing
grüner
aquae
rosco
gaja
thins
ottilie
mdd
reflexivity
peacocke
lamarque
frosting
ullswater
disinfectants
ector
dendrochronology
apollonian
monoecious
fidalgo
lakeport
rhetoricians
avtomobilist
deconsecrated
everyones
cepa
availible
dimetrodon
poza
mercers
dixons
brenden
copperheads
zeca
airlink
chaika
spielmann
kyriakos
mottoes
crerar
tirupur
versfeld
herpetologists
stormtrooper
henriquez
bbd
alexandrescu
raluca
kumaratunga
unproblematic
technet
stuckist
eggen
yemenis
strudwick
lindfield
markgraf
weirder
nagaraj
dethrone
hanh
moncur
fausta
bloomers
qutub
verrier
collor
greeves
elem
aryabhata
guybrush
localizing
laufer
tickled
larder
qamishli
sartain
mortes
panky
thornburg
evgeniy
regin
sabbatini
wds
aristotelis
poulson
inconveniences
metu
tellingly
wank
juergen
sider
marielle
ryun
nesterov
prinzessin
berlitz
casso
kakutani
rossy
felicitated
foel
wellens
dornan
brasov
aftra
tugela
allemande
pectin
slickers
intrudes
clif
abbate
gentili
stillbirth
caccini
nof
bleasdale
huehuetenango
calverley
beheld
svaneti
gillon
pulsations
mastodons
raval
jeannot
gfk
tucano
raila
mudvayne
chisago
braehead
joensen
rahsaan
fazer
comicon
gagik
comodo
hdt
subclavian
boursin
estabrook
charnel
backscatter
dinucleotide
pdvsa
poolside
llantrisant
dpc
bondarchuk
orest
prievidza
ibañez
incirlik
chiclayo
buttonquail
maun
antimalarial
lavell
untruths
soja
ronen
lith
krapp
bnet
spielman
tishomingo
phou
overshadowing
ecoboost
irredeemably
officialdom
gynecologists
ozon
embargoes
dwp
nagra
lamo
musar
laughably
nags
yegorov
detuned
dessalines
dumbfounded
eneco
tantum
fruitvale
recut
diviner
unaccountable
arnaiz
igorevich
panchali
campa
razon
fard
sanctis
botch
jaggard
rady
harefield
joggers
laskaris
popsicle
sokolsky
savini
finalise
barbiturate
cobbe
palanga
frons
wysocki
morrowind
timlin
royan
madhavrao
niceties
laub
fedotov
asymptote
debney
fredi
stijn
loureiro
brujo
morna
alcor
panadura
resoundingly
veggietales
vlatko
lychee
rushdi
mayur
ritesh
karri
gazpacho
vajiravudh
tijd
kuda
ashta
servilia
kcet
fpp
hyperglycemia
amra
suey
pancasila
adjoin
rowlandson
hodes
iskenderun
rewilding
sfg
blumenau
epoque
ftd
pragmatically
wetted
kelvins
shhh
intercalary
abin
abdomens
sarria
kach
petronilla
carrère
torg
wytheville
soccernet
stromness
sadomasochism
remasters
missourians
stefansson
rdm
washakie
dhanmondi
decisis
governo
krak
prynne
bexleyheath
gymnasia
gouraud
troost
micheaux
bridleway
benyon
inexpensively
medicus
bwi
batz
ifan
maric
winterland
recognizably
caernarvon
nerc
thiemo
reexamination
churchwardens
hiran
resettling
brightening
rashleigh
jeremie
beilstein
griesbach
landman
samaná
marmorata
wavertree
laurinaitis
longyan
changin
alzira
bakhramov
hogar
personable
mccowan
hackworth
pepita
husum
natureserve
rnd
donnas
buhler
fasces
hastening
pechersk
goldwasser
marcellino
oligomers
hammonton
foxcroft
gesu
kosovska
wuc
valledupar
huhne
insurrectionary
guyane
heche
eum
carotenoid
liya
tetouan
anf
uren
malamud
chiese
drome
lilting
dursley
medgar
demoting
demoiselles
baqi
jencks
maddening
longleat
tarahumara
hartenstein
ardsley
dedicatee
buckaroos
nien
republicana
headedness
frédérique
sisco
bofill
heterodyne
casamance
rabari
unravelling
fertilizing
biharis
gobel
satirically
ashrams
zwick
sadowski
tommasi
sprengel
menomonee
marshmallows
sheil
osteen
truthiness
karasu
dispassionately
psycholinguistics
buro
staterooms
prefatory
avakian
baddest
perpetrate
sias
fearne
gluteal
proclivity
hecho
doone
gurr
cowart
sawhney
lunas
combermere
punctuality
moras
muttalib
zwicky
wombwell
akamai
drood
architectonic
oesophagus
ahmanson
midvale
retitling
zerbe
uki
philbrick
dahlonega
gujral
sorell
detracting
machiko
brach
awu
neira
impound
opd
wellstone
internalize
parched
dobbyn
eupatorium
heintz
continente
kidston
festina
balogun
ferrata
ilaria
rednecks
flagships
philandering
monona
cardoza
wellfleet
bjelica
baillieu
estas
valores
cosford
laviolette
klima
delp
yolks
inverts
dinny
undeserving
melodically
hasso
sesia
wessely
smitha
trelawney
davidian
xilinx
aycock
mwa
yupanqui
iden
ganna
boggles
malala
ande
alfio
fage
mundine
excretory
factfinder
kamei
aleichem
achill
hudspeth
vikrant
laxity
delgadillo
jacobsson
rtgs
shergill
mccalla
paume
conker
swoboda
brs
epitope
shiawassee
accredits
urologist
foldable
horsetail
genna
pollan
wadis
bipin
audran
derricks
takami
khj
winnowing
ellin
faruqi
startlingly
moslems
unforced
psychometrics
adq
bielski
ploughman
campobello
oikawa
gunfighters
mowatt
auburndale
bozizé
kilmaurs
glaciology
dilorenzo
coppermine
acceding
ruiter
jacquelyn
tunisie
inflaming
hagiographies
livestream
caseload
xcode
aniello
keratosis
rhomboid
linge
sadhus
marrufo
aesa
fellas
graveside
caul
contrivance
backhoe
apodaca
zelman
rossel
villaverde
crato
kanab
holdfast
wrightwood
chev
humain
knittel
grinspoon
pfk
baram
sprach
chk
systèmes
nihilo
mitochondrion
interlacing
tinned
relf
egp
biofilms
trypanosoma
forbearance
clovers
endometrium
lyda
skan
braver
upping
transporte
multilayered
apollos
howse
bnd
gede
megatons
aurélien
yanked
pullout
pnb
gemina
habitants
unconsolidated
huebner
quebecois
caudate
parlament
fahm
schwalm
stann
kaffir
thaman
seafield
homero
dissents
talyllyn
hackathon
ovadia
willcocks
engrave
wahine
remounted
gaber
rossland
sluts
idrisi
sympathizing
abdellah
quandt
chews
lacour
rainhill
bagdasarian
maritimus
redshirting
romolo
kurram
foxworthy
shipka
yoshie
negligently
arctos
cgd
radice
condado
burrage
knick
swooping
eyepatch
pamphili
aspin
afif
jannie
bramham
hiraki
fatou
stipa
imperfecta
godrich
vasilevsky
clastic
chd
centring
manara
konaté
lovey
flashbulb
energizing
onstad
prang
jandek
charvet
mantri
maskell
optimizes
landgraf
disdained
sohan
disorienting
greenspace
leeman
wyborcza
bluesman
jaromir
achard
schama
arbors
karni
devaraj
shahada
mittens
comv
battleaxe
npv
kalem
swatara
malou
discontents
fredrickson
stepsons
visualised
gluteus
cedaw
coyoacán
legarda
tramline
elbegdorj
sappy
zerubbabel
longfield
petrarchan
manliness
castors
jingzhou
readjustment
harborne
woolpack
hovhaness
vedran
collegeville
hiya
frings
gasparini
fotheringham
freude
dmf
crossmaglen
rainie
plessey
shirer
dummett
thika
neisseria
battens
rehearing
ingolf
zelig
fds
buttle
pyongan
burrowes
prin
churchville
aleman
sjs
trae
battlegroup
humanos
goretti
lussac
countervailing
gaw
sweeten
wetherell
baronesses
ader
lenski
dusit
tavis
ngf
schüler
balint
breau
fifer
catchpole
bullfrogs
disharmony
optimistically
shonda
kaler
seck
nelligan
tenafly
ascott
amble
severa
assignee
kater
leyen
tadd
sorte
hoodlums
cibao
macken
indictable
impingement
humberstone
blondin
sushant
mcgruder
asas
reversibly
picasa
monopoles
carme
envied
worldcom
aqeel
snooky
vlasta
lilja
afflict
lipari
cramming
berean
keay
serotonergic
psychopharmacology
keim
morphos
ngata
yamakawa
copping
dharan
transcultural
throttled
sloughs
eberswalde
keightley
pandolfo
unrestored
lnk
roeg
fragrans
massine
kluivert
hesitating
segev
dehra
musiq
gilkey
antihistamine
hetzel
tambour
jonsin
pennsauken
celi
solen
disrespecting
heirlooms
kufra
laureus
camelford
feijenoord
fredman
nikah
vlastimil
submariner
toop
crackdowns
toman
kipke
bansko
karanth
fuhrer
vishisht
accreted
servite
womans
gehlen
woodvale
deists
edlund
khaleda
vigne
sirota
manjrekar
latching
tomson
henniker
longhouses
retinoblastoma
airfoils
paychecks
pinelli
brewpub
agroforestry
brahmos
amelio
solin
biga
brust
riled
saputo
atrioventricular
recedes
kafue
tuma
heidenreich
kody
kennerley
volans
hds
dermatologic
schembri
trousdale
probiotics
transnet
tsien
hamman
chieftainship
webos
fillon
tjm
irrationally
balducci
uniate
bushings
cantref
mandrill
abida
catley
elata
cahan
alcoyano
hearers
seaways
nordby
peleg
celentano
tjader
bogert
harav
gandia
meloni
sobol
flaccid
vda
lattre
enum
aivazovsky
bto
picardo
fivethirtyeight
oglio
olivos
upregulation
windlass
overreaching
narc
drummoyne
disconnects
briarwood
porthcawl
apj
kada
columban
tevita
keohane
kurchatov
intensifier
gambles
rebelo
attis
consensuses
xiaomi
overclocked
sepa
yarnell
wilms
phichit
asmat
pouilly
hailstorm
kastles
dtu
yukie
pupfish
carpentras
dickman
lynley
broadfield
hurghada
stableford
moustached
nayan
jovanovic
macky
egnatia
wickman
entravision
willibrord
lto
subramanyam
nipped
vinu
boughs
verga
zicheng
preceeding
caramanica
lemelson
ebo
égalité
contrôle
kstp
sauli
rationalised
knockdowns
rastriya
craxi
crouched
boyds
namah
connexions
hyuga
steeds
atreyu
betwixt
santacruz
lankester
writting
louden
clutters
jacknife
coraline
magomedov
edzard
uncheck
cusa
archetypical
marquet
ictr
mixe
chatterbox
ruslana
nozomu
crissy
opencast
obd
pretension
mohana
dongan
coif
accumulators
saxes
enero
futur
babblers
chappel
paran
monorails
chesnut
rivett
lannan
manship
narasimhan
interleaving
blin
longstocking
número
fiord
brooking
acrostic
tawney
ashmole
herath
laffer
katsushika
uku
cureton
danis
rivalling
mazo
verum
beechworth
bekasi
gallants
kalamarias
leeks
foden
portici
renfield
boudewijn
swellings
whithorn
eridanus
powerlifters
playgirl
axing
dumond
tramadol
czernowitz
lamongan
bogdanovic
holliston
spofforth
cfls
gpcr
bijoy
spiraled
toral
mayon
ferreri
dail
kilmister
memorializing
braudel
gunsight
florists
lilas
tamiami
amazigh
hirono
slpp
varicella
normalisation
routemaster
sangat
irmatov
reminisced
suara
tomoyo
gazza
gembloux
bettany
schutte
virtualbox
feliu
humint
mouthwash
incorporeal
retried
ofra
kintore
rvm
dolo
repulsing
kotov
forsey
tareq
gnassingbé
alcon
dhow
jayewardene
gasnier
anodes
instrumentally
aspern
prohibitionist
wattenberg
whelk
fairuz
indiscreet
calusa
taipan
excimer
runyan
lasorda
buttler
hdv
volya
scow
nuristan
cribbs
drubbing
iisc
jarnac
birkeland
ybp
resa
briant
bodh
whewell
louvres
kneecap
kalama
kodesh
tras
wereld
floundered
poorhouse
mahr
rainford
bilodeau
peterburg
totton
puebloan
fifield
auras
vernadsky
puncturing
retraced
yuxiang
petrolia
roloff
ncse
rangefinders
rêves
chondrites
weer
calvillo
juelz
thompsons
automobili
tamia
mantles
viney
detections
altagracia
kaloyan
mediterraneo
roding
industrielle
mhk
cleantech
enormity
arnoldi
craigieburn
naseby
strafe
kuper
escoffier
vlt
eeprom
huggett
vocs
inheritances
nahuel
checkerspot
antenatal
peppard
draughtsmen
wakabayashi
anglaise
terminalis
tonle
driss
baserunners
ingeniously
hustling
ikf
rls
dukat
psychosexual
genio
graffin
ngu
ferron
parola
unchangeable
fuhr
journ
typologies
jubilo
aspca
homoeopathic
pupin
unspectacular
grunewald
colangelo
kafir
attell
neurotoxicity
simonov
balram
pneuma
simonton
baptizing
meiklejohn
tarpley
shakopee
damayanti
ottone
stiefel
pieri
marryat
santaolalla
ikegami
eul
sirikit
chadron
maces
damar
eiichi
vaticana
burrus
noisily
smolin
garretson
riverboats
stomper
rashomon
brigitta
hessians
osteomyelitis
floodlight
ganon
coad
oscilloscopes
dogme
puffball
degerfors
purr
salin
emax
phukan
bayar
barreiros
biondo
speakership
gráinne
efrain
gurgel
sharq
zha
collectif
subcortical
rockcliffe
errington
festspiele
wasi
bpf
multiplexer
handhelds
ulysse
inukai
laertes
vinita
kca
trajkovski
tunceli
milady
capehart
cohasset
windscreens
abstruse
hiscock
incapacitation
ifield
tresham
cheerios
belitung
superclass
sadeghi
feeley
ammanford
hames
clete
gwydion
schwantz
drooling
gigantism
zoonotic
guillemin
jaysh
branston
ferreyra
fuselages
takamine
mcmartin
cerny
ruhuna
kooky
lahaye
bechdel
duffel
diorite
shipwrights
irk
isai
biogeographical
enbridge
glistening
twill
incompatibilities
nclb
shireen
torturers
galleri
nyanga
manby
hebner
vaidyanathan
interbedded
germiston
liuzhou
reshma
koresh
knepper
blackcap
carisbrook
criminologists
thia
diminutives
undiano
wichmann
ottorino
oropesa
saket
danaher
waymarked
phagwara
schleyer
vibratory
hindrances
mxr
jarrad
hachi
bovell
medleys
clownfish
lya
ubayd
mustin
tarento
catastrophically
ducale
bagheera
ripoff
tykes
gance
maecenas
bionics
tedford
karmal
lactating
nhlpa
tobar
relapses
morong
maalouf
freemantle
mhic
ryotaro
neckline
voorst
moorcroft
sluggers
eldora
cathodic
castine
ramgoolam
popplewell
pyrene
birdies
aldama
massad
fabrizi
sayadaw
wpvi
plac
malakoff
tinkle
igy
barelvi
deplore
goodes
weygand
botsford
microbiome
roques
canham
wefaq
upnp
linné
plt
nemtsov
transocean
zulueta
doxorubicin
metonic
fleuve
syrups
revises
yayasan
terrazas
carlinhos
ardor
parral
effervescent
entführung
ayah
harrass
woodchuck
daqing
beaubien
holies
baluchi
idee
ordeals
seah
havemeyer
haast
myall
zarate
vith
coleco
hhv
eircom
cenerentola
slsa
reflagged
tarpaulin
ouanna
vitiligo
northshore
snarling
trappe
shelve
redick
bracks
mcmillian
markell
vuuren
ornl
mth
castiglioni
sveinn
baston
villagra
hcn
coolpix
winograd
roofless
wachtmeister
preppy
basrah
alar
montilla
larraín
kanae
chapa
jezero
kahar
depositary
felsted
wrenn
lindsley
fors
djk
oranienburg
sii
bequeath
panin
sarong
michaeli
ranney
rappelling
interject
parsimonious
csonka
subtitling
mager
flaking
questlove
enon
demjanjuk
lefkada
shaik
lightwave
fawning
unsavoury
walkden
bulnes
chieftaincy
norby
batas
pitjantjatjara
repechages
villaraigosa
rouses
drabble
sarazen
gpb
tongass
voskresensk
kadish
tankersley
wozzeck
lemans
techie
sipowicz
crofters
moussaoui
nondestructive
loveridge
franklinton
mitigates
zonda
hominins
hallberg
tölz
kashin
fridolin
trekked
deacetylase
licked
bisphenol
webinars
braasch
muttering
boag
gnomish
subrahmanyam
veep
typesetter
painesville
shandling
eurasians
engström
celal
khola
kieft
kleinschmidt
khaliq
petrillo
perkiomen
meulen
toti
ditzy
gullickson
leaderboards
bayrak
abductors
satcom
kabale
refilling
manns
saeko
bahl
roundhay
turbot
demilitarization
azeglio
ewbank
littles
ralli
meirelles
activations
sanches
llanes
cotopaxi
sekiguchi
caddies
ohn
poignancy
boruch
oddest
conveyancing
kasahara
rapido
fomenko
lacordaire
trai
pernell
trolle
norell
vavilov
swaine
scalping
haun
bookable
nungesser
strongpoint
kempsey
dyan
erewash
friso
nadeshiko
skee
sterns
luckey
nack
mixteca
grega
sarl
gisbergen
delvecchio
sonnenfeld
franti
rinconada
balart
mishna
rubescens
raphe
sakti
zonder
alltel
massi
breaded
conshohocken
unreality
blytheville
estrin
rhuddlan
lepsius
carbs
marples
iud
lautenberg
kundu
pirès
rege
smilin
nitya
beaman
primorac
nobuko
fenny
butterley
banquo
rosine
anfa
timmer
roter
smallbones
hoti
armijo
delerue
delphinium
bassetlaw
harmison
bonfim
oudtshoorn
collocation
fretting
aase
calipari
sakakibara
aerolíneas
absolutes
catz
essenes
yerushalmi
winick
philharmoniker
plouffe
humani
asphodel
toiling
garibay
duplicitous
pravin
concessionary
sennar
molony
harmar
rax
naira
panam
cpac
gratified
villepin
jugendstil
affray
blasko
ahp
encomium
charades
walch
shampoos
coachbuilder
jospin
fabrique
fests
philadelphians
seguridad
colyer
drancy
osco
cuadrado
oculomotor
atenas
respawn
spacesuits
zampieri
boras
houde
penso
cepheus
neoconservatism
tafari
cinematograph
mwm
idomeneo
addled
cruachan
petruchio
cognizance
kriti
unheated
succesful
eyeing
avca
arnos
harting
ionize
somdev
putrid
rummenigge
rearwards
feild
zazen
horley
laminates
neutropenia
striding
salzburger
foord
olazábal
piscator
yodeling
glendinning
machon
miletić
gojira
ceva
bushmeat
ppo
villaret
breakpoint
kalk
ifaf
vasudev
sodus
bothy
ferryboat
patrolmen
romanovich
introversion
felts
duvernay
redeems
obaid
mollo
borrowdale
poch
oskarshamn
serville
exocytosis
sanatan
bewegung
fujikawa
instantiation
lounging
berryhill
fenty
huisman
duggar
copter
bayles
reidy
throwbacks
hairstylist
hurrell
islamofascism
�
ioannes
humoral
lifesaver
sungkyunkwan
axtell
madhyamaka
ariete
mosport
mikhalkov
moyse
mpu
gromyko
ottmar
uag
vasilievich
hettie
pufferfish
smallman
heanor
cung
ozan
fridman
patry
ostapenko
virginica
rumer
crowborough
woodworkers
staatskapelle
haret
precariously
foxhounds
kizuna
vaseline
godfried
ents
centum
shigatse
lahontan
rexall
hickling
billund
nibble
erkin
phalaenopsis
blam
haumea
unmixed
myatt
desmarais
bowdler
usra
sukh
disfranchised
manipulators
ovenden
mulcair
retails
orry
undercurrents
burbidge
schneiderman
aroha
survivorship
statment
sterilisation
disarms
afrc
okolie
kitkat
prefs
crufts
lathyrus
warmup
mégane
reconditioned
tapscott
hpp
dearie
williamston
arnulfo
myoglobin
norwest
longe
squib
olap
derwin
jarret
bulut
nspcc
hatchling
snagged
wristbands
janka
propagator
gualberto
expeditious
verdigris
trapattoni
envies
askmen
festiva
bhattarai
guerreiro
relaunching
nesn
ljung
bukharan
retentive
jemmy
recirculation
gavril
mcvay
observateur
sectionals
aldington
irelands
filmic
souda
vandiver
dubuc
schwager
poisoner
hmb
elaborations
rusconi
beatie
knotty
tdd
edifying
madelyn
hix
guar
hossa
payee
nanobots
chorizo
surcharges
panteleimon
pouliot
suds
olympe
cahuenga
archies
mclintock
nasha
giordani
leinart
lote
matanza
kalash
informality
iiit
ewert
nonfunctional
verulam
bowmanville
billon
pimpin
philoctetes
berdan
donte
washburne
biba
snowshoes
handelsblad
dillman
banki
creusot
fatalism
dermatological
glatch
cantinflas
tibesti
calzaghe
multimillionaire
aauw
arneson
meinrad
mujahedin
poodles
kima
hamo
sacrilegious
iem
salvadorans
goldring
knbc
schillaci
sarna
sitamarhi
unspoilt
falken
fauvism
celestina
iannucci
mikis
shivam
proffer
linhas
rifai
syphon
ists
ivr
scions
kadhim
fundacion
saccharine
mistranslated
boudica
campina
gnarls
trimotor
rinsed
kef
leotard
haudenosaunee
charlies
marrone
canneries
gallstones
magherafelt
vln
holmwood
cabane
believin
burka
gibraltarians
gallet
sinc
micki
lefort
astrometry
presuppose
sleat
simples
parla
wildes
nonexistence
vouched
necdet
denker
oosterbaan
aristo
luzia
alamodome
transbaikal
zulkifli
chausson
wsr
yelawolf
transbay
toliver
tnr
saen
shepherdstown
hassani
glaciations
brained
frescobaldi
unipolar
uel
gamecock
rothermere
pithoragarh
delcourt
mohler
dysregulation
caza
swinnerton
bevy
colorados
bekir
siris
mits
scrofa
duceppe
tdf
gruel
deloach
vardi
kear
ledley
goldcrest
meireles
raso
barebones
sart
angharad
veerappan
reti
frazetta
bitterfeld
kokoschka
robbo
holdouts
conciliar
aronoff
vandegrift
dilutes
agno
variorum
sangar
palaeontologists
overabundance
smeltz
bundang
corre
olpe
bigsby
kelkar
dogo
sonatine
macalpine
maintenon
zep
campfires
lactamase
winnipesaukee
exelon
bryde
bajrang
racal
winsome
mnemosyne
querido
samedi
bogalusa
bucknor
roadsters
fireboats
sidoarjo
trod
arbitrated
fost
nummer
tapani
dws
outselling
loots
melanocytes
socials
modica
shalott
steinhaus
tubize
cairngorms
radionuclide
elbit
strawbridge
paediatrician
blanketed
myat
lamacq
litani
daintree
kampi
adminstrators
clayey
simonetta
friberg
supermax
xna
jmu
dohuk
kolozsvár
kiener
jasmina
fct
libo
jennison
uptime
snd
tirah
volte
electronegative
valdis
bowley
conscientiously
oberto
druck
ranaldo
stoute
telia
finmeccanica
mckell
shively
deliverables
wapa
micheletti
shinedown
momen
laufen
bibiana
attestations
tegra
nber
maurin
bruyne
ipe
ventoux
gorgas
ayyad
rayalaseema
glutton
chickasha
inria
luciani
rollason
trie
apure
ofa
yoshinaga
bedelia
elodie
gorsky
mcparland
luqa
henrici
woodmen
garryowen
sukhdev
greystones
supermini
lsat
numata
moralizing
apostoli
latécoère
annelid
ventriloquism
duffey
superstring
crace
mentha
tht
shaina
briley
healdsburg
idylls
drudgery
bogner
noro
riss
tamika
galassi
tischendorf
zuiderzee
bacteriological
dropsy
lyngstad
ebrd
homebase
chevallier
dhyan
ahe
hackneyed
bommel
barkin
slumberland
transferrin
karami
bory
conewago
kubler
bandied
lauriston
flq
vts
doroshenko
tobit
chato
humanae
breitenbach
almeda
snowe
jerrod
mendrisio
pokorny
claros
mosso
yauco
honore
beady
schoch
cdd
bilgi
abravanel
protestor
automatism
wexner
lafon
byker
giugiaro
cortot
signoria
dhul
huberman
thyroiditis
atd
beeing
abelia
kautz
pbi
puedo
ulli
ramrod
morphogenetic
orito
crawfords
schaap
iorwerth
taheri
malaka
herniated
kabak
ichkeria
gualtieri
beatboxing
palaeozoic
dunraven
jungen
eastview
merrow
ironmonger
cruelties
massaging
tabard
monocular
labore
ellenborough
limnology
hassi
subdeacon
theismann
gernot
littman
rowse
incongruity
haarmann
lytic
drachma
elkie
holbeach
ull
newsmakers
roncesvalles
unpersuasive
samora
czarny
izawa
bhattacharyya
angelopoulos
savill
hinshaw
magnifico
contextualize
slimmed
tandja
luzzatto
sneyd
wagr
metamorphose
bmr
chioggia
liedtke
shean
teichert
manhattanville
retardants
wjb
orthostatic
inconstant
rushall
corollas
komet
mard
backboard
medibank
iacocca
depalma
skalica
zyklon
braunau
sancto
oceangoing
oana
punctual
pandemics
colom
signori
mrn
belsky
sweatshops
misprint
cator
misbehavin
caffery
laporta
adders
meyerhold
mclarty
engberg
zhuzhou
hydroponics
roelof
silkscreen
rabo
mayorship
mnlf
bludgeoning
shriram
domachowska
evidencing
etcs
salwa
pela
trifluoride
hoochie
digha
pipette
objet
quaresma
keerthi
sédar
sadegh
horikawa
defalco
lluvia
konzerthaus
reexamine
brimley
campen
jeffords
preteen
awwal
berkoff
wtmj
lovisa
flatworms
duren
chian
fashionistas
hilty
aphra
touchline
bawn
decanter
asaro
kenning
wynberg
torrio
kentigern
cels
lapine
fortuño
katonah
hunsdon
reaffirmation
kansa
elmham
npn
célestin
gandolfo
salop
floorplan
söhne
marana
sócrates
ahhh
stossel
régine
lucho
volkova
nedlands
ebadi
thermophilic
deneb
playout
tashan
irondequoit
ataturk
europium
lazcano
harpoons
tirico
dirtiest
troopships
polgar
magnin
galpin
aulis
meleagris
vibeke
tipler
comptrollers
coalescing
dependability
santha
nagumo
greenacre
hessler
blad
claussen
fragaria
ballycastle
crosier
artilce
janco
trigo
goldenthal
laurinburg
nmt
jarryd
dunshaughlin
tdma
kittyhawk
saltoun
girlhood
overdosed
incidently
rhinestone
leonida
sinsheim
laypersons
climategate
mcwhirter
nerses
whitelock
uckermark
terman
glaxo
drl
tebbit
aure
shihan
chobham
tines
disavow
advertizing
erba
morneau
stroller
refn
beetroot
azor
belsize
threadgill
milia
hongzhi
hegemon
nappa
faery
dimitrova
siodmak
doogie
keychain
sanader
monicelli
auks
sasse
mundt
xpt
odilon
earless
emley
perrins
crossbred
gpx
waifs
terzo
unexceptional
mdna
londinium
prydz
repackaging
sanne
parys
grünewald
gitmo
laraine
keni
latinas
rett
goldsmid
shoeless
kubitschek
flavonoid
grievously
fasciitis
bopper
unpack
impregnate
basri
arborescens
fassa
mlrs
blinder
lundström
westmead
retracing
jfs
oozing
edb
bébé
silage
lukacs
kazu
tournoi
gyulai
sence
clinique
pulman
stagecoaches
hunziker
sipping
scammer
cashiers
donk
ferne
barnstone
consciences
gry
cilantro
parkhouse
erden
mede
nili
alannah
zetterberg
partai
nightcap
unviable
scallions
thornaby
proselytize
curds
greyscale
joëlle
downgrades
stomatitis
nextstep
gelling
heemskerk
canid
antoon
jda
megaptera
shoring
narai
morceaux
politicking
bettman
entree
intérieur
pinkston
croak
rationalists
rajoelina
catto
guster
hostesses
kitesurfing
reconnection
bitchy
viotti
ewha
thum
gei
hollar
torta
ubiquitination
kafi
niarchos
deferential
mortara
cernan
scargill
bolting
longyearbyen
flaxman
guttmann
ucm
emanuela
radia
torlonia
frankenthal
dsn
riu
sibanda
chondrite
wiwa
morano
gorée
porthos
ozeki
palatino
uppal
rjukan
domon
lewontin
mirvish
bassman
mandragora
thich
izabela
fabricator
dressel
mutlaq
biletnikoff
rigas
gdb
ronettes
chilwell
printout
lydekker
lgd
deformable
peronism
ilka
pewsey
ifi
lovegrove
lingga
inquisitorial
illuminatus
schwetzingen
fotis
evapotranspiration
kejriwal
ferhat
rijkaard
pgd
buehrle
mehrabad
kili
bookmarked
warspite
vaticano
sundquist
kgf
offishall
siphoning
christianshavn
calisto
hitched
wata
laceration
bookselling
lomakin
choudhry
olan
chilkoot
ramiz
tujunga
caymmi
tethering
chilies
freethinker
dreamz
groombridge
saxatilis
streetlight
garcetti
fancher
madrazo
pettitt
annalisa
springbrook
elrington
kingsmead
koura
toenails
koffler
tavira
conjoint
chiro
sturdier
schwinger
lebreton
matignon
quickstep
lpl
afflicting
kisii
mieux
cnmi
yoakum
nti
openshaw
redi
castilleja
russi
phifer
romper
ragdoll
braked
illumina
amas
berenices
burchfield
backbenchers
paku
christe
osd
hosein
fougères
mandu
ggg
keown
mulde
findon
bhel
lionesses
tardif
vivica
septiembre
dukas
mcclusky
belligerence
pythia
paye
unfailing
windsurfer
aeons
brocard
lushington
unobtainable
patens
tbk
alekseyev
peleus
phelim
gojko
psas
nutini
umami
nics
craved
undersigned
mayhill
ttr
foon
sonik
epidemiologic
filippov
gershman
mystikal
shorne
cinémathèque
chorea
kazushi
loughnane
cranking
teltscher
evgeniya
honking
aspic
gillam
plataea
midstream
filey
chg
barked
maass
participative
hendler
kleberg
begbie
bechara
ceux
qingyuan
unbelief
dalip
dogmatism
leaver
gameloft
panesar
lycee
hindlimbs
ameliorated
altmark
cumulonimbus
royds
sabz
modis
gimpo
lajpat
goe
aggrandizing
yokoi
kilborn
coops
subsisting
ginter
twat
roget
oaklawn
socked
clayoquot
cevdet
pequeno
elapse
pauri
vocalized
nambla
börse
vif
oppressing
refile
zapper
filaret
mortaza
coralie
kua
sdh
whibley
ncw
jacmel
kellys
dunnigan
demographer
fussell
twinkling
forgers
grigol
oxted
calamari
gaspari
tamera
fonthill
reexamined
eius
hoffmeister
rangarajan
cobblestones
vanja
panji
weinrich
westcliff
draperies
barackobama
taglioni
sandcastle
ebeling
dexys
yilmaz
aplomb
chatted
berghof
ensconced
electromagnetics
atavistic
hauler
fatales
halkett
bombe
bottomland
modifiable
sugata
ellerman
warbird
alvey
lysosomes
mortise
kreme
ior
riffles
labelmates
panaji
tatsuma
ophiuchi
caernarfonshire
julliard
tournon
bourguignon
halvor
droning
cordage
hazem
climatologists
yawkey
forbin
frannie
róisín
eeyore
mirny
varas
effluents
occluded
chamomile
biter
boccherini
awed
caiaphas
summited
ronit
bajan
mearsheimer
haldia
fée
baiyun
spangenberg
exf
caden
vhp
flagstone
customise
girlguiding
lamination
anatolyevich
bati
wgi
ngam
cisl
jinshi
sinar
matzo
transmilenio
retrained
giganteum
ekiden
notarial
macmaster
dihydrate
majo
interning
disfavored
yixing
cayes
bispham
broonzy
fass
blastocyst
mirella
balbir
siirt
keshavarz
asplund
verisimilitude
pfeil
mitta
wombles
circ
dürrenmatt
tradable
televising
bouffe
fiel
contini
graciosa
burchett
spiridon
manicaland
riveter
debarked
hypochondriac
wingert
majidi
glorifies
majorcan
pasquier
zschech
surreptitious
yixian
blogpost
innards
rosson
transdisciplinary
fécamp
brahmans
steenkamp
pavlik
slocombe
laffey
pernilla
ibbetson
doli
barby
talcahuano
usj
ganguli
coase
kantele
fonovisa
lymphedema
skewered
miyabi
despots
giora
kunigunde
cockade
sedgewick
yuzu
polkas
pinel
ueki
cooperativa
becke
reedited
oink
peligro
lavenham
mé
nørrebro
castellane
vestas
ostracised
deputized
godric
hej
proximus
cnts
misrule
lafollette
reicher
golaghat
amazônia
waterworld
jbs
hurn
holmfirth
devolves
equitably
driftless
rbm
islah
palps
veruca
borrelia
specialisms
morriston
pereda
oben
prokhorov
walpurgis
restive
cadwaladr
ewi
enke
graney
slipcase
lothians
jaja
fulkerson
atas
lyttleton
cinchona
oystercatchers
cubists
sbf
andantino
menara
serafina
fahmy
sahelian
ahamed
suhl
kilauea
adamu
deuel
siberry
creswick
nall
flattop
reconfiguring
blackledge
aplin
intertropical
globalstar
mangling
aste
narconon
tonner
guidestar
gramps
jamba
clathrate
unwashed
sanusi
marías
disproves
residentiary
sisk
kefalonia
kmp
curti
angustifolium
janeane
homeschool
illusive
cadastre
grosseteste
outerwear
extravagantly
letdown
sailcloth
rosenfield
capitalizes
sampa
bolsters
skm
lensman
manya
bartercard
coulon
smoove
opatija
armillaria
sushmita
inhofe
conagra
bearsden
stationer
braes
munford
ljudski
cata
authenticating
aircrash
ewca
wwdc
qrs
britomart
snorting
scheff
arête
invigorated
yemi
revis
ethnographical
independente
shallots
jawless
henstridge
milngavie
mukherji
kupfer
dexia
vien
quartermain
yasuhiko
newswatch
murari
noblesville
farran
waterskiing
slicker
nmu
curson
usia
sizzling
noncompliant
lifshitz
depressants
improvisers
oller
clinches
orphic
macculloch
innuendos
zoomorphic
fanling
egill
emme
kiner
jerash
rededication
depopulate
balbi
equiano
muramatsu
danka
taca
bobblehead
cheveley
territorially
lygon
garan
mayi
alfonse
industriales
lusts
ptcl
hearer
ijtihad
mashantucket
impoverishment
charo
ledford
teigen
auricular
vampirella
bebb
hadamar
rostelecom
honora
rosendahl
polona
jeffro
grez
hayim
armpits
korff
koshien
seato
fcd
rejewski
constantinos
kollwitz
bisi
biodegradation
ketu
anaesthetist
dacosta
pruritus
typewritten
caniff
meritless
willstrop
randleman
caledonians
elmont
ballmer
poshteh
friedrichshain
odai
megabus
idwal
majella
tasers
folkwang
greil
bordon
trippe
starlets
recoletos
womenswear
wickford
hispana
odiham
michoacan
macaskill
oracular
carlitos
vincents
soukous
unlit
cazares
schnitzel
bezos
guillemots
riposte
singhbhum
kincheloe
veltman
photochemistry
saez
beriev
torinese
hoarded
guarulhos
vilhena
sharpay
boley
moston
custodes
winnington
coupés
crowhurst
psychotherapeutic
lof
kurian
talha
kubera
jabot
mgl
loups
taverna
gentilly
sledging
headfirst
jordanhill
padraic
fashionably
rinsing
cortege
fav
fakhruddin
blinders
theiss
fontanne
sevnica
vermouth
dhubri
coola
ruffles
zichy
grappled
shenstone
dirhams
uwm
someway
kyles
vgs
erf
rebuff
definiteness
primm
nullius
kadar
enjoin
cascavel
utan
eleanora
principi
sneek
tracings
poros
spermatogenesis
naze
recrystallization
foxworth
nazarenes
almir
fiamme
wraxall
ammunitions
cistus
dissociates
sorters
kesavan
checa
adelle
liebert
neuroprotective
pietersburg
bampfylde
guse
whiny
wapo
minimi
bimodal
hobe
ntm
vpt
teleprompter
nakamoto
coello
morbidly
cashin
obrera
vinje
pujas
chanler
réserve
leeper
silves
pardew
steedman
extemporaneous
krylya
fna
hsus
freewheeling
grigson
zaytsev
reactionaries
stimulator
attractively
halfa
koc
exiling
postmenopausal
flopping
smoothies
hedged
occhi
hartfield
beighton
asya
weatherall
daltons
euroseries
almirola
zoi
caballé
empiricist
autonoma
newydd
sequent
rashtrapati
baucus
kaldor
hult
menaces
libbey
delicata
mif
levada
copleston
switchers
ltm
fazlul
orthodontist
jaleel
lertcheewakarn
humphrys
fantom
mallenco
merlini
kapo
komotini
capodimonte
slicer
kouyaté
subtribes
philae
tanfield
dotting
derham
cornhusker
vrain
esparta
flohr
snh
usba
chacabuco
mcnichols
gràcia
wasit
mladenov
flabbergasted
oncle
kajaani
ebrahimi
fsp
homologated
qinetiq
jabar
enshrines
ihi
peruzzi
aneta
unfailingly
sprenger
phx
bulkier
faizal
detracted
tomasson
prinze
clefts
frontwoman
graziella
creat
enrages
liebenberg
benveniste
rubalcaba
catalinas
aurélie
ctvglobemedia
altra
anstalt
meyerson
lagash
woolrich
josy
tle
blitzstein
fpi
braunton
tainter
pergamum
stac
wallenda
hämäläinen
hidatsa
konar
jdk
timba
jamaluddin
kopi
pense
liesl
pipework
gopichand
takeout
naoyuki
crybaby
karunaratne
gde
gentlewoman
grapeshot
flirty
longmire
heaving
fuge
koppelman
ihara
disfiguring
attleborough
epitomised
avma
zaka
hfcs
llantwit
dwelled
namier
majerus
overextended
nôtre
pagnol
purnia
unbundled
voit
plaxico
serrate
saltbox
pedroso
mcfee
foxhole
guppies
altyn
speiser
shchedrin
seawolf
slovenians
maund
phua
ijaw
pasturage
hib
tamarine
runny
kasimir
eyles
harnden
penniman
torbert
franchetti
chondroitin
agard
denotation
rezoned
faslane
garrod
neuropeptide
boke
sledges
speier
yurchenko
olc
mayas
yonah
igen
jta
lrs
rekindling
elefante
vmm
dul
wco
mrl
busily
appallingly
loes
crome
chainsaws
bullinger
braunstein
callington
swindled
tubercular
wanger
ilario
dayz
gerona
gormanston
enfilade
bour
endowing
kawamoto
acheulean
carrero
nangang
jfl
katsuhiro
sarangani
yik
meckel
levering
hestia
widianto
nuisances
finbar
falmer
collinwood
gni
crossbreed
killington
xizang
roundish
meucci
glascock
validator
umb
koevermans
ceramicist
sericulture
unexplainable
shearsmith
acappella
ghadir
patmore
spiner
dalmas
cosmologists
arreola
bajau
golfe
xiaowen
tokunaga
cootie
treanor
witkin
parcs
desorption
godse
hincks
bergner
dillen
warfighter
serginho
mineralogists
crazies
duplin
pesca
whitetail
dayang
skein
kelo
kunihiko
galusha
diam
sawgrass
carnations
basileus
scooping
rafer
nuchal
viereck
irun
binger
boinc
bosa
doubters
stokers
tastings
geographia
hmh
arikara
assai
guilhem
mycobacteria
baidoa
tacuba
rayen
aldersgate
jayanta
watermarking
littlemore
gastritis
shigella
waske
blowhole
sonal
provincias
narcosis
ceda
liebmann
cifuentes
rampton
loudonville
fadel
mesmerism
aisi
rentschler
emeterio
markova
kardashians
merville
samina
sherburn
solipsism
cavill
dramatizes
groans
napping
cisterna
humors
lomborg
polkinghorne
deejays
macbrayne
technopark
ecsc
unheralded
inflates
selin
lonelygirl
honorarium
overtakes
fuku
renfroe
houllier
battin
seconding
patrouille
tenax
tashiro
gillmore
lingard
verismo
susann
medicis
sbr
foxglove
prewitt
lavatories
ekrem
soupy
burmah
musca
kivi
pathi
volcan
kocharyan
orava
peterloo
epb
kittson
kultury
floriano
jareth
antiparticle
kep
maag
chinooks
quinze
gom
gharial
ois
rosebank
morarji
beeb
chiptune
waxwing
flim
lenglen
lecithin
untruthful
anneli
brockwell
tiltrotor
twentynine
frictions
vallely
amarte
matosinhos
panyu
dfds
mcnicoll
vij
shahabuddin
besi
baykal
bilinguals
prothonotary
modine
revelator
sudeley
cronje
vivax
blumer
equivalences
perfumer
alster
syngas
buswell
motorcar
katsuura
westfalia
whiteboards
caeruleus
branly
tandberg
pils
alcyone
margalit
rationalizing
normann
hiu
flytrap
havlicek
argerich
cimarosa
orcinus
tungabhadra
channelized
serotypes
bushveld
hapag
pringles
kett
disodium
lorien
choreographies
calmodulin
bismillah
ballew
yasha
tamarins
eliel
modelers
rylan
smokie
nativism
oggy
ionel
clipperton
papaver
jonesborough
nasher
regnault
norum
kasdan
czk
bellicose
cuno
fortas
collateralized
caufield
hercegovina
jalapeño
messaoud
carob
shakyamuni
miralles
mumble
shapeshift
chicagoans
riskin
suresnes
moustapha
culottes
sansovino
libi
pigman
riverdance
enga
floe
inauthentic
magrath
amey
sonnenschein
lacandon
jiuquan
aukerman
duhalde
cointelpro
kimbell
toefl
landfalls
pouille
zeist
teti
kumagai
parkgate
zanni
spicules
mágica
taron
adán
epitomizes
tisbury
helion
tornadic
vidkun
ishpeming
benét
rugeley
alienates
talgo
dubinsky
poiret
ebersole
cynwyd
kogyo
duende
zanskar
colombiano
kennecott
monforte
oakie
flayed
geis
boutwell
dito
undecorated
dendrite
oris
latoya
chertoff
beveled
pseudonymity
henric
novoa
dataflow
photorealistic
drin
transdermal
sanstha
nikolic
froggatt
irizarry
striven
ronni
unsporting
hcs
tawil
grabber
axelsen
undercoat
liverwort
lincecum
wentzel
repossession
marshman
aldborough
theming
monteleone
unsorted
impalement
mandingo
flyin
recusant
upstage
yashwant
kuhlmann
destabilise
truckloads
goalkicking
twiss
tridents
chemokines
noradrenaline
brw
presuppositions
pene
quaye
kirkstall
bismark
deboer
plettenberg
dilmun
sommerville
borut
halling
giancana
dingell
taine
leavell
menaka
groupies
tumbles
waseca
splendida
esat
saleswoman
emiko
broadsheets
uis
jey
jolivet
plainville
shadowland
aubusson
leos
fpa
crikey
flannigan
rulon
kroes
nasirabad
deepti
gryffindor
candlewick
lathan
bersih
taytay
puis
southfork
romanelli
lagomorphs
obv
behnam
rigsby
haidian
grimmer
jaruzelski
housecat
gilby
jra
chives
legalities
allegri
jastrow
sashimi
sprains
regimented
serialisation
feuchtwanger
rappel
tideway
bucknall
intermarry
introvert
montand
litanies
microbreweries
boardgame
circulator
matjaž
cetra
layden
vreme
kupka
ballyshannon
apted
matawan
beatitudes
ansons
steelman
tramroad
rocketed
jut
palpitations
cassirer
osório
demurred
fethiye
figgins
nmp
ballymun
kinley
encasing
hermoso
cowpens
maceration
shikha
bewitching
ercolano
reggaetón
volare
fibber
zile
sigsworth
salvesen
dkp
harrodsburg
denuded
sesi
banos
mutantes
landlines
perimeters
covergirl
clouzot
bolinas
mcnichol
mirabal
packhorse
thorndon
sturla
olympio
joye
perico
golos
domina
riverland
clutton
recuerdo
mamou
behaviorist
bessy
sepulchral
achmed
serpentis
cypresses
hanser
rosewell
bulleid
kurobe
orgone
gsb
disher
bandidos
shebib
suntrust
gorn
shamash
libels
galarraga
prien
seip
voe
unarmored
blacc
khm
mountie
dieux
wrathful
tomtom
motorhome
cretans
qaradawi
goor
jovem
cascio
ravishing
lenzi
attractors
dragana
skydivers
sinding
hendren
mahama
thyroxine
demerits
honig
studbook
teetering
doxycycline
widgeon
dishonour
tatters
hairstyling
creekside
poum
sammons
wiss
ratanakiri
firdaus
deedee
hita
kibi
gic
takedowns
malmo
hütte
lapidary
mercat
majus
dematteis
curonian
karmas
athenæum
morar
shaves
uec
boldest
pernille
endorsers
chalabi
groundlings
antiphons
addio
mannitol
mistreating
silvertone
yixin
malathi
greenidge
bychkova
sciatic
broadford
arlequin
biedermann
pahl
practicum
sleeman
streatfeild
myler
melson
yor
burlesques
ifm
levitsky
weinan
contrarily
stellaris
pasos
doctoring
jerwood
righetti
vermelho
junctional
piat
puls
bogside
mattawa
spamalot
ardrey
dyspnea
arni
yanomami
narang
huahine
ibragimov
rettig
trautman
scholastics
jameer
raftery
janek
fauvel
sugg
sharpie
rockhill
kwekwe
organon
hizbullah
fokine
didion
inheritable
woodmere
outputting
tempts
mizzi
bonhams
reste
congeners
allerdale
zhivkov
managements
borderless
lel
rahbani
crocs
comiket
implode
ptosis
wristband
kirch
jaxon
martire
sulmona
dignitas
bubi
nanosecond
readying
prorogued
alvares
grm
braved
consolidations
tkachuk
meres
timoteo
caraga
redder
sabatino
roasts
gastein
koffi
aniruddha
hofmeyr
knutsson
pseudogenes
rakoff
equestrianism
magnetometers
obolensky
aqil
arnica
angraecum
rolin
zacarias
kttv
tfp
fugate
lawmaking
barthold
novelas
vlog
pindi
spu
ulitsa
khanpur
couscous
workspaces
tartary
umphrey
kisser
kempston
tanveer
lipsius
mbas
ipfw
maclehose
flyleaf
kampar
gurewitz
sifre
levitin
milby
servatius
aley
likelyhood
jammeh
stressor
bagapsh
audun
esn
nort
coercing
granodiorite
schwarzenbach
honoria
neurotransmission
scilla
misaligned
granderson
goree
spiritualized
steves
kashani
okan
richmondshire
lohia
gundlach
schirra
melvoin
anthocyanins
copepod
masakatsu
saqib
glycerin
koza
ncd
peled
theists
gilberts
susanville
mosasaurs
tonton
takaaki
lanford
wikström
penitents
daran
secada
arbeitsgemeinschaft
chromis
tihar
diatchenko
courtaulds
anssi
gazillion
beales
unpatrolled
alarmist
moriori
militaris
korhonen
tsiolkovsky
eyespots
navami
berliners
dehli
dilatation
gentis
mercutio
mikhaylov
kingz
naral
inattentive
hessle
benavidez
alewife
bagchi
reloads
expeditiously
seac
pagel
simulans
automatons
bloodworth
menges
conseco
foreshadow
jugnauth
eski
treo
transat
suprema
gessler
bejeweled
intermingling
kilcullen
trawls
ldh
loins
botton
rightwing
lilienfeld
onomatopoeic
oaktree
dusen
yadin
coutu
ehrmann
lydiard
ravelo
veron
bookies
chiharu
rbl
tororo
daikon
jacka
ragwort
tuamotus
avena
macero
chav
recta
merrit
shefa
dorsolateral
mascis
finglas
willowdale
crutcher
longfin
zahedi
mahalla
loing
associés
gangotri
scruff
tenge
garrincha
kaleidoscopic
sorum
curle
beartooth
rosacea
sphenoid
standardising
ryne
aikin
espina
roskam
pedicle
puttalam
sumathi
shepherded
workweek
wonga
chilterns
leixlip
merriment
abhorred
parcours
dosso
xlp
preux
corrina
gavriel
gyles
tivat
cosmetically
counterstrike
yeley
buzzi
dumbed
interpolate
gie
perren
episcopi
conspecifics
eirene
udd
songlines
brizzi
kalwaria
prepped
efrat
leve
fraktur
tabo
garon
grigoriev
arumugam
quia
marcum
lepper
finneran
telavi
barabbas
noonday
mckibben
ansah
grundig
nigro
foxfire
dbi
carterton
benigni
genbank
bushwhackers
ashraful
zonta
shamsi
roxx
samsonov
spake
airbnb
dreamliner
glioma
festivus
floodwater
gbl
cecchi
kornheiser
silverdome
fehmarn
stepanova
ftth
treptow
disruptors
stably
lalique
hayama
farrugia
megs
kartel
cramping
drac
skf
tarragon
sondhi
hemorrhoids
nicos
kathe
sncb
natori
mappa
hermogenes
crisfield
lodestar
nrdc
coleg
edibility
asantehene
aqualung
rra
sacchetti
guanche
blackham
novoselic
highwood
saisons
spools
gijon
wpk
meijin
afrocentrism
contentid
aimar
oshii
nucleated
citizenships
gynaecological
yonhap
canford
jacking
lge
referents
tilled
bota
kunene
prud
locher
shikibu
navarone
filmfest
kapisa
hawkwood
stottlemeyer
glamor
ringlet
testi
katerini
kelvingrove
interlocutors
spid
phosphide
handcock
stammer
baringo
seekonk
kostov
pallister
bostonians
gamescom
callinan
wilkinsburg
staré
birthdates
stoopid
walwyn
cubical
groundsman
maadi
levens
edman
nanowires
rustler
consequentially
shb
gloved
parsnip
ybarra
reali
kuji
loveable
bushrod
achelous
fone
emv
dorinda
infidelities
denner
beyg
substantiates
bulli
salicylate
ambassadorship
vds
inarticulate
kirkdale
xbla
bontemps
mikkola
mccarey
dybbuk
canonisation
hogwash
oxidants
kilobyte
wettstein
speers
ezo
hillebrand
balladeer
humbling
bocce
beauford
anagni
klingenberg
foust
flecha
keye
quebecor
⅛
pdu
playthings
lpp
durward
hollenbeck
gossips
cini
siba
rafflesia
jaimie
rountree
bilk
gasperi
bhupinder
harapan
eyez
bowtie
stabbings
familias
snu
inco
potala
scheidt
benge
franceschi
osmund
toothbrushes
lomu
himanshu
moyano
nationalgalerie
gainful
florentin
khare
kadu
smn
lenhart
fahmi
appeasing
brochu
maremma
coeducation
bodoni
pib
ewers
brünnhilde
sandell
dnv
trivalent
abbreviating
shellharbour
airtrain
pimpri
blaga
trou
fukunaga
gittins
carmaker
mcateer
unlined
jags
palestina
mulkey
omics
ruhl
plop
pennebaker
harle
chapuis
salvatierra
latitudinal
balazs
rhapsodies
tuncay
soddy
bankside
arcanum
morshead
scheibe
généraux
meninas
annu
yamaoka
beaucaire
redbone
germinated
daher
exa
selex
tiwanaku
electroencephalography
giustino
robbinsville
ridwan
dimitry
thilo
ubaid
ureña
trin
javelinas
antigo
farrel
bollards
cravath
vlaardingen
asala
vinca
tadeo
lutetium
cici
sodhi
greenstreet
exoskeletons
asaba
nghe
dehumanizing
subdural
komeito
shoehorn
revitalisation
carcinogenicity
keta
tansey
brummell
brookland
chaperon
aways
photomultiplier
charta
clayson
equivocation
aureum
callin
annotating
baotou
janki
atanasov
bartell
pornographer
sablon
phebe
endeavoring
signoret
nightwatch
usbc
erh
braeden
neurotoxins
chippy
stokke
ingi
witwicky
lettie
ladainian
charmian
truncating
wnc
nambi
terza
khandwa
efg
usted
communards
palmeira
wso
constructional
rajni
dooly
koca
ashbee
diffrent
curial
familiarization
johannesson
vanilli
paclitaxel
boloria
hexes
mcvicar
mnd
naren
edelmann
dilli
bickerstaff
abie
reentering
sharer
democrático
sundazed
ebl
zucchero
kiswahili
duratec
foer
gavi
nch
antitoxin
pigskin
furay
klis
tracheotomy
pinjarra
kinsler
olha
macoupin
chicanos
justen
maginnis
malevolence
torero
voynich
heping
cloyd
reig
laeken
bierzo
pegaso
lelio
grr
andrae
dubrow
optimists
mandriva
rennet
nickle
heggie
maca
ghanaians
laci
grise
multicoloured
reawakening
wieck
slouch
internationaux
leathernecks
melanoleuca
deceives
tanagers
hemme
dittmar
wanaka
rns
ciné
migros
burritos
crabapple
manch
marañón
guv
absheron
dasari
hild
microfilms
tinny
middleham
fretwork
subash
ifo
appraise
rmf
oao
erosive
devol
prejean
mbna
amenia
jiong
kuribayashi
tings
silvered
towcester
rli
shukhov
laba
vastu
malco
waterless
locka
vermaak
atis
mcbroom
baynton
pey
carrum
andreja
aldobrandini
raus
lumens
mendi
dellums
howerd
biller
sevierville
shorthorn
turun
jabara
regnier
amidah
bunnymen
sweats
accidentals
entreprises
tomaszewski
debentures
idioma
alarcon
sarakhs
napanee
kornbluth
raveendran
nannie
roadless
surcouf
intervertebral
hunton
writable
saberhagen
buttock
bjarnason
skewers
galloped
nordberg
hinch
rovi
wakelin
higuera
pll
stokoe
jimmi
incalculable
menjou
quartetto
casselman
commerical
aerofoil
celio
lippard
maroochydore
gymraeg
pleydell
tamira
chaudhari
balme
slovenske
pictor
progestin
tats
delaroche
cravat
lampley
soylent
winkie
mullions
agilent
serik
mengelberg
klebsiella
verboten
beiping
edad
riverdogs
indiscipline
redeploy
maddocks
inequitable
fattening
nago
majdal
littorio
commendatory
flaco
hedren
kamikazes
etim
sialic
immeasurably
semiconducting
golems
flix
raincoats
givewell
stigmatization
etzel
centcom
lobbed
jaintia
doti
impugned
brind
constrict
jnu
merrin
moustaches
tennille
hache
aspe
bioremediation
liaising
decapitates
rotherfield
klosterneuburg
nyima
sameera
infuses
corsaire
lapaglia
botting
vogelsang
hangouts
piranesi
takakura
arjunan
aritcle
abounding
gamesmanship
windspeeds
vavasour
philipson
keeney
riazor
vilvoorde
foolhardy
ratites
wryly
jeered
mieke
ekin
sternal
urbanist
imprudent
literalism
feith
kufic
gulbuddin
waldmann
gird
icedogs
dorothée
perverting
synchronously
dwain
chattooga
moman
spektrum
cavs
wharncliffe
mbda
bergoglio
jedward
redistributing
pravec
bolaño
supers
awaz
sneer
barranca
shifa
gorrell
myp
lpm
irix
schickel
isostatic
interfaced
yaqoob
quattrocento
baad
guto
harian
menuet
syosset
cenk
danses
pallone
dunce
chastising
religieuse
corbridge
wilcock
neng
sardo
ahriman
sidek
hinrichs
gyros
louella
bashford
baretta
narrowband
percheron
skiles
ramberg
reinecke
agrostis
potsherds
orangerie
konopka
wertz
whitethroat
altamaha
dco
qube
redfish
droga
viterbi
ennui
afanasyev
exim
deangelis
pleura
kazunori
sujit
impey
laar
thorbjørn
barek
muhajirs
dreamweaver
bindon
batwing
yearned
mcglinchey
mcgivern
phobic
alassane
leota
ermengarde
roope
vire
bosé
thro
linkable
mszp
raved
eales
hardcourt
flit
fluor
efm
unequally
carisbrooke
roentgen
mallya
destry
idk
northerns
schwan
derangement
soumitra
stormer
vivier
bosmans
boubou
onr
hitotsubashi
sadek
scirocco
reinado
eib
patronize
leavy
mühldorf
porticos
dixson
agronomic
chromatographic
redid
zurbarán
garissa
niggling
contraptions
katter
enuff
wjw
camby
sensationalistic
huffpo
haslett
quantile
impounding
zucco
portocarrero
fruticosa
calender
amw
asem
damer
paktika
pish
cahaba
thinness
mashal
jigar
ointments
warpaint
nunthorpe
martinis
kabwe
kranji
flos
parlours
akhmetov
unblemished
jonty
colledge
rimless
baumbach
brocklehurst
calma
vadym
voinea
infiltrator
deana
ziguinchor
malim
weatherwax
predestined
unaligned
twyla
jetsam
bourdelle
seasick
willison
schulberg
thelwall
spinosaurus
coelacanth
harbert
streetball
ians
indefinately
rapturous
beran
vab
woll
greenall
taraki
brittas
suning
multistage
voortrekker
ianni
eoc
fitri
utpal
himal
yassine
dyersburg
télévisions
gorani
dnssec
subverts
kimbolton
barson
spillage
ltr
scandalized
mondello
hbcu
asger
krell
pollsters
allsop
choudary
laundered
eukaryote
fucks
takeaways
awka
vokes
placoderms
polydactyly
ravensworth
plassey
flog
woofer
incase
habilis
matings
rawtenstall
brasco
rangeley
barresi
fags
nejmeh
karmel
hurlingham
greystoke
shadbolt
paese
alcobendas
marone
ltl
hackettstown
instyle
bayamon
dvt
hengist
multics
croome
dionysios
tuppence
startin
torchy
sagawa
exigencies
gobain
awt
padgham
verandas
mazara
ozymandias
pme
schriever
lollobrigida
emts
zuri
calamitous
oikos
sandridge
stonestreet
sunapee
hamels
crucify
sophy
marsel
collymore
saoirse
tenpin
aughrim
subspecialty
negroni
rayfield
dutilleux
mccullagh
noye
qadisiyah
trutnov
velour
prettyman
kgs
vib
boesch
guercino
prestonpans
bookmobile
pomeranz
harddrive
rickety
rajkumari
pretexts
imposters
oncogenes
raph
edr
penrod
misstep
biogeochemical
douze
professionnel
shockey
wantagh
walruses
savannakhet
penndot
newsweekly
praslin
idu
lading
xplosion
sherbourne
griz
sevastova
sigüenza
mazen
prambanan
tarmo
soldati
shishkin
branes
dreadfully
irreverence
sieves
siqueira
aradhana
amell
habituation
gené
kirribilli
linna
weigand
mirrorless
souled
fakhri
mazurkas
bolduc
espinal
teamtennis
ruminant
nadzab
fuyu
swanee
savan
splattered
zille
adrià
rycroft
mantas
smacking
brakhage
mccants
tenri
schilt
semipalatinsk
girling
functionalized
sportsline
nassa
belson
rahner
btm
clercq
welborn
bakehouse
cowdray
leçons
handelsblatt
blacktop
sequins
venecia
saputra
orignal
motorcars
inscribe
pedaling
tob
forsman
banhart
gupte
endnote
tunicates
reversibility
dahlman
centrifugation
himara
collodion
circlet
heilig
beneficence
eclampsia
pipo
janjua
fiorella
framerate
chatrooms
rauparaha
carine
beautician
troia
chaudhury
melati
dickins
doddington
olb
prca
decimate
arcangel
treacherously
newbern
kantar
erichsen
gramatica
evora
sanlúcar
ngee
metroliner
asit
bergens
lewton
traunstein
neasden
sask
winkelman
apne
mady
elmslie
gaiters
hütter
mccullers
chihuly
luong
headnote
mapfre
schillings
caedmon
enchantments
florus
schuur
panting
shomron
automagically
magners
alces
partlow
notturno
eclogues
skoff
healesville
mauvais
fleisch
opv
aine
twirl
auberjonois
hajo
knowable
haredim
bradburn
kurla
goupil
chengchi
knapweed
imminently
，
quickened
hypersensitive
delson
widowmaker
genta
rieu
wring
nisshin
cyclically
kurti
taiz
banqiao
wonderwall
whoo
klim
ornithischian
prednisone
disobeys
jnf
truett
ichthyologists
siku
gingold
muggeridge
amphibole
concussive
catid
technic
uppers
spiess
ombres
actives
besnard
basketballer
calheta
accessions
berl
flypast
barkan
zeidler
fromme
björklund
bamiyan
entrap
aerogel
cnf
casadesus
sergeev
pcn
cauvin
untried
superego
parlay
savate
phthalate
goodkind
benayoun
arsed
divisiveness
morphou
nnr
pepperell
emetic
kincardineshire
scissorhands
souleymane
blainey
wapello
prete
thewlis
bick
rohm
mullard
thoughtfulness
reintegrate
splintering
adx
whitepaper
kardar
iestyn
entremont
beza
laminae
militaria
songdo
morville
pottawattamie
lampman
yaz
montgolfier
coso
azeez
thz
mandrel
anarchistic
peristyle
awaaz
azari
naif
cdos
istiqlal
ddi
ddl
ioannou
toshiki
taru
fizzle
trb
turman
starlite
transposon
arnolfini
astronomically
rosalina
bandh
bombast
cassiano
dialogic
tyrus
demus
yv
monceau
poète
osmonds
damara
sedgman
leino
boulle
latgale
heizer
cosmin
rawkus
bobbins
enmore
manz
platten
greenman
ryles
minutia
supervalu
nurtures
bernards
blockages
atalante
clansman
shuns
tamora
kusama
figlio
hookworm
existences
ibaf
chrystal
tsf
undesirables
sneezes
minahasa
kelemen
marica
rotund
reh
mwr
ductal
valbuena
conjurer
barfly
grossmont
hochuli
puta
priyadarshini
jingu
sidewall
animates
rch
heatherton
dunstall
shillelagh
donis
playset
cheez
titlist
solferino
singalong
luckiest
kossoff
crespin
willets
sotiris
shames
akina
hup
bluray
forsyte
mcmullin
ffv
parfum
gothamist
mallaig
zapping
frauenkirche
braswell
copan
stier
dextrose
whiteout
troves
calatayud
sakina
tissier
pertinax
rosada
foxley
neglectful
purposive
rifaat
lakebed
biotechnological
broomhill
freier
gleaves
priester
chevrier
textually
rambunctious
iccpr
walle
mvm
bluenose
sprightly
reprogram
stull
asparuhov
cwo
ishaan
overspill
diabelli
esco
damselfish
janae
issachar
drilon
giardia
centos
nacac
corll
ritsuko
stucky
marivaux
puncak
debakey
bernama
discotheque
caudill
gpt
meshing
krishnam
chucks
solex
dood
massari
hamadi
aricia
prestbury
hillview
audibly
anthropoid
ftw
niqab
wean
rawling
jbc
misspell
francona
jammy
subacute
protist
gkn
sirt
skoll
fot
kellar
farinelli
humdrum
nobly
clockmakers
bupropion
alders
coproduction
hazeltine
lymphoblastic
boger
breslow
ehealth
itemized
cappelletti
olentangy
guilderland
kushboo
agn
kamu
mult
tesh
zukerman
foca
methode
vahdat
brewton
hailstone
killy
moretz
burros
pelagia
lockridge
llan
szpilman
chex
mackenzies
oth
xylose
farouq
ocracoke
nutritionally
izabella
parisot
insan
touchwiz
donisthorpe
freischütz
reification
poliakoff
milanov
bth
rech
kitahara
skiba
chasen
granier
rijal
stereos
ocmulgee
rehberg
finra
limbed
grins
tomos
tamasha
greipel
unceasing
rescission
presentational
reffed
bartering
downdraft
petersberg
ghosting
shatrughan
weismann
nightwatchman
tonia
dehaven
irna
fogs
qun
immunosuppression
simkins
abrasives
peth
turnberry
polarizer
siya
fyne
zeven
desiccated
lifehacker
rajini
juntos
oury
cytomegalovirus
abertillery
haves
mockingbirds
hepatica
christiansburg
brith
grandjean
flicked
bettenhausen
fedak
connaughton
lithographers
fasci
prakriti
dilate
aycliffe
auda
ragnvald
shirelles
nerissa
isaksson
redheads
brumfield
seemann
nantahala
sustrans
asko
leberecht
ccds
jining
nrcs
motwani
vsm
taille
horsehair
severson
ressler
mutilating
cagan
vandermeer
bagatelles
kennebunk
otv
langport
huffpost
rabbids
caecilians
akasha
folland
overdosing
barclaycard
hollowell
yoro
wouk
jovana
cassville
microsecond
makeovers
argot
berita
hildy
petronia
niguel
luda
kotecha
venturers
korba
decamp
merope
whigham
particulier
acela
sudeikis
burzum
quietest
khakassia
scorcher
chomp
nuvo
bronislaw
alessandrini
hiddleston
neuendorf
genuineness
graal
holabird
manalapan
softshell
culbert
kayin
uart
triveni
uzès
prieur
bulwell
ralls
agenesis
kaden
clackmannan
heraldo
upstarts
snedden
akamatsu
agreeableness
bryony
depletes
realpolitik
magnier
commandeer
malic
oie
landcare
compensations
yaniv
nung
marsyas
maquoketa
fruited
chandana
griddle
beihai
hofmannsthal
gradisca
eyelash
accardo
frayn
mcalpin
colcord
misako
courvoisier
seismically
dehner
unrealistically
reignite
comely
omr
bolam
practicalities
jamelia
crispo
litigate
sauve
leadon
kislovodsk
propoganda
weatherill
unbanning
unionize
standley
chump
journaling
messin
failsafe
dunlin
aluf
takapuna
jagran
henwood
denizen
reinsertion
jhon
urbi
triangulated
olov
execs
broadfoot
arbat
iif
assiduous
boulet
cosh
halston
tailpiece
gazipur
xuchang
tischler
quy
harjo
storyboarded
amann
roser
hairdo
baoshan
schmaltz
onodera
ingold
crooning
kentuckians
ejections
rambutan
pembleton
exemple
mento
resh
pressly
lynbrook
muddied
saji
fierstein
shortridge
ysabel
planetoid
napoleone
ladislas
hillenbrand
meus
tamanrasset
trialling
gustin
westray
lienz
viens
nft
jouy
kunitz
motivic
mispronunciation
gasping
arsi
picken
copiously
ajinomoto
lfg
serpico
shediac
pesto
taeko
taar
moff
ncube
pulque
restrains
kittatinny
khmelnitsky
cheekbone
globosa
géricault
dbd
melodeon
shankland
bellemare
angstrom
dudbridge
brownrigg
inzamam
partials
horeb
figureheads
northlands
condes
kpm
musselwhite
coalbrookdale
rutkowski
veloce
tallent
arene
lauretta
proselytism
agim
davidii
reincarnate
arcas
daal
juristic
jaana
singa
goins
wannabes
briskly
hudd
funicello
dilawar
expressible
poinciana
yaks
payet
toiled
cannonade
seiki
levodopa
platina
vhdl
beppo
iguchi
qar
newsasia
huizenga
shak
svengali
depatie
shurtleff
pumpkinhead
apace
natas
claremore
cwi
snecma
wollaton
teofilo
mcmenamin
busied
bugey
sympathised
yurok
leboeuf
catty
tambura
reidsville
kerrie
asymmetries
mcnairy
gelasius
winchcombe
uccello
checkup
hesperides
majordomo
baris
arrl
ninette
relaid
umer
tamilnet
consumerist
thekla
armide
daubert
congratulation
beaucoup
dhafra
corto
giantess
avo
transcanada
orphaning
khanewal
penetrations
fuseli
yearn
kuczynski
jordanaires
kehr
tahara
gayoom
nabf
bynes
divulging
osteology
gomti
lofted
jonquière
toch
woodrat
taddei
morath
munnar
serzh
cartouches
fomented
mbabane
saionji
ureter
raindrop
grodin
forager
zio
tobler
holier
exactness
continence
bjk
murph
johari
mehsana
tyreke
dukinfield
probiotic
marleau
tevye
sealink
kurowski
appropriates
josefine
jitters
eldr
cornes
millbrae
desultory
gimignano
tripitaka
zihuatanejo
billowing
berbera
lyotard
newburn
sharecropper
soltani
puzzlement
jasa
herminia
gladness
cramm
troodon
chayanne
calendrical
meckiff
muerta
salamon
riebeeck
hunstanton
poitras
lodha
nedney
shadowlands
corvina
jurat
maen
birbal
ravinder
bimetallic
marney
peeking
lovemaking
ciliate
monarchic
amick
michelotti
tomek
shesh
cowritten
vaibhav
berling
grabow
flicka
berberian
derozan
pompeian
ricca
vaso
miffed
leaven
sangrur
btk
salchow
aosdána
taveras
lather
longstreth
zogby
daiwa
zafra
solum
muhly
horstmann
mcmann
kochanski
kibera
akaroa
krynica
bigeye
radials
iccf
stargell
cezary
bumi
tomica
venancio
keadilan
viscoelastic
beanpot
onet
thyra
obregon
waxwings
vava
abrasions
fontenot
montecatini
yorkists
seru
learmonth
szilard
salou
gunslingers
whalebone
ronco
dansville
blay
misheard
takayoshi
sidecars
gignac
deanne
lessees
airto
massasoit
wellhead
catalyse
rosemead
cergy
xbrl
vfc
wasilewski
counterarguments
showrunners
crackpots
xanthus
napper
pendennis
pupillary
fluvanna
madhesi
timisoara
chel
frenchwoman
newmont
erratics
severini
mullica
macari
vágner
cirilo
neilston
hollerith
dualist
collegehumor
togashi
ravidas
clelia
wacko
garcinia
concordant
kft
rgm
mirwais
podiatry
bridgit
morishima
khujand
mugham
linhares
equis
ufw
barla
schrute
hdc
slights
bayonetta
mcilwraith
noboa
weaponized
yoshii
defecating
urc
tannic
laidley
tufail
stettler
faneuil
chetumal
marshalled
∆
electrocardiogram
berchtold
mammalogy
kelpie
murgatroyd
spingarn
zuluaga
boven
oligosaccharides
kile
moins
crv
fakenham
tener
convertibility
ballas
curr
gulbrandsen
toren
hulett
chuka
hrazdan
dessel
tamarix
budgerigar
gendai
transience
brecher
remigio
sostenuto
mccotter
semih
dubreuil
ellul
clavel
selinger
pomus
overconfidence
weka
ishita
pardoning
pinstripe
vigneron
canetti
outwash
flatmate
geertz
fledglings
streamlines
fintech
slavishly
dámaso
turville
clasico
peloponnesus
centauro
aion
overexpressed
gilwell
kharlamov
marstrand
assemblymember
demerged
sexta
mccaslin
pestis
husted
cloke
trumble
feta
motorhead
krc
amidar
fendi
manlio
larbi
dadt
brucellosis
undergarment
valda
shevchuk
dexamethasone
urethane
uncluttered
stapled
nyah
rpms
achillea
avaliable
rinn
farting
reformulate
nueve
giti
lamda
bandwidths
toughened
musiques
gainsboro
diagrammatic
nosebleed
desafio
exocrine
joselito
pilsener
koivu
nonaka
haseltine
shigenobu
djm
ganas
resolvable
imatinib
chil
katelyn
orbicularis
regla
chaumière
gensler
unscom
houellebecq
balaj
frenzel
olf
gurnard
dandi
kentville
breckland
piratical
freudenstadt
boehmer
bellatrix
lelia
echevarría
defrocked
builth
bayous
wallechinsky
boies
noticiero
eleuthera
jingo
disclaim
maritain
colourist
ouessant
sarn
gargamel
sangallo
wyrd
fildes
lamento
luken
outstation
womad
windup
maravilla
pollio
mazzoleni
discoverable
kitzmiller
saintonge
epitopes
homesteader
panzers
krabbe
uninstalled
riffle
srd
rff
bilo
hernias
swanley
flickers
motorman
paraphilias
champasak
anthropos
ajdabiya
proofed
neurosurgeons
zamzam
neurodegeneration
crytek
paddler
yasu
wurz
avb
mcenery
brennaman
alloyed
cardano
killjoy
shahak
bookends
campagnolo
minns
osoyoos
duino
levay
meteoroid
sahin
kretschmann
zeena
scarry
convalescing
baladi
breadcrumbs
carnwath
earthing
dingdong
tinsukia
sessa
charro
corrin
heidler
toastmaster
mastro
okon
toreador
grackle
wetsuit
weirdos
thala
isoroku
biratnagar
munari
partakes
itam
tarlton
antenne
neurofibromatosis
silenus
istar
soth
cannibalized
chalmette
actuate
swithin
melcombe
cnp
rotenberg
irrigable
prepubescent
hurrying
cathepsin
honoka
pellow
simonyi
behzad
bloomed
kolya
fujimura
ibooks
glatz
alber
puccio
relaxant
boxall
upgradable
charissa
uele
intertwine
stormbringer
dushku
shiners
tudeh
oranjestad
schonberg
headways
tup
argyros
oodles
selzer
lidl
shikarpur
broadstone
daiko
behnke
horning
giampiero
abductees
tropicbird
gosset
doggedly
mmf
medaglia
scouse
hsg
yasmina
wolfhounds
forklifts
willetts
srk
rikers
piscis
neurobiological
overhill
exotics
jizan
talamanca
dappy
stardock
kyser
alpinist
undulations
wacha
kinkade
fincantieri
arevalo
nelsons
oncogenic
juxtapositions
kyodo
millcreek
stiffen
manohla
brehon
girt
ransack
fiorenza
tapley
quora
falah
itd
komsomolskaya
dappled
tsuruta
euroscepticism
reciter
pneumococcal
impugn
ocn
blackshear
campari
novin
hazlett
yahtzee
creeley
harbi
bescot
umpteenth
ktv
montazeri
jocelin
rothbart
tailwind
maros
nardini
galaga
kelaniya
downingtown
arava
cheatin
mutters
culham
benzo
barpeta
ingall
keays
broncs
walcot
sundiata
cattermole
pingat
cryopreservation
abboud
confiding
blackdown
kushi
jobbers
mehrauli
cevallos
auggie
misch
guestbook
scientism
berkel
aarón
morrone
androgyny
suppositions
henslow
obsessing
spion
hatakeyama
outhouses
bierman
unpatriotic
compas
kraal
zahl
oogie
citrine
hoopes
climie
narodowy
fuertes
wotherspoon
rezende
munt
tyrannosaur
goll
lynskey
widdrington
ritch
quirinal
eed
alexie
duse
torvill
cess
zosia
deactivating
taunggyi
klaudia
bruisers
cck
floes
possessors
kazuaki
nauseating
soldiering
vallone
pronger
strohm
hortensia
mbale
worksheet
kaspiysk
despard
trabuco
mehl
azharuddin
observatorio
lincolnton
idh
coiner
drechsler
vizard
wollen
stuntmen
ahlquist
juhan
kinghorn
kodai
mabon
zahara
steadicam
juhu
ikan
akko
eberly
papists
hamdard
meineke
wiglaf
javon
plumlee
emcees
frowns
caperton
questioners
capshaw
mutilate
sohag
voltmeter
vomits
hasek
kraig
zel
dugongs
hubie
bansi
shanter
lacaze
phosphorescent
gueye
millon
redlich
begich
parp
zibo
superstructures
mulu
gilg
muzyka
tetyana
rhymer
hinter
cioffi
salaf
riaan
caister
goober
baras
orbe
hilfe
kouvola
gansevoort
cherney
rameses
doerner
antonieta
siring
hosny
kixx
beadwork
eurocontrol
nesher
skipwith
beccaria
samata
scatological
kie
ripeness
lowrider
economía
esler
elitexc
stringently
madhubala
pictogram
evangelized
kettlewell
bermudo
quoll
dunderdale
hii
fmd
grachev
schadenfreude
reme
gluconate
gunga
islamisation
wheatsheaf
tronto
snf
xiaobo
fiemme
bodnar
leisler
deodoro
finasteride
preps
wurster
primeau
hamley
weihai
frictionless
persecutors
whas
willner
miti
gombrich
dissensions
heidt
nibbles
maccormick
denarii
canosa
sorbitol
cramond
balki
melman
akizuki
iacob
decapitating
greenlaw
changde
dongshan
dallenbach
nathalia
bagshot
bodyweight
langeland
sunitha
wizzard
personifies
aravalli
stayner
ahidjo
thermos
saca
henningsen
kanban
fonsi
shii
multidrug
corradi
cephas
kitajima
uhr
reinet
smersh
bumiputera
sunbelt
polonius
submillimeter
clevenger
mester
tuxpan
maden
calibrating
moradi
tfi
negativland
delvaux
shiratori
dulac
sfor
saifullah
rpe
idr
scotians
oppen
sincil
garabedian
sophomoric
vbr
heman
brightwell
fleishman
renegotiation
hamiltons
westway
ochilview
schaerbeek
fairhurst
orchestrates
copyists
karki
kipnis
rumson
seigneurie
plainchant
vivere
nanu
hankin
nanhai
jokey
chancey
garlin
decriminalized
arabization
lepe
krane
olatunji
poupée
psychically
nunavik
afdb
resurveyed
andro
qsl
dvp
zzap
compson
macfadden
tuckwell
polidori
zamfara
sju
shisha
paswan
ringgit
pavitra
neste
pierrette
barbecued
demis
shug
pericardium
thermostats
lithophane
samay
seeped
raymer
iin
floppies
manasquan
architraves
syntagma
transfection
saltaire
farriss
mclaurin
petronella
winckelmann
metzner
panchkula
maikel
wkbw
balled
laudanum
tirades
blc
cascia
outgrow
aeron
mahalaxmi
goalpara
martis
aqi
parwan
subcamps
clic
icam
racoon
repeatability
hutten
memoire
norgay
strep
cheh
romesh
spellbinder
bedwell
budha
lambada
checkmark
tifton
tinney
smbg
kearse
environ
sealants
bettye
hasselbaink
uncharged
cetina
olah
perishing
jeepneys
dahir
sadducees
gournay
extenders
annet
inness
iag
dalmeny
clamoring
alizadeh
dettingen
globin
jèrriais
tarboro
amicale
nodosa
comscore
tica
osteoblasts
langland
philosophes
befallen
sneering
trémoille
biti
surmounts
gellert
bonafide
ecovillage
hmmwv
ufs
seule
moltisanti
unperturbed
moinuddin
artmedia
miyahara
fronteras
poko
deriding
merola
guenter
immunized
ngor
gasherbrum
typists
avoidant
dowel
detmer
nancarrow
plutarco
museology
lopo
premodern
ringworm
ixa
lindauer
gesher
tsangpo
platitudes
sawfish
sekhmet
taye
ramification
misshapen
informix
brightcove
easters
urundi
tpt
missus
ferenczi
angrier
bsr
raich
griquas
bronwen
barot
myka
alur
histologic
sangin
precident
cge
barrault
ferg
ludford
caffeinated
yaacov
pions
marieke
ketan
prudently
atlanticus
hoaxing
licentious
jola
salameh
bohs
arboleda
skirmished
carnelian
elwha
narmer
acct
crenata
pleurotus
illicitly
newburg
sarti
yajima
zetec
dorrell
bakes
campione
lims
darvill
freeride
anmol
udai
unquiet
faulks
serafim
righted
hoz
kotto
juggalos
flailing
masonite
abdirahman
varkala
joongang
coldharbour
nemi
inada
dianthus
torturous
pulford
errr
shorthair
cuddles
kumo
espenson
anthurium
lipps
taraxacum
badghis
dratch
flintridge
reccomend
paedophiles
beedle
minch
gue
zoro
homeopaths
yining
ovalle
tct
ballinderry
illich
superga
onkel
spearmint
eppley
immunisation
parapsychologist
minne
bakary
groome
pantani
moorei
tevaram
rateable
matsuzaki
boijmans
pietist
samachar
gubbins
siecle
electroclash
iria
alwan
impious
grothe
wimberly
consecrating
bromma
eglwys
klin
antisemites
panamint
parasitized
rijswijk
lakefield
matangi
drom
touquet
sublingual
charice
outperforming
szell
hakon
henlopen
micelles
kisch
liev
cappy
foltz
harkleroad
kahlil
cathryn
lubricate
threnody
bearsville
tuthill
iou
spasmodic
nietzschean
hookup
barthe
prathap
fantasias
inflamatory
minhaj
barroom
epiphyte
marksville
organophosphate
sundby
lydd
kindertransport
incommunicado
sportsplex
huguette
ikuta
lachance
salvatori
peterkin
comair
mco
connersville
slavonian
trova
materializes
curiousity
rla
clf
sabathia
nahant
glowworm
tooled
ethelred
sandvik
borwick
dears
mals
tarascon
larmor
hodel
gearhart
klong
klebold
damascene
amphitheaters
malarkey
mimed
smouldering
umstead
bearman
bulrush
wigg
denniston
foos
weibel
govardhan
vacuums
kokanee
rnp
kensico
pratima
curassow
corpuscles
portables
hedayat
jasminum
giraldi
somer
waistband
corcovado
chansonnier
ersan
jaymes
uvs
insufferable
bevington
bages
klar
aliyu
bestival
bloodhounds
mercifully
liftback
striked
stolypin
takasago
effectual
louvred
shandon
megalomaniac
hutus
teleology
hammy
tehelka
brazenly
suess
bennelong
masten
steiermark
nasiriyah
ligia
quantock
matildas
embellishing
uffington
clicker
chira
neetu
quartett
issam
dinozzo
ecurie
clued
nederlander
makonnen
osterley
wieczorek
sisa
eggshells
fritters
takla
kumeyaay
millhouse
vasconcellos
selvin
plasterer
mcfeely
molesters
wickenheiser
bandmembers
hirado
laurents
bodysuit
cbssports
dumaresq
aronian
volitional
kinch
clar
jica
groundnuts
gasparilla
balaram
habanero
galati
colloquialisms
fandorin
gelugpa
lyse
methicillin
wier
siphoned
genderqueer
voilà
madchester
birdseye
obstinacy
sprawled
weightman
gouges
commonweal
apsa
tereshchenko
wolski
nuvolari
garis
ikot
sundew
lalli
broomsticks
grahams
awp
wrangle
whitelisting
godhra
justino
effacing
hoyts
massimino
oedema
antheil
sulejman
tangi
thermidor
topal
orbcomm
volkskrant
berthoud
velden
koryttseva
eggar
bartholdy
qana
topley
sandbars
pratama
ostmark
mprp
madaba
chahal
demonize
guiteau
kathrine
blissett
subordinating
sammon
adamou
leblond
sisowath
dorey
groby
vigneault
novarro
alsager
coba
grandville
azarbaijan
scrapper
hartshorn
krämer
chéreau
sugarfoot
integrins
keech
inverkeithing
titchfield
rossman
martos
headhunting
frontbench
deregistered
ectoplasm
reassessing
harems
harten
profaci
arvydas
fabris
dollywood
banega
debutantes
yildirim
quilty
tzur
breastworks
giverny
arrestor
decapitate
rossouw
strategical
viertel
psychodrama
troitsky
uladzimir
kamphaeng
ortmann
ainley
transmittance
hyperdrive
mckibbin
ponchielli
czolgosz
stablemates
sindi
adresses
lancey
bagno
lutetia
solemnis
wbal
schiffman
reflexed
bundesbank
jassi
sni
churchwarden
rievaulx
podesta
eade
kers
boatwright
massifs
pushback
colet
portofino
summerhouse
existe
triggs
figg
facepalm
traven
roelofs
heiau
gooey
enger
illmatic
zhangjiakou
adwords
terephthalate
partes
sinclar
dragonforce
gravitating
antihypertensive
amerie
shivalik
kotter
repents
vassiliev
beaufighters
nastassja
molins
liberality
aizen
sondre
nymphenburg
fauves
mcmorris
koharu
youkilis
putz
ashmead
westergaard
abulafia
afrocentric
hokum
crematogaster
haluk
handcuff
rambaldi
mediumwave
morlock
pdd
guen
güell
bohan
peaceably
seika
payed
tages
kogen
larkham
grayskull
gezi
arkan
lamé
coldcut
mindscape
gorden
settembre
lillis
bergamot
marchisio
mcilvaine
colberg
kushtia
aquilino
peco
spirulina
buffalos
fosco
viols
ferrarese
ragweed
loudermilk
jik
bwp
grazers
jogo
deri
kachina
corbières
dicke
mcginness
flybe
syke
halogenated
lawgiver
breit
cuss
vesti
defilement
tinkers
hornbeck
airshows
knowlege
segreto
gittings
penha
kelham
dicom
bathymetric
llanrwst
kutner
hughley
almo
theodolite
flavell
diwa
serpukhov
zante
ilmenite
reprocessed
hughson
aptn
christies
cappello
bienvenue
stourport
nipsey
nein
scalding
orania
reenactors
akkerman
malleson
amchitka
withlacoochee
danke
scottsbluff
bellisario
janda
strongpoints
demure
ebru
sagem
nonconforming
sml
offertory
biaggi
climes
avett
crawdaddy
externality
witbank
suraiya
naturwissenschaften
kohs
critchlow
grammatica
kishin
catoosa
polluters
éclair
bronzino
unequalled
claymation
poppers
reavers
marroquín
demopolis
mianyang
plumtree
orlan
castelfranco
oros
prindle
hassles
florescu
knoblauch
bastl
aengus
weah
kelmscott
oozes
anqing
baling
expectancies
tercer
shuttlesworth
manzi
ulva
wench
europ
hilder
basicly
bolitho
rinat
doublets
deride
drakkar
yreka
scid
ishizuka
mapmaker
tagliabue
tognazzi
mli
palamu
nadkarni
remarque
anyplace
hyborian
tribbles
trundle
louvers
silloth
technocrat
bernardus
brüder
beautified
stram
petrine
sips
armond
whiteway
cinematics
steuer
carper
maso
hia
paraclete
reshot
boselli
realnetworks
stim
noce
faceplate
hatchbacks
chamberlains
bartowski
shahzada
madelyne
evolutionist
pondok
kogo
laskin
schlechter
saravia
deers
angerer
mazzone
homans
gfx
etu
liveliness
modbury
mcgahee
mowed
chaput
nonexistant
contractility
villarrica
outsmart
aquin
congregating
shakeup
sippel
lagonda
paramahansa
engulfs
bilberry
gripper
sammie
skansen
leatherneck
yawl
sason
poso
disembarkation
marchena
parkhill
sacredness
tejeda
redemptorists
premji
vecht
leonia
lourens
bashkirs
gratuity
styler
stagflation
autoantibodies
abderrahmane
gadwall
detests
wegmann
ksr
tabernacles
thubten
indolent
esmaili
zalm
generalise
lyfe
colclough
idées
megalomania
gremelmayr
hitlers
pistachios
loblolly
mclarens
petterson
cheruiyot
haematology
saima
wordiness
aikens
iee
hoobastank
disdainful
arteriosclerosis
panahi
igboland
healings
estée
hygienist
homma
avendaño
bnt
khs
basma
bazalgette
bulba
comper
kadai
unimpeachable
dbp
kiichi
glf
granulocyte
sheeted
sleeker
doetinchem
scrutinise
reissuing
salutations
burian
grigoriy
exalt
reciprocally
publicis
retooling
colchagua
miroslava
minidoka
wiske
kingussie
caskey
blenheims
narsingh
shutdowns
sidonie
thio
yoshihide
deputised
jillette
handke
tenaciously
deary
giambi
porthole
bouin
seiken
befuddle
acrid
judds
sowa
nogi
armd
maricel
maidana
irm
saperstein
gevorg
microchips
servlet
wason
conteh
docx
janatha
uco
titanate
sanuki
agrégation
porton
lovelorn
pickaxe
faridkot
caplin
decease
ronchi
indrajit
flixton
bergdorf
kautilya
selección
sympathizes
linesmen
plainsong
batey
aver
peñas
streptococcal
ilb
dacko
retractions
mayerling
singhania
tarana
troels
garang
shaef
ardis
laflamme
welz
pilsudski
yakin
serological
benni
needling
ophüls
quieted
bort
przewalski
hailwood
pépin
glasson
mohair
cockles
winckler
lettow
louisiane
arng
shaku
lucidum
dalea
nanook
boardwalks
artform
forgan
gameiro
dunnett
lpf
steg
diamantidis
salutary
pipped
mowag
blueline
pkb
edmontosaurus
girouard
tournier
cunnilingus
krugersdorp
crowing
skarbek
mumbling
geoengineering
samuele
liquified
kaling
troponin
henao
yiyang
visors
triumf
viersen
gerusalemme
netter
piute
palk
nougat
scaramanga
sesil
mcchrystal
ramel
smokestacks
psps
ackworth
logit
pujari
invergordon
elmet
terracina
failover
chena
yolngu
prêt
photographically
bandshell
popularisation
whipper
yoshiro
blackball
ajar
ufp
zarb
stadthalle
taneyev
grünbaum
grigoris
rots
bruk
phin
reattached
datalink
parseghian
villarroel
ramadhan
lèse
sequoias
ewer
morinaga
rsx
whitewood
annoyances
soloviev
wootten
governesses
multifocal
chachapoyas
eave
delavan
sloppiness
legitimised
maisel
svenson
nais
nicked
boyett
cyclorama
gabino
undressing
sobs
avey
luzhou
santería
peeves
anzani
rhoden
acra
duffus
yossarian
southbury
saam
cuoco
eilers
reenacted
superintendency
guma
charango
treasonable
latu
mitsu
achenbach
tahan
ryoji
verkerk
reproductively
appliqué
musicus
obihiro
meen
shalva
rtt
doiron
probyn
indignity
pricey
chernenko
audios
bransfield
isna
tetbury
mischaracterized
thyssenkrupp
mcindoe
mimmo
sarov
hemorrhaging
gerrold
kestner
fightback
roff
neurodevelopmental
psac
arques
girdled
guis
addax
defers
bridgton
breitner
reoriented
bybee
extirpation
inl
potti
tudur
resumé
gunda
ladbrokes
smarties
classicists
fantastica
bostan
rondon
alexisonfire
danity
riyaz
baggs
cresting
brandywell
chrysanthemums
suspender
weirton
sibugay
ancoats
vertov
corno
wbai
saltworks
kratz
nccc
sparred
spencers
winfred
mykiss
bilko
absecon
pinero
heybridge
evolutionists
jse
wakely
timken
speeder
blacking
thorley
patco
söderström
molay
crafter
migrans
avchd
weddle
mcn
balas
aliant
bellenden
lasser
canario
wmds
canoas
gva
whymper
ascites
touma
lovins
skullcap
bodyslamming
wao
platanus
ilderton
procumbens
wuling
shoppes
verifications
paulos
rantoul
jaitley
gloversville
radim
fossati
tupperware
cecco
florez
squirts
unicast
brainwave
nazari
kilbourn
kebabs
tanguay
homemakers
leifer
microalgae
kubla
xiaoli
energizer
spaeth
wiedersehen
owosso
somatostatin
rtn
holidaying
rucksack
katherina
adg
boysen
thoresby
iihs
tomorrows
bangash
najma
kroos
disbands
voegelin
higueras
maestoso
secularised
maresca
yoshitsugu
polesitter
duart
lambo
simbel
finborough
epidaurus
balen
zir
vaganova
fordson
hotdog
brezovica
cooperations
mirsad
abashiri
perishes
runnerup
sékou
lichte
fixx
genn
biedermeier
fretilin
ituc
cordifolia
magos
coahoma
fluoro
vexing
knitters
bowater
stata
huichol
absaroka
preachy
discredits
mustonen
mathematic
peripheries
holsinger
kemba
edexcel
cpas
pinon
casque
uefi
edes
aurita
gogarty
obst
purism
cfn
hellerup
wardley
ridolfi
soldiery
inflect
hkd
mckiernan
hoan
dzhokhar
witi
eal
chilo
suat
cesarean
breconshire
transpire
loughner
makerfield
labelmate
sylvio
philadelphian
bangaru
purgatorio
chamblee
ganado
lusophone
procreate
calor
cantera
naki
segued
vess
spen
maltz
ernestina
heckle
riveros
inchicore
arteriovenous
francisville
pearcy
misstatement
fsr
mintzer
quercy
avison
wbt
krikor
larco
yougov
sokolac
schmalz
foulke
derogation
trampolines
orangey
mutis
sanchai
ekranas
hospitalisation
kalinowski
aldeia
baynard
slavica
nfr
placers
optoelectronics
greaser
farewells
daina
quigg
navalny
rumiko
kouba
lapponica
octo
clumped
dynamited
murrelet
tehrani
crosscut
biala
retell
comradeship
icicles
archy
bronc
marcoux
alstyne
regale
canelo
bili
shaddai
phenytoin
ethnobotany
emon
sexed
kozma
kasbah
westerhof
danyang
muckraking
divinyls
hermitages
betterton
salafis
byles
sgm
pten
kamaishi
consoling
bondevik
lavant
jorginho
zahoor
kayleigh
windass
pileated
cobar
irritants
demagogue
ragazzi
diaphragmatic
leyburn
cww
raymund
firebase
dayananda
whittall
pierres
speakman
transgressed
haggar
aluko
puttin
sportsmanlike
lymon
katchi
lemaitre
iosco
valedictory
keflavik
skiddaw
magloire
ulb
prineville
reynell
mainzer
withholds
diplomate
greasemonkey
miscreants
lieux
jayaraman
youssouf
scoffed
shoebox
ngn
hazar
feiffer
unhrc
webdav
nygren
algren
haymaker
arwen
agencia
gledhill
jigawa
jaiswal
durkee
southdown
yavin
vrij
dietitian
dalyell
tylenol
casterton
rhps
anb
sexto
stereotactic
elastomers
liles
mzm
samity
kotz
ragamuffin
revelle
tini
lunges
reconsiders
princedom
tooker
gimmicky
monreale
oatman
refiners
cooperage
emigres
huk
jägermeister
rehabilitations
lehmer
kubra
iller
manera
reframing
clorinda
mouret
phetchabun
lastman
gering
hobbled
bicyclist
ghaznavi
sigal
fuhrmann
nace
padiham
lavan
frattini
hoskyns
hungária
fahr
computerization
coffer
appearence
stribling
chaikin
martynas
hesham
sard
leibnitz
abhaya
ferriday
interdepartmental
bettor
campesino
bishara
socialisme
contortionist
limoux
teesta
vola
okereke
owais
beltrame
mocenigo
interspecies
wayzata
setsuko
cycleways
modifieds
rinko
blackening
buskirk
impe
repudiating
labine
chillers
atreus
hagop
blued
jaffray
canadas
dumbartonshire
nsaid
aken
visp
cholla
stallard
tuominen
nonet
bellick
persuasively
macdermott
pantelis
kela
mahbub
bilstein
korkmaz
keesler
menas
yorn
skerritt
tuckey
ganden
loansharking
nandrolone
sahra
choquette
bibel
ahve
dayana
evliya
bussed
interferometers
rosenkranz
namaqualand
atropos
moncayo
carbonation
harkat
kippenberger
stranahan
jfa
venturebeat
stob
transoms
qaa
voortrekkers
vernaculars
nicaraguans
monticelli
xjr
canas
deucalion
rappin
berzerk
hinchinbrook
peterhof
gaven
ezln
lamprecht
kidjo
hedi
cumbre
kavir
drawl
severny
leavening
boho
zoologische
sidibe
soldaten
intertitle
storerooms
jdbc
bbe
mwanawasa
bondurant
peramuna
tuite
dallam
cft
vapours
cordner
dli
ethnics
fugu
reactivating
phragmites
rifat
bowmen
kulikov
inara
custodio
azione
akshaya
hatchett
shead
clonakilty
matlin
tapeworms
peyer
thring
hamma
caylee
benyamin
queercore
allamakee
branscombe
teifi
wobblies
bather
siciliani
kuwabara
koester
dimeo
kudus
klugh
magor
kaltenbrunner
vandel
venkatraman
orderlies
fuka
abided
hoak
refills
ceramide
gregan
smither
obliquity
levator
krister
leukotriene
heightens
drost
antitumor
sonography
aktuell
moreh
zoho
tadayoshi
dola
kakapo
taymor
xanga
crêpe
noss
bronckhorst
jiffy
cretin
tsongas
kaline
neneh
valtteri
moneyball
balancer
kosinski
malki
lyneham
buzzsaw
corazones
hollaback
newsround
chatterji
kirkaldy
thorgerson
squandering
subha
ipos
clarinettist
monetize
zalewski
kullervo
altschul
pudendal
opportunistically
charbonnier
symbolical
biss
barbro
shirky
meteoritics
farrant
brosse
seliger
wiene
hukou
brading
dieckmann
stoking
onlooker
torques
bue
furuta
reponse
fnla
tolmin
leamy
matapan
ornithischia
emploi
cras
decodes
monaca
kenesaw
buckton
vijayaraghavan
zhanjiang
zaharia
fissured
ginde
gillow
piera
lateen
cfda
hameln
bungling
dismasted
tpd
rufford
snowpack
friedhelm
dogfighting
sierre
tebaldi
kennicott
mahendran
eram
eban
hdpe
tuvia
teviot
neilsen
lumbermen
temenggong
majesco
mattachine
zaharias
modugno
kulaks
majumder
diskin
rebecchi
linyi
extrasensory
diskette
kasserine
ysaÿe
reproaches
cosponsors
vanderjagt
hlaing
ashcan
pharmacokinetic
aiyar
unsuitability
kemsley
cardy
shibli
curcio
lince
ismaël
mulvaney
ascender
opn
tulipa
culhane
anterograde
martlesham
fontvieille
mayank
relinquishes
eithne
saharsa
laterals
flavorful
questing
keizai
vmro
tiangong
domestics
turbonegro
inveraray
campeau
compagno
panoply
yol
loneliest
prostration
rothbury
shrieks
scabs
sgd
douthit
llanidloes
stonor
concomitantly
mirco
loth
testamentary
gladiolus
strauch
bribie
sorna
kaempfert
pawlak
quadrupeds
isleta
masser
unction
volar
macleods
follis
esalen
viognier
interruptus
ruk
evolutionism
bulova
frit
irregardless
perham
senckenberg
imbecile
kost
spir
tuffs
hobos
chimeras
fassett
kast
toshihiro
kolarov
validus
ellas
quale
gordana
ijmuiden
puru
danseuse
ottaviani
arioso
landslip
khoa
iolani
globules
blackmar
atascadero
mccurtain
fuxing
gams
uruzgan
lubna
downshire
cracroft
margaritaville
pecora
yos
shailesh
realigning
iio
umbels
highmore
deliverable
bunraku
rodrick
immanence
borrelli
wtop
carbureted
privé
miscalculated
recriminations
campylobacter
dripped
zhob
garia
lakhan
meningococcal
weingartner
byword
westhoughton
stevensville
barong
petzold
educ
svb
apfa
acushnet
shinbun
singularis
shaeffer
nce
bopara
odder
eugenicist
opryland
bni
azuki
jajpur
nationalbibliothek
aleck
rmm
rubí
pawlowski
lazarenko
catv
shoichi
coalinga
endara
mychal
maskhadov
keris
zera
elastomer
broxburn
bradt
takemoto
reschedule
kolesnikov
outmaneuvered
falchuk
fariz
brenes
newcome
malkmus
pavey
kuncoro
raintree
litten
hustings
charleton
behinds
statesmanship
pervaiz
prioritizes
biocontrol
patek
mattapan
betas
sajjan
feelers
apk
morbidelli
suppport
seines
arinc
mulsanne
triumphing
arijit
westcountry
haslingden
goanna
ezell
makhmalbaf
cassiar
hersholt
programmatically
fiche
beckoning
siple
riou
kulp
demint
pjak
ditchling
huat
brumley
lutfi
mcdill
jörn
katzenbach
laminin
anuar
suni
beaujeu
trespasser
pisi
sherinian
hierapolis
hermiston
ichthyosis
ryoo
phai
kwee
busoga
boksburg
cordiale
oceano
greinke
facetiously
henriot
rovio
filibusters
nurseryman
udhampur
rens
metascore
scarecrows
sayyad
grimley
preseli
seweryn
receivables
bonga
magnesite
toninho
sadia
bws
wla
ethnical
atalaya
garp
rohani
prickles
sadko
tnk
mossadegh
geissler
nja
trussell
twaddle
pagadian
koreas
corm
skousen
swipes
materialization
stanway
kamakshi
econometrica
keratinocytes
abhorrence
preinstalled
loper
cernat
lave
martenot
elderberry
delfina
lovie
diabase
quirin
dobby
lexicons
cousens
kleber
bartos
editorialized
shinjiro
jaret
isaías
hilburn
frilled
jochem
patekar
bodyshell
cosco
okami
esca
chalfant
eystein
slaveholder
kayal
kriens
lotions
hungaroring
mattes
microblogging
zinnemann
holwell
slk
kuusamo
dorothee
dahlan
veiling
colocation
kruskal
pnu
matriculate
kimsey
hartberg
findlater
biophilia
tedd
hoopla
khotang
periodontitis
feedstocks
induct
ferre
controllability
fayez
multiscale
cruce
hamadryas
mattea
composited
saudia
demonization
broadness
lesa
meca
mirfield
maderna
jum
humorless
mapei
technorati
amil
parian
gdl
enmeshed
mitja
slutsky
fastlane
totoro
llanberis
blaustein
bwc
tinos
piñeiro
railroaders
charitably
ehren
araba
spotlighting
nasm
barrino
bojonegoro
homerun
jambalaya
poyet
otterburn
quintuplets
wino
bolliger
wheezing
mythe
tallahatchie
mariela
groener
cangas
drp
scheler
nithsdale
dasan
busboy
rejuvenating
sspx
natsuko
delectable
carnforth
cockrum
jowitt
captiva
ytterbium
abhor
naroda
klepp
eastchester
henslowe
stalinists
cele
hoffs
alexy
bich
robocup
hypnos
tokaj
soderberg
vivar
twardowski
estep
laplante
chambering
nursemaid
aticle
fleiss
garces
porlock
reinstates
brucella
chinnaswamy
gasset
hornos
bagheri
gip
ontiveros
piltdown
caner
tharoor
sanneh
beiteinu
vividness
skilton
pilas
bdt
syah
matveyev
cressey
curtailment
sollentuna
hominis
surefire
sulcata
astrium
bitmaps
rejoices
monoceros
dysphagia
irresistibly
bruggen
heini
domingues
zhenhai
rodi
enablers
whishaw
hrp
meandered
harrer
hackenberg
treng
mbd
ramamoorthy
hoku
civets
biosystems
bushing
interpolates
pricks
underfoot
aldergrove
desalvo
platers
eraserheads
lafourcade
slob
terzi
shutt
evandro
oarsman
lactase
serendipitous
rwb
ridgeland
turka
ingleton
terschelling
prancing
apayao
asco
elna
saddlers
thaïs
summitt
anping
impertinent
kolok
fanged
nto
dominators
defacement
pagán
mattheus
idolatrous
zlin
coronas
overeating
messageboard
panjshir
tase
reassembling
mbit
aldhelm
crevasses
racecourses
qalandar
stiers
ingoldsby
canonic
hutto
nyland
preda
sicko
depersonalization
byo
willunga
dynamometer
thayil
abuna
poyser
hattrick
ribald
zinder
bandoneon
zipf
frf
asheboro
testy
orifices
brandan
decriminalisation
cityhood
osma
itai
fuyang
heliocentrism
decrement
rissa
yeovilton
moneymaker
locksley
luddite
capac
beeline
unobservable
offshoring
foretell
capi
coenraad
mouskouri
noack
korban
lynes
rockslide
bretschneider
basilique
forbs
bracha
afscme
fishmongers
azi
grane
connivance
abiodun
autocomplete
watermen
cerca
brug
heere
liaoyang
gwenn
kume
pantelleria
netbeans
bessette
bishnu
fdb
squillari
btt
trueblood
stanislavsky
hänsel
pichichi
gaylor
ellman
orinda
freewheel
midriff
reinterpreting
clery
lionfish
woolford
marjoribanks
larrañaga
enunciation
harakat
strugatsky
wiltz
bebington
dahab
factfile
karisma
klien
millom
clearness
ottesen
biren
cushioning
wakeboarding
tryggvason
perrotta
frahm
mols
mystifying
katich
zall
vigevano
benzie
gaura
mabinogi
scythes
lasha
lamborn
kingi
haymakers
zindabad
griner
luisi
gilliard
stivers
stingless
blowup
celbridge
maik
scaredy
miga
exhorts
traquair
mcmenemy
perfecta
gugu
nefertari
aicc
musker
indolence
saturnian
florentines
adelphia
fronton
fhl
creem
newschannel
isen
radishes
mandvi
jisc
lithographed
mynach
gurun
rondane
regino
ultrafast
davidge
tlb
igra
eyüp
shalem
torne
caldo
dietetic
kazantzakis
albertosaurus
ladyland
goyette
abish
blacktip
sangita
masterminding
zhirinovsky
agriculturists
bagni
relix
devoy
hoverflies
medwin
areola
glint
annualized
semiramide
keynotes
forestalled
nbk
wroe
bezuidenhout
dula
mcgibbon
bonneau
squalls
kornberg
lowder
nissa
koenigsegg
bicton
bonamassa
tardiness
marquesa
sagitta
dahlin
mizu
freebird
foye
blowtorch
vsl
sixt
dunums
charmingly
ridsdale
kangchenjunga
carbamate
carragher
redacting
abh
nautica
tinting
timebomb
tomoka
reif
exclamations
kinghorne
rhames
sumlin
accentuation
gritstone
khalifah
bendy
yagura
zakho
haruhiko
stanshall
mcclory
understudied
banerji
marianus
shinshu
sukhumvit
marmaris
kabbah
akhilesh
apod
pihl
marineland
underclassmen
stt
caldicot
nasar
strandberg
baumholder
tgif
manobo
dvm
bleddyn
capsular
kintail
rdna
tinamous
confidants
petrovski
plaited
offord
counterweights
ugandans
ibook
sils
chava
ngoma
shorncliffe
siletz
floaters
stupak
posthuman
laryngitis
carryover
agila
scheck
rahl
waypoints
sandpoint
amv
basher
whitefriars
recirculating
feis
untrusted
lausitz
kumho
instigates
bensalem
kimbra
oilseeds
harrower
dodder
tomioka
duden
zula
lechuck
leder
souffle
deshannon
vulcano
valcour
grotesquely
pinglu
phillipa
springburn
perversity
rilling
shebang
autechre
debarred
newsham
oai
holbeck
anglerfish
gandolfi
sacristan
geraci
lollywood
nazeer
godwits
loog
givin
gabbar
cheick
minu
fabulously
labov
gweru
disruptiveness
belzec
vannin
tucumcari
daim
sieglinde
shinkai
kaley
vukovich
shawmut
fica
hydroponic
influencer
pindus
cascina
kübler
tuckett
ercan
nonacademic
mastitis
manfredo
mayte
adani
boumediene
wakasa
mml
marabout
stefánsson
elsey
betrayals
plinths
hurstpierpoint
starnes
torrentfreak
blokes
pnt
synthetics
eyjafjallajökull
gamay
screeds
fatuous
benmont
jackhammer
wellbore
galvanizing
kahal
batterie
ocellated
redesdale
forssell
ayotte
peeks
pierrepoint
diggin
azz
montoro
exculpatory
lallemand
milkshakes
phagocytes
buggles
caruthers
daito
strack
malva
acsi
blo
cvm
niculae
sabyasachi
marlette
sterry
hügel
simonon
upgradeable
träume
bicolored
traxx
mcelwee
moosehead
peaty
bellhop
lukaku
coccyx
praag
horbury
snipped
studdard
roussos
jantzen
ctn
pedernales
yuzo
changzhi
rutles
rbcs
rootless
americanos
smil
ursprung
disha
telmex
magennis
yunupingu
farhi
collisional
barneys
lami
cudgel
troggs
syzygy
gamblin
frithjof
girdles
chup
disincentive
veach
adventitious
hoven
buddle
churchward
horr
reichsmarks
laas
sportscars
scarabs
nontoxic
cll
ramamurthy
jale
pierrefonds
grupera
skeat
recurrences
waterholes
uveitis
falzon
dronfield
subsidiarity
tonny
digitata
unpredictably
cimbalom
anglophile
kabbalists
bukovsky
anesthesiologists
ghostwritten
harby
mapuches
instants
supergravity
macnicol
interline
ngl
beauséjour
dedman
aias
bomis
marciana
prelim
savak
commercialised
abbotsbury
tunga
flybys
venkataraman
hows
microlensing
perrysburg
blech
throught
amuses
winkleman
ubb
gamedaily
totowa
loto
hanni
ordon
eubie
redington
norns
kinesin
wroughton
nostromo
strad
posad
divac
nazmi
matus
orio
sarit
aviemore
volkmar
burpee
edx
quiere
pku
santas
chelios
zuloaga
manja
verfassungsschutz
foremen
fux
valvular
gidon
muskrats
ktp
alleviates
truelove
interconference
khalq
lakhdar
mures
gynecologic
yedioth
unpacking
gorbals
erhan
umana
wulfhere
bonum
bearable
zinger
woad
berna
hosa
foamy
jobbing
ensnared
sheremetyevo
reconnoiter
whitehorn
wht
salmiya
binny
currants
arten
yasuyuki
suka
eavesdrop
jina
tomcats
mvt
circumnavigating
barcarolle
murli
mccutchen
butera
vodkas
mangotsfield
prenuptial
wurtz
leftfield
scramjet
hiroto
crumple
photomontage
smail
cheektowaga
officious
kishen
muckle
ringsted
riess
dimasa
shikhar
sidelining
remapping
karima
kozluk
macewen
stygian
boons
belch
coas
batres
yuca
pahrump
inver
luganda
taplin
concocts
oles
budimir
plasminogen
damu
epigraphs
pismo
duncannon
kristel
cherif
wira
htp
balmont
thiocyanate
shunji
terrorizes
ficken
pjs
borotra
axworthy
tortona
crosswinds
raiatea
tunnell
coatzacoalcos
cristofori
clarington
kuranda
anemometer
minersville
agyeman
countback
lacanian
braley
metonymy
gedling
uygur
neurogenic
karcher
duduk
drt
thisbe
anspach
windshear
sould
hovell
valliant
saeid
bredon
extremly
obstructs
bayport
fowke
meglio
immagine
loveline
murrays
erv
servos
makhlouf
mahato
streptomycin
mcmeel
fairmile
isengard
kapor
hellstrom
periscopes
clavijo
mindstorms
catherina
chappie
tsoi
maunganui
travailleurs
oakeley
sveinsson
hyperspectral
evince
spofford
shortt
gebre
fonterra
dosimetry
wiliam
rathenau
kalkaska
barg
nccaa
aelita
wqxr
liveried
colbeck
borsellino
liping
steig
grandstanding
kaare
jaren
hnd
ifni
delacour
stagnating
maff
boyan
terenure
komachi
constanta
coralline
cantante
nwda
takata
benedicta
swt
corde
saiki
blockheads
hamadani
compulsively
debilitated
pnd
bichir
ramnath
espnews
royden
pasties
ekg
wasif
atcham
lij
murtala
sarton
faustin
kasturba
krawczyk
catecholamines
giovinco
betters
beltsville
exasperating
hamidullah
cipa
cait
lario
bullish
bele
emeril
reimers
outlays
amoebae
wholesaling
hyndburn
navis
oversights
takuro
stonechat
wynd
dhirubhai
katinka
asprilla
schneller
omap
vandermark
elektron
plusieurs
remembrancer
papadimitriou
swabs
hta
cintrón
shvetsov
kamelot
seraglio
vima
kibo
congenita
holker
shunts
gilbreth
powel
stok
occlusal
loango
reinier
superdelegates
bigler
dollfuss
allais
constructionist
binyon
yawata
bonnington
goslin
welshmen
karis
winterset
kosei
armbruster
cyberman
uighurs
shoos
manoeuvred
drozdov
eszter
madia
romola
hammar
gourami
rosaline
indicum
yng
pista
retouch
synched
hailu
sarracenia
brautigan
fmedsci
kozelek
bayo
margriet
currumbin
pne
hege
darknet
brahmananda
avowedly
papunya
byatt
stovepipe
roba
cedros
fmg
wmca
ejiofor
cellulitis
narvaez
markovic
englisch
kokrajhar
bumgarner
detestable
lamotta
sciencedaily
kalari
enchant
youm
kuriyama
tootie
vassall
decoud
readymade
asadi
damaris
nadav
danas
embouchure
doune
teus
masseuse
horseless
chiswell
kaija
bicornis
metropolises
gutteridge
niihau
chromaticism
uth
chouf
agenor
avium
schreber
outgassing
carpetbaggers
postmarks
heimlich
insectivore
plebiscites
roundheads
hallinan
administrate
jedwabne
oken
vilsack
battye
³
oppinion
greenside
achan
hannett
nawalparasi
tingting
dourif
dlx
legislatively
nagraj
gildo
lagasse
gedney
nordmark
gause
selmon
reoccupation
dalgety
gna
foreclose
favorita
borohydride
circuited
karekin
portly
barrancas
couse
keyarena
thalys
cupping
minbar
safia
floria
fortner
perennis
halm
doze
crummy
kaurna
recieving
oughton
coattails
venuti
kanshi
kuhlman
pharmacologists
raley
prachi
cachorro
kete
fishmonger
juster
relaible
hsueh
shimo
olms
weinbaum
shirebrook
lascaux
busybody
mutilations
wonkery
jorg
momus
ludhianvi
brokerages
bodhran
granat
sasktel
mascherano
mellie
guesthouses
prosocial
spokesmodel
pulcinella
warlow
perused
garowe
salgueiro
richert
kipner
hilali
fumo
szekely
maquette
katmai
wmap
barataria
proffitt
starker
divs
birthplaces
hannemann
ashville
dhoti
broking
nordhoff
temnospondyl
hengyang
warbirds
skolars
diverticulum
thielemans
chenault
kps
kalos
nettuno
personals
polyvalent
ehrenfeld
rhumba
ranuccio
victimised
hoppin
hackle
yellowlegs
cuscus
arnprior
patrimonial
wharfs
sistemas
kazaa
copal
danila
wuorinen
bellbird
riek
jeroboam
carfax
debenture
beckoned
povey
ygnacio
pucks
ioannidis
multivariable
habbaniya
krofft
aircel
igual
palam
pinup
duka
siegal
seme
chanticleers
emancipate
ferriss
minin
ajoy
simak
shantz
ditlev
brémond
xxxxxx
gothia
lampoons
takhar
measly
ratsiraka
paha
wets
famu
aquaticus
gilfillan
enliven
slbm
bessin
outhwaite
proprioception
kinner
rasim
toshima
tsuruga
surmount
hogenkamp
ducting
saddens
toilers
caped
redactor
karski
aph
pht
moone
cyanogen
tucuman
martaban
xenos
ldv
preprocessing
cartilages
lathom
withdrawl
tabori
tamburini
héloïse
eknath
saward
beara
amanuensis
quinctius
berney
kernaghan
avait
aylesworth
urologic
hardtack
curvatures
doctrinaire
kumkum
iulian
likert
greatschools
tishman
perrott
tambay
essec
varg
mmb
beeby
bitte
inpatients
modafinil
keila
ultimus
majnun
kubik
maras
yoshitoshi
infinitesimally
fatt
pakula
scarpelli
cartland
disbursements
pouched
niblo
basshunter
bosra
triplicate
lammer
pêche
juab
sonchat
biru
olio
gogoi
shareholdings
herberger
mmda
cesario
trx
pensioned
standaard
kpix
anastasi
uglier
southcott
kelani
reformism
papist
chernoff
glottis
manaf
tveit
rainger
gerb
flatworm
batasuna
photodiode
deberry
fmi
supertall
wainuiomata
ceccarelli
hydrofoils
downloader
nerio
revolutionist
buner
spacings
subchapter
andoni
respublika
antimafia
dollis
planalto
demas
arnel
carlucci
kiyosaki
eynsham
procopio
kidal
ferrall
dogpatch
yancheng
hawthornden
nydia
osmo
outgrowths
dowlais
heward
snowplow
jenney
breizh
papules
middlebrooks
debus
magnetics
szigeti
matchett
mutineer
thighed
arizmendi
patchett
pery
forsee
mullett
huot
pathetically
patchen
brize
gouri
cinzia
ponferrada
perforce
vivarium
sunniest
sonepur
lawa
hauppauge
correctable
cani
shal
zags
kumaraswamy
wttw
alevis
disowns
glissando
bokhari
cuf
sziget
emperador
attanasio
krige
tulse
knx
brienz
purifier
alacrity
genotyping
blab
welsch
woodcote
zubaydah
topmodel
sulman
lucchesi
hisamitsu
bbg
stanwood
ardian
jacquin
tarif
lecuona
butoh
mlr
jordans
heydt
immunologic
ahistorical
eremenko
mccone
genetta
overreach
mahalingam
dsu
dutoit
silbermann
cornets
gorontalo
shiroi
statutorily
tybalt
solarium
ragtag
georgiadis
gulam
harmonicas
rushcliffe
michaëlle
particularily
siavash
exploratorium
sartorial
vicariously
trended
nbt
eldin
elevens
tormentors
kuga
stoma
threadbare
venturer
myke
maturana
claverack
offtopic
ibp
mmg
wadding
chingy
bradlaugh
schrodinger
eleanore
plaice
pensées
houari
abegg
gallerist
circolo
witchhunt
adeel
conseiller
evaders
ahumada
villeroy
cornetist
filipacchi
mittler
kiyo
mansbridge
fagundes
leonhart
altro
neccesary
graving
okello
cuckfield
addario
sharath
parrotbill
calibur
madding
speedie
tindouf
santilli
convocations
snowfield
whare
eliason
witless
mcdavid
eyeless
ramdev
bergamasco
aql
barberi
baiser
avenel
lona
faik
bamboozle
kebede
nutritive
wingnut
spilsby
gunfights
mellowed
suor
hauk
vegard
rightward
kapilvastu
nogami
penalizing
juif
histopathology
kameez
hider
untainted
sartor
nereid
gddr
masina
pannell
xis
figueras
decoction
kolstad
emek
kukri
pedrito
tangy
corundum
myddelton
furrowed
hartig
newsline
samuelsen
unrecoverable
aboyne
bashevis
laneway
blak
jinjiang
jonna
babyshambles
balustraded
doens
winge
surgut
haneke
gordonstoun
icemen
caracciola
rmn
grilli
blakley
cyclingnews
abodes
kangri
mufulira
cutesy
henrys
oistrakh
jodo
lightbox
bullfights
sealab
shannen
bonsu
gnb
gallucci
hookups
lango
unocal
pomar
montañez
surjit
niente
burrs
praya
sharyn
leishmania
riverstone
thunderhead
hostos
savinykh
lothario
swettenham
lsv
micronutrients
bdi
oxana
kapi
earthman
sarfaraz
ranil
mitrokhin
apco
akgul
sweety
bwlch
dongfang
rosman
wenman
leeton
ruapehu
bonucci
petronio
jagjaguwar
área
zipra
luxton
invisibly
boxford
groep
suspensory
sunnybank
inquisitions
turkington
bosaso
citadelle
dingli
halevy
langi
kinzie
menasha
myfanwy
coronets
zenker
uncirculated
wachowskis
arcelormittal
bozell
hri
bukka
dehumanization
delanoë
aof
shimabukuro
ironies
martirosyan
prophethood
rösler
destructively
bartholdi
sabar
merkava
wynnewood
haid
brayden
taufel
marchesa
raki
amstetten
tiptoe
lunaire
cabanas
machakos
eyadéma
palmeiro
pevensie
marciniak
manteo
fondue
perversions
cornford
childhoods
turnoff
labute
eastville
tarasova
drozd
dramatis
durum
shadyside
gormenghast
auroras
kalil
ailesbury
phidias
kalinić
redundantly
gambela
cauda
povo
crèche
shrivastava
osservatore
fortuitously
gatesville
inishowen
veltins
couped
szymanski
mmd
yamoussoukro
westerham
laupheim
evernham
delpy
warrenpoint
kamina
aldana
sbe
shallot
herbalism
indesign
affan
krishnas
riesch
dowries
piatti
rawley
muret
adwa
daar
invesco
othe
mallinson
pulldown
moesha
chaurasia
petree
coleford
pantheons
jonnie
trochanter
aromatherapy
sproles
glatt
quadrillion
chiellini
rosaura
guti
hualapai
guston
obsess
bowerbird
santry
norquist
xilai
codey
improbably
francoeur
urf
prefigured
polizzi
kilmartin
mitropoulos
steinmann
torstein
axeman
csh
loveliest
kahle
malegaon
tyc
hypogonadism
bide
beker
couzens
maemo
leofric
onan
sholes
kindhearted
shango
collingswood
ferrone
alcina
neuroplasticity
mahru
hyperventilation
yushan
plagarism
akela
hirschman
willen
fanatically
automator
yeom
biljana
kini
aggrandizement
sacasa
efd
hougaard
halite
nerina
ballades
queluz
opi
hawtin
chandrababu
freida
briz
borromini
badness
barolo
milverton
mckennitt
rearmost
shain
penciling
huli
unashamedly
binchy
allusive
gaffigan
luteum
muso
dramatizing
chihara
schoellkopf
saphir
zanussi
fringilla
lutsenko
warding
remedying
arteriosus
golay
sloppily
scandia
mccreight
kensei
urfa
peseta
plummets
cydonia
fugal
riese
kubu
devis
yoshihito
schip
mahanta
lyonnaise
kavkaz
soloed
toka
roesler
salvor
zoffany
djoser
inducements
chukar
fitzgeralds
bkv
poot
vsd
mildest
mullahs
snobs
anticoagulants
persevering
hbos
unglazed
rachana
underestimates
mexia
tevin
marinov
encke
annealed
bedel
cranham
hyam
principio
jolanta
dipak
mayak
meilen
keqiang
gastineau
milligram
magico
medtronic
ebonyi
sorcha
graffito
verhofstadt
frasca
kaido
lanfranco
dolma
archerfield
dennys
borie
brettingham
wkrg
csepel
constricting
equitation
jafri
nozawa
samana
strutter
beardless
unbilled
fullmer
mainstreet
shigeki
blackcurrant
wuwei
brailsford
tonneau
dogan
angelle
duncanville
keytar
montanari
génération
shibboleth
scaramouche
eggleton
cht
wiscasset
barbarity
bhati
yal
superintended
daishi
teruo
ilog
naivete
leaper
lilyana
nooks
horii
urlacher
parkash
detaches
cavaradossi
spex
bandundu
caris
danieli
infringers
midpoints
jauhar
shlomi
umwelt
arsenije
etan
maguey
ghauri
tietjens
hirobumi
misano
ndongo
pancholi
crg
gurls
corsaro
jinsha
ohata
storytime
christinna
melchett
bankim
antiarrhythmic
nips
azriel
cyclophosphamide
shelah
eleftheria
jagr
nelsonville
combustor
shuya
arbore
catchweight
balor
zdenko
slanders
agan
cambium
toshimitsu
vonda
godowsky
xianfeng
enac
géant
falangist
lanl
soochow
inverell
physeter
spang
argentinians
propyl
comps
photocopier
sancerre
rockfall
wtae
mallam
dahlen
deansgate
paves
fxx
gusher
tiaan
charism
alula
rossmore
squarish
sallies
kfw
fowle
demonized
aegypti
cortijo
shallowness
romualdez
arachnoid
pocketful
roadworks
europeo
ammu
hetton
coleby
stricture
sublet
viraj
maino
morgellons
befuddled
twilley
qods
brutes
counterfeits
neun
gasa
lysaght
chinh
woffinden
spx
noires
comencini
heyworth
oreja
kyl
dami
barve
threlfall
kanyon
benedetta
adiabene
scarps
extremo
rainstorms
kanade
siping
foresta
condesa
inas
thau
unapologetically
isopods
rennert
curios
gauck
cunnington
corleonesi
colonising
verbier
hutterite
razorlight
pangolins
unsprung
beget
kallon
kroenke
bustards
brynn
repetitiveness
shuker
dunga
caselli
europol
tirado
chicxulub
bridgeville
acar
debauched
cheerfulness
seigneurial
peeress
occurences
eltz
resubmission
teno
delors
eprom
chartists
deliverer
ilf
wendl
malinche
wildcatters
watercolourist
kahlan
aftertaste
denialists
goalpost
shahryar
coochie
rathi
disappoints
megathrust
similes
daisley
kaew
galia
formalise
kaat
sial
heists
ffp
kinser
ryoichi
caretta
shtick
willebrand
megillah
dielectrics
olvido
tombaugh
impregnating
wilting
sokolow
mehrdad
veii
freon
rhyd
vimukthi
hyraxes
statto
recompression
unconditioned
birders
killdeer
jpegs
manthan
lcf
ovando
translatable
birgisson
compuware
backe
southold
norvegicus
protagoras
kidwelly
wristwatches
boehringer
marinescu
doughboys
lef
matchlock
counsell
brahmanbaria
cik
kyte
asprey
mancunian
johncock
andry
athen
helman
edam
yongkang
miandad
mcgillis
juge
dikshit
widodo
sugarhill
rebello
monohull
prokuplje
qnx
kronborg
portages
harrasment
peplum
interposed
salone
pask
ethier
kingstone
dring
prizewinner
wrottesley
reconnoitre
revving
tolbooth
sene
waisted
valdivieso
ignacia
mentawai
mansilla
chapeltown
nikolskoye
brainless
floras
senegambia
blockhead
lyndsay
menn
borkum
malmstrom
envenomation
wheatgrass
encrusting
senda
anitha
rakers
helly
mahua
hajer
schistosoma
metabolically
cojocaru
prospected
pinpointing
histrionics
beseech
subjugating
fadiman
gaud
gorsedd
torsos
saade
coppers
evic
chainmail
mirta
tiree
lant
schlemmer
aaw
blinks
alik
dalin
baggett
wui
wenceslao
beavan
macrinus
carreg
pavers
immunoglobulins
grrr
auchincloss
dorma
wringing
outranked
lege
gillmor
tharpe
nassif
betancur
warhurst
fingertip
hilli
pragati
digweed
trego
picacho
tablecloth
juliá
syncline
laksa
regionalised
longtail
kopa
girardin
mcdormand
hysterics
showmen
diviners
naldo
homogenized
inflammable
felonious
russells
glencore
dalman
incinerators
silverlake
spondylitis
levien
mineshaft
cuyamaca
woolman
denville
warlpiri
rautavaara
jupiler
overmars
misdiagnosis
yarlung
garagiola
culverhouse
enet
sildenafil
kvarner
gneisses
unburned
maguires
lebo
cremations
scintilla
tidworth
uaz
dupleix
giorgini
albom
kassar
troncoso
kunde
jobbik
retributive
toumani
yeosu
aitutaki
gunplay
hoskin
fontanelle
adebisi
froehlich
selectin
carsharing
borzage
sununu
gorleston
natyam
philipe
samet
halprin
leuchter
mangora
xscape
henbury
algy
iwona
aana
amilcar
callistemon
watari
varnishes
fayyad
traun
atmore
macallister
srr
guercio
funai
parrilla
kiyoko
oof
naivasha
intricacy
interlagos
doel
tarbosaurus
minix
basedow
lowlife
derbez
communicants
unashamed
banjoist
paret
hilberg
clapboards
notícias
pfd
furnivall
ayush
mekas
whorled
lubyanka
aldis
footfall
bancshares
wychwood
wehda
egge
dodos
willmore
kotani
aquilina
ramc
dutcher
dilwale
chubu
jöns
rapamycin
arieh
cholmeley
strainer
wingard
liberti
regularities
jiddu
aribert
ligi
uvarov
vleck
yevtushenko
pauvre
markoff
soucie
monograms
jael
vilify
muzzy
muqtada
lancie
huizinga
rubberized
colney
döblin
reemergence
holländer
shenley
voom
novotel
dago
bacchae
schoolmasters
stasiak
dramatize
bronchi
barkers
pinetop
capitulations
fritzi
gery
pulping
jankel
domi
radiodiffusion
costantini
ethridge
berkelium
marwa
vishwakarma
lambros
esposizione
foglio
ruess
perversely
skikda
pinho
malinda
grétry
arlott
chaminda
vetri
vea
algis
edah
roaster
maneri
pantheistic
gorb
leatherman
flandres
biggio
warre
catechesis
eyres
trendsetter
skidding
nazia
madariaga
yared
counteracts
technicals
particularities
opencourseware
reshuffling
remeber
multiuser
debasement
brousse
tiedemann
skews
auroville
pelee
dominum
itzik
kirloskar
urizen
kanton
analytes
perutz
bundesamt
bustan
entranceway
fatio
schlierenzauer
henrie
charmers
asakawa
subsidise
hoult
berlanga
damita
ramzy
seiyu
godsend
starkiller
furber
thuja
ronnies
olley
kostya
moseby
eddery
asyut
armrest
marshak
giraudoux
harve
handstand
telo
shenfield
airbrushed
treu
pandy
dessin
talman
unconstitutionality
miscalculations
gervasi
flouted
munakata
dubuffet
powerscourt
mpac
hawass
makiko
booger
translocations
rechecked
artland
deggendorf
eanes
gaku
contax
ashida
reliquaries
represses
tiler
leks
bluestem
furcula
friant
meridiana
portinari
sciencedirect
qantara
ratifies
iao
madinat
spuyten
hanoch
vente
campillo
chapo
spadefoot
eleuterio
despaired
hernani
naos
negrito
damião
tattler
quinten
topsfield
arae
dewatering
metromover
vidocq
elham
driehaus
ticha
maternally
ziba
scuffles
ornelas
telehealth
parlayed
andel
sankei
pase
fdj
despres
muz
asgar
wangan
bulma
imbibed
thalassa
arlovski
telcel
raub
doubletree
digitalized
haleem
mantz
karlos
ruppin
hermie
crackerjack
zaim
chichele
amsa
wisecracking
balzer
telefutura
oley
astyanax
signatura
trogons
rautahat
disentangle
deputising
norther
hindon
btp
exorcists
bachar
illuminators
stagnate
maciver
treeline
shoeshine
rng
esty
tsuchida
guterres
prweb
battlemented
powervr
brucker
radicalisation
aeronaut
signaller
arla
rodionov
indecipherable
nupe
boks
linchpin
broughty
replicant
ogs
banias
oscoda
malko
gallivan
oddi
tamron
milkmaid
shalwar
chartwell
blips
purus
hypotonia
kittin
priceline
bailes
clarus
dichromate
gozzi
ctg
kennebunkport
rexburg
ellard
tsukiji
roscosmos
gesso
roadmaster
bunching
rickson
ffh
clouston
verrazano
lhermitte
grubby
snorkelling
kunshan
vna
pistil
goulden
anau
antidotes
tapachula
decimalisation
explainable
exciter
paintwork
codicil
parag
manana
afo
vra
doina
tunnicliffe
talpur
pomare
fullers
clipse
ceratopsians
mckuen
tempranillo
jaypee
jow
ofr
sillier
baldassarre
libbed
shiozaki
outskirt
judie
martiniere
brachiosaurus
phonographs
yoann
brandenburger
skadar
piti
montesano
bever
deve
meinhardt
reynier
jukeboxes
scart
whenua
baise
mabuchi
profligate
ruba
earwigs
slitheen
exum
angelino
rawnsley
washi
thiokol
hsk
agnolo
ubl
crutchley
lopresti
sedov
crivoi
swasey
hitchhikers
crimmins
schnauzer
millpond
masafumi
heiligenstadt
ultrasparc
metallurgists
discolored
kiwifruit
infiniband
bissonette
vibo
enma
topkapi
cambrensis
bramah
fittler
hogmanay
boling
cedes
hemenway
viele
thye
spn
transacted
pianissimo
giap
afia
teso
nbdl
casaubon
kingswinford
morgentaler
perlin
druidic
goldthwait
signees
foh
dimms
ruddigore
portilla
oligopoly
honi
stargazing
warin
lightbulbs
ntd
duyvil
lettera
segeberg
kail
tonna
tardive
kaylie
bordesley
fitkin
perdix
oilseed
dilke
rivulets
updf
cored
povenmire
criminalizes
silvius
thile
archipenko
ranjha
bertran
moshannon
cainta
itaú
rosana
kachi
phenobarbital
verdot
middlesboro
handcart
hafod
muthanna
gastroesophageal
chesky
cordeliers
moeen
guillain
phetchaburi
sindbad
hannington
dabbs
munter
rockhopper
silkwood
hadda
christoff
masdar
mfl
langauge
paneer
unremitting
templemore
fixe
blockhaus
dandruff
gleanings
reawakened
arq
otw
ufford
wieniawski
ejaz
basi
lugansk
jasen
bfl
brinks
oyl
corks
bakiyev
websters
windshields
wankhede
pido
gripes
gamage
arness
garrisoning
boreanaz
bhasin
trackball
harken
badali
fbr
virulently
girlish
ismailov
happ
kulka
chohan
bankston
galerija
gorgonio
segun
jeyaretnam
windrush
husn
baía
theriot
davido
caloplaca
wallack
salicifolia
libreria
gazer
skylines
scampton
beverwijk
presti
nkurunziza
dangereuses
breakdance
wilderspool
spode
impinging
dukkha
sods
eset
harpies
jubba
ahlberg
thirtysomething
onyango
torridge
mairi
elysée
amped
burdock
peltonen
neshoba
sirianni
daves
arzu
tywyn
uriarte
harijan
diffracted
zhurnal
neukirchen
kinnoull
adhi
wailuku
gazed
popkin
sawchuk
tsuzuki
beene
risotto
kropp
outriggers
chiquito
nuo
drever
mudug
radiograph
belittles
jhs
scarpia
ironical
souring
letzigrund
timetabled
ajk
reportable
caterers
ercc
alliot
simonis
meristem
perrie
obeisance
tightest
ceramica
perforating
hauptman
euphony
counterclaim
regurgitating
outsized
menefee
headwind
spillman
lahood
rosneft
sandbag
deniece
sajak
hettinger
tulum
gesturing
teela
edwige
stainforth
gijs
kdm
windsurfers
subadult
mckinstry
wbrc
karuizawa
lovel
ichneumon
nacre
sechelt
metalocalypse
preciosa
sanad
lustron
philbrook
videotaping
jaegers
copas
brownstown
oulad
vpl
blankings
choirboys
coode
yurika
rivier
wilburys
acacias
ascorbate
roze
excusable
vestibules
gumshoe
timoleon
petek
moure
ilsley
biosensors
marlee
roquefort
brookins
pharisee
ouma
improbability
ftt
familles
splay
fono
dogen
igla
syrie
stenciled
luxeuil
bodenham
luella
eurosong
kaweah
tatanka
morny
alyce
notrump
flout
esra
ingratiate
ebt
kushal
polestar
blouin
charolais
micra
gaceta
blairsville
mlf
wth
koreana
ahle
oes
tiaret
crpf
satoh
raspe
cowgill
perjorative
blaring
cowrote
antennal
lini
misto
sirah
huckaby
moala
hongkou
bisque
swansong
pilley
chides
macchio
huaca
schlessinger
epoca
iliffe
gravest
mycotoxins
saja
zahar
hansbrough
transcendentalist
cervo
haba
centricity
sawtelle
userland
quiche
alberich
kehna
siffert
courbevoie
diagoras
nordea
cashback
wolford
pelzer
chernin
audre
laman
strongmen
regals
farndon
arabesques
jervois
fanconi
childline
hadas
belling
dokic
khem
specificities
vose
hellwig
petrovaradin
chastel
kutta
overtimes
ossip
moiseenko
anamorph
dematha
grapples
marske
emeline
endino
quarrelling
kutai
cawood
kanwal
devenish
mcgeoch
diabolic
aguado
inupiat
arline
tuath
mccartan
hyponatremia
cordwainer
engram
liberians
slotting
walke
barceloneta
denyse
arduin
tormes
maghera
atik
bors
carreira
subsiding
smiler
primatologist
wahi
remediate
assesment
troubleshoot
langage
familiarizing
hallström
usnews
privatizing
kigoma
saham
undercuts
srisailam
brocklebank
prudencio
surayud
moorea
rosal
kohinoor
schor
sturdee
saltburn
overdoing
stepsisters
ruder
szeto
mdn
wooler
ducat
thunderclap
directo
pwl
puce
huong
bunhill
tesch
narodnaya
extrapyramidal
voyaged
goar
bangle
mccoist
lhuillier
eurotunnel
fryar
brackenbury
onsager
immelman
shavit
keddie
neckties
mangos
grimy
canepa
persuaders
imprecision
inviolability
refines
mitsuhiro
quarrington
panhandles
foundering
rickettsia
comming
friggin
ormeau
cherubino
gayo
armytage
tartrate
clarkia
divining
constitutionalists
bloemendaal
accentuates
supercenter
hoosick
fairhope
romanek
communistic
petters
westmacott
acrylonitrile
contradistinction
pnac
patriarchates
conciousness
bandido
capernaum
arsdale
unquote
blauvelt
mencia
mattocks
mystically
cunningly
fga
jeanneret
auberon
touchback
yohanan
colaiuta
ecos
waterfronts
walloons
pbwiki
compulsions
rpk
malfi
anvils
afk
mauling
selamat
gunawardena
bov
complexo
cluedo
jare
euv
homeownership
junya
dicicco
chantrey
kardec
lampkin
kwasi
trat
groundhogs
dagwood
rvc
bjelland
darpan
posto
byfleet
fruitcake
downturns
grotta
kurile
cratered
bingöl
mcfadyen
uncrowned
xichang
raynes
saviours
beauly
sebadoh
calva
omt
biochem
strawson
loza
imaginatively
hellinger
suge
apfel
befits
muskeg
halimi
romanes
waterstones
bahria
centromere
unjustifiably
borkowski
chicas
ridpath
saugatuck
auriemma
koinonia
csw
tym
walberg
plantagenets
chrystie
comino
rodel
jfc
neatness
summum
rosalynn
tullow
northview
terfel
vampyr
becton
pyromania
muzaffer
chilcott
woos
skjold
kleenex
beres
preti
chloramphenicol
tremlett
adenomas
hopps
lackeys
albertans
brokenhearted
cecchetti
ambystoma
duby
buckcherry
lss
scholem
cottagers
aborting
phillinganes
noviny
oedipal
brosius
berchmans
whimbrel
ritalin
polloi
orlik
arclight
latona
sallow
ohp
oaken
hdds
mindsets
mitchison
mushers
chinas
dinnerware
cbca
uncompahgre
hartwall
polysilicon
tapiola
reisterstown
prothrombin
tambourines
nyayo
borre
aesculus
mccarroll
sequiturs
morongo
holistically
messias
bridgehampton
geode
waistline
salomón
essentialist
fresnes
keuka
handcraft
thieu
ultranationalist
zytek
mcdonell
carnie
lacquerware
duas
polwarth
disturbia
brackman
caballus
gondii
winooski
yichun
oia
humiliations
kerkorian
ugk
oakton
tsutsui
domenichino
galion
eventide
bayoneted
pois
carjacking
azania
valkyria
naysayers
greentree
thamesmead
licia
halvorson
enshrining
romanticised
illu
tdsb
interlocutory
hybridity
primi
krew
leiper
oriens
fafner
burstall
backbones
timidity
puritani
nary
malouda
crider
bartolini
simin
alleppey
sayeret
cnsa
kwaśniewski
freakish
lightyears
shinrikyo
embeddable
fasi
ibérica
shinui
hytner
lorazepam
applauds
navaratri
panik
gauld
baldrige
busacca
orgasmic
nyr
surest
anthropocene
missenden
phillipines
lavergne
nmsu
paraboloid
melancon
homies
jaf
halabi
teenie
nordschleife
patroon
seaborn
hockenheimring
wehner
forevermore
colomba
akwesasne
duis
pittsboro
yaguchi
merula
beyers
turgid
aculeatus
kinzua
dextromethorphan
merode
montella
armato
isom
saggers
tolmie
kombi
kusunoki
slimming
wenchang
liukin
purisima
lgr
kecil
pmdb
sokolova
signboard
saft
katu
catlins
uwi
broadmeadow
pasarell
sheepherders
lamma
andri
transposons
kimmeridge
statism
lyudmyla
privations
skowron
darlaston
furan
soomro
sadomasochistic
khasan
llŷn
kelsall
rawi
spatter
intersectional
vautrin
filson
glazebrook
mochtar
nolin
blurts
jarrar
diemer
unità
deveaux
emmert
hainanese
hitech
dacs
cdrs
utile
akure
baade
elliman
redfearn
brimmer
toofan
insite
lipoproteins
shophouses
hardi
motoki
sarafian
altidore
beleza
roseanna
frankton
woodburne
hausman
sanded
préludes
declassification
shaoqi
stroup
qubo
vco
duin
bonaduce
kida
melpomene
ridout
palatina
baldridge
malti
donaldsonville
singur
comfy
leiston
papago
swaim
yasunari
wimpey
ammount
gehman
catia
otan
livestrong
tourers
usami
invalidation
dearg
jetson
stavro
lanfranchi
mantar
hend
thuan
botto
atago
thibodeau
brotzman
shange
mauler
storages
comi
underbrush
canasta
sty
nichola
dramaturg
ignominious
chordates
lanta
dispensations
pkg
falvey
perebiynis
ptolemies
downpours
herms
wwj
urinated
residuary
hatherley
gree
exoticism
fyr
dinesen
hajnal
evangelisation
alwis
blackbuck
larin
mclauchlan
portail
viadana
hso
heong
beheadings
nedim
stalactite
feg
alyx
hirabayashi
eurobarometer
venkateshwara
kokugikan
dotrice
epicyclic
aurat
doremus
birge
weiskopf
dawnay
boasson
kielty
juncos
vervain
fluorescein
pocketbook
semon
shafik
hemodynamic
alexeev
exel
birsa
minora
doy
burwash
antikythera
illusionary
misstatements
grima
wilshere
menhirs
vergel
galbreath
ual
teats
courte
hiroo
tribu
banjara
southcote
bazargan
frontages
abyei
lavern
eildon
miscible
boosterism
américain
hippeastrum
calmette
sfsu
slaving
landstuhl
drane
terpenes
drumstick
surplice
kricfalusi
injun
gbh
fnl
experian
hauff
sundaland
reworkings
hjorth
kith
axion
oakman
callistus
quatermain
amidala
berowra
sarabia
bragi
shrimpton
teat
kanner
krw
gress
sowers
slyly
buckhannon
stifler
ilion
phèdre
dolin
brid
papio
hebraica
rpj
traceroute
nudism
nebe
vocab
petherton
mrcs
vicent
chhatra
buggers
ahmadou
flabby
sleipner
rifa
pivovarova
milloy
congressionally
wachs
phr
milkha
freshest
juxtapose
mixmaster
libitum
archos
convolvulus
abbruzzese
osby
mieko
inconvenienced
nayland
holsters
pontiffs
palooka
cloudburst
maelor
shenango
repellents
miru
jcl
matsuno
clairton
yogo
jeepers
roughshod
rearrested
saltpetre
sqs
skira
rohn
epmd
fifita
hyperreal
norian
swartzwelder
abdy
boattini
circularity
sheed
hypotheticals
stater
terauchi
cackling
chalco
dónal
quashing
tengiz
flyte
jaunt
sociality
novarum
cades
amul
deniability
prnewswire
dolci
dorland
centerpoint
rango
dongola
devar
steenburgen
mukta
turtur
clore
achham
ogoni
follo
frangipani
radebe
npi
quarts
rendón
sooo
swalec
amund
seabreeze
overexploitation
kunisada
ofs
rubina
chickahominy
levins
balladry
soave
wayfaring
busa
madwoman
swilly
satiety
woodinville
koyukuk
weigl
caulker
stylization
imes
heyes
oei
yücel
quickies
rouch
casc
barto
milonga
egeria
nonreligious
merreikh
chopard
recommender
duffin
surendran
obfuscating
xiangtan
dvin
riperton
implosive
distended
cubesats
breastfeed
pinstripes
nns
gerstner
etal
merryman
bedales
tlatoani
phylogenies
itty
tlaloc
mwp
ngurah
bercow
adewale
sgn
bethania
vendome
gulen
˚
statin
pharmacologically
rivest
searchin
criolla
degaussing
najwa
deflationary
cortices
ptl
zoff
donnan
wissam
xishan
shahjahan
winnifred
plasticine
dufy
beke
thaxter
aot
resentments
archaeoastronomy
cibo
boerne
pinole
corns
prekindergarten
myoclonus
goral
belbin
khorasani
alishan
pervade
penstock
raku
sinkings
roseman
talmudical
simic
tdcj
ceni
kma
pittosporum
rmg
canopied
deforms
zootopia
igt
zeleny
etp
cecum
seraing
flecked
lampi
helgeson
jayawardena
jostein
forefinger
debbi
pawtuxet
beb
billows
idas
lattes
fumarate
juscelino
harvill
oag
balts
plastron
serviceability
prowlers
higley
stiegler
kreisky
mixup
brok
occassions
haggart
frutos
iohannis
muffy
saqi
axford
mindbender
quins
otieno
antiproton
hambly
ermelo
reimbursements
gimlet
insurances
aporia
mazi
armley
bottcher
adès
yeu
legat
rudnick
bustillo
jharsuguda
tattle
grooving
apnic
palak
krishnaswamy
brecel
artificiality
leatherwood
kyoung
leverton
doubler
roue
trebišov
shoprite
araby
windu
lekha
kedzie
arber
shariati
rybka
schou
midem
fre
cnoc
gentrified
beauforts
moutray
alguersuari
strongarm
tros
forewords
mukilteo
ritsumeikan
djenné
keshia
scoggins
draping
katheryn
naini
meggie
thomassen
jodl
koponen
wamena
vars
shariat
yakimova
jub
amneris
britches
ejb
truism
wince
privation
alterman
osteogenesis
freestyles
ameobi
davidians
dzmm
futian
romanenko
sargento
salima
guatemalans
vandyke
diakité
colaba
navales
cornforth
savor
bellu
huidobro
esthetics
ecclesfield
dalliance
categorises
sinema
sverdlov
duritz
stradella
graebner
webgl
piazzale
taitz
conferment
mcdermid
zorin
chatelaine
keratitis
odt
exonerating
trebbiano
bartha
snb
greektown
mirkwood
rics
naxalites
bochco
kewanee
partula
youghiogheny
tetrafluoride
hallucinates
unwholesome
exploder
deferment
kyan
barf
madrasahs
opportunists
mispelling
ackerley
turbofans
ingenue
wmp
weatherboards
scowcroft
nasreddin
hizbul
lucier
campton
handmaiden
foghat
ferrum
creede
goads
fluorophore
kotwal
golam
slyke
clouser
hylas
yamagishi
bradtke
westhampton
multispectral
monsta
zeba
miyashita
subversives
kazooie
latics
envelops
premonitions
tamanna
voronov
berardo
raffael
glycerine
toyoko
firestarter
worsham
okotoks
noordwijk
kotak
epidemiologists
argan
cuckold
cluck
longstone
bronzed
santucci
clm
sesay
cotati
montblanc
explicate
oversubscribed
wfla
groce
denyer
fends
mentzer
trapezium
longines
rocketship
immediatly
katahdin
odours
guten
gangopadhyay
langat
tandil
breezeway
amadis
koby
frutescens
woodlice
dite
tilo
silkk
bottega
storz
bastin
tongo
junia
alekos
vasi
timperley
melkonian
upstaged
lopatin
stangl
ribcage
alcona
ilgwu
haggling
panas
maconie
undigested
nafi
useage
arkestra
cran
nerval
haslem
beevers
brûlé
bhar
mellows
nihonbashi
scorpionfish
nasreen
cordwell
lilleshall
zielinski
minzu
guarantors
stuckism
headpiece
wakehurst
tollgate
sensenbrenner
gelbart
dewald
altho
afterimage
roybal
almondvale
blenders
monomakh
theakston
driessen
kripal
marvelettes
pellicer
congaree
voo
restates
espers
sealey
claptrap
iechyd
gärtner
foetal
cinderford
silvertown
felician
sovremennik
goldsman
comorbidity
jarratt
sofitel
yoong
cattrall
yashpal
purdah
laury
dietitians
umist
heliodorus
smosh
compactflash
akeley
solvang
orley
skiffs
larios
potluck
amazonica
rachman
videocon
rando
céleste
helsby
sonars
fenlon
malmgren
nayagarh
codice
creaky
spewed
disfranchisement
stammering
kalt
mccovey
peutingeriana
anoeta
lungi
gherardo
evanier
larner
tumorigenesis
tityus
plats
havisham
naiman
rowett
juggled
seaworthiness
telegrapher
voyeuristic
gadgil
vittal
anchorwoman
garris
dromaeosaurid
biennials
gravelle
nacion
piebald
screener
baco
denni
blethyn
underappreciated
benzi
pacy
binod
celebre
sanitize
ladyhawke
kulu
enriqueta
cambon
negre
laka
mosier
earnt
staleys
autumns
glx
shepstone
kinard
ttg
dantan
caos
cley
vyner
harrowby
goldbergs
somerhalder
essanay
capsizes
klick
colorists
confidentially
claudiu
maiming
xvid
marke
knobby
transhumance
tiroler
pulu
epaulette
dbx
leff
conveyances
nibelungenlied
jubail
demoralize
vtm
norcia
canadarm
scherman
cahier
outflanking
ankaraspor
tailgating
wiseguy
punxsutawney
caldicott
vashi
ivonne
miéville
deyrolle
aplenty
nebbiolo
munden
kuldeep
palmate
worton
dissimilarity
fanciulla
chahine
oconomowoc
chérie
huskie
stoats
desirée
khadijah
totter
scrappers
critz
akel
ranatunga
barón
onge
vanek
kany
shimazaki
tanti
rofl
,but
sanctify
marmosets
zq
klett
nuon
niner
irrfan
beauté
mcteague
séverin
satpura
newscasters
eaf
monkman
gatun
soucy
seismograph
alpa
aubrac
propping
bertelsen
fabricators
rini
morial
sarwan
fowlkes
autonome
monn
zehra
squalus
carhart
kamali
jamerson
gcap
insures
horsford
wieser
bluefields
padovani
chesler
carstensen
overstretched
severinsen
berjaya
ottaway
albery
ulmanis
stagehand
gelo
hollen
osian
nexen
unhesitatingly
dignam
batalha
situates
poulain
skomina
sportier
refrigerants
rodway
grieco
thermocouple
esterase
polders
carbamazepine
abadie
klaxons
gidding
tsuneo
brassiere
hopedale
conoco
slimani
pkd
nuc
fickett
scio
aloni
marva
fashionista
gavilan
antonucci
tabern
atmospherics
fusil
vcds
contactmusic
wach
potentialities
abhiyan
multifamily
typify
brassens
caquetá
mayoress
simpkin
gachet
sriracha
heaphy
tarquinia
mallinckrodt
battlers
froebel
swirled
pessimist
nampula
territoriality
universale
drottningholm
neutralizes
champneys
stecher
therapeutically
mugshots
umu
thura
tardelli
digoxin
zaia
mongers
prorogation
spieth
nedbank
readies
putrefaction
parmelee
chinen
bergdahl
maske
kiddush
joux
gorce
genoveva
rosengarten
fanfares
imx
seiter
watlington
unus
akranes
pyatt
myburgh
ches
sunstar
klehr
blechnum
pozzuoli
linha
callison
tikaram
lsr
efsa
ncar
disputations
hallowe
udom
appell
kinkel
vocalisations
grandsire
costes
driberg
amyot
tailspin
wareing
varices
brister
corcyra
bushbaby
backhanded
vorkuta
foliated
paroxetine
cors
orientalia
dandong
probationer
unbeliever
daylighting
bridson
calque
husson
frigo
ngoni
yih
deric
ergs
ohh
harrie
comtes
symbolists
bayon
goschen
pfennig
reconvene
woodsworth
jerningham
twinkie
dahle
pchs
marre
codfish
sinew
trifles
labasa
hightstown
strutting
millo
temerity
tensioning
cévennes
wearside
stapes
ophthalmological
sodo
annua
hemorrhages
homoeopathy
compositionally
chrism
transiently
denso
santon
plexiglass
flink
cavallari
njn
bentall
frankenthaler
fluence
casar
niwas
quinteros
nisbett
ramanand
ogni
winant
disquieting
strick
infocomm
offing
fagg
griots
enppi
tfeu
nafs
houben
melander
neher
eaker
albinoni
vernonia
farne
enamul
koscheck
decisionmaking
realaudio
viduka
lasher
cftr
gerken
gaulois
lesch
zedekiah
capobianco
birobidzhan
prickle
gxp
bivariate
asynchronously
poleward
pangbourne
happend
walkerton
maars
tanuja
anglesea
maney
teige
sensical
pharcyde
okeke
jca
encrypts
ebcdic
onoda
metabolised
voznesensky
neshaminy
nemuro
plumages
lassi
antm
aconitum
chaska
anglos
voris
verrocchio
garni
usborne
nmm
sequelae
diciembre
bhubaneshwar
balthus
moneylenders
mayol
naudé
hoog
deckhand
waheeda
xfc
diederik
lebed
ximénez
jeers
lagunas
nonmetal
ksp
nooksack
blimey
bukavu
euphemistically
sluis
krem
clews
buyback
maathai
corpuz
nizza
suncor
commitee
wisps
fictionally
radiosurgery
tamina
loken
nanton
noes
huf
buis
cigna
bendtner
shantha
anchorite
funneling
grassroot
ryohei
ungureanu
lippitt
veber
supercooled
rethought
barings
usenix
janel
unintuitive
actinium
ownby
yae
waterberg
tsuruoka
sphinxes
chenery
cetshwayo
gowon
microns
nozaki
hindawi
atherosclerotic
stenger
renamings
strawn
illy
warders
sindelfingen
morganza
gandini
laxatives
abyad
linzi
jex
nasrin
ruane
synecdoche
galette
shamanov
steinke
slapshot
redwater
uummannaq
cerise
foxman
horman
halesworth
résultats
ductwork
thomasina
prostatectomy
rendon
synchronise
qtv
radburn
bushwalking
suppressant
konig
alzheimers
glbtq
giguère
shutterbug
todaro
haneef
archeologico
anoint
cousineau
mandalas
btg
phev
womaniser
animo
dichloromethane
koslov
duhon
pikesville
anri
adon
galax
pacifique
vogts
anticonvulsants
panzerfaust
aeroméxico
billingsgate
wallinger
okefenokee
burghardt
luanne
osos
mutabilis
airbrakes
planetesimals
reassembly
gnaw
imaginings
aplastic
gitanjali
pung
ronalds
buxom
northwind
sturbridge
laem
samian
cairney
tailend
kochanowski
swiftness
attucks
kinta
haitink
safwan
preening
lhotse
winkles
scheller
leonetti
duathlon
trickling
hatherton
roundness
choughs
turntablist
cers
caucasoid
cinecittà
kimpton
glis
allatoona
interrelationship
zacapa
tyro
sova
watchet
unece
extrovert
kowalska
bulstrode
superposed
ardingly
chetwode
militiaman
tejan
mizz
yuletide
shettleston
incapacitates
parenthetically
screwy
replaying
bazillion
zoila
shmidt
mccoury
mcshea
haralson
ciborium
toxoplasmosis
peepal
boor
compania
sayreville
sanogo
begoña
catechin
burgan
papis
dniprodzerzhynsk
fragoso
dickies
nullah
mengal
overmyer
immaculately
nakatani
utopianism
jsi
ghislaine
olimpic
postma
assignation
cardell
jogger
underestimation
loosed
russkaya
utpa
pantheist
sobha
bruxism
vti
outfitter
justman
inhabitation
lipski
absense
sillitoe
cito
vjekoslav
jingdezhen
frégate
encephalomyelitis
ibérico
mawashi
bernabe
infrasound
ebor
watercolorist
grof
rapson
kautokeino
cbx
fagans
uncategorised
revillagigedo
outgained
bentos
hajibeyov
winterhalter
backcourt
azzurra
alcmene
omnipresence
sherwani
kratochvil
repairable
reenters
erections
zoominfo
centinela
headrests
bollman
buonarroti
tillery
solera
kortchmar
ruffing
piddington
idfa
culley
mellal
tacony
oommen
jochum
gasper
samit
nolde
erbium
deducting
reedus
fauzi
tempura
handwaving
felids
ingrassia
poultney
sashi
encanto
horrendously
glasscock
kassandra
maassen
rhabdomyolysis
mobilizes
hottentot
arika
chauvet
waiheke
valadares
brenz
iowan
nilssen
duch
penson
rodda
akashic
padmavati
headey
haf
minehunters
coanda
brucei
ikram
mccarville
rosenman
sadleir
virga
binhai
superhit
snowballed
granado
brunettes
kalay
kittiwakes
apprehensions
cawthorne
sterols
embleton
normie
jennette
caillebotte
stronge
knower
butyrate
airfare
hartlaub
shwegu
lensky
sparkly
florentina
ctt
dyre
bomer
noxon
kazanlak
carrell
manoeuvrable
fayum
jsut
annen
tangmere
kostner
pharmacologic
kayenta
sarkisian
bresnan
nevelson
suji
kuya
cephalosporin
souces
lotuses
thatcham
burkert
airwave
paddlefish
hattin
holzmann
rondebosch
scafell
wildlands
moneylender
eaglet
chilvers
upington
romanovsky
iguala
autoexpress
shinichiro
républicain
graefe
fka
usccb
faizan
bindra
banska
ziller
blanchette
herby
morishita
naff
kiselyov
tric
couloir
esol
sibuyan
vanhanen
kentuckian
mcelhinney
dignify
mbaye
aircraftman
barbi
jedec
ioffe
kneading
raben
wallwork
eadweard
crédito
blatty
tearoom
stoichkov
gennadiy
clattering
tcw
istc
maryvale
labiche
masu
atsdr
puddling
tightens
farthings
antawn
vindex
chocolatier
burbot
wilentz
brows
rebroadcasting
kolff
tompkinsville
magnanimity
bokeh
abx
stancu
endoscope
belek
irrigating
studentship
nrb
trygg
kubuntu
bioreactor
microfluidic
kensit
coms
navigability
fiver
bemoans
chata
superfight
fishbein
qader
altissima
bussard
phoenicopterus
mawlid
succès
hargis
promotor
ucv
kelsang
ubaidah
dishwashers
rincewind
arrabal
mimms
croon
surmountable
ngok
malipiero
gyfun
wincanton
helfrich
wayfarers
shepperd
hellmouth
mistrusted
leoncio
blurt
mado
cyworld
toilette
menifee
dewine
knu
ziemia
palaeography
oresteia
wadhwa
shorebird
khayat
mires
sparsity
employments
danzón
frayne
marash
voges
iriarte
mughalsarai
beckons
mulled
hueco
tumbleweeds
longshoreman
minifigures
peonies
ravena
fiorentini
cdx
councilmembers
petróleos
manninen
ravishankar
gerbert
commie
fum
garigliano
mcginnity
tormenta
blackton
kratie
cumbres
isco
unmotivated
oosterbeek
lazzarini
relishes
celler
multiphase
lammas
gedi
informa
msci
miño
dispirited
rackspace
kpl
cephalosporins
coni
flagstad
shivnarine
mitts
accion
lunate
malfatti
garciaparra
longyear
payless
bumiputra
kasab
solie
gjilan
pardoe
fieri
mytton
tablighi
hallet
riggers
karang
presupposition
blier
mccree
epictetus
copt
oec
cattedrale
chateaux
generalists
hila
horsehead
norham
leachate
undercooked
tiberio
facey
ascendency
kaist
gemological
plaything
pickings
nvo
creepshow
jackknife
dooms
hums
echt
ngorongoro
bauza
rusholme
proclaimers
brewis
leghari
jailbait
buchalter
titmus
extravehicular
karlskoga
arifin
ehe
neeta
kenneally
clézio
justgiving
minshull
narodni
polack
briceño
longdon
leavis
supercells
beddoe
gulags
rauner
kobol
bandhu
guipúzcoa
stookey
anthill
wensley
wrotham
lissitzky
omnis
davern
chalcopyrite
hoots
oviduct
dejesus
chamal
trekker
farro
cryptographically
cranshaw
polonica
vestment
igbt
rira
pinchbeck
bmps
lebow
interparliamentary
chanters
dirham
guardado
metronidazole
hpl
brownlie
amphitryon
unclos
hollandsche
astronautica
tauride
pinkard
catolica
bayarena
nfhs
corddry
disfavour
vasilisa
songbooks
inadvertantly
opposable
repopulation
gintaras
eckman
calahorra
wedemeyer
kozue
carpinteria
radka
stringy
meiktila
azuay
unwelcoming
khural
donaire
cirkus
barzun
brazza
hulot
dizengoff
bidston
nahid
wdcs
alvensleben
fumaroles
sirindhorn
belgacom
feverishly
stockmen
eai
powles
businesswomen
probity
selectric
mousasi
coerces
blairgowrie
shambhu
adjudicators
asrar
mckesson
elektronik
tyrie
scullery
aminah
peabo
chugging
incongruent
woz
hmnb
dislodging
incorrupt
dichotomies
fireships
pyi
blag
meanies
refreshes
rapacious
aio
unbearably
lakki
hejduk
rioux
tsvetkov
clasts
orisha
imaginarium
westra
repackage
diaby
missolonghi
etting
repaved
newish
oge
isolator
divesting
rudenko
rosarito
dehavilland
stimulatory
foliar
skyrocketing
mosson
crystallographer
catagories
burnden
stevedore
vassili
rantanen
gothard
diagne
cliffhangers
rooyen
domenici
interloper
achleitner
inslee
murwillumbah
freebase
overloads
fazlur
ifsa
maor
indigestible
obiter
metrowest
remarrying
yashiro
reconsecrated
southerland
affixing
buel
preamplifier
meatloaf
purchasable
borth
lastra
kepi
emended
nafis
pamina
tactfully
subframe
hippopotamuses
catchall
bvg
thirlmere
whizz
margi
caprock
pacifics
septicemia
transoceanic
outvoted
kashmere
satis
ifans
exhume
riobamba
fedde
bwana
noureddine
brachytherapy
sahoo
spoo
marle
phmc
rheinhessen
engelhard
tamada
lera
ibaka
pestered
krahn
melanesians
lcn
mcat
hottinger
chemise
newsrooms
blondeau
cutoffs
yasmeen
levita
servi
rupel
usca
nagarajan
reanimation
vcf
jairus
ladue
squashes
bickers
gabbana
quillan
ands
brisebois
izhar
winkelmann
dhanraj
sultanov
smithee
lutter
fov
mazel
petya
reconverted
ceilidh
ufj
misuzu
bused
catwalks
shaaban
kawano
dressmaking
ridgemont
hammerton
horniman
bireh
subpart
amran
tracksuit
murty
decapod
reichmann
carlock
sapping
thrombus
icb
garmon
takamura
camryn
adumim
weaverville
pajaro
udayan
chown
santerre
polarize
misanthropy
eisele
muh
chisum
narberth
cytogenetics
isopod
inclosure
anacleto
ludden
sapped
metron
itsy
irksome
stratfor
straggled
lugg
holleman
semiramis
pleat
boosie
weakley
apolipoprotein
puerperal
barga
forni
rusu
ivies
sardinians
toco
vora
knotting
dauer
achaia
excruciatingly
sinfonica
capponi
nicoya
nwf
stankovic
hintze
commercializing
sulfoxide
dichroic
pâté
noctule
skoglund
bibliophiles
uig
welte
christiani
pajero
demetria
trumpington
placidia
steeves
ruzicka
mdk
wilner
kompakt
lemba
biagi
timpson
cintas
nunchuk
bottrop
bbk
durang
postern
maloy
ahonen
rafal
larrousse
regehr
kaelin
menchov
dlt
chrysis
fli
tinderbox
allinson
aboud
orie
showstopper
graciano
inspecteur
whodunnit
farol
baule
cgf
froch
inundating
bicalutamide
abit
bizkaia
dinitrogen
genitourinary
weizman
incendiaries
dugway
rangamati
cosigned
waziri
lolli
drool
millenarian
naderi
chembur
josiane
stannary
setauket
poltimore
alberdi
neuschwanstein
dodsworth
callard
kubek
astrée
kardia
vientos
dubourg
kalbi
zukor
subsume
manzana
hoque
unsuspected
pelorus
slideshows
holle
uncoupled
tuta
halwa
orcadian
parktown
flauto
maipo
ardour
skovgaard
refuelled
millerton
evensen
berend
safwat
unweighted
khurda
rockit
jaqueline
mcclymont
sinterklaas
epitaxy
trousseau
lagers
labbé
huzhou
refiner
coss
probs
oblonga
sayville
desmoulins
sauvages
sarani
azules
boje
golomb
pneumatics
clonard
albrighton
kaira
petard
arnall
kilduff
ista
levison
bielsa
poncet
hiley
brigit
glenroy
sones
vissarion
fisica
tangata
asos
urbaniak
gillie
laliberté
trashcan
beringia
krissy
dunked
declensions
morriss
philomel
dubno
hovland
quainton
muga
galeb
grigorescu
depailler
kyros
furyk
boussac
blazin
gilstrap
dredges
missi
ulp
croes
atum
moscovici
kark
hartson
insurgentes
woosnam
artemisinin
europos
takaki
rotondo
musicae
kheng
collierville
devere
gradin
abdicates
volmer
powter
pearman
ecuadorians
bluto
polat
sexting
turmel
eleison
azd
hartono
melitopol
panspermia
ewf
lesean
anticoagulation
monreal
zigbee
dampness
zakrzewski
firstenergy
oxm
owa
nwe
burkholder
sauda
alioto
mouna
behringer
wilbanks
snowmass
jakobstad
heder
judkins
gostivar
gulley
ghomeshi
loathes
donghai
bedfellows
haggai
colloquia
braybrooke
herrnstein
microenvironment
kingsdown
broda
laugharne
izabal
ennerdale
tragedia
dengler
prenzlauer
chafing
khandelwal
weardale
tombalbaye
prissy
kmph
askham
dhea
maram
mangla
azeroth
kexp
devgn
variazioni
pomace
kimmie
sakhon
semey
balbus
mimeographed
ghanshyam
brickhill
montale
cobs
bresnahan
kiriakou
alls
desrosiers
redhat
faucets
maurie
crapo
rowson
croucher
kayne
hopatcong
granbury
unprincipled
azmat
rinus
dionysia
boyles
ideo
stackable
kegel
pennyroyal
kenitra
pide
menza
retards
lajas
macrovision
kononenko
kloof
masataka
woodyatt
oved
spreewald
guimet
possiblity
nominalism
heming
nearshore
policia
genevan
pingu
stylianos
gurdas
mutts
abdulkadir
hugill
webex
frangieh
inet
cruzado
hunched
enigmas
hallvard
laza
schieffer
levitz
gorget
bestwick
stathis
pedipalps
schwalbe
monosodium
brockovich
widmann
soso
tillers
meninges
actualidad
habash
receptivity
gnocchi
neuropsychiatry
bantustans
gildea
fubar
croswell
chamunda
royo
novik
raban
blomkvist
picon
lcu
terezín
cowman
parastatal
cappa
missoni
rafn
miniver
letoya
floodway
yanofsky
pontiacs
zaher
purples
mcguffey
balut
vdi
venne
meurig
buridan
autorun
cripplegate
capstar
baudry
lacaille
lugoj
susteren
ksm
heilbron
cavelier
peyroux
seebohm
pecans
redfoo
lumo
mangel
lakmé
copei
bispo
prototyped
prasada
bph
arani
mediacom
pacoima
haitham
speedboats
straka
sciacca
horology
slevin
ldn
dullness
deadlier
paestum
puti
leta
chaebol
akino
freedomworks
saidu
arseniy
makedonski
danu
ovide
barleycorn
slashers
agne
björling
zazie
sunanda
jocky
fatehabad
nwl
vdot
tsawwassen
chug
hebel
reparative
pauker
zinner
glostrup
slimer
nixed
wetness
pectoris
fdm
fumigation
fti
sabado
shagari
aquiles
bahrani
kasprzak
spurling
caesarian
larroquette
heathers
feaf
vsp
reperfusion
barnham
rossella
venere
opinon
tmx
glenalmond
osmar
lifesize
pagny
leaderships
sloper
muldowney
stelvio
brumbaugh
vano
appends
houlding
ivanchuk
townhomes
belleview
ashrafiyeh
ceaselessly
byrum
bleiler
wtic
briony
esrc
gingerich
agta
misfired
kerli
infographic
customisation
snuggle
gnarled
rues
zemlinsky
pliska
starkie
puer
viewtiful
ivp
frederickson
evigan
picciotto
malcontents
doshisha
owlman
heidelbergensis
darwini
terabyte
stroker
lakshya
cartoony
cardiomyocytes
spallation
warka
curvaceous
minda
soos
attachés
mccardell
heartstrings
bbdo
wadala
oktyabr
morata
bohlin
fikri
kempes
acquittals
ipd
tpv
nesa
bng
hafs
laparoscopy
coonoor
netgear
aksoy
sverker
linolenic
prenzlau
grandly
waqas
bellin
interlake
deorbit
niaid
iwas
modjeska
etrog
csaky
unie
drownings
nuss
sabourin
folch
astrometric
foton
hamidi
estrogenic
maure
dismissively
koln
khajimba
prepayment
rotella
coaxing
redolent
parel
sabbah
woodshed
underlay
boothbay
musetta
homebuilding
zwinger
neuberg
bartonella
foretelling
revson
sieveking
dynasts
dragna
hidetoshi
bikel
lazaros
rallis
swarthy
ornithopod
bethan
debden
herington
thorton
tillett
ficci
fleadh
distrusts
leffingwell
pleso
darnton
alph
mecham
engenders
tonally
concetta
hanabusa
helvellyn
orients
volpone
pentacle
chassagne
tapajós
pcv
sentra
otmar
inquests
lunds
unexposed
collies
savva
steinmeier
victimisation
wapakoneta
earmark
boscobel
papiamento
gaidar
wjbk
fruitlessly
saddledome
amraam
painlevé
tzipi
kirkeby
dombey
nordica
goofing
lvl
arkel
crimped
jonasson
boase
gits
lockable
trampolining
giner
piek
gfw
texturing
loiter
swoops
pcmcia
hausner
kemmerer
outwitted
avenges
kayoko
rationalistic
ilunga
characterisations
bonnot
morayshire
reinterpretations
fomalhaut
motier
crip
commendatore
jaeckel
sighing
neet
terrarium
caac
larocca
vojtech
wakemed
swartwout
fantasizes
trumbauer
fearnet
mandira
telegraphist
spillways
ropa
sarsgaard
mcniven
occoquan
kihara
forethought
uzma
conleth
asensio
nbg
isam
rustication
lansana
nebi
numazu
toklas
grendon
janin
jnk
vasseur
jrue
steamrollers
kikwete
viridian
inflexibility
najeeb
ducas
bilad
middlemarch
walraven
farrokh
zooxanthellae
roomy
froelich
lazzari
jarle
protuberance
tempor
steigerwald
ntn
kome
scacchi
dryopteris
vogl
mabee
cauldrons
guárico
assateague
pericarditis
litigating
bhosale
lunden
bluewave
cranko
borgmann
salcombe
basch
gored
haparanda
yaobang
treinta
bulwarks
centrists
heldman
mukunda
brora
rockhurst
rums
musson
cowal
stammers
psmith
proctors
perfetti
stayer
oddo
successional
blackcomb
caffey
gatley
serow
neuritis
propionate
petz
hydrocodone
coachwork
talitha
boces
fichtner
zayat
glickstein
varick
intimidates
hempfield
mouzon
allopathic
shehri
tmr
usr
chindwin
bittencourt
gabbert
pamirs
nevel
wandle
dailly
baril
mattila
husker
horological
violative
edgartown
alway
friede
bidwill
quintette
uria
auv
natus
bdm
dakini
frontside
hummock
marzano
factum
sadak
paulhan
hominy
billionth
zuk
rosses
seared
lindeberg
tarso
borstein
pemmican
trotted
sarasate
maurine
teepee
adamczak
neptunus
tuman
azaz
luh
pmg
cognitions
kntv
posses
difford
tsvetana
schelde
tova
beriberi
clauss
bookcases
mcgirr
scobey
alexanders
railcard
keyport
parnes
inducer
lish
ciampa
cubas
ironworkers
madrona
teran
phrasings
gmat
srivastav
mahaveer
visto
springbank
hosier
kneaded
udder
pelling
vlogs
boules
sonisphere
heinsohn
seacroft
uhs
carolyne
tilford
zul
grolsch
unfertilized
creutz
bennani
flamme
obviating
okawa
jaba
naarden
mahy
saturating
tastefully
bushra
zeil
avidan
abidi
pedroia
ferritin
borage
sare
lorica
metsu
itcz
oladipo
lunceford
fredensborg
certianly
irwindale
anm
germinating
personify
akar
dava
silane
capades
bullfighters
anachronistically
estatal
marquises
erno
dutkiewicz
asar
misbehaved
herbalife
rayan
deflectors
equipo
bonspiel
culturelle
mehar
adly
hyden
cantley
jamborees
mnp
heeley
mariann
guillaumin
powerlines
franka
rampa
junipers
moris
nutcrackers
helots
throngs
billi
baranski
ity
quartile
bathtubs
marmite
nahman
inordinately
essel
aem
bathymetry
pulmonology
forded
ravalli
henckel
orosco
splices
abarca
cpsc
kahr
rauh
potton
transgress
immunizations
thackray
anahita
moistened
meja
comicbook
mccubbin
kies
gherman
hajjah
fleurieu
mcconnel
politis
lirico
lorax
huling
bajaur
chait
tampon
keibler
suliman
dach
gubin
autolycus
phantasia
itoi
zdenka
eelgrass
ganghwa
lucjan
doniger
quietness
wfld
frieder
westword
tsvetaeva
moire
broxton
shalev
westwick
dfi
cocina
persicaria
methylmercury
leadenhall
zeitoun
erding
betham
blurr
stanbic
keratoconus
milbourne
shouse
hominems
hotspurs
transliterating
biomimetic
neutering
tamwar
caliphates
talen
charny
dingman
preciado
ornamentals
drea
bulman
eventhough
minturn
cabinetry
sexier
sebago
bérard
sirajuddin
tresco
ausa
oaxacan
bju
pollet
rosicrucians
primitivo
nicko
riet
dayo
nishijima
parun
klopp
hawkish
kotal
skyview
qadim
westwind
antoniou
leonine
savaged
knoppix
chartiers
echevarria
azienda
qufu
meteo
aril
zoon
santigold
dispensationalism
glassblowing
mooi
couillard
fotografia
zapote
dongchuan
callendar
maltsev
pandev
colonus
awo
benbecula
anshe
pilbeam
tugging
isabell
hmos
stromatolites
enablement
fugit
lilah
seoane
verts
glycosylated
delphin
kilgallen
durazno
pariser
amyl
indexer
yuzawa
crystallised
menteng
seductress
morientes
vre
froid
dynevor
stukas
cañizares
volokh
slippy
stephon
tilsley
pohlmann
ruidoso
eyrie
firmino
cristiane
restaged
primatology
propitious
belmonts
oliviero
oberndorf
staughton
repercussion
soutar
manpads
munic
ameritrade
sitt
bryars
showplace
parshuram
neuraminidase
claughton
myelogenous
wrack
idd
mixx
poletti
blather
nimrud
rahma
dorotea
superiores
pawpaw
histologically
woodworker
subluxation
wyrley
montello
turnham
corwen
tartaric
griffis
appanoose
transcriptome
applejack
todhunter
predisposing
gál
qilin
kustom
neeley
hallstein
immunohistochemistry
taraf
photolithography
mantises
calcitonin
castalia
postmistress
yeardley
igniter
mids
machito
gridded
mosfets
ressources
portent
darah
groaning
hatto
fhs
zukofsky
fidget
chapala
showbox
gambir
makris
accentuating
dobzhansky
dargan
longuet
bronski
ozomatli
jingshan
boggle
ketogenic
pricked
alvarenga
fredericktown
hinn
reframe
stevedores
grittier
dewolfe
hdf
playmakers
embryological
calshot
rocio
rypien
ubykh
carentan
satanas
monopolist
woodyard
kelty
thatching
sharecropping
narborough
rathlin
kepa
takanohana
shizuo
trebor
foras
mousehole
reengineering
flemyng
ameri
leveque
streamy
disassembling
wholehearted
taels
wonk
gunder
esquires
kanun
crayton
pallette
gellman
tarps
selz
sightseers
papon
greylag
bikash
dursun
staunchest
velas
shoshoni
methodologically
everette
medstar
braam
tlv
ryrie
tyus
bathes
phuc
thorneycroft
servia
ebers
kadoorie
catrin
wanderlei
navasota
lacock
eugenicists
kafkaesque
goronwy
dier
cappie
natron
hermetically
slovenly
chenin
riada
putu
robbia
lockie
gleeful
savu
kristoffersen
shatila
counterfeiter
rocklin
manokwari
iwami
bdr
draycott
cnv
kiambu
cchs
hupp
gleann
pyxis
shoeing
alcoves
boogaard
namrata
abseiling
desjardin
dissing
gyp
boldfaced
aubergine
kasei
indexation
waterland
vejjajiva
sophos
wharfe
huckle
mansura
boffin
zawadzki
vct
rone
polaroids
impiety
bosie
fratricide
akaka
provocateurs
gyges
marchmont
swanky
comity
spingold
sawalha
kallman
dormice
hesitancy
birstall
lianas
dialled
gentner
elysia
norml
vaid
sgu
afonina
ferdy
ahrq
shallowly
yezhov
kirstein
birman
cannone
cesarini
shipmate
silents
zipped
thrale
aasen
consectetur
shrikant
pule
radiocommunication
ballyclare
waldner
gebauer
alexeyev
bogardus
quiksilver
piscataqua
popcap
priapus
brummel
gurnee
inno
cingular
veris
hotan
manca
levinsky
jigger
coville
maladministration
howlers
noëlle
sickest
adol
prodan
kawana
matri
medianews
kinberg
vlogger
alpacas
sftp
shub
jadeite
kilfenora
prematurity
jeopardise
dallapiccola
bemoaning
aracaju
merten
axminster
alfre
puth
bazi
madhukar
saguenéens
cleaveland
papo
joyeux
korbel
burri
kahului
teich
vmat
whiten
schirach
tulla
escudos
apportion
markovich
zupan
elco
kerridge
manabe
lisse
devadasi
sibbald
fowlers
morante
cism
jittery
dect
katar
dumbass
implementers
ghafoor
aeneus
tigo
bosquet
piscopo
rowand
canice
thiols
bii
dokdo
dyno
conejos
diseño
ogallala
podunk
kemalist
allora
chiffchaff
nucci
paspalum
cottonseed
añejo
drugstores
ecologic
vivacity
blotter
bnn
inac
yanga
ajw
zande
gardez
reginae
toogood
grunberg
firecrest
pisses
aht
smurfit
nally
circumvents
sanquhar
nailsea
neuchatel
experiance
barmy
pansa
avers
delmonico
toontown
klem
whi
toucans
holycross
ope
hoovers
blagdon
phonic
arhat
mounier
serrat
mistyped
newsworld
iemma
jdl
parchman
sakala
nudging
hypnotizes
thioredoxin
gingivitis
pugnacious
trachoma
ducey
schooley
guber
faurot
mercantilist
impatiently
longmen
meinen
jihadism
leprechauns
huitfeldt
metalhead
nijhuis
paragliders
patni
surkhet
zugspitze
yazawa
prudish
corbeau
gtm
ikeja
rokeach
charman
narrandera
micromanagement
marsland
danon
greenspun
heiser
hangovers
qualls
choluteca
bhan
caff
soundbites
rougier
yildiz
megacity
myrddin
parameswaran
bullington
jinling
federman
blackbaud
bankier
fmn
malebranche
weinreich
fantozzi
gep
poncelet
villaggio
saurav
flotte
blanshard
rampages
willink
plevna
hijinks
troyens
wipf
wike
toumba
grunfeld
sleman
alyona
cocoanut
grushko
ecotec
nela
encarnacion
voile
pue
impale
oldenbourg
roundtables
blakelock
monrad
boppard
ricardian
megane
portpatrick
odos
chindits
speculatively
seafoods
clampdown
rapt
croxley
prange
thd
corney
caistor
massé
birdwell
gallienne
machaut
antinomian
engendering
standpoints
tzar
crumbly
dess
silliest
shivdasani
cystine
decrypts
hangu
sodbury
larrikin
jiangmen
vocalise
sharlene
dekmeijere
tippi
deadlocks
mazurek
grigelis
courtright
guevarra
lydgate
amphipod
feuille
superweapon
bequeathing
libman
tapps
buju
bernet
fascinate
perodua
secchi
goncalves
widor
mhuire
blindsided
mauriac
changle
makovsky
contemptuously
steadfastness
tredwell
inculcated
covens
ermey
ppu
tarsiers
reliablity
reauthorized
nuaimi
grangetown
transvestites
cleanser
pinhas
seow
crne
martinière
cals
demyelinating
kait
brimfield
dunmow
tinga
unthinking
coblentz
kutxa
greyer
soubrette
biv
galvanize
audrina
shoelaces
icecap
sandel
wondolowski
oberursel
khodro
hixson
wunsch
merrillville
nanai
incorporations
warplane
unhygienic
gwangyang
johnathon
phaseout
bolter
synchronisms
neurophysiological
godstone
vendrell
orsa
negar
straube
mummery
huie
petrino
henryson
newborough
juna
hayk
yussuf
hazari
ogletree
avinguda
indwelling
kahta
iceni
stevia
sparx
yushin
tawfik
demetris
merryweather
prohibitionists
trongsa
kharbanda
imposible
xanthos
padula
hpe
tiradentes
convivial
pustules
clapped
informationweek
anantha
mcq
appletalk
toxicologist
wispy
scurlock
revolutionists
creche
skepta
dorough
coelophysis
snowdrop
clarkin
stastny
drigo
assignable
agit
mcdonogh
indirection
vitex
dramaturge
screwdrivers
pomerantz
autobus
dinoflagellate
paideia
antipode
concision
poetess
ruggeri
cataño
wykes
jpa
sutcliff
stebe
yare
matata
phaneuf
rummage
berard
jumanji
fannish
kirsti
senigallia
aiu
intime
indaba
hodkinson
gordillo
masroor
grump
sdg
andon
unreformed
hamida
durston
boychoir
protuberances
micronutrient
antipodean
rioch
tagliani
wadewitz
manege
botanically
fabritius
kohner
metropark
bijli
redskin
konarak
eyeglass
ljuba
marquetry
gertner
picquet
porticoes
snaith
spiraea
rybakov
licey
pedi
nawi
schroth
presario
judokas
belzoni
sanctimonious
chevrolets
believability
disbarment
multiforme
shillington
luchon
proclivities
junky
tomoaki
wrs
gardiners
gymnosperm
roundhouses
kassir
kateri
cmmi
bluntness
bodger
orser
chikatilo
swapan
vulliamy
markis
neurochemistry
neapolitans
packham
irlam
ampicillin
zanna
mitter
attitudinal
mailroom
temporality
serang
kremen
horseheads
optimising
tailpipe
litas
hazmi
miniato
lobachevsky
grapplers
chiasson
soulwax
navaho
subzero
diplock
southaven
fridley
jessa
jazirah
topor
chesil
starlifter
schroeter
bernsen
sello
amalienborg
badea
chogm
fitters
smellie
newtonmore
disneysea
wcca
amlin
clarets
ramkumar
stokley
padmore
meramec
cairncross
qaiser
konstantine
reischauer
evershed
warshaw
solh
sadeh
snyders
lamppost
duds
bednar
macrocephalus
ttd
victorine
teel
combated
epd
riazuddin
darmon
carbonari
remini
kozin
verhulst
phalaropes
bolla
silex
jaccard
mamedov
koertzen
bisection
simonet
nsm
jantjies
tomball
dfo
slb
cranley
geldern
ecx
sparkplug
romanies
chailly
kurien
bertold
aberdour
keister
donahoe
salins
alema
anhydrase
troubador
cryptosporidium
tydings
thaxton
philharmonique
perioperative
daines
boccioni
eskom
kuepper
pgl
donell
zade
liverpudlian
kyalami
bulgari
loners
menhaden
larosa
autobiographic
applescript
zeppo
chaffinch
moroney
azin
macca
hypergolic
danesh
fatto
coenen
boardinghouse
buttermere
haki
blustery
hochman
xinhuanet
monégasque
nicoletti
ganilau
gasconade
mtx
gonzáles
occitania
maudling
anklam
coxsackie
hadrosaurs
torching
galeotti
gose
issey
provencher
macavity
nitrites
leitz
epee
scrutineers
coursers
eichendorff
olie
kayu
arolsen
sebastiani
annand
olduvai
blanchflower
unio
badgley
départ
wides
pirkanmaa
rabuka
woodlark
callicebus
protégée
plastids
hasebe
hozier
sherpur
chevette
talese
luhya
quinney
colorfully
calzado
lorcan
ankita
perouse
scurfield
brandished
cumbrae
noriaki
seno
kangwon
casework
woi
mswati
senju
transcribes
doubleclick
immuno
oiling
kupe
hooped
vincenz
billeting
janjaweed
sidetrack
newsworthiness
badcock
erlandsson
iws
guentheri
heathlands
tox
mountainsides
resenting
guff
fumihiko
iah
weidmann
durrow
sublimity
dengel
fatalistic
kelliher
ond
tatmadaw
floodgate
frimpong
congee
samaranch
steinhauer
churchgoers
outfest
garenne
selflessly
branchville
edme
victuallers
fransen
airhead
briatore
shamefully
catelyn
bargnani
stridently
delahanty
devaughn
shabelle
elmbridge
adamantine
buffum
gotovina
kratt
meili
waylaid
kyren
jamb
monfalcone
delap
grano
abdulmutallab
cowrie
mediacityuk
nomar
klerksdorp
meazza
laxalt
paramahamsa
celestin
tasc
dubarry
romantique
weighton
rott
intermissions
muharrem
ellingson
odeh
klik
sighet
bersani
opoku
ctb
ljubo
messiahs
sketchup
orczy
crankshafts
apolinario
mélange
kingsmeadow
sanitorium
kiwanuka
alesana
roten
sabini
darting
chitons
mantels
laundress
mccambridge
wilhelmsen
myriads
redefines
sertraline
bitzer
ohe
hunks
kozlovsky
deppe
douse
brammer
snakeskin
quakertown
toric
cendrillon
vogels
cementum
mieczyslaw
withe
microclimates
brekke
faxed
highflyer
majka
cortège
oaa
atys
imca
tomales
steeping
femtosecond
croaking
fola
kihn
avonside
undermanned
etats
wyner
hoyte
denk
salwar
carbonyls
birthrate
khloé
claymont
grotesques
fawad
legon
customisable
moland
smithwick
metabolizing
kornfeld
iatrogenic
westworld
frenulum
noga
allée
cannondale
burse
apoc
melito
heaping
pappalardi
darell
naltrexone
chaiya
spindly
sorbo
gatting
luau
foxsports
hode
ixion
galler
jashari
kapiolani
dambulla
arditti
salafism
hadow
skamania
tnm
hiromu
inspiral
papuans
plagiarize
morone
umara
capitole
gagging
watercolorists
lundi
polystichum
relinquishment
demark
baglan
stransky
deridder
ragib
customizations
frigerio
swedenborgian
bhc
dyal
tarbuck
singhal
fanelli
titano
juran
cudmore
finiteness
adventuress
appert
ravioli
gustatory
brinson
maini
weisser
ngaio
smithing
arnaut
broadhead
pash
sweelinck
dongcheng
myshkin
guerrieri
rosenbach
fenelon
passa
covet
mitsuharu
ciprofloxacin
convulsive
beringer
bohli
kandiyohi
trackpad
rauscher
xincheng
dewees
rakha
linxia
hench
acrylamide
harnick
weinhold
balog
countdowns
reshoot
hellbent
bagri
salsoul
yaf
hogshead
buzek
beavercreek
chandlers
melisa
allcock
shaista
diamanti
parvovirus
polychromatic
quirinale
ugalde
thronged
dawit
knik
commodious
wra
saleable
yuanzhang
palen
apax
cutthroats
wecker
biela
dyckman
pattee
olesya
herry
rahel
anodyne
nrel
tooheys
grint
maré
huu
charanga
metamora
reducer
dimmock
corzo
chelmer
regressions
mylan
khandan
hundredweight
thakin
poppleton
leumi
espaillat
marihuana
jugan
subcritical
melanson
korematsu
colliculus
forestland
outeniqua
agos
xlib
rathmann
bainton
wut
pennzoil
fignon
corries
responce
quoque
spon
madiun
siuslaw
plushenko
kerin
undignified
dragoman
golde
savannahs
rumanian
makuhari
único
swampscott
edicion
blackmoor
arshavin
godhood
viegas
taxol
whys
certainties
blokhin
pompe
marbling
josue
transfrontier
decelerating
leptospermum
tkm
whirlpools
guerlain
trifonov
misconceived
berryessa
gusting
specialisations
misperception
diga
hohenfels
rausing
immacolata
krg
oprea
abdelhamid
fryatt
balks
highclere
indifferently
ottinger
dgm
outpointed
workroom
ncm
spinet
rending
deactivates
stayin
iordache
exergy
turina
luskin
fluorinated
derrike
keven
rüsselsheim
flippin
nother
markéta
macroom
bgb
chini
workgroups
cavafy
hedon
gentilhomme
polyposis
wiremu
severnaya
syringa
shoveling
rco
embargoed
streptococci
jincheng
izaguirre
damsels
cellach
trueba
aviaries
ceiriog
benbrook
ashar
damato
laxey
lubango
imanol
shako
monteagudo
arrigoni
footmen
supercups
ashkenaz
wenhua
emina
truesdell
vhsl
bristlecone
middleport
gwon
sayo
uselessly
bleriot
vtc
ruggedness
algie
givati
rogerio
teleplays
shero
homebound
counterclaims
ballynahinch
hewer
reamer
humm
dreads
södermalm
lenahan
dishing
bamar
crossville
helu
commodo
boch
meanness
ony
takehiko
rohwer
suber
slappy
jurgensen
paleoanthropology
holbert
dhananjay
gallura
jannings
hematocrit
lecca
mazatlan
rompuy
paddick
rastafarians
perforate
lrr
volcom
shish
luks
linthicum
prydain
lewa
yaounde
infantas
nupur
grunting
tahr
promotionally
bizness
poundstone
tbo
doru
barbets
gether
fenugreek
musher
usfa
renai
addysg
goldmember
remco
vme
scoffs
weimer
downplays
vegetatively
dtr
grotowski
ducie
biederman
eugenides
atiq
applebee
reckons
borgne
beaching
sarkissian
mangas
gobbo
yihan
greenshank
moder
horch
tinton
ple
cupids
yochai
hitchhiked
hitchings
littleborough
jaen
pranking
sturgeons
metreveli
rmr
kerins
madrox
animosities
anthropometric
schottenheimer
kosma
budworth
hande
gatehouses
oxton
hallucinate
clicquot
lenght
townson
partum
leftward
whitesell
carné
hqs
redoubled
overdo
mcilhenny
waterlow
veale
arkle
nudged
iram
aeterna
heptathletes
daisaku
jmi
azzi
lsk
kws
superheavy
shushan
amh
madryn
storyboarding
semler
hiiumaa
ryley
madar
haseena
bloggingheads
cnl
balco
fitt
bross
toadstool
ghori
muting
elcho
pimm
betacam
mojtaba
cottesmore
sincerest
mcsween
outen
miscast
grytviken
toku
poutine
jimeno
kendell
designee
norinco
grisha
lobi
stampe
adze
genco
siddall
brindabella
trishna
gafsa
hjelm
ramires
sharett
morrisey
halabja
irks
khand
gravenhurst
flighty
tofig
tactless
ofir
mackinlay
gerrans
hurray
freeh
kena
personifying
faulds
nedra
tarbox
maybole
pavao
knm
holsten
sidestepped
lna
hiba
grieb
bylines
trobriand
bunda
orkla
snore
peikoff
padden
wookey
putte
postmodernity
bergquist
eab
pecorino
vowles
ytb
fantoni
laurell
tangerines
manica
archaean
janardhan
intraoperative
foxconn
bens
。
palden
razz
pugilist
smitherman
aimes
scoff
tutwiler
candomblé
cornstarch
epsrc
katamon
sunstein
taharqa
makalu
rothenstein
parappa
sambucus
matvey
jagex
grinberg
kaleem
altobelli
clambake
eastcote
reiman
persaud
kentridge
metroparks
gattuso
luzhkov
moho
sjeng
battlezone
braybrook
hornig
talab
photometer
tyros
yabba
kenema
helluva
gle
wooding
ispahan
vossloh
dinning
chavanel
necromancers
egitto
lyonel
texier
calamaro
baglioni
wickedly
saviola
brickhouse
nier
chiemsee
siebe
kelling
cherche
habonim
kleiman
telangiectasia
svi
alpino
globalism
reheat
winterstein
haggerston
vrede
reaktion
furtive
culler
icos
dair
arkadi
fatin
backtracked
unshakable
funck
chital
redbrick
ziona
soufrière
francophile
beauchemin
republique
thach
noetic
uniacke
thore
gyurcsány
asheton
calorific
mariotti
naturists
torrealba
shahnaz
buttrey
khou
crealy
boatload
inchoate
crossfield
lenis
beloff
vanstone
decimating
beefsteak
cavalcante
jaggi
weyr
rhymesayers
nankana
stoddert
satun
schikaneder
brownsea
desormeaux
carphone
quercetin
exacts
lucker
upt
subhead
chandrayaan
informe
modellers
stonecutters
listless
equines
dueled
barad
beter
gynecomastia
teodorescu
márta
vusi
crac
diyar
yeatman
gile
niggles
spurns
brinda
moschata
reu
aile
ptu
rehearses
kaganovich
cyberstalking
piao
gtld
backstories
costain
bullosa
dicken
engelstad
sways
strahl
crossway
efp
vitt
recaptures
delimiting
gunnel
cupped
zamin
plexippus
easthampton
polarities
munni
michaelangelo
stuxnet
unperformed
dungarpur
kinnunen
torchbearer
corb
oauth
diagramming
defibrillators
heynckes
wwp
fabbrica
slaveholding
tippet
baudelaires
wersching
roquebrune
traherne
chinstrap
bordoni
combretum
catherines
farook
hutterites
everwood
combust
widebody
savita
poley
kenwright
hauntingly
céu
galligan
feebly
merrier
gillray
kramers
ariella
kurban
burnes
oligo
anoa
foret
cinemark
korangi
uffe
tolleson
mesi
augmentative
chafe
voiceprint
devaluing
neuropathology
electrospray
takai
semin
trivialities
unserviceable
measurably
inra
bgt
rizk
baria
artichokes
waitt
crenna
vertol
irondale
charlebois
intercooled
exacerbation
drophead
cpsl
amidon
hamara
bereza
jackpots
vindictiveness
itr
sinless
landsbanki
determinedly
kinneret
tucks
maclin
yasemin
lipopolysaccharide
lidice
canopic
octogenarian
ahti
americal
harket
terrorised
protestation
vanillin
problematically
aksai
skud
jamyang
pouget
pgn
marteau
atonality
folau
microscale
vitruvian
okhrana
sagres
cadieux
jianghu
corridos
corrido
voeckler
pennie
deloria
anvar
fto
catalytically
bluhm
minks
blackmer
ulyanov
mnsu
institutionalize
hankook
brights
wats
baskett
superstores
bulow
comey
tanglin
retool
caporetto
savvas
kangerlussuaq
kromer
purdon
curtly
gruyère
sitek
goondiwindi
theodoor
kaunitz
pmk
demographers
bilitis
landin
gingras
moxey
htlv
seybold
lazytown
harline
chevening
turchin
malmberg
pjm
hydropneumatic
angiogenic
doornbos
krasne
interlinking
doko
spectrophotometer
andaluz
xstrata
shahs
annalee
aak
vilá
hamstrung
monserrat
arlie
ziggo
urubamba
chiti
ionising
emeraude
crosswise
krannert
jansons
sojourners
hugin
arthritic
butterscotch
diano
assassinates
cafiero
torpedoing
hearthstone
exuded
uschi
alaba
cozzens
taimur
longboard
cbg
hogwood
silviculture
hourani
bacca
kfir
kanha
maudie
mennen
paveway
orosz
montvale
registro
paiutes
tavor
stretchered
giesbert
dingbat
copeman
wideman
chayim
filariasis
longline
cedrus
matamata
cohl
overachievers
voidable
kaili
miyama
micmac
umrg
euskirchen
contiguity
subcultural
facemask
lesar
raconteurs
hevesi
wangchuk
yili
toño
stephano
finks
sekine
bangin
payrolls
desenvolvimento
draeger
uncg
muirfield
zat
gadjah
hinchliffe
quadros
mellis
berlinetta
crestline
dannatt
trireme
serenissima
zeeuw
llanera
ideologist
mapusa
callings
gaudino
reznik
reuel
broadleaved
quast
pluggable
soiree
docter
darbyshire
publick
disbelieving
burghfield
snus
shubenacadie
jodha
bardolph
tenda
longbottom
sydnor
breakaways
crescenta
nabbed
anaglyph
tsukada
eleazer
esquina
confiance
crisscrossed
tàpies
dkv
harr
murnane
armero
cryogenically
coteaux
wrey
naaman
maim
alemany
finnan
newbiggin
kezia
poto
elser
nachos
yushu
kulak
gunness
shapely
fraile
zollverein
wolper
abschied
freek
chartism
hussle
codebook
petrone
liberatore
borinquen
carreon
collinge
kaolinite
zubkov
impregnation
kaizen
spams
nester
demyelination
rearranges
lagaan
swensen
faull
missourian
kooyong
recommence
clefs
glaw
ahrc
utamaro
finanza
monto
biphenyls
seydou
ouvrier
echidnas
expensively
cranstoun
fiero
matagalpa
headscarves
absorptive
inam
bakari
ismat
mauk
amorphophallus
eae
rosaries
loosestrife
boetticher
aumann
loanee
soumya
updater
morier
nyenrode
vatan
sophronia
hutchens
sulfonamide
matteucci
gcw
undocked
huaorani
sensorineural
underexposed
negaunee
tmn
rsu
cuhk
brunello
superjet
contrite
baggot
foxhall
safdie
bratunac
aidid
beranek
mercuric
ulis
braving
forsaking
latifi
zacarías
jianwen
clickbank
caws
ldk
caryatids
xdrive
janak
mde
yarrawonga
mcstay
zacharie
glancy
llong
feuilles
montalcino
illegals
armitstead
israels
defoliation
ppk
unai
aceves
supertram
itzá
lavelli
warham
lpb
varlamov
império
ciani
kringle
thiry
runcie
eightball
michna
vlbi
dahm
taxonomical
donnelley
rebuking
reddin
haddadi
austins
hmd
vipin
fantasizing
pallavicino
acheampong
rivularis
mineralisation
stieg
relevence
scallion
loges
iheu
haloperidol
edmiston
stevenston
suss
westfalenstadion
jomon
arlberg
mumma
comsat
belorussia
yingying
thrawn
chablis
novelistic
lyari
lüderitz
flounders
transavia
videojug
prattville
junon
somogyi
elah
yacine
soju
crasher
lono
carrs
brena
prell
roseberry
dico
olivero
ugyen
mahboob
bourdin
luard
vacationed
autzen
wrongness
fredriksen
romagnoli
itakura
maynor
lungo
luminal
freakazoid
leete
finniss
doku
preclearance
assayed
unconverted
groundsel
lur
cholerae
hungarica
fitment
inec
steevens
anaphylactic
vay
eckardt
sallah
centavo
ayios
schnorr
evenson
ashtanga
dymock
retest
brancaccio
guapo
semtex
mvno
tianshui
puleston
intangibles
graveolens
kito
cachaça
palika
vivanco
neander
treadaway
ouvrière
schull
legum
expediting
varians
débat
maitri
adeyemi
bielecki
liat
gerri
olander
tecnica
heger
apta
kumu
obviated
soirées
broomhall
loayza
ukranian
unfixable
papanikolaou
elkridge
babbit
electrophoretic
suphan
henoch
treorchy
romanza
knobloch
roché
brownhills
horsforth
karoly
orakzai
luol
jme
norville
granola
gaviota
knechtel
arbi
makowski
willingboro
francistown
ifsc
whatsonstage
beltane
bilevel
copia
coliform
heimann
actra
ishan
cleverest
hurtling
drewes
negrón
shahnawaz
astoundingly
staph
legalising
legitimated
abdurahman
unarmoured
czars
catus
lukyanenko
mmk
germersheim
interrupter
sangu
pipkin
antenor
bintan
chynoweth
noobs
boulting
ncat
hilden
partida
crailsheim
lunel
edan
mosi
alpinism
polyomavirus
shr
satra
psdb
nonconsecutive
hawkesworth
pallot
tancharoen
maula
fwb
gunnedah
baraat
mankell
upholder
menstruating
reasearch
exuma
creu
veras
hurum
ngb
portholes
oropeza
umr
hebraic
hems
germanos
contemporaneo
wiggling
contos
ravenglass
bellotto
brooklin
gruda
hammerschmidt
hagger
chavasse
raggi
falters
blatche
cantillon
huevos
leser
habiba
grisaille
villalpando
cofidis
peopling
ladywood
whet
caddick
studiously
unimog
lemmens
surfed
tisdall
strassman
beefy
falconbridge
knauf
dalgliesh
blocher
marham
mossop
hagiographical
abdala
krauser
honeys
blairstown
luangwa
liming
breitling
bromance
pegmatite
odorant
thur
nikulin
bogosian
energyaustralia
judenrat
tvoz
dyfi
diante
carcinoid
ecosse
ilmor
guice
hursley
excising
unpacked
bln
cristiani
britains
tofte
upcountry
avezzano
bhullar
kazys
grigorenko
pushrods
cinch
rockhounds
fernhill
clamour
obstructionist
steagall
waster
factcheck
kimon
sensationally
dickensian
raduga
bantustan
kames
lenzie
minehunter
reca
underdown
koninck
ruhe
medard
wmi
selahattin
succour
zipcode
neutralist
abdin
dewberry
redrafted
foulois
zucchi
lyrebird
valerii
slauson
fedorova
groundcover
wenshan
rainville
janowitz
munz
contortions
mutuality
protos
greengrocer
hurra
warhorse
berline
sheaffer
edenbridge
fernandinho
lehn
amphlett
ergin
ballerup
cbh
progeria
bayrou
tamerlan
wtbs
unenviable
manam
jacquot
mactaggart
angol
freediving
shapeless
attia
moscato
lhote
westbank
bashan
ghada
anel
codling
zsuzsanna
tarak
cech
guthlac
younus
perrey
chaparro
whyy
hoje
liff
headlock
fdg
lofting
mous
porterhouse
joely
brosque
sutan
snowbirds
heeren
tereshkova
marshaling
hudgins
everbank
undiluted
sarra
radhi
yaxley
neumayer
andamans
yevgeniya
gervasoni
lederberg
strathairn
chakrapani
cauchon
dagg
womanly
flórez
gaudreau
innovia
pilecki
mullis
gatien
aubagne
duncanson
craver
oesterreich
nureddin
dejavu
swooped
supershow
autozone
lengthens
payan
nrr
paramotor
apperson
kleinfeld
siskins
udas
hematologic
pdn
dishwashing
staniforth
caminha
stockyard
cyphers
masvidal
theus
thm
leschi
wispa
bugbear
herber
dolomitic
remarries
cymbidium
powergen
movember
ucg
sylver
wakil
hayhurst
tetras
mudgal
atlantas
contepomi
wend
feversham
thure
plainsman
roenick
interethnic
lubomir
magtf
soffer
gamgee
marklund
younge
gowland
mingyi
inclán
hassidic
salivation
reichelt
harlon
bullis
isv
povera
rajabhat
predella
wemba
trev
grantchester
photoshopping
slants
llave
goofs
levitated
conformists
diophantus
barsuk
trews
aicpa
isinglass
arland
architectura
dexterous
pecker
queensbridge
mahlangu
oadby
benedito
psychophysiology
entices
dandelions
eus
coudersport
suginami
douwe
hagberg
piatigorsky
detweiler
hostiles
ology
pandurang
imperishable
hinks
ryuki
barremian
mru
emulsifier
ensa
leatherstocking
dewulf
lasith
fyodorovna
asifa
ramlee
mellons
documentations
platting
pointillist
nifc
silivri
bennis
croly
nugroho
guglielmi
casati
mpe
alecto
nymphomaniac
nasties
carny
tarar
shilts
zillertal
bankrolled
desean
florestan
zongo
amoebic
infringer
vasser
hagopian
tomah
lipe
catamount
mewes
pout
tervuren
chowdhary
myocarditis
hungover
mauchly
jadranka
mediatised
portslade
gonad
westby
noiret
minear
kinesthetic
saraya
tuks
abstains
smulders
qandil
winsted
shepshed
ravensbourne
rorem
strathaven
denaby
macgibbon
icpc
drachmas
landesman
overpaid
redcoat
butyric
putten
parata
crans
sparebank
handguard
vowell
bhama
millinocket
wuerttemberg
hinoki
gutfeld
bierut
tenuously
nenana
brevoort
tippmann
lindman
saltz
href
cypripedium
fmo
inclusionary
tuggerah
sukha
mcginniss
mytishchi
batton
tesol
lightships
lemp
valkenburgh
prayerbook
bangalter
neate
rdt
quotidian
hashana
kidwell
vetoing
paille
kamber
encina
usvi
grenz
concannon
gumbs
batan
bakhtiyar
pietrangeli
antis
abedi
selfishly
honesdale
pank
scherr
bertalan
namsan
stricklin
lazaridis
spilsbury
undocking
cadenzas
calistoga
cenote
erms
unrecognisable
jurica
tokoroa
arisaka
maroto
warded
quee
kristan
singel
vrenna
sinaia
parcell
quizzed
wails
waltheof
pentito
ridgeley
feig
laurus
eku
izbica
lainie
pascoal
breydel
ryans
proba
archdruid
captaincies
spahr
iaquinta
polygamists
astudillo
cribbins
macguffin
solá
derk
juwan
streambed
saimaa
gerdes
lleras
sando
tinkerbell
caressing
renne
arnt
kastrup
esad
baganda
vaka
cajón
superbird
endovascular
automobil
castagna
nityananda
vinicio
anthropomorphized
entergy
majic
blanchardstown
installable
cynthiana
untergang
karroubi
thomasine
expropriate
geagea
stepanek
bellone
consequentialism
ghrelin
hexum
lucchini
towton
ucu
reginaldo
resound
canonizations
marye
grounder
geoffrion
bousfield
dunigan
ghiberti
unquestioning
punya
gedda
ostermann
witz
ordinaire
viau
gigue
tomino
tropicale
selander
boeng
kosmas
kattan
straggling
ikue
outpolling
sinon
xve
bouma
tifa
bambu
liuyang
olymp
glasper
bys
spruill
serry
carriere
cya
acda
ipkf
banzer
evangelina
meed
spellbinding
kookaburras
mikita
stanbury
meimei
olivas
colucci
volubilis
swin
narcissa
skeen
stomatal
anansie
wgsn
typus
andréa
treadle
kwanzaa
bettors
mdu
adelia
gogi
ochsner
svan
kazue
southee
menna
grenelle
woodblocks
musky
caligiuri
rvt
assante
bumstead
khadka
atla
syngenta
zviad
kompany
mckern
lipietz
ican
ncua
allgaier
buts
seocho
pred
hoofed
surekha
fota
sleipnir
siff
festspielhaus
railfans
amanah
casements
arabized
arnab
kolk
mukund
megastore
perkasa
railyard
bocuse
broadland
haddo
bündchen
pedants
bundesverdienstkreuz
chinoise
gassner
nems
usap
winchendon
troubleshooter
sanjo
anaesthetists
lomba
huntersville
mckelvie
sureties
insideout
woori
allayed
vrml
mopar
milarepa
rafinha
toxics
thelen
knockoff
dùn
nibelung
terrify
gomm
emendations
blackstreet
beca
lynas
mezcal
kiplagat
dorsch
ardzinba
simbirsk
vinyls
pretreatment
witchfinder
smallness
enis
tshabalala
mctiernan
roundworm
likley
boudet
evolver
lanesborough
notate
flirtations
fujin
evetts
gwenda
subroto
goldthorpe
idyllwild
fantasticks
badoer
jouko
buti
jayenge
methyltransferases
dotterel
mahamat
renmark
psittacosaurus
wansbeck
cossa
hypertransport
wingmen
kalm
lundby
gaer
reclassify
phillipson
hallahan
invigorate
binn
intersting
deltic
kingswear
casali
interrelations
salmonid
luker
krier
arncliffe
barnfield
phthalates
zhijun
kds
saccharin
codebreaker
karlis
aberfoyle
rhayader
bouterse
precognitive
purchas
dispositive
timeframes
lowville
polyamorous
rapidshare
korkut
bonnin
herbalists
squirting
tlp
cobleskill
limericks
mihaly
infirmities
ribisi
swiped
reut
ganong
calvaire
cystitis
sarney
crr
heworth
westberg
phaistos
adjani
cryogenics
namak
jaquet
rajmahal
yanagawa
nollaig
sirimavo
swerving
taplow
curva
gallops
moscheles
almer
exc
hexameters
semiannual
hubcaps
yasuharu
finalising
yudkin
dentsu
polythene
tartini
bernau
southtown
dewalt
seim
masset
lsf
paga
expiation
monopoli
opennet
kirker
behera
sbm
manel
clod
songster
sonido
obinna
cooldown
fiberboard
pomaks
signifigant
nègre
waiau
farnon
mofo
gwladys
philosophe
postiga
murga
fasb
farmar
panjim
firebombed
leddy
sukhwinder
eichberg
zhaoqing
amputate
fishpond
ahed
secretaryship
teikoku
antithrombin
uechi
pintos
kotze
doubtlessly
astrophotography
kolombangara
danh
congonhas
hanning
pezzi
hejazi
kilohertz
religous
pereiro
caille
brislington
slumps
cebus
omnibuses
dislocating
tristesse
patthar
firbank
virgili
monie
ccdi
olancho
recategorized
internee
rison
vorst
sorg
grottos
germanwings
virginio
avaricious
abutilon
algemeen
lamest
scotsmen
ihh
wenz
ruttan
montecarlo
panizza
maco
cayton
nudie
toronado
hyperparathyroidism
avinor
atypically
phytochemicals
anamur
ewok
carpaccio
rollerball
branwell
gezira
furqan
clemmensen
kingside
umw
reiffel
makos
scattershot
fetisov
oesophageal
sardy
gunk
smirk
reputational
bisht
markin
sequenza
cashews
gambusia
fairlawn
gorny
steeg
upd
marilou
emami
dhm
bipedalism
ires
lötschberg
psychos
pemphigus
lithe
ioof
danzan
yashoda
subsumes
arsalan
sixfold
larijani
vernalis
rotatable
banknorth
soroti
troisi
srx
avt
shorenstein
sarvodaya
devey
bolcom
zsuzsa
shaurya
paulton
candour
reichenberg
quicklime
diesen
jailers
cocu
unorganised
williamtown
ryusuke
qadian
percenters
brookshire
moet
viken
borane
slagle
unuseful
cunego
sautéed
tlds
zani
balderas
mixta
gauley
beu
salaryman
jasin
trona
aliona
missive
dfd
alcindor
sanko
tole
siragusa
inamdar
macchia
wernick
subwoofers
magnox
fras
jsw
reinstein
archtop
superimposing
kraton
bleibtreu
peñalosa
cleverley
prideful
aberffraw
juvenilia
etb
criminalised
temenos
lieberson
pravo
chapron
taenia
sarhad
wog
microfossils
procures
endorphin
kaus
latinization
hamama
gorgeously
crossen
acmi
wardrobes
samphire
plateosaurus
margai
gorenje
dabba
teterboro
bicep
drouet
unimodal
swick
bickham
kipps
consistantly
stilo
eniro
tene
ostrow
maspeth
leydon
ciardi
debunks
femen
barbastro
iguazu
mishkin
bultmann
reconditioning
evangelium
imperioli
dastan
kallas
schlöndorff
groupers
westhoff
punakha
savina
veja
semion
oxidising
blacken
inklings
egging
babo
sozopol
qashqai
tresses
awas
agin
karanja
lenthall
inhambane
sabmiller
rpd
cherrypicking
defibrillation
barakzai
werrington
starland
mondavi
sukhbir
centralise
snugly
recordkeeping
tomczak
cedilla
monstrance
idrees
loreen
boalt
bobak
badran
pasion
evison
cangzhou
luling
vohra
jofre
aggressions
mantelpiece
mccrimmon
mahagonny
wuzhou
wagh
bigham
publishable
bonynge
klec
oligodendrocytes
hadash
dharani
engadin
darkman
doxford
hottentots
pátzcuaro
lattimer
marner
amarjit
einion
leptotyphlops
coeditor
uja
pitbulls
tocopherol
horrifically
gamper
redoute
stalemated
topshop
banno
mehrtens
baddies
kirino
anusha
pétanque
tentacled
goolwa
earles
withing
lecouvreur
partitas
btl
arcachon
jeffress
fnm
burswood
dornbirn
shiprock
tilzer
megachurches
axholme
brochet
cognoscenti
attiyah
quinoline
sharansky
perec
jabo
noblewomen
windjammer
thetan
superimpose
quelea
oberammergau
reanimate
globalizing
mhr
tussaud
cotai
ezeiza
eustachio
kuai
etsy
böttcher
susans
ventress
schrock
hylan
crisper
glyfada
huub
shazia
amott
debnath
chairmanships
uros
dhana
vereinigte
newsies
jutes
greenlawn
anandan
aphrodisias
dollies
poku
debré
slee
mdpi
haulers
dimittis
dragstrip
jô
arlesey
müritz
nagaraja
manulife
regionalisation
leftism
nasscom
molucca
bastiat
cerone
concordes
rdb
ondina
ktn
wellwood
authorises
taillight
sandin
balcells
baildon
hapa
daire
tavera
refugia
foma
tarzi
shoudl
barthelme
dayuan
ruperto
elea
posy
nazarabad
irin
cyanosis
vetiver
vivants
emich
downslope
minurso
blandy
patteson
climaxing
bouman
dónde
ladera
akdeniz
preform
gielen
lunacharsky
vrana
zaloga
lactea
mirman
traiana
bevans
acclimatization
omv
marcelin
emei
otterlo
dromaeosaurids
cardone
slippin
wunna
freres
carlino
huayna
brau
azman
ghi
ngwenya
cowens
polyatomic
timofeev
metalled
surer
collectivisation
ikebana
marilla
naegele
lignum
frensham
viscose
cratons
handclapping
tyee
terbium
ascham
elopes
fitzjohn
burien
rushd
goffstown
nightside
graeber
federalization
pels
provenza
interposition
mahonia
extralegal
indiantown
lagat
beanz
zisis
cept
morticia
hazelhurst
babli
upplands
duveen
lamarckian
amandine
kanouté
attentively
wayamba
nonsteroidal
pims
muschamp
mcgonagle
netcom
furthur
kse
cresap
woyzeck
solli
lhamo
ivybridge
fernley
wagenknecht
shusaku
norrell
shoud
rono
swindling
siegbert
tauscher
umbral
sassnitz
tufte
christofer
pumphouse
overreact
laboratorio
aksyonov
puyol
kagel
limburger
energised
jeering
postponements
zephyrhills
subhumans
puissance
pilch
croyle
inconsolable
thistlethwaite
chrysostomos
eruv
kezar
overcharging
marigolds
stfu
abergele
miamisburg
riversharks
magaddino
goldener
foodie
blackshirt
hershberger
terrasse
ragni
cottonmouth
sessler
swifty
propst
plucks
abaca
siar
qishan
frie
deanship
amanpour
parkinsonism
randolf
torquil
cowed
maika
gypsys
alinsky
hauke
unicom
sahak
knabe
insubordinate
anthro
fulmars
chesty
metrocard
enclos
hirschsprung
digna
danehill
dibaba
ubm
bonapartist
dilfer
dinard
tortorella
carwin
rifampicin
pieterszoon
kyme
conches
doggerel
externals
orlin
wolbachia
poni
mabe
damai
zem
speciosus
hyperlocal
counterpoints
sovereigntist
synodical
poldark
hidayatullah
uce
lomb
shopaholic
mansingh
batfish
aftereffects
candra
bucked
aranguren
beltz
sfv
regev
overrules
hipódromo
colmenares
vpd
ballerini
vaccinia
wanita
berlinski
pulpwood
reconnoitered
azara
iiss
manhunters
weyler
opsin
hape
parzival
orli
medang
cloudiness
feeny
plu
lehning
moranis
cueing
gadchiroli
sammut
colavito
khmers
kishanganj
kenjiro
photoresist
galliard
tnp
oxidizers
kangan
demps
cockfield
suhas
outspent
perceivable
boyington
barzan
savernake
duman
lasch
callies
erris
raskolnikov
tamborine
holdstock
djarum
telemedia
saron
slayings
iese
lydie
paraparaumu
gunhild
gouna
llandrindod
jefferys
barin
ardnamurchan
doit
gazzara
tbe
vélo
barberry
rebalancing
jassem
hti
worshiper
yoshihara
narda
giz
lucera
bedwyn
leinsdorf
heimdal
seismologists
donoughmore
lebon
hinduja
sciascia
confluences
bht
buitrago
ambling
castaic
recalculated
towada
mimis
schoorel
liberata
gatty
adjei
constantina
cammie
feisal
destabilising
payola
aramid
beman
ballclub
parmigianino
grimwood
loredan
icca
arida
dold
mastication
humped
totemic
rearview
rüdesheim
delio
zoque
laminitis
shawm
majoritarian
backwell
papelbon
jacklin
heythrop
somchai
rembert
jobseekers
filkins
astronomic
metonym
gobelins
wallner
durrington
bocking
daulatabad
tamaqua
lrb
ranjana
wakeboard
defenestration
sparling
burgled
oseltamivir
godliness
cirri
montagny
marvelously
teese
manea
broudie
gringos
chila
prattle
arriola
evm
sepulcher
bighead
oreilles
gfi
mtz
beefing
anzus
denisa
salmonids
grantmaking
ojukwu
alliston
stabenow
maharlika
diep
posies
duele
holing
retarding
adey
winmau
pinault
unmanaged
abdicating
bombards
ruggieri
btob
roddam
npm
piscine
gonne
sherbini
unsentimental
plimer
silicosis
woodcocks
karapetyan
chango
koeln
primark
taylorville
jabr
ehara
torossian
rfr
dack
brewin
neckerchief
backes
yucaipa
medjugorje
wrentham
zeelandia
umut
firas
parmer
nathu
sny
anoxia
workbooks
flues
hamata
discriminative
merauke
dracaena
momoh
chagres
unicoi
tampons
sovetskaya
ratty
irresponsibly
confederal
litz
hoose
guardi
hamby
latifa
thorning
monne
gigawatt
wwd
nachum
banastre
rdo
bassée
buckfast
pegi
dcg
bactericidal
moloko
katydid
quarterfinalists
teste
eurico
guillet
mrqe
vlachos
maxed
intraparty
pastis
smuggles
haak
breccias
derwentwater
faddis
beakers
echinacea
hmis
ipsen
balloonists
bellerose
lopata
arter
padel
litherland
elmhirst
creve
mgf
residencia
christison
guntersville
yurii
scribbling
atw
macaronesia
warnecke
borton
beguine
carline
ankylosaurus
orana
platanthera
tullamarine
nfcr
sequestering
saltville
pivo
wts
lostpedia
purefoy
largesse
gracenote
allbritton
seneviratne
subsists
belfour
freundlich
banchory
forgivable
fse
keyworth
gaikwad
taxco
aleko
diaphragms
maderno
doob
middendorf
contrive
dialogical
methacrylate
leask
armless
hoodies
mucilage
pelias
dibs
baciu
gerty
vigeland
tarhan
nullifies
hotjobs
icsid
moutinho
concessionaires
safarov
legroom
snowbound
vaslav
svv
bioethanol
heyl
iztok
nolo
eschborn
webtv
fromelles
dalgleish
mulrooney
wacs
kilifi
celtica
aberaman
shabak
chesters
tungurahua
dungiven
brandl
sinning
jarrold
krasnov
pontardawe
gradings
rands
smithton
carnera
kabeer
cipollini
puttnam
sidharth
semmering
corradini
mercantil
dribbled
breithaupt
setton
kalita
embalse
pierogi
fabray
kinglake
sampedro
fireproofing
mentos
kochan
mussina
cissie
iconoclasts
olmi
généreux
antonovich
archundia
oversteer
anpp
wainscot
headrest
mangles
warnke
pelin
tawdry
gimson
pinetree
daxing
shoemaking
waists
bosun
deverell
superimposition
efts
kinane
eloping
salu
tobacconist
mittel
maranello
destructible
florine
inq
aylett
mccallion
lidiya
gaskins
gundry
pudgy
minimises
embury
bredesen
avron
desta
isandlwana
cowlishaw
coreopsis
littlewoods
bettino
kagera
widdowson
ramot
humanize
spironolactone
ampatuan
overend
penname
rainsy
solovetsky
menge
freiman
emotionality
deprogramming
garro
drumlin
globetrotter
abbiss
aerostar
wspa
xingtai
maghull
enroute
lefties
grobler
technopolis
suquamish
calumny
fazli
pennypacker
steckel
beobachter
ahm
matrox
vronsky
creaking
caston
izet
foxholes
botterill
otic
knauer
wanchai
sigfrid
tiantian
pescosolido
viveka
killiney
quim
cilmi
hannum
opinión
medios
localizations
ballasts
mussa
glos
redecoration
duer
vcp
totteridge
vuillard
superfinal
frontrunners
atanasoff
prophesying
saidabad
shyamal
temporomandibular
scaler
yamashina
suggestibility
akh
optoelectronic
homophobe
tongans
illiberal
rotisserie
imari
vitelli
akinori
paci
pelaez
cristin
madeiran
pepperoni
phunk
demaryius
workfare
concentrators
sneakily
yogendra
throughly
prolongs
ballyhoo
mccolgan
uptick
powderham
wylam
mixte
khaleel
sciarra
safavi
goodstein
psoriatic
shein
conurbations
krys
toughen
osteoclasts
kamlesh
bigard
hyssop
ertms
moustakas
toenail
tenbury
nessun
ohchr
brison
legere
alpaugh
garey
tabulate
handshakes
imagist
volstead
viren
thursby
beathard
malanga
ciaa
tensed
goehr
ivermectin
pni
recline
iurie
rostow
treetop
montargis
merak
donelly
pernfors
gherardi
buso
manat
arterials
iaas
cherrie
minelli
saski
scammed
esna
deyo
backhaul
eaglesham
mcelwain
zimbardo
fitzrovia
adjusters
jolfa
tomographic
trofimov
offeror
lamplighter
thibodeaux
fawaz
zarin
sahai
cornflower
struble
nuk
undoubtably
hovis
sissons
commissario
westcoast
dume
mkv
sweetening
borken
jiggy
tiffani
shontelle
oumarou
scouter
hildyard
pickpockets
bookend
noora
birdhouse
massachusett
ibogaine
tuborg
jrs
dardanelle
spatz
refloat
enteritis
jibes
haddix
milledge
mcgaw
blueish
fiac
klausen
thatcherism
diffusers
bolitoglossa
disenfranchise
yaquis
decaro
calvario
barberis
kall
pronk
darron
riso
sfpd
kohout
darra
fodio
aquanaut
zender
magique
shafei
omran
arvizu
nito
slugged
hemmer
osric
fokus
bergenfield
loulou
soran
tailfin
mathare
ddh
lerch
jidkova
vci
utt
ndrc
barthez
barnoldswick
forsteri
earpiece
eunomia
freewill
kujira
peristalsis
postmarked
redubbed
remar
wakeling
cassim
materazzi
henceforward
pitlochry
aflatoxin
renaudot
blakes
hotch
rearm
skaro
shuttlecock
mahram
untag
maybeck
effeminacy
algorithmically
forint
valeriya
lampposts
recio
feher
compère
hikmat
kayhan
csas
naras
unemotional
purton
uttaranchal
takaful
holliman
pock
guerres
polysyllabic
mapreduce
xabi
golmaal
arbab
heiland
comite
kaneto
puzzler
darvel
jitney
bowline
fownes
suckle
hondurans
sousaphone
cramlington
dessen
seacliff
gourcuff
halima
isos
sheff
bohinj
chins
bricklin
cambie
chlorpromazine
salutatorian
panayiotis
gigawatts
veces
potw
branton
wjr
jamshoro
brünn
adderall
solfège
grantsville
ofori
maisy
britishers
hidenori
overemphasis
isser
miniskirt
inclusivity
yifu
fura
pescadero
maralinga
bdk
sullavan
pierlot
ngari
carouge
yacob
dymchurch
kuznets
ayes
powe
perfections
herczeg
playsets
eritreans
hyades
geochemist
mashad
reverential
prestwood
nith
gosper
hiaasen
coati
laetare
zawya
moar
blott
helliwell
onstar
urayasu
eberstadt
sobule
jiaqing
fluorophores
roys
bolander
cchr
zhizn
wabbit
ruxandra
pyros
prowling
piazzas
fegan
arabo
sleuths
exudate
vesely
challah
xinyang
stringham
breno
maldita
staci
embl
edric
strathnaver
kanchenjunga
wetumpka
erard
domestique
bfbs
magri
madelaine
sixten
muny
luiseño
crewdson
enewetak
moru
azurite
externa
prude
ilulissat
indepedent
limiters
varitek
offley
fatmir
vfp
kabushiki
linq
daryll
vdsl
isea
andelman
ascione
hendaye
occlusive
budrys
barite
fukuzawa
headbanger
gallois
uyo
arteritis
buu
tulf
carpetbagger
brycheiniog
inducting
onca
rifugio
keleti
caravels
crecy
icra
icaza
suwanee
keldysh
terps
ahronoth
silicic
trexler
feedforward
swamping
frn
gangstas
sulfurous
premotor
antonyms
lutwyche
madruga
badgered
fluoresce
gerrards
reenact
swot
wews
kellum
toliara
bagenal
touareg
hantz
rikkyo
llanbadarn
wappinger
sproat
bordallo
bircham
calo
swaggering
multifarious
pallbearer
amundson
macondo
renova
triptychs
thalassarche
zenawi
applecross
manser
houseboats
lezcano
kic
goldthwaite
pmu
ahman
ferner
revanche
rahab
yone
dunklin
shaare
pseudonymously
cawthon
xoxo
maumoon
leerdam
seapower
aom
chattan
ledes
ferrin
topa
ramechhap
shunde
ehrenburg
denominazione
kupper
bhabhi
roath
ogan
coderre
chernova
raes
unaccented
habeeb
fairwood
poignantly
horder
aviles
revellers
tyla
earplugs
dehloran
nml
linfen
eveleth
mccaig
anyones
sokka
eversley
cupe
grandaddy
nsclc
coevorden
mcelhone
yamen
excercise
udell
tanwar
hogging
cjd
dulverton
jairam
ubud
bogong
celyn
sektor
hongbo
deportment
inhabitable
streetsville
cherrytree
salerni
bottrell
arati
presupposed
saguaros
blyden
fase
talento
azhari
bookplate
rogoff
larkhill
winbush
giraffa
trouts
vlade
evaporators
jagmohan
vakhsh
namecalling
compacting
morigami
hanza
reeperbahn
rossbach
barmby
quotidien
distel
monkwearmouth
mindedly
jobin
unrolled
tablespoon
extraterritoriality
lmb
agnon
alfredson
consistorial
minimisation
dweezil
superintend
joran
strock
kpi
overvalued
baybears
skipjacks
crugnola
memorise
cesana
speedcar
deta
morphic
knitter
funereal
passerelle
thrombotic
vocalize
priestman
wordstar
jarden
khieu
confiscates
caouette
anquan
ishrat
kriel
sizer
renae
hubba
euless
moshoeshoe
uchimura
courtside
aspirate
cohens
wächter
picone
uhlig
estey
orkin
kopitar
bordo
bardin
mugler
cerutti
navona
lysosome
algoa
youri
unexamined
objectified
eweek
mamdouh
pakeha
brooded
yellowcake
zst
schjelderup
churton
eprdf
vetinari
bucca
carriacou
lemus
doakes
tekla
cleanroom
npower
housekeepers
junipero
cutest
wks
benefactions
bianconi
wycherley
nna
heintzelman
aok
kabakov
manthey
unpleasantly
privée
kappler
hasely
clifftop
nichelle
workup
broxtowe
nervi
lysimachia
carlyon
bria
jaidev
karlsbad
albán
brey
polet
trypanosomiasis
overspending
wead
serology
laming
richemont
ekelund
requisitioning
rille
incivilities
moriches
minagawa
insulae
vilcabamba
kerrey
tarcisio
battlespace
charalambos
allready
vampira
extricated
hanoun
vickrey
orlowski
solitario
cotler
sobhan
puertollano
escargot
taze
unburied
disinfect
zenga
egyptological
afarensis
panfilov
corstorphine
nobuyoshi
laches
curi
lovestone
baigent
slivers
metalworks
chaiyaphum
molalla
lollar
brandish
enraging
ambu
kiyotaka
arcady
roughened
carlie
lavoe
polamalu
dorrington
ascendance
kracker
fychan
agarkar
nesuhi
balkhi
anche
leyes
myaskovsky
tucholsky
tequesta
drita
quoin
texoma
depreciate
vladas
wreathed
nnl
severally
filmy
nuna
hagenbeck
wahhabis
medawar
vergina
thiebaud
braye
vremya
yabe
nafees
munchkins
herri
alexiou
senorita
situating
smigel
vardanyan
prolly
clik
glenarm
aardvarks
atleti
vishwanathan
carita
ségolène
eboracum
rarebit
lecompte
shrift
jayasimha
cándido
centrica
créole
morrilton
skittish
auron
surfeit
dilettanti
ruban
cadwell
glassford
kaneshiro
ukyo
engvall
grajeda
buckenham
chéri
pember
immunofluorescence
grigoryev
aleta
queenslanders
hawkhurst
hundertwasser
zeh
tenanted
turriff
simonyan
skg
reedley
wlb
supernaturally
uncommunicative
broadhall
kobayakawa
kidwai
walrond
lynnfield
nanos
vae
problèmes
cica
unreliably
postpones
issoufou
chokehold
abalos
cuffed
lititz
sanaag
lowton
fairless
hauenstein
knap
vitals
senat
choux
tewari
ballentine
huguet
zarco
mecosta
mischaracterizing
moré
superieure
tunneled
doles
sweatt
mersch
rantings
duesberg
routings
patey
spick
clondalkin
iinet
tahsin
finegan
mundos
togethers
containerized
livesay
mawes
padania
bartek
neubert
scocco
lvg
seafire
ghoulish
rectitude
choros
kikyo
brinley
elte
dwarfing
unthank
misfolded
gerrish
aaronovitch
fbb
cyana
scrimshaw
pannu
nabob
laghman
nodded
heffley
consumptive
prologues
aise
shubin
chedid
treecreepers
yakushima
evasions
shiflett
carausius
overbury
stapley
bombadil
wooton
secc
putonghua
rtk
taiki
comeuppance
kayani
nejat
cattail
pxe
greenvale
pigg
holohan
hurtigruten
naag
lodestone
kailali
goffe
brynmawr
peskin
schemed
musicor
cottier
bayu
lijn
relives
mccalebb
bingbing
sealion
kove
foolin
munde
hpo
urgh
hackles
ignoramus
steelton
ruyi
fantasize
kabal
bordelais
norgate
lorene
assicurazioni
remixers
lavar
citicorp
bercovici
tweddle
rowlf
upv
zoll
morrice
jailbreaking
kissa
küng
gien
dmytryk
busbee
gabapentin
commack
enr
sugo
huangfu
fptp
metalmark
winningham
decriminalised
wingsuit
jankovic
aptitudes
kelvinside
bbf
jaroslaw
westgarth
koechner
ecologies
taronga
bramcote
varadarajan
definatly
hiranandani
dippers
endotracheal
morisot
arledge
kav
feininger
karratha
krakatau
prandelli
hargus
lloret
eveready
mimbres
foots
olo
garching
monotherapy
fatemeh
breakpoints
cordele
swig
conc
stanek
linsey
dyspepsia
lysa
wakhan
arauca
throttles
symmons
siebeck
lagarto
navdeep
unresolvable
virologists
dolezal
börje
unterfranken
uop
semipalmated
fridmann
runrig
etonian
shumlin
lashio
jayakumar
dcv
titchener
steelworker
munim
klaipeda
crazier
lordan
forbath
carbury
akhara
matthey
squeaking
hardenbergh
ortner
augmentations
glycan
boitano
mamiit
hinkler
sukiyaki
tashlin
dimock
uai
baddiel
baryonic
platformers
venevision
schmidhuber
ransacking
wisher
patuakhali
calisthenics
jessye
batemans
monga
agoraphobic
karon
robertshaw
meskwaki
fattest
bugliosi
fishwick
lebrecht
funda
chayefsky
nicolini
mahamadou
plaisirs
radyr
virtuality
feldt
destructiveness
pazzini
vossen
tandoori
brotha
huitzilopochtli
vedat
bestowal
nagler
muchalls
kaustuv
vias
homosapien
snooks
enforceability
proteas
sapulpa
despairs
brenchley
uninterruptible
westerbork
naic
nissin
aliza
hochi
mutism
aspergers
tawe
tumbledown
cavazos
coveney
tianshan
treece
isfahani
mireya
ndegeocello
sindhuli
lamarckism
haldon
wapi
sunstroke
brezno
twitches
verhagen
vtt
budai
linkers
klinefelter
belaúnde
idema
evreux
nalan
virtuosos
furries
squish
consolacion
tims
numinous
noria
ensley
merlins
krok
silverstream
brevik
wassermann
azabu
latimore
arnolds
kennywood
trium
marathoner
fischler
mirka
constantini
stooped
laning
lembit
sherilyn
kneejerk
%,
inglefield
outspokenness
carting
mends
vep
buca
eil
lambretta
baun
mcpeak
shahidi
wenonah
ohad
lalas
wkyc
reviewable
blaker
metroland
dispositional
javadi
esquerra
shapovalov
mawhinney
marquesan
veith
rustle
wielders
krapf
kaleidoscopes
compartmentalized
zelle
tontine
sycorax
wisse
melvil
buffeted
pangu
naat
sags
pgce
inglese
informatica
uncircumcised
wauters
coryton
clavis
kazuhito
austar
ellingsen
atha
abascal
absconding
sardesai
rvn
ardoyne
uncannily
hospitalier
godel
descript
footplate
dadaism
wecht
polden
povich
bloo
pecs
pranayama
ensnare
frasers
amies
kartini
penances
ryon
mcgeady
trafficante
preševo
fincke
glendive
ramsdell
babacar
yakir
polyneuropathy
ewoks
seres
sudesh
pagett
kizer
emoto
twiki
intellects
soundview
traverso
almog
steadiness
coffeeshop
mandaeans
paperbark
sambas
serle
mabey
derbi
erceg
bottesford
hocken
silurians
tokoro
apelles
dugger
flemmi
milnrow
haarlemmermeer
calpers
uncounted
quilliam
bubbled
brue
literalist
alphons
berlinguer
lipchitz
inhouse
flumes
engdahl
llorens
flamenca
woolfson
escándalo
filers
koppen
filarmonica
zorg
gobbledygook
jarome
microstate
rackley
stickman
toph
felin
changjiang
dusseldorf
schrier
heda
katsuyuki
abdon
ilja
pallotta
cratchit
kolker
flaked
lbp
pendent
placated
wowed
electrum
barbato
ayelet
hymie
belkacem
ashridge
sideboard
vanderpool
giulini
eagar
dickon
peterlee
horsburgh
soloway
clisson
dynamix
nimal
minidisc
bgn
greef
satirising
huntsmen
spadaro
frilly
ruffle
rohnert
rykov
tbt
spindletop
comprehensibility
balkhash
karm
jok
nighter
embroideries
priti
monopolizing
accessable
bushby
blaugrana
lescure
nura
jennet
tessellated
niyaz
cinquecento
hoghton
mujtaba
kojiro
evildoers
hanegev
selver
mckusick
stoners
neoplan
renege
valentim
bonecrusher
neorealist
bracy
confounds
beachheads
megamall
ankylosaur
hydrous
glenna
bambini
providential
esses
hryhoriy
pleshette
laguerta
casilda
microraptor
esben
swieten
jónsdóttir
wgr
mannington
whirled
seacole
izmailov
meaninglessness
uderzo
queneau
scali
reet
humored
maling
lyly
soundararajan
carport
edline
beheads
macdowall
tetteh
akhand
geely
tanel
mediatek
parmenter
hugger
clairsville
aralia
vsat
hains
muzaffargarh
raker
shiota
yoshito
narok
cadi
dayaks
delannoy
prière
pamphilj
madhepura
carelli
cumbrians
prebends
mallia
navon
uneaten
staplehurst
ulrica
stratagems
sitch
saulius
mkrtchyan
luzzi
goni
jonglei
leti
millets
corpsmen
harlesden
fattened
khayelitsha
caelum
whoi
skydrive
edem
norridgewock
shutoff
varejão
soooo
qft
guacamole
rajguru
patman
servet
carian
longmore
hryvnia
pavlenko
aftermaths
alí
kneeland
kiesling
gemelli
disparages
figge
confidences
shebelle
dosed
metall
cléo
furcifer
marable
reenacting
brockmann
quiddity
samant
hiromitsu
redrew
munawar
marber
hakobyan
ohanian
pegah
zanzibari
wagnalls
quabbin
boj
downcast
mazie
womble
hitlist
couvent
gnatcatcher
guangming
narnian
endesa
sarfatti
sylvius
nirenberg
aeronomy
lobate
faiyum
cholinesterase
abeer
khadim
rumbo
serkin
somerford
nakul
demarchi
musty
payerne
manie
papilloma
clydeside
treefrog
hnc
quivering
kiat
nodar
amaechi
previte
stet
jacquinot
bilardo
cullompton
duplass
mahad
pervasiveness
hafsa
aldin
hangmen
suras
noho
bantock
mirin
kiew
procureur
chunder
butner
maskin
cowbells
dasilva
consistancy
entrees
dinnington
bhairab
viju
aborts
schuurman
westar
quadruped
oliverio
rohs
fritzsche
winnicott
gastroenterologist
diphenyl
randor
collab
golla
hasard
selvi
scapegoating
krt
fcv
hughey
parmalat
moonfruit
velodromes
tversky
fiachra
dialer
withy
cristianos
hickel
wetherall
beeping
rosaria
peon
lavezzi
salvadore
indri
playthrough
fundraise
goddaughter
lindt
coasted
pyloric
cappielow
annia
hypnotised
cantoni
malviya
meusel
amalekites
postulation
birchfield
albia
audrain
basmati
nadin
kitara
profundity
rri
coh
legno
housley
kerikeri
dobyns
daas
brigstocke
thali
goiter
incinerate
seann
defazio
bobwhite
dyskobolia
briefcases
magnetospheric
silts
readwriteweb
khush
crowland
salkeld
peyser
wincott
warby
zhonghe
qaim
reaney
horrell
respectably
wobbling
brisket
hitchhike
anorexic
rhe
yanbu
onno
cashless
doorknob
salomons
liberalizing
cancionero
watten
nationalizing
infinito
anastasija
grimaud
yusheng
fardc
sexologists
patoka
foucher
chasuble
kiriyama
inkers
briquettes
kcna
bassem
bombonera
cupp
valentyn
dedi
gammage
hiromichi
tormo
danquah
pecked
mpx
otb
tinoco
imeem
zweifel
bardet
muong
bossman
promis
solidarność
glyde
buddhadeb
kofler
babalola
santillan
manigault
atascosa
,it
uws
codifies
postmen
wortman
mvr
zella
soleimani
carraro
lindale
pêcheurs
jouni
pigpen
digiorgio
velit
nereo
tinkered
rotolo
nitzan
bundu
grivas
ebersol
göncz
peening
ajah
swill
fathima
ahsa
inclining
motorrad
kammer
appleman
ersoy
multani
haberdasher
shaqiri
brahmacharya
sirloin
sahlins
digester
trumpeting
maggy
soloman
massaged
hallux
brennen
pictorials
capdevila
bolide
newmar
speckling
feaver
golovkin
montresor
araucana
bucco
fryeburg
sunsari
eike
tiina
kantipur
guericke
artilleryman
sushila
venal
hairpins
sportimes
ealy
desio
concorso
wardrop
callup
crafters
bvt
sanitizing
flamel
walldorf
mangabey
johnes
langurs
expansionary
spanked
sapwood
razumovsky
paragons
reunify
atash
industrializing
bachao
baumer
zizi
meathead
bonda
blanck
ablutions
nikol
serota
soroka
kodori
weltanschauung
khushboo
chelonia
meinert
lesya
fister
vwf
wtvj
stumm
nawada
honeycreeper
loche
tasters
oppo
lacrimosa
tenga
atrophic
bajram
christianize
navigazione
happold
gorgonzola
daguerreotypes
chiluba
windstorms
odysseas
byelections
westaway
allauddin
fauve
tkach
politécnica
maclure
sujet
kiyomi
vulpecula
wattis
marken
orri
gener
peverell
coben
renzetti
firefall
blumen
midgut
carnets
chouhan
ardee
deathwatch
ansara
loreena
katari
battey
tendring
arnar
contextualization
failsworth
féile
bouck
bonehead
donc
allegorically
franchisor
uncontacted
heitman
arzt
procurators
turabi
ukuleles
riverkeeper
placebos
metabotropic
tanit
nahan
lubeck
hagee
brogue
dietl
adalah
botes
kilrea
morona
crewmates
vestris
kieth
niimi
butterworths
nanavati
tills
blanes
olympos
bachand
cappelli
gylfi
brims
cloakroom
absolutly
sigurðardóttir
iconium
allotrope
leicht
chipp
rohana
loucks
dinanath
alderete
masochist
thl
francoism
syariah
wpxi
kalanidhi
basit
ultan
karle
corrals
schlieren
dousing
osney
lwr
hade
tonhalle
henney
belie
hadouken
jephson
blondy
floriculture
birkat
interjected
eisleben
llangefni
pitlane
sysadmin
friederich
yokogawa
rotimi
tijani
jsat
superannuated
senility
ordoñez
trinket
soeurs
eaglets
morisset
tgi
clothiers
fleer
bumthang
miev
duhig
naor
maysles
queretaro
marama
quaranta
wuerffel
sidley
harbourside
trym
branka
colinton
keitaro
lochan
eberstein
invoicing
fincham
shabba
cannings
tijerina
matam
sulkin
fahri
yapp
waku
raheja
thespians
rolan
freies
risaralda
talpiot
shanice
akhter
saarlouis
hellberg
buckden
bodegas
sanatoriums
kanker
pire
succor
ardalan
siboney
szentendre
unst
sibal
luteinizing
radchenko
papaioannou
minnis
aliwal
undiminished
sycophantic
stooping
gryf
zino
timiş
mwangi
cyclopean
roessler
ibk
chazal
fractionally
abbad
gomme
junkin
breadbasket
postmedia
netherfield
wiberg
cruk
foodland
courville
vonk
mesabi
farell
porsches
diamine
ummc
whaleboat
mancera
nuetral
tsd
committeewoman
fludd
sandpit
effusions
ravindran
duddon
squashing
eick
wyer
sayfutdinov
grbavica
suger
pij
eustachian
grebo
kuehn
nehra
spezza
peridotite
flandre
propranolol
trabert
electricals
squalene
tavi
worksheets
zócalo
dharmas
turnage
berlage
kingpins
grbs
tikhomirov
nastiest
requisitions
qutbuddin
broberg
strudel
getaways
hackl
rhetorics
auro
yaka
steklov
pondo
redheaded
naïveté
eventbrite
mardon
geocoding
hassen
heymans
chileno
maudit
hindhead
misono
kais
ncn
dismounting
shupe
gach
iseman
ronell
airfix
robinette
acasta
golborne
grownups
osburn
houdin
bonello
lukavac
fuzziness
pargeter
agv
endorphins
pennisetum
icoc
rothamsted
somites
schedeen
vestey
sentido
diahann
fishnet
sequera
sharipov
lowen
limite
alak
ebbs
areopagus
siddi
plott
bronzeville
ademir
aharoni
blowpipe
golubic
sakho
albuera
allender
glucuronide
ogunquit
sommes
addicks
gibe
swirsky
highton
azura
intoxicants
midamerica
ankylosing
nonchalantly
weaselly
hermansson
windfarm
unturned
tolerability
haskin
colerain
wresting
phalaris
mirabelle
buidhe
hokie
rueben
nacs
antinori
sollecito
clauson
kuran
districting
teary
lychgate
rike
allameh
jealousies
mazursky
ericka
batoni
wimps
shelia
frontally
burgi
campin
kimes
opis
industrialism
fishhook
larc
englishness
arnault
torno
matravers
dimarzio
faya
kazumasa
tuv
vukan
sennheiser
amália
clarkstown
assimilates
ketevan
chotiner
eunan
glueck
raters
bisa
frighteningly
casgrain
malwarebytes
vasileios
buffo
dolomiti
propublica
hogback
plastik
battlement
taisei
hevelius
fischel
nymphaeum
bti
moudon
olegário
moonie
fibroids
salom
lavina
fineman
grieux
vasanthi
bedale
transposable
jons
percept
elene
pasquotank
mccammon
overcharged
lezama
gik
labarbera
pfau
cucaracha
flecktones
toer
guffey
maner
musgraves
stormare
dholakia
potočnik
bayadère
wti
selima
maclear
chalices
barged
hamengkubuwono
sideonedummy
zilber
gokarna
inscribing
dwg
autobiographer
subclinical
mbts
brem
digregorio
brongniart
hoggan
doms
wyalusing
japs
dinardo
eastchurch
mauchline
abdominalis
dispels
khumbu
antwoord
celui
willowmoore
aen
nevesinje
grf
cfrb
wishlist
muntari
unef
anthocyanin
menomonie
verdin
typecasting
prepackaged
siman
coens
opr
exco
rials
chihuahuas
rulfo
maroun
gilbreath
jianguo
undergrads
indefinable
halbach
draymond
bailon
karlie
malani
mountaintops
yoshiwara
hofbauer
pettiness
rimas
keshi
pizzerias
breedon
yaacob
majik
cerruti
gussow
hindlimb
montealegre
guiche
kilbane
taus
elta
simulacrum
sems
bollington
delic
coldingham
obion
vefa
ensler
hossack
emiri
eckl
portlandia
misdemeanours
mcnerney
beurre
gmi
pleaser
amancio
djorkaeff
salz
catman
emraan
qualicum
housewarming
levente
necmettin
woolard
wdaf
ancram
malakal
garriga
netapp
teriyaki
leaman
podmore
lukman
amphotericin
smeets
gorkhas
cervia
petani
ebullient
feilden
cartledge
viardot
avocation
coupar
wamp
icct
agutter
eichner
carrozzeria
syenite
diatomaceous
warda
latencies
aravane
autonomies
blic
contentiousness
arthroscopy
illegitimately
forbush
decedents
lycra
scamming
tiku
demonizing
senan
acquirer
waushara
roky
aqim
deathtrap
mucky
carowinds
amelanchier
catsuit
mally
metafiction
becalmed
frisbie
murciélago
ruffner
elano
hopgood
ferrat
shantytown
tullis
koepp
garnock
noelia
berezin
ambiental
athey
nevinson
kuster
cannavale
webzines
kermes
wisn
stanbridge
telephonic
abro
valenta
ailanthus
talma
towler
mukundan
slovenski
hags
cavallaro
abdallahi
tippit
deweese
powells
inda
shaked
naza
dinosaurian
silchester
merhi
tetsuji
turkcell
mithraic
scheidemann
polster
chandrashekar
hydrogels
tkachev
begets
dikembe
privileging
ufologist
jagdalpur
parasaurolophus
mulayam
superbad
ftm
vilafranca
begot
hiralal
hothead
bergqvist
jingjing
knipe
shills
dipankar
defrag
curare
mateja
luverne
semel
edwardians
shulamit
sandfly
colorada
arnason
yishai
bvc
rowboats
barma
trelew
dragunov
rawcliffe
raquette
montis
brazauskas
writeups
melitta
voysey
konkuk
mcneile
pudu
gipp
aramburu
nady
padlocks
echeverria
deeks
sirio
restyling
saeb
ellendale
liven
imbrium
levick
honcho
upended
pathologically
amlwch
gaertner
weddington
jancis
dammers
burdening
samm
oza
solomos
nepalis
ferman
sylphide
janmashtami
armrests
straightens
paralegals
osr
donata
liversedge
woolridge
chlorosis
colling
inskip
whirlwinds
wyton
dennen
maruko
satz
dickel
berrett
gitlin
kaspars
birther
naledi
nasca
dpl
barrat
liberale
cof
tóibín
spybot
manitoban
adlard
altgeld
shaunavon
qinghe
chilhowee
passavant
ilitch
rosaleda
fitzhenry
lyres
replicable
verrall
sittler
biehn
shibu
varzi
workdays
jagland
cawthorn
pietrasanta
jenness
avoirdupois
manette
cowher
sonorities
differnt
ecpa
flatline
psychoses
doring
galliera
langman
christophersen
positio
nightshift
seara
parini
singson
kersh
unpromising
abled
alinghi
duru
mulvihill
wyly
mclaury
mohi
lasing
baranowski
gravier
polin
arcelor
tinguely
parksville
galuppi
ellinor
contoocook
singye
matranga
selfe
deluged
autorité
fenby
groveton
transsexuality
dnl
sadder
dossena
minustah
kulakov
valdas
destrehan
yarm
ardley
nieuwenhuis
tortious
soutine
konstantinou
pflug
servando
kitsis
ungrounded
pwll
roundtrip
gof
pinn
vlasto
carazo
byas
jmb
yaracuy
glomerulonephritis
pipettes
farfa
schwechat
acromegaly
granet
catterall
elastin
galeana
regedit
paulose
swivels
foretells
bango
mcsherry
unstuck
carlberg
bodog
pinza
barik
genzyme
meston
ferd
iberica
itta
lipstadt
proscribe
mishmar
buprenorphine
sunao
holanda
wallowing
magnifique
colletti
decrypting
aylsham
leptis
hentoff
swerves
pyx
tietz
varona
epididymis
amrani
morrall
khalif
rosemond
lauf
cachapoal
edgington
badie
saara
dure
currington
valencià
jml
babysitters
gitte
skibbereen
ciccarelli
gunsmiths
joosten
téchiné
frontiere
patheos
bracher
gravett
goldrush
padawan
festooned
granulomas
cornfields
egba
zenger
hammoud
manyara
ocb
gwillimbury
behram
worli
shahn
resupplying
loblaws
bagus
breakable
szalai
taskforces
caribbeans
jambs
houseplant
bracciano
peice
gyorgy
twn
sheepdogs
araria
sellwood
sablan
arqiva
dakhla
vagal
devastatingly
asselineau
suq
sfmoma
misappropriating
hermia
westerwelle
gresty
fethullah
fortezza
budleigh
impugning
smer
mihdhar
tirano
harking
rafelson
fforde
schoolbooks
hobb
superficiality
felker
blassie
canh
vympel
punctum
midazolam
ciam
bannan
marlton
ottey
dulaney
cuong
juiced
ncpc
antonsen
headbands
schagen
agudelo
nopcsa
pistes
sharmarke
dagnall
appreciations
palazzetto
mtwara
pausch
heney
recaro
davydova
hopley
aishah
cleat
dustjacket
lauterbrunnen
icelandair
rfef
mesaba
diel
zit
gebrselassie
essling
tetrodotoxin
indelicato
manal
freerepublic
propofol
lovebirds
kerimov
gbi
yamba
merch
buzzworthy
phm
fbn
mukerjee
messieurs
kassem
peploe
attercliffe
dondi
simkin
tenaya
parliamentarism
kimani
gullet
fethi
battistelli
probative
yuchen
bargello
batur
fitzclarence
mutlu
carlini
cadd
bromyard
sievert
ceduna
rouget
mesure
taccone
kiting
gnd
tropico
turkoman
vind
vocalizing
lachen
doggystyle
haddin
loughor
moyal
zendejas
hasted
calabresi
rilo
glenys
pomelo
giggleswick
devante
interoperate
tucking
janick
omidyar
acclimate
ferdie
rabinovitch
tdy
longings
frend
boater
ahf
freespace
okoro
spic
aliaksandr
insistently
wjla
pneumatically
periwinkles
willock
gaggle
latv
crich
tiemann
béjart
formiga
zuccotti
periódico
anap
langenthal
taillefer
levingston
lockstep
rasberry
wils
dysprosium
kalinina
reaganomics
lionhead
ricocheted
cluskey
soulé
cusk
harangue
atilio
zeidan
anisa
saurian
cnb
aneuploidy
sgl
salzach
nimeiry
wathen
capitation
deddf
redwings
tectonically
meegan
paella
rufe
clintonville
liberum
konigsberg
maertens
agbaje
jiban
unsubscribe
ruminations
nápoles
broc
olafur
emule
piccinini
frain
desecrating
cuius
flours
outgrowing
frederika
djerassi
geneology
rahmon
dros
boxx
richborough
ehle
shastry
hannen
hobley
remer
tirthankaras
furuholmen
dudayev
frivolously
gaetan
depo
morganfield
salmagundi
chasin
dropper
edis
stickley
jáuregui
noémie
tameka
lashawn
lessa
theyre
plcs
roeser
manuva
urbs
alexandretta
eixample
usmle
perdrix
impel
elize
prophète
thousandths
motz
farner
didone
wantonly
palindromes
whooper
amorebieta
khushal
versant
fiorillo
gazzo
karner
turlington
grizzard
aliaga
eloisa
lidell
czernin
bicultural
romanowski
phalle
dosn
hoddesdon
adekunle
miniaturists
ngael
padme
goodsell
principalship
mccredie
mackinder
poivre
jrp
madonnas
socialisation
julen
swanberg
idv
soundproof
magdalo
stanozolol
smx
megastar
roberton
widmore
jerod
baudoin
electrocute
morfa
winterborne
plitvice
takafumi
dyachenko
helldiver
eicke
flavourings
myton
trocadéro
rhydian
patino
leontyne
overstep
lavalin
payn
avishai
roughed
asari
banlieue
whitetip
sagna
munising
shiatsu
madelon
ironbound
mutational
actualized
marginalisation
amoxicillin
gullibility
thestreet
laja
alors
cameraria
jhonny
marlatt
theia
rigours
kalis
granulocytes
kaci
nsv
mrca
retrievable
gizzi
etosha
cowled
anomalously
bka
venky
proteinuria
ttx
bjerke
bisect
elcano
taihu
penrhos
suchy
meggs
cgo
gravesites
ignoble
beust
padarn
darshana
winwick
vanaja
cleavers
dinapoli
matthaei
blotted
valses
alekseev
atsb
aymon
paramecium
monopolization
skouras
medo
loftin
nazmul
debits
yokes
winnable
sebree
haggin
rende
porretta
akayev
predicaments
physiques
merom
zef
podhoretz
techies
ketoacidosis
peli
tendril
mcdougald
mcfaul
samatar
fif
daktronics
finning
lockington
halswell
laffan
scarponi
garita
censer
flowerheads
repairer
stateful
jetport
siamo
maffia
wryneck
ghosal
sombart
sieving
rudkin
shiitake
careca
thyristors
zabar
heffner
scheib
pnh
soylu
eydie
sekula
bajrami
wolfen
laocoön
grossberg
camisa
toleman
empting
fali
sakal
tutta
riquet
shagged
frinton
latticework
inopportune
recca
gubkin
phenotypically
shinjo
anticompetitive
barriga
bogolyubov
coover
ninetieth
papillion
algona
radames
lysozyme
goodrum
sipho
wbur
trist
bertinelli
oberthur
arosemena
kumai
wheaties
kgw
kompani
bergara
eddyville
oskars
tiaras
hoen
knill
arlt
koray
ollanta
pasinetti
quintile
nfo
stoff
cerney
berrocal
friable
arindam
jayasinghe
dunciad
pinkner
meols
futurology
kiewit
tabatabai
yueyang
reames
portway
damour
menarche
aptos
rieder
dotes
manayunk
friml
zahi
dehesa
ludovica
boakye
democratize
seach
fornax
dody
nezha
jayashree
xiaofeng
scuderi
asako
redactions
sayang
sakara
hobble
stockwood
erinn
uvic
morta
krauthammer
polecats
axp
trem
trimmers
dunnock
gilliat
treherbert
amfm
yoi
gabb
handbills
remunerated
jeld
azureus
harlots
nicomachean
diplopia
maccabeus
mckidd
indemnification
fryxell
ikat
garroway
waitz
festinger
kalmbach
noot
arsene
looses
karamat
magro
inestimable
tensing
burruss
bolingbrook
guamanian
trapezius
newsmaker
stepp
fartown
sizzla
neco
egoistic
iframe
tempora
moffet
floresiensis
tropea
stearn
besh
bieri
gramma
lamadrid
zegers
sabit
kilman
revill
hingston
sybilla
nordkapp
fva
ghajini
dores
polymerized
docsis
sclerosing
sunless
creamfields
hft
ottman
balik
wortmann
jenison
yha
vient
eastfield
horiguchi
sampan
acuna
attie
farma
seau
rbb
octant
tagish
drumheads
goût
mccawley
capitalizations
amalek
kisha
funnell
shalala
bleeckere
wieden
stinnett
aspersion
korver
ritts
savitt
hepa
censures
camelus
sivertsen
goetze
cerrillos
wildhorn
rhiw
athabascan
chinoiserie
foudroyant
bassel
bloxwich
yediot
tiros
scarification
nacreous
fiscus
cheekbones
whitemarsh
kenia
fujiyama
sagacity
aliev
cloudless
bruen
woops
jogs
suntec
lizabeth
birstein
bulldozing
browned
gtf
liriano
stepdaughters
pallor
sinclaire
walbridge
hultgren
saia
libations
pinfold
issoudun
baresi
neglectus
cindi
prentis
rewound
hecklers
quinceañera
mesrop
roepke
microspheres
ghazala
mahomed
sulky
softener
ates
limbe
shalamar
haverstock
tiko
belfer
lalbagh
wiedlin
kausar
suthep
alliss
rouault
tte
stylebook
bigamist
melih
overflights
akg
hoogland
pulborough
cebit
improvises
heuston
billinge
subdiscipline
bleakley
gianpaolo
hammes
mrv
sphynx
graywolf
quah
salen
corticotropin
mwd
duraid
sonnier
indecisiveness
seberg
isoc
thaws
sonthi
ncsoft
dpg
boite
schutt
mosler
bethenny
zoolander
enterococcus
dargaville
hoofddorp
commode
osgi
varmint
pitty
lenya
granddad
madhesh
dortmunder
krauze
clodia
anechoic
deceivers
merkley
playlisted
effectually
schuester
ryhope
intermarriages
alaimo
ridgewell
chevaux
donatelli
unió
ryzhkov
heyburn
cpap
uyuni
kracht
muggsy
halong
sebag
anthropocentric
nationalisms
pdh
antediluvian
kendrapara
sinitta
kogure
urinalysis
pseud
recapitalization
curcuma
reenlisted
sigulda
unfixed
vacher
boléro
terpstra
swaggart
rukeyser
aquash
auja
shd
tarda
garmendia
campanelli
flexors
schurman
motd
uruapan
saqlain
titmouse
osteosarcoma
interbred
hendryx
thermochemical
boquete
traeger
mothman
vignoles
dillane
kiddo
snaffle
sayegh
sosnowski
strassmann
evacuee
brrr
halpenny
aquidneck
santhal
dusts
pota
tarplin
sakyamuni
brinckerhoff
weissberg
copiers
clearwing
vinter
proinflammatory
bresciano
marinovich
romeos
peverel
oks
agapito
mtbf
censuring
meharry
vigilius
dahlquist
halon
joynt
laskar
clotted
deraa
maibaum
ubp
savarese
nidd
hdcp
mettler
jarecki
mosbacher
kommandant
coltan
introd
dandies
saucon
mcvicker
willemsen
posa
jounieh
spitball
scharfenberg
faits
bremmer
albinos
unawares
currin
drumcree
carps
gluttonous
fonti
lysacek
kitzhaber
feuillade
berserkers
webinar
physalis
consignments
chirps
mihalis
helminths
rodriquez
irishtown
königssee
reys
rohrabacher
beban
arrau
protozoans
jany
beacuse
magowan
racialized
wailes
trussed
allenwood
dunnet
theatreworks
frühling
nawabshah
muizz
cardon
freeth
hussam
magdi
lro
hepatology
nathuram
prufrock
deltaic
scarier
homesite
enuf
marnell
vondel
diwakar
carloads
clayfield
wunderkind
wizkids
continous
ebitda
extols
poiré
geldings
taylforth
professionalized
liqin
positivists
sayoko
multihull
avella
curriculums
blucher
darlow
rajshri
sportspersons
grindal
oussama
rippers
kushinagar
ieg
kaiserstuhl
urr
caamaño
yorkshireman
speyside
cazalet
friendliest
hydroids
daunt
perotti
muslimeen
tabun
triborough
hornless
riggle
timelessness
pinedale
lodgers
codebreakers
zadran
greenhow
jalbert
pellissier
capitolina
landini
dwr
asmar
nacha
costarred
setswana
catfight
taraji
antigravity
jacquelin
bumpkin
stockmann
hillin
rastogi
neurospora
gervin
luen
arau
tulou
ranong
ljubica
monessen
copywriting
horler
raspail
ajantha
contracture
uwb
caraballo
trave
heiman
nyholm
premo
bougainvillea
adversities
floorspace
napoletana
juanjo
uob
racketeers
marcopolo
idalia
maithripala
zhiyi
tates
gabbiadini
nyorai
sightedness
moncure
minoo
yoneda
interjecting
avio
tumblers
raniganj
fourah
nabataeans
identidad
autant
homiletics
deploring
dml
tuns
sullied
pach
prominences
lucila
gilks
peda
pembury
buttigieg
svetlanov
roxby
disegno
amcham
willenborg
frb
lickey
ocher
bcom
kealoha
vernay
bygraves
ionised
alfond
hilditch
shobana
newgrange
zeroed
ghezzi
bozkurt
korina
tingey
gowing
bucklin
kinrara
chappy
larrionda
：
ferncliff
hollinwood
latissimus
scardino
lunchroom
masers
firefights
unneccesary
distension
gole
sipos
silverfish
newchurch
miyauchi
dindo
putian
borella
oleson
wsis
fandel
kwabena
lamme
sushruta
yall
notman
kalakaua
wildey
kananga
discriminations
andøya
klavierstück
megabits
skybus
transcaucasus
dalvi
sembach
crossett
nutria
sorbet
jumbotron
cgiar
sanjiv
sweatshirts
piketty
siegrist
magnifier
construing
cantilupe
stmicroelectronics
bazilian
eggnog
rago
parimutuel
lidge
chummy
lichtman
selimiye
wakarusa
pcos
perturbing
plasticizers
kateb
rathmore
bluer
maidenhair
swaledale
schoenherr
deadeye
jarboe
buchi
leisz
califano
streaker
caspi
gavroche
tenderfoot
galarza
ioa
jawara
passeig
boberg
froment
mcot
mulago
apter
zilberman
inuk
mancinelli
brocklesby
ucar
baathist
apprehends
vids
whiskeytown
ollerton
soha
delighting
mizer
riverbeds
crittenton
hatchets
giske
pinguin
nagahama
cafta
fultz
duport
adivasis
pizzey
mariotte
acy
vade
dragonette
daresbury
independantly
karpis
hirakawa
bambridge
hadrosaurid
saraceni
federline
bagga
tendo
shary
pratley
ljiljana
kabat
erasers
wmm
peephole
lateralization
silvino
plimsoll
tynedale
magner
wolfeboro
suport
archrivals
heye
koin
maddern
bbp
keatinge
kagaku
misma
wmg
rawles
elmes
cleobury
modernizations
tranquillo
shrewdly
crumpler
boody
plaka
falle
streller
ptk
plagiarist
dissonances
abakan
schreuder
voicings
koichiro
manora
derniers
khairy
fratellis
maister
voluntarism
archerfish
cesa
pottage
greases
taglines
nij
jejomar
motos
paolozzi
carless
araceli
arleen
exposés
gleiwitz
vindicating
chitwood
blyleven
europeanism
loral
daunted
gerome
explicity
chapada
futrell
cosplayers
escamilla
britishness
impermanent
spinetta
marginalizing
arj
creasey
furu
clozapine
abinger
tessitura
kellan
sunscreens
dipiero
ruhpolding
spectrogram
woodbourne
angioedema
marjoram
mosharraf
huffer
sciatica
clarkii
zooropa
zennor
tettenhall
danilova
boyton
mieres
staggeringly
zhicheng
morabito
dowdall
microfluidics
glorieta
wigham
cluff
trinder
fbw
doudou
leafhoppers
gramenet
diclofenac
cjsc
chaussées
vroman
blewitt
mumy
greenburg
suur
imovie
destra
bayazid
toxoplasma
barrette
imagineer
gask
bevil
loosehead
protean
ccj
willkommen
goldbug
divorcée
farfus
lauris
arbitrating
polyamide
casteel
crichlow
mhi
hool
oscillated
euroregion
alara
scopolamine
hobday
marangoni
neca
gartside
thuggish
oosterhuis
pimples
kacper
healthgrades
recantation
lochore
itea
lecavalier
nosek
caned
petain
uzelac
christophers
excoriated
leggatt
unselected
tathagata
isoprene
tullia
trv
lubbe
rathdown
odgers
thulium
tutbury
craveonline
screven
ingests
quebradillas
carquinez
afrin
agraria
preventer
forcefield
pylades
cix
mrcp
traube
satow
neogothic
hogben
bonbon
coinages
draka
indignantly
madrugada
blaha
giesen
lapsing
dimon
deoxygenated
frederator
tassos
abattoirs
cadenas
orchha
wimberley
amblyopia
westlands
csos
statelessness
minored
buzznet
rundell
rajapakse
saphenous
dessa
claritas
ferozepur
qatada
macaluso
lembeck
ukrainka
berlingske
driveline
différence
coquina
lch
antipolis
liliom
calama
dishonourable
pneumocystis
chitarra
behm
newscenter
mofaz
prova
balon
pejman
thangka
redemptoris
beastmaster
choker
grimace
extrapolations
buzzes
intriguingly
shizhong
varano
besos
uninstalling
fibs
faba
estrous
tandridge
sahn
morgane
pann
halva
yicheng
nosebleeds
enjoining
boisset
bucur
puree
joanneum
winstar
katzen
chalais
kovach
gpcrs
solares
zawra
rowhouse
ovidio
jingoistic
lindenhurst
siegburg
circassia
benedick
lansingburgh
roughest
dennard
tarkio
tesson
torcs
corser
apres
kddi
gaily
balachandra
careening
gastown
deda
pols
chichewa
hisses
pilo
disarticulated
mcbryde
nays
cimabue
verplanck
gfr
montecristo
berre
quadrupling
leiner
longueval
favell
vesak
braziller
misdemeanour
shabtai
mentz
baqubah
tanzer
tarazona
mirah
soapnet
taina
penhall
ermin
unfrozen
pepsin
arbon
memorised
seka
exfoliation
martinek
badruddin
bensen
zuñiga
nagore
cerullo
daryle
perisher
yngling
chloral
carolco
ferryhill
wagenaar
khachatryan
gasometer
tatarinov
chloroquine
twan
reusability
pugni
physx
polyploidy
educationists
hura
slamet
rbk
chachoengsao
prajadhipok
plataforma
gompertz
chassidic
valuer
memorandums
treadmills
peterik
dirleton
muste
czestochowa
lws
superprep
ephram
hateley
mahdavi
brewood
gnasher
unities
mustaches
bya
gitta
aveline
valproate
clarens
fromage
vasodilator
chessmen
nikkan
ingenio
rajasekhara
southridge
bania
gualtiero
castellon
vte
nordling
haysbert
basehart
dorival
earworm
inzerillo
warbucks
farag
hymer
runde
gelatine
privity
navvies
nevile
hymnody
barrionuevo
sportster
skillman
nicholasville
farrago
capetillo
iwamura
arrc
rothberg
railroaded
fraternité
merab
glycans
adsit
udmurtia
laskey
poldi
teredo
hickerson
bankable
whitest
fandi
bromhead
dessinée
prescience
fradkin
voltigeur
roka
bandula
picc
ljunggren
sanyasi
hoogeveen
furillo
puu
paracas
longships
kemps
hamrick
rocard
forseeable
schatzberg
paling
snelgrove
darwinist
toyokuni
retrain
yawen
gigliotti
volvulus
steilacoom
westhead
chipettes
nasopharyngeal
rcl
willaumez
karev
faulconer
tarawera
biophysicists
hayer
insecticidal
kershner
angolans
casini
mansor
cosumnes
immunize
ramasar
wakeham
unasur
chinandega
pothier
extractors
weatherproof
villamil
nacio
vithal
sparke
hongi
thredbo
decicco
poddar
plebe
vertus
jau
distinctness
maslen
rinder
kilrush
thermes
pflag
whaleback
extracorporeal
catrina
candlebox
shinozaki
wqed
acord
tuxford
rolen
cervenka
baldwinsville
nyunt
leukoplakia
rober
maiolica
sahand
toadies
rayson
derogatis
hatsumi
franklins
woodchucks
mangosteen
karmann
luberon
circuiting
theriault
muwaffaq
asri
otd
amarah
recognisably
ché
nirupama
mentalism
emidio
bupa
pasquali
kheir
gesar
marelli
cristy
aie
decembrists
randeep
kiick
markievicz
serotina
prachanda
yatala
langara
bandler
kallur
unwinnable
bitterne
ipg
testarossa
faina
wijeyeratne
ibby
rochet
besoin
ssgt
acteon
stearic
nmfs
ioi
wynkoop
kashgari
vendler
kweller
pigmy
langholm
uncompensated
rockpile
mantooth
hailstones
duga
kantara
holmstrom
prebiotic
cossette
avati
oberdorf
cuanza
valaam
ruination
faggots
bonnies
tappin
tullus
danto
hirotaka
hna
lyndall
applewood
hrsa
forno
elchin
cozza
mwt
leukemias
caprio
olor
groesbeck
ivories
destructions
wriggle
vueling
abels
pavesi
ophélie
stockham
hotheaded
beo
wolfsohn
swingman
matua
olmecs
hct
mahle
cockroft
siniora
malham
adjuntas
openwork
photic
congener
vitalia
brymbo
masiello
viveca
unwto
drafthouse
manhandled
halfords
bhf
microelectromechanical
eked
nylund
rewired
sittwe
feminisms
econoline
piemontese
alternaria
bilt
captchas
alhassan
wdiv
gurpreet
sandby
malir
squander
malema
lowey
pileup
cerros
empresarial
rinker
blox
freeling
toal
apthorp
sangyo
kbb
samangan
niyazi
barnaba
dsf
tml
yehezkel
keffer
failla
teka
guzheng
fugs
ikin
reang
loxodonta
abcc
laxness
helmy
dalat
ntini
khashoggi
modder
kemeny
olaya
willer
meers
zajac
majer
tabackin
ridgeville
cordilleras
yulee
motoyama
phibes
yaffa
varnas
highball
birchard
jeunet
hydrocortisone
anthologised
kronen
koranic
muis
muggers
slimline
platov
pancrazio
cigale
economou
pharmaceutics
wilcoxon
locle
revokes
umali
dekkers
hawar
dcn
genia
enrage
shayna
raschke
marienbad
sukabumi
badillo
misner
ashtrays
idrc
traumatology
chacun
kurseong
luddington
ciw
kusuma
skuse
trn
humourist
woodlouse
humoured
walgreen
persano
shiney
gatecrasher
liberalize
zeljko
supercapacitors
orzabal
thackery
obstinately
plastique
sherrin
kericho
maglia
brainwaves
goyeneche
sanfilippo
chumley
gierek
altima
lairs
bardiya
wolfert
según
anees
llanthony
coxsone
gustavson
caulking
berube
hollick
zaa
pierfrancesco
ordinators
atka
luncheons
lihue
trevose
retractor
bonnes
konerko
sheindlin
wrather
misstated
lrg
arismendi
righi
eatonville
alos
kári
holmer
autoplay
cyberwarfare
lestrange
einsteins
scollay
digvijay
porten
hajdu
udal
bassler
joffé
chmerkovskiy
piacentini
encamp
jolanda
jabuka
mfn
poble
consciousnesses
jubb
ambra
lovelight
chimurenga
checkable
calvisano
chicagoan
dimartino
monody
carpel
tshombe
yudin
chapdelaine
crannog
carbonara
definate
eor
tercero
newsam
rückert
maekawa
shahan
alsa
dismembering
teratoma
falso
reta
laveau
bgr
bermudians
miasta
lums
mory
liebowitz
ghandi
shoto
planas
contestable
grignon
warmers
haniyeh
fuchu
abnegation
quileute
schnapps
brummies
kizito
wacom
nilesh
jesson
northeastwards
permeation
moch
worland
chree
strub
onc
herzig
geron
mezger
guajardo
waki
explicated
crois
caggiano
organismal
homann
scroller
perino
kamuzu
stefanelli
sterilizing
junejo
maltreated
chalcedony
farooqi
muddying
persecutor
ilah
antiphonal
tutorship
eer
pyrex
mercader
quinnell
dorsi
junfeng
tutzing
efcc
oscott
wellard
badshahi
newsy
kharian
relativist
vanguards
kenworth
burkhalter
cratering
zelma
fozzy
donaghadee
burki
vestergaard
buxus
kerzner
kawamata
marylanders
nondiscrimination
kingsgate
musiker
patridge
stockach
fallibility
turchynov
wynford
breukelen
poppel
pargo
shamen
sofiane
bitz
telefonica
karakalpakstan
slon
zinman
redbud
aspis
riper
paleface
canterville
sauro
ellenville
gönül
anglong
ueli
vered
avance
wasley
akhund
kouchner
giocondo
mineta
gagnier
redressed
cisse
soysa
halper
reinstituted
murre
donax
haemek
silberstein
saucepan
serah
auray
caviezel
mantling
eckington
recommissioning
pilz
seney
decelerated
infonet
meegeren
lienhard
diphenhydramine
deindustrialization
childishly
phylicia
dagoberto
pondexter
chucking
jacobitism
gerardi
anonymized
collingham
patassé
sterrett
ghan
sandboxing
kasyanov
sidestepping
buse
defensiveness
morgannwg
donaghmore
armah
mantuan
fyre
fennica
jijel
bivens
regmi
geomancy
gii
ferneyhough
krimsky
shive
maddon
wireframe
jayshree
choirboy
vescovo
likelier
rappe
jns
tsuboi
deplores
dmitrov
abq
pranav
boulware
glassell
faden
courchevel
unnoticeable
minnesotan
zaydi
eiler
redirections
graziers
baklava
lytvyn
usaa
democratizing
gastrin
mendonca
certifier
bason
picketers
gilo
deslauriers
charg
haynesville
signifiers
progestogen
wagler
chamoli
cafferty
biphasic
tellings
kamsky
cahen
autocross
ufrj
wynand
hakeim
grawemeyer
annonay
npe
hesjedal
sops
bonder
odintsovo
cosey
veta
hitec
fengtai
unmasks
gomery
tatung
pakhtuns
bullwhip
jabberwock
zuffa
gravid
litmanen
jatun
dhobi
lexy
baty
uht
osim
tabling
eyemouth
birzeit
glendening
insull
gogan
funkhouser
psap
snooki
llangattock
trull
itsukushima
fantino
cleavages
justitia
deciders
exactions
aleksandrovna
capen
berimbau
etang
neccessarily
fujisaki
boekel
jackett
barthelmess
leight
akshardham
hoenig
monetarist
gotemba
canne
pernis
neba
fancourt
humanizing
displeasing
formentera
clearcutting
darrington
gagner
kiraly
rickett
amiodarone
jerked
unearths
fabinho
kapler
upg
mogao
dulcinea
damaso
jefferis
spambots
crighton
sacajawea
cygne
greywater
endows
travelodge
panchromatic
troiano
iztapalapa
tareen
zivkovic
redlining
darner
adjutants
hackley
mallock
teves
reichman
tuyen
wiesen
meditates
appraising
transfixed
kaahumanu
mirrlees
nust
suspiria
cathey
dodgeville
accor
tedi
deerhunter
dimitra
sne
mabou
gurinder
wando
donadoni
meall
briel
skellig
sadu
autarky
cianci
dulag
darcey
dicamillo
loafers
jumma
demoss
boothferry
virna
bardsey
hardrock
powertrains
mery
binswanger
tamati
finau
osram
beading
schäuble
tenaga
confessionals
phyllida
scumbag
bernauer
skyliner
oguz
coxswains
unicyclist
hodgdon
brannen
kaifu
allott
fauji
elmi
ahsoka
bowens
xylitol
obtusely
drunkards
henreid
policyholder
flucht
cotham
demorest
outshone
passivation
westhill
thermohaline
tharanga
reichard
redfin
lanterne
physicochemical
leontief
abduh
waterstone
loden
roisin
waianae
luts
munchen
autauga
lujo
duvet
hontiveros
valorem
neft
yop
fixers
gisors
faliro
resister
beany
skol
tplf
roedelius
hrothgar
cathays
illes
seperation
zbornik
salahi
kolob
buet
chema
dalmuir
retinas
reassurances
hargraves
femicide
bew
kurbanov
piazzi
muffed
hartline
mcgahan
bergan
nakada
parfums
amrut
maurel
callier
baldessari
seductively
laurencin
supertanker
wpbsa
nole
proprioceptive
mcilrath
lockman
squealing
toseland
traxler
nart
belarussian
crimewatch
dheeraj
georgics
sculler
lafond
rockie
kostis
wholes
atapattu
hintz
elphick
lobanov
merlion
gullane
plasters
princetown
hartl
photorealism
lcas
tadamasa
najah
neighborly
livelier
onderdonk
ravers
internacionales
perrow
vaginalis
sabbaths
comparision
esti
krulak
kec
kasese
symbionese
aereo
stavridis
tasmanians
datsyuk
aretino
spittle
monocyte
juanma
ostler
roselyn
tagliamento
nachtigall
controllata
safet
moister
brasileiras
sizzler
kanemaru
modu
lianhua
arnould
adri
occasioning
gimble
hawkweed
hiebert
esaias
dows
rightists
analysands
ditties
kabanov
carano
relicensed
ridnour
slaine
liposomes
gstreamer
midship
flic
crue
vibrated
tuong
cleartype
abrera
bimah
holdin
bozon
ashiq
lavers
varo
castilho
easterling
valeska
brickfields
tugwell
stampedes
pierpoint
shorin
therfore
superchunk
towles
loz
dvir
dildos
kovtun
schl
lunokhod
bonser
trenchcoat
kombucha
handyside
pisanello
greenjackets
lalibela
dyas
vigée
embodiments
cubase
gaeltachta
mountainbike
marcon
magnesian
beton
mimmi
povilas
tragus
gots
partygoers
bles
tdl
launcelot
vanquishing
guare
gustloff
winkfield
glubb
flagpoles
downhole
giralda
dumpy
grobbelaar
ivars
eshel
yarde
clorox
glasshouses
rtx
ravello
whall
headstart
yelchin
simoncelli
kazuhisa
meseta
czeslaw
lemkin
sitton
farmersville
sculley
ictu
spartina
fawns
ballers
tinkerer
fibulae
uhde
commentors
aart
frison
malade
zawada
dwb
sair
alisher
mcgillicuddy
defeo
spargo
traf
ziemba
dnt
gandi
ango
memorializes
doppelgangers
schoonover
vivat
mckane
votto
valory
mapungubwe
juanfran
hypospadias
purfleet
khidr
phillpotts
phosphorescence
amuso
sexualization
hanumantha
susurluk
ingrams
filipp
koll
walterboro
fahl
lusa
nasiri
desouza
courtesies
koans
gretta
smoak
pão
whitebeam
bergé
maintenant
soulmates
orbitofrontal
kuqi
massenburg
lmt
kundara
raoult
besotted
kahanamoku
hame
isda
trouw
sellin
thorleif
radicalised
praemium
mariucci
jogye
diluent
vervet
welten
kapranos
osmunda
hro
kangas
kvam
backflow
sln
tomiko
molsheim
meyler
hicklin
ossietzky
hemanta
byelorussia
ladrón
ballagh
nonvenomous
bopha
bécaud
pierzynski
lectins
bungei
rockledge
untethered
portchester
flatow
electrocuting
rager
zech
shirl
mahuta
replications
ijf
jerold
chafed
dussault
barguna
mincing
wargrave
gloats
tío
poteau
macek
wayles
petsmart
adipocytes
locators
kamogawa
yolen
taare
stenzel
swindlers
parm
landeck
ncep
myelodysplastic
scrotal
cowlings
momoi
bogoria
denes
manoukian
spackman
gigged
enemas
rapinoe
keloid
ballyduff
piane
raghuram
rewritable
naimark
inglorious
wedgewood
strass
ekland
attac
ples
trances
flexes
templin
bredbury
microstructures
bauwens
ribblesdale
qingming
encyclopedie
sahakian
nerone
kissy
subheads
infarct
interlopers
hdms
priok
zinnia
junin
embarassment
infographics
samat
supersaturated
arabsat
jrm
elaphus
itawamba
limonene
phellinus
hach
lookahead
gbenga
margus
tofino
kidlington
pagsanjan
gotlieb
tabatabaei
katsav
marwah
sundowner
evey
navicular
llorar
eggplants
hartsdale
alfreda
birtley
tominaga
murk
refrigerating
miis
valu
zappos
kilworth
havin
boswells
biphenyl
jagga
ashik
ifex
accrues
ellan
sobe
conal
brayshaw
aldworth
tallchief
uvula
colchicine
chiwetel
eliphalet
koide
lofa
trawniki
soulchild
vitrification
bhatkal
jacobabad
schau
idsa
sebesky
coningham
acústico
denslow
casanare
noncombatants
schuschnigg
emel
testino
ommen
understeer
whimper
veh
soori
aquia
moushumi
yanez
helpfull
recyclers
balafon
carner
barani
miikka
moren
shider
lapsley
lemongrass
ithaka
rmaf
asca
garant
rusko
impudent
muench
sobeys
gordonsville
heermann
luciferase
centrepoint
honeoye
cyclassics
sentances
necati
updown
smallholdings
wickremasinghe
dalmia
countertop
dunam
hyson
binned
riego
axolotl
zhiyuan
ardizzone
cantorial
donlan
xieng
hobgood
rwenzori
anshuman
ilg
amarapura
ganji
elayne
aparecido
gumma
leod
filippino
cemex
garnets
toddington
concubinage
chambertin
yakup
personaly
fissionable
mabus
fantastico
lucre
hfe
nagare
summative
citybeat
loreley
klepper
timbo
garters
nieuws
signee
nsfnet
wanjiru
batta
sankofa
gebert
monetarism
phonograms
frankowski
shambala
brocklin
eddings
ausmus
phocoena
recomend
lovatt
marawi
bertelli
hablar
regs
northug
perlas
hellebore
determinable
anyhoo
wragge
catnip
xiaogang
hepatocyte
momir
shoa
munsch
contemplations
koori
majel
visages
caroling
galanter
pavlyuchenko
corydalis
takahara
khalfan
ashbridge
dualities
ninemsn
leeza
myanma
sacroiliac
degroot
ghiyath
islamiyya
remora
giffin
socialismo
frosh
cardross
brean
fluoroscopy
unvarnished
severest
descant
copiah
winterburn
arrighi
firehose
clappers
mutiara
despondency
gradus
adeeb
reorient
blundered
gorringe
cantilena
topcliffe
lyde
iffley
lambasting
skyfire
sasakawa
plunderers
pams
rsg
braugher
toor
guidolin
castlefield
trant
chus
chapterhouse
filesharing
ouarzazate
marisela
negredo
pored
badwater
omkar
sidewalls
fres
unladen
gravimetric
nph
ogo
gohil
amsberg
mistreat
huse
celebi
pfarrer
hioki
brandname
beachcombers
flook
fryman
aloes
ussocom
inkatha
mkapa
wassail
autrey
topknot
babys
chelsey
wkrc
fiorano
parsees
overshadows
charlwood
leguminous
vally
ulman
weakling
battier
stateroom
cille
rajouri
inuvialuit
caesura
niskanen
xdr
cloisonné
gofa
halewood
melchiorre
kacem
playmaking
therrien
brazelton
nausicaa
sistemi
adjunctive
zeckendorf
oliwa
lymm
tikriti
simen
insistance
kreps
zubrin
monumentality
mannish
salli
longsight
harks
pagenaud
weu
consequat
troth
skr
heartened
bachi
rastelli
berrington
nonessential
napoles
steyer
studholme
deerhoof
ouaddai
oxenham
mazeroski
forlani
ruelle
sanday
crx
sergo
genderless
presnell
sawfly
parslow
anonima
colmes
tilikum
gunnersbury
duris
arcadio
gitana
molyneaux
whiteland
stenborg
chelles
shafted
forebear
duret
societa
sholl
krinsky
barocco
holbach
noxubee
hisaishi
taik
graphology
hadiya
pvd
slangy
legalist
ussa
rbst
attenuating
etiological
gulabi
forkbeard
ethologist
weese
iberoamerican
mikula
demerger
rufo
deukmejian
haverty
skyros
unequaled
hectoring
sagredo
chiffre
tokay
soundz
ansaldobreda
ghosn
meuron
jamat
mixteco
rinderpest
penumbral
stenography
eron
tamao
oettinger
seacat
shotley
burritt
lunalilo
tgn
tipple
midsized
crinoline
webcasting
igls
audiard
jonassen
karren
preen
árbol
trickled
emceed
tavola
daniyal
agains
hardbound
mecano
colomiers
ellsbury
aminata
oast
ludwick
pratham
andaz
esherick
olteanu
knab
bernthal
bibliotheque
soeur
ladakhi
klutz
kölsch
prepping
gekas
hennes
lachie
orica
jointing
transmuted
sarvis
adelstein
motiv
viia
chanchal
dongxiang
lalith
mehtab
hellhole
gadkari
cjtf
vianello
gda
coment
penwortham
conciliate
gokwe
wingspans
astrea
peirsol
losh
borrero
gwenllian
sisodia
panamanians
armourer
yella
lianyungang
riggi
jobseeker
thermoplastics
scenographer
maysa
zeebo
adom
adshead
vadi
glossolalia
mishka
hergest
tamino
vanian
seagrove
piquant
pyjama
piguet
boathouses
tripfilms
montara
greenbriar
martinville
satterlee
chasms
siento
deq
semedo
schaper
howorth
litke
trahan
smashers
ghsa
veltroni
salines
giambologna
homed
jlp
hallandale
noreaga
agnel
locura
coexists
runar
apparat
querida
underwhelmed
arounds
stuf
civilize
cgv
hardaker
schalkwyk
scarpetta
muscovites
yunque
nucleases
qeii
leanna
midd
hedera
waner
dippy
freddi
prostrated
ruairí
ecclesall
lobito
derik
larache
incriminated
kfm
orna
varadero
winzip
intermixing
globalised
frier
vermonters
farel
spindrift
discretely
kranjska
shuckburgh
hillah
inscriptional
tweens
woc
fages
miyoko
hemagglutinin
farzana
prolate
azizabad
pava
klif
prancer
mxn
homayoun
goldacre
rentz
yanis
nightgown
vns
balcer
epidermolysis
apiary
rehmat
gover
sumptuary
bonfils
hsb
timmermans
lemgo
tighthead
avary
ginnie
tulyaganova
maruja
northanger
vacuuming
civiltà
grd
mcmahons
bfb
suha
gesundheit
wurzburg
graner
goris
pictoris
bá
pinnick
chianese
numeration
brockington
localizer
lobbing
privatise
dedes
sabratha
alexandro
vaporizes
takasugi
thumbing
rizo
skiathos
okuma
mcbrain
magoon
bhuyan
rotifers
norquay
outperforms
demaine
kmbc
holyroodhouse
ccis
silvey
hageman
joyously
knowledges
linac
shijie
shuar
yaffe
dogsled
hematological
hypnotizing
indistinctly
tarkanian
dbb
rewinding
venatici
marji
winnick
kokan
atangana
zidan
bredow
taavi
somersaults
sleepwalkers
bertman
loesch
owada
calpurnia
mcguiness
bristling
azami
ehow
sequim
longgang
tadanobu
idowu
laisse
descender
thaçi
parga
cadle
alojz
freewheelin
continously
gavyn
boykins
tritton
cowans
lesbia
aquilani
torda
sauternes
lipsett
phosphatidylcholine
dunces
chicos
ilyasova
rutenberg
braude
majolica
savidge
earthed
housefly
dorit
specters
abercynon
depresses
daddah
giovanardi
breyfogle
equalisation
katusha
kinnell
drzewiecki
halfmoon
manvers
ncte
tonsillitis
addenbrooke
einsatzkommando
sontarans
tef
syndicale
tobia
alienware
cazenove
stoutly
kerpen
lafuente
wanga
bodiam
farson
copulating
boquerón
righthand
brohm
kookmin
peifer
natzweiler
vanderhoof
sinfulness
manjhi
tabar
rafic
luces
chagnon
concretes
scioscia
rhun
dubay
teratogenic
artzi
impoundments
selvan
elbowed
handfuls
intimations
garran
raggio
pomerol
sommeil
doubtfire
orvis
moabit
cleckheaton
cyberjaya
behesht
blasé
albena
itala
soqosoqo
malleability
ilt
suzman
somes
scalded
menchaca
enthoven
enloe
kinzer
trenching
morrish
pareil
uncoated
immodest
overstates
hydrogel
qbs
quinby
chinnery
cubero
storybooks
petchey
pernice
postpaid
ballenger
effet
misreported
foraged
chaar
confectioners
hwange
sportv
cheapness
inclusively
tzotzil
deicing
yusaku
nummi
syphax
jarlath
aubameyang
cotte
tré
yabu
vocale
churcher
belisha
chafer
emmi
wonton
ratoath
kontos
gauhar
revote
laith
sobell
zillah
avulsion
bonnefoy
lukis
lcac
ayden
enrica
tranquilizers
legitimization
shennong
gellner
donya
permeating
zigong
dowsett
simians
manouchehr
azraq
clinger
radicans
allright
isotonic
shiism
thalmann
fowell
odc
meko
corkery
kindt
rubes
hardscrabble
thiomersal
patas
toksvig
bootheel
monocytogenes
dymaxion
rotta
cournoyer
ellos
temazepam
kempis
monozygotic
finnis
polyesters
coppélia
aktau
grifters
furet
sassen
gnarly
mandolinist
impacto
hellier
vicari
peppery
dalaman
tellez
aigburth
curried
indigenes
lacon
mightier
korzeniowski
dirnt
bracegirdle
philbert
epona
jea
ormesby
forwarder
armco
eeden
choctawhatchee
skolkovo
pirillo
romijn
coxen
kilham
maceda
liudmila
lionello
boatbuilding
meyerowitz
wisla
courtneidge
bagehot
ropati
crocheted
chlorella
letang
witts
aguilas
arbenz
detractor
kephart
rense
taibbi
vizcarra
osher
michiyo
boissier
makira
neanderthalensis
lochgelly
revolutionibus
saucier
eyak
bloodsport
bogaert
wyomissing
sudi
boesel
madha
dragonslayer
trama
cytogenetic
pev
gairloch
wiik
bucovina
taverne
deblois
technika
crated
fyvie
mariangela
scarth
rouben
teargas
verite
haeundae
seachange
jwt
kajsa
lunan
pappe
ined
humpy
duggal
petrella
palicki
wintle
adde
deford
saluki
eusko
candoli
hissy
telma
raffy
costumer
johnnies
poultice
enforcements
activites
jora
goshi
starfield
torrini
resultantly
ksawery
blease
plumley
zithers
friern
saiz
dhimmis
descamps
goosefoot
integrase
verbalize
simulacra
peerce
mameluke
dessner
atlantium
musumeci
obdurate
hefer
modenese
chidi
zaghloul
ponza
tonypandy
animists
nikka
pullover
raymonda
gpg
gilkes
mainboard
yahaya
degnan
roycroft
sigiriya
ouseley
constrictions
chingiz
ignatyev
novation
kadlec
greenfields
brutalized
baltes
alazraqui
royales
dode
arfon
mayim
ginko
thrupp
flywheels
waistcoats
hartke
tuebingen
potto
kuruppu
surender
wsfa
vpns
wst
lpt
reinke
dalloway
preoperative
subnormal
muftis
gelded
boxley
mirela
canadiana
mitnick
landmarked
shibukawa
sabertooth
indisposed
nobler
molko
winterson
kinison
peyre
kindia
windpipe
deduplication
hymes
analagous
nase
brownsburg
sharpeville
whiteville
berms
ifes
hilts
vetlanda
harlaxton
lissy
corneas
donnersmarck
hadise
scattergood
feelies
nigg
lieberstein
afyon
carbonized
pellagra
pantalone
tulowitzki
ciii
alledged
joist
yaseen
chewton
bvr
siddle
fairborn
monobloc
chakravorty
multisensory
relight
djed
boncompagni
bbi
thataway
sumiko
salaman
cabiria
benquerença
manske
woolcock
darger
indemnities
enumclaw
footrace
umra
disclaiming
gvwr
japp
graphed
pattana
backbenches
harthill
curro
felicitation
lunes
ucmj
sterett
girod
pétursson
crucibles
sarner
leveller
turtleneck
beveland
patricide
timerman
praseodymium
reeb
conard
yuichiro
tablo
vyse
tiefenbach
warsame
mcteer
ideograms
gerasimos
collegially
decamped
flansburgh
viviano
sumeet
seligmann
feargal
chaque
velikaya
quivers
cancion
hatzis
stonecutter
wetering
rajar
terrestrially
deadball
pinas
kaymer
crinum
bittan
rescaled
proletarians
shirtwaist
glyndwr
polyphase
nanowire
hagadol
temas
transmute
aestheticism
skream
barbar
dne
martone
macke
jep
derrière
gunnlaugsson
amiya
sobrante
hauz
sledd
goldwin
preempting
dsps
burrowers
ilarion
arrs
wendling
paddon
bobek
asmahan
ofthe
ifac
defrancesco
tindersticks
chicanery
peroni
derfel
kvist
whelchel
farjeon
quitely
pruden
hisense
wasdale
jozsef
mailers
noorani
underaged
derbys
veloz
barbas
ilminster
recchi
duckie
relicense
garrel
teapots
lurked
iml
underemployment
gentamicin
collioure
hygienists
kiveton
tradewinds
replant
urueta
mscs
presenta
karthala
cedillo
deza
sandri
knottingley
nairi
mackall
zesco
gcp
klasky
nrotc
lura
misbegotten
noris
unrra
beaven
dangdut
asmus
bhoj
vag
kuok
aist
nappi
morina
archabbey
comunal
pasanen
ramle
metastasized
waqa
komando
christodoulos
blunderbuss
chesterman
bagher
bloodstains
rubato
callicarpa
copthorne
sitra
antagonizes
exhibitionism
canella
goodier
bvs
japhet
spean
christiansted
multiyear
charivari
vigoda
ephialtes
fotopoulos
idiosyncrasy
daewon
phagan
ribbit
hollidaysburg
ouseph
genevois
shyne
meritocratic
trever
henkin
annelise
preziosi
hote
erebuni
tassigny
salme
vietnamization
grandiflorum
tulln
blairmore
tabley
mickael
zeldin
lakshmanan
condrey
soundbite
vishniac
stennett
shimamura
cioran
marabou
savour
pails
bayona
nazimuddin
psychonauts
eubalaena
endocrinologists
borin
acomb
feltscher
cribbage
theobromine
pyper
natriuretic
szalay
perdida
donlon
storrow
stollery
brierly
starsailor
calver
psychosurgery
borgman
intermix
muhl
draghi
backlighting
hersi
baffert
haws
numerus
hosford
srilankan
tresa
mazz
paresis
hotze
chettinad
spinrad
fumigatus
gradi
rajnandgaon
fyfield
laatste
nuenen
miata
shillingford
sandokan
irion
stordahl
finaly
chio
eskenazi
parkas
mukalla
regenstein
claddagh
microelectronic
juppé
roadtrip
hartvig
huerfano
spews
avonmore
rigler
kevork
cabrio
introverts
squiggle
rootstocks
sapsucker
bergstein
royd
inconsiderable
pierhead
aeb
deontological
noshiro
eisenmann
louris
bateaux
lawfulness
loughgall
lanzi
koppa
cks
skirball
vaad
hosaka
amando
francks
ayscough
altamonte
paradorn
wakeford
hmr
muttley
petch
shantung
saltus
epochal
mausolea
akpan
hohl
skyland
noiseless
solidaire
ovington
portisch
mesosphere
penser
peral
kokin
wimbish
hanly
komsomolets
apposite
staters
whitefly
cutlet
mainardi
gahagan
pietistic
dilek
steventon
nerys
chunqiu
dieguito
bodas
scenarist
klok
cropredy
conure
glimcher
dreamboat
knowns
zernike
pelting
knobbed
praz
perilla
blanches
fcu
hots
pneumonitis
orga
torpey
banwell
padalecki
halyard
motorcity
apha
nokes
aspar
royko
imhof
zegna
ajka
candie
meyerhoff
mollify
cellino
ontarians
keltie
malmström
halis
cubi
patryk
ryunosuke
paynes
folan
reachability
biennium
joseba
stolonifera
elsworth
dymond
puentes
charrette
umair
twm
pirating
sufian
tramore
stefanov
nuttin
knoop
gitane
hendersons
buendía
tabaré
beaufoy
inactivates
onalaska
gwasg
roydon
beuerlein
incentivize
amaker
menina
choli
masia
redmon
goldson
hortensius
demoralization
ushaw
bessey
kangal
coeval
dvla
usurps
lafave
wajahat
kasthuri
ingatestone
brocka
anlaby
bravia
boonsboro
pybus
rubashkin
litigations
corneum
facta
petukhov
soled
lambertus
elian
neemo
sachse
dupas
afropop
oseberg
bagasse
schwarzbaum
guanghua
plf
hadza
stank
townhead
thre
kempson
venditti
rosenau
sendero
sanghvi
avtovaz
rutt
maximums
yizhi
murcer
zangpo
barish
martie
yob
readington
inefficiently
chesbro
rowton
dì
satter
yavneh
merpati
aureole
dongo
glitzy
horsted
uppity
semiclassical
pontormo
tti
budak
alvechurch
mukachevo
dourdan
sisimiut
quichua
skeptically
gonzi
autoblog
taseer
longship
dehydrogenation
laisenia
pupi
benenden
gunvor
melanocytic
apgar
soonest
gargiulo
hansteen
voghera
rogowski
kissi
cibolo
gadson
reichsbank
goop
macheath
habil
holzapfel
thula
egas
tfas
källström
northwick
phb
cuddling
wengen
asaad
elkan
aju
knowledgebase
freja
marzuki
rocka
futureheads
zhanna
hardisty
slobodna
poix
maybelline
casavant
lavage
sandiganbayan
simnel
lbi
blinker
gavino
canó
trouper
groundout
vincit
metopes
hijrah
demarcating
hoye
rlif
spangdahlem
fratres
bychkov
boredoms
garut
changelings
comunidades
allum
meadowhall
solenoids
osaki
gersh
richt
tolu
tomio
neophytes
kittinger
callista
dejuan
rekindles
natuna
magath
crillon
shaoguan
toffler
dinko
elitch
menuetto
rijk
kansei
schladming
glacially
boscastle
noory
defrost
lummi
rosalee
aurignacian
pennefather
yulianto
gcf
witwer
spath
canali
biswajit
tarandus
mentira
mattock
shehhi
simas
cahit
conferral
anglade
prêtre
ruwenzori
gunge
gongadze
wilmott
canvassers
acpo
calas
duracell
nkunda
gono
patting
larbert
guram
scarcer
oradell
allotting
portside
tribology
haemorrhagic
selenite
naxal
marimbas
hayom
kaftan
carra
syrus
livsey
kristijan
camphill
moringa
tibco
magicjack
cowslip
stoch
kovalyov
costars
cruck
staved
wadhurst
pkwy
telesur
chatelain
ihe
iasb
calendula
cherryville
endres
belem
leola
cheep
lesli
miryang
rabie
impute
workingman
ignat
zilli
mettmann
kuldip
curto
scicluna
yot
dehaene
shumate
paver
arbres
bongiorno
crosslink
wbu
quartiers
chikungunya
backhaus
greeter
cimb
matlack
peni
fcn
airless
wardi
klobuchar
bombardiers
dacorum
tibbett
wilga
destructs
ftu
pegmatites
tiziana
snively
intensional
cradling
werchter
rodr
stipes
bandarban
werneth
nudists
matchroom
hollmann
heiskell
torridon
katzenjammer
quraishi
tosti
hleb
fiamma
tykwer
laliberte
sater
abaris
tickner
alethea
shirokov
abrahamsen
leuluai
cananea
biopolymers
noscript
parterres
ardglass
klcc
nejm
massalia
renick
allom
assynt
madams
vicary
eppler
winson
rahmatullah
shapps
bobov
hardtops
proofreaders
blots
baubles
cowtown
nuthall
laufenburg
naseeb
hoquiam
vuillaume
tshering
avaz
livability
spiralled
perfidy
placoderm
shirov
espectador
trishul
emrick
decompress
springy
biryukov
mikako
daldry
cauley
nourishes
nasta
pichel
samvel
skatalites
latourette
trouville
obel
logy
ciotat
spilotro
texted
scads
cologna
priscila
proscriptions
esx
terrero
egl
tabib
rishton
dema
sojo
filion
naviglio
doriot
aggravates
zundel
twitchy
juco
vata
cafaro
kimberlin
sopranino
overlie
naama
yanai
purée
porteus
muzzles
enterovirus
casto
jochi
ghostwriters
needlepoint
ortolan
harrying
milborne
wailer
idiomatically
christou
mmos
jouan
dehart
balah
appletree
darom
haldar
newville
niem
dews
liberalised
agnete
hamnett
guardrails
machale
hoxne
ornithopter
urp
tallit
getrag
stoffel
exhort
stilling
upsetters
bastet
schwerner
overberg
agip
amoretti
russified
inconclusively
yearlings
spelthorne
guptas
hcr
rach
reveries
adulterer
cappellini
syllogistic
regionalization
synergistically
azan
sœur
southcentral
smeal
khursheed
skanska
cherkess
hodgetts
pilloried
multimode
littrell
outpacing
thies
vulcanized
aihara
carmouche
tricare
wicken
roosevelts
clyburn
aerea
ringspot
khim
defore
imbibe
microcar
ceram
moneyed
balbriggan
ngv
caffarelli
avie
anfal
tuggle
paxos
underwrote
vouching
teoh
jhoom
huay
klayton
kijima
xss
nabo
manoogian
contrada
wads
trafic
poring
barreda
vatanen
hittin
pullum
leconfield
inseminated
morgado
longy
pushover
wigger
itô
enkidu
breathalyzer
heppenheim
hgs
diethylamide
otford
bigwig
bagpiper
dambusters
akula
poulos
lourie
mountview
subbuteo
bruckman
villy
cogwheel
abutted
okuno
taghi
kmg
disinfected
demirci
tolerably
gauntlett
amauri
tricolored
sparro
goofed
rihm
wheater
regalo
skybox
frothingham
harshman
fellaini
burmester
charle
motoren
iftekhar
englishtown
appurtenances
veenendaal
ignalina
seijun
lissouba
kotel
cvv
bingaman
massee
venray
wroxham
drayson
presumable
velveteen
alexakis
vaziri
salzmann
ctesias
ramalingam
bonatti
polluter
risco
anzu
athenagoras
cobe
wojcik
dentil
arale
hydrophones
campaspe
nahas
bowra
shrieve
hilgard
mappin
rjh
mephedrone
polli
mawlana
stamkos
eigil
mgk
fitzhardinge
werkbund
fortnum
spellers
autodrome
yit
oversupply
shusterman
checkboxes
tuz
pwo
kosal
rogallo
vdw
tantallon
reciprocation
zarek
waz
mfb
mchattie
woolner
simp
courtrai
whang
swifter
wiegmann
berck
mckees
engelman
flettner
claypole
lanphier
pelfrey
breakingnews
nuland
makmur
thoroton
serc
derya
uncharacterized
moher
relatedly
beaupre
ellender
misdeed
mcbreen
hanoverians
castaing
agudas
cawston
grater
electorally
itaipu
berkey
disgusts
wakeup
dennistoun
kulish
faiza
kastel
frierson
déjeuner
bulking
antiseptics
aberfeldy
azamat
mufasa
jedd
thp
transfigured
humvees
wsoc
mcconkey
gidi
jeopardizes
kankkunen
zhukova
tarns
abbs
rumblings
ultrashort
donie
sheema
nqakula
hardliner
stadtmuseum
gaëlle
hanada
arps
broiled
rehire
fakih
peregrin
vaccinate
piermont
mrk
unos
chamakh
tesfaye
clanranald
mccarrick
infesting
tekakwitha
blears
zeuxis
arboriculture
gweedore
gergen
markarian
recycler
goines
toxicants
principessa
taklamakan
lamberg
turnor
fll
drolet
foliate
ahron
waxworks
kirkton
shilluk
aerie
keikaku
actionaid
yuill
hatano
vessey
kosha
directionally
sankuru
miettinen
cornelissen
virani
vehari
longchamps
uncorroborated
spellchecker
carnotaurus
brasch
muza
jagdpanzer
reheated
viby
gresford
barques
parool
eyebeam
connells
armer
rubbermaid
zaida
cual
emoluments
nalwa
dissuading
georgine
hube
memórias
assail
ingratitude
guttmacher
levein
bridles
personnes
reflexion
orality
gpe
gaona
timezones
hord
soaks
menage
preparers
sluggo
summered
glockner
illusionistic
wuornos
tramlines
uen
rinds
clagett
bannerjee
excommunications
bfp
polytrack
woodway
backstrom
arrowroot
osca
hemery
maslov
nazionali
héroïque
ncca
meem
parklife
churns
rajyam
facially
crocco
douglasville
kavner
ardiles
windowsill
corallo
tanking
tegmental
ussuriysk
aifa
lurkers
charbel
otherside
porges
lelong
duncalf
caraquet
rangifer
samaniego
psilocin
caselaw
rendlesham
morali
beamline
zumbi
readjust
pazuzu
aloysia
sturmer
seatac
nistor
seumas
nyra
gellhorn
imada
caramelized
ofili
persinger
fluidly
yata
gratify
bedia
djawadi
electromyography
afrikan
remitting
federales
buttonhole
fuglsang
halbe
yanji
bradykinin
suozzi
lallana
oolitic
jibrin
kips
leshan
benj
meis
fussing
furie
snowblind
dowdell
milana
massaquoi
robbinsdale
terribles
papazian
swoopes
unreactive
aono
ufi
zhendong
vögel
carrà
oakhill
purslane
hirani
mourdock
jawhar
prickett
limekiln
sers
buthelezi
guicciardini
mugwort
ryk
cupra
klarsfeld
marquees
arianne
tsmc
admir
winklevoss
dishevelled
grat
mendacious
hallin
mondegreen
kawkab
emptor
fady
vipr
billick
infilling
krajewski
plattner
kalmus
pite
woodcarvers
tanay
donatas
nippert
squished
lmd
alatau
naogaon
mickens
vorderman
cremin
foskett
praxiteles
blaby
propter
woollard
shiseido
aggrey
cmn
lude
pret
skvortsov
kprc
pantoliano
sagunto
tahini
gorgons
prisco
agbonlahor
barritt
terroristic
dorien
weingart
portmeirion
meekly
creedon
gcu
thandie
smidt
galletti
bellezza
bundoran
ekholm
ouedraogo
tetuan
softimage
corms
klansman
precendent
fibro
korie
percolating
farhadi
lehnert
returners
ruderman
sesac
himani
timofey
auclair
daka
grossen
kemer
chkalov
kirpal
elsdon
esquipulas
cilliers
janardan
imbues
unda
sarker
blanchot
whio
ocm
takasu
misconstrue
lolcat
banns
temperton
rasoul
piggies
crandell
papageno
gullett
fethard
daytrotter
serializing
alighted
vaginas
tiberium
matinees
hermas
visualizer
sirtis
monodrama
mistitled
lilias
noosphere
bagg
monywa
uncertified
depopulating
horsman
tair
aledo
goldsby
ucav
abueva
douglases
kwanza
forli
karpin
gunny
simeoni
nicea
titmice
craddick
taillon
tonsil
beuno
kamisese
applicator
pravasi
sarel
nanteuil
castellet
silkin
cataluña
tesserae
takehiro
nard
hydrae
africano
babis
waterbird
aflac
wynalda
patitucci
labette
lmr
arndale
botnets
mirabile
lykina
mincemeat
ranthambore
gati
dkny
benetti
dauntsey
hydrogeology
calendaring
kalmia
creager
merve
nociceptive
grosch
fetes
sumana
austal
leath
reined
kiprusoff
darod
londoño
jeung
trec
parchments
madu
jwa
vinohrady
joma
ebersberg
ontologically
hermeneutical
muawiya
oric
ravensthorpe
rill
abol
galasso
dqa
mapmakers
wncn
duprat
emslie
ecml
anaplastic
rendang
plowshares
bheema
delmont
yuman
nastier
forgings
kpt
nvr
impulsiveness
yasi
eisenbach
aldon
eem
rawer
medievalism
sexless
strolls
midships
mirani
grizzle
helan
formalin
meaford
gwalia
devery
liaqat
tuberculous
cossiga
otg
condones
assar
bero
atropurpurea
railroader
crin
spinto
clindamycin
beem
didyma
darc
relaxants
tongzhou
herbals
stambaugh
brockenhurst
albanesi
mashburn
buncrana
chastisement
domergue
portstewart
chukwu
sarmento
tremens
dennie
rescaling
synopsys
shamu
cariou
mishan
cabbagetown
ciau
galkin
mirus
kyösti
arromanches
préval
reprap
rajai
criminalisation
caloosahatchee
hongyan
weisinger
churchyards
jolted
pipi
crosslinked
dostal
brf
beeler
ibadi
bargoed
hazlet
gibsons
kevyn
blowouts
lahori
conecuh
luckless
saggio
vver
umlauts
erects
sutphin
cafu
aironi
stutsman
hypertonic
briquette
tonson
moola
kirlian
sahgal
dalem
lubov
allchin
lodes
herlin
sabater
steelwork
marbletown
veining
zellner
subdirectory
ronaldsay
acir
knowe
expatriation
kvb
reverser
maylands
mercurys
ghanim
transcriber
yablonski
serme
yoker
fjeld
ipiranga
peil
mazatec
craney
cropsey
shirreff
branche
gardo
jungmann
helwig
caq
subzone
duri
luisito
lvs
impasto
postbank
magliano
malahat
repatriating
tholos
baoli
instills
dosh
tyrannosaurid
newfane
baik
huberty
karenni
ratiopharm
handl
glaive
akabane
amare
ebbed
lambourne
correr
aweigh
rastrick
gilled
solé
yid
comedienne
trenet
alembic
garzanti
chaw
apv
kamio
anangu
tuomo
madrileña
hypoallergenic
cristine
klosters
shimei
consecrations
dilutions
cordray
snooze
zerlina
lannon
coltishall
indubitably
pmla
kashiwazaki
endotoxin
bokan
qassem
yamano
caspersen
lockinge
jadi
suttie
definetely
lynxes
tonalities
metalloproteinase
gosdin
nums
minichiello
pummeled
sonera
coury
hohne
cauthen
slmc
aecom
jakubowski
halitosis
terraformed
jerusha
torchbearers
underachieving
marijo
lorenzana
bootmaker
apeejay
mulcaire
itw
flashmob
pledger
adalat
kolam
anisimov
winterreise
denne
madill
paralympians
brittleness
pasargadae
zigzags
atlantico
medaled
naxi
uvc
cairngorm
alaya
leir
léonide
marranos
joep
guarana
clague
kleiza
harn
metabolizes
sulieman
schantz
realclimate
telewest
neutralising
lumberyard
maté
suskind
schwalbach
balete
seavey
blat
insta
nastos
boya
coulda
melby
scottdale
diebenkorn
totp
sweetnam
predispositions
inverleith
ltt
bioethicist
hawi
giesler
leggio
grbac
berny
alberoni
stetter
whitesides
widmerpool
aïda
caddyshack
lipset
spruces
quivira
aiglon
hurriyet
muluzi
arntzen
junket
commentate
poeple
blaize
afgan
prashad
unravelled
immobilizing
pergamino
polkadot
silsbee
epling
zombo
educationalists
jockeying
neonate
hyping
sugimura
lazarsfeld
mko
greenlanders
shwedagon
altun
corporative
matadi
backspin
clubbers
canajoharie
euratom
dassel
solitudes
ransford
pugs
dépôt
hendley
flappers
pethidine
onsets
humic
afx
cornthwaite
swordsmith
sidorov
kraushaar
sisteron
baor
examinee
primakov
immobilised
enraptured
konsthall
memetics
ycl
aars
mishin
konow
wrp
kirkcudbrightshire
luckhurst
sextuple
ostracods
fidelia
eckard
scally
korsakoff
licit
victimless
peeved
perfidious
idir
cih
pinkus
sputtered
thermopolis
figueredo
biton
oly
vova
horrigan
aduriz
kuts
effi
corine
bakst
manderson
preparative
pindad
slingerland
monken
owaisi
toji
zainul
escp
sculpts
hudud
ivette
yma
khusro
impromptus
halfaya
farfán
mafi
higby
reichenhall
intrest
buton
bajada
viñas
neild
anciennes
matkal
opg
perin
synchronizer
pembe
communautaire
seydoux
meader
singed
wheler
caroni
stogies
majorette
thunk
spork
tonics
dismutase
submitters
tageszeitung
floodlighting
zich
barbaros
evashevski
cuyp
sativus
morat
terayama
tigr
osv
sipri
experimentalism
shuttering
felisa
prothro
debbarma
rait
fritzl
alport
overlander
coade
szegedi
eser
lunged
imed
kelmendi
kahrizak
jahanara
hedo
referenceable
están
terblanche
asat
asociacion
staw
tieng
karmin
manicure
kushnir
ligeia
arlin
caspases
squinting
lualua
bienen
audet
dussehra
basar
iraheta
jutarnji
parth
mimoun
crewkerne
residente
frowning
pakpattan
camejo
hunterston
rne
sarasin
gevaert
placide
baitul
demott
kuber
wigand
fiddly
trex
rewiring
elr
acclimated
klinikum
undisguised
breastfed
hypercholesterolemia
micros
frolich
neilan
alprazolam
chiklis
kuttan
carlgren
brinkerhoff
reuleaux
wbtv
canina
eosinophil
ellon
farahan
hepatotoxicity
osvald
jahra
hoddinott
divulges
scrawny
jotaro
waze
kripalu
asturiana
cozzi
extern
disbelievers
getzlaf
taihang
gopnik
immerses
weyrich
trillanes
kuehl
hambone
yungay
pacto
zoellick
jelling
bhool
vinho
saloni
esdaile
luxuriously
lalah
mcmorrow
rinna
frolicking
mawgan
horwath
ibert
medlar
bucephalus
playhouses
crystallisation
neturei
embryologist
juez
ranier
mashimo
wrubel
geof
unkel
herminio
temblor
valin
hydrophone
cabras
lynmouth
ruoff
episkopi
munaf
sabel
título
revaz
benchers
varallo
jusepe
prosecutes
lante
kaleh
lph
almudena
westies
kichwa
nazif
weinert
thermography
quemoy
hunsaker
wangler
aloo
carpooling
pommes
farnley
krenn
widney
clohessy
hally
unfeeling
rangelands
wutai
bres
fagioli
bhakkar
tommasini
passionist
chasetown
needleman
sampla
hydrosphere
brer
makinson
pfäffikon
tobolowsky
awsat
pcso
odaiba
puyang
decentralize
concreted
hgm
rituximab
counterbalancing
evacuates
contrails
svedberg
bannen
bonomo
liveable
sadlier
maxence
sensus
crisscross
grabe
newpaper
mcilwaine
janni
reik
djakarta
mcguckin
kercheval
rovs
singletons
irlande
parasitologist
kotagiri
honington
mihael
mizner
retta
wows
odalisque
khela
euc
hapi
welham
hellgate
peronists
iacs
metrosexual
uddingston
limnonectes
janny
gettys
fabulist
lapworth
pishin
yuzhny
chuckwagon
syncs
scones
yusif
quango
yabloko
rozario
furano
gyration
wheelies
puncheon
chaisson
swindells
amadi
cgp
expiratory
scheyer
shortsighted
cowpox
zohan
entangle
scamper
maplestory
krystian
ultrafilter
klavierstücke
gats
solaire
stw
runkle
cuerda
elvir
rangeela
breathtakingly
widdecombe
laux
condie
islamiah
spatiotemporal
conjugating
ravitch
bouchier
tsewang
icsc
yous
shoutcast
griffo
sajama
shiraki
alizee
principale
wastebasket
freshfield
reverdy
silvani
mentis
hydroquinone
jado
bombshells
nork
jalabert
shoplifter
emscher
glowed
edgell
exomars
clayworth
vladimirov
doted
gooder
limey
kuriles
salihi
azzurro
innkeepers
statments
donath
trotha
meral
earwax
statt
dorus
thuram
karney
limbless
isoniazid
tropica
forestal
wellston
noha
sullenberger
pugacheva
celebrezze
arpino
richet
fluffed
matchbook
bordin
jazmin
sige
tasik
saenger
corsicans
penury
castronovo
datar
langworthy
lanolin
stolac
acas
dermody
fondant
bva
alvim
kert
lutenists
shimamoto
ausonius
smolny
suvero
deogarh
maillol
dysmorphic
nordwind
simko
yohei
kitayama
contraire
jsu
aught
cantilevers
vanderlei
logout
drd
pakal
unobjectionable
iorek
hawkey
idli
sundeep
ultramar
adbrite
abdellatif
buckeridge
jasim
haylie
sinopoli
persichetti
digbeth
wallich
ghettoization
hollandse
goitre
dinaburg
politi
identifiably
emmanouil
thep
beyaz
kusi
quispe
westerplatte
montesquiou
sackets
cherishes
tangling
valene
fortman
basak
veasey
krikorian
labarthe
brawny
waterspout
ceremonials
difesa
eilert
rixon
concoctions
claudie
mantia
thyestes
douchebag
netivot
steelyard
damariscotta
marveled
ballona
beamforming
gunmetal
labadze
weitzel
bossu
acec
rhapsodie
hest
necessaries
chanterelle
spaceshiptwo
indemnify
chernow
frechette
champagnat
pigou
owers
rukavina
gpf
coqui
ruto
neef
kaprow
mirisch
groo
tona
petrocelli
fondren
bookers
corinto
marvelman
narrowboat
dalriada
stovl
semetic
gernika
stipple
mosaad
botan
sassone
acco
slapdash
tajo
mandula
acinetobacter
mangrum
isocyanate
pertamina
teleconferencing
kirmani
cowsill
mikee
ziolkowski
helikon
jazzmen
sebum
aubers
fosca
cubbon
hendre
ravenscraig
schallert
ioe
polishes
motorbooks
carbolic
fabricates
ameritech
hoopoes
elisabeta
iriomote
shayla
beautifying
magnan
galangal
reasserting
reassigning
siddhu
alife
jarnail
flom
yilin
ringnes
vicodin
dynegy
invidious
brancato
worksite
accretionary
gregoria
enterobacter
industrias
solt
goldhaber
feminized
arguedas
uncultured
stringency
publicaffairs
frogfish
interschool
osmanthus
furrer
cld
maximised
hotlines
erni
alaknanda
roethke
xanax
telecommuting
nauen
chugoku
berghaus
waterbuck
reits
poveda
adcs
roswitha
pountney
nephrotic
baci
spach
tombo
anlong
oehler
stumbo
vassell
disinvestment
darnay
gorshin
stagno
spel
criminological
gslv
kazunari
cleisthenes
alano
provine
kokhav
azua
torey
wingrove
rivonia
fondling
kosor
piggybacking
flurries
vigier
goulash
tinkling
cavallini
ladykillers
cataldi
stowaways
lorenzini
qalam
sgb
pendarvis
zugzwang
xinxiang
manetti
hermeto
chimbote
dulhania
nowakowski
outshine
fiddled
hanner
marsteller
rhu
gagaku
carborundum
goodtime
oecs
mochida
sberbank
fev
barfoot
murcian
rienzo
ladylike
dafa
slotback
erice
baard
btb
nobutaka
charrière
whan
trinley
porcelains
cullin
wagar
idel
prohaska
isakson
simplemente
gern
indelibly
quartermasters
afri
muskox
nonius
writtle
mignot
gamson
debretts
grizz
franchet
kiwa
blackhole
corbitt
millibars
harshaw
loafer
turnarounds
fonz
mhra
aquilegia
stepchild
satcher
rietz
aráoz
telethons
personam
yushi
allergenic
intramurals
certitude
meryem
kram
madres
qalat
kickball
althaus
schizoaffective
euthanised
angkorian
takayanagi
interrelation
mouflon
nuristani
isbister
encase
tosco
intimation
dji
cilley
webbs
aking
adell
dynast
howerton
barrasso
acheived
tamari
reoccupy
involvment
coonskin
formule
tansley
icebound
queenside
raffia
cockfosters
cyclooxygenase
audenshaw
interactional
pilkey
histria
kadare
catapulting
devilder
tetrahydrocannabinol
midcontinent
neris
zakhar
pastorals
shafqat
tapti
wheezy
deprivations
artemus
tomonaga
erec
bookmarklet
faheem
buckfastleigh
paek
abramo
wonderbra
posterolateral
cawthra
elderslie
seamans
ribonucleotide
yuxi
motivators
drian
taban
isolators
warneke
mileages
bhavna
longhua
miccoli
wilcots
circulo
bickersteth
shakil
stylites
crafoord
bgl
gizmos
abot
reedbuck
melick
briles
pirogov
belva
lovedale
negley
matewan
eichel
eyetoy
wrongdoers
dimitrij
ticaret
mown
suppers
brisker
wedgie
gasps
gilruth
unpressurized
eita
fortson
odescalchi
cdte
reneging
golia
shammar
sweetser
claverhouse
kameng
kalinka
sajida
doublespeak
koudelka
collocated
stagehands
decavalcante
schad
tiye
paratus
fallingbostel
kans
moretto
meche
yotam
delli
unsalted
haberman
sharpener
barât
abdulhamid
mabrouk
cryptogram
malvin
tanimbar
batesian
wernham
devyani
acquis
semaphores
conceptualism
rostron
blundering
geiser
freeloader
raita
tearooms
kaikohe
maturely
chowchilla
colada
menkes
poissons
verran
apoyo
pittoni
granato
letellier
willowherb
coltness
charito
soco
makoni
saut
jollibee
sousou
weicker
litterateur
whippany
ozerov
serrana
diversities
adminstrative
keston
stamatis
spinocerebellar
reisch
arenig
bromwell
iole
puhl
itx
mattern
westerland
southpark
mierlo
keskin
valcartier
columbians
deben
chacao
aliki
kalymnos
percentiles
doner
dulany
ayoade
virgenes
revegetation
vovk
usfsa
armé
teign
okk
gruss
mikki
vasko
snak
neonatology
belabor
galitzine
magas
deseronto
quartos
karrie
blackstaff
ethz
essayed
chinquapin
spinnerets
asexuality
epitaxial
volcanologist
furnival
dransfield
stapler
squirm
handprints
eucom
silcock
alutiiq
gherkin
amorsolo
merozoites
pavillion
hillwood
nafa
xtr
hingley
strapless
peons
zaks
varios
toupee
lartigue
lpn
farron
faur
bulford
cret
joa
capricornus
polsky
landscaper
vaporizer
rimrock
caucasia
ahmadzai
stolpersteine
levitra
faulkes
kims
negrín
gup
christelle
midc
bealach
networker
aberaeron
sangay
kemptville
absentees
franciosa
velvets
goyen
desales
zandi
slugfest
wignall
pyeongtaek
spiritu
intermingle
rre
ieng
monzon
wangari
antwan
mcanuff
villamor
welliver
antiterrorism
centerfield
ical
echa
vacillated
kapok
cerveza
gloat
amadu
gilardi
penitentiaries
gorseinon
meursault
uberto
alya
través
miñoso
offsides
yonex
wachau
kowalewski
bouwmeester
tanf
gyllene
lamplugh
manford
vitrified
inspite
chiodos
hmx
conchos
poulidor
turre
sweezy
descanso
raasay
rossoneri
pasuruan
sazonov
spaceballs
swilling
cradled
corredor
marmi
jindabyne
ragnarsson
sfk
kermani
touba
tallink
bottomlands
kinyarwanda
bellus
mmpi
stopovers
desrochers
savana
chunking
cacciatore
feridun
blinkers
ittefaq
tailfins
franking
kokang
senderos
pointlessness
grondin
lumea
boîte
satam
tottenville
inkheart
harvestman
sorolla
stoplight
lisson
crail
buchtel
pantaloons
toffees
nytorv
tollcross
bigley
idrissa
marcil
ryuhei
chadds
realizable
depositories
quorums
dembo
honorio
skylon
nkf
cenotaphs
paddlewheel
vipul
mazurkiewicz
barrowclough
waterwheels
mehrotra
fluorosis
gianfrancesco
eytan
leavey
ladi
forsook
mischaracterize
qca
jio
budnick
burkart
albe
tragedian
rondel
kissling
refract
mentalities
perrineau
krld
monsen
iacobucci
quadruplets
bevo
ecobank
ripp
leke
molfetta
ventrolateral
vinzenz
bracero
alpines
mckevitt
monovalent
parkhomenko
embarassed
cgh
funtime
iser
tretiak
brechlin
keenum
caspary
erogenous
beeville
genesius
kielder
obert
pfoa
snagging
arbel
paulsboro
imsi
borriello
repetto
sayd
ciliates
tatsuta
klock
nrbq
mikhael
equateur
salmaan
clayburgh
giardello
kobrin
obl
baldelli
datt
bergheim
liesel
bekka
martelly
pyrrha
elad
milsom
erda
prabowo
karem
volksoper
neuheisel
keya
gortari
julin
megaprojects
canora
looy
sabas
mccaskey
sweetgum
sepe
deba
jovica
pupo
baconian
tearle
sevenths
suero
kasprowicz
tanigawa
cohost
carrée
puspa
kacha
whitsett
fogging
boghossian
churchland
seamstresses
glin
loic
contextualizing
isidora
miani
jokela
tickles
quercia
middlemiss
bellomo
peyo
arben
dilshad
fsg
arinze
tobogganing
woolfe
akademija
kasner
describable
witting
halkbank
streich
vervoort
yantar
signorini
weide
veere
mislabelled
blandine
aparri
wyandanch
thebe
bordaberry
campbelli
arwa
squeaker
yesod
omnicom
weck
luppi
mequon
bluesbreakers
mauritshuis
coseley
crede
langside
blenkinsop
bathrobe
motzkin
mudanjiang
cucina
willig
tuku
exorcised
newegg
mazin
brantly
ohioans
bronk
avermaet
salvinorin
bruschi
airlangga
briskin
ditson
tto
wedd
burlison
nicolaou
brainpower
santelli
inbal
tenens
polyploid
masamichi
aeroport
heimbach
hofheinz
acclimatisation
seadragon
hassling
parapsychologists
juggles
figari
mulliken
unplugging
rationalizations
maravillas
lolich
lelo
kohoutek
seasprite
abdeslam
teun
anpr
yangzi
patisserie
elucidates
maternelle
punnett
laron
photogravure
balladur
karmen
lichter
naomichi
riitta
tarvisio
beber
irsay
aashish
pathein
passphrase
shizhen
novakovic
romeyn
investec
samphan
torte
macer
cisa
mejuto
bols
chytridiomycosis
daunte
markman
parkfield
orsino
slideshare
wellborn
titty
bourret
gilcrease
takenouchi
toff
kersal
stap
colectivo
nli
silversun
jovita
wanes
bautz
risible
lunging
tyramine
hatanaka
fuengirola
lustgarten
revenger
tanita
hohenschönhausen
perito
mullican
medwick
piperazine
aiman
claesson
mazzetti
leporello
heraclides
zuhair
copperas
propensities
nymex
babil
scarisbrick
bjorklund
nechayev
trimark
bitterest
lenna
vulvar
moapa
getcha
usw
backdraft
matcham
chemistries
ambros
kneed
bruma
gammons
islwyn
stirk
unguja
goutam
stirrings
bowa
hlc
misbehaves
reddening
branksome
riemenschneider
bmf
acna
biogeochemistry
itas
wrongheaded
saturnia
marinade
kuttanad
rrp
ecclesiastically
pecks
instil
logjam
forró
digga
postulator
enl
bareli
procida
onedin
filemon
courson
pulham
pictographic
azzarello
toone
osmolarity
orji
dubya
insularity
inconveniently
arzamas
wawona
babine
librarything
nadp
ocasio
staller
umemoto
maxpreps
alviso
zervos
magnifies
ziani
kiper
trekkies
dcns
hipolito
tamzin
sthalekar
discussants
martinsen
lilford
piazzetta
jivan
dogmatically
jiwa
büchel
labe
balustrading
woollcott
rovinj
pflueger
tarnowska
mortared
jiayi
trotwood
sauteed
aabb
luckner
contestation
daruma
heta
neurobiologist
lunney
maois
hoofs
barkha
excellencies
mudbrick
nandana
davon
diptychs
trumpeted
nishant
scandinavica
miroshnichenko
socialised
infoseek
desailly
kazuno
raudabaugh
yelizaveta
inputted
glorioso
folkman
cey
ccas
mitsuya
elution
hedingham
irrelevancies
rollovers
blacky
igoe
leonardtown
dorjee
pawhuska
panah
viard
marylin
calathes
laidler
knits
representativeness
intermediation
santoni
ayrault
guangyuan
heddon
ziering
tonys
hemangioma
kulasekara
reutter
contrivances
delahunt
chadli
disinfecting
fulwell
roussin
chirnside
bujang
interborough
panskura
dirlewanger
lindzen
hayati
ridgeback
riles
bedazzled
nordgren
rossio
neuroma
waechter
castroville
nmd
orlen
vandenbergh
urtext
hallier
idolize
voeux
preorders
dente
mifepristone
trawsfynydd
intertemporal
randles
langlais
probationers
lamberts
anding
uxo
candelabrum
damad
ploys
tarentaise
caulder
gujaratis
vob
meinl
chazy
etchells
soz
akkar
petroc
summonses
clavin
glum
zarra
xabier
bagnoli
palwal
chenda
classifiable
natufian
bitdefender
guastavino
pratica
tithonus
ferit
dracut
peatlands
udoh
apic
stachel
gava
khedira
juancho
racialism
chinnor
foreshortening
subdivides
dulli
homestake
warioware
bergersen
discourtesy
teofil
ataris
wixom
bâ
lambing
tannahill
elworthy
assayas
glau
voigtländer
sukhi
dispersants
newbattle
nonlocal
dodin
okajima
tima
fisticuffs
lastfm
catalani
ekd
nordics
zaghawa
wenvoe
loli
thickener
kaupthing
sojourns
wappingers
coquet
awaking
taranis
suttle
viciousness
dysgenesis
baartman
heeney
holmium
hassocks
smurfette
leontine
flamboyance
hathcock
kakkar
sneinton
hajek
bösendorfer
fieldston
phung
tuckerton
neeme
sidewise
castleberry
tullibardine
mirandela
obaidullah
jadoo
lacewing
keiichiro
harpa
bichon
bingu
tanvi
palminteri
datang
doneraile
dhhs
kroupa
triest
papst
tresckow
talis
mohar
hfa
andantes
abhainn
rockmore
hald
necc
thielen
mahopac
reverberations
chissano
digos
prehospital
marilu
elbaz
frolics
jettisoning
gulik
kokko
christl
schaghticoke
birdsville
eei
helberg
rachna
vaishno
astier
jupitus
fetting
runton
wme
beba
consign
farzad
erdrich
erevan
mifare
zare
fufu
goodacre
orientals
dilapidation
ciudades
laffoon
closeups
littlestown
hees
stabilisers
hartpury
debridement
marcato
tfsi
betanzos
nosey
demaio
kusch
melian
keoghan
catmull
encalada
dars
policarpo
rawness
cullom
fieger
glittery
russky
demobbed
coomer
ddot
teliasonera
kepner
abusir
estrellita
böhler
chernomyrdin
berlinger
brah
sreesanth
heckmondwike
frederich
depressus
frea
officina
mesbah
bushati
pirsig
mant
furloughed
valsecchi
woode
otpor
malcomson
miraz
hise
jazzmaster
tovah
suprise
stremme
doonan
blogg
solutrean
cassill
cvi
florek
summerson
pazienza
amahl
lubetkin
kassala
horgen
wades
fex
nakane
dmrc
helford
centuria
abeles
kitschy
watchword
photocopiers
lenina
hohn
volf
scarman
saralee
gonzague
lebowitz
westphalen
isaia
ooooh
addlestone
tallboy
faunce
suttles
dethroning
tediously
kreuk
cryme
bingle
hidehiko
headboard
delphian
bavo
multihulls
unallocated
sprecher
etzioni
yauch
giovinazzo
mthatha
sonybmg
curmudgeon
ebbing
teleconference
notchback
bowsher
perjured
mirtha
soulshock
ignatiev
nyonya
terrorising
tuberville
revolucion
coxwell
tgwu
jovin
ecoles
nóbrega
isayev
bafokeng
temer
tarini
kuning
immunoassay
processus
biochemically
guineans
fynn
outros
almanza
desaparecidos
cronan
miracleman
silje
jsoc
pilchard
zada
&#
bricolage
iftar
noriyasu
qingyang
georgeson
soundproofing
dorotheus
bordoloi
unmaking
bissonnette
totes
maybin
mrinalini
vanzant
agbayani
bouzid
caloris
panguni
chos
hummels
comacchio
eav
naryshkin
nlra
fontan
shovelnose
infini
illimani
cony
mdw
chiding
julep
nidderdale
flowerpot
bachelot
mersea
pacesetter
ichinose
kiesler
hig
factotum
minet
cassi
hahne
desperadoes
fritsche
foosball
bacteremia
nubes
ryutaro
laïcité
trawled
bursitis
dishonorably
romare
patar
dehydrating
arshile
pronovost
aircrafts
poulains
tsutaya
dhamar
retread
transgressing
asur
fluting
heartfield
penthouses
jodeci
christoffersen
laroque
ongaro
wardman
skytte
comillas
switchgrass
florizel
becki
itay
chastened
vitantonio
heinen
rusts
fsd
barnston
gorst
sudano
aminul
dohring
immolated
petabytes
ruas
underbody
lusher
altria
giwa
industrialize
downtowns
tekapo
zehnder
masoom
desiccant
yonan
lepine
moscoe
nottage
nashoba
claspers
bornemann
jawf
jizzakh
chattaway
fieldbus
wachowski
walkthroughs
quoit
zot
shits
alcubierre
yuliana
harkey
jif
maryan
bioaccumulation
skillset
lyin
marktplatz
filemaker
harasser
bathyscaphe
balbina
rempel
kaolack
borbon
kleinert
ricasoli
antagonise
shuffleboard
volleyed
acconci
shanda
altin
dominical
reinoso
roseboro
weissmann
beckenstein
bawtry
fysh
brizuela
improvisatory
yyz
cabezón
ballett
jori
psalters
liversidge
crabby
alsthom
patchouli
dissects
sucka
snicker
simonian
grandiosity
aghdashloo
liggins
potentiometers
quaver
gruening
galpharm
sichuanese
zealotry
trehalose
parihar
bodes
saux
baranof
lambro
obstructionism
usonian
crosswalks
gahr
meti
primitivist
upul
ule
ecatepec
ecclesiological
shoaling
corexit
latinum
urgings
naden
coudenhove
cester
heirarchy
honorine
gutzon
contactor
elitzur
ancon
negations
kalasin
neocon
libeled
lampre
folco
chhabra
bifid
nle
auspice
climent
iguazú
ripton
whould
mimsy
gonthier
greger
vegemite
wheats
frydman
unnavigable
voxels
apos
loggias
sideshows
gorley
clawing
takemura
gaelscoil
cihan
lapo
htin
rabbitte
immokalee
klöden
mpw
xicheng
schach
perplexity
xist
gugino
tuddenham
marianela
wij
shabir
garganey
figeac
arkley
gelati
wookiee
makis
pwyll
sann
petiolaris
crossbars
mycologia
calamine
lopate
exhilaration
montélimar
uniqlo
kabri
pras
segawa
fraulein
schleich
commerciales
indignities
tschumi
saddlebred
totaly
monounsaturated
waipahu
barwa
charme
cappon
grisi
unpasteurized
movsisyan
neemuch
tamai
saraceno
shaws
mcgeehan
fleche
sporozoites
kellock
risca
wisley
kundra
konovalov
fairwater
loonies
aacc
cayzer
bandanna
redhorse
pyr
milbanke
europacorp
beddgelert
launderers
ullapool
laggan
hasid
pumphrey
inocencio
shrivastav
hirschfelder
suryavarman
rangitoto
falak
trebles
potentate
morada
ranta
kilcoy
etchingham
romas
swatting
rosamunde
subcontract
pedroza
kikongo
ifrc
manek
relais
comprehends
sleighs
verbania
laurentide
ouahab
wanadoo
aerodynamicist
quickbooks
schenn
tractarian
egen
metalworker
hughton
wll
pedregal
belgrad
creedmoor
vierzon
hht
felber
thind
mozaffar
baboo
hrg
pedalling
baserunning
ladas
adaptively
derksen
genkai
sungoliath
uncoupling
fritter
prsa
bovington
exhaling
roughead
attias
francese
blart
darwinia
mcbrien
hubner
sumburgh
agreeably
chitnis
clarkdale
norovirus
harbingers
rodino
hincapie
evocations
fossilization
melor
sacerdotal
reil
lacasse
irri
silvas
wyant
shifnal
zedler
jebsen
lemminkäinen
pyramide
mittenwald
rwandans
mitla
houldsworth
simmental
ancel
egli
haarp
vraie
brominated
allspice
alguien
raskulinecz
greaney
isard
sestri
tarpan
vaping
pirzada
makarenko
kaylan
dhalla
unerring
larina
woolnough
mühlberg
talkeetna
zie
delt
gabber
hysteric
bruit
hainsworth
garko
queasy
balsom
shuey
finkle
maari
deeney
otunga
innisfree
invasiveness
sulfonate
consuela
granulomatosis
polytheists
staverton
roundworms
tanimoto
synchronizes
gerow
abertay
petts
bakir
stifles
heathcoat
snoddy
seidenberg
ceph
dest
loures
complet
schlichting
gowans
atiku
antiga
armey
jailor
sensi
snared
cavin
veldt
gsz
stradey
scripter
elsbeth
kaliko
carrott
egea
skint
infront
barretts
aaib
hanus
svidler
uncontaminated
paralyzes
pandolfi
mcglone
shifrin
aftercare
okocha
allandale
duchin
tosun
uremic
wavers
hamence
dwc
bial
pomeranians
mehring
jamma
euonymus
arveladze
blaire
gwt
daingerfield
amniocentesis
tetzlaff
lfb
washerwoman
wknr
tapis
kmox
aztlan
aravinda
faget
boel
lhakhang
sambuca
catabolic
paintballs
highwire
barsky
millerand
jiawei
acsa
jula
nte
nethack
somerfield
dessler
repartee
woop
guillon
cpw
rolpa
streetwear
lewicki
ungarn
grav
caractacus
romme
crestone
radiotelephone
pilothouse
floret
oco
naranja
upali
farfetched
audiotape
vittori
thingies
hemostasis
cornetto
hoerner
deschampsia
abuelo
fazilka
reticulation
hertog
jarzombek
vues
dardo
kastrati
abhors
fernande
fridges
colorable
karlson
pledgers
troodos
forstmann
kvn
carwyn
starshine
saretta
maragall
rockumentary
digo
stewardesses
zimin
compartmentalization
ademola
maradi
pianistic
meskhetian
socceroo
reder
communions
dumars
sparser
strummed
helal
macrocosm
stonecrop
goeldi
schönebeck
santillán
coale
apollyon
howat
sherer
rushford
briens
whitland
coia
habituated
londons
stripling
munidopsis
dispite
courseware
dogging
greatbatch
shoeburyness
twyman
proove
hanamaki
qura
shepway
admonishments
wri
rostrevor
rowsell
jeron
montereau
dawat
saronic
cruder
jackalope
rotoscoping
luxo
cholangitis
tblisi
klosterman
wickremesinghe
meinertzhagen
carollo
iof
unbranded
laulu
kleptomania
morphologic
oxyura
muette
reye
mazrui
functionalization
worzel
girdwood
tidier
bhrt
jata
availing
veeder
yota
silberberg
sheller
unswerving
sabat
mosaicism
hano
stell
donachie
munier
vaida
laius
bischofshofen
plattsburg
cosmopolis
brg
hadham
abdirashid
laak
kelurahan
edip
leatherette
kavan
saravanamuttu
rulebooks
uchaf
teatime
nies
courtin
compressions
daubenton
curbishley
obr
atallah
grampound
syston
olufsen
rottingdean
photosensitivity
panisse
rustem
stana
trudie
judgeships
landraces
xinyun
mimeograph
micke
zyuganov
wadleigh
quilters
closter
varin
sinopec
faried
gallos
mosco
albelda
hermida
geigy
protoceratops
malyshev
onix
maplin
gses
monumentally
zittel
proops
yatton
vizquel
habermann
vosloo
dovercourt
baynham
ushkowitz
hazra
sorokina
wertham
jeanes
chaudhri
muad
perowne
woolloomooloo
cortelyou
riserva
ifb
sketchpad
recertification
hakusan
prosciutto
baitullah
zoysa
collates
bernadine
prigent
undernourished
meadowland
palmitic
adamek
atrophied
japantown
rispoli
almansa
levelheaded
despoiled
hanham
cdep
sharam
hoggart
criqui
cipm
gondolier
nicolaides
coonhound
adsorb
padbury
krenz
bilawal
callousness
sarmad
radlett
dreck
maring
cnw
eliphas
massingham
danseur
bywaters
profiteers
crystallizing
missioner
clurman
earthrise
delict
duplicator
thira
safflower
mkp
bahawalnagar
shakoor
tomfoolery
arcis
stankiewicz
sowden
illiquid
bailee
morakot
gordano
buttar
ickenham
relearn
rantau
amichai
onias
waldon
helleborine
mewat
sansepolcro
milion
peshtigo
bromides
litle
clodagh
puppeteering
girma
gotoku
abkco
jakab
opprobrium
verbruggen
dachstein
beeld
selfhood
saci
riopelle
beckles
toggling
superstardom
pilaf
cruzes
jarvik
kellgren
himba
pinard
uniq
pieler
overestimating
chelle
southlands
menses
nemea
woozy
ordsall
jamo
nishimoto
tmk
tobermore
dharavi
quelch
skittle
sanzo
rosaleen
fascinates
loosens
hapsburg
opalescent
fpt
unexciting
vacates
apas
tiaa
optometric
bentwaters
devalues
sputter
jobert
asharq
itzehoe
groner
unassociated
peñón
trumpy
sollers
aesthete
ishioka
mangels
arsal
pestana
lifesavers
campanini
ondas
aldie
microflora
mortmain
qvga
wednesfield
caddis
nipping
businesslike
fallot
syeda
frederique
coor
extraverted
unmil
nzd
acceptances
bokova
lockbourne
staudt
insipidus
bpb
dehaan
mannino
tailhook
aptera
moncef
pointillism
chametz
dgh
ngt
sarcomas
laogai
tredinnick
zeaxanthin
thielemann
dcms
sheerin
malibran
phasers
innerrhoden
depressingly
hurtubise
gatow
bastardo
pullers
blé
archa
tatlin
sudduth
castrati
giocoso
punctate
puryear
jerkin
hertsmere
mebane
creutzfeldt
moorside
chanology
evac
weyman
zaur
portier
esop
microsurgery
talton
vore
ruah
militantly
blighty
peppe
szabados
tiangco
tangaroa
adbusters
quello
honjo
tremelling
paraskeva
tawas
keelback
dorne
baumgardner
woodsmen
doeschate
lgs
dpf
redeemers
collimator
heuser
canino
galles
mammuthus
shajarian
iuds
siber
septicaemia
kishwaukee
nakheel
micrurus
udrih
hornberger
thel
disaccharide
kawhi
kpo
sedes
vanderslice
peccaries
wardlow
clutched
adhoc
champenois
fep
forgone
burntwood
minnesotans
hemochromatosis
arsenite
hypocritically
gabreski
hanin
guenon
reitan
funderburk
womanising
balochis
cabaye
tabarin
kapanen
jablonsky
angstroms
stubbins
iliya
keary
atoning
cammarano
reptilians
paise
decreeing
vlastimir
zunyi
pingle
bovo
dauber
itk
toso
mortgagee
diné
cccn
shallowest
mooch
planed
greensville
ceryx
interconnector
dismantlement
laze
subbiah
heswall
groes
ignominy
krul
berggruen
reptar
brygge
foulger
ganser
danial
achy
impaction
tahil
beckers
bhang
mainieri
blaser
veoh
pmm
parratt
unhelpfully
chandhok
godda
plumeria
klute
hesler
roughs
kostelanetz
hambletonian
srw
elleray
siga
hingle
ays
rantzen
aymer
penthesilea
trekkie
northborough
peiffer
mcgreevy
montrachet
kuwahara
ello
ncba
deverill
eells
basim
matzah
deichmann
soldiered
duminy
yefet
bratwurst
weinrib
bidet
killingly
margrit
iip
tyche
duany
vacillating
hinksey
gnn
suryanarayana
accokeek
unexperienced
nalbari
bottum
angelita
milankovitch
golota
médias
ewc
naila
perceptually
swains
benckiser
baral
strandzha
jargony
termer
subbarao
noppawan
instructables
cholangiocarcinoma
diwata
hypnotics
ptf
veu
citalopram
sachsenring
paleography
ognenovski
bignell
neuroleptic
pettifer
mceuen
quantic
senden
ryno
argun
wicb
knebel
contempo
skyy
chilpancingo
salang
pridham
souvanna
dowty
attentiveness
hiraoka
spearfishing
pdq
unrealised
apds
kefir
cudjoe
coactivator
kopenhagen
kharaj
hsan
naiads
rolette
aparently
cantero
purloined
hechler
vautour
maccarone
shrewdness
woodcutters
ihop
desmet
shuisky
garlock
waiblingen
nenshi
mcisaac
nyos
majeur
elmander
tava
chifeng
custance
brookshier
snelson
aveley
leterme
escamillo
kjersti
lycées
tyrer
lashings
benwell
gardée
wdsu
kozloff
sexyback
mels
yehia
gilmanton
supaul
yarkon
promulgates
purposeless
rnib
iwabuchi
xueliang
cecina
murmuring
brigandage
garraty
rodge
brugger
cyro
amta
saurin
flannagan
beckie
latticed
llandysul
quelque
bakhshi
destructed
jinny
gordin
scribbles
baghi
parodist
buea
yorkie
pench
nishihara
rolleiflex
lurleen
draftsmen
arh
thromboembolism
kerio
neusiedl
sathe
kapell
skyclad
carneros
countermanded
whitham
broaddus
schoor
owo
burkard
padar
ecthr
linhart
flatman
abar
livecd
sohmer
stonewalled
cribbed
axilla
kobuk
uxmal
emmott
mcculley
pbt
gabriels
robotically
brangwyn
voulgaris
strebel
mondino
cutback
vivante
firaxis
unapproachable
ktvt
heinonen
rasping
raychaudhuri
goodbody
yamanouchi
oot
arrester
knibb
lydford
deroy
dowagiac
zelinsky
icograda
torgersen
stefanovic
shinwari
midhat
sfe
dejong
miff
medtner
calavera
myrie
hott
blackfield
besta
helenius
yamana
disbursing
changa
pilato
azulejo
muneer
naschy
ramseur
géraldine
squishy
jiaozuo
dongping
partis
hansie
cedarhurst
unpolluted
negritos
uberti
mlada
datapoint
kersee
deckert
nedda
armchairs
luxembourgeois
replicants
murgia
meacher
doyne
entrepôt
pflugerville
colleyville
joyless
predication
bayramov
preferrably
nvm
epauletted
palmata
ketty
mommie
kollek
interdicted
paree
positano
nunley
oyelowo
transalpine
baguley
bellowing
parabolas
monalisa
chirpy
cambell
tidus
ungaro
soltaniyeh
redoubtable
toubro
hagemann
sumrall
nobuaki
wellow
ezcurra
whateley
spliff
chorrillos
asprin
barash
hamri
ransdell
evrard
dotter
reverently
malha
chittick
chromeo
mte
kristaps
winnetou
marjolein
emasculated
medinah
papá
stinker
arthus
tolomei
lissner
bagneux
hexavalent
arborist
wintu
drabek
ecuadorean
chucked
madrassas
pyong
holofcener
margalla
novaeangliae
prioritising
urry
ottis
mitteleuropa
renteria
pve
ameliorating
yglesias
pattillo
nakia
ruen
aniseed
furner
senne
nnp
gibert
paranhos
siegelman
politicos
sourire
blauer
glided
phosphoenolpyruvate
fortwo
haxhi
seleznyov
barefooted
cercla
danaë
berendt
nurburgring
chye
larrys
abdominis
asphalted
fightstar
ronconi
ohsu
counterparties
learnings
grondona
vandergrift
aerospatiale
benmore
speach
jacklyn
madrone
relevantly
nonnative
amboinensis
grayslake
horatia
metatarsals
supercharging
thrum
rushmoor
boozman
chauvelin
pertinence
warboys
willacy
pickton
taboada
narula
flintham
rapace
titta
bottlebrush
vuong
croupier
hoogenband
soroush
strapline
shirow
ikuko
kuusisto
boonsak
naci
desensitized
boingboing
melanotan
yeller
derivate
internalizing
giganta
putbus
schoolkids
ondra
edenderry
packington
purifies
selkie
cios
mkiii
slytherin
visualizes
shortfin
geoje
caddell
undeciphered
pradera
budan
robertsbridge
mowlem
ismo
pyogenes
konchalovsky
mirian
trivialize
sobhuza
pitons
monna
riesel
toltecs
triaxial
dossi
moonbeams
panathenaic
bonariensis
yagoda
brugha
spiderweb
sorriso
swaythling
furrier
monsell
viglione
critism
pursey
gazzaniga
tomalin
leafcutter
handford
succinic
hoynes
belliard
qsi
cerva
muezzin
emendation
hoonah
xiaolong
righter
wamu
loughery
lafortune
elum
victrola
yack
kaisers
bunratty
mccudden
julissa
schepisi
vics
taruskin
coalisland
gracing
facchini
zhengyi
ennoblement
maroulis
riri
bge
anglogold
huddart
batali
vesterbro
lesabre
cottee
yoshifumi
balsall
siki
peridot
fisc
lages
gardyne
averred
rochat
grímsson
nauseous
sposa
cdv
mainlanders
posillipo
panfilo
cullowhee
sorn
cime
rohatyn
hlb
thunderstruck
clonazepam
niskayuna
tenko
italicizing
mazzei
interdisciplinarity
peyron
petrina
melter
mirepoix
shat
bjs
jrr
hostetter
aabenraa
liveryman
vilifying
khaya
castillian
conforti
bivouacked
sogni
amitai
wariness
blunts
brasileiros
figes
charrington
nukus
mni
goudreau
goldings
persuader
lilt
alvy
grifo
kerogen
aybar
withnail
boral
colomer
ristorante
helminth
verrazzano
rej
imb
nephrite
contraversial
distrusting
chinense
gattis
bonnici
stanislao
agogo
qaddafi
burlingham
gild
muchachos
kubin
sunne
saadeh
axelle
hostelry
meixner
marybeth
alehouse
tabaco
herewith
spetses
maitra
begay
mofa
sonitpur
schiano
lianna
gaydar
zamba
variola
smouha
pirouette
siôn
gwi
abla
woodhill
domesticate
bondsman
brimble
sextuplets
malene
edmondo
gardell
biocompatible
louviers
tzitzit
nomani
tribbett
addi
mokoena
unevenness
gampel
moppet
dusable
dovecot
cockatiel
peepers
golic
purines
rakai
sharking
chmielewski
kinkead
lackner
fluoroquinolones
bairns
cpk
backroads
tianping
casl
iroh
chamdo
cluelessness
pierina
beru
industrials
altec
harl
thé
rondell
fredette
hoarders
gooderham
lancefield
solidworks
margolyes
gasparri
langstone
vedomosti
ecclesiasticus
redistributions
surfliner
deshawn
diaconescu
utusan
undersecretaries
bajío
huggy
autoerotic
sirotkina
burnage
syrupy
crucian
bullae
lanman
onley
johnno
gédéon
lauber
ifeanyi
adhan
gervaise
tcherepnin
woudl
jika
jetter
kemmer
consell
transgenderism
glatzer
curlin
söder
puerco
conigliaro
leora
chelly
epix
swynford
lievens
vermeille
tansu
miska
shcherbakov
bdl
chahta
touchwood
poveri
cuma
marcks
michiels
shakespears
labview
cogburn
leaseholders
posehn
stree
patersons
garett
underemployed
onorato
leeke
fust
selenia
nutbush
olanzapine
mkk
convolutions
halffter
rintoul
caméra
pagès
hendrich
guarany
bosson
caltagirone
jamrud
mahovlich
varon
olhos
parul
pitchman
minkoff
callaloo
mutuel
utsumi
geldenhuys
greenacres
acheive
rosholt
solier
nonferrous
synthesising
ouzel
squeegee
ecomog
presaging
festing
juneteenth
bière
frewer
trifoliata
whirligig
astrud
diosa
staggs
piv
guldberg
goner
deeps
eola
gigaom
timey
finden
jacquetta
stender
capacious
ciders
thermosphere
bienes
mcswain
stickiness
rainworth
attenuates
schwarzman
halvard
eidson
presage
trafigura
federacion
ostertag
batajnica
adjuvants
chromic
howze
farnam
passionfruit
fmcg
guille
abjuration
wiregrass
quat
stegmann
bedfont
qsm
felina
danel
najar
rasi
unawareness
sivaraman
cuan
decries
dyker
historial
absolom
barnby
hwe
kahuta
danilovich
bokhara
viviers
laune
castleblayney
highgrove
frustum
crickhowell
devall
airdropped
vaporizing
höfer
decriminalize
sullinger
victoriously
ladenburg
yashar
rummaging
cootes
pittock
touhy
poręba
ashtead
disassociation
kishtwar
ptas
jasika
transect
bfw
rudie
dupatta
crozer
moralists
permissable
cassiterite
shibh
stanwick
sabbatarian
disconcerted
fresca
halloway
kirsi
shraga
thymic
khatoon
heidari
sublayer
ehrlichman
hypoglossal
graniteville
papert
laager
wroc
rosolino
perivale
icefields
decouple
marcoussis
natrix
ethnos
bookended
moton
stenerud
kemah
penev
frizzle
tunick
japaridze
lourd
sicut
sels
kishon
uncorrupted
narine
electrotherapy
transposes
bonnat
significations
vido
cufflinks
neuropathies
fpso
damasio
mancos
balestra
kureishi
tomassi
ponchatoula
downrange
anorak
breves
wherewithal
monolayers
endothelin
plr
amash
construcciones
mustafi
girdlestone
saffar
prx
olivar
gutt
neils
willies
psychokinetic
teradata
fowlis
kinetically
nykänen
maccormac
espagnol
undof
darfield
disbelieved
senger
robley
dlna
hamit
mtsu
kensett
muffet
dispersant
porpora
moundsville
manxman
wikiscanner
cevert
schnur
sirdar
holzhausen
iberdrola
faites
ungerer
rigoni
werber
rache
dimmers
googe
meatless
krank
cregg
sandstorms
wsd
taconite
elbowing
blackhearts
galimberti
salan
keiller
hematuria
icebox
babka
homeboys
sterk
pequod
mgv
toybox
wonderment
lastovo
akst
móvil
unip
saclay
pollini
guill
wplj
criminalise
schibsted
hont
isakov
gerau
fuehrer
sendo
nyle
guerrino
janicki
hypothesizing
sleepaway
pillet
grimble
bazelon
wubbzy
druidism
barramundi
iwashita
pich
gerolstein
panico
grella
flamin
majerle
vallées
althought
hilger
lizette
ngugi
sensitize
diii
agre
illiterates
bpmn
stubbe
domecq
premeditation
frossard
vasarely
deloris
categorisations
phalange
cornstalk
milde
cyanoacrylate
shands
shibayama
niemand
factitious
mareth
unadjusted
bpel
dhami
fatos
imv
porcher
jeana
hugos
rivage
rastrelli
stayers
yippie
bindweed
kudlow
kheyl
laner
iturbi
morels
agd
tomiyama
shankman
quebecer
económico
amethi
merstham
finicky
bundrick
wicki
landolt
kulongoski
uglich
stowing
kinahan
jagdeep
cullis
castaldi
noahide
lazzara
laboratorium
viveros
mekons
neuromodulation
ginther
impérial
friedgen
filner
undemanding
farndale
sitti
foyers
algeri
voiture
iddo
gantries
flik
tibbits
elastomeric
claytor
rentable
orascom
merli
mutans
killingsworth
flikr
micr
bringers
nomis
truces
consols
giana
gammer
naqi
crisostomo
pkn
kesa
sanjoy
tiegs
bodrov
treed
yuke
northcroft
pdms
crewing
endel
vedova
masorti
semplice
cetti
halk
montoursville
blanke
khattar
ownerships
yippies
mear
richi
limites
glasse
drypoint
vaghela
jewelery
westling
barcelo
asteria
mindell
langhans
optimo
lojze
nutters
noncompetitive
budgen
valeo
pantries
yingkou
tapsell
aaup
johnie
vaginosis
grot
madhab
hongik
wordgirl
ravenwood
morcheeba
skandia
chaga
freemans
ifvs
magome
tarwater
qum
fabregas
villacoublay
fluoridated
buchen
wobbles
motorhomes
incoherently
sternwheel
stows
whittingstall
asahara
lustration
metanoia
fichera
chikako
edgley
emrah
isozaki
korol
agco
stokesley
lazzeri
kado
diabate
nanomedicine
ivete
civilizational
wracking
zodiacs
hede
lasley
livens
grene
parfrey
tropfest
judiciaries
handbell
glossopharyngeal
unzipped
chaat
heliosphere
fulltime
harpreet
deflating
conceits
ciri
henlow
delfi
horsetails
blacktail
blackhead
attala
applique
vva
kindermann
neah
ritualism
rhagoletis
bolena
fenger
johanns
filmore
paynesville
diferent
mccaleb
challinor
fibular
acount
inseparably
naslund
schultheiss
dvl
puits
halcombe
jetset
avus
cjb
beatniks
bowron
baars
mmog
penology
fortiori
steelpan
ungovernable
hepple
conchas
yalla
daydreamer
arhats
tusa
generalising
reinsdorf
roundups
sandile
varroa
zocalo
pitying
budhi
ducos
knanaya
roscoff
mccollough
wsvn
accu
incat
dipika
soward
bellmore
techsters
johal
eig
ainscough
chunghwa
carolee
arterton
severodvinsk
orishas
sicker
henton
bharatha
taveta
holmquist
tost
nezami
conjunctival
talu
southesk
shuter
artibonite
docents
rumbles
malins
bmws
karume
bedbugs
blowfly
haik
melisma
heri
markwell
momi
zwirner
blackbox
marnay
nitz
assouline
fcps
multipart
desroches
achmad
intranets
dunois
weinreb
gehrke
damson
annaud
jered
mongkol
waxwork
hocker
lizardo
polisher
lisnard
chitosan
probables
bredin
cabalism
polyrhythms
hbl
adulterers
cinequest
causse
cederberg
paea
mias
hekla
sprayers
ichijo
cbj
rodos
academi
decisiveness
frede
khater
ashu
mohnish
tempeh
kinescopes
rowlatt
proteomic
fixations
biocompatibility
coring
tía
bullhorn
oakleaf
netsuke
marijke
cosatu
goldston
goalkickers
islamonline
guerriero
regraded
takahisa
lucene
byelaws
tiant
kcra
dalembert
rodolph
hiren
nahanni
edelson
turberville
cleora
enticement
apocrine
visco
jumpsuits
gigahertz
bloodshy
paleta
chines
saraj
norling
abai
wilken
durkan
vigen
fashanu
emelie
livadia
bixi
miscarries
lenart
desprez
underland
kirstin
denisova
venlafaxine
moco
srbs
pricking
katsuo
greca
mcgowen
dirigibles
northlake
attwater
gourdon
krabbé
sanlam
vermiculite
settee
baugé
toggles
mosa
carmon
showboats
werde
sandworms
pacifistic
newfoundlanders
oberholtzer
tiebreaking
flr
umbridge
bandelier
niassa
feiler
ytl
shevlin
sestriere
splotches
buffets
upwey
uim
integumentary
souks
bollaert
femke
supercede
noten
rahall
graton
mafa
yaser
tottington
wiart
fajar
ottar
zenyatta
nondescripts
kowtow
seelye
fehmi
tribemates
kjeldsen
macrocyclic
myelination
wrns
ramlila
beilein
emotionalism
lithonia
hubbub
hedden
maysan
mukasey
olerud
streetview
orthographically
junked
animales
persimmons
harrisons
flec
sinfonía
doraville
revitalising
klingberg
gabra
spac
redzone
augury
turbocharging
squarepusher
brawled
anttila
berkner
ridgeview
nastia
destouches
fanni
mng
dater
tathiana
casos
sours
imprisonments
rohlfs
akutan
kondratiev
overachiever
aboutrika
inundate
intemperance
hanbok
superpipe
arsons
weekley
neurocognitive
rowallan
officiant
koru
ladywell
boschi
butare
voorhies
cychwyn
modise
celie
juvénal
tolomeo
valvano
accreditors
harleston
uhc
redbox
collude
doerksen
sahlin
mcgloin
biomorphic
paulk
lazica
torrado
didio
meia
vasistha
chantel
miyawaki
mccorquodale
fuit
outliving
gangways
klaassen
xda
hobgoblins
shootin
sherratt
reconstituting
trotskyite
btf
severna
jagiello
hesa
phreaking
bushcraft
sarim
kungfu
kalbarri
fdu
salant
todes
aysha
sephora
vautier
diatta
soliloquies
anes
puchkova
duffer
lynagh
petrick
saccades
velupillai
esselen
dodik
fleetingly
ausaid
biasi
concarneau
hiten
desmarest
sinmun
sentimentalism
helming
medavoy
dissidia
lette
relit
macrolide
hobsons
melbury
breytenbach
warryn
lavanya
anthracis
zadora
bahonar
makurdi
goodloe
sarcelles
ylang
edmar
koenen
qbe
shaham
conlee
barometers
politti
trashes
godo
grechko
pbworks
chandon
antz
branwen
renna
visnu
birns
mogis
socar
garnishes
roj
montpetit
ehrich
veronesi
clubrooms
duruflé
boganda
heyn
aboutus
georgio
gallian
fixit
disavowing
schoonhoven
zeek
loganville
subcontracting
noach
debray
nizhniy
drystone
shangqiu
westerfeld
dgse
talaud
healthsouth
minuto
nephila
chumbley
linne
belka
whitely
bentz
sutured
ikuo
wohlfarth
shaoyang
allograft
bashiri
hypercalcemia
parolees
carcharias
eastmond
certifiers
lacerated
tiz
dovedale
falabella
genotoxic
belfiore
mocky
mechel
weatherboarded
recieves
fixin
luddites
waht
margaritas
proeski
houdon
sasan
citgo
vilmorin
nathanial
microvascular
kimbo
thongchai
yateley
ineffectively
karkh
bsw
schone
toggled
bankole
synephrine
kelland
vetere
perloff
mitsotakis
doud
ambigious
mikvah
chiki
veysel
shearson
superheating
casuistry
nacc
necking
territorials
nonuniform
lavo
loiseau
farabundo
folkes
leykis
arenac
bzw
danegeld
drazen
misquotes
craniosynostosis
holmby
goos
aqr
jóhanna
toprak
succesfully
husa
segi
oenone
derides
bumpass
cwu
rosenbergs
sankrail
duignan
hamoud
wbez
ebf
gummadi
sitiveni
ostentation
jacor
chilworth
katee
isotretinoin
techo
shelli
ardens
burgage
papillons
vgf
ustrzyki
padamsee
peyman
cowpea
munenori
argyria
ludgershall
wedell
matsuki
quixtar
lekki
lochmaben
xfs
hightown
kujo
unsweetened
cabinetmakers
yusufzai
boeheim
annulling
amec
jitka
execrable
alvorada
wittenoom
adecco
deadlands
chivu
marasigan
nuestros
broglio
mayme
swiveling
ultralights
walkup
clotaire
contrail
vastus
crinkle
villafuerte
preservers
rehana
flagstones
bellfield
undset
intial
wicksell
briars
locklin
michot
boker
chhu
morbi
borgata
almy
leonowens
guadagnini
polos
colorization
poquoson
radan
christiano
chicoine
beki
kouichi
hya
kitzinger
deadspin
codnor
exorcising
assimilationist
gallone
unseaworthy
spinsters
dulas
bpt
arrol
binfield
seeburg
eliahu
bunky
paatelainen
pheu
inapropriate
sisterly
muldrow
baggies
gombrowicz
sidesteps
sixsmith
bryer
paranal
dunkle
bouazizi
ktvk
ditmars
maclise
perlo
rippy
dulled
sdv
shyster
epiglottis
wurttemberg
loyalism
resourcing
duikers
titelman
teuber
ostad
baramati
girlicious
guofeng
gimmie
firouz
opilio
kolesnik
aguardiente
kcop
perreau
reinking
theorising
ayas
korth
mbia
publicans
elizabeths
fusible
fdlr
bahour
zanjani
reprimanding
wainfleet
enshrinement
kofman
masso
homeopath
bascomb
fastballs
goroka
goannas
diehards
bloxam
crooners
speakerphone
guiney
tatsuno
sonnabend
earlobe
ordinariates
wbap
lightens
comunity
scholefield
schmelzer
constancia
endodontics
lattin
proact
holling
jarek
ruses
conze
levassor
dako
machos
voraciously
unfruitful
braindead
wcu
penglai
uneasily
pcrm
atomistic
crepes
portugues
almgren
truby
belter
annihilates
peonage
seccombe
tŷ
laist
wardroom
arguta
mnac
rolandas
laigh
chittenango
selfoss
farquaad
groupama
eluting
mefistofele
nifong
whitesboro
mamer
limbourg
kanchanpur
finan
quwain
roeland
naches
kivas
geus
tanzi
shamba
bhuvan
ballhaus
betton
draven
ardeer
kempt
sharifi
hegan
ontv
prominant
siegman
hattusa
verticality
borey
pieman
piñon
burgee
satirises
stretchy
masher
doubtfully
garat
bbt
evr
ketton
rerecording
terabithia
kaycee
outranks
tennison
mocho
moskal
chickweed
gunjan
bénédicte
catala
metacognition
lnt
carrickmacross
nixdorf
buffel
czajkowski
iatse
edg
priddis
unidas
redouble
vaulx
slammin
pretences
touchscreens
pervious
evernote
timecop
blankety
pudge
cabalistic
yamini
stewartstown
coaxes
armonia
sellouts
yongfu
trembled
bridgepoint
lactones
handal
cfx
meka
jansky
mickleover
gutai
commerson
nitromethane
smarmy
bookham
swigert
ixtapa
asobi
cono
heyns
toome
trumping
liborio
canan
finfish
maccabee
sanur
eilenberg
urt
eglantine
gami
solders
alsea
silverthorne
soubirous
burgmann
malvolio
batsu
mataura
rediscovers
cyberia
gewürztraminer
farka
outpace
wenche
oleta
bocaue
shalford
recoded
profeta
mocambo
deedes
monklands
oir
parvanov
shaya
krop
afflicts
corcuera
repolarization
jamón
nagpal
matfield
lillies
dicarlo
rodinia
cogently
balsillie
gallitzin
englishes
rahmi
castillejo
checkmates
ayatullah
majalis
termoli
mgd
alleman
noseband
dunluce
laia
adelaida
dahrendorf
reworks
shant
bobbed
gallinari
hammerbeam
taldykorgan
ultimatums
caple
maryon
bungoma
killybegs
unimpaired
blurriness
heijden
waaay
stolper
michalka
chongjin
bushmills
gamely
knaus
manian
sapientia
kinsolving
transgene
steffensen
shaara
mannesmann
microgram
delas
karakalpak
sofya
ciclosporin
methvin
infectivity
irbm
draves
avari
henlein
javafx
dauphins
bolloré
badon
gravediggers
leichter
shunyi
puntos
boudou
sawicki
doke
plenipotentiaries
appétit
untroubled
donan
turpitude
yellowman
deano
jansher
kinkladze
jacksboro
haryanto
corollaries
karz
rhead
roguish
liran
seawright
morawski
outlands
barcoding
caudillos
babergh
ijsselmeer
marjane
onenote
confabulation
pople
kidapawan
scrying
sabry
wsg
ebonics
sexiness
delton
upb
duckweed
chessie
blitar
tuxedos
taveuni
replicative
kununurra
spectrums
oregonians
lfr
guymon
presqu
steny
tainting
jaarsveld
scea
raviv
brigida
semiahmoo
deconstructive
liquefy
clouse
bolinao
butterfat
jovanotti
travelocity
hackford
ayuso
meghnad
marinaro
inox
odst
weidler
pelini
aeropuertos
kettleborough
brownfields
birra
arantes
kdfw
gennari
dollie
lotter
lynge
nautique
luhan
lrn
szechuan
struthof
heg
ebell
oxfords
jahi
schtick
doradus
murasame
fihri
billig
passionless
eliminators
dolton
janik
zilina
snm
hoardings
okinawans
dovetailed
wennington
elz
kiger
inferiors
chromatically
rheinberger
heyde
nedelcheva
reapplying
gleick
tarnow
fedir
cedartown
moschino
xiaodong
drawbridges
trabecular
peeblesshire
sloe
lochinvar
tejanos
lyonne
chiodo
boit
railfreight
shapell
trailblazing
koral
ajr
picchi
bagerhat
hyp
emdr
ferrario
minga
remunerative
saturno
effaced
adzes
preternatural
coeruleus
nabulsi
loveliness
soltis
wttg
burgdorferi
snips
pangkor
ilena
desportes
gardella
roamer
foale
provera
tempests
wormley
sconce
rimer
ryegrass
carto
schapelle
amte
abiko
fgr
dumm
mudflap
lethally
fordingbridge
kantei
agentur
developement
mandie
geoscientists
piccirilli
threlkeld
tuu
prados
valtonen
bruff
fairlady
sobo
breckin
greenhithe
gearóid
symetra
olivers
assal
luuk
salaria
alcocer
dausa
katri
yindi
hockaday
ochieng
punchlines
ashtiani
lann
winogradsky
saddar
bunder
footholds
balderston
skai
gowariker
shetlands
vermeule
zanotti
mistle
logbooks
ulladulla
equalizers
bently
afpfl
nagayama
audiophiles
toan
chalybeate
plenitude
raiganj
brabin
titanosaur
buffed
haranguing
snellville
mangu
taipower
graininess
stillingfleet
khoei
klages
weymann
hsf
nones
bioidentical
nrcc
edendale
dorrego
lauritsen
indicting
kymi
inès
babia
kaida
myelitis
hlm
funktion
samueli
miscreant
capricho
pasek
fabares
draconic
väyrynen
acutally
pachauri
restocking
atelopus
darel
nghi
adenomatous
defeatist
ndong
shirted
mosuo
syreeta
jewess
dornblaser
bouba
inspectah
muggles
eyespot
cuticular
deodar
duprez
shehata
shahed
ypsilon
aseem
deayton
heitz
goncharova
orko
carma
scitech
oel
myres
mintoff
llibre
ryuzo
uberoi
photoshoots
lable
messala
actinic
baronne
mullock
hinsley
dolphinarium
polarising
embolus
biehl
laie
sativum
sendmail
bergere
reselected
harbourmaster
teensy
mattu
bartolozzi
vivos
radovic
fevre
sfeir
trossachs
sachet
anklets
avic
feh
caseworker
asadullah
arguelles
corriente
penrice
quain
extempore
hamdy
undefinable
summerton
churchills
oursler
rusudan
kere
shumpert
chiavari
relator
kimya
bino
nocturnals
rse
quitter
czarina
rotch
garbett
umaña
shlaim
adobo
eddisbury
aini
muttered
kronenberg
skehan
kosminsky
bibo
correio
zahran
cutlers
nabila
lyf
beckingham
leuschner
raees
rawsthorne
subir
desecrate
corriveau
meteoroids
jaideep
sweltering
hijau
nightlight
bema
craigs
anubhav
naunton
asbestosis
matsusaka
pilson
schendel
faten
earlswood
aldona
gujjars
tygart
proteoglycans
rebuttable
azizah
laven
pirbright
nru
jayce
souverain
futenma
pame
rathke
aoshima
interweaves
temperment
penley
pontin
piché
indoctrinate
tomblin
protractor
codebreaking
pettman
hobyo
dule
killjoys
daca
epeli
kenda
bransford
translucency
exhibitionist
lasswell
cooperman
preethi
polmont
camac
dimidiatus
mizen
forktail
riccione
chac
zandra
thomaskirche
pingyang
konarski
angelicus
rominger
bedsit
pharmacotherapy
liaocheng
verco
aidoo
przemysl
maiwand
longhand
sheene
leybourne
amrum
callery
waybill
bradleys
manono
hypoplastic
aykut
arencibia
florham
moneim
periodontology
adriane
penola
gnomic
outlasting
speedier
leijonhufvud
hsls
cremate
gushed
kolin
wensum
bittaker
solukhumbu
babek
piggly
mantello
vha
campero
weisskopf
givi
whistlers
recoils
overestimation
rheged
appearences
paps
getup
strader
misse
appstore
unimak
hewison
wiltord
hypokalemia
punctuating
ekins
hocks
striatal
clunie
beaky
chowdhry
noia
nextera
samani
mealing
awana
saldivar
zephyrus
dickstein
ewg
coppel
sessue
sirois
eckel
duerr
goforth
itg
yupeng
barnhouse
crazily
facetime
hinchey
orac
pasdar
ramprakash
laurindo
garrels
zatara
obscurantism
bisham
laudate
demolishes
allsup
samaraweera
toux
carpeaux
mikolaj
namechecks
neti
endecott
ccv
hurtin
serafinowicz
towners
bickerstaffe
rantala
divis
theorbo
lowball
peroneal
cerdan
kerzhakov
monas
kiritimati
foti
rostenkowski
sandblasting
castrillo
wireshark
fawzia
deterrents
bended
jamesville
monemvasia
pogge
nadan
flashover
inal
misperceptions
amsterdamse
theophylline
preparer
philanderer
salzberg
eska
thromboxane
halman
klayman
brade
conversano
pito
filan
purcellville
deis
boxee
whe
itworld
coggan
coche
federale
guralnick
crawshaw
petropoulos
certs
dld
abductee
askwith
tokuda
tenaglia
samp
bongani
binah
muley
outweighing
ermenegildo
mambas
ndaa
boulos
roditi
synaesthesia
yeston
volkert
abadia
communicant
lunatus
hoodoos
hartle
nuru
ecorse
scaphoid
scdc
repossess
medinet
zookeys
greiff
oligomeric
uttley
nizwa
fairbridge
ddf
komu
armel
natation
lanzmann
metlakatla
laforest
incomprehension
depuy
amamiya
fatten
mazzocchi
intranasal
antwi
volleyballers
gadgetry
evenhanded
enp
taimanov
maeght
darwinii
transformable
tacy
kolber
nowlin
champaigne
romanée
olafsson
waksman
birthe
westernised
trapt
whereever
meike
serica
antartica
engelen
giard
penck
anticapitalist
disinhibition
herbology
winterfest
lovells
valenza
wmur
concessional
armorer
quong
yasuaki
sft
kiana
tragical
cashiered
lescott
guben
marthinus
havasupai
kmsp
agonising
steane
gwas
paschke
ratri
baah
koskela
politzer
trudell
wendie
mirek
waiouru
polyrhythmic
moffit
meshach
spellcheck
shafto
farnaby
uthai
tappers
unalterable
komala
mris
borers
inhalant
drivable
nishad
lampton
aune
fior
polyandrous
hypnotherapist
pilat
norbeck
bte
agonies
ryazanov
gittin
capelin
notaro
makina
fastenings
kalra
alfresco
sichel
makro
lese
sacca
desiderata
djan
klump
kulla
simancas
diamanda
maleficarum
restock
haberdashery
malakhov
juggernauts
chads
seyss
lycopene
morsel
schuld
stratotankers
makita
zobrist
hengshui
atlantida
stuhr
carnavalet
vatos
fertitta
shujaat
ageist
oastler
jom
aloisio
jayna
cariappa
demountable
eyeliner
mwi
hudaydah
leawood
modul
natcher
bobigny
adia
starmer
dagenais
cound
henequen
interferons
uwp
podujevo
cargos
frewen
mehreen
sellier
tuberculin
moerdijk
pneumonic
inos
lyoto
hcd
peoplesoft
seun
unomig
largish
sleng
clyst
madaris
sodexo
underreported
lovestruck
matchings
repays
wudu
eurocentrism
germont
desautels
beland
torito
dekel
willcock
welsby
uncompromisingly
trichloroethylene
lakha
pohamba
nich
indec
confucianist
chipstead
terrana
tanqueray
sanhe
telematic
portents
seveso
contostavlos
politican
hitchen
tiberiu
teneriffe
ecri
sews
chesser
boisvert
sophiatown
okonkwo
volcanically
harfleur
bodman
orthotics
melhus
vaswani
samarai
pipp
kawakawa
recessional
lowie
rosetown
elvan
kufr
brushfire
mehler
floridsdorf
bloodrayne
punx
nihang
tutankhamen
bayardo
normandin
puebloans
colen
waage
cursors
tugendhat
lammy
karagounis
luchetti
cottingley
shunichi
parce
roxanna
panaro
pengcheng
convoyed
badarpur
zafira
eha
zaventem
verdana
reser
pagés
zhangjiajie
expressjet
gerontological
govindarajan
rtmp
zipp
alroy
doyon
moonstruck
tsou
newlove
kitch
strongsville
capitani
berghe
luteal
gliomas
zenimax
propoggia
cinar
financière
trivers
presentment
maldivians
motril
metasearch
workshopped
chiru
norouzi
ajla
torme
pennsboro
wellknown
atapuerca
guage
sukarnoputri
iannone
bober
autotrader
schreker
gaydon
thoburn
makaay
unarguably
vicenzo
hammondsport
mirs
evia
automat
pontins
cloying
lensed
imaizumi
momentos
privada
alwi
wearisome
valdo
fierceness
tatupu
tahira
crossways
ashcombe
dantzler
probaly
cineworld
netra
jhangvi
ilker
heister
okkervil
salbutamol
comprimise
shahdol
minott
almunia
childlessness
temco
lawrenson
nomenklatura
reinvigorating
misremembered
assonance
farington
keenness
entreat
northstead
logi
lumby
hyneman
morganti
worplesdon
afcs
monopolised
atau
reinvest
pitiless
dubuisson
onlive
diederich
golenbock
porcia
convy
montross
madra
tweedledee
madson
goleman
zaun
parminder
govett
wanderley
oppressions
italdesign
madjer
stuntwoman
fulks
abendroth
painlessly
despising
tobie
dalgarno
grupos
painswick
harkens
thiess
paled
yorgos
beeper
zarzuelas
perfil
teppo
modernista
lishui
ganatra
cohabit
hotelling
binzhou
takato
chetak
vicks
litem
bleyer
complainer
trichoderma
ritsema
dieren
faln
malpractices
wearmouth
flamingoes
telepictures
dressen
ppaca
herschelle
leitmotifs
stoltzfus
schieffelin
noticia
mbah
schizophrenics
guiraud
dimmitt
zippel
puga
fuxi
ejaculatory
alagna
chafin
salvadorian
parkey
kable
moed
leam
clouet
bahuguna
experimentalist
comores
hypertrichosis
beardslee
harriss
milks
lubezki
ravings
ketoconazole
makemake
jenne
serguei
cnx
farahani
donzel
flamand
meurer
gades
kaige
nishina
bizzle
tendler
fusa
mckerrow
nakamori
burmans
manvel
dorantes
trippi
jenkyns
connellan
aspel
didon
modarres
shucks
foisted
schelotto
whicher
calibrations
nyein
bretz
braaten
namiki
shpilband
giao
sindy
thirlwall
someting
paraplegics
scuppered
paramor
skoog
koibito
tamponade
schaik
allahyar
slipways
hareide
bridleways
beckert
clandeboye
upavon
reallocate
gàidhlig
kratzer
immensity
tweeters
reintroduces
saghir
vmas
mcclennan
fourfourtwo
terrie
nankin
workmanlike
kentwell
gamel
clynes
suja
darrah
rottenberg
haidari
lattuada
rangoli
mahra
neurophysiologist
autore
piloto
veerappa
baró
strolled
szubin
serletic
fow
barnack
titcomb
lihua
melanocyte
nlg
claeys
clausnitzer
bestor
cleverer
coloane
moghaddam
callejas
ibirapuera
mildren
bylined
nickey
rejigged
reacquainted
aughton
oppel
tendancy
borjomi
dziga
gladman
darci
burla
bashung
concussed
geth
gitai
battlecry
anari
sodje
atlassian
sja
warroad
unde
vakulenko
triplette
untelevised
christs
battledress
tahiri
baban
annakin
imagers
hacktivist
ideapad
kruis
gowerton
brantingham
randow
aimo
harpeth
slutty
sublimate
florilegium
osyth
crowdy
latynina
gandon
atrazine
kibler
tatian
paranthropus
optique
soundclash
jrb
hideko
pichardo
accuracies
kempff
furosemide
simcox
magidson
komma
boroughmuir
bouvard
lindsborg
waterlogging
assuaged
rookeries
stigmatizing
kleinberg
eccleshill
medha
wismer
lader
bagnold
devadas
photodynamic
fathy
mariazell
narrowcast
mallrats
vannier
leniently
minutae
speakeasies
upperclassman
coontz
grooveshark
ishizaki
uggs
solaar
elantra
grabner
danz
jyotsna
shrivenham
kolwezi
peppa
nicotero
cerrato
nazri
mrta
rezko
raisman
bobi
halmos
putamen
cimber
lepidopterist
ewr
unreinforced
beidou
outstrip
trailfinders
sinews
treponema
normalise
kanne
garima
fbf
tellin
baima
bolly
blassingame
zoa
holcim
matthewson
helgenberger
bluing
ldcs
catchiest
frontotemporal
bunkie
cintron
pastrami
rarick
copyrighting
obsessional
chilmark
raif
siders
craggs
outran
wakatipu
ikey
gey
configures
stubai
fingerprinted
beamon
laslo
guingona
kutti
doggone
takfir
uriburu
keet
gameboy
hepzibah
effete
hyaluronic
versteeg
dearman
usareur
schelle
verbeke
mckinnell
guerrini
delis
unfaithfulness
slingshots
tawton
rabha
risala
namer
franzese
fresenius
kupferberg
erbakan
bongiovanni
carousing
palynology
pharmacia
xanana
kyne
talo
watsonians
warte
serendipitously
coller
cespedes
gilmar
hungnam
globalist
multistep
rubis
kingsolver
deepsea
byner
samaya
laima
buisness
orthographical
signallers
diagon
dunrobin
kenzi
ocker
wallenius
paichadze
teachout
mckillip
demagogues
bodi
boxster
ilea
roha
pentridge
farmworker
peacefulness
belon
sadanand
greenburgh
tekoa
disablement
wayuu
verdone
marocco
tarjei
erwan
noemí
laks
pince
verdonk
usumacinta
yoan
goz
kafa
hephzibah
sigar
brioche
ojala
ophiocordyceps
italico
snowfields
mikkelson
regge
falardeau
obliterates
jowkar
coolock
longwinded
ransoms
overwrites
propanol
spaz
competencia
transhumanists
lattanzi
brecksville
cleanstart
colunga
chasez
yanko
reshoots
veys
zaps
yeronga
trefor
ellum
huffy
pagels
forensically
moynahan
akimbo
capitales
dmax
cowering
taints
seretse
macbain
ceferino
arpeggiated
kroto
ringsend
pokhran
siver
manzanilla
overdriven
downstage
rubinson
texeira
madron
gohain
knope
casanovas
abz
kahala
genscher
cabazon
institutionalizing
positronic
shewan
tubac
heid
levertov
riv
chandrasekaran
jakobs
mesrine
sabia
pebbly
karama
jole
mispelled
playaz
newsflash
kolleg
barilla
cloonan
cuello
undisputable
phosphoinositide
haacke
iliff
beggin
nunhead
bienvenu
lecky
milliliters
bachelard
amaretto
longform
syal
løkke
internode
modano
niceville
imperiali
lutton
levchenko
naqib
greenan
dharm
brattain
savall
subbing
autoweek
impositions
rattana
gudauta
manneh
ketola
bubb
bovids
kamani
eichinger
allers
crocuta
christophorus
gripsholm
clubb
jostling
matiz
mcinally
staithes
zollner
boscov
irretrievable
katif
palmiro
shugart
souad
iob
hongqi
iwase
chone
allana
windblown
ahlam
knopp
cilea
keizo
bujanovac
donora
bauch
sabol
saner
nawrocki
recapping
schlichter
lizz
oisin
flims
incinerating
flickinger
tamarindo
relabeled
ampk
melanistic
platino
nansha
nitroglycerine
açaí
ellerton
otake
pasini
nihad
adamov
hubel
altough
inayatullah
lycidas
drepung
wrongdoer
arthroplasty
villedieu
kovacevic
monje
rondine
aymeric
nordine
herrman
orelsan
oguchi
pushin
leval
hydrometeorological
esade
sibert
otávio
pigeonholed
haagen
spouted
clines
levuka
lightweights
chiusi
wayde
tetum
enoki
cottenham
caver
khur
lithospheric
akande
primeros
babbington
entorhinal
paparazzo
nanas
cassina
dozy
bolli
gramme
oroonoko
fluidic
khadra
etiam
deheart
collignon
campesina
sacy
rubberband
dno
nebraskan
murderess
deak
cambay
lile
sheetmetal
mundaka
unitel
chunyu
mehri
ceol
bilis
wertmüller
zoli
navenby
dongjin
extensa
ceasefires
rzewski
negligable
lasek
spattered
caballito
myhill
woggle
larghetto
sebastiaan
bascombe
shotter
ekstra
gillham
beaujoire
bronchus
haron
gyngell
artas
pennsville
gayane
nordfjord
skybridge
yusei
fifes
grenadine
robic
consequentialist
bishr
grafica
roffey
bruyères
bucksport
atre
rehashes
oddness
emigdio
poteat
yutang
vasaloppet
baharestan
curbelo
carraway
jolts
köllerer
vratislav
lycan
tarte
panchito
nopal
stegemann
ccbc
tonkinese
tantrik
mavin
mcconnachie
bafut
onaga
oblomov
ignorantly
kuwaitis
sibs
asanga
lown
worryingly
promusicae
ranganath
corpore
jauhari
printouts
corra
comparatives
kuria
nali
reassumed
jazzman
brüggen
nealy
edonkey
biotechnologies
davalos
alani
kaushalya
mockler
kakuta
sakar
chiappe
classism
ericksen
impracticality
jetman
glr
kitti
ujung
warmington
supplications
rushent
newsgathering
kereta
luiss
bioavailable
rmd
jailbird
drach
ftf
wadud
shinyanga
pucker
fargas
drams
newsquest
pepitone
ahlert
hornady
tatlock
lincolnville
mindel
falsifications
dulal
interop
berthon
ladybirds
hajjar
sandstrom
superbikes
mousquetaires
kinman
palmi
despertar
kinchen
ptg
isitt
dronning
fakery
straton
downturned
surcharged
commemoratives
linington
sensex
neoteny
asq
snidely
carluke
summertown
galkayo
giorgetto
liquide
kuntar
achingly
obamas
estenssoro
wrenched
sawley
roesch
negin
dignan
reproved
regier
pachi
azathioprine
problemas
suturing
synuclein
urkel
winne
volkszeitung
grumps
mardones
lippa
beliefnet
gruenberg
meador
inm
commiting
mufflers
epitomize
ischaemic
ranft
joannie
chenoa
kameny
fangshan
romanow
lillington
bricktown
visualising
spreaders
castner
gimmes
giono
farhang
gabay
tanzim
questor
demodulation
fransson
edfu
mytilus
bouch
countach
jeshurun
nocioni
mells
twelvetrees
rhinehart
biram
ankang
abbeydale
dalan
ndn
whin
ibas
tranent
sarlat
islamiya
caimans
miras
andreou
realest
shahriyar
cleto
ohioan
bulo
elga
wisborg
floy
mallaby
rongji
barchetta
herseth
aberthaw
devildriver
saval
satrapies
geodynamics
stegman
rearming
bethke
titillation
faubert
sorgen
destructing
bagshawe
edgecumbe
mckoy
coulombe
logarithmically
shrader
dinge
conundrums
aranha
murmured
aslo
chooser
sprawls
misrepresentative
nafplio
recyclables
piatek
partage
sanctifying
errs
tangen
oxcart
faregates
scottrade
wurman
leadup
sirico
finless
fastness
dittmer
nesters
mpvs
colella
viljanen
mountable
kadisha
copped
goethite
chiens
bavay
youthfulness
gimblett
teer
aberfan
coastwise
greenwalt
messerschmidt
massu
energomash
hitcher
yadollah
breslauer
kodjo
dibb
illogic
erewhon
ollier
sidamo
majorelle
egi
rennick
aronowitz
diokno
mopped
zakuani
icsa
bertsch
blazey
huapi
ajmi
guss
vmo
muzzled
kucera
unsubtle
theisen
seikan
pharyngitis
balashov
akroyd
biderman
besler
savitch
lujack
unreasonableness
lutein
exactitude
highwater
anns
sjp
khryapa
baquedano
santeria
elpida
straffan
rheidol
croshaw
settipani
sdd
cambiasso
wheelbases
naj
pressey
rubbra
embraceable
gellért
blatently
merchandisers
seussical
ceausescu
daido
samah
aylin
yinka
coucher
merseybeat
bloomsday
cordyline
odierno
deselected
shawbury
gizmondo
woodthorpe
laurynas
ferid
karpenko
henryi
ballena
abdiweli
bairam
albendazole
sorong
shahani
coprolites
eastin
cheops
breggin
coracle
garbajosa
capitola
frane
chedworth
ganze
dubber
sadaharu
threadneedle
osen
marignane
townsley
jagannathan
freakonomics
hosta
beatitude
parthenogenetic
shinhan
ceja
phrasebook
wiranto
irobot
shrovetide
rotich
pedy
convulsion
concocting
gerizim
auris
frumkin
terezin
dither
lfn
subsidizes
karrer
medemblik
ludd
castafiore
sajna
nickles
novoye
beitou
levonorgestrel
eddin
freighting
inne
annelies
ohlson
embalmer
khad
chisolm
mashriq
fazeley
soglo
nabih
rancorous
kehinde
eatontown
swaddling
rodenbach
snedeker
drey
birdied
misri
sxc
srdjan
ferney
gumede
waitomo
clearlake
ress
aristocats
catie
martland
trainable
affronted
gintoki
menounos
routs
powerset
billson
tonie
gosfield
interstitials
moniuszko
bilski
ellett
substantiality
silwan
teraflops
alginate
ruvuma
birner
ivanpah
febres
blaschke
doodlebug
yakuts
musc
neversoft
reconnoitering
crispian
alphonsa
antiprotons
samye
yoshitomo
fluorouracil
oaten
‚
meadowcroft
kyprianou
namtha
jyske
loudmouth
flamini
ibne
barsha
morinda
lyndale
downshift
mikati
ecolo
kurányi
bjc
geismar
braeburn
pedretti
tuum
sowmya
mckeehan
boortz
fanmail
garcin
gothel
karmiel
bradgate
manzur
sycophant
sparknotes
waard
casilla
métier
pappano
kovalam
carini
topolino
pettibon
maginn
fike
arye
spera
uniti
bernardes
nivel
annville
broodmares
valuev
verghese
accompanists
eddowes
seydi
azizul
ladon
siaka
emsley
tompson
waggle
parried
officine
buba
measham
zentner
hatami
bonefish
scarboro
piccolos
ierapetra
jego
primigenius
vitorino
tylney
giuly
berson
twingo
tyo
firmicutes
oln
mansergh
ikonen
skeete
bleakness
anesthetist
monstrously
offsprings
temkin
données
buffing
scba
kumul
shahjalal
torrez
breathers
anticorruption
ebensburg
orbitz
caines
esten
manuelle
hesitations
mortimore
graun
dary
tomasso
naproxen
cwfc
pinpoints
acuerdo
impalas
polvo
mehlman
hisa
lesmahagow
rapanui
honeymooned
banky
softest
haiyang
epigastric
nzz
episcopalianism
schildkraut
quemado
bekenstein
ranna
baltz
heartening
barrot
parrington
kyoji
kesel
meux
mandeep
iren
fraunces
ditchley
geren
shortlists
goopy
cambia
edvald
edmonstone
namibians
vollmann
tornatore
brik
branstad
kunle
houlgate
bernay
morato
offen
rockcastle
rivolta
blunstone
mbf
hayseed
sugarbush
hards
knollwood
campello
pontifice
cgl
severly
gillberg
famke
forestation
kruglov
egawa
supercharge
rassam
prolix
gheluvelt
pequannock
beko
elfrida
augurs
mutatis
bigamous
crosfield
halberg
framer
watusi
fod
acq
evangelic
bunkering
balilla
uttlesford
csj
marinello
ballymote
rockaways
mattaponi
heeling
tck
sandham
endocannabinoid
loonie
olshansky
tanked
hydrolysed
baldwins
richelle
alimentarius
assyriology
clingan
handicapper
amna
fiyero
bernardsville
viloria
bolten
thunbergii
ecko
squiggly
digitech
bika
rebecka
ragstone
transborder
millepied
tbg
arkenstone
navneet
ikaria
denaro
shebaa
decs
pizzonia
hertwig
mulroy
mccallie
kingsholm
froud
gowanda
micelle
krasin
goatskin
iomega
kochhar
pletcher
fourway
béart
meirion
muncaster
kunin
eversole
illusionists
klingler
corkhill
fevered
remorseless
stodgy
pinsker
anandamide
whitted
davidsen
cryptococcus
sharpley
mcmurphy
icas
maness
iphoto
darras
zahira
gyres
coober
laminations
beutel
pielke
dioxane
projets
stacia
interlocks
jingling
disingenuously
ximen
nonaligned
intersport
wreckless
ruark
coler
davidov
rhel
leviathans
jessee
womenfolk
tde
penicillins
avelar
calapan
ignatov
dicing
shinsei
ezzat
mudflat
interweave
deltona
naiveté
disinclination
mangi
muzorewa
lucette
hybridizes
ardon
novacek
heterodoxy
aunque
symptomatology
gundu
fati
marineris
sorbara
prance
dundy
zuzanna
amitriptyline
eligio
bioreactors
chorion
azéma
colonisers
japonais
phileas
sadhna
amabile
jianhua
cyfarthfa
arboga
malaco
buttimer
salitre
seijo
smucker
excello
montcada
lrdg
sabharwal
venti
pantene
saheba
acupressure
rajahs
unclothed
koroleva
ratigan
bellport
honaker
ramkhamhaeng
hachenburg
ballachulish
laprade
reson
freethinking
espa
baruchel
morado
payen
ertel
pramila
montecchi
justiniano
lbo
temirtau
mutai
essene
charbagh
zutons
américains
bashes
asriel
edinho
wetsuits
amoeboid
dissed
cofferdam
bhaji
menes
panabaker
trespassed
infuriate
étrangères
nastya
cooing
bartenstein
delorenzo
yentl
wibc
skinless
boonah
krumholtz
consulta
evy
hyperkalemia
goji
undulated
gopperth
bassani
auricula
elsass
goodling
flw
flashcards
stowey
nutz
svevo
meaner
hameau
shigenori
anhedonia
gamache
hayti
persimilis
kennaway
kadiri
yurakucho
fringey
syndicator
gaslamp
arizpe
tiwary
rancourt
vitolo
unfortified
yazdani
doctrove
lightstone
thermocouples
groenewald
sybaris
hean
sgpc
gorenstein
herbes
millin
mosty
parejo
keer
speedskater
ferrel
donyell
hirwaun
theobroma
deserto
hrbek
skil
sirene
shouguang
confabulate
khafre
soucek
shoalwater
recognizance
leadbelly
hurlford
feux
boehme
jamai
frites
commandeering
tinicum
roboticists
hounsfield
balanta
hagai
kombu
loret
morgenthaler
vindolanda
bussi
tokuyama
qena
sibsagar
primedia
durrett
coarseness
kurhaus
hasnain
bringhurst
sandwith
lortie
alaeddin
huracan
ddo
hawala
soini
hiw
cowshed
ljubljanica
rasche
scandic
brahminy
moli
hickock
hefford
fedeli
bilas
discoloured
tougaloo
shirogane
chappe
nehring
sanmenxia
ludwell
krausz
karplus
verdú
umehara
sesa
tamiment
ario
rigueur
charita
langtree
yehudit
hemer
cassy
tortuguero
somo
oshiro
multicenter
cormoran
tamuz
dcom
hardboard
ngk
bayati
ossington
conchs
bennettsville
barona
bennink
blushes
kahraman
deadhead
roughneck
ghd
partout
sommet
driskell
shinchosha
finola
argall
tsumura
safka
refractories
krome
warsop
giustina
caol
somnolence
kerkhove
parkstone
telle
pummel
dille
aita
mkm
maton
profundo
xel
bejar
exudates
brienza
gtz
achilleas
relph
muntasir
wfmt
ruffa
lefthand
throbs
ambitiously
craigmillar
lasswade
koshy
salterton
generaly
udyog
misprinted
torstar
idled
nearsighted
backstabbing
balsamic
osea
victimhood
mariza
amschel
conciliator
blonsky
pickthall
tammar
kimmage
eufor
khizr
mablethorpe
wdl
boustany
stepladder
malzahn
coalbed
cerys
kwansei
gericke
whalsay
ademi
eardrums
amsterdams
bletso
ghostwriting
ingratiating
bacs
apprenticing
referable
jovanka
emaar
myositis
itri
pitied
chauffeurs
huji
baus
cunts
palko
cseh
fehrenbach
joice
sbh
analytica
pahlen
slacking
goodin
banisadr
schoenbaum
matthys
chimo
roustabout
orts
netizen
shoelace
sfera
tuqiri
charmes
shehan
thsi
pawlikowski
cagiva
peds
startles
sigatoka
pacioli
cavender
obayashi
rabkin
verno
bunts
meiwa
kimonos
bristled
monahans
mcclay
rallo
urologists
youde
masakadza
followill
sirat
pratas
punative
unshielded
beemer
unbending
unlikelihood
sward
nilesat
hoehne
ogerman
microfilmed
icelander
alvida
mikro
cheeseburgers
steckler
mercouri
kouassi
blare
multistate
acrimoniously
belluschi
livigno
jetting
tollefson
palang
unobtrusively
rakovica
prompter
retrocession
homoeroticism
minuets
returnee
ventris
zoomable
signboards
langhe
pocky
anklet
qiong
franceville
boire
rooy
recirculated
bladon
elveden
tradespeople
chaudry
ebow
brewerton
jiamusi
birchgrove
triangulations
employable
meroe
khalis
nial
rothley
contaminates
kotra
haccp
endacott
fidrych
glovebox
spacy
flamboyantly
seismometer
yussef
hughesville
piombo
alacranes
momotaro
activa
bisharat
spacewar
tvbs
yearnings
dirtier
matute
kofoed
antiparticles
vandalization
mbu
compactor
wagg
arihant
mcbeth
truncheon
bioethical
thoroughgoing
steppers
blowjob
tuilagi
ccma
remzi
spdc
lunghi
verbinski
berr
brosseau
avshalom
esquiline
baillieston
mehbooba
vidhu
avas
dajani
euroncap
scaglione
troubleshooters
nitrification
zipporah
filat
femm
monacan
hagans
crabbing
rula
ncpa
raquin
byt
panufnik
friendswood
bergsma
pachyderm
kuchi
hacohen
casstevens
aurorae
mallen
dzmitry
wiseau
asado
skipworth
yonghe
doctrinally
tast
multilateralism
cattails
shahdad
unconquerable
sequined
castmates
heindel
modak
swanscombe
huby
ajp
allardice
coran
oin
macguire
sarki
gubaidulina
nevo
birchington
geetanjali
brumm
tracee
palomas
bumpus
bunim
dira
roselawn
ezri
mankowitz
judt
dokkum
arks
isse
pomposity
vpm
demande
plautdietsch
kosuge
nurnberg
ibon
kalindi
keiretsu
buser
negril
baldus
ghoulies
amii
overcoats
feldenkrais
daymond
flagellate
uninterruptedly
morsels
dispersions
pharmaceutica
overhunting
musl
kptv
electrocardiography
shchukin
westdale
juliani
harehills
puteh
rotundus
abuela
kynoch
totley
cizre
malzberg
parques
xishuangbanna
agitato
cryptome
evry
pwe
rockier
hymnbook
laterano
nikhat
tilehurst
nginx
mckeag
anesthetized
jangly
wagoneer
whippy
kushwaha
esata
stothard
lowveld
subsector
reformations
conways
arnd
bleier
kinnan
zaya
keylogger
khalatbari
synanon
jennens
crookham
childishness
plb
cromlech
salvages
jarrard
crispell
disproof
kinver
asten
farfan
chamran
kyrle
kaio
rwp
agonized
medem
stratten
lavagna
serhat
motioned
hickie
madhi
bragdon
hotak
kasson
tristeza
majida
wrasses
lusting
emlen
nanocrystals
mobilizations
bonifacius
prayerful
mccreadie
dreghorn
shoehorned
obviates
ohira
pbuh
lcpl
rebelliousness
paolina
baser
backwash
wilda
dijo
bolat
lévêque
platero
lightwood
christakis
fortenberry
misspoke
tremper
killinger
mws
orsola
phenotyping
goltzius
tartt
benaroya
workability
maille
uhi
sanayi
reyno
bitrates
katla
archaisms
avnet
edhi
vembanad
arris
docosahexaenoic
tersely
zuzu
nouadhibou
undersheriff
belpre
zemlja
langwith
kalorama
vidler
defreitas
sarginson
housebuilding
kusano
soeharto
lle
trots
ormes
avira
delahunty
katzir
enthusiasms
upholsterer
banamex
kiem
gulla
benq
preeclampsia
kakko
talbots
hamstead
bleating
sunnier
phenology
alexandrians
jais
ovarense
billfish
haledon
maenads
mclelland
tilston
hanya
prochnow
wben
amenorrhea
prio
nyren
unroofed
myofascial
mailbag
deuterated
katainen
monifieth
najat
allura
cazadores
blatz
ceviche
fortunei
classing
stolze
noachian
nesterenko
standpipe
ventilate
giedroyc
anoushka
rela
thuringiensis
cheevers
drachten
turca
babayev
rosenstiel
copycats
mensae
shohat
tartikoff
yanqing
mcclair
stereophile
palmitate
escogido
lubao
pinchuk
melena
aliu
cicerone
colorants
kilmacolm
thoda
ciampino
ponda
hobeika
cybersquatting
sux
mattick
thill
tudyk
barata
dch
waterspouts
eshleman
whicker
murguía
filipa
denice
malapropisms
whfs
palapa
expressivity
ingrown
klac
ahava
laurenz
oberlander
bica
messersmith
iliana
taks
rockettes
hitsville
syktyvkar
witchery
dipa
jiangyin
vot
ruairi
wegmans
najam
juta
caldecote
righteously
tuberosum
konotop
oare
toothcomb
cilacap
reichsmarschall
oste
radiogenic
revocable
gentility
timonium
physick
brolly
balma
albar
shanksville
cansino
reclined
barnsdall
vno
hackner
tiswas
torpoint
hutcheon
rulli
deinstitutionalization
wolkenstein
stoup
osmose
bourdeaux
bradys
ator
handsomest
bellah
ustasha
gaetz
ekanayake
silverbird
gusman
sistem
henfield
caradon
ghotki
blackhorse
noko
reeler
summations
rampaged
baili
upmanship
stotz
duba
heiliger
clenbuterol
hofkirche
bijl
vfs
rollinson
pillion
knole
brookman
collezione
shophouse
staghorn
daoust
zahrani
crossbills
fiorini
murnaghan
marcinkiewicz
gebze
beitbridge
absolving
fortuné
lidded
eyepieces
coorong
pmh
adelino
dvrs
extention
extremophiles
tausch
editorialising
brittingham
marcas
feiglin
wente
drenica
kelvedon
bulbar
muffat
endodontic
connon
harner
multipolar
trimpe
gabrielsson
grabbers
astraeus
belal
boatner
empyrean
rafted
sampford
starless
keylock
chlorofluorocarbons
vorobyov
procope
pashley
beliveau
ifma
cullinane
medcalf
landesbank
pinturicchio
louison
furat
floatation
batshit
kineton
fullwood
hogbin
denness
mikayla
panchos
entreated
gulper
knaves
laybourn
saparmurat
magalhaes
steelbacks
morvern
indice
wahabi
unforeseeable
strieber
threadlike
galin
cuvée
evader
midyat
usepa
mockups
jhr
licentiousness
cunene
jenova
denature
motomura
grigorian
amai
bormio
dortch
deathstalker
morwenna
combatives
sequin
staiger
risperidone
electromagnetically
ngawa
prepress
orginally
ixia
kilkeel
bangali
mustain
gages
guana
fetid
trespasses
wattie
rgc
ropp
bayit
nailsworth
verão
unaccepted
livi
kirriemuir
bavasi
chiapa
feedings
materializing
gobs
quare
tartakovsky
batroun
mendicants
fallis
motihari
stites
carmeli
cranmore
imagineers
datchet
acculturated
ajai
blasphemer
uglow
hummocks
crawlspace
hugon
yorkin
amarc
erkel
cantieri
schruff
flannan
memex
schelin
itim
manifeste
jeno
tangs
schoolbook
farmhands
cnnsi
microfabrication
hotchner
gdm
atitlan
quizmaster
swisscom
stilettos
nauta
thit
independance
eberl
dumper
noseworthy
chynna
nardelli
bringin
hawza
achala
elettra
moratti
faille
rotationally
disunion
baggott
benchmarked
leatham
montalembert
sugoi
gursky
quartey
tablecloths
liaised
creamed
denaturing
nonpareil
gloaming
leetch
raffarin
lauzun
caffi
provisionals
tasmin
virenque
nylons
tropopause
outstripping
kennerly
arscott
vanderveer
blackboards
barril
paramotors
hierocles
tregaron
postcolonialism
nostalgie
ceallaigh
dorot
crw
ryr
hagens
ssid
sturminster
shac
haier
grannis
lundeberg
taihe
renouard
tristes
azoulay
hhmi
fortuneteller
multifactorial
stoyanova
scheid
cheerio
graffitti
tessema
digard
diddly
squanto
tortelli
desmarets
zedtwitz
boza
kishoreganj
putted
haggle
akunin
wendler
fastpass
schotte
mckendry
khail
drat
koltsov
janu
tabatha
sambat
maxson
marville
tomonori
operability
lackadaisical
saladino
burgiel
skerrett
athi
pavlovian
randburg
bettws
malky
taurog
farmerville
botica
songhua
goñi
woche
poyang
kawakita
reacquire
picco
berthelsen
sicht
shealy
desson
zuazo
masqueraded
eburones
sunol
handprint
scapulae
sechin
solva
steatosis
tortura
bossche
hassoun
taizé
camoys
tramiel
cressman
saariaho
ronaldson
braggart
baso
tanha
sociologically
sokurov
febvre
pubertal
orco
quilombo
baryton
inosine
deshi
kalergi
nhan
quiescence
youngsville
maudits
dubravka
peatland
lacto
bellingen
delbarton
artane
kaua
goudeau
chirino
jeanna
niese
mutagens
htf
veo
langsdorff
gondwanaland
suli
pookie
corbière
robinia
haddow
alcorta
arowana
wampler
schottenstein
timbral
bronowski
extracurriculars
sillett
northvale
umrah
jaspal
zygotes
aungier
muspratt
airdrops
kuduro
larf
ohayon
vieta
nečas
balneario
lench
beeny
majura
bhusan
lbv
pogson
igloos
raouf
klip
pérignon
firelight
bnb
clytie
tandems
entangling
adulteress
gilham
origliasso
leidschendam
smashwords
spiffy
pullar
wmt
belgard
visicalc
beesly
raybould
lieshout
peconic
taubes
coquerel
kourosh
kulikova
puddy
redes
abbi
lirica
chorleywood
dagang
meaningfulness
bergesen
wgrz
iacono
uglies
homemaking
giesecke
spuds
pardis
stratocumulus
dextroamphetamine
morrisania
erps
biopharmaceuticals
biase
deberg
babysits
regularize
porphyrins
mikimoto
quattroporte
lefts
deaflympics
boxofficemojo
kazmir
ranas
dccc
gazebos
crisanto
flanery
barnstormer
flm
waltraud
netminder
yachvili
anticlimactic
excl
windspeed
speare
trilled
siao
expanders
sriperumbudur
dainton
konk
nakdong
hoyerswerda
saltley
berd
nekounam
sylwia
vikash
fishtail
ked
garrigan
ashan
widom
electricidad
nimi
antimicrobials
counterfeited
marvan
lankershim
konik
cfz
faludi
administrates
hooijdonk
pioli
gracy
follet
australiana
bacteroides
eccleshall
inhalers
buscando
lavis
chiho
reso
seagrasses
pianola
draa
fluoroquinolone
misael
isaev
talan
poach
crilly
schneemann
bacteriologists
waffling
disfellowshipped
satkhira
biochar
heidecker
polypodium
yoichiro
phlegmatic
yevgenia
aper
norvo
orthodontists
mazzy
sitdown
jerico
linstead
mesivta
kuningan
knucklehead
whalum
qaqortoq
contortion
bistable
bebeto
swagman
ridler
predisposes
solovki
geathers
anteroom
turchi
turnin
plumbago
aham
abierta
paramountcy
evalyn
hydrologist
wif
skd
empath
brightlingsea
chipley
kunga
rwr
dystrophin
belching
clerkships
olivi
nicolosi
queensborough
broaching
finningley
stati
bfn
udonis
broadgate
luper
rockliff
officiates
vnu
andermatt
batara
hustled
taqwa
mwene
unmeasured
cryptomeria
jahnke
tarah
devaux
uif
mönch
schlumpf
rimac
ilic
tais
kepulauan
edensor
dings
malpass
arpin
subtractions
clapperton
gernatt
grès
mutawa
flowerbeds
begonias
kanayama
zuccari
combinational
troublemaking
bäumer
pcusa
turkel
magnitsky
aegyptiacus
lome
whish
lapus
bonta
ayodele
baly
ufuk
ambrosiano
alse
chughtai
disburse
zsl
prasanta
goteborg
nosocomial
centigrade
boondock
zeni
namgyel
heterogenous
styris
xmb
oaf
salubrious
nitto
josten
excerpta
pcap
affonso
loams
suhaimi
mome
bermudan
owney
senad
bellino
shirur
pinkas
kulan
sanwa
spacefaring
financiero
charros
bayraktar
minders
erratum
boning
torremolinos
visuospatial
disproportional
shefali
washable
boby
skitch
macroevolution
aacr
flyaway
perine
swinson
filmgoers
counterrevolution
yepremian
bagman
mcdyess
rollerblading
vlaminck
colyton
fouke
kerinci
virally
zoeller
aasha
cornbury
pavitt
snuffed
gachot
bage
douaumont
vinni
kenway
actualité
mujtahid
otec
calculable
bataclan
vassa
gilera
questar
vannucci
queeg
equalizes
soner
hewat
gaula
sunbeams
pilotless
risalpur
butchart
garrote
paolucci
rile
nativ
marpol
baracoa
ouchi
multimap
chemehuevi
tambellini
ouko
inulin
nces
stouter
cja
pooper
phalen
cnnmoney
turkomans
dogar
mikhailovsky
grope
serina
nette
sangalo
vatnajökull
diorio
jerker
alabi
colostomy
lounsbury
lns
byer
glasvegas
lifeway
platja
critisism
echenique
allwright
qisas
brk
fmm
toothpicks
kaela
meum
tthe
albritton
nunda
whittling
temesvár
timar
katti
transvestism
heckmann
dingos
rodley
stalagmite
yoshimatsu
crisply
waag
brütal
rockstars
orfeu
darrang
godey
pistils
moots
lehua
valadon
mihaylov
ornithischians
interreg
demaria
torrevieja
kjaer
outmaneuver
wilsey
tortue
dzagoev
efimov
yoseloff
riboswitch
pannalal
christoper
riprap
boulais
newbould
moes
firewater
tsukahara
ahadith
paceman
skopelos
cantone
citronella
woodsen
polycythemia
mulund
ify
antinomianism
khronos
rectories
fantasized
penlee
qualm
clyfford
numberless
juxtapoz
yaeger
wiffen
sigonella
whatsover
briere
bathonian
zaira
yab
pufnstuf
seyfarth
carona
thors
pusser
lacalle
rym
solemnized
mitar
khrunichev
marclay
shandi
buckfield
seismometers
boese
talay
homestay
obuchi
santur
lluis
marburger
bochy
sokolovsky
cappel
emberton
gerner
tafawa
izzi
looseness
marsabit
fiddlin
lipuma
dullea
thole
mtsensk
porcello
grazes
eviscerated
hatboro
hurriyat
tii
liveliest
seiffert
mazumder
aravena
nejad
motes
windlesham
ragione
sylwester
volksbank
earthmoving
froot
donaggio
sarkin
rece
millbury
seyyid
ryen
icheon
ruehl
moliere
weimann
czarnecki
koscielny
daouda
kuleshov
iacp
menz
fabulosos
makeni
dacus
qurna
ishares
metasequoia
miffy
confucians
dufty
pazos
silveria
wrappings
shabani
kehler
marsal
saffy
cury
clydesdales
radomski
mancilla
howison
pesonen
horon
latsis
iturbe
fixative
sundridge
compendia
tyrannosaurs
assortments
feaster
pettah
preachings
snowballing
schrei
emmie
emond
ginola
kopel
begone
callously
evenki
clec
nwankwo
zapad
mazzilli
premenstrual
clandon
boruc
jobling
ostade
cujo
highspeed
oxygene
bny
mavra
singrauli
gilberte
seashores
kctv
gudmundsson
koronadal
maltais
teare
suprematism
mdv
otep
raghubir
sacp
toshihide
extrication
bärbel
scroggs
bazil
samish
littleport
fairland
lamonte
nigris
aliyeva
kaput
porthleven
nomex
kinesis
moina
greider
meltdowns
woodcliff
caldara
stanchions
saras
salzkammergut
kingsborough
korf
riverrun
exigent
pccw
brightwood
salvacion
dinton
langfang
unquestioningly
qusay
manjit
tzeltal
westpark
agnesi
sobchak
ergaster
fungible
dedo
verrett
kingsnorth
palanquins
theunis
shudras
tucana
morino
loganathan
vanoise
firestop
lavallee
fordism
tharparkar
vilches
brinjal
rambus
faucher
lungotevere
reedbeds
carso
funnyordie
kaptur
gossau
hayashibara
berges
rancheros
mitzpe
disaffiliation
nicols
eftpos
feints
etive
subsuming
toothfish
kajima
epiphanies
worsfold
honderich
annfield
suwaidi
tepes
orkestar
stayman
hemicycle
quern
mengs
baju
musavat
burnishing
bujor
fash
schori
druckenmiller
saag
vosburgh
whacks
cockfight
luntz
foist
tepeyac
biped
misfires
laparotomy
khazraji
blase
zhenya
sauder
freddo
monisha
mboya
sacem
kuralt
apparant
ipse
gholson
gillo
telcordia
amrapali
wambaugh
stults
rowden
fireweed
barrón
krla
bortolo
tach
karch
hunches
tipis
foreshortened
belletti
tgm
onoe
eggheads
panniers
cashflow
murphree
rosenquist
wolke
oen
rizza
partaken
biretta
umf
lamarre
orlandini
umoja
lomo
leavesden
unexpandable
ragheb
athas
lano
petn
krasnoroutskaya
petroff
troutdale
nabatean
vagif
infests
garzon
tuto
souce
peppercorns
carnies
korpi
litchi
danan
voloshin
sakr
ulta
riner
orexin
plisetskaya
wone
mcgrail
monhegan
blistered
hichens
noort
dorton
zambrotta
nahj
alexandrova
zaentz
thuot
mihailov
redeploying
tallard
psalmist
khami
bulged
klimek
enam
kandari
lsl
tonghua
chalan
oddy
strategize
afaq
anacapa
olavo
sisir
ginty
mallalieu
kadosh
tishchenko
celanese
leyser
mgt
mcculloh
shardlow
delaporte
unproved
samland
oliveto
wjac
bogdani
ciutadella
clubfoot
embolization
syringae
siim
langstaff
ambos
renat
johannson
jaywalking
occassionally
mante
kushite
androstenedione
saceur
hematologist
drumlanrig
birdwatcher
gassama
noisier
lete
kalonji
icecream
takebe
bref
hedonist
bronkhorst
evenness
hermanson
warble
drumkit
kerbs
gub
fundemental
wittstock
moviefone
josefin
kretz
machineries
domitilla
seraphine
trampas
soufflé
thimerosal
thordarson
chimaltenango
ssdi
dimmick
capgemini
soloff
golspie
hypoglycemic
braund
thirlby
pomerance
weeper
proces
sniffles
redbreast
metromix
backfiring
forton
dillmann
perpetration
tauron
playbills
kifissia
armeno
juneja
hanalei
annegret
tandoor
lmf
darro
nicci
calientes
paratha
premixed
hanami
zloty
gleave
krupps
colvig
suffocates
paraty
kassner
edinson
karjakin
ambrus
batiatus
barbarin
concentus
yellowhammer
meckler
khorezm
cwg
venuto
ampoule
girolami
welcher
leeland
staphylococcal
keds
temerin
nasd
bemusement
smidgen
wanner
carstensz
karmakar
patronise
tasa
alcl
tripathy
dealbata
trilemma
tinu
pharsalia
ockendon
thiagarajan
sousuke
keziah
goetghebuer
albayrak
retrofits
deprecates
marijana
tamme
hiroyasu
derivates
simpsonville
doormat
photobooks
boarman
cherepanov
mutandis
arzoo
betina
stargazers
shox
inteco
yuzhang
prospectively
odebrecht
ojinaga
ilustrado
egeland
hanafin
valdarno
warrenville
scheurer
cedarburg
bearsted
jeffcoat
dancey
effervescence
macvicar
trucco
khanjar
ittehad
kabba
bumblefoot
okemah
possessiveness
fearfully
milija
galbally
plunk
bluetones
grimond
schiavi
nakamichi
exterminators
gavron
midan
oakbrook
blueclaws
kukla
slaw
woodcrest
wenhui
ceg
menacingly
faucon
polyphenol
deuk
harpswell
beerwah
abzug
harchester
terius
fml
afars
maoris
atitlán
volksbühne
akf
akam
kongs
muesli
northcutt
demond
sharers
hounsou
kista
gleaners
presencia
permatang
luggers
pittenger
dockets
ketel
pudi
tignes
ranjani
treffry
digitalization
reenie
gijsbert
undersurface
amontillado
uzan
hudec
serevi
rainiest
favorit
atoned
noar
sirène
dmsp
longville
bruel
fernan
boie
porkchop
kreuziger
bellen
narducci
sismi
desnoyers
hhh
folky
penalise
mckenny
teetotaler
gambardella
containerization
shying
junctures
frontière
anoles
aptana
husni
knobel
piramal
bakal
pingo
xinghua
pmb
bearss
arlecchino
cati
mlakar
nanostructured
axia
wbns
wormer
spitze
randfontein
vadhana
pentaprism
wetten
emis
xts
lorant
awww
alpers
currys
trialing
pathétique
besieges
harber
mentmore
inculcating
securitas
zager
trimesters
gange
schuberth
sabzi
safecracker
inoculate
satpal
jinni
chinoy
pantha
malcontent
lavilla
hoefer
exalting
klaver
rapporteurs
leakes
sirin
pachamama
bonners
foli
chubbuck
institucional
bourj
pascali
mahane
llanishen
dpw
upperville
nieuwendyk
coronial
crackles
cheetos
cottonwoods
elsom
redevelopments
homunculi
nantlle
pully
penan
purgative
mudras
jindo
barling
podravka
fifpro
werent
jalili
lochee
asafa
semo
nmo
carvallo
karakocan
mitsukoshi
batsuit
droving
laube
trinitas
cadwgan
cabala
shondells
ngcobo
ravidass
thrombocytopenic
islandia
motormouth
blackshaw
ballinrobe
hengchun
winterbotham
sorento
cantat
ohman
cincpac
dukie
elzy
forfeitures
melanomas
trav
phyu
sneath
nazz
javanshir
suroor
elaeagnus
maleki
degenkolb
jenkintown
hinshelwood
maiya
kfmb
chappaquiddick
ntare
beatific
nagina
haydée
olano
nanotechnologies
yunos
amiata
composted
surv
griet
portada
sunia
noval
commies
pressurize
lisburne
merseytravel
illustrato
higashino
demoing
neisser
tondu
alexandrou
dibango
asbo
twycross
straightness
pallice
polityka
writhe
giacosa
koff
varnhagen
juon
reattach
semitrailer
vestryman
crimping
ktrk
judaean
smorgasbord
canam
tiba
schwabach
reijo
tenace
guren
ypa
feek
hax
yiannopoulos
dihydrotestosterone
angwin
gremio
keiner
pittance
zworykin
ruffs
cousine
pigna
whimsically
bugaboo
ajloun
nezahualcoyotl
terneuzen
petrosal
sharifah
lightsabers
rosca
udm
veon
branting
reim
lesher
objectivists
beame
oho
compunction
aubenas
givet
sucralose
branner
azem
circumambulation
leakages
khosa
elixirs
berde
macoun
gatekeeping
dorte
duparc
culme
tammie
andreoli
kohr
vollard
¯
scandale
prepuce
banfi
beauvois
conformant
singleplayer
squillace
lits
abyssinians
sparkbrook
zadie
politicize
afsar
certaine
lectureships
ocotillo
arceneaux
audiofile
cuter
lubanga
meilleure
uzeyir
mcghie
msy
excavates
burek
fengshan
ghimire
meades
kley
keyvan
arcimboldo
cuscuna
jayma
thot
mandisa
reneau
meekness
azn
maidment
shabaka
yanking
hitzfeld
neben
pragya
lutcher
bernstadt
mosquitofish
aerostat
xlt
molas
breakouts
bessell
kiron
tausend
goyo
ayresome
individualists
yoked
gayer
stotts
kyokai
biblis
insulates
trux
subcomandante
avetisyan
hambrick
davachi
mccosh
reider
darda
theophile
reseeded
ntia
intrested
beckel
yamamah
goian
ulhas
kadoma
survivalism
frindall
hoveyda
immunosuppressant
commins
promontories
stroger
masahito
steatoda
bakrie
montee
smartbook
scurrying
chaffetz
seibold
nauset
crais
ibara
verdeans
shrum
redbull
sharleen
fforest
imprecisely
sawflies
kahlenberg
ballsbridge
whelp
nicastro
donita
reagle
westleigh
legarrette
forestalling
karola
sodden
musin
besiktas
huazhong
fatcat
kluane
bondfield
constantinou
intension
cotonsport
jannis
mouly
huarte
atrás
matryoshka
yac
losin
shockoe
rhod
slatter
tragicomic
rrt
presidencial
jogos
interdependencies
astrocytoma
limacina
poorman
pointes
manele
seabra
untaxed
pathfinding
kinnaman
bendall
laxer
nanchong
soaker
valters
donnellys
sculpturing
wordpad
mahtab
biographia
kazushige
terim
michihiro
poesy
lissauer
commerford
brunelle
preflight
dakotan
enlace
birkhead
insensible
luciferin
ginning
vallejos
niggle
frontierland
biodefense
mammalogists
zanella
lochlann
gabriola
espindola
olton
pldm
campau
dollmaker
mermoz
leukoencephalopathy
immortalize
felted
dolson
darga
yarmolenko
nuhu
delatour
xvs
aikins
banteng
comana
omara
supe
ireneusz
jaluit
hammerlock
xizhi
thandi
ilwaco
solare
garet
cadair
chlorhexidine
bithorn
hural
manjimup
jugo
newb
mutualist
taradale
goicoechea
fulla
nanometres
borea
heifers
altis
compsognathus
paresthesia
lakhani
blanketing
separateness
santuzza
demet
hershman
sebold
portree
jagatsinghpur
hambling
proj
aima
horno
momofuku
samarasinghe
discontinues
motlanthe
gyrfalcon
macungie
demilitarised
axbridge
mulcaster
sclerotic
tomaselli
boeke
klöckner
argonauta
cavorting
artel
scathingly
handymen
shotover
drumlins
chitinous
boak
neuropsychopharmacology
malavasi
novelisations
velorum
monir
philodendron
dulin
hurr
fasching
nissanka
macromolecule
sunrail
jih
astell
hoopa
unformed
wci
lookalikes
okl
kingsclere
lufeng
sqa
modzelewski
opo
inflatables
chane
speedline
gawthorpe
batuta
schönherr
salvinia
shestakov
rospigliosi
loboda
anacondas
kitchenette
mavro
endosulfan
kyai
duetted
reial
halki
traineeship
thiopental
mainlines
herma
michalek
escutcheons
legrain
mcmasters
seelig
minoans
meshell
skylarks
rapha
nakazato
relavant
scribed
sakka
kernal
gdt
tikki
reiber
housemartins
hef
orlean
kirksey
dacascos
gonin
bwb
mccorvey
orlok
fábregas
cassells
dheere
hyphy
acipenser
bloedel
songkran
ariyoshi
degradable
razo
biodiverse
yurts
shohada
lvds
glenfiddich
hunchbacked
belchertown
denigrates
latinisation
acupuncturist
minc
guaviare
zits
karpinski
sout
greisen
freyre
skykomish
ghadames
nashashibi
ometepe
aleuts
mapperley
prosiebensat
strawmen
sinoatrial
azerrad
nutcase
portentous
gubler
cinquième
noyer
maligning
infoshop
cussing
rotarian
kissam
wendat
fredericka
khvostov
alow
oraon
asotin
jalaun
kilz
temin
duve
scrollwork
peche
adamkus
milanesi
turbin
sniffs
rzeznik
rosseau
guli
kasuri
warid
implausibility
almario
battagram
castlewellan
cofounders
palhares
contraventions
lenghty
perturb
ardila
germann
ciénaga
nubra
longquan
khagaria
bennati
vickerman
potentates
azolla
sarp
bents
ainhoa
rafiki
bedknobs
wagstaffe
zopa
stec
nube
chuzzlewit
bradby
htoo
cepr
hamnet
liturgically
coston
kiesel
creag
rmas
gaoler
walkmen
beardstown
gapes
benediktsson
spectaculars
tonalá
repenting
costel
dega
beranger
detoured
ufologists
unseasonably
theiler
nordlund
partenope
kujalleq
sleiman
housecleaning
golino
khaw
aspidistra
zaslow
picnickers
popple
tetiana
accedes
recognizability
riverworld
poots
wearied
benoy
benten
fionnuala
kirkley
overflight
gentles
blancmange
rsh
starched
mesmerising
kady
metzenbaum
recharges
wheelbarrows
mazzotta
awesomely
dresch
birthmarks
tsunoda
tweedledum
twila
inderjit
ianthe
merna
iwork
iseo
asimo
montlake
hamlett
sodano
munching
radd
hammamet
brewmaster
eschbach
rutabaga
nsrc
hodler
bifurcations
earlobes
ispr
amuck
junjie
parisse
trialists
bromham
terrifies
mentee
frese
sriharikota
wch
rodden
tvw
lvovsky
typhimurium
dumbbells
lewie
fiddlehead
eplf
glycation
quagliarella
ondemand
canuto
ephrussi
leemans
windscale
djanogly
rosebush
greenfinch
dietzel
bellot
sizewell
glienicke
kiyohara
auditionees
bandito
polperro
pawsey
shehab
tomentosum
holmqvist
heydari
duje
intertwines
munnings
prostatitis
cottons
kashtan
megamind
cosmographia
craic
killingholme
barzilai
bierhoff
kapel
ryuta
admitedly
joell
pourri
rdd
mujahedeen
brinsmead
brisa
undershirt
hoaxers
kante
soffit
ranted
riblets
jemal
olbrich
nixa
kold
acea
respublica
jinxed
jojoba
urwin
actc
lindvall
aecl
vlasenica
shawneetown
shortbread
rheum
bogeyed
shf
hitmaker
benighted
penhaligon
clinica
georgieva
dhingra
demitra
anax
hennen
canot
gunports
troodontid
spuriously
caché
dtic
galambos
phen
millport
benefaction
fuwa
lestari
crania
brucie
mugo
comley
squibs
saberi
logothetis
permissibility
coleslaw
jape
scriptable
ackbar
bould
rys
crz
udorn
iracema
noisettes
cadetship
outlawry
lovich
gallion
trankov
clum
mahabodhi
steadied
daryn
kevon
couching
thrane
ropers
narelle
bossard
dilling
coproduced
roatán
nccu
antitrypsin
windstar
chirbury
bwalya
vasca
hikurangi
hubcap
crowden
kalypso
wesh
geotagging
brewpubs
koyuki
fmx
epilepticus
kinko
janklow
melty
xanthe
vallés
urm
mahaffy
schoolfriend
djalma
skeel
stiffs
awestruck
gruevski
pneumoconiosis
concordances
zuleta
frear
ariba
namibe
invovled
zwaan
timman
zettel
shayan
bosques
ennedi
plym
prostituted
nafplion
leopoldville
bisignano
molts
gois
cavaillon
waives
madou
beiser
mclynn
merryfield
experimentalists
anuak
somov
wtvt
irreversibility
barrantes
amylose
charioteers
cuvelier
skolimowski
mundra
kamasutra
timekeepers
deadening
quirkiness
tappa
negreanu
razorbill
petershill
nctc
doull
estepona
ditta
ranee
preliminarily
weihe
nohara
rackmount
balkrishna
bortolotti
sanski
sherley
hylands
cok
osk
duchene
mtbe
vandervoort
vaxholm
readjusted
ttb
planked
tahitians
espero
bottomline
xinmin
sudler
gholamreza
coalescent
chanhassen
haake
rubell
miyazato
geim
jolimont
hoarau
portelli
dinga
fazle
hartranft
babbs
wou
sporades
lakis
lahi
cordle
ivanishvili
temir
namdar
xiwen
gluckman
søndergaard
geschwind
zambezia
goodlatte
tatsuro
jirí
stickam
chisti
crosson
ehh
zvenigorod
titanosaurs
dazu
sherani
coughed
telemarketers
luman
tinsmith
morzine
archaism
ferrule
rousselot
csce
thermobaric
gaden
rúben
consolations
nafisi
enciso
maty
rappa
haçienda
fingerpicking
redressing
zien
unsystematic
bcw
wilm
brons
canalised
subthalamic
shiwen
lowii
scroggins
sge
bressan
cibrian
ansible
castrate
mitha
szechwan
loisel
sailboard
cotchery
kahuku
coureur
rakan
dalio
gorged
rimba
sitchin
akhmed
konkona
mosasaur
marianist
hypoventilation
taborn
manacles
illi
unrighteous
traversable
kisselgoff
düül
ausiello
craneflies
balderton
marum
pliant
portaferry
belisle
leef
fbt
beamlines
chanteur
papaloukas
letwin
hasankeyf
tfca
azithromycin
luli
avani
orti
paines
pcsos
chambrun
windley
roedean
canzona
knowshon
nalle
vicens
chani
pulsipher
sapperton
jaafari
arachnophobia
nasib
westlink
presas
skylink
vollrath
burstow
antwon
merrivale
mobbs
firle
baddow
meditator
laforgue
graying
dahlstrom
kurumba
karawang
vermonter
kaltenbach
checkups
outstations
mcla
toffoli
gonu
topicality
nonmetallic
comesa
lucano
sokratis
dispensable
acrocanthosaurus
ontong
realisations
rupo
sonnenhof
byres
synthasite
snakepit
kutv
phillie
scherchen
langstroth
renyi
petaflops
ceawlin
yohann
gomberg
franci
mustards
bibs
inia
reoccur
fitzwarren
leelee
sprigs
cavy
radiochemistry
jyri
waldseemüller
siegenthaler
shipborne
stubbington
chesterville
opaline
phrenic
sulking
inishmore
earphone
screensavers
sardonically
stylo
kovalevsky
braylon
jatoi
shelden
tibbitt
plutocracy
fanfan
lanxess
droege
disenfranchising
pixelation
macrobert
guignard
delavayi
lamaze
baptise
sahid
jeanty
fadhil
accardi
keynesianism
contouring
fatiha
diabolo
pymatuning
marq
differentiator
sudip
dickerman
kleiser
meredyth
obara
diggory
berlanti
adrs
tenuta
kamerlingh
dibenedetto
marchais
macandrew
debnam
blaauw
genetical
subtests
layed
ellisville
impresa
binga
aerovironment
dildar
insurrectionists
torreya
buckstone
vaisakhi
mosimann
unluckily
foundlings
taberna
kernighan
fatf
railwayman
goatherd
iops
doorsteps
anica
moët
proselytization
sentamu
unseld
diarrheal
snijders
aboul
reframed
dubrovka
microbiol
sarpong
commonwealths
giornata
glenridding
tattershall
peattie
maraschino
amnat
changning
commentates
hutong
bomblets
kurdo
krishnakumar
microtransactions
dohm
detains
plavi
pannella
barrons
mohnke
lwl
gahanna
beny
zenas
abeid
yibin
trittico
coalmine
overplayed
vengerov
montalva
enercon
tarjan
urease
patin
inserm
itanagar
nonpayment
malignaggi
casandra
skocpol
retorting
bargate
bleh
hirson
abse
embarrassments
parmi
melani
tickhill
hakama
perrette
pickerington
weightings
pizzetti
antipersonnel
planetariums
forsett
shio
ebersohn
trematodes
mckale
gunasekera
caddisflies
jopling
kovin
pimlott
pigtail
dunbarton
kyp
pequeños
nbcc
grayer
bluesmen
lohn
fogelman
swar
liben
macovei
fixate
sarjeant
passman
whinging
pbf
waterslide
biscotti
nasco
colostrum
herbison
clearstream
longtan
yuanyuan
meurice
activ
bonneted
suz
peten
brockbank
stimulators
mcnicholas
shamblin
kieswetter
longtown
ludovici
insurrectionist
shirahama
mutti
fermentable
tamaz
streamliners
forwarders
colmore
étrangers
marcom
nuccio
erysimum
dirceu
swapnil
coudert
monofilament
pitter
tananarive
cashion
dockworkers
leppert
shiplake
intonations
gamecenter
pedo
transferability
cathinone
reusch
pokot
gooseneck
blackspot
individualised
stiffly
siza
ryce
woodcroft
bita
dalt
governmentally
byre
amroth
prams
weslaco
troxler
ternan
dilan
senescent
coum
freckle
panny
agenzia
enderlin
matamba
manari
rahimov
egbe
leeann
kursaal
expecially
minories
imposture
topples
napoletano
ununseptium
hims
pluses
jdrf
wagenen
josu
reggiani
carrizal
sulphurous
tiang
goodna
lauck
gresh
saifi
justiciable
bargeboards
olg
katznelson
deafblind
remediated
siac
bassingbourn
faccio
riksbank
cinc
tvg
explicable
donatien
felicitous
disinherit
epsa
toyin
erico
cottony
protoplasm
obsequious
refashioned
winkworth
zazou
countertops
durov
ordonez
volland
testability
crenulated
dickov
fishburn
inchcape
enciphered
federigo
windschuttle
ystalyfera
maschio
swinden
bahrainis
nill
thiong
workchoices
kanes
maccoby
claassen
xoom
darryn
nwosu
goumas
colliders
landsdowne
seigner
winchmore
sbx
zionsville
masn
sbarro
ferencz
adelie
raworth
serano
globule
rajib
edmonston
periosteum
heys
willette
wcdma
callable
pamukkale
krupskaya
custodianship
mazon
electroporation
leviev
hershkovitz
feore
bernheimer
sententiae
pictograph
ajram
herston
sotirios
svindal
austronesians
gatemouth
satch
postville
exide
yeonpyeong
atbara
foros
cubillas
paleoclimate
stonemasonry
ruddington
sundt
calomel
primadonna
lichtenberger
jeopardised
experimentations
sightlines
tomm
valore
amate
yahel
yus
eaw
liberato
emulsifiers
calcaneus
doled
oww
norie
roseum
arbella
sebright
warschauer
felten
curiouser
pushbutton
petach
mamoulian
swannanoa
packager
hefce
williton
maresme
jonkheer
minorly
paraiba
balcones
witan
chillingworth
ngarrindjeri
abridge
bendit
intimidators
overshooting
chaung
caractère
regulative
hollington
nordhaus
protostar
streetscapes
holli
oversimplifying
mawer
dobry
monastero
ocna
mevlana
kabalevsky
lembo
veneered
bushkill
yuhua
basilone
jobeth
muhanna
regressing
dastur
mroz
pernet
seawell
kaneria
searls
creameries
udmr
razdan
worldbeat
unswept
sternhell
constructivists
haddy
biopharma
lossing
mashona
jony
kpp
pfleger
grune
viñoly
aetos
haussler
denia
davida
macgraw
kreidler
tzaneen
stuarda
maturities
bjarke
eklavya
kenichiro
riddoch
slazenger
wahidi
trombino
allensworth
otolith
boychuk
niit
pattan
ieremia
wgy
quenneville
randalls
lovas
babic
unestablished
payphones
penas
beezus
teilo
inarguably
cabi
otti
mondlane
brou
boxleitner
kuby
ornithopods
prefacing
chitta
outcompete
unsa
margerie
ardoin
syllogisms
breakbeats
adalgisa
osteotomy
massanutten
gingham
deerskin
hillen
tsvetan
swallowtails
outsell
beerman
hennequin
babineaux
agyemang
rass
bobrovsky
duva
choicest
lammert
imposts
bassnectar
giordana
idlers
antonsson
seacombe
ariya
terraform
beatboxer
lintott
schematically
brushstroke
adiga
cedd
ardito
rheinpark
dwyfor
coninck
bellydance
lippy
puting
veatch
hakki
lyubimov
acquisitive
sahana
ahlmann
mycotoxin
keepsakes
minow
reigniting
balata
fidh
djordjevic
riehle
ahlers
shamar
tumaco
abrahamian
chorten
kuehne
korir
elberton
macshane
camelids
mendo
dendera
hermaphroditism
rrf
ménière
bzp
cityline
nectarine
opf
maharajganj
monetarily
husbandman
elv
kieu
abdoul
julesburg
vézelay
hotness
faxing
barrosa
hilma
finnian
chubs
gribbin
suddenlink
fortuin
ewelina
feiner
abdelrahman
abortifacient
muffs
catafalque
crescenzi
nibiru
glenny
potbelly
castanheira
scrim
imbeciles
konstam
kataev
braeside
fáilte
grol
raghib
telekinetically
walkouts
voya
fatemi
caresses
contractures
babich
odorrana
shizuko
gefen
remanufactured
mitroglou
morgon
gnawa
chromosphere
sbo
tlalnepantla
elektronika
purring
ambre
schreder
simien
seborrheic
catalin
rosatom
yosano
filiz
filippa
malchow
secu
dorna
regionalized
sandiford
liaises
flunked
mmps
rajnath
clotworthy
permanency
wyrick
ceh
rolaids
darwyn
borich
ceta
visioning
ferriero
plod
suas
cabel
lebedeva
neuroscientific
neurofeedback
harissa
khaja
neocolonialism
coldham
agrochemicals
dworsky
guapos
bacchanal
kuznetsk
vakula
ribavirin
blon
nanuet
marylou
mwana
promethea
knowhow
eliseu
canaima
killough
mebyon
isinbayeva
haedo
bridgers
sambalpuri
tessera
gittens
schneid
truther
quemada
hocine
debugged
necropsy
tupa
reverberated
blowdown
yamma
wifredo
disproportionally
anirban
afterparty
sanatoria
zolpidem
hedegaard
boesky
buchheim
gearshift
oborne
valetta
equidae
dighi
enquires
unexpurgated
swarthout
flitwick
guillemette
intresting
smac
machala
fedden
athis
aylestone
jaffé
meiners
parasols
skender
yonemoto
whitmarsh
culturales
gismondi
childwall
stanleyville
setts
vassy
chrysanthus
disorient
berenbaum
szymański
sanlih
realclearpolitics
unsaved
weisel
llew
timonen
vermeersch
sulak
kurnell
witkiewicz
sakigake
phenylketonuria
orthwein
patenaude
arison
rebeck
schine
smileys
cabrales
parlamento
eigg
mallin
moeder
bakau
zuleika
barkham
zipcar
bourgois
apiculata
tauno
dhp
pagla
leps
gombert
elongating
shahpura
ghesquière
trefriw
htb
hatzfeld
granero
kvly
mutability
valon
dysarthria
jewsbury
gaafar
tremulous
curig
toné
callister
cutaways
proslavery
venkaiah
guale
zaken
nwsa
hacket
banak
moonless
suggestively
segers
sallied
contrives
craster
landside
henig
markson
depressurization
peress
colantoni
rubbings
swatches
wetzler
tappets
onomichi
demange
commer
seferis
atomization
glenbard
cauldwell
twixt
swineherd
gsma
mpsf
antel
tineo
trethewey
hilarie
branyan
sawbridgeworth
elhanan
dongbu
mangue
incorrectness
wbcsd
pownal
sulk
qanun
cranny
mehrab
anissa
clent
filadelfia
lrcp
misteri
malolactic
sayce
arcona
videophone
tidning
dyche
leksand
trico
rge
takuji
smoothest
nodosum
auke
riverwood
discerns
broseley
jaubert
ribchester
intimating
baseboard
senga
voyevoda
kalma
cibin
cannata
barotrauma
caresse
uip
mothball
paisano
mellberg
topolski
tosin
tembe
enjoyably
giry
chilkat
fraport
offputting
milivoj
mythili
tgr
gilels
icare
kelch
semitransparent
seances
palmarès
bahlul
ipcs
bunten
saheli
phyo
tez
senge
sublett
noncombatant
patcham
princeville
mexès
woud
gorney
camano
goodwins
bertucci
tweenies
pisans
soldan
mármol
larnach
bouder
burkle
verkhnyaya
chehab
knussen
arbois
meriweather
jiaying
vilniaus
aldf
bluejay
mautner
minotaurs
zusammenarbeit
emy
foresti
stehr
churchgate
propitiate
lindenbaum
fallas
farad
dacoits
tuti
ohlendorf
belloni
ostrea
dapeng
affordances
aweil
brabus
lebus
springfields
laer
schroer
asarco
harim
paleoanthropologist
flowerdew
fatback
copeia
carella
truthout
likeliest
schneeberger
tanai
ataka
teta
hha
capurro
churlish
stylishly
misspells
pvsm
supremacism
highnote
hwee
meirionnydd
whitebark
bilingue
chaptal
plasticizer
elg
devotedly
slowpoke
guinot
mersing
corradino
poydras
refried
unian
regroups
blagden
impregnates
michelob
kalitta
pury
hrabal
rucci
exuding
pulverised
mres
dfr
severomorsk
claas
boisselle
cernuda
mercante
dcnr
feasted
comunitat
tinus
ansermet
pemberley
reingold
irrelevancy
brainwashes
ehrhart
ppar
boyaca
khidmat
goodpasture
grindley
wazoo
servis
ekeren
violons
csec
gramatically
clewer
inrush
manute
freest
novelized
encantada
clatter
hamre
vidhi
ooc
anniesland
polites
pirenne
nimo
arstechnica
townhill
eliud
residencial
whisenhunt
nccn
noke
hibben
afolabi
daringly
kirman
douglaston
asocial
unive
abolhassan
kamble
marcucci
bumbershoot
phouma
ridenour
copernicium
ptak
mohawke
fadec
régence
almquist
tunggal
ionotropic
ewings
ganso
magwitch
caddesi
sevi
dpe
accretions
pechstein
alycia
saxbe
kpfk
hinging
lauwers
bletchingley
essig
wtp
secularity
lemak
greenberger
tointon
storke
takia
barendrecht
scourged
swamplands
congenitally
snia
prophetically
yeun
craiului
cak
chaki
obsesses
rahane
consilium
griping
conisbrough
charismatics
langsford
taggert
milone
romig
bromborough
watabe
gerwig
skiable
thakore
colvile
prosumer
lundie
makani
berthiaume
kilcher
roslindale
lamitan
kosten
adlershof
expressionless
icsu
newent
imke
kienzle
emac
launderette
tiberi
ruffy
monsalve
genotypic
forefoot
lordsburg
scorns
pinki
bullous
meep
blythswood
cognizable
haldi
travellin
raptorial
velocimetry
louch
knighting
formalizes
mulready
mily
stenting
commandeers
jedidiah
broilers
shalal
glassed
akishino
hatra
swearengen
idolised
pesantren
everage
frosch
tyrosinase
dimos
bramson
feckenham
roundwood
baselitz
zimmerli
lilavati
jinotega
gavazzi
vallotton
aiea
naheed
theatr
soukup
agers
quiney
millsboro
amanti
faders
bishopbriggs
jahad
jeremi
mccrum
solbakken
colten
clumber
léogâne
mashina
glanz
clason
hamal
eeepc
jdf
sarver
klea
kirkhope
bennison
gosnold
estyn
letha
ekpo
talence
kornet
dersu
nonbelievers
leister
chaud
tepee
slayed
majali
kingo
trebled
kheer
sporobolus
expropriations
faile
studier
looby
flori
rezek
pembridge
shumaker
vehbi
sikka
glashow
sezgin
jandal
khafji
natterer
fechter
outgames
musidora
gyroplane
navara
simus
yanev
sparkler
derrek
tamiflu
bonnaire
kumalo
ucha
ostentatiously
crull
coulouris
samaha
garen
aspersa
naef
whcih
diffusely
arwel
felici
bovill
koidu
rooty
azat
conerly
beckum
knac
merloni
xai
yuppies
pauleta
wajir
nhon
medoff
gragg
mushroomed
cervélo
raiford
haymon
porites
abdiel
sellon
levet
guanxi
mayock
foundationalism
defeatism
disunited
recasts
paternally
cinnaminson
placencia
navesink
anomalistic
transnationalism
rasc
polhill
ohi
atlit
pandur
waaaay
pardue
chaine
tweedle
swannell
pudovkin
hitchins
siller
faurisson
ahtna
efford
deet
veenstra
insurable
tellegen
zigzagging
schlesser
sannikov
emboli
weedman
nandor
rochas
sainted
ioanna
wildwater
gumpert
dowels
pluripotency
attires
oxendine
electrocutes
ceefax
dogpile
pouncing
yachtsmen
bocock
piccirillo
rason
skłodowska
agros
kba
seyni
knapton
imss
straten
quetzals
maglie
baikie
bousman
lati
banani
dewani
laleham
huaihai
shored
tradeshow
kalim
jianzhong
clonidine
dentoni
konz
tasered
cassadaga
builtin
acetates
xiaojing
sadiku
agah
vold
spangle
prizewinners
earthscan
yokneam
swofford
horcrux
waverton
recoiling
wilhem
kellog
kre
abbiati
misconstruing
ravanelli
lemper
vicini
ders
sabermetrics
wolvercote
bearwood
oversaturated
weingut
quex
hoeppner
maizière
nosecone
yuengling
connel
tarbat
nitsch
vicinities
waru
amadio
kittur
lloris
epigraphist
nxg
nationalise
swinburn
deprez
woodchurch
winesburg
bakos
wilmshurst
hackel
liras
ratto
rakel
systematize
mouawad
retransmitted
kettleman
xinxing
tianwei
pachycephalosaurus
cachexia
stosch
cuppa
nivkh
cherrington
transesterification
moazzam
parnaby
mannin
reclassifying
zeri
metaphysically
elephanta
cdw
calment
haslinger
gustavsen
merenda
braunstone
nordman
ipil
treschow
southshore
abord
associational
lineside
ghaus
ganim
silverthorn
balayan
aala
cruzi
arcangeli
rheem
pardi
babita
bloemen
pavie
penedo
sobrinho
mahmudi
carsey
sherawat
keaveney
beccari
fractionated
ekurhuleni
realness
jsd
ponciano
toobin
ramnaresh
tamkang
worldcup
ezz
squab
cragin
mooning
boreman
akindele
senador
inviolate
conditionality
hgf
blaye
promisingly
katsuta
mamani
chaotically
jeal
womankind
dollinger
harmonists
folkets
numerological
dka
bamse
yothu
subsidising
rogo
bagi
pirrie
whisks
arbury
reawaken
oddsson
lsg
pompon
culter
donskoi
unicycling
triga
awqaf
highcliffe
peps
trinet
comverse
bessone
sgro
postholes
westpoint
springboro
oceanarium
cona
kriegel
grantown
indentures
schlag
urinetown
piercer
abms
pandoras
sundries
théatre
manou
aica
metastasize
ardo
osseous
verica
goossen
nestles
skadden
restivo
réalité
hubby
wormser
lizbeth
campanas
aben
aparo
cryan
windbreak
ticinese
bachus
sportaccord
derryberry
tongyeong
nepalnews
woodstown
kneipp
pharmacopeia
mendell
horsell
dairyman
htr
mudflows
moville
dotto
ogino
praetorians
yune
truthers
montezemolo
sherr
victoriana
criminalising
aloísio
halvorssen
peterbilt
tazawa
millstream
tschaikovsky
nado
narinder
kastor
gobrecht
schnitger
lokey
portugalete
cnooc
letterforms
karavan
steff
uspa
peñafiel
gransden
zygotic
edginton
keowee
larke
superliner
abass
dakhil
parceled
brownhill
lahij
pede
washingtonville
regensberg
indiggo
haltemprice
starliner
sprayer
protegé
creggan
toran
rustico
shellshock
ramsbotham
esai
blust
lorikeets
tingwell
diskettes
danchenko
hockett
gagnoa
ellerbee
amerind
dragone
buzzell
chatillon
mckissick
zacharia
einziger
sorce
marchwood
imploding
sangwan
kroy
softline
luxembourger
kinnard
courante
jarbidge
oddjob
preferrable
relicensing
dirtbag
xcor
leuenberger
seawalls
keirn
arny
estamos
phenomenons
mccardle
fadlallah
nonplussed
sabari
vlm
fulop
rooth
neot
govert
lawbreakers
unburnt
wonks
renderers
loreal
holdovers
redlight
kanthi
coachbuilding
lgp
spoelstra
tijara
jember
terrifically
struth
monthon
vojinovic
pyromaniac
quanto
enza
kerfoot
esperia
peca
kharg
alexe
liquidambar
thiophene
memorialised
kaifi
wtam
pulsatile
hadlock
warlingham
clancey
trivialization
wakatsuki
nudges
schwenk
netlist
flappy
codina
casserly
kyohei
wenjie
soiling
gads
polyurethanes
hosie
yba
initializing
harless
jiyuan
gasteyer
tipaza
gulberg
luqman
kyffin
warf
compellingly
falconers
salcido
tendre
omp
borok
blitzed
boto
lisco
arghandab
senecas
sasu
bastidas
boumerdès
jakaya
odemwingie
daugaard
hurtle
ratcheting
glendenning
vortec
angiolini
cannizzaro
ult
dreamscapes
trevorrow
myerscough
uck
godfred
schaick
gonaïves
seagle
magnetoresistance
vergata
wiggy
rhib
insein
yocum
yampolsky
pwnage
charline
königswinter
enitharmon
aesculapius
diaa
bokor
krentz
yews
northbourne
karena
flys
nahm
criccieth
flyball
fossiles
uneventfully
countersuit
hbp
grundmann
televangelists
sady
wilmerding
pradier
zhulin
ouzo
screencast
grabski
hotwire
brittania
dwane
geraniums
deori
microplate
tearjerker
nuka
misnomers
interdicting
reichlen
wrung
tenofovir
worldliness
paladini
tennet
wondercon
superunknown
oped
cryptologist
yadier
jevon
nuva
petrosino
totale
southfields
ssrs
hedgecock
scac
niddrie
sadruddin
sporn
ixs
rheological
rocas
virgos
hafizullah
ulundi
immunogenic
woolgar
wda
yokote
removers
annatto
ciudadano
kintner
bacio
dudamel
kitsilano
sapphic
dundon
allogeneic
tikolo
ralegh
ufcw
fishtank
heffelfinger
spqr
seyrig
hawkinsville
stevedoring
allem
schweiker
fespaco
idot
aldrete
ndoye
katsuji
capuleti
tothill
baldacchino
wojewoda
mozer
lindland
keitai
shafir
deemster
ellingwood
dosimeter
célèbres
teter
segregationists
delen
harmonising
darkhan
intoned
erotically
pariente
poz
countersued
kaftanzoglio
sparklehorse
moldenhauer
rootlets
towboat
jasbir
ahmadov
najimy
avent
dpo
mcfc
tomohisa
internetworking
zygons
vermette
dge
relaxin
troya
athanasiadis
surfs
anerood
yasa
reorganisations
blacklick
overstuffed
aprils
helgesen
girne
heinola
pluvial
paleoconservative
simplistically
iwatani
garraway
meddler
pontet
gleditsch
advisability
dedrick
chh
baldauf
buma
ardi
unenclosed
newsbeat
oropharyngeal
preempts
militarised
mulatta
unconstructed
moonglow
tihanyi
deterministically
flaunted
showell
assan
rinku
fornell
llanbedr
clínica
cramton
ryul
blixa
cartmell
mouat
stenmark
célestins
gyantse
robofish
uzair
allways
dangriga
musan
openview
instep
gld
xcp
assaying
fahs
dictaphone
neese
gyrodyne
msha
gillnets
prosthodontics
lafe
biotopes
maciek
breland
goodhew
raffo
wadesboro
xinzhou
heartedness
politicisation
alemanno
multistory
ferrini
franson
dippenaar
fallingwater
leehom
bissel
nbf
rattenbury
olas
stigmatised
thornes
climbié
krabby
relegates
beholding
spay
ktvi
souviens
fraternization
funcom
applegarth
fulbert
brining
lorenc
ribbe
sidearms
coté
billman
ganzel
vainikolo
hongshan
ksbw
smsc
lapide
mahn
lello
buettner
chonan
marut
ery
yorty
turgay
tws
rubins
ballsy
michelis
azulejos
revascularization
perrelli
bodoland
pullach
erth
zrt
wolstencroft
cuddington
rebuts
eyot
ensminger
ngum
gallinger
yerington
manoranjan
dallman
twilights
gouaches
hilfenhaus
gandara
repudiates
mastan
haseen
setouchi
priv
copulations
walli
saddlery
akhavan
garma
swadlincote
tadoussac
hacken
lakeridge
shirking
prantl
contusions
appy
plesner
matriarchs
macrobiotic
kolm
hypervelocity
concords
lintz
bho
sharqiyah
teachable
pitiable
yaro
didia
slackened
hurford
evertsen
badenhorst
botstein
sviridov
amplio
chauffeured
lefever
okemos
centrosaurus
jades
dahiya
meldon
hesder
gillispie
koekkoek
kaleva
garioch
støre
narges
corbould
nicaise
woolson
krasimir
emigrates
snorre
boff
discolouration
rangell
kwela
committment
leukodystrophy
sanchar
isay
syf
impish
unicity
zijlstra
bucko
dfp
huddy
nonselective
splashy
sonda
icer
ksb
arvon
lacquers
catholicity
laira
mccahill
icbc
tenino
ramal
becali
laurelton
riemer
suttons
repugnance
modded
aksana
wansdyke
thomann
sladek
zecca
sodomized
bemberg
narcos
amadeu
pumpkinseed
tanami
nordoff
mcnicol
marlen
guoli
beisbol
moza
recouderc
unconventionally
wagan
morganwg
höller
haeng
béal
barging
skb
lasses
delicatessens
rody
renkin
competizione
mcfarlin
drydocked
jackanory
paolino
raytown
lamade
vica
derogatorily
stahel
ekstrom
caulkins
katzenstein
hacha
mississippians
walts
lucescu
brydan
jeng
carolinum
suydam
implausibly
lyssa
laghi
glackens
lail
sheriffdom
telecasters
didonato
turkiye
penalizes
alexx
syvret
mcphatter
wivb
eisendrath
beilin
ladybugs
quot
headbutting
veri
sotnikova
ponteland
eddine
westfields
tubeless
wibowo
baine
fetishistic
dabur
natasa
kokorin
bandmember
sadiya
losman
ogburn
lamed
glaister
ismene
ecotypes
gaols
lansley
ochil
freshen
franklinville
medlocke
haff
tornabuoni
semipro
quoth
schedulers
welner
humidifier
brookgreen
thamarai
rhapsodic
laughingstock
brebner
deneen
rupesh
planetside
kaps
andriana
nyla
dominium
congos
shahidullah
ywam
jic
gunasekara
hayloft
suzannah
gullion
nomadism
coulier
gandak
valdemoro
ishtiaq
turion
deign
incarcerate
moriwaki
pleasingly
psat
ginobili
slive
cojedes
bleich
wijewardene
teplitz
irg
schwob
jameela
rigoberta
curatorship
penetrators
coquelin
ltz
emyr
jhaveri
sunsplash
tenser
burdine
striken
varyag
sestiere
toton
helaine
mahanagar
apiculture
rongbuk
anjir
rouser
vaguer
nattier
allori
franchione
tebbetts
mavado
loaiza
rahan
simvastatin
boyana
valdir
butland
massana
iowans
spitzbergen
kerkhof
thorndale
picmg
epileptics
willich
negrin
funiculars
spcs
taneytown
omelet
salthouse
daljit
loughridge
kundi
javiera
kens
ahlstrom
cognos
sartaj
trafton
fedra
eurocom
dymphna
niblett
umari
stumbleupon
tenochtitlán
cheska
malec
marang
akinola
impellitteri
ellinger
icsi
opb
dweck
dorst
voiculescu
wurzelbacher
schuette
fortino
danese
lexeme
ayukawa
medecine
babayan
hanjin
merfyn
kango
cytisus
valeurs
burgoon
bachan
cyberattacks
stormbreaker
suntan
kahoolawe
monia
daglish
ramar
slone
novelised
buchmann
sultani
schemers
ballu
hectoliter
graptolites
stammerer
corpuscle
gwn
chaturanga
mecum
alalakh
alatriste
brissot
pelecanos
czukay
mcps
sidor
grajales
clowney
tgl
olegs
laicized
sykesville
panera
troubetzkoy
ciera
pmcs
videoclips
lapita
unglamorous
blitzes
uned
carrasquel
vgik
bensley
thorington
mansuri
julietta
forwood
glassnote
grimwade
najera
borgwarner
ngi
kard
burle
fullam
bakan
comex
pdv
debat
badung
pomodoro
vorn
felsen
toothy
bloss
wildstein
roomful
lydians
darkhorse
milfoil
funtastic
aapg
coaldale
mynd
kabala
ghazaleh
briffa
kiprop
koehn
argyris
vesco
niblock
vinne
zavadil
stonegate
huntingtower
brickmaking
likeminded
bohl
dgo
kastle
trimethylamine
kenfig
balck
prosecco
alderaan
shawfield
christmases
avez
valori
bumpin
tibby
risser
docetaxel
plagne
ysleta
vend
müstair
arkangel
neerja
prosperi
mizuta
fallaci
groggy
tweedmouth
shimkus
brame
antagonisms
forschungszentrum
titmuss
flagellated
granatstein
cepheids
rygge
schank
hierarchal
wayfinding
clobbered
daddario
dicentra
angella
michoacana
oosten
tranny
orrego
kpcc
unflappable
byculla
transfused
remota
norheim
brookhart
wcpo
loree
spieler
pepfar
thornalley
clenching
hatreds
guadaloupe
jir
poincare
reutov
herbin
caci
charkhi
norcal
oquirrh
mubarek
uceda
mckitrick
battel
heilbronner
salamone
stieber
brouwers
actuating
simonelli
boorstin
vishakhapatnam
taua
bodacious
browz
tushingham
patterdale
multibeam
possessory
wenjun
choire
aopa
scawen
choloma
scie
oduro
griego
verda
norlin
minnick
tijana
greenly
rodchenko
desjarlais
thalheim
sanden
tinkoff
harf
sharpens
spermatic
nzru
borei
generalizable
allhallows
stuttered
haskett
cezanne
rylstone
medicate
byronic
sallinen
dilruba
imbuing
acetylcysteine
manrico
groninger
antichi
berka
nermin
svitlana
skillsusa
bailiwicks
azza
hadean
bisher
dargie
unethically
civis
röhl
droog
majles
stolid
backstairs
frankau
rawn
lifu
appetizing
sheetz
creedy
kgaa
carillion
alexej
oyston
penetrant
epicureanism
pokrovka
mongrels
vänskä
organi
gaude
unachievable
agudo
qadisiya
stodgier
consortiums
slager
dealbreaker
khalilov
keadby
passu
cavallero
traumatizing
multitouch
marito
razorblade
hallein
iaps
reveling
cedi
borns
lagrave
szolkowy
photostream
canady
loye
jijiga
alpo
alsup
heedless
grails
polarizers
senates
lappland
jingyuan
hbi
keppoch
upraised
nesser
postproduction
inflames
gnashing
sanka
neuropsychologist
talybont
magirus
trammel
aloofness
contrastingly
nucleosides
despierta
cicotte
teertha
peepshow
lemmons
snailfish
wellings
rafat
gaonkar
halfon
gamu
cyanuric
cielito
shitting
tolkin
gradualism
kanellis
rynek
plagiarising
alness
ujjal
weigle
neé
naafi
kormos
fulbeck
smudges
interna
dénouement
molenaar
perfects
whee
ninds
aping
gelled
huws
goffs
russophobia
loga
doji
firkin
kopassus
calaway
geppi
faruq
wptz
chalons
bigmouth
mcgrane
iara
paulsson
kawara
chinery
syncytial
fretwell
uiq
shawarma
capuzzo
suran
safonov
betaine
boyet
southwind
dupuytren
buah
developable
triangulate
montagnier
tertre
rajin
dangles
mursi
mcginlay
tolon
toorop
thenceforward
djordje
emunim
easterner
yasumasa
ragone
hardstaff
nneka
largescale
glaciological
gastelum
renaults
atic
overdrawn
dupplin
grimethorpe
implementer
twinsburg
duron
venning
chashma
sibusiso
tadmor
todeschini
rupnarayan
miseducation
chondrocytes
askam
brihanmumbai
samsa
gervinho
yubari
seogwipo
hilltown
adfa
jltv
auslander
apostel
trombley
lgt
fadeev
tufo
manekshaw
zeroing
ouidah
fireclay
coltman
deray
wankdorf
delocalization
coveralls
transrapid
maximalist
cloninger
alie
kirwin
bezels
flyhalf
kopaonik
milada
usdot
vrubel
valproic
inosanto
improver
formartine
pawcatuck
dunstone
pröll
montalcini
harring
karuk
salpingidis
berrick
palmistry
grn
lhr
sunyani
secteur
petrides
adis
guadalupian
kayama
pixy
paschalis
psuv
kalanga
eckhoff
maneesh
breitman
glassblower
stylez
trastuzumab
tszyu
brodovitch
medicals
churnet
karakol
sucessful
showjumping
straaten
capas
shaftsbury
hambleden
ambersons
milosz
werff
lulled
embley
carthago
panucci
pushups
hunn
swm
lione
blz
dsh
wallichii
sies
mtpa
roderigo
unscr
urbach
pko
elstow
umberleigh
yvel
montesa
semicircles
sjoerd
zenden
lanie
legard
unbaptized
lastras
penders
awasthi
trepanation
quarantines
bérubé
frameline
cohousing
mashrafe
silberling
paduan
dooming
greyed
locution
acog
azrieli
knb
marriot
indecently
popocatépetl
balearica
moffo
unwieldly
rados
muito
glitchy
hibernates
quasicrystals
jaggers
dismally
arat
caber
hevia
argylls
nairo
milliliter
radler
esguerra
uziel
asra
ikiru
feckless
shishir
lahav
chuckling
reimposed
inerrant
winched
feil
jurvetson
cyclosa
flagon
atack
bidault
signalmen
tth
disclaims
isasi
luzi
unclassifiable
doling
macba
glauco
recliner
mahin
dubliner
kanchelskis
evenimentul
fucus
falkiner
dorée
kalaupapa
eglington
scelsi
stidham
emelia
jallow
triply
zax
phytochemical
woodleigh
tamburlaine
klondyke
unsullied
zillow
gigha
nikole
sanni
liotard
ashaari
huka
wistfully
condover
seymore
rosiglitazone
weerasinghe
vimpelcom
chevra
brightens
sulfonamides
ledecky
roes
greenglass
olynyk
zanjeer
momoa
mislaid
ctia
kittle
byerly
claygate
savolainen
czvitkovics
muckraker
culicoides
westonbirt
seleção
tuaregs
dryads
threadless
wih
thir
musqueam
cloacal
jaspar
pedler
babalu
sequoiadendron
terrorise
reveled
coreligionists
breadboard
yucheng
dauda
madine
thinkquest
abortionist
hinkel
firehawk
enlistees
bitchin
comfrey
ruach
durak
fomina
filastin
faberge
klasfeld
curieux
haseeb
goodwillie
recoding
kwadwo
sidis
gomaa
obes
makonde
discordance
justis
tarporley
hamrin
aulos
fadhli
patte
neurochemical
bloodstain
adilson
hacarmel
jongen
kulit
stav
bunte
restructurings
duverger
massereene
modin
megalomaniacal
sedia
hamiltonians
scrublands
fallada
stereoscope
laimbeer
villella
belltown
mafias
kryptos
neger
grenouille
smudged
duyn
marshaled
kule
joynson
airlifting
trivializing
diverticulitis
sibson
lowth
wilted
paralysing
benesch
lasham
brasier
hucker
weobley
concentrically
djite
sheeler
intercedes
goater
ryokan
backfill
deistic
tioman
tokiko
gaus
denko
demoralising
makossa
transfield
pallekele
middlemore
mdewakanton
aimable
misjudgment
dangled
endon
schnitt
ation
chunlai
celibidache
zah
bhumika
maccracken
sclafani
higginbottom
packy
shaz
trapster
litel
ponticelli
sprewell
padlocked
ghita
chailey
santonio
altschuler
reposed
abdelwahab
dubus
retinoid
nonda
extortionist
tunnellers
pewaukee
deadhorse
linezolid
vogues
bucha
withstands
gernon
impinges
efb
bladet
fondest
frostbitten
unfired
turnings
pintu
mih
greven
parentis
marabouts
moralism
muddies
baruth
virgina
hosed
sneers
arashiyama
amero
sabahi
hotseat
doisneau
mustached
fattori
tabara
heis
resonably
saddening
gnoll
peery
interconnectivity
carrageenan
postlude
refiled
quadripartite
riccò
assasination
chambres
inbounds
roboto
concertación
thingie
kenyah
enskede
wrona
lul
pigozzi
cankers
byc
beida
morandini
ifds
tarpaulins
nystad
orszag
syk
unbundle
sicknesses
silkie
novar
gronk
llys
kjartansson
elyot
antrum
bivalent
nuf
highlighter
sipah
hegg
salada
schayes
ghouse
stoiber
teutons
emerich
szdsz
karakul
lenehan
sarkisyan
hodie
allanson
grigorieva
laettner
commissaries
heneghan
simplot
wiston
vize
illiana
fuchida
placentals
ahola
chike
reinvents
grizedale
mercaz
muji
allocator
lein
bucarest
overenthusiastic
gatta
probabilistically
imst
traversée
mesfin
collodi
pribilof
distresses
nicd
staffa
ronning
marloes
unifier
unhurried
knaggs
zerbo
hary
minish
rengo
aneesh
zhihao
jjl
approriate
tmo
girado
malines
stewing
hpr
wagah
corbu
jaundiced
undershot
biding
bryzgalov
fews
quella
knapman
taupe
mertesacker
cardenio
pegge
kredi
beneficially
sawar
glassmaker
ransoming
balasko
spaceframe
lullabye
centa
sublimated
scatterbrain
boonesborough
fosbury
kli
ardant
kissell
hirschi
fratianno
unitized
westshore
noson
accomodation
fairlee
haart
whannell
adath
loveman
sres
cowin
neuhof
taraborrelli
jurupa
panozzo
liek
mandaean
rippled
stik
benefactress
mucci
wohlers
bandhavgarh
metalsmith
bmb
pitstops
octuplets
cernay
yarnall
ronaldsway
valeant
danilovsky
durlacher
levenshulme
rubab
duranty
aslef
logsdon
fank
xxy
macivor
agosti
waun
darnielle
kangoo
unfurling
sweb
shambu
hendrickx
luxembourgers
nideffer
richford
lakshmibai
labar
nabe
imperceptibly
sallee
geilo
cachao
kernewek
peaky
blanched
closedown
alist
tozeur
tyas
alyth
renea
gites
khlebnikov
satou
slx
slaveowners
eho
pataca
particularism
hachioji
schary
prejudge
adduce
gigot
weegee
rootsy
shey
nikaia
magis
mclouth
ldi
lovelady
pheochromocytoma
gaffaney
sandyford
bloomquist
cappagh
mirer
deister
redhouse
knockhill
muffle
rippin
zowie
heatter
vostochny
sensitizing
perego
giovannini
varshavsky
estragon
wringer
stavans
lkab
camaguey
horáček
tonioli
polytech
jordyn
giandomenico
bubbler
heinberg
rubicam
toughs
dumpsters
koyanagi
lambic
tyseley
joyal
kaplinsky
newsmen
mcfadzean
avilla
ammerman
chatel
ortrud
vishnevskaya
granddaddy
stapling
nuada
albergo
colossi
hudswell
buni
eeva
beguiled
zuffenhausen
steratore
renesas
kenenisa
chakan
mindjet
baclofen
wahkiakum
panchal
germanophile
koningin
surfrider
flubber
salesmanship
freestyler
crenellations
ruedi
junos
agroecology
dayglo
captivates
demystify
algerie
blobby
guthridge
ecton
brockdorff
gambaccini
hepper
demodulator
mellower
suryakant
rotan
romine
snowbank
outrank
malesherbes
precedential
debateable
truls
taxied
lehel
eades
boere
submersed
cwp
tegmark
multiplexers
teddybears
bevins
roslan
gawd
shubra
rudbeck
greedily
katsouranis
teseo
thatcherite
shiela
whn
crescenzo
multivitamin
folwell
gerrymander
wmal
tazio
allene
crescendos
paroisse
baled
mhaonaigh
shaila
isizulu
dungey
mittermeier
consensually
commonground
kalinovik
kailasa
tfn
bullpens
sauropodomorph
mörner
pulli
kutz
shaa
hofner
sutor
sheyenne
cink
miano
ferrucci
lottomatica
mcharg
holck
dmo
schweickart
cannibalize
riboswitches
jawai
fascistic
excreting
lodwick
mozley
motorable
withey
lundbeck
dunja
veldhuis
counterattacking
exhibitioner
hoffner
booksurge
algimantas
behooves
xara
mannus
traurig
materialists
clachan
carribean
vpu
tournay
collectivized
fréchette
martiri
commingled
saltsburg
cheverly
bachelier
openside
bedwas
inri
selk
chunhua
benedicte
hassled
nastasia
grannies
ziege
sinisa
usmanov
unpardonable
ellena
heimer
helguera
reffering
adobes
dudas
tamen
suprising
ganay
flagrante
asatru

komische
hercus
schnier
leventis
scialfa
frischmann
khokhlov
llopis
gober
thijssen
ducato
komai
sholing
cookhouse
wachsman
nightstick
drydocks
cbcs
lasagne
sacrement
tearaway
ustr
kaiden
aniket
forbesii
seasickness
limmud
nimes
ivangorod
sankoh
indymac
dhb
astroparticle
cuauhtemoc
decon
gwennap
norgrove
kelvinator
kindlmann
waba
estee
goulette
rebekka
oximetry
sympatico
jingoism
dorky
subhan
böttger
microsatellites
pangalos
amad
quinet
shergar
berrow
lucido
democratico
bdw
kaypro
dongyang
onitsuka
xiaoshan
tuban
namazi
sandison
kobelco
maccormack
paraneoplastic
milke
jurisprudential
henrickson
brutalities
claggett
sheeba
countersigned
nutella
communiqués
chiana
riboud
taur
gingko
cleome
silverleaf
hoefler
curculio
katsuki
estève
duplo
lilya
morcom
canidate
harrowden
kaim
napierville
alama
avadi
handshaking
valvoline
asla
dipaolo
ziwei
rore
sibilla
solf
taedong
electroluminescent
kingfield
beaminster
gradel
lavoy
dufner
ashlyn
montolivo
huebel
megafon
billiken
vukovic
golijov
donavan
arara
handforth
mckissack
keret
reinfeld
pdsa
absolon
scruple
revved
duquesnoy
wycoff
stoxx
rezvan
highworth
unspent
stratfield
nuwan
pierrick
larra
cnpc
breadwinners
zorka
aulia
cedrick
hkust
nachshon
finsen
hamzeh
inhales
foglia
balasubramanian
scandinavium
hustla
lughnasa
chupa
creach
caughey
comptoir
wojtyla
malpighi
coraopolis
poya
mrb
zachry
arkoff
blomkamp
checkbook
looff
vorm
cruddas
ultrasounds
naciones
mauris
incongruously
ibrahimi
okina
unoci
tardigrades
rezzonico
peristaltic
nafc
catlow
huell
abrazo
kemo
croplands
kealakekua
woai
abap
vivero
pengelly
polyansky
gmf
heru
ingela
makhaya
godmanchester
kutschera
ignasi
ftb
leprous
hailstorms
apheresis
unexcavated
kerrison
supernature
egle
angelides
morlot
huaxia
sht
coralville
pcx
misandry
bujar
uzo
labib
gabrielsen
infiltrations
bartolucci
fluconazole
churston
mortiz
leotards
cfpb
fruita
tallangatta
ricarda
lancôme
pengilly
merwede
survivalists
setif
petrini
puiu
tasikmalaya
zahm
diori
witzleben
cke
brooksbank
mkr
roissy
millas
ricos
landrith
hakimi
dandekar
cooperators
avons
aldredge
rhm
inital
doggart
labouchere
jamun
rutelli
ucn
rimantas
mundesley
gart
cuidado
lamoure
huatulco
sojourned
temu
tutela
statuesque
cvb
formalisation
bonifaz
sadaat
sirik
scampi
obuasi
bloem
blanning
trista
leisen
garegin
frej
dahaneh
distefano
guesstimate
illsley
seeler
markups
torretta
quavers
pragmatists
procambarus
nawabzada
locog
igh
kohrs
mapmaking
lipsey
exploiters
exclusiveness
northover
carrabba
exacta
matrona
methow
thackerville
wjar
assistantship
blomstedt
valgus
onex
tolgoi
tmj
ybbs
tevatron
icesave
gualeguaychú
codie
mellifluous
aceveda
diablada
bicuspid
wimsatt
embroiderer
pittard
westerveld
bumba
bestie
reflexology
silversmithing
cotroceni
freakout
arsenale
hertling
suining
alimentation
thurs
stottlemyre
aui
wellsburg
hydromorphone
gants
millette
gonorrhoeae
takanobu
clucas
rockery
calon
plinio
brunelli
tidelands
hotson
ténèbres
frenchay
pansies
tagetes
poile
overcooked
kaukauna
malaccan
sazerac
quarterbacked
larriva
arcaro
caerau
compendious
simper
markdown
sindical
gookin
prinses
londra
srilanka
petraglia
eog
shirehampton
frothing
wwn
combiners
legado
priviledges
nazer
mukarram
belvin
talukdar
ashihara
cambered
myre
meins
damac
elmsall
vugt
derman
eylandt
rofe
astar
stefon
fisting
mocca
sunroom
reli
shechter
reactivates
monstrosities
pancoast
zazi
rebalance
foad
martella
hyperhidrosis
rockhouse
mnj
lingappa
lijun
eidgah
rcog
gordonia
gaisford
squidoo
bellocchio
badam
glb
pyres
ginnifer
rogel
lacayo
fonction
cleveleys
kuow
marlinspike
yongding
flails
wadlow
puttkamer
schizo
savon
polaco
banatski
whitnall
wendorf
fokin
pastas
vokoun
gerrymandered
biocon
oxymoronic
acclimation
aspie
orangi
savickas
thn
graystone
delarue
meaker
altshuler
prestonsburg
pequots
turnagain
noodling
immunogenicity
mariane
ritualist
plotnikov
ivona
antonioli
gatland
effa
militarists
venkataraghavan
davidsson
mahl
orrick
galbi
pujo
sangye
galanos
suominen
gillum
broadman
respire
rynn
montanaro
ultrasonics
ncg
rohner
pittas
burtin
koçak
isothiocyanate
daira
dietsch
mynah
uor
aloma
neergaard
igawa
rudolfo
kaabi
uds
bamana
carcase
chryseobacterium
haygood
mulki
saaz
malter
mcvitie
veritate
sycophants
accumulative
lygia
gleich
hawsawi
diversa
braunston
reiniger
hosoi
pharmacol
imei
acri
dinneen
rancic
edet
iványi
saho
chinaglia
rockridge
taslima
turkle
raymon
sgv
latiff
heun
mouli
jaroslava
bryar
esoterica
rafo
chizuko
obscenely
delf
clementino
turow
hilding
wiegert
bromell
wenig
badra
gloag
electropositive
snaking
bordeleau
murle
trebuchets
birling
ramalinga
rosendal
wainstein
garga
tragi
helou
selsdon
yakovenko
kassovitz
polycrates
hénin
fonzi
marzouki
backdating
gaarder
depor
fortum
locksmiths
selkirkshire
clausus
reaps
kehrer
harkavy
huimin
jimerson
poppycock
guestrooms
xterra
rph
torreon
goodeve
sanoma
khruschev
supermoto
coachmen
bolek
dandan
heffalump
stinkhorn
wilburton
fritts
beauman
impudence
nephrologist
clendenin
zabeel
hornbuckle
lactoferrin
moissac
wolinski
mehtar
inebriation
toile
sprue
mechanoid
hominum
gedge
marazzi
waterkloof
kaiulani
termas
laet
eléonore
aarushi
leant
supernovas
joburg
toula
maglione
futhermore
saturates
chesson
gunna
hitesh
scimitars
wallfisch
ghimpu
switchboards
deamer
righ
returnable
latonia
khaira
fatone
kobarid
proselytising
perplex
apperley
ferlin
schüssel
mitsuyoshi
diamantopoulos
widder
kaoma
smalling
belleisle
riyals
comedie
imminence
exosphere
pierro
amnesties
walbeck
ankylosaurs
skokomish
troutbeck
filippos
sloc
leron
leggat
titova
lifo
fleener
macaire
wigman
amenta
kimmitt
apoe
multidirectional
brixworth
katzer
sofiya
follmer
citywalk
ferial
leyendecker
restatements
nduka
ozols
weisbrot
althoff
javagal
cottin
librae
zhongxing
usrowing
geomagnetism
expungement
lael
sorbent
worthlessness
attunement
plaisted
mujahidin
baldly
mealworms
totenberg
phenolics
larcombe
culina
tlaquepaque
ladonna
senghenydd
percolate
kaska
protheroe
agnar
claudy
wijesinghe
luding
compered
sturdivant
pattullo
bcra
luza
vatika
baraja
hsfp
witticisms
marcu
somethign
twiddle
corvino
schmieder
wicksteed
pyeong
minicamp
purposing
andrzejewski
ohhh
astc
oxonian
coved
throaty
brined
bluford
unlabelled
pache
strelka
daskalakis
assiut
dabbles
treichville
inds
fortrose
cardi
bakare
eryngium
townline
spagnola
mulching
zentropa
badel
exr
fremington
workaholics
pulpy
eigo
afterhours
ventromedial
merker
dragonair
warrented
scrat
ifj
trotz
gauna
dekay
sidr
merrylands
wenjing
enduringly
sayat
cantharellus
jdam
greenview
nivola
easyshare
benkler
overproduced
ocularis
sarich
khabibulin
mathy
flatware
leadbitter
comedown
semporna
guterman
attinger
guinee
crosshair
lidstrom
abernant
getafix
duckbill
heym
knockers
bucktown
vivino
mamuka
goldsberry
harambee
xjs
sahira
onchan
stablemaster
usurpations
ingenuous
ebbesen
leontes
rkk
sportsground
ulusoy
yeow
recrimination
seyoum
bekim
arous
herkules
midnighters
picher
boote
gambara
lyke
rimshot
getman
agel
himyar
sadam
romar
cerqueira
katunayake
cappelle
haxby
pickpocketing
schmied
drenching
shinned
thiha
kabin
bagamoyo
upholland
siewert
opara
universita
actualize
phevs
denucci
katou
castmate
samatha
viliame
budig
supplicant
marki
lynds
prednisolone
mignard
chakar
benifit
ilizarov
soundcard
nannerl
kimmins
megabit
dones
szostak
persky
belliveau
pontivy
panayi
colnbrook
paulistano
ordains
dosari
coniglio
edlin
avin
chiccarelli
bouley
vitaliano
turbinate
israelita
auchi
basecamp
gisella
eurogroup
portmarnock
meise
colesville
teaparty
viroqua
bernacchi
svga
barlett
huffing
hassa
dinnerstein
benbella
pria
koyu
watseka
novica
pseudolus
gradkowski
neelum
duplantier
palmgren
grego
labbe
picha
ranford
anticipations
aaps
trarbach
crampons
gasimov
edibles
pronation
iaith
malpaso
manika
dahlias
ruvo
epizootic
luisão
popmart
kurtág
amfar
panegyrics
nazanin
streiff
msj
hardstone
timpanist
baranovsky
siamak
noblet
sesamoid
rbe
owston
killion
sibyls
ironworker
insincerity
unceasingly
seppala
handi
saguna
avascular
conjuror
dulong
paribus
nafisa
truc
jogged
hawser
wrg
dragway
blackjacks
epicure
parda
blockley
hullett
boyertown
shud
standardizes
vitiello
hogfather
bloque
enviroment
yokel
aminoglycoside
neuropeptides
nonimmigrant
marris
strobes
techdirt
touchstones
walp
formalising
beedi
anjaan
abrego
alman
tonsillectomy
lovette
especialy
mikaël
descarga
khorshid
cruddy
hyppolite
sangma
aphthous
cardellini
maniema
versos
reagon
imbroglio
malbaie
pinotti
kumquat
tammet
forsch
rotblat
gatson
jonni
hesburgh
canonica
instantiate
secessionism
tickers
eim
vlasic
missoulian
curtained
mahjoub
chatters
samlesbury
donagh
sheek
kokoity
zueva
pearle
pieniny
misdirect
gouw
catgut
chatteris
yelland
unprintable
andromaque
rostered
soldotna
mxc
castaños
stinnes
salomone
banzuke
livedoor
dooling
powderly
keynesians
gosha
geli
fukada
rijke
lanercost
burrington
easingwold
hartnoll
tgc
foudre
illusionism
eling
refortified
smithkline
guerriere
tobyhanna
ilich
sumulong
leninists
gamez
uncitral
embolden
mandera
audette
mozes
soulsby
bossing
majestically
creus
brazzi
mujuru
sarao
hekou
groynes
coope
blegen
wintersun
agli
geldart
thersites
alna
axim
schutztruppe
defendable
moens
yassir
chinami
hgt
hippa
cing
ksla
hampe
stonefish
pashupati
norbertine
clawhammer
darzi
zaveri
ootacamund
psychotronic
marmande
stoss
ovitz
lambchop
velda
moncloa
plaits
daulton
migi
tese
mangham
cantillo
bianchetti
vasiliki
gouws
heliospheric
fbe
barwise
awls
mantaro
bogeys
maloja
huhn
midcourse
inkblot
gayfield
erga
nchc
saadawi
pecher
lates
regresses
fryers
chael
picturehouse
washbourne
shope
giffnock
utb
airburst
lembah
blackguard
parreira
kawan
benso
thery
eboué
snoops
yaws
bleeps
malnourishment
etranger
microburst
alrighty
recombining
icrp
zairian
transhipment
luik
samedov
millage
ducker
acey
empey
somersby
sternfeld
westridge
jiggle
venters
combles
laster
camerini
dawgz
tuiasosopo
featherless
lefortovo
tallarico
ahearne
luneta
savarin
wiltern
perranporth
curdled
tildesley
cissokho
akathisia
poundland
valencians
rovelli
arbib
homebuilders
risin
prazeres
kokopelli
vittoriosa
biwott
avais
antivari
anjani
vago
runningmate
valediction
lgc
slickly
raddatz
abre
devananda
multicomponent
razali
dubb
discription
mozarts
swooning
calotype
swaby
richartz
headmastership
adichie
hardangervidda
flaunts
arbed
rematches
kingsale
tottering
narcissists
complexions
brunning
lipo
borings
forenames
zhiming
hubo
cornick
dogleg
pummeling
schmeiser
marketshare
racecars
yoy
akn
fondre
beguines
targhee
ossorio
tln
bryher
mewa
turrialba
keisler
maccready
neurotypical
coneflower
overcapacity
senio
yick
sodi
accreting
mcnay
crestfallen
heuchera
misericords
neurovascular
webstore
hornbach
arvesen
glaeser
haynesworth
takita
marree
quidam
jonbenét
dominie
stager
lamona
delamination
mcternan
perr
lusso
emiliana
venerating
moonshiners
limonov
flypaper
hoppen
sfh
supress
necesary
nextwave
soph
faymann
kingsoft
recuperates
wiht
waart
guanylate
lewdness
feldshuh
besmirch
wining
schweigen
lindos
rouble
furby
pirveli
fanwood
trimethoprim
botte
tiebreaks
scalene
tacs
dunkerton
echoplex
creusa
driel
gainsford
quartus
ringel
appologize
cryotherapy
passarella
kalsi
phooey
puissant
turmoils
nimruz
inthe
orris
pagewanted
revalued
mahu
mathworks
lamarca
obnoxiously
avrich
creditworthiness
greatorex
dandini
pavlou
vasp
kandhamal
rajang
kathimerini
abrikosov
comdex
harlaw
alvor
wabco
kandeh
bego
ktunaxa
kuniaki
polypore
juhasz
euthanize
undulation
talgarth
latos
braben
keesha
krauth
teaspoons
spiderwick
hellogoodbye
velten
undulate
unhistorical
buntline
pogrebnyak
delamare
lungren
hanway
kelner
gettier
gunsan
mtw
brott
unprivileged
druckman
dawlat
funnelled
sterilizations
criner
gasse
phencyclidine
beefeater
mankiw
gazet
untalented
siemaszko
autónomo
fallah
boam
blackrod
amnestied
camero
lipodystrophy
sankyo
excommunicating
cohere
scandalised
steams
totentanz
glyndon
polyphosphate
robbert
mukasa
amarcord
sxt
seismographs
blf
distillates
valujet
brutalism
aaker
perrotin
tejera
pakokku
bengoechea
naber
ballyconnell
menards
torborg
tomaz
unclimbed
hethersett
zanon
buzi
thermographic
simplicissimus
rodica
vieilles
averil
englands
peeing
nordsee
nurit
welchman
yathrib
hoisington
mingora
noz
zambada
asare
slurring
kalli
brandies
berrier
mackays
dunsfold
dehydrate
rippey
vdb
partite
ewelme
miep
purkiss
esslemont
mccoughtry
marter
montenotte
sujan
ium
glt
mackendrick
pabón
hagg
odorous
castelar
jedinak
enjoins
shukhevych
moskvina
keena
outa
daghestan
zulema
taiaroa
katydids
putts
fazakerley
hian
holguin
slumbering
rimon
rusal
etiologies
cavo
bonventre
maci
uncommented
reineke
kwp
torras
robertas
titley
naess
erminia
busher
hummed
faivre
disd
unitech
honeywood
bemoan
runup
qaid
bussing
zurbriggen
blattner
gerth
amstutz
bruntsfield
marchington
meriel
nonentity
dramatisations
rubidoux
botwin
stai
baranova
freep
maxïmo
alerta
lovefilm
manwaring
iua
spoonerism
kgmb
städel
dawoud
eyam
crumley
honghe
coches
grutter
solal
gosplan
sculptress
doar
parahippocampal
arsehole
daytimes
khambhat
ceteris
afflalo
extruder
ncsl
carême
bigtime
kawy
belyakov
lambesis
politicans
iliamna
edelbrock
barazzutti
wadhams
twe
routley
burbach
thresh
remez
uut
forgoes
phosphatidylserine
lulla
busdriver
elizabethans
greathouse
chenggong
daho
smolarek
teviotdale
adap
spohn
debes
bollea
pozner
heitkamp
wiehe
degenhardt
cunninghams
ewb
reawakens
localist
rearden
jackey
cartage
cassandre
whirring
seiran
untangling
temujin
grenze
sherkat
jarosz
nikoli
maladie
risha
asao
peredur
bops
vieuxtemps
haise
lifers
ultraviolence
takahito
deerfoot
jeer
jtv
inessa
destatis
diliberto
doosra
kostyantyn
pontremoli
cadaqués
hink
lakme
halutz
debarking
issigonis
birnam
jóhannesson
députés
nuuanu
cpcs
originalism
hancox
osis
manen
onkar
unchain
odermatt
uprightness
michaele
leggenda
artan
themba
megu
scudo
rogozin
mcmoran
soken
cleophas
jaycen
achondroplasia
oronsay
purnea
waynes
potente
allbusiness
ancrum
wanchope
yanovsky
ghannouchi
babilonia
infraero
sonn
faim
annemasse
greivis
fori
ringen
kneebone
fratricidal
crowthorne
tatti
storable
booka
breitenstein
pucallpa
federate
mcneice
kirschbaum
abood
envisaging
mcaloon
eridge
hoki
refinished
gillot
clingy
eigner
devraj
eurabia
pryderi
tradenames
turnbuckles
bonzi
baten
hilf
triforium
damacy
jolissaint
wasikowska
giat
meranti
westers
tunnelled
noteriety
reimburses
torphichen
tolstoi
geering
enroth
spinouts
evangelia
changement
uttermost
pirri
sleepwalk
eqs
kamille
sonck
sinuiju
woda
nubar
holmesdale
donck
tigrayan
ejemplo
sermo
underfunding
balboni
houseplants
heilprin
personalising
knockabout
uio
chokwe
nyai
deitz
steitz
phot
rutka
chenies
werman
yoseph
toolmaker
assim
loane
thg
minford
ggt
holey
mastrangelo
laundrette
jolliet
whitened
powershares
scheidegger
jowl
rehhagel
mazdaspeed
pathophysiological
ysa
kimbrel
prempeh
blampied
shok
uranga
girotti
donot
tager
underslung
unexploited
sallam
darrieux
longbows
neile
kets
risner
laverdure
dextran
panwar
olliver
bourdeau
reconfirm
cardstock
golisano
duenas
outsize
baratta
meyrin
camilli
rcra
hobbie
llanfyllin
sihamoni
lynnette
nfdc
puglisi
barragan
vanesa
vaccinating
schumaker
bindery
hummelstown
benouza
chequamegon
presteigne
optronics
addicting
mattera
udayana
eisuke
univeristy
myotonic
pattenden
albro
imparcial
peddled
xyy
vouliagmeni
pamp
ivel
aje
ercall
futilely
spragg
ureters
blotchy
dilithium
roederer
grania
akhlaq
koukal
poka
loriot
wilcke
dadaists
tomasa
underperformance
reemerge
zorich
iroda
batyr
suvi
kilar
bergholt
satiate
lockouts
proxmire
langtang
ensslin
lawe
sabanci
babuyan
diko
clangers
matchmakers
scarff
huei
zikr
koker
ajinkya
tonnant
choshi
bedloe
wpl
diopter
lucerna
kacy
natoli
grandpré
ballantrae
auchan
fgv
mahat
nrd
lcbo
leiser
bassols
edzell
dorries
dotti
cbz
kubicek
whatmore
varrick
slindon
anhang
dragonfish
micawber
destabilised
mannock
deady
ramer
pontarlier
pome
tinseltown
datas
praline
pittenweem
lapi
baucau
neurasthenia
divin
hcb
iberoamericano
arrestees
keaney
exosome
culford
woodmont
shatskikh
adroitly
aircon
consorting
caca
patak
thair
daju
reflow
massospondylus
iadb
belhadj
taffeta
krysten
miggy
gless
groll
jarndyce
oswal
encinas
croston
lutze
ogwen
decontaminated
wrightstown
eisaku
balibo
potocka
vomeronasal
souder
defuniak
kathlyn
irradiating
tarra
baaz
clissold
citri
inveresk
signups
zema
syntek
effing
hilaria
expandability
smooths
hanisch
friedlaender
mccaughan
recoiled
christofias
acocks
sapelo
leider
asphyxiated
malathion
viktória
acidified
dixi
quaintance
nuran
tokar
formia
roosa
nirav
fusi
padrino
valek
caas
raro
leupold
watchin
florance
epimetheus
harmel
alfonsi
sirajganj
barq
kashf
sampat
koech
tepuis
leskinen
hollowing
totalisator
osogbo
hvidt
sirnas
heartgold
bosham
betchworth
nicknaming
nehlen
ploshchad
beil
rotheram
hudong
dernbach
frade
declamatory
plasterboard
pahokee
tacrolimus
annamaria
matina
kfwb
middles
straitjackets
koppers
siddal
corail
nasda
billo
pimen
yaphank
cockell
youll
smartt
holway
tomasello
chimaeras
mccargo
jutiapa
allergan
tomasini
marchionni
swerling
curiosa
usualy
niyogi
camaiore
fetishists
vincere
mortalities
nbu
haddaway
monteros
federspiel
kasauli
metafilter
vao
prendre
mandai
mgi
pablos
ın
skipp
schaan
sike
aillet
allia
pritt
minge
vicenta
curwood
cinderblock
bowersox
modiano
mullings
untreatable
twangy
erian
tanimura
dahod
ramaz
mortazavi
screes
schlitt
metabolomics
miroslaw
vilks
booysen
apicius
interpose
mariquita
plausable
cayetana
kibet
morihiro
swashbucklers
marve
lobkowitz
hearken
fager
psyches
cavuto
determinist
compadre
soem
counihan
accies
thermus
vartiainen
shobu
awns
zadkine
specifiers
khorana
hazeldine
beason
transantarctic
campfield
israr
portamento
whitewall
dmitrij
psoas
avens
panju
mamdani
bashmet
muchmore
ksf
addey
dawna
nyassi
lamoriello
spiegler
massingberd
cameri
arietta
engorged
lupica
boyens
meshwork
borked
boze
fockers
laplanche
felicita
crapp
lomi
alerte
hatchard
enchants
unstudied
tordenskjold
algonquins
batmen
prosopagnosia
bamboozled
alaoui
comixology
korobov
lambrechts
scornfully
farooqui
ditchburn
dominicano
erasable
metrobank
urbanski
duellists
allwood
qas
fpb
mebo
footstep
backrest
glm
dunmanway
corbetta
phalloides
hyperpigmentation
eskin
guardino
ilike
michiana
moderations
cornishman
evinces
judes
zhirkov
lukowich
bronchiolitis
rokko
irritations
rexhepi
shugborough
galecki
garron
torstensson
pruyn
atomized
hemed
botello
poteet
surikov
vladek
chequer
caballos
awori
regiomontanus
ewhc
yerma
denitrification
bohanon
êtes
advertorials
stockades
nudelman
tarasenko
siona
fenter
brener
chiaramonte
minghui
malikov
mauriello
radjabov
springtails
terefe
muqrin
scullers
butthead
intec
quach
peko
advantageously
flatfoot
marí
oklahomans
clemm
sidle
netweaver
banyoles
voima
parmigiano
tredici
malakar
eday
dapple
urbandale
urinates
bolex
utaka
wre
treffers
finntroll
lindenau
headrick
expatriated
mogilny
portlethen
faran
bluebottle
fourplay
goog
westtown
layon
abare
dfff
kolehmainen
mowrer
sisera
tobira
kosmo
neocons
fumarole
iniquities
michale
suppan
bisphosphonates
asdf
dapa
pierino
twosome
shehbaz
ffynnon
andruw
minicab
shigeyuki
miret
mobileme
zaide
comptia
staab
nyarko
satrapi
philipstown
lovitt
lancy
donnay
bustani
rosler
­
bellos
piasa
garnsey
rotha
borchard
intercompany
bloop
pccs
simoneau
isachsen
annetta
brandts
crossbreeds
sunning
movida
maddened
duetto
swindles
midp
closs
meningoencephalitis
autogyros
mida
heatseeker
hakuna
piggery
emancipatory
minko
lighty
liveth
dellacroce
mcnee
elastically
ledezma
haeju
wasmuth
updrafts
nuruddin
mafikeng
xylophones
carvell
gottehrer
ibraimi
borley
małopolska
drainpipe
nicollette
verbascum
lickin
stard
schuldt
singler
nabarro
sagacious
lashari
catchier
mronz
wildernesses
starmaker
simonini
armenteros
naus
daat
ouali
essl
rhr
hafer
romay
whitesburg
eusa
noughts
lej
candombe
nonsence
safronov
smarta
alphand
xibalba
vladyslav
rasha
vajiralongkorn
vanbiesbrouck
hatfields
pilfered
tiktaalik
bledel
reynes
koppe
osawatomie
satpathy
avraam
soleri
zebo
skorpion
daag
brüno
unsimulated
evelio
betuwe
quek
polycentric
kellam
gennevilliers
buggin
andreini
cerveteri
rootkits
alptekin
smugly
seductions
grocott
tadros
quila
dewlap
cleddau
birdlike
clepsydra
highrises
mkz
priorat
hetrick
rembrandts
pratten
baquba
rtw
alvie
yorath
ptm
chilis
trysts
cheilitis
baccarin
munte
amans
kahnweiler
mjf
jamu
noer
bynoe
rougerie
exploitive
henpecked
shahe
abano
ueberroth
drage
abberley
pugmire
yangsan
shohreh
perlite
nijinska
hiace
sdlc
photophobia
benenson
amagansett
magon
abbes
palates
ttk
tripel
imaan
woolery
madalyn
meebo
bhavesh
bishopton
domokos
gimbels
nasogastric
pnw
rajhi
neak
lwd
nazione
schleifer
gummo
charisteas
surmising
benon
suwardi
swen
mycorrhiza
zohn
ksfo
ledwidge
rotoscoped
boisbriand
whiteread
kostin
gladney
krauskopf
manzai
sted
kowalik
extrusions
manimal
brehmer
pescado
kagen
matusow
mallette
ronchetti
kovic
bislama
argillaceous
cutcliffe
helmore
chador
hypogeum
regn
fgb
cherone
buist
blache
mouldy
wkaq
dorrian
climaco
hudnut
sahashi
rouf
aysel
madeup
kirovsk
legatum
grippers
gerunds
snowdrops
agor
roderich
segmenting
fontbonne
staffords
brr
amorosa
karonga
oii
nardiello
vads
avod
cochon
bonini
econet
maziar
alicea
aveo
podiatrist
velikiy
preller
dollman
stoloff
citylife
margolies
humpbacked
costi
kepala
cryengine
guérard
maila
donoho
balat
caprioli
relishing
awais
coate
talaq
rlv
freights
mountainview
congressperson
nachbar
jouffroy
thermophilus
trevis
marilena
gritti
cheli
schoenstatt
pirjo
mugar
existentialists
eeee
haina
crossers
govia
leptospirosis
jundallah
hewetson
foday
curettage
soccerplex
tsuchiura
helfgott
mlbpa
inb
whipsnade
honeybourne
admixtures
hussy
dystopic
everthing
tweener
philosophizing
schrade
menne
parri
kuramoto
mosqueda
googol
siarhei
noren
koseki
barrandov
daisey
keffiyeh
huyghe
barrelhouse
berkovich
vanderlip
moghadam
kiah
matancera
hotwells
rimet
urgel
evar
papes
braggs
urdd
lvr
unmemorable
yadegar
dewing
bovingdon
toxicities
bahian
schneck
russkoye
mylor
vrabel
tymon
schmeisser
bookbinders
myfc
italicum
leterrier
karstens
wilhoit
collingsworth
bawdsey
susini
chiriqui
casspi
koyo
dayle
zubaida
alvernia
berroa
pothos
niwot
varis
longwall
africanized
endresen
seman
gign
ostin
shintoism
krays
clobber
attal
highnesses
rába
savitsky
lagman
yips
francesconi
lafosse
spener
tahreek
leashes
oncolytic
mccahon
epinal
tawan
adelboden
rgn
infor
matton
debusschere
wgst
corofin
coletta
shielfield
chegwin
carwardine
imer
kleptomaniac
ildebrando
centros
kyat
izo
satine
ricoeur
chavous
exacerbations
jaimes
subregional
mcbeath
domenick
hpu
osbournes
ebx
azcárraga
belizeans
sillman
dxf
herrerasaurus
thaddaeus
fukumoto
hemma
usgbc
yrsa
rockton
downunder
asle
djimon
thornett
hollering
kime
chatuchak
bailyn
syncom
schiaffino
radulov
kreiner
willesee
murawski
bowdlerized
laforce
caspe
checco
villalonga
zakharova
shearim
saly
samoyed
hasbara
olay
azria
challen
lenexa
natesan
untranslatable
judice
icma
northmoor
odenbach
cems
gumy
doñana
lalaurie
dabashi
shostak
aberconwy
kajiado
nimmer
unpronounceable
furcal
corruptly
giertych
hochstein
danell
dalmore
capece
urashima
irrigates
limpid
swu
diaphanous
wxia
preimplantation
ulv
mclure
mcqueary
sizzlers
externalizing
schwer
techsystems
innerspace
utrillo
hanjour
maraj
henges
rothen
sykora
durutti
bgi
moralising
toposa
redecorate
castellari
olaudah
incomparably
prunier
lateline
sicario
belotti
simeonov
solondz
rancour
gemsbok
rickwood
guelleh
schoendienst
boodles
heatstroke
musawi
satilla
digestibility
blamire
choptank
fhi
tourmalet
archbald
unlogged
gattaca
tastemaker
jonquil
khris
sniffed
oetker
heteronormative
oteri
proietti
teff
danil
televisual
bigleaf
combusted
shinko
lechlade
mezuzah
drori
joschka
sleeplessness
arcadium
insole
luminaire
nonvolatile
gaugamela
burzynski
jallieu
kameron
ackerson
grou
cairoli
mowden
aspirates
esbjörn
kreisau
spenny
kristiina
intraepithelial
tusked
wittenham
jagoda
unhyphenated
misremembering
dimucci
arlan
montuno
gez
particuarly
mazarine
voyer
inagua
flaxen
cuillin
mamilla
utilis
novosel
gurvitz
bozzo
cortines
popularizers
dukhan
parochialism
timoney
libia
choroidal
jostled
denílson
cpy
stanciu
wedging
editorialist
costlier
hectoliters
winberg
authenticator
hasen
lotman
marazion
sug
mancow
isiolo
sogang
bodenheimer
zipping
jlc
quiera
moulavi
drobnjak
wheen
latasha
bargo
palantir
biobank
esson
melandri
greenhaven
moussaieff
crippa
swit
jests
htun
bracht
hofland
becchio
oleksander
shiekh
radiographers
spectating
gobbler
verulamium
saulo
shaposhnikov
hearses
mezquita
ayyam
grouting
scrapbooking
quietism
markiewicz
windhorst
ogof
replayability
vanwyngarden
cassaday
debevoise
octopi
eiken
shinta
degla
akre
habanos
vinick
myostatin
bummed
steinheim
protectionists
yanase
wraysbury
bourdillon
lacewings
harassers
lightroom
rajskub
kamaal
veyette
ritzy
polke
illescas
emmets
blackmailers
easby
nool
nycha
yatta
garrulous
siggi
couceiro
collishaw
ponente
syphilitic
kandasamy
hyperlinking
ranville
enunciate
surkh
gnawed
sefi
ajn
talvin
wftv
schnellenberger
stri
attas
ranjini
bricklaying
bedecked
ghika
almanzo
jibs
ergun
kalanchoe
arzhan
dening
ritualised
pekingese
gakushuin
repartition
parvis
stardate
phytosanitary
tuer
litigators
ribonucleic
delma
swarts
zdeno
zohaib
stojanovic
tadano
fulsome
waterproofed
arundinacea
caestecker
undistributed
slatted
ferentz
bouteille
neccesarily
babini
bareiro
glessner
takeya
ankers
countryfile
shenker
luffa
haraguchi
vybz
incontro
strop
whisperers
oncological
moyra
teuku
siddig
parer
castillos
thomlinson
lulac
doubleheaders
margy
tache
cockspur
morgenpost
unlikable
enstone
skowronek
polegate
paba
monolinguals
eget
lorence
bermeo
motonari
metter
hooten
pelissier
braamhaar
shailene
aamc
tahi
flippantly
herstmonceux
helipads
neth
balmoor
apexes
ninagawa
icfi
yehiel
objectifying
fletchers
anally
cambra
sequestrated
lmao
asiri
meze
geisinger
fishermans
georgievski
missin
herberts
winnow
powley
pyrites
minner
comparators
intrathecal
norddeutsche
temporada
deshaun
zalgiris
wdg
farda
kompong
brettell
tirona
passaro
vagana
samwise
cumnor
thermic
arensky
polian
kreisel
inspektor
deutschlandradio
interspersing
acho
discothèques
loewi
phf
ribatejo
worde
conformism
screeners
burtonwood
blandness
chloramine
gelai
zubizarreta
grommet
sodomites
gratuities
contraindication
varlet
onboarding
insúa
halawa
rosellini
lusi
ferrybridge
biosolids
dynes
septien
gerasim
apoplectic
mirabello
merest
hovel
ferdous
fallsburg
stickle
gapped
macrory
pesqueira
detoxify
pavelka
marda
nemerov
pedophilic
nervy
uthappa
kurylenko
donlin
recidivist
schwarzlose
montană
thelin
bronchioles
heaved
haddenham
baculum
daubed
roxon
tokaji
altuna
dyrdek
oompa
klaxon
overfished
magrini
carmencita
trengove
spj
hardcovers
mccorkell
buttressing
kdd
extortionate
eslami
grandage
fleshtones
ncmp
swags
coline
percieve
mauceri
missives
alarmism
haltwhistle
honved
distributorship
dalling
asencio
mcelhatton
halcon
banpo
brmb
kohls
schwabing
maitreyi
jewson
linebaugh
aracoeli
cresskill
laforet
scantlebury
nrma
lambis
sentier
corpulent
oher
jessalyn
kidson
kuzmich
mosk
wenk
thornborough
prodigiously
mastiffs
pejoratives
padley
callout
gretl
journo
khamsa
deripaska
laville
changxing
eatons
samur
ficker
westchase
siol
jhabvala
volle
blindspot
cristescu
tsankov
culcheth
geremia
bestiaries
fluorocarbon
maoming
calado
brocchi
whippoorwill
gurian
cuticles
heisey
taler
bramblett
possibles
debeers
osteoclast
gleed
crossgates
riegle
oldboy
nadra
yasukawa
yoffe
yobo
chateaugay
rehoused
nucleate
iruma
weigert
tanar
méridien
ressel
kaisei
arrupe
rfo
slipshod
withernsea
castros
hardcourts
nare
oligarchies
slink
satyanarayan
acaba
buczek
ï
vilkov
praziquantel
yutong
asiaweek
tortora
brotman
cordwood
volage
peguero
rdu
taketh
monetized
quattrocchi
signis
tinning
virgile
schwartzberg
kadeer
peover
thiamin
otowa
kyphosis
fpd
ascertainable
butterball
hawkshead
faros
jpm
lacertae
rumpf
ridgetop
bargen
difficultly
driesell
superspeed
geach
concertgoers
cremaster
opata
arundale
hurlock
lemmer
kirkkonummi
ganea
milen
technicolour
matheu
weasely
lbm
maeno
vanwall
baddesley
lespinasse
lipmann
pichu
leeder
haendel
mitad
dant
pedicab
gándara
hallan
paoletti
imbrie
khesar
legitmate
pigsty
xetra
sahuarita
arundo
deiner
understate
cholecystitis
cadigan
sebert
glenmont
moltmann
overeager
nmf
vionnet
adressing
toxoid
paravicini
digue
centralising
biosensor
dickhead
ackman
kheel
scmp
privatizations
calmar
repaving
shasha
freas
procurer
sundered
atorvastatin
vituperative
certian
reconversion
muddling
skidelsky
yanshan
mountrail
zakarian
propositioned
hamstrings
tropism
giorgetti
bmh
coddling
axelson
slighting
gravesham
checchi
phytoremediation
fitzwater
psuedo
myleene
rosenwinkel
cnm
macaco
bardney
cefalù
ozias
miniskirts
kouyate
tiggy
rockefellers
frühe
kibworth
benadir
miscounted
ovenbird
strout
inundations
igad
monheim
moomins
pachanga
llyr
cheeta
dearmond
sundowners
goed
nihilists
okpara
slouching
ostra
godín
calcraft
wombourne
villela
placita
starlog
tabassum
jinji
releasable
psychotherapies
latecomer
pervis
belleek
johana
termez
odet
quesne
chancy
prisa
magix
lakey
activin
rattner
pavlina
newsmedia
allrounder
basketballs
kailin
freel
saveur
dileo
sadaf
formalistic
postulant
kmpc
minkow
splenectomy
lors
kbd
donovans
towy
inessential
maharoof
sugai
getchell
freebies
paiement
jois
righty
kafe
empathise
quetzalcoatlus
stieb
schiro
kuruvilla
stavisky
leeves
tmv
oktyabrskaya
leukotrienes
karnac
cosponsor
descalzos
jarett
magnetar
baljit
whitebait
savoyards
namings
sugimori
buffeting
fumagalli
edrington
meggan
dozing
rosten
haston
mema
tvl
tigar
quacker
fillip
gaitan
sanjuro
edgerly
piette
festo
ssnp
lustiger
unsubstantial
wrgb
pohlman
cofield
cobley
bruckmann
hardier
providenciales
bastianini
dryly
lombardozzi
oea
tueni
skitter
koushik
karnik
smiting
scabrous
syriana
dowe
fringy
hemingford
laca
banyumas
treebeard
mosaddeq
saren
skyhorse
bridgett
resi
professionalize
sturmovik
nans
whu
peachum
rabinowicz
titrated
chicheley
virta
extinguishes
casiano
bodyboarding
fcpa
taicang
transgressors
galmudug
piffle
danta
vermaelen
parador
forsythia
kempster
cubo
pries
imprudence
belova
babbo
garsington
remmel
eop
wittily
zollinger
applicators
speake
diffident
scrounge
avit
torok
gerstenberg
shiwei
goomba
auletta
lusby
kipping
antbirds
ferm
stanfill
lugares
steeplechaser
blaik
lavette
zillur
hustvedt
misapplying
laffin
kerchief
zeina
akinnuoye
murer
arrochar
rabidly
kripalani
johner
buhr
deseado
tsuno
nyhan
kurek
mushrooming
compston
firn
jiefang
changqing
handwashing
terpene
schuylerville
kyril
cautley
stene
iesus
silicones
kjos
vincristine
urbanites
cobie
kuechly
incompletion
masucci
greatrex
bodner
cartuja
femto
terrel
dacheng
qissa
tieling
basnet
gyn
dynamis
gvt
slicked
soetoro
kljestan
hofstetter
benecke
becht
tulpehocken
grinderman
radioiodine
produits
pohick
chatrapati
coolbaugh
luminaires
montreat
acklam
kgc
eip
piques
corked
landsburg
sandes
stever
frayser
dendrimers
paia
igd
communitarianism
attapeu
wdjt
baiul
michetti
plumbed
abysmally
zavaleta
hawkstone
antropov
xinyu
tornillo
asman
sursock
xsara
lannemezan
kwtv
pollin
wlad
lemi
trichardt
freestylers
brixia
sccs
ethem
fluellen
sandersville
braunschweiger
reiten
clopidogrel
eulogised
sdks
fishback
fraying
ugarov
eijk
navjot
ringrose
daini
coital
mulugeta
haaland
teutul
recategorised
larabee
pashupatinath
grimal
alhaj
raymore
bourlon
mattino
kanerva
mtas
mihama
dulcamara
luffenham
jaqua
suef
kemmler
ngm
jackdaws
oppermann
hived
meeke
elish
dargo
unlikeable
dafna
sunninghill
yamaji
tennesseans
atomizer
marcelli
teegarden
trapezoids
sammer
homebuyers
bensimon
pacta
washouts
danso
chassé
plock
solitarily
bsba
oceanica
hijras
kiyomizu
haefliger
bvn
buntine
psychopathological
panja
wchs
villamizar
hougue
arses
mandore
raveonettes
heiligenkreuz
jokhang
zarza
chatelet
pacepa
attiya
ivarsson
northington
thialf
glenfinnan
nemoto
messam
cicco
karolos
restlessly
eppa
biberman
wallah
pilchuck
directe
deuchar
dimethyltryptamine
bejan
erekat
abraded
trackwork
nagla
noby
wpri
ammonoosuc
tilefish
candelas
vizio
hayashida
bluesky
peppering
footlight
delegitimize
epoc
arci
vorpal
unacquainted
goyt
paralyse
lipponen
hobbema
alleanza
ejidos
gachibowli
fructuoso
faumuina
moiseyev
valarie
saintfield
sccc
chaiken
doubledays
mergenthaler
ihab
senatore
cyberinfrastructure
dhd
scaglia
villian
scald
lycanthrope
vural
fondle
bressler
wiba
bevill
bifurcate
guarisco
akhmad
bacco
tautog
hierophant
sardinero
photoacoustic
irini
rungkat
berrie
harkes
cellblock
akhurst
hurtig
sharratt
thurmont
causley
monegan
cleckley
ansf
crumpsall
dimitriou
giannakopoulos
tweek
hacke
caspari
miscanthus
retuned
kaluza
gunby
curdling
decisional
delamar
minson
fulminate
begumpet
robinet
schizotypal
healthily
auditoria
watty
cussons
miva
devyn
hession
nchs
negligibly
sundering
kosovars
brodick
photoperiod
lius
cardia
weetabix
grehan
hho
netrebko
resko
eurosceptics
tenenbaums
mhg
sugared
euthyphro
herkomer
mirando
fitial
casei
biomaterial
defintely
pettengill
munchies
nambucca
gilyard
tanin
sharpsville
brehme
vengaboys
pazin
videographers
galal
cording
distichum
ludic
zilei
clickers
dirrty
airgun
eichenberg
heaves
wahiawa
wyland
sportscasting
melech
nuwas
typesetters
susser
wlwt
walenty
nisra
lehne
hawiye
itinéraire
pachulia
gabonensis
aeo
neumeister
unshaven
rapin
kertesz
galwegians
karamoja
santacroce
mikulov
caniggia
goodheart
radiographer
hutz
turu
serenading
chide
ufd
mahout
nickie
moku
minuten
kwale
villus
radmila
tonda
crossbencher
abating
punchers
oger
cornerhouse
climatically
litman
borak
megaplex
fiordiligi
hruska
tanabata
marcellina
holdren
emperatriz
soopafly
biffen
cupples
janiculum
lillias
arigato
glimmerglass
gladrags
tossup
decamps
profilers
pariwar
cantin
quieting
barsi
currying
douri
schuch
knp
sallisaw
expends
arbeloa
somaiya
idabel
shinseki
koteas
alderdice
ghoda
refinanced
expurgated
zych
ambulant
borwell
gruesomely
procreative
unitedhealth
huckett
areco
barbourville
maita
cael
cowlairs
ispahani
bulldoze
buchberger
bleiberg
daubigny
politicizing
aicher
abruzzese
dabas
rde
wordie
kensi
rashied
fanad
skaff
bollen
indivisibility
sirkeci
tradescantia
rachele
fayyaz
doidge
ramallo
meningitidis
cheddi
londesborough
dangerbird
osier
marsico
uachtaráin
muntaner
tucows
hinesville
rusin
wvga
hyltin
hachem
solor
frightfully
panicky
ozuna
hazes
borowitz
sunwoo
krarup
overstrand
pisarev
cspi
slbms
soundman
islamized
kamps
dysplastic
insuperable
capozzi
ajayan
shibumi
catastrophism
mokri
muchas
urbanity
ioannides
gme
flandern
springside
stegosaur
toles
colletta
begrudge
baquet
badland
kleinwort
cbot
wellen
hidi
poom
buggs
hda
yanukovich
fibrotic
mcareavey
skempton
platner
twittering
vasilios
checkin
tamarac
pignon
potshots
mulberries
makua
leane
weininger
zipline
russolo
brandishes
clathrates
snubs
unsupportive
kalter
alyeska
tpk
herschend
belnap
kurahashi
steinert
tracs
baff
kopec
nasaa
saltergate
pichot
klc
hasting
slaked
frailties
bfr
kumaritashvili
zehr
yeshayahu
rayos
sarde
toils
ellmann
birkebeiner
parrying
anae
meetinghouses
ponni
kakai
niac
zafarullah
commercialise
procaccini
maxene
frodingham
condenet
ghgs
braman
deyan
museeuw
madej
langbein
bicicleta
wunderhorn
molé
adenocarcinomas
aubade
kiryas
maximises
cacia
hydroplanes
cemetary
cafos
phorm
fauntroy
straggler
djam
encana
glocester
barchester
monumentum
dewyze
gonadotropins
anderston
vercoe
rebrov
disemboweled
elrich
benxi
neeman
reauthorize
dennings
windber
ettlinger
anila
kuwari
silone
wilbarger
northpark
mcalinden
peeper
spikey
saung
ramshaw
qazwini
rossett
corica
txu
mychael
trussville
souper
retested
jauregui
unsettle
tilles
patkar
ucode
tuneup
glisson
karabagh
kendrew
rumina
gadara
grobe
paramo
fames
guiomar
epaulets
encinal
bandersnatch
countrywoman
okai
digic
piot
hacky
abbass
tuuli
limps
zamperini
aerocar
barkatullah
boones
prelinger
duranguense
hynd
cowgate
guelmim
hopital
ilhéus
bramshill
asfaw
heirens
cabanillas
bulatov
solter
drench
lignano
iod
wineland
spanker
unsponsored
cominform
soundstages
tomov
shoham
abkhazians
lanfang
rickardsson
nationen
chikan
imdad
mortgaging
distler
cymdeithas
chivo
shrugging
haidt
kimmelman
echocardiogram
dagfinn
pokljuka
ramanan
marginality
ridgley
moutiers
onel
fixity
anhalter
ravenscourt
reddened
fiorino
taimyr
mielziner
moharram
vxworks
molter
unpatented
fangled
eltinge
driskill
lulls
bayamo
tenner
giorgis
friston
villacorta
cheang
saddington
schwebel
seresin
tpn
enterprize
shirvani
personalisation
chrisette
itraconazole
stemme
cruzados
keast
delk
warford
monotypes
liddiard
fuxin
liba
moonshiner
ferments
lrl
jahani
correlational
cucumis
salyer
stadholder
acos
falzone
neuropathologist
pacu
sherritt
sereni
gidea
toehold
manpreet
gpv
avensis
rauno
bassmaster
euskaltel
haslar
althouse
balmes
chatroulette
stough
chikamatsu
samosa
deery
wahba
detwiler
herreweghe
benburb
mudcat
bereket
meachem
ticky
toback
ysl
nissar
urate
oberwesel
widgery
kex
wannstedt
obziler
sickbed
faultline
aberdyfi
amby
bandari
mulitple
kutless
blackballed
foundas
antacid
tribespeople
candelario
pauillac
beyeler
boersma
coalhouse
yossef
zakayev
iacopo
hubbards
swappable
weiz
whitener
elderkin
kasr
shourie
viorica
rewire
vmd
sodastream
blackhill
soemthing
kernersville
katas
sverrisson
consorzio
jarhead
thornfield
anticlinal
slik
klepacki
mudflow
appassionato
kaimur
anneka
kunstgewerbeschule
multiphoton
importante
alcazaba
altura
kyler
grackles
neshat
kovner
pharaon
molex
jli
cref
matmos
garum
ghirardi
chobe
reald
dwork
kampo
muren
alecia
paiwan
virola
shareable
mirabelli
mothballs
swissôtel
apostolov
bobolink
teitur
elfed
behaviourally
impeaching
bastiaan
dannemora
psychologies
rescorla
batha
shahabi
finckenstein
namc
zindel
meder
crk
shalabi
maral
ekran
nahon
mountstuart
afterburners
sexualised
enchaîné
arney
proneness
antistatic
jaff
sorrenti
quantcast
lasd
narimanov
tidd
rebid
cerri
buer
bogyoke
pardoner
hargest
kronfeld
parricide
lachiusa
northen
sulfonylureas
leny
rhenus
cleugh
regality
sakazaki
bundesanstalt
freia
fogler
schluter
kgosi
vxr
grovel
wfd
superfan
subcutaneously
wretches
azcona
yusoff
lipoic
thistledown
buxbaum
thumbprint
oversimplify
siregar
embezzler
dups
csrt
strömstad
whoppers
crapper
scheidegg
avalokiteshvara
rockey
azimi
grigio
bertorelli
cassani
flattish
qinghua
hewish
zulma
heusinger
boozy
tverskaya
atem
brochs
warrensville
labrecque
cjm
utterback
mirallas
tipoff
combativeness
yangs
methanogens
bahal
sleepytime
ohtsuka
sceneries
delhaize
brunete
craigellachie
carri
glovers
hosoda
obeidi
anegada
gyratory
teamer
sjogren
laotians
blonds
jerram
kjr
nonoy
cundall
arribas
laska
helloworld
ibad
malise
mobberley
fulgida
battlegroups
messiness
kamaluddin
tusc
disavows
phoolan
distributable
zingiberaceae
ogston
remold
pastern
farha
doula
commonness
effenberg
manges
vbi
esfahani
sterilised
qtc
kello
foamed
vanvitelli
aalsmeer
ichinoseki
belcastro
railsback
kurin
bergmans
minimalists
syllabuses
shigeta
mitchels
liese
munis
metaplasia
fesch
wiltse
hassanzadeh
radiotelevision
helmsdale
northwesternmost
auxilium
gulli
yiming
fiercer
malgorzata
slama
nefa
waddon
bedwetting
sokolowski
kichi
rizhao
deserta
fnd
moonves
connoting
bunney
amoung
precancerous
repp
drongos
neera
deno
juddmonte
cadzow
koregaon
tiptronic
cwl
polman
matahari
showstoppers
mehul
krispies
grainer
biomedcentral
cresco
propagators
dosent
vioxx
siskind
lgn
diacetyl
pulped
cajole
onlf
danford
weigall
hudlin
postdocs
uncleared
powerboats
turbomachinery
perpich
eccrine
coface
bronchiectasis
ramdin
gulmarg
kristyn
oestrogen
luini
pailin
keypads
sego
kemar
lysy
caulk
bbv
juskalian
muamba
monzonite
biegel
dracul
witbooi
pfund
ambassadeurs
chelm
mereworth
talcum
frivilous
soichi
ketapang
tantawi
redraft
schoon
haizhu
antifungals
cropp
armine
twelves
nazarenko
musicares
yangtse
alric
gelais
escuelas
wsmv
temperamentally
josslyn
mithradates
fanbases
daleville
mumin
plasterers
facchetti
elmdale
gerster
whitwick
swartberg
vagner
serlio
annina
elron
turnour
kente
nikanor
contento
kleppe
dishman
mcgonigle
cenotes
wwiii
redgate
vignelli
muan
nvq
jiangbei
aschenbach
mikoshi
onchocerciasis
romanced
studding
fishmeal
speroni
kitale
urla
berel
jaswal
kaempfer
tollways
terna
macrozamia
grinling
funt
guardroom
meridiani
surguja
overtown
imipramine
paani
syncrude
dermer
youcef
veracruzana
runtimes
pasteurella
riyal
coppicing
weepu
prica
eastcheap
superstorm
sois
kaura
asara
kuffour
sunnat
zisk
happyness
qrf
allitt
ruhland
mcguffin
mandoline
bothmer
iparraguirre
straitened
coordinadora
webelos
hudler
quaich
tychowo
meadowsweet
medicea
dushyant
satchmo
breaston
mimpi
scorm
kemboi
moosic
tean
anosmia
oestrus
dayman
jarvie
khir
kirra
readopted
cimetidine
mcshann
cauchi
mystification
tidball
scioli
marquês
prader
saundra
polesworth
religiousness
moakler
sanitaire
bircher
fantasist
nostalgically
numbingly
cire
faryl
meconium
sampans
corke
seperating
fadul
waupun
stringfield
wegg
soulsilver
minky
kirkup
zhongyi
cruncher
ibma
enquist
rengifo
lizana
charring
arreridj
facio
kizlyar
mohseni
schoolmistress
knaben
pulpo
moncho
devadasis
ganson
noya
osiel
luers
smote
stroboscopic
defter
ekonomi
laetrile
kxas
internalised
werf
anokhi
ayanna
terrington
markieff
bbw
birgunj
alfi
qif
beaird
kansans
burketown
lemche
dilton
depressor
wildparkstadion
saidov
altimeters
hannie
glassmakers
saudek
rhodia
dippel
herff
yeddyurappa
pego
kassab
dunnington
britian
craftmanship
motl
menevia
joeys
dingbats
fowleri
remap
spiceworld
woetzel
solovyeva
oldschool
cisg
incandescence
cultist
rockline
rajpura
uksc
khalilzad
giannakis
winging
delineations
gregorc
gurbanguly
tautou
outsmarted
wonderboy
dvora
backdoors
karabekir
sesimbra
gonchar
arrivederci
moonwalker
demonoid
alfasi
uggams
ravil
teals
nibbler
tadesse
ronay
unsurprised
hsca
olliff
izaki
interchangable
ascertainment
feltz
ikechukwu
serviceberry
libbing
croxteth
latrell
slogging
shapcott
yaakob
gérin
aartsen
maniacally
miramontes
ravenstahl
pluralities
sundararajan
knoedler
investissement
incontestable
liatris
friburgo
masturbates
appetit
cannabidiol
decc
ballyfermot
markovits
sherin
ghk
tatev
lamotrigine
pitifully
centr
marvi
andenes
muktananda
bloodily
hommel
cartogram
qmc
marai
agnarsson
lululemon
juergens
aicha
panhandling
toyne
cherishing
rooter
radecki
coes
leandra
thua
unifem
capetown
enlightens
nishat
santisteban
elion
garlington
vorstadt
roubaud
javaid
fistulas
bootstrapped
hakko
sigfried
hatoum
maytals
millwork
stanky
collaterals
scribblenauts
capleton
miac
ifam
dlo
coppock
roboticist
kinked
yapa
roofer
jayco
mazzara
sluicing
chidgey
shenk
hbcus
prospers
cambo
pbn
macwilliam
huys
balt
karanga
frightfest
ponomaryov
quetiapine
goikoetxea
wlodzimierz
lcross
mcgrigor
mufid
ramez
greenhead
anac
backstreets
takimoto
radiopharmaceuticals
warlick
eizo
fiza
manacor
molave
htut
vivan
kitgum
teguh
ternent
mailloux
yearsley
filomeno
unrolling
misir
warmongering
buttonwood
idles
parni
keolis
culdrose
organophosphates
gaillac
boyette
almondsbury
mariss
batmanglij
sureste
benefield
nidia
deviancy
zizzo
summersville
growlers
rangjung
cips
moonroof
mangabeira
monopropellant
pushdown
brasstown
andrieu
vri
borderlines
ahlu
oira
pendry
lzb
howa
cockerham
defuses
horsewoman
hamworthy
coalport
eker
bulked
bertos
reliefweb
kienholz
horlicks
talentless
evinrude
keum
sacconi
mechanistically
carrano
viktors
fishlock
guzik
schertz
mullany
situationists
yanjing
digitizer
squeals
tayshaun
ergen
indexical
ekklesia
penikett
thinley
brownstones
warnborough
sigismondi
weever
heol
interorbital
kalemegdan
fabs
snubbing
estefania
jajouka
schwall
hdx
kway
costarricense
semiotician
ffw
apba
stuber
duncans
tapton
warth
briganti
tampers
wistow
hydroxybutyrate
gallate
mcmc
mammograms
sanzar
lysenkoism
mmmmm
demidova
gazans
mokhtari
imtech
benedictions
gargrave
npas
woio
legba
kiai
xev
indiscernible
exter
lodève
qms
pakse
culiacan
entrancing
chansonniers
vindaloo
pardy
neuhauser
hmu
guidotti
carquest
shahaf
speedsters
brousseau
psychobiology
poos
kasongo
sirri
unquenchable
shold
fabians
roomba
heitinga
gurgling
godsey
linkous
ratnayake
antagonised
stirrer
lividus
herridge
sethna
wieners
turnback
changan
caha
trabelsi
charlaine
pepinster
megastores
nafziger
plancha
durres
akobo
pointon
replenishes
parolee
evisceration
communiques
hoolahan
dudman
stickball
deniable
patricios
reiley
liesbeth
maritimum
demeaned
barer
aleida
skindred
ccel
lpsn
abscond
ovaltine
elizabeta
barbat
akhir
nautor
baraki
dodaf
chronobiology
spoto
disproportion
linocuts
palmisano
fishbourne
anamosa
kcia
rickenbach
shavian
papoulias
jodoin
hgv
anwyl
slapper
nimbin
goodge
brookeborough
kurfürstendamm
mercaptan
gnh
joconde
bhb
manège
jayesh
schuon
ofac
paho
interclub
goizueta
bufalino
omarosa
sandblasted
hotbeds
listers
baryonyx
barboursville
penland
sousveillance
hambach
seabourn
mostow
pvl
zabid
laham
gote
formatter
burkman
ytd
norwid
agonistes
chiselled
gunaratne
verme
snowberry
garrotxa
lovage
gutnick
flowcharts
telescoped
yoka
chillingham
soules
lusail
zervas
saule
lannion
viewfinders
cockscomb
trappists
andray
tahoua
jabbawockeez
parras
sundstrand
grungy
topsport
macnair
baccalieri
fonder
marojejy
hybridizing
holdaway
ledet
burness
occurances
phenylephrine
hanania
rebirths
sarli
acording
curmudgeonly
pearcey
maham
zeferino
carlstadt
tayport
cacophonous
nalco
denervation
neeleman
lovegame
scotswood
ringa
sumon
lenda
marantz
devitto
abani
deiniol
stelzer
dunas
yongzhou
papageorgiou
brusca
slas
immunohistochemical
midkiff
caia
gwathmey
weavings
flatmates
cammi
imbibing
sulfated
xeroderma
shanghaied
swivelling
adlib
acceptible
finkler
namal
newtyle
thumps
feit
jeavons
newbolt
gazal
thirkell
graef
extravaganzas
nortec
shukor
hetland
greatful
mopan
inducers
ngp
internews
bulgur
phoenixes
tayfun
zus
souster
organismic
bradbery
mandovi
cwr
hiroya
glencross
sahota
sanitizer
clamdiggers
eido
brignone
atiya
absolves
garbed
woolverton
deniro
schoemaker
ypm
flaying
vassaras
moldoveanu
eikenberry
schmutz
firehouses
happisburgh
semak
ieva
difc
cholestasis
shipmaster
inwa
pcna
bhatinda
inglish
vanoc
timesheets
fratta
margetson
omeprazole
levina
pablito
sawah
lfe
mapleson
liquigas
phototherapy
heena
aboubacar
koger
duggie
gushi
htay
beckon
jackaroo
vatuvei
andrewartha
bonchurch
filibustered
oyj
latae
rivieres
godderz
wahoos
stigmatize
psst
malamute
alvina
cryonic
meschede
obscurities
wellesbourne
blueshift
garvie
reveres
timberlands
copello
guider
ivanek
kassebaum
bolckow
porcel
omentum
interjects
canarians
invision
jannus
beauchamps
braly
ningning
yampa
odium
rosengren
subito
disaggregated
glynneath
servicer
worldwatch
hopoate
bucquoy
felsenstein
tugged
wrinkling
venkatesan
carefull
merde
sunsilk
andrianov
laumann
upwood
chapelton
unties
alz
chaunac
hedger
cspan
shioya
kabira
proyas
deepdene
firebomb
hamet
egar
broyhill
tmw
barasch
bulaq
mutiple
ghafar
kapolei
iodized
frewin
dehmel
snehal
courmayeur
helma
zazu
karli
mesmeric
aree
bichel
mellado
leadsom
juicer
azzedine
hyacinths
dits
tade
quechuan
macerich
fingaz
headbanging
reptans
leutze
necochea
strokeplay
pallotti
sarla
barnea
carbonneau
ichthus
busuttil
maintainance
mallo
mcatee
visconde
adamec
ramadani
tamin
seia
detent
hopeman
saxa
blinkhorn
bement
domke
siedah
gwe
barnabe
sudairi
ramakant
naysmith
viesturs
picadilly
bacton
buglers
sareen
eimer
karter
nesha
sukuk
tavia
chavarria
laurenson
scansion
khoren
pequena
monterosso
lcb
ayna
unreason
losa
labuda
maroof
silao
stojko
trinidadians
cuaderno
dolliver
nestmates
decompressed
henryka
keithley
meaden
reichen
losar
stromer
leahey
camoranesi
screentime
thonet
lookers
pumpers
norfleet
collectorate
rakhimov
kunstsammlungen
ccna
trinny
spremberg
encantado
sneh
swardson
thrid
defiling
valcourt
dechen
mcrobbie
oxycontin
ekatarina
kolli
dumaine
bushong
yujin
sharrow
crusted
surkov
keenen
headwall
strozier
erawan
fullstop
afzaal
batboy
molesta
longmuir
bogging
mapledurham
neutralisation
cando
achakzai
belugas
dainius
contentiously
pressurizing
cbcp
maquettes
nosing
mohamoud
russophile
soay
debrief
martires
kelm
buckhaven
projeto
bendik
aminoglycosides
qinling
nze
streambanks
wadih
shambling
lozier
routemasters
freeness
swoosie
avnery
bastianich
poperinge
markowski
sciuto
capecchi
rodenticide
deliveryman
umuc
pindari
glogovac
fazlullah
pellicano
reran
ajou
vesteraalen
hydroxyethyl
desolated
aama
fluorspar
falconieri
westmark
mahasaya
gumm
nuj
colorant
nejc
klu
shorthands
luvin
wertenbaker
monja
zaytuna
xuande
auxerrois
mthethwa
duprey
savoye
dethronement
morente
dantonio
bhala
outwits
scovell
lawdy
caffyn
talai
tamam
carabiner
wurtzel
peirson
vilfredo
minifigure
manoeuvering
laur
anelli
haislip
waterview
pescatore
madgwick
madri
afters
gpmg
bechard
jordanstown
davoud
sitatunga
bankrupting
annamacharya
shouldering
vinterberg
nymphal
murase
hoving
otilia
motus
calame
manteau
trumpler
smirking
hemraj
xingfang
tasca
puffballs
chep
cwd
mightn
foodies
mccraw
doxey
zhizhi
wilbon
dawu
bedbug
adcom
wicking
xiaoxia
bootie
multiregional
bozak
magilla
jianying
huallaga
obcs
unshakeable
palad
brychan
meissonier
tille
brayford
xrs
oktibbeha
caparo
stalder
curcumin
barillas
solimões
thisday
moas
appointive
unzip
marathe
gremillion
humping
cowbirds
touristy
mariyam
esrange
aiw
bne
swados
karet
efflorescence
ovine
swainsboro
salzer
sendler
eynesbury
compatibly
rumbled
cubital
speciosum
ecke
dabbed
strube
kalugin
fitzmorris
franzi
bonpensiero
tinubu
tahta
fornix
brosch
sweatman
engblom
katju
poff
tems
puttenham
natoma
nativists
khoe
stecker
brymer
sorsa
classicizing
knifed
massart
boonstra
heatedly
ballydoyle
sulfa
pinhal
bongard
marouane
revolta
stoically
fassel
ricardos
apocalypto
pharmacovigilance
inextricable
serdyukov
glyptotek
wookie
micrographs
niceness
menchú
ishim
bashaw
prigorodny
dkr
halsman
cominco
overprinting
dorel
clairette
achterberg
petrossian
resurrections
kabuli
palpably
brockhurst
sawmilling
kettner
ekdahl
mozgov
trustable
sadiki
uhhh
kowa
sleevenotes
mcnelly
baisha
shidao
paracha
armlock
thurl
ymcas
priestland
karlen
narendran
jetways
stratas
chami
yermolov
medicalization
herland
gusle
southerton
dimensioning
thundercat
woulda
mcree
nield
elvia
wattens
ruysdael
macfarlan
harrachov
consanguineous
kopelman
rautins
sobhy
overtopped
fhc
tillmann
cbso
kazin
slowdowns
insha
matuidi
khaitan
deconstructs
ouseburn
pilaris
shopian
saltier
joba
ibla
grammes
lydbrook
maoz
vouet
iptc
dhari
quinteto
iliadis
rentier
berkely
professionalisation
luy
lorri
castaldo
halcrow
conservatorship
celerity
trashigang
speciesism
zeiger
delly
volcanologists
ridenhour
nunney
dehumanized
gongora
blisworth
blumlein
meserve
neckarsulm
ikoyi
avtandil
panfish
gida
smets
vkontakte
rykiel
triano
auken
zombieland
schurr
iyaz
applesauce
teepees
cumings
engelke
caerlaverock
newtongrange
viagens
wode
downloaders
bullingdon
wentzville
kaleo
alderwood
benzion
jaffery
tetons
voegele
fauzia
colligan
garanti
dumdum
chukyo
overmatched
wissmann
xaml
ediz
ratdog
pocantico
knauss
shameem
mandalorian
grabowsky
harsanyi
kaulitz
hochtief
emini
talita
dietician
botanicals
yanick
heathkit
semenya
yatim
physalus
yelle
aéroports
dorabella
protium
offiah
genesys
footages
ncss
mcclurkin
bosso
fould
tiryns
ilgauskas
pangani
araiza
transracial
deathblow
leering
rheinberg
hemlocks
touchbacks
evangelisti
hebi
sangria
braggins
lizzani
lucite
mingles
vendel
ecsa
labianca
worsnop
wthr
dalys
beetz
laymon
napoleons
wollman
hinchingbrooke
villaflor
kessenich
wainer
sedlmayr
gruffalo
babylone
sabarkantha
chote
mantler
misi
yadavs
maladjusted
deoxyribonucleic
arko
kiteboarding
cellucci
waterboy
saeedi
alemayehu
idec
prionailurus
klopfer
cornerman
cauterize
equiv
spacial
resor
cordgrass
grimbergen
selecter
knd
hazim
botella
haunches
royaux
pharmacodynamics
drik
benaissa
jerman
nibbling
enteropathy
extrude
huzzah
saijo
marlyn
overdevelopment
pfannenstiel
semir
aonghas
momper
stauber
cordeaux
olejniczak
liphook
shadowplay
eichenlaub
unenforced
khadir
darran
fehn
ahmeti
hirshberg
ganderbal
pentz
overtaxed
koco
feldkamp
labra
lewthwaite
frr
satirise
bapat
swink
tilework
idolaters
waterjet
bijoux
mudford
subotnick
ghostlight
mccombe
aspies
glaspell
dyncorp
mannings
neomycin
anothers
onita
lup
dwd
parisiennes
copulas
paskal
edlington
beetroots
joscelyn
narsimha
gartrell
moley
ukase
guerrouj
dilophosaurus
haykal
autochrome
mcx
minaya
krama
subban
rwf
tappahannock
biomimicry
bachs
waks
mahina
bolz
marzotto
prosky
tourettes
zox
imageworks
polari
boitel
serenaded
pusch
isreal
cevat
noordin
dotan
teds
welches
lundstrom
meadowvale
tokenization
lubis
nanay
villiger
agecroft
pachacuti
maiman
lucus
kozel
simtek
lense
brodkorb
leukemic
nephrectomy
danubius
gangaram
outrider
misquotation
moyale
newmarch
cocha
poseur
cityplace
caersws
charn
distills
nonie
ishizaka
exige
expunging
udny
pauvres
bickmore
kpu
rosenfels
rumbek
houla
aetate
fredricks
monos
cyn
balkin
weisgerber
blindsight
pierpaolo
alling
catherall
stargates
haavisto
tuason
tiss
toothpastes
baabda
panizzi
legnani
kingham
semillon
amphoras
rituparno
wreford
adelanto
chopstick
harrisonville
foredeck
housden
cosin
gurmeet
telegraphers
sepah
desaturated
heslington
fye
loquacious
solara
intouch
shims
mozdok
tragödie
portobelo
burglarized
shahadat
budgerigars
festoons
netty
sabita
mitm
chuter
holsey
empiric
khary
prepositioning
shabran
priore
hilltowns
aspyr
prahalad
xrd
reville
samb
labled
oversensitive
berdyansk
associati
kawal
cyma
kozinski
loera
secretase
monroes
dietze
yafo
kaisi
nochi
sardinha
ethinyl
cribbing
marcinko
sackhoff
larochelle
reconquering
subliminally
oldtimers
eshelman
earland
egocentrism
moscatel
sondergaard
lieth
rish
freemont
speechwriters
comfortdelgro
bertozzi
yukihiko
octoroon
hypocalcemia
festen
srn
tupaia
dahlke
weitere
pfos
foggiest
premie
davorin
sigurdur
schunk
smolinski
alysia
abiel
berlaymont
antiemetic
mccoo
pentheus
heitmann
waud
anastasiou
horndean
kozyrev
gorecki
blasingame
cifra
subhi
affeldt
foll
cockshutt
glaus
brandauer
mdiv
hematomas
jovis
dres
njoku
superdog
cyclodextrin
retype
kowroski
hedd
antúnez
stupider
warland
rudaki
aslanov
lopilato
dunnes
ceasar
nandalal
pinetown
madin
scarrow
saegusa
mcgonigal
tefft
implementable
verged
decorous
ponselle
milewski
kwo
zielonka
güiza
lmo
wolesi
vogon
sonus
genovesi
petherick
deconcini
corktown
vibram
sanai
whitegate
enzersdorf
nonlethal
provisos
gner
lauterbur
mowlam
articals
ystradgynlais
ambergate
sdxc
dashain
chunga
orduña
berrabah
taylan
proteges
laich
diaoyu
lamanna
atomically
tames
venir
bouna
ryer
edilberto
hlf
lakelands
quino
kolingba
ostracon
disavowal
nanpa
maberly
tassajara
costumers
kci
ﬁve
berninger
terregles
andoh
fedoruk
trivialized
lockton
disgustingly
santosham
dombasle
ultraconservative
kryuchkov
paranaguá
flanner
fecund
sawston
merlene
documentable
loquitur
caliendo
ashtown
fewell
nahdlatul
letterboxed
kinswoman
tylo
recchia
arocha
nonwoven
cybulski
ballykelly
klossner
erler
poblete
crenelated
foresman
ujan
vilvens
ruether
azb
unloads
majoor
dishonoured
louganis
spectrally
jenga
macneal
wattisham
perfringens
pulwama
parto
ukrainy
chinchillas
garik
marenghi
dethick
juticalpa
tlx
feeler
rads
pungency
peces
habibur
repower
minsi
alvino
leccia
hassayampa
capucine
lifar
dgb
templon
memmo
ensuite
incharge
nostell
durcan
kerryn
murfin
labradors
boggart
bioweapons
mecom
petta
kilty
aliança
storrington
ulsterbus
bishopville
neurologically
cacau
hyperlipidemia
gjokaj
babrak
wateraid
alpher
nanayakkara
torvald
lbk
mcgeachie
pupillage
rosmer
braco
picoult
doaba
dragline
hizbollah
herzogenaurach
valognes
hondas
poing
sgurr
mickel
teacups
barkston
sweetgrass
naadu
gbk
wagle
chermayeff
polonez
dpms
lordy
dorthe
hershfield
aidc
moonlights
steinsaltz
wisk
qalqilya
sighed
coomes
ndwandwe
nyac
zilkha
yellowed
miera
lewknor
counterforce
mineko
atomique
hickories
acpa
goujon
edmilson
booga
pustular
pallenberg
needwood
weybourne
soejima
résumés
goldington
mplm
spayed
uncertainly
cytological
bastardized
slutskaya
sherine
integrationist
etops
girlband
russow
wangpo
demeans
rothfuss
gwasanaeth
fashawn
longjumeau
ledovskikh
echeverri
instore
ntf
cdre
webshots
oenology
sentani
photorealist
plebiscito
insieme
vlaar
singes
shafran
spurrell
apichatpong
antidiuretic
hirscher
pirner
eynon
foulk
calluses
encumbrance
probated
anthonis
bange
schaerer
effacement
gothams
duopolies
vitz
airbrushing
coscarelli
legatee
kriz
vrelo
desisted
censorious
browbeat
smelser
janss
miche
gisenyi
cholecystectomy
hiskey
parijs
mekel
kustodiev
silverline
hardeep
lobban
hribar
clapboarded
monavie
peper
cobban
eisenberger
beledweyne
mancroft
cockeyed
puddin
maged
moocher
giugliano
windpower
jaago
ashvin
thurloe
larcom
pmw
lekman
interahamwe
nobi
scalpels
clothilde
tallec
mistreats
vawter
trudge
doot
sye
hepp
dufftown
gauvin
existentially
ekpe
mangler
banken
myalgia
einojuhani
pingyao
tukur
steinbock
cerc
oxhey
igraine
minimizer
elapses
winegrowers
lebec
alchemilla
sulpicio
golzar
yashica
whodini
schrempf
hamadeh
forcier
crocetta
cattelan
theissen
politeia
krastev
esfandiari
steeve
gilbride
gokey
inexpressible
blader
sparrowhawks
dominants
empanadas
dibernardo
melchoir
wisecracks
isel
yawar
beaglehole
mescal
sveen
realitatea
caverna
cynara
deglaciation
enbw
blackland
mcdivitt
baldin
unicornis
polycom
stenographers
buttram
monotreme
inverlochy
lemaster
roure
balland
bandas
policymaker
emote
castiglia
committe
gaar
orzechowski
mallender
greenspring
westlund
nangle
tounkara
waxers
upnor
ganoderma
hyperalgesia
appr
galvis
vaizey
borenstein
merchandiser
zinser
durness
sansoni
glico
parducci
yeshurun
merta
transglutaminase
coulston
stepien
beasties
meningioma
nektarios
lalji
timmermann
mandylor
gofer
sysml
reha
luthuli
ibru
outerspace
bellion
tarkin
fletton
preordered
deworming
dewy
propre
feyzabad
jatta
slumbers
piñatas
ceto
jero
charcuterie
dougray
kimathi
rossmann
mudguards
bareboat
nolf
catrine
centrelink
schouler
greenlees
gueule
yonemura
bilious
rankled
redway
duelists
claribel
winspear
michnik
moonman
hollo
namaz
sinnett
holms
dogville
rastas
andrian
thawra
ditte
roffe
bangerter
albam
submunitions
adamczyk
fiv
ruthe
hiland
lowlanders
rutting
walczak
pasang
satar
slosh
hybridise
salimi
kepco
salemi
chamlong
khunti
chenu
blandly
gummidge
rsb
wolframalpha
ranikhet
invalidly
oxygène
rombach
shipston
tillmans
moglia
magliana
berigan
footrot
konstantinidis
anabela
metrica
gomal
desnos
fien
conowingo
mondy
colonizes
montebelluna
hitmakers
chevreul
unadvertised
dholes
kalon
sparklers
schrager
korem
combattants
denouncement
nauseated
kvirikashvili
endean
acceso
tickler
celik
ljp
kulgam
hypocenter
htwe
megyn
osg
mccarten
yaphet
possil
positronium
nagasu
ataque
dongxing
greathead
rajo
glenlivet
sctp
fascias
girardet
lbgt
influxes
linky
keyshawn
ansty
mattathias
marginals
utrera
diodati
microphthalmia
burping
cruelest
cbk
coaker
dehiwala
extractable
plurinational
newco
gyeongbokgung
abdullaev
sast
celador
carrico
keelan
ketley
feghouli
colori
liska
kingery
ripstein
xiaoxiang
cadaval
rackstraw
longmoor
underglaze
kvass
brulle
politcal
cornershop
latinist
rockfalls
weisner
vsg
yukichi
carriles
hmhs
alzate
siop
pirc
persuasiveness
randee
bergues
matcha
demagoguery
holts
flapped
kemmel
moortown
spendlove
cybersex
extensors
tryed
gravettian
pinkertons
ossuaries
weinmann
carls
frisson
khateeb
sulton
sconces
irthlingborough
carterville
imperially
asgari
flatters
assata
pecoraro
blindfolds
aapa
samovar
prunty
guimaraes
collaboratory
sapone
yerger
nanostructure
plouay
nault
herp
shinda
conversationalist
citoyens
okrent
klenner
twisleton
ireen
gametap
bilbray
barh
jannah
albasini
thumbed
appassionata
borle
rushforth
bende
guzzo
catchfly
festoon
cuon
tsukishima
hovde
mureaux
yabucoa
zagar
tienes
banik
terlingua
oxidise
whoriskey
nagbe
kollo
aday
cochiti
homeplace
taufa
tjrc
forthbank
sternbach
kallies
louka
ilva
strongheart
danijela
scrapie
kanehara
dreidel
macalpin
pentastar
peker
ichabods
pavon
wadsley
daikin
brüggemann
ratchets
unsparing
aramac
haarde
oce
misprision
calcitriol
megaron
draughtsmanship
spitak
tftp
bassac
stadtschloss
zetter
iben
assumpta
fld
gga
papyrology
eustachy
redundent
salpeter
dadeland
cheder
leid
wilmut
seagraves
osteria
brunssum
glorietta
bace
bemrose
houseboy
bruhl
gastronomique
praful
judgepedia
buzan
rochereau
lempicka
manoeuver
artiodactyls
tirzah
noujaim
ninotchka
vaile
carmena
tinkham
landru
cattani
deutschlandfunk
reekie
sonnino
fingerhut
hookes
anyanwu
dahal
bergs
kailahun
qvale
vhd
arthington
ideologists
itochu
ravished
benirschke
rovin
qsr
gosta
ponomariov
walesa
villazón
ledi
protectiveness
lucozade
defrosting
weeny
radway
ruediger
ramchand
fazel
kalaf
paletta
elysa
ledra
cayne
marketa
climat
growled
harward
fresnillo
precipitations
qabbani
maraba
exurban
thiazide
nachtmusik
brana
voisey
burrowed
biassed
teetotaller
cléber
faggin
loxahatchee
mammen
mcgaughey
cegb
mooting
underclassman
oleds
rbg
mazzoli
unfurl
okhla
jpr
korin
chidiac
mindbenders
mascheroni
maceachern
rangana
bocchi
privata
kikuko
rehmann
inglenook
bongiovi
waterlily
runet
causton
shuk
polybutadiene
whirls
kondylis
binjai
dalu
buscaglia
oratories
epler
expressively
victuals
waples
farkle
zenji
vcm
bentine
agronomique
anzor
standon
somerby
triazole
mersereau
retarder
salote
cajoled
oceanport
gamil
witsel
bigpond
potently
alternans
vocalisation
aibo
tetracyclines
snettisham
barcia
stoppa
punj
sindelar
digitalised
bontempi
misgav
paaske
shapwick
belturbet
crotchet
chulanont
cedrik
¨
leafing
yezidi
ahlgren
mousy
businesswire
rhianna
trussardi
corticosterone
clime
audace
askins
karelin
margareth
kaikai
timewasting
camouflaging
torrejon
devendorf
arpita
cameleon
yeasayer
harrisson
richmal
dondero
toyooka
headend
oppong
ibr
kewpie
berkson
focaccia
amadori
tues
bkk
flossmoor
kombo
racino
neas
margules
southville
understating
groped
fourvière
sprit
kupwara
doleman
flagman
scahill
wtnh
hellweg
lymphoproliferative
boublil
rovshan
rossier
dassler
pzpn
scuzz
marikana
sath
vaknin
farmall
grainville
teddie
washtub
solennelle
manggarai
yazaki
szyk
uniques
awale
assortative
hscs
asghari
nadarajah
incurables
buckey
dewes
lechwe
starkman
refosco
weida
yunan
klier
slags
alkatiri
frocks
oliynyk
misjudgement
jaman
petropolis
messaggero
alatorre
dunmurry
gerace
obliterans
sackcloth
spradlin
erlendur
roszak
glares
wolfensohn
skirving
oddone
nolans
stollen
regularised
pomigliano
zaven
kramden
wights
congolaise
bartoszewski
pellington
desalinated
brockhampton
khairi
dobrynin
dredgers
aicardi
spanners
gwc
trackdown
arlit
petrolina
paita
khalifeh
pachachi
charalambous
battell
haole
baddie
shirane
thereunder
mandora
ticklish
enfeebled
steinweg
ntds
staroffice
leow
immolations
velopark
fumosa
noncommittal
sciarrino
maíz
hibakusha
rebeka
extruding
stirrers
fetlock
istiklal
ewenny
circumcisions
whome
yeas
hornstein
amerongen
unilateralism
countrywomen
wassef
kanth
stra
aoraki
lifecycles
trickey
tamago
sameh
kellet
yibo
attractant
gitega
kemptown
pree
toom
prejudicing
worldvision
futch
folmar
mosser
harahan
bardonecchia
chrysanthos
giganotosaurus
ugetsu
calcineurin
casseroles
catchiness
selick
boniek
asilomar
eynsford
lho
varlam
zeisler
wolfswinkel
karpal
lauderhill
nepad
gauger
shailaja
garnham
chandipur
harborside
englebert
wojtowicz
lanson
raylene
denio
vldl
sonique
keko
wmar
thaicom
busi
kovar
metacognitive
teeters
goldreich
rinella
wahed
sensorial
fankhauser
sheens
hernanes
lochnagar
oculist
turnblad
shutterstock
armande
madrassah
demarcates
magal
acellular
niedermeyer
compendiums
canda
kint
hukam
wenning
greasers
březina
mahathera
bibendum
wittmer
igda
butchie
rimu
vartanian
bigs
incompetency
leeser
saxmundham
bomarzo
schulhoff
marschner
furled
entrées
shadia
scabiosa
elburz
rhabdomyosarcoma
durer
imagem
hypochondria
mushfiqur
studdert
foula
heilbrunn
jmg
soslan
wooller
jayantha
höcker
gendre
anhanguera
fle
balde
silkstone
qtr
drue
disowning
ayt
croissants
kariuki
mekhi
lyngdoh
mousey
surridge
kopecks
oleum
castanet
durnan
tejedor
aripiprazole
frideswide
vattimo
kelleys
paperweight
durwood
toshikazu
mahdavikia
kirkgate
gasca
jlb
hickstead
aggresive
mccance
pracha
philles
fgd
nclc
globals
proffesional
martucci
razorfish
jokester
abdelmalek
ostium
martinican
huggel
outrunning
perote
habte
fatness
joga
namecheck
kadeem
pesquera
errett
upl
haselden
kooten
youku
depredation
chanin
swangard
cric
thorncliffe
jameh
tartare
casita
ebina
skelmorlie
mcgarr
ecmo
granicus
mucked
socrate
techrepublic
zidovudine
pleasley
manawa
gummow
pitfour
birlik
benincasa
rexach
bertoli
adeliza
paycock
repro
aradan
malhi
shiren
glanton
buysse
bashley
gamliel
arja
iue
irex
copake
atget
foris
carlstrom
zerkalo
borgetti
nechells
sevin
scrimgeour
biocides
sinkiang
diplomatist
nanofibers
carbineers
eirini
horsemeat
junebug
oad
motorization
ilwu
franchini
teac
sobey
zhigang
davidovsky
serenely
kaminey
cicilline
osteopath
kashdan
unscholarly
miyakojima
yanda
wassoulou
pedigreed
subscale
avoriaz
anwer
hinode
surco
corroding
franzini
aéropostale
welts
cholon
raun
rawl
zopiclone
injera
uremia
khilafah
electroencephalogram
yulon
contentless
aluminized
foxfield
djebbour
outokumpu
butai
før
lowney
linate
vudu
timbrell
dery
typographically
munfordville
herbstreit
ichat
kazbek
appendiculata
castelluccio
dinur
anki
domu
ratte
pennsbury
littbarski
comiso
clowne
telemarketer
tiferet
procurements
goelz
girded
snowcat
alphabeat
sulby
macgrath
caunter
handbill
sunline
andreeva
pilfering
reynald
chenard
warg
differnet
gunbarrel
bowlin
biswa
hailee
iscb
sandrock
fincen
abscam
psia
jairzinho
vff
jist
demetre
ketsana
wobegon
ketosis
fustian
ishara
maiga
shindler
daine
pomponio
kurochkin
collonges
bruyere
cilt
eiu
wilsonian
nescafé
firdous
bhuiyan
colbys
ehn
haskel
demoniac
zuccarelli
pearlstein
mauls
abita
primitively
yasuoka
newsfeed
gréco
rachlin
poids
budda
kataria
eht
distention
blustering
mandarino
nfca
ssrc
soame
seales
centenaire
trattoria
netheravon
pontificating
endroit
gaubert
malapropism
glassfish
gillom
macrina
northpoint
tirkey
véra
nicam
matricide
soberly
picosecond
trentini
meridien
uncodified
koronis
aktiv
bizerta
solovyev
ligaya
carcosa
shoplifters
simoes
bearish
stephin
avra
solsona
salicornia
kildee
lushly
dosimeters
serse
luristan
chiat
mihashi
shiyi
vexation
koffman
factly
quynh
ulbrich
ollantaytambo
philippou
voth
jinlong
aforethought
tamping
balsdon
airmanship
kamper
duavata
gregersen
ugl
inova
trkb
heckert
diuresis
perronet
padron
hevel
ssat
ufton
bootylicious
badara
behoove
armwrestling
marmaray
errorless
bugbee
wallhead
findus
flatboat
bapuji
atomium
tuimavave
mccleery
solemnities
mahfuz
amax
achour
monashee
mechnikov
canonbury
lucic
astutely
sumptuously
nassir
prepon
mawla
wilmarth
homogeneously
leckhampton
komati
wanyama
hallgren
antirrhinum
michaelides
sportpaleis
moviemaker
bockscar
piddle
humourless
otari
bandoneón
duddingston
ihnen
hollers
amezcua
zaoui
vaguest
reshuffles
whitemore
panamera
dki
egner
mechlin
lauryl
revering
ifilm
elmaleh
clasica
bibbo
jop
freshener
acetabular
lents
voici
manchego
kuniko
braggadocio
castlerea
bechtolsheim
tatsuhiko
arktika
maharaji
nevena
rajagopalan
bylsma
geostrategic
ornery
colab
truculent
keyloggers
bingyu
dobrescu
rubinek
wikki
playfirst
silman
goudhurst
kurtwood
jifu
mousseau
tiede
bulganin
crummey
socioeconomically
redwine
inrockuptibles
stammheim
belene
pbj
schmo
rampurhat
tornquist
czechoslovaks
ghirardelli
spondon
monacelli
eiichiro
numantia
bruhns
pattabhi
haycraft
aleatoric
spangles
smarting
zhelev
fuzion
akosombo
draheim
beachley
castricum
srei
lovelier
peerzada
manyika
middlewood
encroaches
petrochina
miyatake
minsheng
saliers
yeesh
schachner
steerforth
bebbington
pastoring
acquits
eyelets
ruchi
macassar
pierluisi
mccowen
atsu
microfiber
scrollable
golmud
folliculitis
taroudant
motter
hond
torra
rosia
treg
overlanders
nayer
hallard
torkel
ildar
prateek
serialize
nalbandyan
standells
mckeith
abrogating
pawlett
sumar
mram
milbury
munificence
petegem
enzi
viewport
wriggling
widescale
paster
pursell
dottore
fobs
derrickson
meraj
cultivations
baxters
fischl
tablespoons
spaciousness
geeson
shapero
pousada
pennywhistle
mccanns
kossmann
ansen
hydrometer
neeru
hurworth
wno
cknw
gerety
photogrammetric
greenie
wrinkly
berretti
prostrations
vincy
waldwick
flans
mamy
pinkel
sunglass
barno
bleda
choleric
mpofu
spoerri
eidlitz
puffleg
workforces
iberostar
poliziano
bjarte
achen
kirpan
machpelah
dorrie
lockey
interweb
bidzina
seena
neuenschwander
revi
lacetti
idil
pagosa
preconception
shahrokh
dollop
budgett
wackenhut
stealin
ploeg
prefiguring
fanie
tomjanovich
cherniss
galit
thunnus
cancelation
blantant
hady
systemwide
jakks
zhenhua
harriton
aset
bystrov
matterson
bromont
kochs
capstick
razaleigh
schranz
kuzmina
gioeli
minigolf
grasshoff
disaffiliate
emh
sonneborn
trichophyton
rijo
sevgi
amitri
etcc
monigo
clayden
memphremagog
garri
deora
borana
mustachioed
learie
doura
manholes
elcock
amalfitano
envoi
eppie
shoreview
jpc
nalepa
allestree
impro
lilywhites
mourant
curs
troglodyte
adamas
konstanze
gasman
trickles
francescoli
intercessory
vredenburg
langrishe
explicitness
lurcher
jinzhong
ohle
tropa
areias
renk
ftz
mcgugan
apley
vlaeminck
forslund
ohkawa
gustavia
hironaka
jocularly
extravagances
whipps
cheesesteak
npdes
tarvin
plourde
sagnier
poulsbo
casma
nevilles
lawang
rabel
vistavision
vaden
thie
otha
kandersteg
unconfined
reas
melanism
francigena
dissimulation
kpis
transco
bevilaqua
canaday
anwarul
dadlani
mixco
kryukov
orbetello
oophorectomy
poltergeists
enlight
swinfen
muirkirk
swathed
galtieri
medlin
pft
dilara
jarawa
beic
furloughs
rossie
gundi
odwalla
vieru
acanthamoeba
orsolya
loosest
sarine
rorquals
crellin
westens
piège
foucauld
stamey
antiparasitic
dîner
vmt
asters
buffoonery
bronislava
alexandersson
solario
covetous
admiringly
gernreich
aldonza
pinniped
angélil
giddins
treehugger
malon
salvatores
combattimento
mcloone
prpic
cudicini
ambrosi
dabbing
pappalardo
petroni
liù
bihan
willaston
happer
peronne
carlinville
bers
rootham
overbite
novaes
ceqa
korbut
vandenbroucke
satins
disi
radiobiology
takaka
zhongyu
seaga
reshef
sarkies
thakker
mandra
coverdell
raavan
rolles
benzer
kipsang
bhupendra
mptp
brookner
christenings
itto
directionless
barbadians
kibby
conservancies
percenter
nagqu
pantnagar
durables
eustaquio
rayners
manikin
shw
salvific
tme
marwijk
stewartry
kellyanne
rupununi
bernera
amyas
gantner
helander
wolfgramm
casu
theri
endearingly
pennoyer
schlee
jigga
nonu
dæmon
marchione
wyggeston
castlewood
keesing
brisley
weinfeld
casitas
perforator
warmonger
showdowns
flippo
polyrhythm
midrand
sheperd
grainne
nattrass
kittlitz
obis
quondam
wilkinsons
hallas
zorica
novikova
tarter
flandreau
potsie
kelk
kodar
dawan
beghe
exene
incentivized
satinder
toppenish
foodways
beatin
trendle
carice
gamekeepers
vitalij
husin
minorsky
dejection
kpbs
lezak
ratnakar
sahabi
ilda
mullarkey
wombs
kosgei
ocklawaha
breezed
schismatics
gateposts
arcobaleno
almendra
rippingtons
nordenstam
zions
doppio
florette
jdp
rhynie
neverwhere
cullerton
retraces
hanny
tournant
afis
dandin
camomile
gammell
carencro
triclosan
enslaves
coel
vlast
spamhaus
schwartzel
congealed
aquired
kirchherr
barrs
matheran
kurmanbek
fifeshire
madingley
berquist
carino
meszaros
brouillard
overestimates
microsdhc
sohi
matcher
cazeneuve
lifer
samual
dongs
vitello
olczyk
shahir
wingnuts
bludgeons
zachery
unrelentingly
prenatally
asscher
speedometers
mongia
pagode
pranked
mgn
tremeloes
latt
asilah
hias
shimoni
najee
skywarrior
kwch
trata
battened
thaxted
sejdiu
delko
vietminh
feifei
aamar
troilo
matalon
persil
kawanabe
dapo
ckx
mishal
cccs
chamberland
undecidability
colescott
shilla
changshan
oxidiser
kitchell
opper
qurbani
beeck
palaver
ligety
papayas
nerz
stewardson
reeser
ilin
fijo
malverne
bouldin
aaahh
iwu
weisbrod
redlynch
volkspartei
kalima
freelander
alveston
duloxetine
anguiano
vrdoljak
segismundo
insurgence
schwendinger
livistona
manicurist
kidde
asociados
edkins
lajeunesse
wbir
suvari
keiki
dflp
batard
khadem
hallstrom
krishnapur
kahlon
tupamaros
underrepresentation
sopher
freelon
illeana
menocal
thewrap
ticehurst
cathkin
fregosi
carloway
kadee
fagerbakke
amberjack
cgu
adriaanse
ramstad
annable
bogoliubov
sportbike
wakkanai
wizo
mahanama
aidy
ermington
wennberg
kafirs
linslade
carrozza
freighted
pussyfoot
floridas
upscaled
chytrid
tyrel
stutters
feldheim
visoki
nektar
bulava
mombassa
choson
anticlines
vertonghen
liebeslieder
leocadia
campan
ugu
yawing
blackwelder
peston
misspent
fring
ansys
charlbury
steingrímur
roaf
ghandour
mistimed
platformed
crimewave
rodenstock
ayoob
piene
roodt
sawiris
hammerson
odn
ladles
stenning
maska
neustift
vanport
raef
etx
carnock
txn
soare
fargeau
bratu
trepassey
hypomania
wjfk
winford
freni
lithos
fronta
freston
insets
shannons
ciolek
gpd
chiddy
orofacial
torosaurus
eag
anthracnose
greenshirts
pirmin
chene
gwillim
dispossess
lavik
mouette
richeson
zaugg
acces
draftsmanship
kajagoogoo
colwick
kleopatra
tvweek
rgr
nulty
huckerby
apol
loui
bifurcating
superordinate
multivolume
jumu
fumed
warhorses
lcv
geopolymer
polunin
rodenberg
purtell
glsen
pakman
infinitude
lactis
yeamans
belser
landownership
homewrecker
lme
westdeutsche
undistorted
pyrethrum
consuegra
baburam
padovan
messerli
whatman
maughold
pawlik
subscales
rubinoff
nyrup
sissinghurst
soloveichik
bowerbank
tlm
photokina
faan
kaap
nese
chinar
tkr
pollens
uncw
khordad
haleh
jalapenos
larivière
manageress
sittang
ichthys
abramowicz
stamatopoulos
pasdaran
völklingen
braamfontein
brierfield
knop
pastured
parasitologists
tangye
mireia
milow
philomath
malkiel
misoprostol
waterbeach
heckerling
fossen
dehp
vsr
shirey
outpolled
pounamu
overconsumption
riegler
statkraft
llanymynech
pllc
noles
akerlof
zenn
prairieville
sanchita
grumbles
collingridge
hindbrain
chutneys
seyi
gpn
stanage
molder
easdale
jancker
impassive
ferrick
sidedness
adms
paymasters
janz
cryosphere
feza
webbie
mcgeough
petrotrin
swg
kleinhans
stutzman
rathcoole
taue
bohrer
domaines
whiskeys
sachie
wylye
koes
ansted
xiomara
chenzhou
furong
restall
eriskay
papain
boul
acurate
braf
hautala
rabiah
harridge
hillbrow
laffy
mcanulty
meldahl
kiehl
rubery
caunes
agression
dimpled
keti
morenci
constrictive
valorization
shekinah
eeny
trinucleotide
congar
navnirman
occassion
gapping
sonko
kilvert
canzonetta
kwacha
mbombela
camelon
aider
durão
gestel
levack
thinkprogress
inocentes
shopko
usgp
holzinger
bettega
frickin
pregabalin
javadov
burbs
rotenone
altendorf
mitsouko
lanyards
modan
sotos
masp
kooy
famished
moluccensis
rosand
postino
bleeped
torro
khedekar
unmanageably
cavallino
klusener
lyd
junaluska
dusek
rogosin
carmakers
chng
rottweilers
suckled
bioweapon
hoodwink
urijah
chisora
uib
wolfer
boyarin
sentech
cryptanalysts
bludger
tagua
gesink
gambrell
ohsas
klooster
conneely
mittweida
mediapro
dawber
buntingford
tabou
tatura
rhinestones
micaceous
fleckenstein
essor
hunwick
saidy
farney
moisei
artscape
otey
prehistorian
magleby
zusi
crepsley
dnes
worsthorne
rockband
actuelle
prefigures
jungr
nonpolitical
titer
braak
bigalow
erhart
managable
palach
nampo
negocios
jianping
unmapped
sheikhdom
cruttenden
streetlife
hosp
sudre
sobran
redpoint
homestand
nonbinding
quadriplegia
firat
ponor
bokar
romeoville
mellan
calix
fetlar
odorants
berlins
keralites
orol
crocket
dories
tongling
galibier
fmqb
basauri
cutlets
lanc
slawomir
oyun
geliebte
mahnke
aljubarrota
stitcher
retallack
azinger
cacace
nalder
honourary
masirah
sangkum
dumais
putto
persad
hanby
blairs
blickling
haymond
bobrowski
serey
freesia
ferodo
bourses
scoles
trubshaw
cristopher
sketty
keralite
waukon
hekmat
ciment
pepoli
jentsch
pictoral
commendably
breathlessness
scogin
littauer
theuns
adamle
damaraland
turbidites
emmel
nilan
neurol
nre
grimthorpe
firebombs
getulio
durbridge
dego
thuong
basicaly
tamiko
ballinamallard
anabelle
imput
kostadin
butty
dinter
fener
foregut
finocchiaro
ribicoff
murphysboro
amritanandamayi
coincidently
benfleet
adhikar
backgrounder
taining
sebo
nikica
blankly
globalize
hillcoat
stockbroking
homophobes
zna
chalks
clarey
imputing
abberton
veazey
goatfish
propithecus
insultingly
acms
campomanes
piatkus
derrell
despaigne
ukaea
bargy
lewi
wankers
adamovich
preprogrammed
manabat
rbn
pavin
lings
flummoxed
jarocho
bikur
fuglesang
erwinia
youself
syu
chanoine
wolken
huayi
acadien
bilingually
civitanova
ilyumzhinov
hikikomori
belda
crispness
supercarrier
wauconda
lytchett
tzedakah
vijender
humbler
albertino
illiniwek
lomita
exaggeratedly
zillo
jowzjan
lache
wunderland
geelani
majkowski
transair
anaesthesiology
ignorable
dhss
schuessler
unambitious
ikar
tooby
tineretului
beltways
carmichaels
leura
paradine
quercifolia
clarithromycin
saddlebags
hvad
oisans
pearn
nakadai
rafales
eccentrically
wehrle
mislabeling
arkie
inconspicuously
rakia
majeski
tummel
monotheist
medardo
dfat
odenton
youve
ditmas
forstner
kition
gamefish
sundaes
bandolier
lettings
dutroux
dextrous
sace
bissinger
britos
netcong
inappropiate
doillon
blotto
mellat
moghuls
gadot
dingolfing
natapei
abstinent
holtzbrinck
draggin
rozenberg
homefield
purex
metacarpals
summerhayes
hyaluronan
americain
philippson
nayla
coplan
tvd
eastertide
montée
cambs
liyanage
eastwind
karstadt
lafite
spectroscope
cumulated
misclassified
klippel
kotv
whinchat
wallstreet
oney
myotonia
headly
mudder
matviyenko
atay
frankenhausen
pointblank
debasing
mimura
taurino
inversus
zadek
sigsbee
guoqiang
snowiest
landslips
ceftriaxone
okoli
acquitting
preceeded
multiplane
undisputedly
contibutions
sitz
tracon
frankley
madhusudhan
folkerts
manhoef
plusnet
xetv
regner
receptionists
lewenivanua
chamaeleo
cocom
diamorphine
blakeway
easterby
kinburn
espousal
compressus
casula
sext
marocain
jumpman
mischaracterizations
parisyan
schedler
echegaray
honker
impulsion
baudo
immutability
soudani
komura
tussocks
boodle
pprune
shaddad
coppiced
rokita
dully
berget
harroun
penfolds
nishani
weighbridge
mancino
hedworth
bateleur
stoessel
choosy
lathers
isoflavone
miscues
shige
guirado
buffoons
manasse
nisl
chuzhou
ilma
traina
etd
dzbb
randomize
tpms
bullivant
fotsis
sportfishing
econo
sterkfontein
moncks
groomer
knie
bifrost
citaro
flightline
acgme
gillings
dyads
disentangled
ehp
mendips
enyart
lamoni
redistributive
elfyn
pioglitazone
hagner
ministrations
preteens
veilleux
roadmaps
bioprocess
tamal
plantlife
aravis
shimoyama
urbanowicz
frederikke
villejuif
viscardi
cij
bernières
kokh
lorente
bourgeoise
druten
tuifly
kdb
accreditor
armoire
lupul
gabrielson
quarterman
witchy
thornwood
karolyi
celandine
campany
lehar
stenz
keizersgracht
tomohiko
dantewada
johnst
komiya
chimei
koganei
nazli
dockland
superdelegate
eyp
entrepot
slumming
lount
kosuth
urushiol
nurettin
kilim
fronsac
dizdar
leers
taï
zumaya
parkchester
clewiston
changshu
challenor
grognard
mallows
drucilla
clamorous
laminating
guadagno
wolong
jlt
gardi
rmk
europcar
zaharoff
terawatt
washingtonians
schwegler
janno
beqiri
faouzi
garfagnana
snagglepuss
høeg
gameplan
sawrey
nonconformism
nookie
atak
bilino
overarm
snapp
passarelli
eumetsat
hayyan
ranbaxy
intracerebral
advocare
frt
smv
raymi
butley
sieff
ferres
arthas
fastow
barnier
wenninger
pratik
cherven
lobar
pitsea
trasimeno
chumakov
xueqin
willans
protic
amitav
carsen
tyrannicide
avista
krishnamacharya
franzoni
gumienny
untraditional
shotput
accesible
jeram
uniroyal
heidsieck
chona
passito
yasuhito
vincenza
cotman
bliley
vassiliki
numer
dockweiler
esas
mutambara
intercessor
egleston
wools
katsunori
dalbeattie
gangrenous
fairyhouse
alrosa
steinkopf
corralled
mzee
ogwumike
nasturtium
kls
bml
makri
zaillian
psyched
nawar
stepin
wabe
cads
trichotillomania
bilyaletdinov
lincolnwood
deruyter
typewriting
naruhito
wolmar
kassis
bradesco
invernizzi
podrinje
roloson
ranmore
chypre
temi
genya
silverlink
shambolic
bastow
ozren
vouches
appo
nvi
sligh
satyanand
losch
lutwidge
penick
gelert
slavo
metalheads
kozintsev
yandell
balka
delingpole
wtxf
cowhand
champagnes
galip
spottswood
reverberate
coalmines
sappington
scrutinising
cagey
contemporânea
northeasternmost
limahl
holyland
befalls
traficant
crveni
unrewarding
tavy
ottoline
compostable
doumergue
korgis
gtmo
monstre
plectranthus
mourvèdre
tharwa
gebhart
botz
zaritsky
bartholemew
sassa
arons
weblogic
sachets
ormrod
reverberating
siyum
distractor
savidan
roughton
clamored
aaah
kravets
brigman
ambers
istaf
keeshan
linichuk
tommorow
lankenau
sekimoto
padilha
mncs
reseach
wojnarowicz
thicknesse
carone
palmquist
zellerbach
yanjun
fme
epia
cyo
slussen
curtice
pressurisation
leysin
mandeb
marzouk
convo
ebolavirus
nimmons
midyear
haemolytic
phoblacht
bastable
kipketer
minni
gushes
manoharan
lonzo
songcraft
birchenough
parran
fritjof
colosio
dieck
annotator
trendsetters
grubbing
diffa
abramovitz
fki
bojinov
ammer
wakeful
marles
trompette
refurnished
sirenians
delpech
auria
nevzat
carland
jaufre
manchild
basov
mckey
funland
seleznev
hivemind
moshing
eiki
fröbe
crorepati
salik
whirly
tascam
bleomycin
bauble
baying
finstad
acrophobia
mandana
yesha
hieronymous
zoraida
supervolcano
razzies
delran
jover
perfluorinated
cabri
roediger
cipd
reciters
cappiello
awwa
intussusception
pribram
cerveris
fitchett
egginton
tandragee
ricken
markelov
farabee
faerber
xisco
gesine
immel
efremov
petiot
pól
mabley
vornado
baystate
handbells
tingler
serapio
rande
ispa
olza
visiongain
bowering
massow
antiepileptic
remonstrated
pmos
phentermine
gracchi
hulley
appam
waay
stippling
accs
hcmc
shomrim
groenewegen
developpement
cgg
dacapo
krusen
twt
mikhailova
postbox
ibstock
muntean
laybourne
mrtt
schwein
qinglong
coagulant
besame
recodo
waban
apar
abbemuseum
nyota
yoriko
xplore
reep
dattilo
banyuls
nikiforos
spermaceti
cravo
boffa
unsheathed
rkm
freind
goddamned
kering
makgadikgadi
chertok
sisti
guite
sandblast
capris
alimi
abshire
reedham
colicchio
chkheidze
lyth
locascio
keansburg
wynwood
burditt
balletmaster
huckster
muntu
confit
bukittinggi
sportal
emancipator
ninnis
mcelligott
cbrne
oyak
idrive
pazo
chaand
jowhar
mameli
crouches
akra
enzensberger
ghyll
bermel
musiri
ixe
ruhleben
vilanch
gregynog
derian
beshara
reveley
jordie
nanoelectronics
thorman
bossert
pinback
hiruma
sucky
impost
blunting
bathysphere
creased
standridge
hoopers
nerine
allason
zinni
greenshields
jihadis
checkouts
mirante
baekeland
netsky
schongauer
phuntsok
clamouring
potapov
arkanoid
harangued
americom
horween
kakha
neall
inoculations
meece
denisof
woodchips
cyriac
hundal
dharmesh
foretaste
cookout
chanh
mccrone
bundesländer
dromgoole
hunnam
dormael
sonika
kayden
viernheim
madhok
warbrick
regrown
novis
oslin
supercollider
rudrapur
mayson
newswires
plasmons
snakeroot
parisa
oswaldtwistle
calve
comt
costelloe
leimert
dahmen
abim
cowsills
ashima
neang
ligertwood
standart
kubicki
monpa
mekki
ousley
tulia
ellicottville
kilpin
kirovski
koosman
recapitulates
ewin
theatregoers
valentiner
fima
yeagley
bavand
harrel
repaints
ashari
expresscard
practicability
octagons
syndicating
mafraq
spired
hanak
marani
dathan
neitzel
camulodunum
schoof
inducts
simitis
resited
gaudier
dumbadze
forbear
khasavyurt
heasley
lutherville
ifab
frenette
demby
lubich
ntk
turksat
folley
weekenders
disses
rupnik
perov
extradimensional
fayoum
kindai
skogen
polymorphous
ayuda
wingecarribee
locatable
abramsky
besigye
methley
jockstrap
assefa
blazek
metrological
enplanement
maisky
oser
wurzels
hengoed
nocenti
wyles
misuari
mcinnerny
fenske
caiazzo
cockiness
madiha
spycatcher
poniatowska
prognostication
yorkdale
kavadi
emulsified
cokes
schulenberg
recoba
söderlund
fortinbras
roadable
sreenath
ebihara
haunter
landfalling
strahovski
burdell
artemyev
gutknecht
perata
audiotapes
fleshlight
briest
reyner
brigden
lorge
botija
qol
oktar
spoliation
gjertsen
tronc
mouthfeel
ciega
magnifications
kratts
toeing
denr
lmn
allagash
dogtooth
disfigure
cellmates
eissa
diljit
ombo
dedeaux
prarthana
danker
midianites
nutjob
burba
pelinka
glk
singlehanded
waddingham
derisory
deodorants
innit
armthorpe
gautrain
reformasi
rossetto
crennel
aurland
blowed
dettmer
speights
buicks
tranh
wittrock
frappier
vitzthum
altizer
kmel
joblo
magat
rozema
mottes
tigana
villalta
mbari
thumped
toshinori
helling
baviera
televison
drosselmeyer
pforzheimer
tavel
pragyan
kudrin
expeditor
shuten
topline
brijesh
yarwood
unimaginably
airtours
xolotl
kerik
amali
jannik
poolbeg
uplifts
obong
unmoving
rnav
jde
wtsp
andorian
lathbury
armonico
sated
gaggenau
roumanian
ehmke
bigeard
scrovegni
heterodontosaurus
saria
glencorse
wideawake
tolpuddle
flesher
binges
stickel
artech
stefanova
mickleham
baulk
covets
capriccioso
connectome
malivai
jaros
metoyer
wulsin
succes
unreceptive
schonfeld
kapela
oceanview
equifax
chalking
pyg
damnable
azaad
cpx
longan
gravette
garzelli
deddington
cddb
takahama
keurig
blakesley
stilgoe
scrawl
whillans
holdsclaw
consigning
orit
homel
angmering
goulds
delauro
lugubrious
unfenced
hurwicz
pipal
condescendingly
bedd
extinguishment
ekblad
tipster
sterjovski
tadjoura
yennenga
cortney
karppinen
chimu
korla
vyrnwy
rumley
pimped
ornamentations
protasov
electrica
jennys
shafie
downscaled
melsheimer
costacurta
teignbridge
intech
gbowee
lawnmowers
millones
breeam
elián
sentir
farsighted
desanto
partied
northcentral
saucerful
restalrig
maisonettes
ethnographies
xuereb
centris
lacie
azpeitia
nvh
ggp
gornik
tapout
dahlmann
leatrice
bovines
siria
hartel
mmol
gilsig
haviv
ations
pramana
uihlein
decongestant
bobst
quinolones
hillgrove
reeth
fiammetta
passato
scats
tabuchi
riesman
keates
haiden
racisme
escom
dragos
unbalancing
niemiec
munns
teshima
harboe
myla
boeung
definiton
recouping
coatis
herrell
redlasso
tobaccos
clelland
bottisham
mamat
nicholai
torralba
solomona
propitiation
audiogram
adkisson
rochers
glees
eile
garthwaite
martinair
thygesen
experiencia
teleri
bisschop
romila
theobalds
modernly
snv
rakish
bevacizumab
kuchin
agot
vinoo
aaha
hunny
factbox
oll
wenda
sesar
fahnestock
abagnale
hodgen
towline
nonchalance
marashi
silom
rbw
opet
glg
kondrashin
younan
kroft
mbaqanga
linage
wetherbee
tribally
navaratnam
edwardson
craine
koolau
ratatat
jrf
superyacht
aasa
demichelis
vinalopó
colver
kernot
kornman
okechukwu
nontechnical
pindling
sesc
elephantiasis
longbowmen
brzeski
omma
kump
adrenocorticotropic
pedantically
westham
contactors
metronomy
wheaten
sesterces
comtois
weakish
forbears
transferral
warkentin
subcontinental
hideyo
wfsb
machrihanish
geras
shinwa
sindoor
cojo
kpho
levit
keebler
lasserre
lanni
rioter
alary
musco
chiffons
sakhr
otlet
gatch
mpho
komplex
haraam
sylphides
schaeffler
adea
mitzvahs
transcode
uponor
gunaratna
hanashi
plumper
caban
convalesce
cochère
rupali
hawton
senese
sosnovsky
penydarren
zlatni
ubach
barabas
pobjoy
drinkard
ansip
marven
stifel
baroud
eion
boatlift
skoal
placa
inescapably
bheag
sabalan
wakering
yuqing
manero
bami
prefuse
newz
mckercher
tuscania
romped
asier
obituarist
doorjamb
kirst
rehabbing
eyman
pemphigoid
ovulate
frontcourt
borwein
lemerre
ppps
tollhouse
ancón
cabaña
sharipova
kamae
jamesburg
sledgehammers
monoblock
bluest
akyab
itera
mandiyú
penteado
bradner
qadr
banaba
sachlichkeit
reorganizes
fellah
mcdine
stouts
ninawa
prizemoney
marschallin
trainmen
charef
cordyceps
newick
aahs
lodgement
zewail
raisons
jhumpa
autio
manganiello
kwangju
devorah
thinners
wpro
frizzy
photodetector
ecstasies
biogen
rabbitbrush
muhyiddin
papastathopoulos
rosko
distin
stemless
pulsatilla
koston
shanked
airbourne
tranby
pastores
transversus
tarcher
zacher
aena
chessmaster
eleftheriou
amade
vivitar
cabourg
ehlert
salemme
finkelman
zaranj
winblad
bagnolet
kickhams
felicitate
gebbie
densification
scrivens
powick
unquoted
unmoderated
hairlike
shuba
diena
racicot
jonjo
eatonton
dcis
owasso
pozzato
tabe
vietti
duni
saling
ninan
zehi
starbreeze
wallula
sheck
brogdon
fioretti
sheckler
piontek
mcgarrity
zetterling
leevi
guertin
nfi
ykk
milberg
poisk
miren
geluk
tureen
seita
stretchable
smcs
fuzzball
terraplane
dfh
maketh
simulink
vasoactive
levitas
patz
eastwell
bronagh
mcgurn
finalisation
garnaut
nemorino
restocked
brigata
elkind
arnal
pozos
machair
maheu
ratzel
wfor
acquiesces
cívica
stuermer
thameside
paata
alioune
yel
csps
giveth
wildbad
protools
zuberi
lisch
naveh
legitimizes
nuckolls
generosa
aastha
kerlin
silkair
foxbat
yanagida
shoba
schofields
mcmeekin
tefl
cardarelli
kingsman
routier
hubler
sonoko
battrick
ludin
mezz
rappoport
balerno
wainman
asja
boan
auma
fitzgibbons
namjoo
scatterbrained
nainggolan
retzlaff
shuvo
yafan
apapa
chauve
vernes
reversi
blinkered
omanis
describer
spooling
ledin
dépôts
mprs
matey
walkability
mercs
capek
lugia
briny
bunetta
picante
borini
pleasantness
whickham
professore
sardana
buzzin
grunter
zeinab
lohas
reheating
aalam
zagging
metasedimentary
lyden
yunjin
supercapacitor
feedlot
montone
shahida
volontè
ledebur
edgerrin
ivans
elvington
epaper
demartini
wynona
ewig
unflagging
lagunitas
palmero
pelsall
answerer
dairyland
paterfamilias
elgort
whitall
propertied
dinakaran
tiya
gullwing
mohonk
uncivilised
faas
broin
winokur
ozai
coffelt
proficiently
unalienable
braziers
bashers
bloodworm
mistrustful
raizo
sanzio
sabur
lockean
bitam
rottman
wassmann
husaini
heavyset
centrino
atalay
naxalbari
verdade
skakel
oberstar
arakelyan
riseborough
exultation
breillat
hulce
snog
hutchesons
aglow
stanes
keckley
sbir
masterstroke
donnel
penk
taiho
catting
livanos
immersions
genov
tomme
gymnasien
abizaid
tulalip
sobriquets
gallstone
swoosh
antoun
spagnuolo
findel
emeriti
bierko
flunk
ileus
impossibilities
mcmurtrie
wiretapped
winiarski
horlogerie
chataway
lesvos
tubed
aspens
reappoint
evt
merisi
bantering
vtech
arregui
monell
unimportance
erixon
myopathies
shahla
athans
polhemus
pollokshields
malakai
tammam
coagulated
krehbiel
escap
emomali
minoxidil
eww
persa
disorganisation
friedreich
isaaq
ederle
koldo
hermel
goodfriend
bwyd
emancipating
collon
cholecystokinin
sandbagging
rices
dynamique
duffryn
idon
leonides
becouse
wny
xiaoqing
manza
zucca
sakiko
dammann
danzi
castellini
dhal
ravitz
pietre
mojito
sayyida
killorglin
underplayed
ooga
youporn
jeromy
dongting
ispat
jull
igbos
iittala
dillards
gallega
guler
mcqueens
wcbo
insomuch
flaxseed
wibf
meanwood
arūnas
agglomerate
absurdism
goodwrench
calaf
asds
trichomonas
olhão
mudassar
demurrer
visitante
windsors
mushin
inapt
caminiti
kaysville
mathe
pasqualino
kiowas
chavira
prawo
karapiro
canzoneri
phosphorous
kemin
hellzapoppin
clarett
cadden
cullercoats
peggie
markopoulos
engagingly
chicanes
skytrax
loewenberg
shacklock
lampshade
becontree
jimy
indpendent
whereon
vizcaino
dramani
figments
contrariwise
validators
skylit
azoff
tularemia
lamorna
wust
antonova
hewa
coagulate
brantwood
bellmon
casciano
homebuilder
françaix
brauns
gartland
radtke
mtech
triiodothyronine
albiol
haloes
haicheng
myrin
affric
niecy
carneal
nilda
noffke
kanza
luganville
rattail
colautti
siemon
jehlum
peals
nyce
hogi
welke
hammerless
hartridge
mollified
butanediol
laterna
thoas
pastebin
intercessions
hasanuddin
sealings
filmstrip
caudwell
guenevere
hostettler
cavanna
kerron
outshot
drewery
smid
ranker
eelco
playwrighting
thian
tepa
naturalize
boffo
wansford
steadier
cristie
snacking
giacomini
dulaimi
nccpg
reininger
delfonics
diament
doust
shipbreaking
mariama
scheiber
gaydamak
palmo
cfrp
baginda
granberg
epoxies
acmc
fairhead
gribben
gittleman
compartmented
overreliance
airt
alworth
kopple
crossdressing
lemann
schoeck
ksdk
transglobal
larkfield
carrig
femurs
farshad
kabinett
babbacombe
manics
sasai
emulations
subspecialties
robuchon
prizzi
nordhagen
xenomorph
sparseness
lastuvka
derny
pictorialism
orsett
salaspils
chubais
alane
rainfed
algarrobo
gutu
hornacek
gangi
kondapalli
otehr
speedman
saor
mountfield
catalino
fatai
bidu
böcker
heidemarie
ugi
fuckers
halyburton
milquetoast
alpilles
galeton
eastburn
pukka
immigrations
cloacae
geminata
upyd
steinunn
coffield
paschi
thoughtlessly
verina
sofaer
poststructuralist
defund
sproston
slithering
vapi
samudrala
crassostrea
snakebites
sgarbi
twits
demonte
klinge
antagonising
gabri
barangaroo
butor
mochan
scor
rosenburg
dolcetto
brinck
cyberchase
indesit
futcher
kirkegaard
matras
sparsh
jagielka
drumchapel
cordiality
precipices
zanelli
wibw
gateman
sannat
menkauhor
ulsterman
dysphonia
zitting
oura
markes
muchos
hereunder
gwendraeth
palaung
clawless
barentsburg
deluding
sekhon
voller
diaconu
neuroanatomical
callimachi
goneril
watkiss
pascals
opps
wilkos
doonican
procedurals
panafrican
llanddewi
austintown
kittrell
hooksett
ameristar
buxted
barakah
vlahos
storys
kyrillos
bwindi
turbat
wingrave
terex
paraquat
chickenfoot
jabbarov
enchong
oyler
machaca
folkston
hengshan
guanggu
curies
reforested
serralves
qingshui
zissou
katsutoshi
monogenic
mbala
alstott
abanindranath
louver
leininger
copel
nocton
rhizobia
fabrikant
stephie
paleoclimatology
mobilicity
rigamonti
patxi
marandi
adjudicates
baize
linscott
gmanews
nabhan
walda
noteboom
feret
quilla
penzler
buesa
bittman
foggo
gestured
jrd
ebsworth
liautaud
skaf
taganka
philippos
carhampton
holliger
kfs
bholu
champenoise
sanderstead
natomas
transcendentalists
noughties
bazookas
waidhofen
streng
darktown
piquette
memorex
breer
hastens
fruitfully
nasb
sahakyan
bonaly
wavin
overspeed
shredders
wachenheim
harpal
pedagogies
gendebien
lispector
jurowski
bessler
frappe
eastick
accola
berrian
afrodisiac
argens
hyperstimulation
vato
roubini
michalik
maniche
eyelet
dysthymia
glams
sanon
mfsb
misson
mnn
saleability
clw
mardas
dayspring
overhung
falsettos
chapmans
perone
southdale
alev
portsoy
yanes
teissier
heidrich
monda
giai
finkbeiner
techland
quader
parenteau
drell
dhok
mbembe
tsujimoto
cooties
nadiadwala
poher
dunnellon
scherz
bado
sunrises
mullein
overcomplicated
harnish
bosma
telecomunicaciones
yardbird
freestyling
refillable
microhabitats
georgetowns
jableh
tasneem
lintner
schirripa
mythologized
mudi
frangipane
barye
barner
twic
weera
pressel
naggar
bonte
dallaglio
nighthorse
wex
adits
nycta
marondera
tbms
lemley
dto
koneru
dysautonomia
belchite
agganis
tomczyk
möhne
hoarder
vcg
studland
featherbed
selten
elide
poythress
pretentiousness
fraker
headstocks
sedating
quinolone
technotronic
keyring
fiddes
intelectual
etic
sztuk
carlesimo
uninflected
killip
dtes
wassmer
venerates
permethrin
infrasonic
vitiated
terao
greenscreen
jodhaa
jnc
levelland
luns
nardone
patou
neds
ronquillo
bouwer
kiyokawa
arrestee
badinter
staphylococci
staniford
quoits
kirstine
freedonia
chigasaki
moyses
bule
melanocortin
alibert
moskalenko
enas
freixo
valadez
luxemburger
hiltz
glenrowan
hackathons
gofman
servin
liberte
ingber
midterms
pitou
retaliations
sumption
tumblin
rainmakers
dependently
stoermer
phyllostachys
dissuades
bouet
philyaw
résidence
oopsy
kidzania
siegert
marigot
manosque
antacids
stettinius
claverton
mingas
splendors
ivanauskas
godt
abdulhadi
celek
toz
glavin
timchenko
lyrica
chefchaouen
theorise
calar
gulou
corroborative
misstating
coogler
unshared
rahba
uhp
debary
aprs
chenille
fasth
treue
fagel
candreva
ledisi
grownup
stereotypic
konso
windstream
liepaja
qic
gimcrack
rebozo
theunissen
recurrently
etymologist
buraimi
drr
wychavon
mjg
itokawa
motzfeldt
sunshade
rockliffe
buan
sice
unscrewed
berhane
sonett
kololo
giardina
vamped
godchild
tartak
ulmann
thuraya
mellophone
angelia
corá
sundell
shipbreakers
unprecedentedly
cromie
cobbling
dulci
macatee
clé
rosecroft
crofter
kestenbaum
megacities
breakeven
pyin
lerista
kopek
dreading
velaro
assez
cheju
bouza
khuzdar
digesters
nctm
garvagh
lutin
therm
irreproachable
ratho
tefé
tega
hurly
mascolo
volman
killyleagh
dinorah
eimear
lumenick
nazem
arshi
dahr
disdains
anthropocentrism
commissione
fukaya
lingenfelter
paperboys
dittman
mehlis
beeding
mackerels
zebrowski
tinyurl
widdop
jorquera
axi
kaffe
takahiko
gabar
demaret
ajose
ashcraft
barinholtz
cartwheels
avf
odegard
jottings
kizu
berbick
patted
satiation
samhsa
soes
displease
perast
bmm
nordqvist
panga
ravilious
horsed
threepence
ldpe
gaydos
krysta
atalaia
loupe
burti
twerton
dabei
canin
kever
dunfield
orphism
lisas
ladra
chole
epting
pitfield
maoi
stefanik
weee
wiehl
ateneum
defries
flydubai
moak
serbsky
evenk
mallikarjun
parlin
meriem
dimitriadis
decontaminate
woolas
lussi
toucher
alborada
smallholding
panem
wydler
sypniewski
cossington
huangshi
guiting
varone
blystone
andrias
haniya
rinjani
ostende
ruthwell
maurois
gautami
debatably
changhe
murrey
yeghishe
ight
hooter
unplayed
policewomen
bleeder
gansler
nearside
bartov
cetane
malladi
cadaveric
maters
alpay
fukami
overemphasized
rennison
vincentians
ransomware
gryposaurus
monusco
liverani
magsi
mcgeary
normanhurst
alcopop
aztek
swallower
laurentians
blitzing
gunson
rashness
wher
enslin
yoplait
barricading
dicen
attfield
banh
lavrio
stechford
bellahouston
loyo
sifter
praunheim
lechler
ylva
abdelbaset
luchs
ineos
hazanavicius
fauvist
latz
shelmerdine
twinkies
storekeepers
sloopy
mandarina
fransisco
simeulue
proscribes
blome
edgewise
overstayed
apportioning
shackley
epn
mazor
ethnomusicologists
wedi
dangerousness
petersburgh
pither
castlemartin
compos
whiles
sdot
canalside
suprisingly
obfuscates
suad
galabank
rusia
laughingly
faras
downeaster
aramark
grokster
peed
denna
homecare
slowinski
skyship
sirion
garbus
randon
nabeul
appletrees
fusillade
richy
dillahunt
pangloss
kerekes
knoblock
ravenal
deknight
keser
chapati
barbacoa
baster
metagenomics
solvability
stretham
paleobotanist
peb
nhgri
kalinda
safaricom
heade
mauthner
croitoru
nieuwegein
karnow
secrest
lucasian
humen
benanti
fauns
seemly
szymborska
borzoi
landeta
perton
escapology
busywork
ormont
poors
ferina
rosalinde
thatgamecompany
ebsa
pooler
satheesh
lifschitz
apts
wissel
hesselink
purolator
khil
complexly
stirchley
benacerraf
narin
maytham
handlin
conine
wilhelmsson
leuzinger
metes
hige
kuze
marvão
bartkowski
feddersen
bazeley
aamodt
papiers
cosmonautics
elmsford
tamano
enameling
khudai
garnacha
jurby
zom
dander
bonacci
diethylene
necklines
manohara
gelin
veirs
phosphite
laureen
robbi
masahide
krupski
horsefly
arief
tideswell
yaeko
reibel
chhina
aldwyn
aonuma
lova
sansui
rammohan
proudman
steadying
irreplacable
signifcant
niveau
nonunion
staar
hakala
vanderpoel
nowinski
caral
evelin
underreporting
dysert
kiessling
keenest
ette
ghosted
meddlesome
bacc
anastasiades
harton
stumptown
piromalli
postdates
tahnoun
turba
snee
eroi
busfield
sitarist
licensors
kandla
linnane
shechita
taneja
zachar
opacities
elyas
brockenbrough
balibar
barriere
krymsk
kalach
fluker
exultant
zircons
telcom
smocks
iodate
ncic
bandiagara
tzi
mouri
almont
dangote
ultrahigh
colourings
benali
turcios
costea
bruyneel
npy
reinterprets
molecularly
reversers
bootsie
lowrance
sesso
scheinberg
tyngsborough
kieschnick
zirkel
mechner
igal
bmxer
wittenburg
sayward
darion
hypergiant
toryism
otsuki
metzen
samhan
mazra
disgorge
margetts
kitanoumi
backplate
doswell
baaba
schupp
withdean
prepay
flatus
lorrin
transpower
lenkiewicz
thrashes
cheena
zunino
pieroni
kosti
siaya
burdisso
kayvan
woodworm
jalouse
frerichs
djalili
sohl
ulvi
lobanovsky
develope
ullal
pollutes
clipstone
angele
amoah
roussanne
muggy
laderman
papazoglou
aromatica
supercilious
pakka
mycobacterial
zamil
formularies
aminuddin
motorstorm
sinama
searl
sadlers
bostridge
heartworm
syndromic
jairaj
saturnine
vanik
bukharian
mouchel
norfork
roday
jalousie
tramonto
yalobusha
kassian
dinham
retransmit
hilmer
pesters
gainford
heldt
sjm
jamea
sautet
autobahnen
carrió
prouvé
nanopore
foppish
mesopotamians
silmaril
desulfurization
minitel
marans
rigmarole
galleried
caramba
toyotas
minoring
arisaig
propounding
tensioner
wathiq
gregerson
alemu
mender
kanz
kcnc
arborfield
harmonizes
unbarred
oap
burrel
confalonieri
metso
untestable
whatton
patang
chicle
lavalas
sneezed
northman
rasika
mundari
docudramas
areata
tuktoyaktuk
noatak
spectrophotometry
windsong
amacuro
loquat
pried
beinart
partin
thorens
tway
hireling
ashkan
sidnei
daedelus
estancias
convergences
homare
ceesay
samjhauta
crimefighter
wollemi
cinematically
hemerocallis
rubbo
impetigo
gotto
bof
andalou
stencilled
miraglia
mammut
portora
kilbeggan
wiegel
vollenweider
overcall
baglione
wakeley
oiwa
terespol
antiplatelet
deese
skims
bahoo
carquefou
extraordinaires
taproom
narwhals
decena
chimenti
minsters
tideland
gardasil
rubeus
ginormous
bergier
toughening
streatley
cascarino
watmough
passcode
canastota
dumba
botchan
tankred
kcmo
screamfest
cejas
roussy
amorgos
ryding
satelite
cassowaries
torma
klavan
orgueil
oligodendrocyte
newspoll
excretes
whitestown
mccarren
glenmark
turfgrass
yorubas
victimology
dolmans
debu
norina
demilitarisation
ocelots
dzehalevich
emx
killiecrankie
hagin
nelms
farriers
wanderson
doff
grether
josefsson
dwdm
haunch
lupins
hoaxed
schnebel
kulon
ngah
isringhausen
thielmann
bogues
kyoichi
jalapeños
niah
kurtley
atrociously
veno
gynes
zhangke
gillnet
elitists
porzio
carmont
mellors
osteonecrosis
labruce
danno
birdshot
cardini
piñeyro
unsworn
esche
armatures
mistra
quijada
scotton
dubos
aspergillosis
profondo
tempter
angarsk
aimi
hatrick
gooseberries
strathern
escoto
lwa
judentum
hellbilly
mcquay
madon
natwar
majuli
katv
higinio
pdgf
ipas
knyazev
stereoscopy
musoma
woodbrook
coronaria
prevalently
tarvaris
pawning
cancan
mirpuri
linnets
isobutyl
dibakar
perlow
sulis
nappies
iddesleigh
unhampered
fehling
brynjar
ketterer
knapdale
portella
rowsley
thuc
mermelstein
cazenave
spiciness
maquinna
brambell
manannan
northenden
pantaloon
mazury
sowetan
hajib
devotionals
parkinsons
véron
sossamon
akhenaton
nippy
kickflip
pathmark
moorilla
krong
mitti
gett
palito
amicorum
glynde
atol
mijas
regurgitates
cannet
fesler
harnisch
mouloud
vladimiro
izrael
gehl
mulanje
splattering
keino
nagatomo
réaumur
omondi
lella
luchadores
poyer
foulness
patters
públicos
arev
mamelukes
biotec
foriegn
militarisation
trichet
gernsheim
venturas
supersound
progres
macallan
adame
steeplechases
bodelwyddan
dankert
cuozzo
firelord
sunu
rightsholder
psychoacoustic
hajong
slimed
selebi
schut
antireligious
hebblethwaite
dilatory
schorsch
whomsoever
ashin
dockrell
selanne
ngala
stagioni
brenne
politicus
oreos
akhmedov
eberts
newsmagazines
blasien
dncg
snoek
foregate
expectedly
marcovici
dissapointed
eugenol
blumenberg
thundercloud
mythologist
salvoes
grandkids
plainness
cendant
empiricists
paprocki
soliloquize
kadirgamar
hormisdas
inupiaq
duchesnay
shinnie
boteach
dimi
outstrips
necesito
hooting
aprc
prizewinning
esmerelda
reum
pegues
kasra
nafeez
gearhead
wayna
stirton
lessors
titers
radioactively
skold
seabass
unmerited
quolls
nushi
vaira
thankfulness
mikell
witherby
wtvf
ltcm
estos
proles
kapala
janas
ruinas
dhows
ivon
belville
bkr
ocio
ergonomically
nscc
aletsch
broadlands
wanborough
liberati
demystifying
mikra
immunosuppressants
wedmore
parlato
brandenberg
waldoboro
deposes
ogp
reconnoitred
steamworks
mckew
llr
mitri
wiffle
biocentrism
meddings
bajar
dobrogea
lhotka
dotel
razu
poite
oldland
mahabir
crile
gelinas
metroline
girgenti
littleworth
manucho
rustad
beakman
kdaf
waxhaw
habel
chook
oizo
sough
parthenia
triable
guerard
hoagie
muncey
segerstrom
sagua
sudbrook
prestons
munjal
socalled
parmanand
nahshon
dhahab
moviemaking
aleksandras
hsts
barrau
waterbed
bhanot
marwell
crenellate
colindale
brioni
kustova
botez
mozzie
maierhofer
pingali
kluwe
matapedia
mateer
louanne
fastcompany
serranos
krippner
redburn
pios
mindi
orangeman
cower
travesties
tinsulanonda
multicam
gravelled
socioeconomics
styal
escada
grasstrack
offit
ctcs
cerina
dados
nioc
biddenden
corporately
lilit
neyer
candover
gubby
malchus
saygun
giguere
dorsa
kpelle
nicoli
kolan
hartness
atjeh
hili
cockett
usmar
tauba
caj
seicento
sifang
dictyostelium
umansky
ramayya
neema
roulet
kinn
kimhi
worsbrough
gouzenko
karlan
ungaretti
lehighton
tige
videocassettes
swerdlow
efsf
warfighters
reeded
holtville
ventas
issas
rathburn
casados
helicina
harnik
bernardina
gowin
nabor
yurovsky
soberanes
aeris
fuelwood
longneck
elvises
marchuk
vendela
arken
irawan
kensho
pohlad
mitridate
pantazis
frerotte
lubomyr
mystere
dreadfuls
higgin
mck
seelbach
lincou
niraj
götzis
unresponsiveness
ollerenshaw
mirto
pastorini
balquhidder
gadea
anyon
aerobraking
takushoku
dachshunds
aitch
solorzano
libet
fouda
moreni
lalgarh
gaullism
ironmongers
robida
bandolero
finegold
virsa
othellos
raiola
brainin
cernavodă
ptacek
wroten
cathedrale
pistole
renren
hollin
eszterhas
hfm
caridi
thelwell
favio
reionization
heppenstall
outnumbers
amah
sya
remak
windowpane
copco
shriveled
bpk
aargh
halifaxes
warehoused
inbar
spagnoli
salver
laxminarayan
mesmerised
orotava
tne
hehehe
cym
transacting
brambling
regrading
rugger
ebey
greers
ammended
toasty
netease
irabu
yushun
denari
gwynfor
presbyopia
lawther
weishan
wist
bayldon
greenwashing
reanimates
dimopoulos
milhous
ameriprise
gurrumul
hoathly
fruta
leggo
terashima
blueback
woodspring
ggs
novocaine
dibnah
muchnick
roumi
gudkov
swafford
kantrowitz
reassertion
stanner
buckby
manlove
cumulate
rupal
sognefjord
milita
hazza
flippy
churkin
imtiyaz
princetonian
perahia
amperage
wwt
swedesboro
julee
stephany
propolis
punitively
dipstick
kelburn
cabg
prevarication
lusted
cardillo
boxmeer
supercedes
jialing
davyd
tennapel
enh
covetousness
cymraeg
monton
hollanders
wicky
campidoglio
reefing
poinar
genzlinger
ockenden
tantalising
breakages
killerton
jagadeesh
pullinger
tasburgh
lasme
goodpaster
excises
hardway
troxell
llu
craiglockhart
schnack
opsec
prajatantra
ivanna
argali
depreciating
shuhada
weylandt
velshi
klimenko
declawing
pedis
caging
interprofessional
impassible
zoellner
cisac
hyves
arthaus
coatsworth
bedini
mellinger
prising
ipac
gulak
thighbone
pramoedya
hieu
wws
bordetella
wolfit
castrop
omneya
microcars
nsps
parcham
hesmondhalgh
badagry
bogans
gimondi
rieko
gitler
omprakash
donec
muhajiroun
gilden
suppes
voina
gilders
ypt
chasselas
obermaier
fdma
mineralised
saraband
pullan
lunula
tabarro
francesi
newfangled
jasinski
bresee
esnault
epitomises
slepian
atrac
koldewey
jimson
gspc
laslett
bedrosian
flecker
hll
idolizing
saola
ischaemia
magdy
nembe
handlooms
tachira
noid
eulex
vulgarities
dicko
cheesemakers
riklis
iiid
hirsute
trach
digressing
krick
deadheads
humblest
tresvant
bwe
torney
olusola
donavon
moross
seiberling
hiranuma
roadwork
irmo
wellpoint
tlrs
shadrack
osteopaths
unrepeatable
lutefisk
madonia
intergraph
gurl
flaten
noura
garren
gedde
qasmi
katzmann
bazzi
jalopy
strongroom
simmers
chukwuma
nosenko
horseshit
freeserve
lissette
pragma
propper
horseplay
discriminator
swalwell
pianto
magistrale
assaad
hispanos
canvasback
troell
coolum
privies
upslope
gahn
patrício
smartass
tripa
vittek
tarapore
hirsutism
dushkin
falseness
turl
vishesh
deerhurst
electrometer
sonos
amedee
obon
preheating
springboards
bonfield
superzoom
taizo
hawkmoth
hatful
candlelit
ampera
dipendra
belote
proprietress
malen
spirale
abdoun
moyglare
desch
renegotiating
izza
buechler
thibeault
luxemburgo
burnup
shkoder
qianjiang
haroldo
chauth
cantrill
amorrow
orrville
bowerbirds
msgs
margita
ladyship
aguero
silverliner
sorgi
aphoristic
brightline
bjornson
barkworth
rebell
boggess
sheepish
playtex
darman
rosevear
ussc
morganville
persue
presidentially
oikonomou
vibha
ardross
systemc
philippic
toxicant
escheat
guízar
llanwern
glycosaminoglycans
guigou
shadid
northop
dededo
stratiform
officemax
siriraj
binali
qts
hasdell
plumeri
psycholinguistic
vlk
sofar
nosler
mystras
olausson
trevena
doesnot
stegeman
hornburg
mujeeb
esfandiar
altimetry
preheated
cohesin
zelter
bové
leites
shortall
goshawks
dph
polymaths
haaga
webobjects
tunc
prioritisation
quelqu
foppa
buttercream
senftenberg
minuses
fixating
maninder
democratica
harmsen
jagjivan
tonder
udoka
gessen
gbps
matula
floride
kingsize
kelle
railbed
etches
steketee
groundshare
dockum
scrutineering
roughy
urano
xms
garst
plumly
maleme
orate
mischievously
cookes
zientara
kinna
chiquimula
velle
poca
yco
moonies
arrillaga
tacita
conveyancer
ollila
miika
demel
martines
hbt
seraphin
uncharitable
tuazon
muktafi
amorality
clownish
sixaxis
redentor
solich
quires
mccumber
ghaleb
improvers
chalgrove
melhem
keßler
firer
unflinchingly
bolivariana
leeching
adger
inao
footstool
ndungane
luanshya
papà
yasar
sundews
bambra
gibbings
filipo
gayness
inching
lianhe
duetting
astill
mignonette
nephrons
cader
ravensburger
ahorros
shaowu
oyamada
aydar
lizarazu
bengel
ixl
cuffaro
vieni
kins
melismatic
conductorship
nelda
naep
ebp
delux
cadoxton
mcneilly
fundament
bonan
careened
nordmeyer
woodchester
yippee
raphel
aksakov
goude
pictionary
fiorucci
toptenreviews
vegfr
arguin
bulling
algunos
sciortino
fator
coban
prieuré
chacín
finell
norddeich
normalizes
hamhung
keshari
manageability
woodshop
internationalis
kumon
whyteleafe
shiffrin
baluchis
buana
valdai
unmonitored
perim
snipping
wuerl
gwlad
silbert
cockaigne
yaari
brookhouse
chelsie
askance
shigeharu
nyf
unbuttoned
rigali
pantaleo
stamboul
personalise
oxx
anyi
coneheads
zhongnanhai
kabbala
anneal
owasp
relly
nonito
wemple
goosey
driv
milicia
cementitious
umtata
standford
kammermusik
lindum
huds
depasquale
deivid
isotopically
blizzcon
maharam
sise
merca
kangding
wakey
abcb
tamboura
rocester
evildoer
watada
dyscalculia
predawn
gaullists
garfish
barbagallo
nasopharynx
lugu
cameco
fuisz
keturah
loadmaster
goeth
steading
tqm
apotex
tinti
noem
cerdà
malingering
goodlad
wallows
eudy
gunthorpe
drinan
defecated
permira
riveters
ukhta
techne
elx
somercotes
hania
miyabe
madlock
incapability
ukhov
luti
rawmarsh
nccs
schanze
halaby
elliptically
ipcress
imedi
tremé
floyds
schwert
backyardigans
pesek
autistics
improvment
upscaling
charalampos
malott
tendonitis
telcos
soothsayers
torabi
mtbs
monomoy
kayano
mandella
burnbank
karmazin
comiccon
pdci
ggc
samis
ogam
adleman
montefalco
gersten
ghaffari
camu
sagittae
gaffey
llwynypia
dadd
candleford
tauzin
shali
basser
rebbie
adiponectin
mutairi
hvm
teruyuki
boltons
henckels
hym
calstock
yanan
pancham
timestamped
chalumeau
defiles
forough
gaffar
barch
kucher
nubile
sportin
y,z
dribbler
vinyasa
socky
wned
unaccountably
monkee
cuckolded
duroc
microprobe
fereydoun
xwb
samuda
birol
controvery
dokey
evanson
maestranza
egat
transuranium
kazantsev
hymnus
recke
shergold
rushkoff
perceptibly
langenhagen
maccabaeus
thern
ancho
youse
tpu
única
irún
gons
schulten
chirinos
mauritians
donella
brainstormed
authenticates
nosema
photoworks
brezina
hengst
usamriid
ironweed
bundesnachrichtendienst
ktar
spratlys
lundvall
laframboise
rinfret
makena
ief
proceedure
melisande
crêpes
fidelma
rodia
leapfrogging
bozza
clarkes
badong
chichén
cyproterone
dalbello
hiscox
donruss
umezawa
sagia
hexamer
liddel
strongbox
durley
boix
goulandris
metodo
dorfmeister
thakor
babbel
beyle
armfeldt
facinelli
ankrah
cuv
almaraz
whitechurch
pepperpot
amerasian
godby
zakia
qia
liseberg
halberds
zhujiang
untended
zhenyuan
baliga
goba
vish
ballygawley
disengages
rzeszow
shaefer
yuda
gamestar
hbm
peev
frontieres
thaad
uep
backless
kolpak
mercatus
yanzhou
helmick
yassa
izi
solidary
exalts
pachycephalosaurs
unipol
whb
gesamtkunstwerk
peller
euboean
debriefed
takeshima
shortz
mceachran
brockes
sweated
pasteurised
malians
headbutts
coveting
shawan
alyaksandr
repertorio
cuties
simes
iui
militate
naftogaz
farnesina
deductibles
shibani
kortright
kozol
cabooses
kienast
plautz
gropper
deregulate
doorknobs
maer
aspheric
dusa
cossart
caes
yorga
darma
guarín
yoshiyasu
brugghen
krotov
rasheeda
sarratt
mughrabi
sulev
peckforton
sainsburys
praneeth
abdulkarim
cantalejo
schoeller
dkim
songkick
brode
heighington
mout
fadhel
ekstrand
frighteners
stanishev
mccrady
carax
taibo
matthiesen
extranet
partyka
gornal
hiroe
honganji
hgc
haem
philine
ramsi
silko
dieringer
magana
ecfa
edsac
heighway
transcutaneous
rearguards
unicycles
brayan
morishige
pixton
waibel
mushing
shambo
foxrock
yohai
amoria
llega
jaji
keychains
kambli
meare
gunfleet
tricorn
westcliffe
grindle
bifurcates
wissler
raposa
dæmons
gret
portbury
bucaram
harumafuji
belchior
ramshorn
hairbrush
ferryboats
charla
dullest
shubik
boisseau
cascone
oodnadatta
creveld
melquiades
bronchopneumonia
mcgreal
slushy
tullock
goldendale
hotties
ibragim
antwone
voinescu
sheraz
mjp
andreasson
mertes
loralai
restaurante
mbuti
lighthill
varnado
soulbury
cosel
counterfactuals
adz
keiron
guilbeau
gabourey
latehar
chamara
porosus
elnora
stodart
sharone
proskauer
enriquillo
ghazanfar
ijk
mesdag
redshaw
counterproposal
regueiro
immoderate
corgis
boulud
hourigan
minhang
dantley
masr
shirota
toscani
devons
xuhui
ulstermen
zonnebeke
fubon
gitxsan
kliff
rurutu
boey
allnutt
alfetta
mcclements
lafaro
sadrist
bettys
inanity
sadasivan
gangsterism
osweiler
erosions
convulsed
royse
homarus
palmanova
blurted
spodek
supernal
atrato
marubeni
xiping
despain
aquifolium
puckering
yazbek
chaussee
endogenously
wayan
wops
lamey
premade
kufi
lungless
axé
anonymizing
longjing
thornham
sysco
raghunathan
terrelle
carleen
panmunjom
manzar
unanue
povetkin
vahagn
wuss
magellanicus
unreached
fawdon
histologist
tirunesh
harben
hakewill
zembla
tadalafil
amstell
takamado
zookeepers
hassinger
marico
lakra
kibbe
norr
grasser
glimmering
devel
klare
moneymore
auty
complainers
technobabble
immunomodulatory
suddenness
tabulations
savic
bylot
repairmen
clunk
omertà
polwhele
kipchoge
stayton
burakumin
relocatable
armie
witherington
abbreviates
skellingthorpe
rodnina
ezaki
wsv
covenanting
holidayed
gigolos
oasys
bicoastal
eccc
metrix
tremec
dispersers
vecsey
scooch
harra
tierno
quente
coquilles
pietrzak
snooper
despereaux
calbuco
bessbrook
ellenbrook
byut
vilasrao
medulloblastoma
peskov
winefride
dyslexics
kozuka
escondida
larmer
legalizes
samoset
radomes
radivoje
fettered
bellringer
tepito
liturgist
palitana
cunxin
froyo
knappertsbusch
vesperia
precio
radics
impinged
mapi
heiligendamm
degassing
unemployable
fakers
mulato
saraki
punkin
infliximab
patay
motorcross
pashmina
pariahs
spiderlings
gurs
luckie
therin
abstentionist
beaudet
chimneypiece
revolutionise
boudicca
bainter
canudos
lewallen
schlei
enchiladas
almagor
pontarddulais
gess
sagarin
consummating
asato
stapel
orebody
tanzanians
mechem
steinborn
goram
burjanadze
gonesse
qiantang
kalhor
beye
loganair
hankerson
thrombolysis
ninos
mcraney
rbmk
wardwell
viñales
attah
oresund
redbourn
pâtisserie
babad
dounreay
workmates
beurs
wplg
popi
decosta
fowling
cascabel
ruffino
caroe
certifiable
circumcise
vendramin
honigmann
galica
löwy
kirilov
westerling
patara
fisherfolk
admited
sjöstedt
monkseaton
colerne
niza
hagiographers
gatcombe
scherzando
gumption
bershad
svanberg
gracin
beenhakker
muthulakshmi
parlett
boiko
zitouni
dhammika
bioaccumulate
shindell
coworking
ningde
suprachiasmatic
mccalman
barkingside
snowdrift
clady
taubert
maximov
mirzoyan
broadmead
amap
dscs
blackground
rahr
martinborough
potboiler
noul
sumners
trachomatis
beaudin
loebner
risby
huashan
wapato
weeting
lystrosaurus
castonguay
hidekazu
hilland
dtn
bumbry
faute
unbowed
treyarch
metaverse
barnton
macaya
pazzo
sourness
goias
sarona
pepino
mavers
desireable
maniatis
renishaw
sharafuddin
savaii
haythornthwaite
conservatoires
langner
progestins
eicosanoids
maldah
mcpheeters
reggiano
karakum
frys
newspapermen
mawdsley
lleol
delicto
ayon
excell
basílio
recapped
khot
orba
tordoff
armagnacs
krein
kitani
outwitting
shimmin
varty
laurer
saikyo
nymf
fastway
burcham
shewhart
mencap
earthlike
nilton
asenjo
wiv
cybc
weissenberg
sweatpants
burkeville
amalgams
leoneans
reviver
endsley
alfriston
murghab
lako
spurning
siega
llerena
arcadi
ruxley
harambe
myelofibrosis
doublethink
probations
snyderman
personel
wullie
enews
edon
squealer
sauchie
feres
horseland
hewing
hignett
kimm
horfield
ardisson
isahaya
tallet
vellacott
arnage
kröd
restrictively
apam
arese
iolanda
disentis
bishnoi
arsht
reseau
jousts
tracht
tdg
stenka
kmw
overacting
grizabella
korova
compusa
forefeet
subsaharan
paoletta
radionics
tarbet
quaestors
aghion
befit
crannies
whsmith
dgca
dunsford
haversham
zahed
coreana
tawakoni
jerling
ariffin
quintela
yumoto
pixma
woerner
bonsor
dysphoric
unmis
thaiday
già
malocclusion
quipping
sarstedt
mirkarimi
sabrosa
yoshiteru
ankiel
fago
chazan
assumedly
négritude
elysees
kuzman
conk
retinoids
marcinkus
weissenberger
kaita
lrrp
unserved
portreath
bermudas
ladan
embroidering
pertz
cannoli
vont
carll
mougins
sahrawis
microblog
ideale
chlorpyrifos
cyt
pauh
elzevir
khuri
gradualist
chabat
vdr
valentijn
wintney
croteau
tokara
denneny
uruma
shinbone
fixings
extremophile
rsta
padalka
blowgun
backfilled
rockferry
majorettes
tuomi
detta
nonprofessional
vichada
kenmure
ufr
birder
lünen
stourhead
wate
assy
lide
peacehaven
lombo
mikheyev
gumley
islamique
giesbrecht
lefler
emberá
inyokern
concertant
marlou
markit
counterespionage
primas
luder
nightcrawlers
kerrier
vassilios
rudbeckia
kanaga
bakayoko
clanging
preconditioning
kitaj
jec
stillbirths
reyburn
coalmining
arrhythmic
laclau
amirkabir
brandin
chlo
pcas
artz
zhixiang
screwtape
ostpolitik
greggii
bluestreak
srikkanth
wtvr
drupa
agarwala
weisberger
nailatikau
brayley
derdiyok
indhu
busybox
kiesewetter
huddlestone
janiszewski
meditators
gherardini
faizi
economize
libidinal
lidos
campagnola
greenfeld
derakhshan
mesylate
pien
cornelian
sikharulidze
akzo
lemm
lasdun
qpi
pouting
heygate
hurndall
nijs
nurullah
dijck
dabiri
embarras
misfolding
setrakian
exfiltration
renold
skolnik
antonacci
kearton
moyna
defarge
tolkein
ghil
celecoxib
sevareid
atoc
muktar
jalebi
eyzaguirre
whacky
pascall
augustín
adjourning
makins
coprinus
freshford
satiated
infospace
tasi
legare
unibank
zandonai
mcguane
educause
tatts
ataullah
plebes
shyer
fessler
coye
cercis
noisette
frottage
frager
reinders
opic
sose
ringlets
coex
feilhaber
gavilanes
schneiter
darvin
falko
armindo
aghai
salfit
snellgrove
nrx
womp
lambiel
nighters
ligabue
occlude
kumbum
powassan
bellmer
ladson
cqb
tricot
tumu
centurytel
goven
gardenhire
weakside
mealey
wassell
juston
camaros
grybauskaitė
sundaravej
techstars
lissie
sponging
kilogrammes
teotihuacán
lazarides
spacehab
bellaghy
garw
srdan
bazille
numminen
quaritch
hordley
mojahedin
melsungen
ardern
warily
lombe
melanopsin
mpenza
oviatt
fantini
zarabad
betsie
stuy
neovascularization
epinions
devita
samsam
frankenweenie
almendros
swihart
seulement
giudecca
gaudette
gibs
foxwell
transmeta
pistola
cusd
solaria
truncheons
blackwells
procrustes
belan
absorbency
lemarchand
renseignements
seagoville
aeroelastic
miscarry
transferee
segreti
refaeli
arenales
santhi
gorlin
lgi
maruf
bayberry
albertz
roton
caretaking
reynders
costarring
bizri
madore
ceneri
heimo
balonne
behrouz
mclusky
dannemann
franju
koenigsberg
gávea
gabilondo
gipsies
islamica
equipoise
kudai
marulanda
moacir
abidal
parenchymal
teras
midwater
jajarkot
kolkatta
trailway
squirted
vallentine
kronwall
ratledge
kavak
domtar
starvin
khorkina
pinstriped
ejc
crepuscule
baksa
asbl
milliners
predilections
uhu
hollinghurst
schweich
gwilt
unenlightened
bossiney
nkc
ambiorix
conjoining
railyards
daloa
satans
lapan
glimmers
lotze
gordeeva
marada
skintight
geminiani
ruland
pard
superblock
iriver
obersalzberg
laveen
schakowsky
eriq
satisfyingly
yapi
wesselmann
mmv
adelgid
glomar
avuncular
choden
grubber
abts
cuadros
henhouse
repressors
tyurin
ocf
dxm
shepherdson
zeitun
lummus
dismantles
caddisfly
brechtian
rhic
overcharge
venial
quagmires
jueves
hasumi
farago
classwork
neuenkirchen
childrearing
evf
delie
whetton
handmaidens
pake
carsley
maisch
rayville
careerbuilder
duderstadt
chenonceau
parsee
kallikrein
karenga
fudged
bacci
postion
martellus
coughton
pelicula
balsan
fangraphs
throgs
grischuk
thirlestane
chuma
saltiness
conceptualisation
motoo
kenkichi
samil
candlepower
sergii
silverheels
airlifts
jook
gerron
ceps
keiran
fetchit
bellard
shirish
corncob
inched
monett
denstone
shuddering
fadeout
racey
prittlewell
naná
westermarck
gratzer
zwerg
ustedes
ruina
weening
brealey
belfries
barde
courtown
calleva
dramaturgical
blogtalkradio
vasomotor
pottle
underwire
unfilmed
whiddon
vona
pennar
kompressor
porec
devouard
obermeyer
lisbet
shaddix
kafes
popat
mchardy
horrobin
filipovic
hewins
appertaining
yts
dromaeosauridae
shingler
giral
wua
belfrage
prefabrication
richings
blean
naadam
mauboussin
breffni
csulb
plomer
intercontemporain
okoboji
fernbank
sympathising
lainey
institutionalisation
nmma
inscape
catling
farinacci
manka
colourfully
hoka
nanjiani
tsuchimoto
heldens
conjuncture
cassella
veltliner
ottavia
poolesville
coly
escapologist
mitchinson
trohman
jarabe
biographically
adderbury
krouse
carissimi
fazed
soquel
mrad
tierpark
stfc
metrocentre
demystified
zaidan
itsekiri
saed
kasumigaseki
snappier
uee
pinecastle
kamari
ranglin
severiano
croesor
matyas
dtg
sarsen
kiyonaga
entertainingly
avramov
keinan
kazuyo
condemnatory
wbff
punditry
llwyn
twisp
etro
interdependency
kisoro
bendt
kremnica
bièvre
metaphysician
ruggedized
centeredness
navvy
raca
frontstretch
schwadron
kooser
baverstock
witteveen
schueler
dats
lrh
baith
pommier
outers
calin
scrapple
ghanzi
lumbreras
cyanides
akhmat
samudio
acronis
clasper
lobules
darted
ashfall
americanisms
kaung
pectorals
bezeq
burrillville
tenting
kovr
facelifts
bisceglie
briegel
norrish
gase
florens
rivaz
fishtown
batti
katic
mahabat
submerges
magu
unama
fithian
fitbit
amfissa
oligarchical
countertransference
junglee
brians
lindstrand
jaipal
topacio
ulas
tuel
peti
chadd
laziest
moster
panynj
cairnes
ebbers
zhvania
glenrock
bradstock
agbo
sherm
curteis
cohutta
moskow
pomerleau
sabeel
kornel
htet
naut
whirlaway
viduthalai
frontalot
pigmentosum
donné
mwamba
reja
trr
akala
calland
geu
boardgames
melanosomes
tyrod
serums
dovetails
hodnet
ineke
godparent
protectable
tutus
papakonstantinou
resistible
sémillon
twachtman
florissants
rajala
cowhide
photoionization
forero
tjx
cdss
subcomponents
visegrad
aioi
kirwa
ruritania
perishables
sollima
giussani
basir
ultramontane
certa
hhi
taroko
watene
vieng
lochy
nardin
pederast
udders
durano
interwebs
ritonavir
ruido
canner
yingling
deionized
mbatha
gonen
littmann
rsw
enskilda
vilamoura
goot
venetta
verbage
puea
khanji
mcgilligan
bulbus
schlueter
alterna
supranuclear
meiri
denisovich
sharland
operationalized
aerialist
chaperoned
rumbaugh
eagels
hooey
sapsford
bankrupts
brookley
thorkildsen
ruefully
monterroso
xiaoxiao
bakili
rolandi
dorie
nho
heavey
shangla
casterbridge
tinner
teslas
sonogram
lavista
caat
erythropoiesis
frankmusik
williamsburgh
manjeet
kotlikoff
bartl
gottman
fromkin
refounding
hve
pelamis
agma
unalakleet
tiner
hideto
zinsou
multon
bomans
ancholme
foward
ashrafi
jörgensen
terazije
askegard
kinte
prh
morency
calcagno
colombano
wurzel
disrespectfully
amiably
pichet
cactuses
viljo
vonne
springett
nespresso
vestel
lauchlan
myongji
skateparks
warneford
millenarianism
affectations
tootoo
ultime
presages
stubbings
koslow
shantytowns
naeyc
chinggis
hexose
diggity
frari
coya
janvrin
mompou
mondsee
romiley
cunneen
talboys
hamartoma
amodio
gawn
judaization
earlston
lww
demotions
afrl
culotte
downfalls
importations
sidesaddle
glotzbach
busman
binley
abdolhossein
pih
eio
poppen
interliga
deanie
teetotal
cardonald
alexina
buzet
reintegrating
herengracht
risan
eicosapentaenoic
tverskoy
wilcher
lotan
thesen
luminoso
jianlian
muise
cheapskate
nungambakkam
medlicott
armourers
ciber
royalle
gabai
sulfone
balearics
otolaryngologist
gorseth
ulanova
ruffman
centerpieces
farentino
akyol
speeders
sistership
poage
slimes
tfx
highley
somerdale
bardal
slovik
sussan
martir
coltart
daifu
quinns
polychromed
nagl
tarnishes
ardipithecus
mililani
kandar
shortish
geoffry
cabby
alverson
domiciliary
kormákur
muelle
pitchblende
repopulating
sarpa
nidre
pellerano
tattooist
galili
hochhuth
dutchtown
monocrystalline
condobolin
windbreaks
apisai
sabinas
frangible
woodsville
lambuth
bicentennials
wilhite
syron
breckman
strug
uher
squeezebox
octanol
heeb
cortesi
liangqiao
hme
zizek
junor
contort
swedien
kennaugh
cavada
ozanne
tuusula
hypernova
hayesville
hardon
carrigaline
ague
leechburg
nikifor
balaclavas
mckain
prabodh
higman
lewisite
buruma
anslinger
unspeakably
hyperdub
fusebox
dillons
primos
laikipia
chicka
─
quehanna
dyin
activesync
manabi
bnfl
npcc
weipa
arnalds
jabulani
contrade
lannoo
sólyom
stilled
overcurrent
superhumanly
josephina
sapan
detmers
multisystem
connoted
buress
ntb
reedman
histoplasmosis
afterworld
chiavenna
itchiness
measat
cohon
coti
bander
ghillie
bickell
bacteriostatic
segantini
monogrammed
aweys
strathpeffer
cetto
sugamo
brooksby
colander
hypersexuality
tomizawa
waldschmidt
ferring
dujiangyan
blencathra
pizzazz
glasswork
ildefons
crape
winders
kuenn
darchinyan
youd
gelignite
pharmacogenomics
accomac
ameren
portner
granqvist
offramp
accelerant
motomachi
scrushy
katyń
darvas
castelao
klingman
spitefully
ricaurte
arzobispo
sipple
thackwell
ponson
lrm
elibrary
kasun
gurcharan
dbn
cavefish
collapsable
trango
degan
myrtles
ramadoss
condello
saverin
ftir
legendarily
tindle
tradescant
pantagraph
masing
bubby
lisvane
prochaska
invectives
nevaeh
cannibalizing
bierbaum
massar
fayad
gollan
herberg
candlewood
bandish
droppin
mehmeti
verni
entablatures
malmros
dote
rodong
operalia
mirande
baerwald
shapleigh
margueritte
barrancabermeja
scabious
sunrays
shott
ariodante
mongabay
merrymaking
soulier
perugini
breakwell
flossing
aptheker
cfos
ralliart
schanz
houshmandzadeh
feare
chy
gwaun
brackins
aerosolized
kiawah
didacticism
dardel
delude
allos
raisbeck
abasi
carretero
fertilizes
samplings
bassenthwaite
pantun
leaseholder
slavik
millson
qattara
editorialize
osibisa
baldinger
schmoll
obligates
heawood
avida
tyzack
tcv
wieber
servicewomen
candlepin
coronis
rationalising
pallant
couturiers
gugg
jauron
berkus
skellington
ooxml
bouffant
karmanos
warmups
hanningfield
fasttrack
rodgau
hassard
xiaoyi
gasifier
showoff
hagos
veau
lenzerheide
aventinus
luter
anum
calme
breaky
wainaina
addabbo
maricar
venceremos
gilet
yarmuk
natch
gorkhaland
wiu
bhoys
westminister
surtsey
siffre
executively
mangus
campani
fuqing
hashima
westwater
shapour
kenward
pilita
verbeeck
rfra
moktar
przemyslaw
elmasry
morga
saluja
simsim
procrastinate
trophee
cassetti
gelabert
interpenetration
hvc
killamarsh
badley
estremadura
robeck
apicella
maleh
hert
weinke
millares
nachtigal
kuharich
kanemura
lachs
incurably
teps
outscore
kaisen
bahjat
koufos
schrenk
engrained
brayne
economico
policier
lietz
seamaster
elnur
lukic
wheres
charmbracelet
benante
iselle
waked
karros
cohabited
buyten
balamand
veneracion
omohundro
atoyac
sursum
uncollegial
furley
epes
schabas
programmability
karega
kluszewski
alydar
granberry
outsides
arbos
olumide
segodnya
intu
kottak
kovachev
thae
suiter
fert
gurmukh
ngouabi
kirkhill
kanaeva
brookeville
abbandonata
faul
gumble
paulius
oudry
clarkesville
etchison
blowhard
salvadorean
fairvote
zomer
tarsem
lysyl
pollina
vibrantly
shabu
kaliska
bororo
erkelenz
mohite
mammogram
udin
privatbank
hchc
caccioppoli
reputability
burgener
hanaoka
wittels
meur
pionniers
fmcsa
trimdon
schaar
cornflakes
longport
crutchlow
unvisited
smps
wheelwrights
morros
hower
woodlock
marquart
karpeles
afspc
tradewind
bobin
kight
aquavit
guavas
priebus
recomposed
clearwell
blackleg
snowdrifts
lalwani
jelani
edenvale
sóller
arsia
anx
wangmo
bilgrami
thermoregulatory
clucking
vendee
emersons
visy
farole
ypo
daehan
kidbrooke
franschhoek
ripert
brummie
werve
loudi
garstin
meteosat
shabaz
mcvea
kanamori
doorly
furkan
monetizing
fellside
aldar
emeth
hasil
jesty
blaina
parlow
hesitance
hollandaise
pissy
siddis
harmston
listeriosis
abramovic
biopics
putina
phthalic
nzoia
europeanization
stacpoole
dombrovskis
frontin
fitzcarraldo
shadings
cambronne
xenophobe
chomet
gwalchmai
widdicombe
stuntz
sharqiya
geda
gilfach
mykel
compagnoni
nordjylland
ingvald
sanjiang
citysearch
lichenologist
ziebach
colwall
capriciously
fairouz
pellatt
hakuho
sandfield
hussin
evn
dartboard
darst
chope
schlink
khap
bendor
chiarello
zalmay
theorises
galisteo
biopsychosocial
greasepaint
gff
apuzzo
jesperson
ovett
gwanghwamun
kalen
wsn
novack
nightspot
cumber
nyam
llera
calafate
friedberger
heeds
landgren
redecorating
taleh
asmir
adalia
staite
lapchick
munira
vija
sperone
entreaty
databanks
dropkin
shariatmadari
carlina
mavinga
barefield
innaccurate
malba
orgreave
industriousness
handcuffing
medhi
supernaturalism
kcp
lavrenti
inherantly
trellises
multigenerational
grans
macintoshes
pilobolus
daiva
synthetases
ruttledge
larches
erpingham
kaut
bedaux
tashima
revamps
pityriasis
overshoots
reevaluating
arkansans
mappy
allcroft
jotted
cialdini
mastung
kashyyyk
lemington
maraca
kurkjian
uvas
longobardi
kriging
congolais
péladeau
virage
darabi
taverners
samin
wans
carlisi
tracheostomy
acciona
luhr
sydor
hippogriff
atiu
gaume
itec
mackichan
pelaw
manchesters
wadkins
godefroid
brownjohn
bossed
laryngoscope
sychev
radegast
forcella
moroso
falaknuma
grein
basiliensis
keawe
audemars
adcox
penderyn
sogavare
compadres
haotian
rodewald
baleful
phial
sweetums
hostelling
ojc
wesco
titleist
aleix
sturmey
renison
chawton
daraga
kadowaki
kingsburg
solebury
savion
humaid
testaccio
lipolysis
arulpragasam
michelmore
tizzy
bethânia
bartholomaeus
promociones
pineview
mohammadreza
azubuike
snaring
epas
carapaces
matthaeus
esterson
kabongo
scient
alade
anerley
ridesharing
arvidson
propinquity
gyle
coronagraph
prothesis
kotoka
walis
pastorelli
tsca
wilberg
amsler
rajamani
nastasi
unforgotten
petanque
chinhoyi
intocable
ivillage
technol
adenhart
kirks
weissensee
tambunan
nicoleta
mirpurkhas
pannal
pawnbrokers
woodlake
ligula
haradinaj
ishant
penni
steakhouses
kabaret
flakey
fewster
claypoole
kharif
blandin
govier
effectuate
gilot
dentine
keate
ciss
supernormal
lancair
suhana
apy
hussman
padian
lognormal
organotin
chrysostome
thuggery
ibrahimov
mackrell
nyrb
bladderwort
jankauskas
outpourings
ustashe
kashmira
youyi
thalictrum
gangwar
pentobarbital
mabye
cspa
evett
rakowski
rabello
tpx
usms
brokenshire
radamel
gracefield
rockmart
confrérie
obiora
makhmudov
wili
heeger
kitties
abraha
ashlin
badaling
elmley
vardalos
amazin
somatoform
bunnings
carlill
calcaterra
unfused
wheter
había
lynnhaven
amantadine
sticklebacks
frae
kuman
publicizes
budiansky
sirola
hypermobility
caponigro
nagila
hybridus
iwk
cardinali
boguslaw
pateman
armoring
appg
rameshwaram
laclos
dongles
newdow
scdp
huntoon
fcj
exportable
murdoc
halkin
uncreative
luthra
aristos
flied
njai
floren
woodborough
obin
ravenclaw
butman
airguns
forouhar
tektites
benaki
bragan
unlearn
sangaré
debility
imazu
snazzy
caldey
gamin
ifpri
mixology
ogm
killearn
nemeses
loutraki
gautreaux
ricocheting
issi
koistinen
kuffner
harang
unsurfaced
prum
maliciousness
transmittal
schroedinger
sugano
popsci
adairsville
cicinho
binbrook
chinglish
kenting
samardzija
hoor
zindler
chadda
edenfield
boco
iler
inoke
wvtm
ephesian
ambriz
mait
bedsheets
monico
resected
fogelson
marcelina
lytell
evicts
apurva
robstown
jombang
rfq
motionplus
diminuendo
seeff
tagliavini
tailcoat
gotama
cosson
dims
bolender
mainer
brandao
nams
moneo
lowkey
cudgels
alinda
dearer
lawan
nahash
annadale
snavely
dainichi
adenoid
expectorant
ceremonious
rdh
reconciliations
takiko
enrols
wakka
reneges
libelled
eagleville
buchanon
mcgown
sinkers
willo
exhall
ghari
thermoluminescence
carmassi
izon
scours
qurban
koloman
morsy
merchantville
hookham
pyruvic
underwings
ceara
fuiste
munby
cinquantenaire
dewart
askren
unvaccinated
coolants
hemans
nmea
leadhills
ionizes
camba
pontyclun
shaner
brandwood
chiva
netiquette
opti
recapitulate
abobo
imperil
larrieu
arrecife
maccabe
stabilises
dudok
ducret
ecv
ambert
spillers
okkas
shuangliu
mørch
sheeley
wango
carajás
antiserum
embezzle
kumin
yakovleva
onis
sarraj
paet
jacot
tenneco
disconsolate
pisciotta
lazic
rheas
ostbahnhof
sanming
myk
mariola
krikalev
splatters
caruthersville
bigland
gaters
acerenza
mangat
gosch
icta
harkis
undelivered
ysp
feminista
zlotys
fintry
alerce
yarmuth
herskovits
kielbasa
myelopathy
cogen
dagoba
kellas
chengzhi
sabic
egc
mascall
bertoia
thode
djia
ippolita
mosey
jilava
brunhilde
earnie
olalla
datsuns
bandaging
goyard
rhue
chivenor
chary
devrient
ndo
scabbards
mumbled
gajewski
heegaard
sadok
gieco
moer
topiramate
beezy
jolting
federalsburg
venkatachalam
reinette
kaladze
roxo
pohle
tookie
fickling
pnina
foeticide
oversteps
butrint
millionairess
kiprotich
thruston
leedham
huddling
frolinat
bussière
dissention
daae
fillan
killanin
nomenclatures
anglicana
vazirani
arleta
aderholt
ivanoff
bevelled
mamontov
sanaz
eastover
zullo
scholey
micanopy
mellanby
setubal
torke
creer
computerisation
malis
heppell
kirknewton
acap
chare
gusset
sheriffmuir
bagnols
nosworthy
waukee
mansoni
twined
gtt
liberta
fuddy
electable
sukanta
hauch
etnies
marilynn
papac
singerman
lampshades
showtimes
swissinfo
osnabruck
buggered
tuthmosis
gorkhali
oyen
bishvat
vittel
pinochle
arann
cadens
kauto
fauver
plugger
meatier
madala
blomquist
kaffee
blanchfield
steinmeyer
saltram
jeanerette
egrem
thoren
centrex
handfull
centrosomes
toppo
lucidly
regas
pimco
bearn
ferrate
nathi
pities
hamms
unscored
seraphic
sandcastles
garcon
shukan
goodmans
boscovich
dofasco
evington
steenson
roughcast
fumikazu
molefi
tanmay
hawdon
interport
luaka
mitchellville
injudicious
oldknow
asyr
neeti
elongates
germinates
radleys
armilla
okulaja
titties
diandra
pufa
flightpath
mayle
coopersmith
loehr
mukhabarat
smcc
artless
gencer
dolmayan
marianao
kessell
nnpc
haltern
tuscans
bodson
qim
harmen
adolfas
campoli
craftwork
misreporting
dostana
gerould
nemacolin
schoening
bendable
mlx
memebers
fennesz
sawt
cotesworth
indika
hippler
overtopping
eversion
melco
delattre
brancepeth
mawa
indep
parmley
gypsophila
viiis
lichtsteiner
dashan
necco
transtech
poplawski
cameronian
yankel
airspeeds
kitted
médiathèque
xscale
burgau
tenix
scutaro
balfron
griffeth
capitulating
springtown
whoring
zorra
vavasseur
lochside
yohe
giovine
accosts
bpe
esotropia
rooflines
seraphs
breker
ourself
couchette
kaleri
lofthus
arithmetically
fritton
finnemore
phrao
wcp
osterhaus
hollandsworth
blach
eemian
clitherow
bungert
mismanaging
lokhandwala
riall
makana
geochelone
rosés
cidra
colloq
kutuzova
bestwood
kirshenbaum
crda
loeser
uttrakhand
frais
paediatricians
ratchford
jytte
oremus
hpf
menlove
ciguatera
muhammadi
treforest
desheng
dogue
objectify
zhaojun
trautwein
isic
heigham
middleweights
argenis
tujuh
delica
ontrack
teairra
adieux
cotts
jerai
trea
frugally
hewed
dulux
kallan
consoli
standin
lundeen
hahndorf
invitationals
darroch
rakic
hiromasa
massmutual
nitinol
natales
kennison
goldline
huludao
howdah
brays
acrolein
magots
uncaged
uneditable
tappet
rosarian
spatafore
maures
killilea
mouseketeer
waterpower
avijit
dovizioso
darkie
hifikepunye
gpk
poges
horndon
demetrious
unitedhealthcare
urna
brusquely
evoluzione
superieur
fauth
naté
echlin
viets
dagher
lbd
isobars
milot
carbonia
cebr
blackalicious
unwounded
singhvi
rocklands
pachuco
amoo
matese
apears
hirshfield
houssaye
boerum
dullin
tryp
kickstarted
isserlis
exuberantly
thimbles
rothmann
vedado
bushism
makor
chazen
intermarrying
elegantissima
tonko
kambar
baltoro
latigo
muckross
documentid
lunger
valentinos
pacwest
hollandi
smout
merengues
mcalmont
fmu
masin
frati
voyeurs
minassian
yakunin
wacc
gpw
endive
sirc
siero
ruutu
gucht
goofball
mykines
atiyeh
hako
moring
milieus
scaremongering
hafid
wightlink
pipiens
minetti
scharpling
longhai
rossia
shigar
macgeorge
bratcher
wangford
snt
pandal
extirpate
repellant
armitt
molden
tubo
tabler
novye
cheesemaking
iggulden
defla
mushu
bailo
colorations
marsy
caked
ecuadoran
brunvand
blogsite
strandings
irrecoverable
greengard
kullman
ugni
walby
ampoules
inler
kess
starbird
boulger
cvh
debreu
meurs
rauber
recanting
crotchety
carnivalesque
megalitres
ecclesbourne
serret
hdw
minetta
overflying
crotona
volosozhar
mamoun
jerrard
nefyn
sirolimus
mohideen
cenacle
wees
visionaire
facchinetti
poels
firedrake
darbuka
megève
ethnobotanist
ivig
rindler
nikkhah
laack
melker
proinsias
tumby
masoli
shelleys
guede
bevacqua
dunchurch
hiestand
cellulase
springhead
zubeida
jetlag
brookie
jemini
weatherbee
kuypers
fromentin
sbz
balakovo
civilising
muraoka
unrepaired
otoliths
deste
thierno
strategie
siggy
biplab
tulk
langsam
giladi
suel
eavesdrops
polya
mulenga
adamowicz
noreste
welspun
wenxiu
jacquette
gnash
quester
dawda
nessi
materialises
glenties
kapaun
mogra
magdaleno
ressam
behler
villalon
besen
almera
chamorros
castrating
takacs
roome
headsail
schaer
borrel
sangala
cits
wangjing
frishberg
tomin
laloo
pwned
marolles
interpublic
sanitization
newmans
mcwherter
keal
meggett
vinaigrette
goolsby
luckinbill
tortelier
anatomies
developerworks
strums
medows
diverticula
individualization
beseeching
graus
stanislava
guinier
pendine
barberio
karlberg
trixi
drivetrains
creeggan
meile
bodde
ecuyer
wnep
randallstown
zzzz
junod
thse
garve
neethling
darville
weiller
moix
milord
fpg
chriss
calcavecchia
purdum
rothkopf
suppository
rennard
emig
merryn
culzean
uwic
zaret
beauchief
spencerian
samms
greying
matuszak
cesspit
snork
untying
needell
liberopoulos
posener
orderliness
ayyash
selvage
balgonie
remoter
osmaston
pako
crianlarich
lovemore
laduke
ozer
stasio
containerised
handwoven
aury
wwb
grean
wolfs
yayla
neighboured
icecube
zemfira
placating
sirf
castano
depaolo
croons
pasmore
pocius
vernons
cryptozoological
boddie
glassdoor
skrela
macafee
curent
insitu
iccr
rudloff
berek
pronin
sociopaths
netten
flogger
kameni
gascón
axiata
shawky
kotick
trimingham
messageboards
blabbering
haidinger
schacter
zavvi
lendvai
maccorkindale
stronsay
gibernau
peyto
charlap
izuru
deek
sipes
frugi
stepbrothers
waldir
inoculating
janeth
vindicates
denisse
extricating
frediano
cocchi
hypervisors
dorai
qusai
cico
nvs
blahnik
manesar
greyling
witchdoctor
constricts
shorties
szczerbiak
corringham
downhearted
chix
ricochets
gusinsky
enology
bisconti
ymc
perama
lubang
aryana
asok
ellenbogen
tommies
buol
castalian
scoular
ingar
zaslavsky
riehen
bonampak
orilla
havilah
eavesdropper
dmanisi
sesh
ragab
betteridge
guzzling
braydon
meigle
bahati
buil
publicidad
dokka
kerrick
csto
issaquena
pacitti
dembowski
panno
ragaz
grohmann
woodlots
afghanistani
oubliette
parachinar
presense
annaly
puhi
uncommercial
bullmore
sipp
muskego
blacher
nantgarw
compay
twellman
stampings
lifebuoy
blasphemies
licklider
antlered
mstrkrft
maranda
villawood
hiti
hathersage
jaguarundi
gmv
korsholm
fantasmic
cuti
querulous
accf
calacanis
untaet
bezoar
sheepishly
clokey
malecón
konst
jatte
filderstadt
textor
marçal
divertissements
tergesen
cammack
digitalism
penberthy
manures
kennis
guruve
lassila
athe
gerra
likelihoods
spiridonov
keamy
senshu
permissiveness
dogana
snj
iffi
canyoning
wahyu
circumspection
picchio
saranda
weda
unwired
forkner
triebel
raymondville
homosassa
kocian
giovanelli
arkush
charmin
bresler
brimhall
spondylus
wittek
kappe
vasanta
fairbrass
bishopston
inni
fers
gigo
hanon
rautenbach
pimento
chiqui
monachorum
sweepstake
herse
klapisch
platitude
gidney
clower
bioassay
patarkatsishvili
picart
theydon
abdelkrim
sapin
teymourian
dhiraj
carmike
mhe
munslow
sooth
sterzing
fistball
kachel
gusted
mcneish
rosenow
bajos
yorkston
adeola
rosenbluth
hartsock
zumthor
darina
cusi
contalmaison
ikbal
kamra
saltspring
ppis
vasoconstrictor
helia
floch
portwood
malawians
gojra
gambari
earmuffs
surti
rodford
conair
pronounceable
depts
tahirih
dongsheng
macrolides
abseil
etn
chizik
pouncey
boocock
sanitised
geiringer
yifei
yonas
pantothenic
eikrem
meeny
asps
grassington
estimable
karten
botas
tifereth
gregoriana
meersman
chronographs
unfabulous
kizza
bovec
alland
nanocrystalline
richburg
exbury
yermo
demitri
wgf
juby
ouistreham
kealey
winglet
hertzberger
baskaran
tiw
septentrional
taag
knapping
aeromexico
oromos
gutless
dombi
tilke
opatówek
cottington
najim
enfin
hinwil
fiducia
barkow
apfelbaum
geebung
makka
yeary
ignashevich
qahtan
dissimilarities
passey
anqi
numerously
odoratum
schirn
bradberry
ochroleuca
hockessin
millicom
gardners
barzel
gedera
refunding
firebaugh
mutv
lambley
splott
chocks
hultman
habgood
souped
fitzpaine
trialware
bozcaada
slapton
scarano
dufrasne
maynes
jrg
pannon
cryptozoologist
reille
skelos
hoodia
banaue
guillermina
wiseguys
fotheringhay
vengsarkar
hikma
hulan
grayed
amz
willin
madai
lyngen
plemons
ewyas
theys
barab
kuso
nitrosamines
wlra
balcom
bourgault
varkaus
poochie
abuzz
stadtpark
propound
belga
bottoming
unionised
copson
soporific
swanzey
trenholm
kulob
sundhage
hayslip
vinces
fatat
targett
jianzhi
darksiders
fiesco
hampdens
tpj
campervan
thorner
cheechoo
sabbaticals
hendred
baalbeck
devrim
nutraceuticals
winlock
marinara
santora
populariser
chaan
indissoluble
petrakis
surprize
retton
osser
donaldo
earvin
monstro
biscop
orestiada
blythburgh
hemon
chella
kornblum
enteral
otology
topnotch
amemiya
honeycombed
relived
gaetana
krook
ommanney
wella
jampa
finnissy
tassell
ysbyty
sergel
fertilise
seraphina
papabile
sherston
pobol
amonasro
polyscope
geibel
duno
ghp
flunks
ondi
xidan
fetishist
tdh
ersin
vatsa
nasz
pepler
shereen
trancoso
tuoi
deusen
conformable
howman
merkaz
bbo
deti
lwanga
photosynthesize
hissène
farncombe
pakefield
jimin
artek
chapare
judaizing
lizotte
dcps
jenko
yanling
tividale
sveshnikov
barkai
cardiogenic
dilger
electrostatically
olexander
kwasniewski
vanston
wina
pueri
schechner
raiz
hinda
monsun
satwant
liberalise
intensifiers
middleditch
oake
panguitch
hensleigh
spaying
cerar
washougal
turangi
meadmore
highams
sumaya
gomen
cianjur
cuche
stephensen
telephoning
contee
sloshing
convery
hormigueros
manuchehr
yassen
peppermints
riski
barnhardt
interceding
quarrie
dermatomyositis
rajnikanth
unamended
walburg
gaynes
xmrv
sotir
salboni
euell
jeppson
countee
valpolicella
roewe
ruia
oubangui
crewson
twirler
pandiani
bushwalkers
schopf
maroua
partypoker
elmers
wolowitz
systemverilog
queerness
lonelyhearts
vizsla
peered
bouc
slurries
makau
ramotswe
marzieh
knightswood
sorelle
morwood
chomping
bwin
teia
mahaney
glaces
kitman
mccamey
vaccari
hartville
athleta
hamano
grosbeaks
lence
enterica
frigidaire
medicean
herradura
baseballer
kyrill
palade
riestra
acquisto
ganeshan
mcnicholl
knutzen
lucht
kipruto
antigovernment
rawalakot
gadahn
castlemilk
millrose
streamside
deerpark
countertenors
zinjibar
elsen
yattendon
ireport
metoclopramide
dragsters
carminati
bradbourne
durieux
accorsi
urca
diederick
contaminations
entel
kiselev
bulimic
xilin
cockeysville
cryptograms
gento
newsweeklies
lout
marna
pvf
sabermetric
exhuming
lyndonville
deignan
bottlenecked
sabet
birthers
clearnet
saloman
niseko
bombes
bilboa
unconsecrated
evincing
bardach
hatteberg
mallowan
katai
axler
conspiracist
rutstein
manhattans
goguen
bondoukou
abrial
anticlimax
nutri
sarana
talismanic
coupeville
concreteness
chindit
riccobono
spagnolo
marguerita
offerer
posi
gammal
marett
dansby
ethias
handelman
patkai
akie
scowl
chargeback
benke
rosch
terp
verwood
lmm
meisels
liangyu
arlyn
wakatobi
kenly
badhan
dissociating
canonize
bodos
shortbus
agressively
shubham
fowlie
flageolet
schobert
franey
wolbers
cherchez
vitous
petrovka
kalogeropoulos
dukan
cione
unpretty
matts
sulivan
moginie
hena
maggette
ccfc
gwaltney
lettieri
coundon
puel
hatcham
martinsson
laton
cloudesley
torti
kajaki
anderegg
nunu
netsuite
middlebrow
arcor
mitro
crinan
bkc
wiersma
wimple
bioregional
sifa
faramarz
lemna
zambo
bielawa
fbm
prole
kemet
adenoviruses
traut
chcf
sooooo
brington
behavoir
dilema
lacera
laghari
chillida
rasam
cordula
meadowview
jostle
heidrun
supertest
jamalullail
khakis
fffd
shuna
ilife
constabularies
photocell
inmost
notating
ermitage
perfetto
waterkeeper
yelm
floresville
straightforwardness
greenspoint
salatin
jiayu
alleycats
kragh
korten
augh
couverture
wilman
colophons
brawner
walsenburg
signos
dreja
rafeeq
refusenik
amatory
dextre
regola
reznikov
pinda
nuti
frate
jogjakarta
creeped
périchole
areni
longlines
bennu
coleville
lonie
tuukka
linschoten
nobels
portuondo
trieu
yongxing
portsdown
esmée
bastardy
bleaches
maayan
marak
scpa
hovnanian
shigeaki
heavener
exigency
primerica
kandaswamy
adley
mendte
centromeres
stanmer
yucatecan
walmley
fiano
berlyn
walburn
lennartsson
deliriously
overflew
melikian
paly
oakbank
gandil
divined
daou
marmor
mamaia
ntg
leasowe
concealer
resignalling
lundmark
zini
lenham
onyewu
sandrich
coments
labastida
vuoso
amicitiae
pilav
ringens
capeci
skalski
dubourdieu
titch
feyd
taikai
libin
raghuveer
solnit
siberians
ahanta
thackley
schmucker
nields
rammell
janabi
schwartze
lety
zagazig
budgam
marshyangdi
walkaway
grévin
cicala
belsham
shengli
broadcloth
kulczyk
birgir
facia
yangyang
elya
jitem
ultraviolent
warrendale
vanina
arminio
lightheadedness
panchmahal
dunny
kessels
appleworks
waitressing
colsanitas
streett
lenience
rolfing
diari
convoying
apso
presle
entartete
prostituting
khandro
kabar
smal
multilane
quantz
hookworms
perfused
lipper
saltman
favorability
meere
cothran
rodier
hoeffel
annah
hopland
tias
gaoyang
poops
gebran
debunker
oldfields
hodo
bequeaths
etha
antispyware
rheticus
refinishing
lowson
breastplates
marib
apon
ourstage
tionne
gulaal
steinhauser
terreiro
paranaque
misti
encumbrances
croall
hapmap
quinquennial
goytisolo
wetterling
borremans
understudies
mayuka
ballaugh
chavda
castlehill
mjc
dominico
factsheets
succinctness
receptiveness
abbyy
tsintsadze
planchon
cityhopper
tajín
akeelah
bogdanoff
unnecesary
brosh
ramco
fienberg
nicodemo
yanjiao
wghp
sonatrach
fallston
weideman
icepick
xsi
sercey
majima
sautoy
oughtn
wilsford
stradale
vuze
frack
smartrip
thoman
rozhestvensky
bakun
whimpering
baren
nykvist
heka
bathampton
virg
kwapis
olivarez
mabhouh
beloeil
epargne
hawkinge
bardolino
marcescens
bedstead
agresti
thaung
souljah
steffani
ritc
jérome
frades
braselton
malfa
medlen
yañez
lippold
soliven
welin
nasif
racette
aija
ruttenberg
babeş
matsukawa
unmediated
bgy
bahnhofstrasse
cootehill
skullduggery
batasan
alaw
propitiated
balz
norrmalm
brasi
giuntoli
kinuyo
festers
nrps
fuenmayor
hmiel
nunan
framemaker
reprobate
wilpon
blobel
odysseys
ghilzai
zoonoses
bridgeland
dendrobatidis
prego
bbj
billingshurst
delite
grassie
moawad
volaris
raceways
knxv
psim
formalists
pyrrhotite
trover
vacillation
brunzell
lucasville
andam
solido
sueno
pensnett
benihana
hockings
b,c
heuvelmans
meulenhoff
hyacinthoides
ecklund
vidrio
hyong
tums
techtarget
isoflavones
vohor
beitrage
nasseri
playmobil
turkmenbashi
hoodbhoy
leewards
willams
unkown
mirasol
houstonian
refaced
chetna
mishler
betfred
railamerica
siani
docwra
joyriding
larmore
teardown
enuresis
arey
piedrahita
direkt
danfoss
pateley
monopolise
libbie
vpk
toldo
fullbright
kunert
americanize
troikas
arly
gasperini
falconio
reapportioned
piperno
triomphant
tudou
singable
freebooter
wymore
masella
winny
nghia
erdemir
grunion
logix
repayable
ccar
gladwyn
olp
roder
berko
darry
boggo
cittadini
poststructuralism
hartert
restenosis
shellenberger
plzen
macks
mogok
wego
kinneil
luquillo
melky
aggborough
scherfig
sgo
usssa
phun
acesulfame
tlf
volodin
tabish
norc
peekaboo
cnemaspis
smales
xixi
padshah
kulash
calello
nonmembers
hellner
asham
pretties
giannopoulos
risse
changbai
seeber
brugmann
broody
siddharta
rcg
krumbach
supertankers
braunsberg
adeleke
ibv
goalmouth
jeckle
eastridge
mileva
kelby
thaipusam
rigondeaux
flattest
entryways
nyoman
wheathampstead
yarosh
oropharynx
evasiveness
gáis
shawqi
heterosis
austrinus
transgenerational
vrb
aloi
blatchley
gitter
tlg
alibhai
sarie
buras
diethelm
ihrc
fayssal
aspell
colhoun
nonesense
polycomb
querns
transferrable
melillo
brodbeck
fatalis
eurowings
hawfinch
lucke
pacte
diversey
mignone
fifers
natsuo
cholly
haraldsen
crutzen
pedestrianized
mekka
alpujarras
cmts
allamah
tamerlano
summerlee
thermostatic
debited
scherf
jwst
trerice
aleksidze
lütken
andujar
ishchenko
rexona
spacecrafts
lindemans
teele
remengesau
horticulturalists
eurochart
bluffer
nazaryan
siumut
certo
pinups
boxgrove
martorano
accenting
jitra
fieldfare
xmi
mershon
chadbourn
nozuka
dhumal
adelard
sekt
electroless
macchiato
forbury
witczak
irreducibly
mutasa
nordan
minocycline
ugwu
braising
bitti
straightway
flatliners
segui
differentiations
sirian
fafsa
hundi
webkinz
guiltless
hepatomegaly
prostacyclin
pisana
wolffe
ssbns
dutschke
janai
filipov
bonnybridge
lubec
lihou
rocketman
kranenburg
overheats
glister
catán
nowt
wehbe
fizzles
mirel
mollies
khawar
kriv
kohistani
woodcarvings
antiquorum
tubeway
peulh
blackstrap
januari
appliqués
brugmansia
newsagency
provincialism
howsam
medalled
maytime
guira
aasif
bizard
tjuta
venga
iams
heitler
tuonela
wgl
loomer
alers
liminality
gochujang
tilos
bovard
stoilov
makars
wadge
lochend
frendo
seckel
bosal
abondance
originaly
multiline
courteeners
uncontained
sarason
cokie
ceas
wenli
rififi
mezvinsky
neki
minutewomen
understates
poppier
krakowskie
polytheist
agas
fabel
heesters
ainsdale
yefimov
capybaras
vastra
glenburn
lynchpin
imageboard
anglada
impresarios
jalloh
luminosa
yarnton
yarnold
bramer
levinger
kalaallisut
plymstock
vainqueur
romantische
conflictual
lefrak
ngwe
lamido
xingyi
ropewalk
templesmith
disgorgement
vikes
nyoka
volleying
byzantinist
shandwick
dinata
stinchcomb
nintendogs
mzuzu
mariposas
brugada
mongkok
gouldsboro
wanzer
poil
snx
biagini
batsheva
kazeem
wbr
afterthoughts
yountville
paxon
riderless
ekibastuz
derval
eberlein
panhandlers
shwayze
powszechna
mertensia
villeda
likeliness
oystermouth
blackbushe
sidey
micoud
constand
pastan
autorotation
revenged
lumines
alparslan
warmoth
paterlini
plut
devault
synnott
nikkel
blye
shawnna
exciters
sophistic
pacificorp
esquel
ceder
noorul
bulley
caremark
turbeville
orah
stenhammar
beninati
balakian
selinux
montagnana
esea
coddle
dereck
peñasquitos
chilham
limpsfield
haydee
marketeer
benching
admixed
espressivo
moanalua
orecchio
domen
tanui
waithe
spolsky
mumme
marring
cains
overal
shagreen
rascon
hadramaut
normalising
finnbogason
feneri
xenobiotics
minneola
palmera
boun
cognita
basters
oldsmar
farallones
femsa
tenny
neptuno
confuciusornis
webchat
olawale
bondarev
gauloises
sadistically
kuja
sobhi
homestretch
idus
adelin
manaudou
wesfarmers
silverwater
golab
laryea
bams
urai
srey
ismar
abhilash
larimore
wgcl
boucheron
nitesh
bugner
roady
amoli
jref
highers
mythago
cushendall
ntozake
sturr
sabzwari
samsonite
magnano
marsicano
renger
onny
incarnates
riat
volonté
oatey
absorbable
zeytinburnu
laran
élites
mfis
perfidia
lambertini
saufley
prend
vmx
befor
neonicotinoids
nonproprietary
azadeh
jayde
navsea
leggy
bidart
plett
stoudt
actis
wojcicki
victorinox
meopham
corkscrews
luego
exosomes
confirmable
behrami
kongers
lurching
boltanski
inexpert
obligatorily
requa
kozy
moakley
walkington
blomdahl
trowell
vampaneze
quddus
podiatrists
ciaccia
mccary
kirchoff
bildu
serratia
rayer
huseyin
watros
hnb
cheswick
kotowski
baadasssss
bruces
hett
jcu
lilien
mhealth
claessens
tatsuki
abdulkareem
kazlauskas
dressmakers
mansha
evangelise
masculinization
ahsanullah
expedience
ovie
divot
childbed
unpitched
salvaterra
middleboro
shair
scotney
barela
micael
everywoman
manahawkin
legris
viridiflora
negga
frua
undercount
commingling
kunashir
mamerto
quisenberry
vnn
andreadis
intracity
stemmer
tapie
necrophiliac
feulner
kuhle
muizenberg
rocke
foregrounds
kabbadi
pawiak
vitara
funkier
pedrera
izakaya
retinues
dingwalls
regenerations
douve
faca
grindstones
enclaved
couchman
gisa
dipeptidyl
brustein
skyrockets
takaharu
bejaia
debase
gascons
brownmiller
ramstedt
cajas
herzlich
yankey
medicaments
papermill
hevea
nanorods
farb
epw
hydrops
mistranslations
recived
sudd
kisho
schnepf
lanco
viverra
elsy
junning
davoli
juric
woolens
zoomer
colwich
trystan
tamlyn
durra
kraddick
varcoe
temaru
pasca
renesse
schie
ueshima
preordained
iei
nestos
valise
tanne
parkinsonian
sener
piëch
corbo
yeganeh
numbed
ayinde
rotundo
cestius
mctigue
torontonians
schweikert
radion
clusterfuck
jtb
ordet
rudall
duenna
mclin
ishizawa
ramie
batte
multiday
argueta
achuthanandan
trivializes
kimbrell
jazzed
niton
ccx
sciorra
hundt
smartcards
giebel
attridge
gheyn
eurotrash
tober
pitka
jagdeo
jackhammers
christan
plai
ritzman
aati
topolobampo
trog
hmmmmm
migliaccio
carena
thusfar
backround
nocerino
choteau
wulong
paroubek
phlebotomy
altavilla
baher
lindane
larrazábal
tesori
scrapheap
sportback
sabara
cadotte
périphérique
lft
jrcc
kuntsevo
croquette
mesdames
welti
wpd
oegstgeest
valdéz
urbania
shinar
queyras
pétionville
timidly
harakiri
atole
kizzy
yizhou
kronberger
agog
greengate
otamendi
kundry
kondratyev
osieck
karpman
asgiriya
utian
impecunious
sarhan
bjerregaard
ferzan
abtahi
raynsford
stobbs
bitmapped
unionpay
bembe
stelco
holga
starla
awfulness
cohler
sublimely
lopera
nosov
fionna
dispiriting
hollaender
naimi
mcfarlan
seehofer
scra
carabaya
rcds
soothes
appan
ballybofey
dervla
vogelstein
twinn
homos
limos
petur
verdure
humanization
demur
exmor
lasko
yoki
anglicisms
earthwatch
jugnot
jasso
changling
buerger
jumbos
nairac
devco
messori
kordell
hanekom
carping
sariñana
kauppinen
lifelock
mulaney
gete
accustom
insley
cannibalised
bailhache
criseyde
obradovic
redistributes
badoo
hykeham
scrumptious
elettronica
cnhi
komeda
southie
tarsia
zirkle
fukudome
sonoita
machover
immunoassays
gangly
kartemquin
imaginationland
podsednik
arakaki
pienza
skus
kanzi
runcible
easterhouse
weining
maconochie
berning
gatlif
vaster
braço
accrete
lisk
edun
sammarco
goelet
izarra
spicejet
stigmatic
feron
halma
appley
cuadrilla
idiocracy
luban
centura
recollecting
weizenbaum
thrapston
ssgn
shengelia
liuzzo
towey
poleg
mudeford
bakaj
chandor
gyalwang
lrd
udeid
sortilèges
tavárez
nordquist
anv
obligor
gerundive
gangloff
milvian
moler
kizomba
auditore
multiprotocol
klans
mühe
cronkhite
meningiomas
cfpa
cachi
coloradas
unredeemed
cunniff
kdv
berhanu
rillington
shorthaired
halsell
sârbu
capulets
isobar
flameout
porajmos
castigating
autocrats
zabbaleen
villosum
ruddiman
arthaud
lactobacilli
amaliada
sanan
homar
wkrn
satsu
badmouthing
alshon
straiton
shanel
jolles
subaqueous
diekmann
rooij
mitrovic
palmier
crépin
teucrium
mitred
bancorporation
maslak
nzx
miseria
lasson
anindya
volek
tadley
congres
flipnote
alpinists
phog
creak
crichel
coonamble
kevins
castlederg
bellowhead
reproof
walpurgisnacht
antivirals
tylden
magilton
gainfully
ampney
peeked
servility
toppan
sparano
ruminating
neua
overdetermined
limulus
steersman
slaters
latters
debabrata
kopacz
arcuri
movingly
saltwell
montandon
plantsman
fiats
whoopie
danette
kunsthal
milanés
bragin
senario
loor
cryptorchidism
zareh
chesshyre
laurium
terrien
sobrino
exsist
aglionby
enevoldsen
stohl
rousso
cyrill
jakobi
kuramochi
stratocasters
connexin
irrefutably
migori
iesa
himmelfarb
gohmert
vdo
hochul
gidman
asthenia
hais
olegario
yentob
nemmers
lubick
kobori
aracely
pensieri
stanimir
wirthlin
kumarakom
mccullen
buntin
aberlour
kelabit
lorbek
faena
boorer
phas
elwick
oshie
cullimore
xiaomin
omnisport
premies
wirtanen
chislett
mouza
larrie
isy
siemer
leunig
hyperkinetic
takoyaki
squirming
litwin
jobcentre
dorridge
robinett
tompkinson
nitrobenzene
neocatechumenal
godbold
martic
getto
dkc
excelent
maruo
namikawa
tausig
dallimore
grinde
planarian
geovanni
ftaa
slateford
rambin
prai
moyola
shuka
safehouses
gigabits
fgcu
drylands
forints
hundredfold
saheed
gerontologist
golombek
aulin
espuelas
vanderpump
karoon
ables
butterfingers
whith
pyrolytic
haayin
magomayev
magdelene
easiness
purwakarta
weiguo
noticably
tibbles
nyingchi
hinzman
webiste
mooncakes
parrs
thyrotropin
memantine
genc
radko
meindl
scrimmages
divisi
roehl
seabright
georgica
postive
triac
kardos
stannington
meltham
aardwolf
jaanus
shrimpers
tugay
housebound
sotero
vanderbeek
garamendi
saric
forayed
supervisorial
cacc
riddling
gallager
headcorn
collagenase
firebrick
rocketplane
pennon
mekon
shamsur
sanaullah
haikus
premia
akinyele
deconstructionist
couderc
antifeminist
yohn
iracing
conscripting
buiten
carriker
fruin
shapinsay
henryetta
hiatuses
höch
buza
dellal
cogger
heale
kieta
alliluyeva
tecuci
helders
wiklund
standoffish
windiest
terrasson
rosenior
sudhanshu
zigman
dslam
enco
nvg
tsum
woodpile
plonk
jingyu
dirges
seeder
grings
willers
pampulha
quirinius
babacan
realis
pallikal
hindraf
cdti
yongping
seipel
bilgin
slamball
brelade
cavalla
zaca
chinaski
praagh
fuzed
thetans
bairn
xiannian
wearhouse
submarino
dystrophies
kulwant
cial
hulten
chernyakhovsky
skall
macbean
segan
nathusius
knifepoint
superferry
bjorkman
alinea
hosh
gamkrelidze
mawddach
stickles
scourging
langille
kpe
uniqa
pridgen
nottawasaga
barbells
werkmeister
buffetaut
starves
môr
shahbandar
sudetic
hormonally
tookey
nazan
egoists
teske
spains
kesten
jehuda
starmedia
coyly
shua
cerana
mendacity
daiquiri
mikva
pictorially
memristor
tricker
odontology
backlist
daube
rongai
vestroia
unavailing
elrick
addressees
hinnen
multitudinous
ghostley
waterworth
pitton
miltos
legler
dubner
malenchenko
trefoils
nosedive
vels
reachout
cazale
summitted
afip
tietjen
handfield
patchily
propellerhead
gilts
arboreum
lauth
tappy
missie
crispino
helheim
nelmes
tippah
vanniyar
feldon
screencasting
bijlmer
manningtree
arria
marich
holzwarth
schlender
petrik
banlieues
seigel
metallurgic
dustman
porsha
kmov
ozick
gayda
wijngaarden
frights
aceros
morozevich
abdulah
enchilada
kallie
huntingtin
oliseh
jollies
jegan
lindenwold
belligerency
bouchon
quadricycle
gentlest
stentorian
tavon
recalculate
reckoner
stratfordians
hellhounds
telegu
deeg
ambalangoda
liburd
vlasenko
gogolak
unwinds
geoffery
farsight
heen
julich
rewinds
softwoods
wretchedness
printworks
ursini
woldingham
toxie
lidle
stilson
odendaal
chis
imos
satchwell
disincentives
gorelick
wallman
mckinnie
misericordiae
tecno
kickham
ashwani
sharifa
petralia
kuhr
keenlyside
anastos
aformentioned
linnhe
shawne
parasailing
armengol
derec
papps
ár
répétiteur
safri
davion
beder
durgin
glatorian
lomaiviti
patchway
hamberg
faustini
orok
kapono
duhan
franglen
barelli
celata
summerskill
rachida
stereolithography
verifone
suona
constanzo
homeplug
hazelden
evanovich
crackin
hipwell
spherules
trammps
piersall
leonardos
margiela
karttunen
rodallega
havanna
zamor
distil
darrius
mentees
tostitos
cutlasses
leimbach
drf
sloatsburg
scorelines
scotoma
ruhrgebiet
vassil
toplessness
blofield
allouez
letham
dinorwic
saphira
fulgence
emburey
thornber
agaves
occassional
goehring
unsan
balado
knoe
beiersdorf
perabo
broths
rixton
hilliers
ledwith
dizygotic
hangeul
contextualised
wiechert
telesis
spicher
kolis
mendizabal
remanufacturing
bensenville
hippopotami
dudleys
sturman
galeazzi
nawfal
grumbled
neocortical
tinman
cheongsam
byr
behaviorists
moonlite
arlésienne
cerion
horrifies
kagiso
alver
mcgivney
maseko
worlock
chimamanda
deringer
badrul
atiba
quencher
limuru
wearables
angelman
southbourne
dawsonville
tomich
harperone
sabreliner
idemitsu
scholte
pâte
konkin
stache
gasthaus
cobus
usefullness
viscerally
sreedhar
bachelder
sahajanand
luneburg
attali
barbin
organochlorine
cagoule
fiorito
pitz
sprog
dedecker
ellijay
tschichold
hollman
delalande
dinnie
haynsworth
cointreau
sutera
jumilla
lonborg
glowingly
elkann
emollient
rogov
spaceborne
ringle
ebene
tadworth
golin
pasodoble
uchitel
hertlein
techfest
cyanogenic
reemerging
hogtown
khejuri
senbei
dpn
nanocomposites
explorative
vostro
fraticelli
boyi
matsutake
carbapenem
hertie
francofolies
bandsmen
prehaps
deronda
syt
yayi
tramuntana
stoychev
dilhorne
hereabouts
lappalainen
palatability
percolator
arpeggione
rubha
mdh
meret
uscc
jaglom
malodorous
chambray
yeap
chromatograph
leisha
covenantal
mishearing
calmes
callicoon
auli
clubber
turiaf
ystwyth
charnin
nondisclosure
danticat
ridgelines
peruzzo
steinhart
blek
horrorland
chettri
parastatals
vertes
hailemariam
kayunga
mendl
scratcher
isador
boxsets
cahuachi
southview
barbaresco
nonindigenous
bhakra
pavy
sabba
bazile
sparkford
nmg
ringback
karie
hoiles
acquiescing
sleepyhead
versi
fereydoon
abendblatt
hyperextension
esposizioni
queequeg
furney
fomc
nonclassical
camuy
dhakal
dwele
dabi
condat
elvet
berganza
nimby
enstrom
galdikas
nikonov
cauterization
bbsrc
yuni
sarabjit
dundry
satchell
longland
puleo
pistilli
kisen
divison
astarita
centralizes
patwari
isetta
aquiline
araucania
weidinger
pinga
insource
bocchino
enterocolitis
bisse
ratso
dingles
marjo
umtali
mljet
efr
robusto
igb
lydden
borujerdi
mulvane
jaster
intones
softeners
rationalizes
dumbwaiter
lehren
pervasively
acocella
hfl
savoring
tijeras
necator
jabala
vilhjalmur
destabilisation
nhler
foccart
dipeptide
initialed
cuchillo
karasawa
nkandla
eske
multijet
hissed
ehi
belke
passin
boondoggle
zengin
jeta
doyenne
transaminases
sunburned
vlg
caudle
jaydee
fumie
cypria
stingley
timucuan
somerleyton
averts
naude
staving
krewes
yetta
seigi
karimun
yingli
foreshock
cbms
börner
flimsiest
poena
weeze
bellecour
jerichow
aames
mazan
trainload
bitlocker
okeanos
southill
communale
jila
expropriating
mifid
anytown
texada
battie
treisman
zacharek
shamma
wollesen
cosham
torticollis
wrathall
argov
scheinman
cutmore
biomonitoring
hardik
kiddle
thisara
allopurinol
alldis
cardale
widecombe
delport
mohel
kasch
abjure
tbv
spawners
contis
framboise
wujek
glemham
orkneys
yuge
seke
jollie
paxil
barcombe
speir
bloomingdales
covo
madhvani
raan
fiumara
boxborough
kizil
tierce
rhinovirus
heiki
neblett
catsuits
drw
westmalle
dysgraphia
brinkworth
rybczynski
goddam
hassane
kudasai
dungloe
microbus
outreaches
nonliving
anouar
sakamaki
tgp
disalvo
sneered
omfg
bharucha
beah
instillation
havelis
jks
lintas
mummer
shabwah
kirt
lunardi
fujima
cauquenes
monomania
acsm
chism
wns
goudge
naiqama
theun
nappanee
balbuena
dré
noynoy
gamp
demobilize
ucluelet
frontmen
ratterman
seegers
liebes
klapper
coppelia
fleeces
neurotics
cedeno
wangyal
kramarenko
garreg
bgv
petkim
outwork
kaberle
carolis
zimba
powrie
glanmire
gnjilane
bavaro
liet
amadiya
godsell
stavrou
gehlot
retweeted
xle
brownout
irritatingly
cartlidge
rutz
jamarcus
lofar
gravediggaz
ktf
solicitude
lehmbruck
curfman
batya
suhaib
unifi
moszkowski
ibnu
antiphospholipid
mcnall
malefactors
cushy
vye
coffy
vzw
rtls
manufacturability
raheen
frombork
nepeta
delizia
drimnagh
picoseconds
tripodi
kihei
gaetani
tuh
ransomes
ccgt
dymoke
machell
etoiles
nikitas
helminen
demarcations
soothed
primroses
rivesaltes
vison
perman
okolo
hazael
readman
icehockey
hettiarachchi
sulejmani
unschooled
thinnes
tosches
acdc
greka
naughtiness
brodin
gerges
mosteller
subsidization
escalations
pennal
meye
knigge
bursey
zahida
linklaters
montfermeil
marketeers
taieb
martialis
macfarquhar
geoduck
maryellen
nuestras
browbeating
ivanovski
mejlis
mactier
allal
whut
zaiko
fairbourne
mauvaise
manase
repairers
geishas
ecdc
duco
tzigane
afsana
anello
winegar
idealize
andis
veco
gemalto
abrasiveness
faires
singal
pavoni
voalavo
recaptcha
tattenham
upholders
berriedale
deconstructivism
zambians
pathmanathan
jingyi
chilson
unsg
cleal
gottfrid
torcello
scovel
stoping
classon
oakengates
timberg
rudisha
pizzorno
doomwatch
clippard
zoghbi
footway
razov
bondsmen
slidin
flitting
ghuman
vons
mollah
kirschenbaum
affluents
crippler
asanuma
manichean
rohinton
bremanger
lívia
ucits
labarre
wamsley
boobie
nipah
azzaro
quintillion
nudd
heav
seery
krekar
grizel
shunryu
dreifuss
mockridge
spaceports
ryba
stoped
maluma
ealey
medd
satisfactions
shandaken
marram
commotions
immelt
rukai
googoosh
rgp
isec
pumpernickel
dallow
alj
shigematsu
narcis
footedness
raghunandan
luxuria
ezy
tressler
yasnaya
willens
hoepner
jianfeng
ridolfo
kiski
wyncote
topolánek
kyc
broadgreen
intraday
gollop
sriramulu
cotgrave
obeng
hamani
previa
vrooman
tregony
litte
tsakhiagiin
paksas
gummed
backtracks
dujuan
proenza
intepretation
musti
diler
heliopause
preindustrial
plaiting
sugaya
ishfaq
diesendorf
calli
intersperse
lindiwe
carreiro
schoolchild
vejvoda
amont
jabi
reyn
warsh
shamshir
cleanness
bbh
recondite
anastasis
boskovic
cfra
studt
soupe
ebensee
xinxin
utri
cavalese
favola
dravs
agag
whittell
maati
husam
hypnotists
toonerville
okur
paranoiac
denfeld
jpod
fardeen
lenn
artifex
haes
vivianne
pratts
barbès
chelli
roumain
natanson
soumah
latacunga
guerry
mecc
bellon
impastato
rendova
packie
kostic
pbg
heren
tucanae
edell
karyotyping
cannady
cict
mhatre
condi
drawstring
chudnovsky
scroobius
windsurf
wrangles
furriers
sunexpress
arnout
awst
shortlived
abderrahim
culin
forsell
pratti
videoconference
wimbush
fictionalization
sangharsh
sambu
savane
unidiomatic
augenblick
saltires
jym
mugambi
trofim
overwintered
onramp
iasc
gametime
euphronios
sokak
nnenna
horcruxes
karlsplatz
tyce
shatz
kilmuir
dewit
sorceresses
aics
sakia
chagrined
borko
pennard
autocourse
diltz
aqib
toepfer
oppositionists
ofek
reified
halkirk
pebworth
raczkowski
dikeman
yukiya
absented
dixey
fayet
kampa
vaitupu
masaba
yellowbeard
ellenton
orthodoxies
sonae
hockenberry
sefolosha
enderle
irradiate
uninfluenced
interlayer
goldwag
nasally
rozman
mcgahern
vanderbilts
duranti
mccrery
signorile
rogg
fearghal
stehekin
blunk
appealable
derussy
floorboard
moxham
roadhouses
ciac
bouillabaisse
dropsonde
ciat
rebalanced
leana
priveleges
bardez
cozier
montet
katongo
sautter
begetting
movieland
pettingill
netherwood
guajiro
albertoni
mirrorball
sociaux
shinwell
paliwal
incompetently
pilli
larrieux
mimran
escs
klipsch
wopat
kitaoka
mcsheffrey
steeltown
interworking
brilliants
gardy
defilippis
gcos
clapperboard
glaum
rybicki
munificent
fni
irakere
wirths
coronati
mccroskey
pataky
elvire
needlefish
panko
biglia
nataly
sojka
kirinyaga
abridging
icey
nevern
kalinago
tillerman
passbook
yingzhou
jodelle
plaskitt
gianelli
kiet
donator
anthos
googlemaps
milstead
jerónimos
staker
cabranes
matschie
likhachev
coypu
röhrl
prettiness
ballen
proconsuls
zanoni
raffel
besso
slamannan
pjsc
camkii
kret
zol
mitchelson
reinstallation
intercountry
garling
heathman
damin
horstman
norgren
ktxa
uclan
sharf
namp
vist
garlasco
entreats
attari
nabs
zeisel
drecker
sixways
positiva
victimes
roundway
drek
gionta
shans
kealy
yati
didim
brancusi
fischbacher
pasturing
porche
napes
mascaras
wekesa
wohlgemuth
aout
hessilhead
willerby
delek
goswell
entwine
longhaired
meanderings
madaripur
unsociable
katsuhito
reffered
bilheimer
domesticating
burkill
akhalkalaki
chuseok
asiavision
yane
laboe
benezit
kemco
ostiense
seafish
balda
zhiyong
reihana
muradov
engles
nisman
kopje
preponderant
simular
westerby
hotten
reanalyzed
kummel
palaeoecology
loek
crookers
archimede
csrc
debilitation
radionavigation
soldierly
samih
ploughshare
halphen
lurssen
baldwyn
hotaling
streetdance
eeas
frizzled
tuteja
sape
sweitzer
delaine
moravcsik
seafair
skulking
klr
uncompromised
herle
delone
turnbow
billin
slesinger
bloons
morrisseau
loadout
kibbeh
calmac
zuhri
ardoch
revelled
shvut
garderobe
excelencia
quadrimaculatus
unmee
reynoldsburg
deanda
wkn
ustc
diminishment
balshaw
breger
barbecuing
omelettes
augers
estacada
seafires
fistfights
kurnaz
benett
klindworth
wair
chagossians
kosloff
ueg
khatron
jiguang
,so
haseley
hummocky
rtaf
strathkelvin
wkbn
syers
ravenstone
foschi
carcharodon
michelia
whippings
kday
skot
ruffe
staysail
tzion
aota
lesniak
fudging
semiprofessional
lyria
dgps
dihydroxyacetone
levie
perlstein
qawi
astiz
flamm
aubier
ronna
schwede
freeburg
mantoux
bramantyo
maleness
acle
airland
dissolutions
wildsmith
carnevali
beanbag
delgrosso
candan
tejaswini
louk
plec
mapinfo
kurdistani
cmes
descente
cécilia
felicien
technologic
europeanists
nepalgunj
klopfenstein
aesch
sonin
unknot
auel
whinney
peacockery
neuroendocrinology
expedients
pterygium
bulrushes
xga
godfree
dongpo
norv
sgouros
paulins
inadmissibility
steenberg
palazuelos
jank
amathus
copywriters
maartje
bushwackers
ingleborough
brusa
karpen
vonage
maspalomas
bergers
auctioneering
llanharan
encases
mcraven
faherty
pyrokinetic
digitise
ernsting
myeloproliferative
cajoling
shakar
jlo
diopters
fero
monosyllables
colaco
freedberg
penknife
wwlp
montois
kalmunai
hathway
eireann
reinisch
distractors
dellavedova
scip
rtnda
ausubel
rached
liddon
mallery
jenaro
kendallville
meehl
reeking
egorova
raffaelli
lagrone
guiliano
pignut
allofs
marzouq
westwego
rastenburg
gosney
hichem
khq
schapira
juca
lohara
syam
flitch
seatruck
mrls
antiquarks
schwarzmann
hexagrams
hucksters
khc
heavitree
critchfield
ruxpin
hudood
autotune
scotstown
hafren
billable
mousley
cheil
lopa
perella
dondo
forcings
pankin
furtively
laths
whoosh
simcock
sughra
wetherspoons
anemias
eveland
udrp
houstonians
yarden
hermanns
aukland
spacebar
proscribing
zango
comportment
menstrie
nucleobases
subeditor
crownsville
hurvitz
stanchfield
sensorium
claman
accommodative
roobarb
godi
becquerels
taumalolo
zinsser
ruminate
sherfield
felip
topinka
postprandial
neoguri
pahad
druon
lonigan
guantanamera
heiskanen
cleere
sumio
governement
staf
cruciani
hornik
klapa
ayora
airboat
todds
nesquik
khodynka
llantarnam
fougasse
doleful
phlebitis
okoh
kozinn
tigon
dubonnet
triumphalism
wegerle
sternhagen
neurodiversity
wates
photosmart
infn
squeaked
vexations
raichel
branchburg
midsayap
plese
songaila
balderstone
mewn
ingpen
berneray
marnier
deras
clampitt
paranjpe
morosi
gossen
breault
multitask
iaapa
informatization
progs
ningaloo
tacticians
pillon
schiavelli
wongso
gingers
didar
cndp
ulaan
quesiton
overindulgence
talisker
fuda
discographical
hauritz
petherbridge
grillz
irit
twg
scalawags
lebas
farshid
luciferian
wcr
crimeans
betamethasone
polyolefin
clairol
francos
ventilatory
buxhoeveden
holmesburg
chenghua
troche
woodling
houshang
adai
alejandrino
nessim
porchester
novem
braud
meeropol
dorigo
vandervelde
singhs
becerril
hauksson
kuter
besim
macronutrients
overclock
oración
ruga
kullen
asola
mastrantonio
burghoff
caucused
wachsmann
flsa
carrolls
feshbach
etape
hopcroft
flophouse
westergren
perfumers
freeskiing
unshaken
orianthi
azis
colmer
volsky
asianweek
coleus
wernbloom
gunzburg
errigal
ualbany
pushto
estell
derwood
grassby
timescape
spani
ridzuan
unratified
bernoldi
slawson
understudying
rld
gosselaar
gallwey
bikeways
tejpal
koegel
negrita
spag
chrysotile
lucetta
zanclean
datz
mcgettigan
jurevicius
otn
policastro
pettyfer
othmer
hongbin
pelphrey
lulav
vilella
cuckmere
nierenberg
delimits
freital
mystify
fairholme
zelnik
decelerates
swamis
diran
castigate
merkerson
bulis
elswhere
mazzeo
amru
ewins
searsport
tastemakers
petare
margrete
fowkes
jiaming
equivilent
mckeldin
huseby
vasilyeva
wiedmann
restyle
anorthosite
rangy
nanog
groover
bantus
qra
cshl
grymes
jomar
cordel
kolka
murthi
phthisis
gunwale
wonderlic
bionda
araminta
foaling
awada
strainers
paish
sles
joannou
hanbin
junkman
tercel
uley
neals
bizzell
lardy
illumined
whifflet
weenies
scifo
leipold
narnians
workwear
paetsch
carris
personnels
gunji
souphanouvong
netroots
hectolitres
merga
kyrylo
corsie
fesenko
cuber
süskind
whited
beanies
christon
snarkiness
mypyramid
mousepad
malartic
izabel
silvretta
halyna
pottenger
landale
leonsis
acclaims
serino
darl
skar
standardly
eya
cfw
iglehart
taylormade
cantigny
zajonc
inamori
pézenas
lokman
attn
marioni
payami
aliye
sauers
osagie
oxburgh
epicureans
rocketboom
kameoka
portel
takashimaya
toyopet
goertzen
yariv
schaech
yojiro
ccleaner
haylett
mechatronic
johanssen
kononov
ercp
wesendonck
brokop
mcad
botswanan
terk
shudders
choto
dissembling
burghead
wasti
yeldham
akinfenwa
valensi
lingvo
izvestiya
warung
stanisic
dumai
eharmony
unmetered
georgiades
memmi
howled
mankin
merchantability
freeney
anonymizer
namus
burps
economique
jadeed
sthree
lucayan
factchecker
superstate
shante
forelock
jaheim
athanasiou
bsee
ezzard
tamta
amruta
bookworld
playpen
dharmaraj
metabolise
rezo
votorantim
nolle
choosen
bunion
fpmt
jenji
waterslides
uncomplimentary
lwp
sikdar
zic
parkhotel
guayanilla
sokcho
oah
ondimba
dilator
lgl
taymiyya
signficance
kiskunhalas
egotist
araripe
nni
vespri
ishino
barny
tomohito
tjc
velzen
transects
gastronomical
gameworld
profligacy
richwoods
eckley
lendale
geochemists
walsworth
kuryakin
beerbaum
eisel
skewering
waldie
boatloads
essm
statut
wery
gerbrand
kiwami
giff
decipherable
gastropub
hesseman
unexpressed
jucker
tejero
detente
workprint
haughley
canalis
naem
mccalmont
oakington
lifeways
renauld
dibella
easson
incomers
itsm
jacs
kaprekar
nwfa
looter
oakden
suppiah
intersperses
foxit
ecotopia
terrifyingly
doubloons
recognizer
glasspool
tapo
féret
davidsons
paranasal
internetwork
roulin
agritourism
midford
pandav
sodomite
wurth
vanel
gyda
sachdeva
altesse
fumetti
kandiah
szwed
waldenbooks
lambdin
itvs
pilchards
neurokinin
minnich
multiphasic
solipsist
bmu
idealizing
eilidh
vaselines
grealish
belinelli
ashurnasirpal
ahb
carcieri
lybrand
heaslip
sigg
bottlings
sawin
sliwa
gladding
asashoryu
enumerator
orquera
semenova
taitt
giedrius
waha
sneakin
bhagwandas
cange
gardini
viscid
lepic
jehad
frognal
redsox
clastres
salming
laoshan
polychromy
burca
vugar
tingay
amaurosis
deshields
steindl
estherville
unfermented
desplechin
gearoid
xds
mangusta
abdillah
yellower
defrayed
vandeveer
wagman
endura
pickoff
estriol
dsch
wormleighton
stepanenko
sharnbrook
cadenet
derana
khane
chovevei
needlegrass
padwa
ghorbani
chukwuemeka
backchannel
raunch
mateship
buraidah
casamayor
shalikashvili
kainate
astrolabes
preiser
abat
sagged
cygan
tavarez
diagrammed
perillo
remissions
bradly
seedbed
lesly
canfora
baqer
petróleo
bacar
kilberry
mildness
realestate
coupole
gaddy
gmh
ancillaries
teyana
sile
spiaggia
kwanten
dourados
synesthetic
wunderbar
feick
baratz
flosse
musandam
nyhavn
loreta
overprotected
conto
astwood
fruto
sölden
tricorne
haass
schusterman
keigwin
khawr
codifications
kapali
unfulfilling
ogu
halbrook
samiullah
yabby
heshan
lederhosen
tehillim
sipi
charner
fesa
fromer
bongbong
stefanki
southwesternmost
mizelle
commisso
onrushing
sejima
pibor
immured
bussie
khosravi
swatted
hatikvah
panpipes
melonie
seraj
ulo
pelkey
verbiest
mccombie
ingibjörg
vecchione
wagamama
raxworthy
russin
tansman
candis
aynak
bovino
scalzo
menter
guttridge
immunosorbent
gallowgate
rrv
gialle
chazelle
broadnax
kounellis
airfreight
xiaoling
palimony
fynes
gorokhova
jiggers
micromanage
cnor
karmichael
aapt
quisqueya
solicitous
namita
pachacamac
neuquen
pessimists
talend
greengrocers
saryu
uniao
hendershot
barbon
markan
sawtry
bahaa
yamcha
givry
unaipon
inflammations
stabber
yubin
provos
intercommunication
augé
decio
henrion
aaryn
guangfu
sagada
miral
leguizamón
legitimising
crysler
courtemanche
hennesy
cockington
leibman
conventus
jodhi
sheeps
egidijus
privett
chicksands
hudal
rinke
finestra
borun
aiping
cesarewitch
rainout
faïence
jagielski
moisturizer
trombetta
gauzy
holmewood
wabush
weerasethakul
milholland
sandall
gargle
skyla
overthinking
jeanson
anglocentric
ochocinco
paperweights
khammouane
submissiveness
westbeth
markeaton
newroz
wmgm
wymark
driftin
storen
sectorial
shecky
kittles
redrafting
dalitz
airglow
groupwise
rotonde
hellfighters
kinka
gurmit
kaser
mondulkiri
andrieux
athaliah
clatworthy
pepes
personnal
conviviality
vizcarrondo
joeri
makharadze
pipino
welly
bovril
sletten
krylenko
studiolo
matis
odem
augustina
alario
felicidade
harned
wildt
crisscrossing
fibrinolysis
lamplight
ennals
buczkowski
carpentersville
upe
blavatnik
prospectuses
quested
fissions
muraki
bundibugyo
wiederkehr
scourfield
devastations
jolicoeur
kardon
unvarying
grl
exfat
unwatchable
alshammar
snarks
cliffsnotes
arngrim
davtyan
zurzach
wrightington
boice
kostopoulos
ostroff
manezh
multicasting
ocotal
ziemer
ronee
perianal
tegid
kumgang
palena
fischetti
ijm
wbi
jsb
spenders
martyrdoms
implacably
unbelieving
optio
ejaculating
soacha
lawanda
guiro
clothworkers
horehound
beaubourg
qattan
psyllid
mcgrain
wittes
maniaci
jailbroken
cegielski
rtca
alphege
moonlighted
vanves
leinonen
tzipora
avaaz
motijheel
roura
ferrofluid
georgianna
anthroposophic
kunka
noiseworks
photorefractive
etm
typifying
desir
tiburtina
slaved
glaslyn
siyi
passel
brailey
prosector
elgie
mcpartlin
sportsbook
ghazl
kellow
patiya
mispronounce
wrangham
farenthold
cohesively
duping
hanshaw
rrl
inish
maufe
panya
polus
profiteer
whinge
almen
woodfill
clugston
inhibin
ruchill
heese
assem
chaplaincies
frazzled
hensler
memorising
beens
galizia
gruver
cuttle
forthrightly
tyack
ndfb
neubiberg
kaptain
csny
oscarsson
aristizábal
doofus
cosmides
bomblet
micaiah
hammami
rahmati
wohlfahrt
rickrolling
wickett
sparkled
zadig
lombardini
denbeaux
jordanville
beauteous
tishreen
destabilizes
kuchen
rossiyskaya
mitel
liberton
sranan
spintronics
garlanded
bernero
sellick
jiwan
agaisnt
celant
rayamajhi
ameriquest
kolles
ingvarsson
trenta
touray
coulby
murieta
asafoetida
durling
strathy
servizi
wyville
moxibustion
roundy
playschool
edgecomb
somersworth
altissimo
ziana
menteur
qilian
alao
topographies
mankiller
foreclosing
azarcon
nordal
misfeasance
uninstaller
médica
zeiler
ameba
balang
wrtv
dhaba
sehorn
ludivine
neighbourly
wheelmen
banesto
antinuclear
chiddingfold
yansheng
kbjr
danceteria
parlby
splendidus
incroyable
severine
kreiss
darrick
wailua
skoko
rovner
goodenow
yodeler
wujiang
mrazek
freundel
dangi
squiggles
aberlady
diabolique
mcspadden
morgues
harinder
toadstools
valiante
unfortunates
comac
icho
tihany
princen
onigiri
infringment
gratin
circumscribing
phimosis
marvellously
hardeeville
subedi
bohle
riai
mallaber
tobita
lariviere
melan
cullens
reliabilty
bonanni
evins
maneka
rockley
endoskeleton
sumthin
chargesheet
florsheim
trembley
chignik
kishu
tariana
jinhae
ktvx
benvenisti
cottonmouths
imec
reichstein
wholemeal
favaro
cierra
conciergerie
bakeware
vasgersian
iln
sedano
bluestein
blitt
dahshur
phytoestrogens
tannersville
basden
pterodactyls
amed
sagrera
harptree
firestorms
birthrates
gulnara
soie
sixpack
delila
karow
ayen
vereshchagin
tieghem
notifier
hoatzin
rahbar
durio
baisden
vides
castellucci
massara
didrikson
icrisat
hamor
offner
rothorn
zieba
niram
reversionary
snowplows
ratmansky
fundi
kelloggs
gapyeong
chriqui
tasse
mannis
shennan
cozma
rostova
eeghen
undershaft
jinggoy
destine
mosaico
synaspismos
sumant
colaiste
kurve
bhante
teahouses
impertinence
clottey
osteoblast
uhry
avontuur
clon
epicatechin
hagupit
ghaddar
mürren
sohel
ough
gosar
benvolio
democratized
shakespeares
floodings
deividas
helbling
kmk
unitar
scorton
sabs
freilassing
erstad
brancaster
keuning
molybdenite
leuser
geula
anythign
custodia
thrilla
retrospection
cappetta
buki
stefanini
ails
velleman
pelleas
toyshop
tawana
cleanses
kouri
johri
kdc
jabor
turo
dragones
collaged
gorai
bohman
oneasia
outdid
karpovich
séraphine
butterwick
godar
horchata
andreani
phalangists
poppi
druse
itten
ixus
pade
nipon
salmonellosis
cuy
alting
divisionism
wallie
guinta
firmed
jeannet
bolkestein
covenanted
beccaloni
dedan
berdimuhamedow
vilbel
segueing
cmcs
hrubesch
paysans
hanwei
masimo
gushan
kretzmer
radzi
kassie
bacri
usurer
desisto
shotcrete
burs
cherrywood
medianoche
traber
chuvalo
wilhelma
imacs
momtaz
daykin
trient
arnhold
munsan
fahed
booters
acj
danah
pillinger
kofa
spandauer
gjerde
brighi
semioticians
gibber
xserve
corrigenda
mullumbimby
gho
hillhurst
jurats
bergeson
josi
pompa
natto
millbay
surt
geran
satriano
mingay
freeplay
gaman
gapp
bresslaw
coull
crowson
mcfetridge
adebowale
peppas
egomaniac
macedonio
nars
songa
kuss
poundage
inorder
morrin
caythorpe
shuyang
interrogatories
bronston
oliveras
ceren
chigumbura
tappi
tingo
tulley
djo
docility
caam
sevillian
geddie
ciudadana
knead
metherell
rubida
ncsc
irven
boardrooms
hotell
salinization
kinabatangan
sedo
telep
kalala
enlarger
danoff
weathercaster
maragos
frappé
aerotech
tywi
glorie
dharmachakra
adiemus
hudsonville
bobbye
hellerman
fpn
erla
minchinhampton
genny
salanter
tsurumaki
appelt
zinzan
litvinoff
consistencies
beauchesne
obraz
carluccio
charwoman
giacobbe
vivaro
calpain
sapodilla
musonda
woakes
dogstar
arns
nooijer
alexio
anai
budiman
fingerlings
satnam
wexham
interchurch
cmhc
arrack
brazell
cottontails
blinked
laurey
mottley
hermsdorf
qanbar
pixeljunk
cuby
fourball
zareen
raybon
rebhorn
microwaving
monkshood
decentralizing
benozzo
skela
glaziers
nussle
locoroco
molano
vimmerby
dietrichson
aquinnah
guernseys
nedeli
counterpoise
cannella
clardy
wrixon
propylaea
blatch
frohlich
orfila
phonetician
thommy
buraku
bornand
pechanga
herria
aterciopelados
pendelton
propellors
ashlag
lixin
ivanenko
comancheros
tschopp
solarcity
kollmann
hiden
synchronising
noci
supersize
ecus
demetra
oelwein
instrumentations
proficiencies
garrigus
kyriacou
classement
pencoed
cubanos
crosscutting
lueders
vellai
cassata
shahul
grupp
bergsten
zecharia
metzingen
palea
duihua
messers
vicino
saliency
hobert
armyworm
etec
mispronouncing
sacheon
altix
syngnathidae
doka
jefferts
csat
leaseback
greenlighted
nikolova
florica
papagayo
sloyan
jba
hotton
feedlots
recalde
essi
jelks
evc
linspire
rahayu
unconsummated
warchild
lellis
verité
chisso
ketut
calcot
onslaughts
sucia
haemolymph
requirments
lapdog
zingg
toytown
roader
sesostris
mirtazapine
guideposts
otterton
ramorum
navarin
deinococcus
alnmouth
westampton
qadam
bonwit
verducci
antonopoulos
opinons
controversal
thinkable
hardwork
feltman
kukulkan
contraflow
dalliances
sportacus
labre
stenographic
biotransformation
compactpci
khaing
golshan
palmatum
alperin
rizzio
kurung
kdh
blacula
saltford
jedermann
hasim
cyberport
origanum
naud
yvr
mutliple
vandergriff
desaguadero
mckennon
liván
vocero
babysat
awlad
mollohan
hgp
tiddy
ebbert
talegaon
yiorgos
redcat
ellenor
cosmopolitans
legitamate
embarq
peotone
malalai
posthuma
kajiyama
marilynne
scheuermann
waskow
contin
dand
traversi
talmon
asmik
megaproject
nevius
cheves
stallworthy
goheen
polwart
elusiveness
knayth
dehri
wakin
esport
bibbs
callanish
burkinabe
panero
doogan
batterson
peric
flub
ltn
coarctation
flindt
garmsir
punit
iachr
sanket
einat
illana
merricks
touran
staithe
andraé
malaysiakini
predjudice
sumika
anwr
adeang
buddhadev
mattituck
fluss
mcclarnon
feile
matola
candon
mipi
babyz
turfway
randomizing
calayan
bathans
fadia
midsouth
lightkeeper
crites
ludens
fikr
gegard
comand
lerici
krit
glasco
virtex
lefrançois
dunwoodie
kukui
amic
mensing
hfp
lorella
jabbing
trazodone
astori
hbg
kleen
namdaemun
przybilla
bentleys
michoud
cleavon
stirlings
hahahaha
weatherization
joyo
goller
sandwick
taja
pyelonephritis
hlp
terrestrials
gurira
hegle
fwc
wobbegong
olding
elzinga
eltingville
minny
bluffed
toadflax
nakhla
luchini
boorda
caronia
runout
costessey
penhallow
dysrhythmia
approver
egberto
dazs
sayad
vanita
ovulatory
squitieri
eifion
griffithii
snick
powderhorn
asiata
konner
tremolos
liptak
reducers
juiceman
raffled
yuyuan
materi
gustavian
roundoff
cnas
cuu
seborga
budworm
dispersals
endell
fontella
hesser
automaticity
magglio
asali
ghad
suckered
whiti
hogweed
cynosure
narcisa
obediently
rukhsana
sentara
consalvo
tissington
thorsteinsson
mathon
hazir
recapitulated
depaola
funkytown
shamkhal
floridan
flyswatter
vanko
hinrichsen
varto
frevo
wrva
preheat
postale
cnic
mahara
footbal
layar
annonymous
ainsty
albertazzi
empyema
expiate
kharms
hardtalk
malaparte
birkby
aboubakar
kawata
bushfield
xiangdong
zaltzman
sahr
knipper
sieh
eagleman
darkwood
ballater
sdio
pfn
pawlicki
caldy
pinzgauer
baumol
olek
faeroe
bonesteel
postle
sackey
totino
carnero
rinka
battiste
graciousness
particulière
nulle
mopani
chadians
doppleganger
buday
readymades
bourland
netherdale
sonicbids
bertola
hfi
klasen
twemlow
xfire
paneth
bunia
htd
suw
toiler
specsavers
outfalls
boneheaded
prasetyo
fcoe
highcroft
pinatar
chondrules
mazzarella
maxwells
honeyboy
herra
vigas
orlo
cheyrou
educacional
reteamed
seps
snarls
stillington
ebeneezer
reconsolidation
seiche
drydocking
footfalls
balter
hanney
loenen
rajavi
huckins
thymol
teba
levofloxacin
matthewman
mainali
atwal
beymer
qcs
jaswinder
razaq
jordache
nacala
tabacalera
yvo
avenal
dobler
donncha
jklf
bendiksen
matza
ejaculated
unreconstructed
tabet
bellier
tradeable
hiru
foroughi
uslan
cmac
nygard
tenley
bothroyd
reede
zacchaeus
alburgh
muammer
phippsburg
nki
utmb
vernazza
cartosat
bulanov
profaned
navaja
surjeet
gabbay
mennea
allods
keymer
bullett
biozentrum
schoolies
deianira
forgas
sextans
hseng
impractically
gazit
grennan
eastbury
hashoah
psychodynamics
piff
ranocchia
snarf
olejnik
uhler
kingspan
beeley
anishinabe
gumdrop
pantaleone
krasnow
hasyim
counterterrorist
breanne
nzrl
freedland
slbc
cuil
kurn
fruitlands
lecher
oit
drumbeats
cenac
voulkos
boatbuilder
arturs
endpapers
brinklow
kenickie
aneka
reznicek
gerets
fadeaway
nuneham
cockatiels
tadahiro
lonrho
decter
lobanovskyi
pebbled
sighthill
kurir
speen
agrochemical
btd
fcra
sassoferrato
daylily
fapesp
heyliger
jiahu
rielly
mubariz
türkan
holste
sufia
minit
prtc
litening
minev
undernutrition
initally
derks
sillars
yaghi
orchy
recrossed
gorog
kunama
arki
manzella
mcglade
jayakody
salata
auchmuty
rimando
reedsport
konosuke
rjc
bomford
jonang
atec
markka
polic
cantelli
anyting
enumerators
berty
groyne
benger
swailes
repatriates
aethelred
tryfan
abud
siamang
alredy
muin
foard
varta
debrah
afterburn
giancola
rainman
pollards
nagged
tergat
deason
adventurism
culliford
yettaw
bouschet
birse
snit
nanyuki
windjammers
ukrinform
gerti
bommer
yewtree
rutherfordton
stoyanovich
apotheker
teresópolis
harbinson
lorik
kabo
bowmore
unfitness
sugrue
pogodin
minahan
demars
bellissima
embellishes
phad
aigis
jefri
impliedly
albermarle
merlotte
raceday
narm
ajia
akbulut
atsuo
brenly
cosewic
thaye
peggotty
freebooters
toxicologists
chrysantha
gaghan
jongleurs
taraneh
asef
monsarrat
microcosmic
mankins
kcd
brueggemann
lumbosacral
taskin
chemoattractant
jellied
thuban
aldbourne
markinch
criado
chitlin
orgeron
gvs
earbuds
impermissibly
saomai
buckminsterfullerene
hédi
epigrammatic
njoroge
munhall
uncleanness
berresford
buste
migliori
ganci
faught
localise
kolla
ringneck
okpo
choling
mapother
penington
vinalhaven
mengel
elting
mistinguett
komarno
pleming
megapolis
colavita
felli
zeilinger
massport
escribano
siriano
vilest
groenendaal
goffey
osmers
nettelbeck
apparels
rbr
hygeia
soapboxes
dechert
almodovar
airth
weeki
jouer
squillaci
yangpyeong
vette
shipway
katimavik
pipex
avails
twofour
pomerium
welbourn
decriminalizing
northfields
ainger
landsbergis
blaen
gasparotto
mendax
blabber
wagners
kumyk
gdnf
jesting
notaras
threateningly
defeis
birtle
papy
braunwald
rimmington
polyclinics
kneeled
spoony
repast
rocketing
tmh
trapdoors
joash
wicklund
quatorze
tamburello
tonguing
deceitfully
maxted
schiffner
synesthetes
nesler
elysee
tunay
gemignani
hockin
naugle
microwaved
calì
nimani
berro
giratina
lavrenty
presale
kiriyenko
shachar
katsidis
hairpiece
rotondi
brunskill
haitai
bhuta
ibope
diktat
keloids
jordanne
coloboma
winlaton
deare
kraak
fape
conflux
mundel
meel
agunah
noisemakers
devesh
kamenetz
kirkburton
nuxhall
onagawa
kornfield
grimson
recode
nsaa
heelan
aggrandisement
proses
caborca
circumnavigations
ershov
msconfig
plimoth
collipark
labarge
gotan
silverpoint
rustum
hews
hirohiko
plomo
dillsboro
combatted
felucca
kussmaul
nizkor
ardan
schriner
lockroy
keshavan
ulua
iafc
dushi
gaustad
scotusblog
tianmen
eppard
banuelos
mazy
sarakatsani
widdows
alyse
nectarines
goodbar
slighter
hopin
aerodyne
requiems
nscs
karada
damnit
brainiest
samurais
lennar
splunk
bassas
sonthofen
drooped
yellowwood
masius
sealife
kishwar
hypnagogic
kavalan
virals
joio
bonadio
hypothermic
byeon
aislinn
isobutane
believably
eidelman
mellett
ardara
bonbons
jacinda
bartolomei
flatout
acree
derge
noell
downburst
binu
jannatabad
unrehearsed
afrasiab
wargamers
collusive
bracci
carnivale
bitting
eguren
appologies
colasanto
thami
shavelson
cjp
nale
micaëla
heah
transeau
oiticica
cabrita
synovitis
baazigar
earlene
licari
exisiting
venugopala
alemdar
wbb
xingjian
minervois
pestovo
diamondhead
biebrza
decastro
popovych
kaysen
jgi
gudbrandsdalen
jacobina
carmageddon
shchusev
purda
hellersdorf
vaquita
somsak
maston
stonham
wuest
halkyn
smutty
siano
jedda
macroalgae
nasus
nedo
lopped
sico
multichoice
bft
kobler
creatore
micucci
duston
broster
shouter
regimentation
anaerobically
danowski
galoshes
biasca
plucknett
dhir
furans
diversely
fitful
houda
gharbia
stampley
paccar
bhowmik
checo
heatherwick
liniment
orbi
ody
hemangiomas
wynnstay
balkman
mocker
ziemann
nollet
arntz
parcelled
linkup
asadollah
coity
fiendishly
bleeth
dingus
athiest
damia
digitising
financiera
nyqvist
messen
bardfield
liacouras
volkswagens
llcs
marsanne
quinlivan
marvis
ellerby
nicco
huntingtons
chaabi
abscission
kgalema
honeyed
ruit
edger
mazowsze
riegert
meisters
subida
gediz
jergens
aufidius
bajer
acreages
elsayed
unmounted
unspotted
mosheh
lavillenie
solr
oversampling
hamerton
amiruddin
softley
shutterfly
menden
readthrough
challe
hydronic
stupar
eoi
deferments
carbó
caramoor
finalizes
mousinho
aprista
yazan
zabi
brandenstein
selinunte
illum
brachypodium
reagans
southmead
centurio
buzaglo
enslow
ferlito
móra
quintas
nonresidents
sabai
peregrines
kdvr
caldbeck
gerbe
kadhi
dibner
succisa
eqt
rivelin
sabrewing
kimberling
dumitriu
gonder
toker
christianizing
cimic
comorbidities
bacalhau
mckeague
guta
raburn
artifical
akzonobel
pinfield
llewellin
genesi
krasno
zatlers
cartaya
deavere
incertus
listerine
kensley
bittu
ingénue
thoracotomy
kostecki
rebellin
ohene
derated
siham
bucholz
gribbon
lepel
esteli
widman
clerico
ethylbenzene
embroider
luminus
wfo
gurunath
matveev
cuffley
krasheninnikov
havo
jammal
jiazhen
terzian
camelo
quietude
pocheon
lems
nyali
cnac
spartel
galu
studds
unmissable
essent
zanamivir
witticism
sangoma
serero
rondos
spivakov
volpato
tannis
philosphy
sarafina
rasher
trimarans
hartlebury
wxrt
fullington
whiterock
desford
moumouni
jambe
sezs
arimaa
packards
dragonetti
afterimages
shuzo
ztv
proliferator
nanka
rueful
fayerweather
smeg
itep
eitc
herek
saens
lairg
veerle
nockels
balser
oreilly
ardoz
lovelife
girons
mccreesh
adaa
kalicharan
dehydroepiandrosterone
disgracefully
ripsaw
bellanger
rafto
sarc
gwasanaethau
csia
jagath
rajmata
lepidopterists
uwais
disarmingly
prerelease
subercaseaux
detling
abk
locutions
icmi
hruby
boudiaf
engorgement
franki
ccmp
zhiqiang
dhanda
budeaux
globs
lamble
meshulam
kukkonen
pampering
shantel
zilker
djeparov
nashiri
novoselov
balmaseda
piezoelectricity
nagamura
marange
gibril
saum
bidmead
slatington
lupercalia
soce
rapley
uunet
cerén
bastone
telework
blading
dbrs
halderman
ferreras
malinke
grayston
iph
lydiate
airton
rightwards
curzio
arular
benzine
phanatic
foldout
dewdrop
valdepeñas
townie
ivankov
dices
tutton
surtout
bittle
khazarian
megi
sulston
samaritaine
oreskes
wainscott
brard
fbu
sectarians
tirrell
kucharski
auxiliadora
sametime
neuroradiology
icis
aonach
winslade
muskat
overemphasize
titterton
soad
floodwall
delmon
weinzierl
tondi
namas
kyriakou
cerulli
kimera
croze
kammuri
bnai
mawby
paymer
thanawat
babycenter
hawl
dementias
ellam
eyen
spicebush
breathlessly
bashundhara
snn
ishimaru
amaan
hornist
saltation
uicc
pithiviers
baxandall
hollinshead
iten
medroxyprogesterone
tusken
mawali
kubilius
spicier
jumaa
skeels
polie
bajazet
lamivudine
topman
shikun
feda
coxcomb
finitude
govou
ampon
elaina
etel
quennell
samten
passingly
chlor
keedy
gracechurch
tinges
angsty
opendns
grohe
tarantini
matsura
spooktacular
renker
queally
restituted
vng
chessboxing
vause
marfil
sieved
strasburger
stamler
airheads
maehara
synthonia
instrumentalities
mge
sensurround
acheloos
mubadala
filhos
foreseeability
bickered
copano
microfracture
eeo
haeften
manier
arrant
newschool
gaine
hiplife
karwan
lorentsen
wachee
scorza
pyra
yanping
delwar
kameshwar
mbw
kirya
mutator
sscs
alsdorf
kumuls
partanen
eschscholzia
desenzano
maskey
soundbox
rathdrum
khodadad
incomprehensibility
maryja
denniss
edmc
tempa
emissive
holtzclaw
charco
szadek
antonetti
pean
roelant
sulfites
audiologists
costanera
itinerants
chayes
liubov
boumsong
relenting
spreadable
kcpq
taia
courter
otherwords
króna
rescinds
gayan
outranking
isolationists
wassaic
benthall
bandanas
priego
ragen
aztecas
noton
bienstock
iliopoulos
gatecrash
medicating
gumtree
proft
benziger
taqlid
legging
astri
beachgoers
wistrich
saujana
nease
burdge
spearritt
halasz
gingrey
timbs
lennoxtown
darning
grises
ascom
magnetizing
motorpoint
sqdn
indels
mascotte
wackernagel
unworldly
overripe
fugato
dhf
nixons
inchbald
peartree
teitel
tka
baug
racin
nonwhite
tiptoes
birdsongs
bosnich
intimidatory
soulages
isim
cristino
belshaw
mazal
djibo
panpsychism
danzer
pochin
tads
kyran
etalk
homeschoolers
wilbury
edgren
horologist
worts
sinosauropteryx
bedhead
samand
iniquitous
moralities
raincy
doodling
granovetter
cifs
johnsonburg
franek
vbl
kimmons
mulatu
germanys
relicts
havertown
perdues
thundered
cgil
disintegrator
forgione
mealybugs
dreambox
unbleached
fiskars
khosro
parramore
fule
frankenmuth
bayanihan
moidart
kurultai
zwilling
malecki
hopetown
speziale
insidiously
kapalua
pees
lavandula
dadullah
haberland
mesocyclone
elenor
airfrance
kuhne
astaxanthin
commandante
floristry
brandler
sycophancy
merin
teather
javy
bekmambetov
copi
pelada
kundt
mabhida
hice
stauch
enfranchise
caereinion
wajdi
legrottaglie
disorientated
larentowicz
norlander
rimal
dorcus
ccrc
sivalingam
gors
hollenbach
lubos
rattay
soarin
sigrist
bertarelli
nolting
vonnie
amerson
notifiable
spacewalkers
deigned
gurgle
narbona
feustel
croagh
naville
glitnir
abdullatif
jeshua
groudle
dionysiou
menfolk
farland
sunriver
saensomboonsuk
sutphen
arbol
salido
migas
provencal
rocastle
gorsuch
invermere
syncopations
mullaghmore
filmland
sluggishly
sagala
procrastinating
laurita
choosers
sagd
gosl
zabol
msimang
charif
acerra
mcgeer
harworth
denting
cmea
ardeche
abucay
dogz
galien
senni
strugnell
levinthal
quiambao
infarcts
mirae
gorriti
krash
eame
zingarelli
chernyshev
soroptimist
traves
hnp
zhores
dysmenorrhea
belbo
lolcats
woks
gerini
fintona
deci
ravish
iturup
mondawmin
decors
delbridge
transantiago
slaloms
mathcounts
mabuza
autocorrect
riha
willshire
kombarov
cairnryan
schweigert
innovatively
akanksha
hayflick
sommerfeldt
landhi
spinella
strahm
sterritt
formspring
revol
machiavellianism
pwb
unferth
josceline
clamber
misidentifications
debelius
lobotomized
ringu
nasrollah
stort
teresia
newsdesk
paradises
taiheiyo
gobblers
fusari
provolone
makarem
kuzu
alxa
tapeless
elzbieta
rebodied
inconveniencing
frug
trufant
flextech
autoclaves
krstic
repatriations
multivalent
chirped
joropo
sandis
neumaier
boshell
arsace
lellouche
birkner
cymmer
uliana
bronzo
racon
demagogic
kapros
utilitarians
opheim
quotidiano
feinman
sudek
lewisporte
mothered
circumlocution
sumati
exurbs
noomi
swineshead
chungnam
csokas
houtte
cseries
morry
estima
ereader
reiver
conradt
duppy
bipeds
intarsia
pendeen
bungy
bracamonte
banane
lurton
musina
kippa
sesam
dithered
dionisi
hoiberg
souders
brodo
khazana
abdeen
haja
personnages
trichinosis
lajja
businesspersons
csid
ndis
tamarkin
almaguer
chessell
hattem
cabrol
brillo
netcraft
barz
ossman
growin
afsa
outspokenly
scénic
arafa
meana
schadt
gastrectomy
umhlanga
shatterhand
nowland
adwick
bahamians
nynex
linocut
harebrained
mutesa
blemished
macgruber
deshaies
kalaallit
karsavina
rondeaux
flashcard
thamrin
oneal
ponchos
authorisations
ostlund
hellingly
cosmi
jerryd
tailbone
schamus
shreck
ottomar
masturbatory
calzone
osteomalacia
smeralda
tamminen
kotsay
reuschel
trezise
vidagany
punctuations
hobs
vallet
ephebophilia
reuteri
generalizability
soszynski
eurocode
enkhbayar
decaffeinated
spertus
vogeler
surve
eja
deerhound
ophthalmia
viguerie
uly
sanyu
schneiders
chahe
safranbolu
europride
ipy
temba
blackbrook
harap
erasures
recalibrated
jockeyed
chantler
woodvine
kaufusi
htl
petrification
mamey
sabates
peregrino
jooss
brogues
scherzi
exclusionist
sundog
dogwoods
teem
colorblindness
sanguinary
jayakrishnan
carros
dren
yasuhisa
doute
palaeontologica
reanimating
dixwell
suai
booktrust
momchil
gannaway
rawail
alverstone
scuff
cumbie
dynamiting
obus
yangpu
tularensis
bahamontes
forder
assegai
inhalants
pronator
liklihood
amitava
bici
emmanuele
switchman
commandoes
sundstrom
mochudi
rebuffing
lipases
batswana
anousheh
badboy
sahadevan
blosser
poyle
restormel
haïm
salafists
fanuc
littlepage
cuori
ganjabad
manoff
vertiginous
togni
kapos
wimprine
pehrsson
fourty
goubert
stinkers
sunjata
codis
setra
gaisberg
weatherley
circulators
eskandarian
debaser
ddn
shibi
colca
termon
huffaker
belanova
undercliff
nonus
kesang
samawah
poons
saltwood
lothbury
saluzzi
stojanovski
barnsdale
limekilns
allbäck
dimwit
kartheiser
toques
ccafs
armington
disrobing
ziyu
tzintzuntzan
mediabistro
yitong
konaré
longrich
classé
evariste
vallette
preamplifiers
guttering
mably
baishi
jcd
tellem
germanies
floggings
larysa
neesham
strongside
hhr
approche
krunoslav
pagham
países
springwatch
molitva
rockbox
tought
hippest
eskandar
luzenac
ardabili
freema
dogface
monocultures
pruvot
tavan
condensations
successfull
miyaji
khom
wastrel
elkland
scaleless
salek
somerhill
brunhoff
convertino
mommies
oase
squarer
sweetback
jianzhou
warzycha
sheriden
allchurch
ismaning
popgun
pressmen
clearwire
baisakhi
norac
backpage
mbasogo
iiro
parisii
hartsburg
malach
iwe
jianli
donnacha
satoyama
tofts
alpern
primatologists
waterhead
digable
baalu
tyle
limerence
awen
guzzanti
lochwinnoch
llandinam
dolen
lachenmann
heery
billu
byk
campoamor
ludwin
uncalibrated
highnam
brisas
overath
additonal
allsorts
trevone
rivel
mnu
komissarov
jerseyville
pagliuca
pursed
vanpool
pfft
pinjore
construes
pscs
truehd
spanair
bytham
serta
iziko
usis
popek
nené
demarchelier
ndour
kudoh
gleek
woofers
hauger
petrou
uncatalogued
autostrade
xuecheng
sociopathy
solomonoff
baisho
boboli
robichaux
iaff
xiaowei
wasters
dieters
lightburn
dgk
bourbourg
wenchuan
pinera
hoffmeyer
sanguisorba
alyssum
rasel
hooping
lovren
mayacamas
tedisco
genzano
kipa
leckey
navair
chinks
llanover
virgatum
cardrona
rosensweig
teni
observador
casasola
qomi
peugeots
malmqvist
teagle
intercoolers
avionic
naranjos
fastbreak
serg
halfin
mpn
bafflement
rebrands
vacuity
marting
navasky
toulalan
zesty
celsa
muston
morones
kythera
nespoli
reenactor
unnamable
parbati
aow
skyride
sachkhere
penmaenmawr
tumilty
requesters
soongsil
pipping
geerts
zuckerkandl
tulipifera
backburner
dusenberry
ardua
thiede
marto
immobilizes
congers
jonte
igcc
kaczorowski
bakso
ghurair
imputations
coreen
gobsmacked
glycolic
securitized
qingcheng
chusid
jorja
portet
lepeophtheirus
rambaud
calcifications
ipro
thurible
unschooling
therry
tabac
mudejar
pergolas
sverrir
kneen
quadroon
perrys
harnois
tragedians
sicl
nival
barzagli
mcgirt
ludvik
urticating
alibris
pratim
zhoukoudian
mucositis
degraff
aurally
reimbursing
umran
theale
hfr
léaud
larrain
exonerates
naharin
ipmi
motorail
rebelution
kbytes
disembowelment
musallam
izturis
eulogizing
switcheroo
leibler
constrictors
readouts
ryderstedt
checotah
acetylsalicylic
salita
ferras
mammoliti
brasseries
godward
jarana
pirogue
washwood
guohua
reichle
gaprindashvili
bifacial
mudfish
schochet
ummmm
shadegg
zauner
cejudo
ewhurst
kosan
cravero
hanle
telstraclear
bahagia
milonakis
dinkum
fujiki
chintz
verplank
kroeker
boxted
drais
steinhagen
phalangist
loftis
iwr
oddballs
megaw
stanleys
nuernberg
explination
fliess
arsala
moulson
liling
packin
lionized
seathwaite
spittoon
pietrangelo
harjinder
esophagitis
cultish
assessable
naydenova
osirak
angriest
packagers
panayiotou
nathans
climo
shechtman
purvey
bortolami
kilovolt
drakeford
lanterman
magistris
porthcurno
cosier
clearway
eliteness
forenoon
ctls
gunbattle
claudication
erfan
jamahl
sheelagh
polystar
seang
wadd
tradecraft
cummer
computex
transpiring
extroversion
espie
tuten
forestieri
misinforming
hipc
calixa
taso
pinctada
coonawarra
florican
bourgas
stelly
opsahl
demelza
defelice
nikolaou
prudy
weichsel
forewarning
cayos
loriod
admont
weltklasse
tewaaraton
roski
vojta
xerostomia
gerolsteiner
katon
sobered
fichter
dodgem
grafite
cimt
tecnologico
mixi
tugger
wtrf
grahm
reavey
kaimana
truslow
boeselager
ploiesti
gopalaswamy
armscor
swed
vocalese
anganwadi
bordet
gewirtz
mlambo
aedas
rauff
andrology
ratcatcher
mcgahey
farlington
huiyuan
abejas
furstenfeld
knockoffs
matadero
wratten
gilmor
pentwyn
barrenness
decoteau
bulga
jerrell
tuhoe
narcissi
gallopin
servicemember
rotators
pasaje
chatt
dunkerley
sadollah
gandharan
kust
kamada
zevallos
casero
repertories
multivitamins
quantel
farnes
kotorska
graphix
kinzinger
bonal
brondesbury
latchkey
lucking
byward
invasor
fumigated
wearying
dunkard
eaubonne
cherri
certains
lobular
pengam
fatema
kurihama
aucilla
cdcs
flyboys
reifer
jankulovski
kegworth
proscriptive
unconsidered
forca
esos
transfering
palco
jenas
ffn
hispanidad
bakool
ayun
buttercups
postscripts
strohmeyer
valuers
mumias
rheon
mowrey
avallon
norb
latas
suckley
lykken
cuttyhunk
veryan
fariñas
katabatic
hasni
kiz
octyl
holobyte
muscling
manteno
maike
kema
timeslip
wordsley
nocebo
akuffo
thaindian
fanum
margvelashvili
gleadless
liberalising
bcis
ngg
negash
vinger
acclamations
kakuei
cassuto
unconference
maaike
shangdu
christal
almazan
argueing
magaly
alpbach
zelinski
pó
hartshill
winelands
spatola
caricaturing
humby
shardul
dystonic
clerides
jenolan
lizza
waterski
keran
metr
reverberates
malpais
eventer
peterka
neuropharmacology
arnoult
zhenwu
stj
neen
ifri
yahoos
okonedo
ribalta
odumegwu
rued
blackbook
satur
tambien
bunmi
mccampbell
slahi
pnds
yukimi
ironhead
septuagenarian
lelant
athole
pingdingshan
troccoli
crossbeam
prohm
queralt
davood
dislocate
filipiak
detroiter
grismer
megafaunal
magpul
limu
blagrave
datacenters
bergeret
complexioned
zemke
digiovanni
moulsecoomb
ohrdruf
narratively
mallozzi
misal
otisville
kerviel
chaklala
cantanhede
yameen
moony
afroz
reckord
darent
mipim
belak
lydeard
falana
milda
digitel
zakim
nankang
mcilwain
kbit
cannelton
tricarico
briavels
cruickshanks
mojca
unresearched
premedical
berenyi
pizzichini
tamerton
bads
nige
ebara
jonn
soderstrom
gofal
harston
giambrone
oppresses
eurobank
shrivel
ipic
nurdin
fakhro
nannes
weyers
inya
oxygenate
etemad
englischer
mentorships
madad
gtn
thangkas
obb
unfreeze
eurodisco
bisulfate
dolar
ashigara
seagren
solinger
milliarcseconds
rossner
gallini
pacifici
chesworth
inzunza
pearsons
suffruticosa
stoffer
aulie
weyand
palooza
haiqiang
epu
aoh
eggermont
unthanks
iriondo
refahiye
vinyard
roce
chindia
urasawa
orense
misjudging
shelvey
madjid
blumenstein
dlb
ruelas
wgbs
trinkle
cucchi
xinghai
tupe
jabalia
bloodborne
clammy
frisbees
ivailo
iatp
renie
lehder
teofisto
cargas
tosser
swiftwater
etools
doumanian
battleline
atchley
teana
folkish
muscari
hanken
hamaker
sudol
aviatrix
tristana
dekle
chueca
airlifter
njal
haruyama
schotten
snakeheads
majorino
horribilis
iveson
radarsat
klout
arvanitis
exercisable
kadim
rosenblat
smelts
waymark
tropicália
kiuchi
sunseri
tokitsukaze
billen
underachiever
gorbea
bipartisanship
shaoshan
sheepshanks
oughtred
menthe
sevillano
vinous
soundless
cheesecloth
harassments
verri
rafalski
conad
gips
sican
kene
intractability
kilcoo
qinglin
lehmkuhl
armellini
mokulele
lauricella
mancina
pomade
towpaths
glycyrrhiza
zipes
electrochemically
kirui
rolfes
charlee
tinariwen
cosio
nwsc
tsegaye
autosports
bissix
misprints
savoca
conflits
westoe
rhinemaidens
checkoff
traudl
ktb
hilaly
mariga
claghorn
facist
cahow
izapa
spotts
lotsa
kobs
schweik
gulotta
kippah
presson
jentzsch
wdrb
maniwaki
panerai
glew
sniped
striggio
cowton
panela
schrobenhausen
komiyama
neutralizer
kaczor
vicentina
zimbra
pigtailed
sagor
amulya
vlahov
hadl
pickel
teeling
penkovsky
misting
isetan
sticht
johans
gallaway
kaliya
casazza
xeni
lauterpacht
endoscopes
mhór
lumpkins
greiser
animé
cucurbits
beristain
jamet
stampeded
sermoneta
jerboas
wedo
battiston
merel
sutherlin
nontheless
stansfeld
geremi
verstraete
sxl
spadafora
péan
policial
plumelec
gräfenberg
lafd
meniscal
plasmapheresis
chèvre
lls
cimex
skor
wodi
indiecade
calegari
retrenched
unchristian
kindergartners
bharara
steinbrück
cissp
zilin
chosing
batallion
raec
katin
sahagun
dugi
symi
sendoff
misconfigured
kgtv
jeopardising
relaxations
montanan
doorjambs
maradiaga
senedd
dinmore
copus
narla
shimotsuki
photobiology
habibul
modbus
haxton
escrick
southpeak
stz
resealed
gairy
yusen
sledmere
breukink
hamdallah
dusseldorp
reintroductions
littlecote
kshb
longet
kova
dolemite
wptv
morando
clingman
malpica
wenbin
kogler
muhr
weeb
devia
eeckhout
kaladan
warleigh
firdasari
hanae
unilingual
tsat
kassin
connelley
arauz
folklorico
timson
headhunted
falch
hispanicbusiness
charmouth
supraspinatus
rhymefest
wilen
gasolina
wennerström
ailton
pakubuwono
maxing
concetto
yohanna
stenness
skeaping
bumbu
flumazenil
denardo
rutnam
triamcinolone
tameness
jianjun
dellwood
profanation
nade
lidz
dieffenbach
mahari
mayrhofer
katzrin
mompesson
pashtu
fcx
yunior
betwen
barooah
khokhlova
stanzione
phife
tseten
corbetts
rychlak
wilker
decanted
shiso
deaner
tadese
dargomyzhsky
waipawa
preshow
togas
hemis
unstrung
whiton
mishaal
martinu
salguero
houseflies
bortz
saalbach
assoluta
rahon
slimey
aurizon
fishkin
basepoint
excavatum
superdrug
delucia
nanortalik
daftary
mostro
marise
alexithymia
mufon
alanbrooke
namara
eggesford
lazur
cyclosporine
minidv
kcts
penygraig
tentacular
longis
pichai
tussey
lockland
kitasato
iashvili
tecnicos
zhangjiagang
behbahani
monarda
pettinger
adisa
silvertip
constipated
nevio
cellcom
hatzidakis
roseworthy
buckham
grisanti
verdejo
ferriter
rockfest
queensgate
doubloon
liliya
autofill
bêtes
jva
teeuwen
arivaca
kizhi
responsable
matejka
sorek
marouf
calders
fornari
skywriting
kaari
mailey
bioplastics
guttuso
routiers
ayim
cabotage
bloggy
baragwanath
poxy
panula
marchetto
pasubio
agona
crumpets
holdenville
marvelled
latecomers
foston
kuhner
happenin
kaino
extrapolates
dramatises
influencial
spottiswood
glengarnock
keratinocyte
nder
ccms
hurstwood
pepsodent
tilli
feigen
popeyes
bollin
vranitzky
demes
caruth
kingsfield
suleymanov
elanor
coffi
gastroenterologists
monkfish
kaysone
iela
brinkmanship
plattsmouth
lezgins
yenne
pupating
ssci
foulon
mossa
gerdau
mokbel
malonga
helenium
maisonette
alcee
deepali
huddie
liebeck
abdulwahab
logística
everdeen
chlamydophila
forresters
marku
nebojsa
zichron
amuru
royalism
streete
winfree
kendi
bentgrass
tsentr
kozik
modelica
sabie
mouseketeers
sunderbans
naohiro
cineaste
coiffure
scappoose
theoren
godinez
daimaru
lympstone
mongar
astakhov
moshood
pizzolo
warlordism
flahive
christofi
hiam
hutzler
karyo
giss
uteri
boetti
vodianova
vobis
hillar
refco
prognoses
cipla
zagallo
learnable
anomalocaris
protti
taibu
pinxton
goudey
varos
blackhouse
yie
birkenshaw
aissa
sullom
rúnar
zebina
songyuan
kotwica
zarautz
hogland
photothermal
varmus
categorie
fxr
werc
welliton
guge
kinmont
curently
chave
rumple
sorpresa
siran
breading
pandilla
tripplehorn
spritz
régent
schwind
wynkyn
fresheners
hypacrosaurus
wallenpaupack
cassese
agapanthus
maniwa
rothera
watchfulness
dovers
hermagoras
overworking
weaf
itzin
standiford
ruddell
ivinghoe
padfield
orbea
kimock
canche
lisztomania
zitong
shipload
spectroscopically
woude
superpositions
pami
deslandes
marchini
hatfill
moviegoer
tennys
yunshan
stabroek
photinus
metabolisms
rxs
amatya
mehserle
beibei
mikus
deadbolt
tipps
kaikhosru
dixville
luchterhand
lodo
natc
backa
underachievers
wavves
fambrough
portale
aktiebolaget
holloways
delfs
godown
woomble
subvention
interservice
sjoberg
supermercados
,who
lightsey
helghast
mednick
rishworth
tillicoultry
bukh
gaige
greasewood
langenfeld
mulley
deportee
dfj
atila
harkers
rigobert
worldspace
tafa
wellford
runamuck
bureaucratically
malinois
timan
benjani
heever
posluszny
farnan
agarwood
moskos
abello
squawking
pleopods
siwei
dinda
laithwaite
gravell
yaara
xaverians
gacaca
dameon
alyss
disembarks
shingwauk
conaty
soursop
yacouba
snowville
chirundu
primatech
ntrs
lizardi
cappuccini
orka
mucker
losee
gozzoli
nussbaumer
bilfinger
willsher
bario
cocksure
canedo
reliables
swetnam
dron
jagua
mwalimu
afifi
donatoni
richton
zelen
snarled
kunihiro
eisbach
horseriding
manchanda
industrija
shanon
rabot
milliman
ascutney
mottershead
dynaflow
penmon
corneau
fruehauf
albrechtsen
nalc
crampon
hlas
fabuleux
muslimin
kinfolk
wallström
changji
silin
fertita
tiefer
aijaz
andronico
sukhbaatar
hamidou
recoloured
yoku
michala
carga
rambova
finnart
aftershave
nzb
laitinen
kingstree
ufe
aracena
glasner
masacre
pyrethroid
levenstein
quiros
cuddyer
dilshod
antiheroes
jamon
preverbal
nicktoon
nouel
popkov
rathje
tigerman
uate
exordium
borup
parwana
disquisition
elberta
sauma
friedheim
uscs
arac
nrao
mnh
supraventricular
codelco
bhe
tourneys
silverwing
astrocyte
execration
jaywick
acom
albana
bezzina
bope
papered
apli
hallifax
jnj
kissane
voicemails
noctilucent
hassania
useing
subthreshold
chungbuk
tgt
stroe
dresher
buttafuoco
makinen
ovshinsky
fenoglio
hermelin
tlw
lissan
disrespects
anuja
berriman
shwartz
magnetoencephalography
naevus
rebollo
aull
dirtbags
structuralists
vandervell
neeld
kabui
matassa
tdci
willeke
gerba
cockrill
tuts
metten
noisemaker
cityflyer
hornpipes
perrigo
magnette
procon
diomed
lapoint
broomall
townies
montbard
ineptly
frijoles
clockworks
dpv
cammalleri
karvonen
sardina
magnificens
seidensticker
leesa
kindles
communic
cemaes
arib
investitures
yafi
coriolan
bellecourt
korma
sharpshooting
wildfell
victoriei
cohle
vallas
cherax
humbleness
jbg
azizia
dohnanyi
olimar
leapfrogged
fineberg
kirishitan
khewra
gosain
brayer
cambourne
verlinden
cail
ortego
bsaa
groomsmen
lycans
amaar
felstead
hyperuricemia
thematics
chanu
esterel
staleness
cordovan
undiplomatic
shtern
radosh
chemerinsky
kieslowski
stendahl
minderbinder
termly
entrain
schrott
asexuals
celestron
mazak
valade
millmoor
bencomo
qnb
elfi
menning
teleservices
promptness
petten
figline
beija
assab
osso
premnath
baixas
tabeling
caragh
galanti
flaca
fragata
veldkamp
strangeland
knightdale
caretas
tonegawa
goodliffe
uralsk
lerer
desconocido
moccia
chiloe
becka
terral
commodified
brassington
macher
ornans
melleray
tvf
schnauss
selvig
kibbey
turesson
vorhaus
lindsell
yoshika
inawashiro
nitsche
schaechter
derring
rodarte
bursledon
walkathon
zele
chasey
annies
towe
honkala
overbay
fasch
kazoos
coluccio
bandsman
csikszentmihalyi
esy
puech
ashly
guianese
steerer
sarid
kijang
greenpower
episodically
russett
nontheistic
yis
busson
balletic
craniotomy
shortnose
pheng
perel
mabelle
skibo
morgridge
nijenhuis
yelps
sleepwear
moodysson
pedot
butterly
triband
vinos
sonangol
phonak
escandalo
jeffes
lysol
baptizes
atara
shrewish
bhide
gohl
lunchtimes
torroella
borovets
upswept
brasted
slewing
venaria
zoonosis
drowne
leveille
gazin
turpis
essilor
indexers
mingei
parrado
lobs
utis
fleshes
carabiners
joda
murkier
cerrejón
edurne
kosaraju
neubau
beha
conclusory
badilla
hegedus
ramadhin
monkhood
greenlands
mcintee
bosoms
fheis
mcgaugh
birsay
glassner
filarial
sics
jaso
polemicists
jugnu
intimidator
proselytized
porntip
alleycat
cheesemaker
tuscia
kiewa
dachas
pettie
garbanzo
neurofibrillary
bics
biosimilar
kaysar
staviski
hillsville
holburne
landell
corbell
yogurts
munera
stanlow
manandhar
auchterarder
nadon
shitload
islamorada
aciclovir
sabapathy
haytor
unasked
malkani
blan
ejf
kunzang
klasa
clemo
mustafaa
revit
gingerly
hollybush
gamlingay
motorplex
pastori
wallum
walrath
wamego
stranorlar
constanten
itani
karez
gipper
nashwan
khata
limmer
triticale
kentland
spumante
ziskin
tryggvi
asts
stemple
bourbeau
pjd
gentlewomen
selja
etoposide
shoda
urman
qadhi
multis
sabbir
grody
finedon
burgstaller
niether
mangino
pinetum
tsusho
stettner
exertional
cipolletti
wcac
spaso
eiti
ashmun
wisch
benzedrine
chandrasekar
misclassification
myrow
ofgem
brazile
tappen
osseointegration
loades
edifício
tranquillitatis
zeile
leshem
recluses
llorca
modesitt
tayyib
bnz
hdtvs
heylen
beebo
kersting
sobolewski
ketner
raio
kostitsyn
tinelli
binayak
mondesir
metaxa
doughton
ightham
sacu
antibacterials
drummonds
caylloma
brûlée
balcerowicz
occupancies
kamioka
huessy
kumail
manker
minchev
trembler
fiola
sugarcrm
sherf
apposed
kso
pathirana
pinchback
cibeles
ilcs
levothyroxine
aced
liebel
rodelinda
vancamp
sturdily
mainstreamed
jamhuri
bangert
jetpacks
bruker
biogs
moonshot
electricité
stupefied
infallibly
pépé
timbuctoo
frary
electrique
bacigalupo
jitu
chiddingstone
sarcosine
mcbrearty
nutr
esquinas
intoning
slyde
squyres
weathersby
palladianism
schelp
twentysomething
preki
refractions
mearly
trauth
slusser
coquettish
dewhirst
tague
tocumen
ohmi
hayatullah
domiciles
rooi
comgall
yarber
newnownext
jinyu
ghajar
stratego
eastvale
balawi
balko
dinged
mooreland
larken
doihara
weixing
meiring
sifakas
ravelli
voltz
mawkish
tsvi
egolf
gastarbeiter
aouzou
pogany
harmse
vagos
retronym
bacchanalia
keulen
misinform
uproarious
felty
hasham
genito
orphean
gerlinde
iroko
rabiu
persecutory
gadon
ajan
overholser
trên
beachill
overhyped
cicc
najmuddin
tashman
liljana
efecto
thaer
bravehearts
tmax
calumnies
shmita
unscrew
coalminer
pegylated
stampeding
zanini
mercey
crumpacker
macerated
zarai
assignees
griffing
bookforum
koth
mmbtu
plunderer
arents
anticommunism
huiwen
mcclintic
screeches
habtoor
vgm
elsaesser
vergès
carload
kilocalories
jhong
bery
scaa
ruppel
difrancesco
afellay
achieng
jephcott
mytouch
humpbacks
hegner
wallaces
majdi
guangshen
genei
crago
qingshan
khnl
clx
schygulla
pulheim
butetown
wlf
cims
costinha
intersted
tagliapietra
bollettieri
lexden
juggy
nandlal
rull
caaf
shoppingtown
zaini
bronchopulmonary
saltalamacchia
gbit
farraday
abil
zambelli
dihydrocodeine
vilasini
thumbtack
miescher
jalpa
courttv
binyan
pangasius
schlereth
bijaya
freiberga
valer
stenciling
roundhill
slaughterer
limone
klimova
mayorov
tommo
façon
cineteca
hanzel
cherono
tweel
memorialization
immobilien
arlindo
evanna
jessika
nkhata
razif
faura
spermatheca
manini
mocksville
chemosensory
polland
jantz
coiro
kopechne
cleansers
maenad
precht
nominet
jenckes
chickie
shonn
horserace
riska
facc
unfurnished
serbelloni
declercq
koury
groothuis
duberman
cofton
penjing
javeriana
philiphaugh
chiado
robocalls
actualities
ilian
hemostatic
squeers
loewner
fabricant
monin
kunzel
clovio
cualquier
bilin
batching
huayang
hulsey
wender
makamba
medicolegal
discomforts
matelot
supercheap
arnone
unamid
webshop
jasem
danau
xtec
acetonide
boaventura
boche
daykundi
icaew
emry
butti
ejaculates
subsite
goldstream
ulmo
derogative
zwan
czapski
waterflow
serebryakov
resolven
hepner
eiteljorg
bronzini
cunanan
slaidburn
macie
villeret
clwb
verissimo
fibich
gustine
frigging
thanon
illogan
eumenides
toeic
completley
eshoo
oleanna
goudelock
clanfield
dagr
atambayev
mangochi
yseult
yadira
mejillones
haward
dippin
cuddihy
harrows
barnabus
campagnaro
fluffing
schuhmacher
haveeru
godlewski
damani
braddy
constantines
overfly
sunpower
cetin
pinnell
bensinger
readin
embroiderers
chillington
loda
blowholes
cybi
erazo
enshi
serm
gingin
javelina
marieta
rancheras
envisat
unworked
qeh
ahwatukee
akker
umnak
buckskins
gabelli
mosto
atotonilco
naufahu
capitalisations
corell
unwerth
atong
mmogs
electrochromic
landel
youngson
multiplatinum
pader
oxshott
hacen
equestre
aitmatov
supersports
hanish
codsall
pontis
gholami
interchangably
scholder
pinin
motamed
rufai
totus
stape
horizontals
ringland
wse
scheier
tassin
lamkin
palffy
zavaroni
vetra
menya
hune
trailside
lissack
boswellia
trauner
hyd
deyn
woodes
meshal
thalheimer
grimacing
stagliano
skywards
threave
defreeze
lundahl
sattam
kachinas
pooped
wetheral
janetta
dicussed
deline
kerkar
vigia
frohnmayer
collister
beltrão
malov
eisenhart
rold
puriri
doozy
chiseldon
umdnj
shaznay
quogue
pells
spiv
stovetop
wrington
abersoch
headpieces
dreadlock
mycenaeans
setchell
marick
typescripts
driveshafts
ssme
tunel
babydoll
cottrill
parenthetic
muroto
narina
lhv
brithdir
saxitoxin
douthat
reznikoff
silangan
trouve
wendall
dungarees
debunkers
bifa
yil
outcries
euxton
koslowski
tramples
rathfriland
quadricentennial
avallone
arconada
ppq
ncoa
ingvild
urth
saffire
paseka
foshay
shrimping
auh
nightclubbing
strete
ngunnawal
turfed
josselyn
björgólfur
netflow
kinvara
maracatu
mcgroarty
dilana
mckayla
apperance
lanter
perseveres
itaituba
grabeel
bispebjerg
shripad
lactococcus
lemelin
amplexicaulis
mcnew
pantech
chisos
parries
beldon
rockslides
yur
fika
somnambulist
tianna
croma
masahisa
klestil
strank
madiba
dongbei
daylong
pcaob
fgp
keron
bortnick
kotb
lithotripsy
sakineh
comoran
ragù
iex
macina
paradip
andalas
girata
bizot
wigwams
boaden
dirani
escitalopram
ctbto
luebke
afdc
sundman
reisfeld
kfan
beefs
fakirs
yarkovsky
fasd
christofferson
reedsburg
punctilious
pinholes
jaggar
deandra
speedweeks
amerasia
interbrand
canvasser
kutama
rauhut
morlacchi
jbel
towell
belmarsh
pannonica
runk
snsd
investcorp
degette
streetwalkers
chooks
algodones
swang
ibos
mediatheque
knapped
ingomar
quetico
srbi
winked
unagi
isabey
mravinsky
noritake
ghaly
avera
diann
✤
argentum
ebrard
esslinger
mhairi
jaylen
michl
montmorillonite
edps
lovechild
objectif
umbricht
piskunov
piros
alltogether
eavis
dormont
petley
geostrategy
fxs
maalik
johnstons
lanterna
hoos
kochen
greenbelts
nassib
acquainting
sgf
hegyi
agranulocytosis
waldrep
ruddick
sekka
mcneeley
guye
hedgpeth
ogyen
dyersville
ploughmen
balgownie
lejon
stojanov
trefoloni
bloodsucker
takehisa
crémant
doland
saltsman
habbaniyah
rouvier
hypocrisies
obree
andsnes
datil
mazlan
hagon
mastronardi
perper
singleness
uncolored
topas
koul
hardiest
addingham
restinga
outriders
diagnosable
baucom
standdown
demont
ambushers
figurations
willendorf
reinold
coun
tonsillar
allsburg
exotique
timoci
wowereit
samaa
sinagra
fouche
creditably
belgo
allenhurst
hongren
mollard
sarmatia
pinewoods
hnat
superleggera
ramsauer
kammen
abaroa
porcini
katib
mieses
bistros
vetrivel
shackling
bremond
physicals
pourcel
karunakara
uncaught
uams
landgraaf
kace
saine
vemork
cpre
britanny
kisling
deji
bellmont
grindon
pratte
loevy
mingyuan
wonkette
papanicolaou
borras
shihabi
dolours
underachievement
maulavi
koor
dérive
jayasekera
besnik
verio
eisenstaedt
goodis
deuba
capus
rotis
pickover
sedar
satt
dartnell
penair
cavenagh
harshbarger
solida
knaak
institue
majestie
chiaro
luisita
rewe
hagstrom
cardium
zittrain
eyecon
pemulwuy
cowperthwaite
pletnev
cheron
tirunal
hypertrophied
patency
jerad
galesville
stlouis
pirotta
declaimed
gerst
liffe
cinematical
lwc
jahvid
toolson
tibaldi
masuku
castlerock
rethinks
muratov
boisei
sugarcreek
bounder
resul
cothi
beasant
starlit
osbourn
dya
sturdiness
skechers
ozona
merican
myn
encephalopathies
lews
brocton
penzias
sagano
tenex
fmla
cusworth
sunwing
pippy
bordelon
landsgemeinde
elmfield
haci
geant
charrier
embrapa
gigapixel
overqualified
sorín
karpo
theretofore
alterio
lengel
audy
castillejos
ardgour
frenemy
miladin
deyang
langberg
konko
undercarriages
jeschke
freekick
quintiles
birdwatch
petabyte
miniaturised
birker
bonfiglio
knaphill
grzybowski
reepham
chengyu
tremadog
horni
rcv
jasser
kjellgren
arrowe
colourised
jagt
dunloy
zonk
incompletions
efas
piquer
loso
earthship
shochet
deescalate
rolodex
priapism
perkinson
tankage
rivkah
suthers
commision
whihc
waclaw
guidoni
hazon
lucanian
weightier
kaslik
kassi
sleepin
huilong
emasculation
trank
rxr
skiatook
optronic
sicher
fishscale
cabanes
mytalk
kroh
burgeo
cuper
doughy
cordials
tsukioka
campur
caplet
schnellmann
broadheath
galanthus
crcl
ccts
uptodate
meeran
tadic
kbw
asiatics
leïla
otelo
apéritif
godrevy
inchcolm
savigar
fackler
solsbury
woodchip
quints
metwally
priolo
panarin
niggli
breezewood
abjured
dcsf
potanin
hartzler
enchanters
jcpc
speros
gatot
woroniecki
trago
castex
khaibar
eavan
bolwell
melones
faes
friesner
pennisi
cystatin
cfsp
difiore
agami
micko
borowiec
hodgenville
nariko
cascadian
knowhere
ferebee
langhoff
yagudin
bith
akinwande
kapow
temirkanov
crackpottery
illing
nzoc
fitzharris
hydrilla
tanura
tasia
uvo
verson
kaleta
kalapana
valcke
reappraised
dillin
chinamen
tanni
grunsky
chmp
akwasi
lisner
loughrey
qdos
moldau
nushagak
deducts
sylviane
dwaine
veiller
cabriolets
arle
telindus
archipelagic
foxdale
luders
babesiosis
electrik
mammillary
iseries
patato
lakhpat
stutterer
gmrs
cropley
mahakam
emilion
campamento
kasane
portraitists
ramathibodi
sugarbaker
wirtschaftswunder
acorah
rankins
bellissimo
polypterus
raida
anicia
voorschoten
aberle
pettway
instanced
júcar
ghedi
bleckner
morazan
preisner
shapey
dipu
ognissanti
anastacio
gunj
dhupia
hurva
lewinski
spalletti
kyrre
blee
hueber
floccari
farningham
velayudhan
cytopathology
kotova
aorangi
orthoses
minz
agriprocessors
mecu
pyrethroids
twix
rfn
azali
vog
corruptor
olana
haughtiness
mayama
convivium
mroe
kahekili
barretta
roborough
saltdean
bonderman
tarbela
murney
springerville
shorr
nosh
commish
monotheists
unimplemented
momaday
kickass
workhorses
nebulizer
homozygosity
regaled
nols
ciak
tianyu
stavropoulos
napolean
beere
lofland
jatinder
mezzanines
jazzfest
servery
marei
longboats
hoppity
ngaire
arensberg
stensland
gargling
justyn
jalopnik
bongs
glazkov
bleary
flender
kahveci
gayheart
dutugemunu
logrolling
corvid
chaohu
boadilla
gobin
kolon
nightrider
circumlunar
compañeros
annise
vlore
charterer
zuck
yemenia
schwenninger
newitt
navali
luzuriaga
kaboul
cortini
emsa
sunnism
blumenauer
leuba
slagging
koce
shena
tanny
mikie
dongying
caymans
polyhymnia
presburger
dober
cahal
mutola
candidatures
maxtor
konigsburg
macbooks
gossman
janan
agere
shatkin
recusals
vvip
wipha
roenicke
anghiari
ajemian
rutigliano
terlizzi
wolfers
evictee
saibal
washingtons
nici
syringomyelia
bezold
dribbles
eapc
dismaying
yagya
fontanella
bolds
niam
foxhill
flits
fores
mimieux
divorcees
azfar
hwacheon
cazes
eventualities
szewczyk
nortons
treherne
ogbu
beanfield
ference
slotnick
stiffkey
orting
unglued
groop
bocci
anthing
defensin
chavs
laisser
clercs
christianne
tidiane
alaniz
tnl
fawwaz
kedia
auswärtiges
cipes
emiratis
spivack
teabag
hadal
quashie
nurserymen
protegee
rajpath
barcino
whithouse
richardsons
answerers
ipec
merteuil
splenda
avians
watcha
tucheng
musashimaru
arvs
astounds
mckeough
boppin
troller
buckhalter
dahlhaus
lusha
centerstage
wirehaired
lovington
hayder
sauchiehall
familiarised
shneiderman
rusbridger
antolín
jassin
shopfront
bronchoscopy
shuafat
groundings
conen
baetic
weder
hydrologically
inflecting
cantini
hypochlorous
conventionalized
hng
liposome
hadjidakis
coastguards
chinatrust
zaydis
smurthwaite
somwhere
usaha
anticholinergics
lincroft
neuroleptics
thujone
baquero
topel
flybridge
qaanaaq
camplin
graphische
boothman
pawa
supercalifragilisticexpialidocious
anxin
belloy
thumbtacks
centime
colourblind
matrika
bloodsucking
schindel
gateau
pickaxes
executory
chimanimani
foose
kidane
anisha
irsa
worthier
shoichiro
bavel
repka
niugini
scj
waken
flashers
beavertail
stev
rosel
ollis
thereunto
muhannad
campeon
pakhomov
damphousse
tyer
mosshart
modernities
unfitting
ventosa
externship
crimestoppers
nondiscriminatory
mignini
nastran
machir
cortright
gyrating
liou
notar
neurobehavioral
roki
asklepios
phenanthrene
mittie
chayote
carbonization
últimos
rakoczy
sumps
songkla
sehat
segregates
aprica
carraig
dejah
harborview
istook
throwaways
vgt
untimed
tatty
viliami
rodding
radde
minkin
buq
gaillardia
serveral
almereyda
blazkowicz
cbos
charalambides
koppes
gaetti
wmtw
bonfanti
haibin
salame
coccia
uscb
prosauropods
mammatus
usip
walberswick
vota
uncinate
halia
sortir
luckier
lederle
calafat
munny
orgon
fangirl
firebreak
ashim
fulbourn
maritim
evoque
hunsicker
pippins
lefthanded
etymologists
jarai
landowski
medin
buerk
medmenham
neofelis
habibollah
bco
ultramodern
weatherston
unequipped
zettler
heceta
partee
angiogram
fastrak
hindmarch
incontrovertibly
mdis
dimino
disassociating
surena
barbershops
robotham
endlich
untutored
gennifer
hannoversche
lanre
suppositories
kauppi
suplee
stackers
rapiers
whorwood
harrad
nasza
broaches
outraging
fauria
kiira
voyce
dunnell
farang
delftware
audiologist
incinerates
herriott
fattal
congresspersons
businessinsider
gamemaker
averett
rendezvousing
naspers
avere
marzan
spfa
gabrielino
endz
angulation
mariz
tomey
kcsm
abandonments
mccunn
merchandised
maxcy
daudi
centex
quiles
rrb
downend
buth
lucasta
walger
hengistbury
rizla
lefel
vladeck
konfrontasi
bromby
bernhoft
luyt
snogging
borkman
bloodier
meloche
flextronics
galstyan
confort
norbit
hudsucker
stratis
clach
bertine
ruhlman
fahmida
baudet
bethge
followups
christopoulos
parera
counterrevolutionaries
crowbars
ncate
kiprono
spelunking
rieser
jerricho
cabera
skewes
jianxin
percolates
barile
castlegate
trovoada
nimbly
schiraldi
skypark
mittleman
tetela
ionov
wolstanton
dziwisz
parsnips
psychostimulant
udl
herlitz
neutrogena
indicts
agian
transunion
namir
felbrigg
donalds
clevland
pétrus
valveless
laugher
aperitif
evere
kylee
matteis
mcandrews
pigweed
mervyns
brügger
garel
kamprad
casamento
jerusalemite
stamets
romaric
wetware
raczynski
unshackled
ranitidine
pattnaik
yardy
knoweth
brimelow
vimont
preckwinkle
coomans
thaden
cancio
obrien
kariwa
contextualise
beadles
altomare
nevitt
naung
musicmatch
gisburn
brocades
kastler
szczebrzeszyn
gendler
overholt
alexandrite
cpsa
riffel
kinzel
pinet
maxus
petero
jallikattu
capirossi
nagpra
analysers
flixster
lulling
karens
duhaime
alexian
bigness
pankratz
reprocess
landore
surtax
slimbridge
loubser
lcg
montesi
wahlen
tessio
televen
oberweis
bakhit
vogons
hinky
inea
quintos
torto
nypa
stothert
biocide
sanghera
zumbo
intramuscularly
niederhoffer
nonstops
dropoff
pedlars
tendu
mones
imformation
inot
cpuc
mchedlidze
rosalita
desalegn
ramb
giampietro
inma
axxess
bergler
falluja
scozzafava
petrifying
backstabber
kaiya
sory
newminster
warbles
visagie
skycar
coalesces
neuhoff
admi
harput
transurban
gronow
urzúa
halicki
sirakov
agramonte
alighiero
margarite
zerhouni
fidan
butaritari
sakonnet
rustington
jcaho
bettoni
reconstructionists
toretto
embsay
myl
osnes
afriyie
rnz
dirrell
schlamme
kalemie
mekonnen
humidifiers
sanlu
shoshan
gasson
guzan
strevens
labriola
hemswell
herky
hiob
unlettered
rdl
slep
ennobling
gravida
zacky
purdey
gavialis
reabsorb
blackpoll
henrikson
parboiled
mooncake
psychobabble
zhengding
wsaz
scows
loweswater
crunches
everardo
hasu
hdm
bonamy
deblanc
remonstrances
epidermidis
häusler
heskett
mceveety
amatuer
kappes
knoc
kilcoyne
qadis
mikiko
phalanxes
bioblitz
dael
hinske
homeruns
signoff
klinsky
gitanos
stolte
ebright
winnett
pitsligo
plomin
carwash
agolli
quesadillas
remolded
ringstead
griffes
committeemen
huella
lco
krenwinkel
irland
poplarville
sherron
breif
ambrosetti
eastry
fininvest
rampe
sapote
tanzanite
percents
seijas
elenore
creux
govender
licciardello
edmon
demurrage
saraiya
energizes
canem
mesko
amlak
anbari
stelter
csdp
henricks
chassidim
camb
trawick
highroad
afaf
slackness
keba
yangjiang
dawsons
punchier
redoctane
naïf
dresner
glebova
armlets
tratt
kff
ibolya
cockneys
bolina
carboplatin
mathie
fetcham
florido
heacock
emanuelson
myxomatosis
postmodernists
gormaz
hattab
sawers
glandorf
tyminski
norihiro
sciri
krik
milicevic
naranjeros
artw
scates
basting
gigandet
liyuan
freedomland
schier
collingbourne
skeggs
jahanzeb
sheldrick
organogenesis
shashanka
orgaz
colourists
peppiatt
promyelocytic
ashden
stotesbury
tonners
heraeus
reovirus
berriew
zecchino
roxio
cywinski
bluestocking
aramon
madyan
veneziana
aldate
tfts
lagares
pfcs
ranevskaya
winster
purgation
bunkum
brentnall
thilan
snorted
disabuse
neuroeconomics
marcq
peaslee
cynopterus
impeachable
kilojoules
rahaman
yarrell
virpi
muggings
chega
hossaini
epistemologies
weisheit
wallbridge
giacalone
wetherspoon
mccooey
rebennack
finito
ypc
belarmino
noja
unmanagable
zpm
sabac
pbd
mfe
headbutted
reimpose
freilich
drays
eatwell
officialy
emrich
utkin
ecpat
dilhara
koonce
graybill
dontrelle
whitbourne
lozoya
wroblewski
zaf
crocheting
smbc
borsch
cods
lockard
ophthalmoscope
brioude
skywatch
ivindo
treas
tirole
clausa
lamizana
goathland
hemdale
stoneking
photinia
chingachgook
winawer
eskew
mcgillion
bcv
cartographical
rebirthing
marsoc
dressup
abdc
harangues
bellefeuille
insupportable
bulgaricus
afterbirth
mortgagor
smilie
outpointing
ghotbi
penseroso
panchenko
salsbury
wgba
kwek
equilibrate
obuya
medzilaborce
wildscreen
hattaway
smeeton
mcwhinney
putah
porridges
attash
extortionists
lemn
scandalizing
osako
tsotsi
heorhiy
cockfights
futtaim
folklórico
ostuni
infarctions
hfg
navetta
deathlike
ondoy
gwyther
konchesky
helleborus
hakamada
causer
rdio
floury
hejian
cringing
dracena
rastus
responsiblity
moul
khloe
zelimkhan
toufic
fontanes
ingratiated
redshirts
shanwei
tompa
moretonhampstead
photograms
neofascist
carnauba
palis
parainfluenza
heinke
amboseli
fakin
ffordd
solyndra
brickner
houn
phut
hemorrhoid
hillington
nater
sarwat
irkut
cndd
atca
capgras
danakil
avpr
orbelian
glentworth
lennertz
leiris
bumppo
crsp
makela
sillyness
peremptorily
latchmere
fascinations
fecteau
juicing
meddled
cax
ritva
unscramble
cycler
nijo
gulian
extendible
imas
teijin
megaphones
tokenism
grouted
suppressions
seigniorage
weronika
hampsten
samandar
lishman
rissington
oakum
gynradd
garlett
chulabhorn
pulex
umunna
ruis
bandwith
tarmizi
soskin
aigen
doyles
seamed
achin
pvh
selles
baptising
tramaine
arnside
irsn
gnomeo
embarrasing
lamberth
mucopolysaccharidosis
antispam
molehills
trewin
colly
nhrc
rotundas
grabban
flannels
tegin
ammendment
honeycreepers
skymark
trys
citti
naomie
moso
waye
misreads
roary
duffle
walentynowicz
saadiyat
cumani
figl
tursi
infernalis
tallant
hys
klsx
apethorpe
vecindad
sensationalize
meden
pesenti
urdang
scheide
partem
shengnan
landhaus
leucadia
nicotera
norvig
alimentos
rids
trifon
encasement
gianandrea
hoehn
swoope
feferman
viticulturist
demotes
toun
icarian
swop
xiangfan
sirona
alfasud
belyea
decurtis
ncta
boyland
quaaludes
kettlebell
cearbhall
brustad
mousing
lovastatin
terremoto
korky
ofakim
baseload
meteoritic
vetsera
madonnina
secedes
breadcrumb
stcw
bauzá
osam
ramaphosa
bachpan
barrak
shahrastani
bricket
manafort
sketcher
zhizhong
lengthiest
solium
midcentury
brightstar
abdelghani
brignac
bandirma
jadwin
prestonwood
oxtail
atheros
fushi
kiy
hunkin
peewit
pressroom
calixte
remnick
mcevedy
schrag
eurobonds
bhm
traje
thsrc
damico
sames
huszti
randol
aberdaron
bugden
astrogeology
resuscitating
candyland
julija
streptocarpus
cortada
lound
dearmer
actividades
avold
velociraptors
serono
propagandizing
etuhu
amihan
gambits
declassify
kanamycin
piltz
schmoe
niek
beir
eklutna
seishiro
schwalb
benjelloun
verpakovskis
mcavennie
measurer
ursine
bies
rubirosa
obeisances
popsicles
filch
gavdos
blashfield
pulsifer
steidle
etap
nazeri
sagmeister
eslava
deryn
aey
entropia
nesty
ciotti
thanou
highlandtown
daylami
shoenfeld
bradl
gdot
legree
wcnc
kytv
pesh
carparks
lyes
windthorst
becase
redemptions
khandi
seimone
jorvik
garlits
balkany
duvauchelle
andrassy
jenelle
alson
kozakiewicz
danum
balkanization
derelicts
wallgate
widefield
peiser
taling
coard
nawang
lazear
roadshows
irizar
ohuruogu
snitching
chokri
tecan
shamoon
powerpack
combustibles
sofoklis
baltusrol
buggie
icds
footsoldiers
binnacle
attenders
gysi
stagione
repas
amerikkka
slaby
yongli
cordish
okfuskee
ineffectually
shaher
beardy
pangnirtung
hathorne
mouches
cordwainers
myakka
gaudencio
kostal
bullys
atias
riendeau
dalmellington
sapiro
dius
engebretsen
welp
bernbach
léotard
keever
boylen
wieslaw
cornmarket
mazloum
sirajul
murchie
binner
samedan
tuckers
polisi
tunji
souvlaki
anastassia
marywood
ozette
vevay
letraset
vishakha
bargmann
ataxic
granturismo
mahmudullah
cammarata
bhutta
decesare
barberie
hypercapnia
hancocks
chinos
sandiacre
exline
thayet
sapience
defames
aatish
ºc
limeliters
crosscurrents
darkish
croppers
terraza
mottet
donelon
rudloe
taren
carns
hallucinated
vange
inversiones
etau
stobo
verster
takatoshi
connoisseurship
jamshedji
inpe
paraskevopoulos
waldenström
koyaanisqatsi
transportations
sudley
spiderwebs
esam
pictorialist
nichd
punia
penco
grinter
daham
oerter
madol
unprimed
verhaeren
amoebas
nispel
musayev
wkar
marnham
artc
yuyao
serowe
hilborn
garfinkle
oxymorphone
kxly
wolfberry
haverfield
farkash
demunn
burgle
countenanced
barnstead
marianists
kleinrock
eliassen
hembree
hypnotically
intervenor
mockus
moeran
komon
zhongbo
headwinds
pauken
poblenou
souderton
dreamless
macheda
blockaders
lamos
ubar
rampersad
mladic
rajender
kauhajoki
activewear
inskeep
tagoe
eliran
jayadev
clywedog
kazinsky
behrmann
eyl
signposting
csea
kretek
mpge
pollokshaws
goertz
minihan
nasrat
nives
hilco
conry
racc
voght
pène
jenrette
carilion
tumulty
interrupters
fnf
wiegman
bogdanova
wesseling
dewis
vibra
idrissi
bullrich
abouts
feehily
trustful
resend
splish
wamc
resounded
reppert
ganter
gorder
ukrop
henno
coachworks
timetabling
priskin
divertimenti
pabellon
chiche
miny
chardonnet
modou
typhi
teems
vajira
llandow
cfh
excepts
folktronica
ninjitsu
clatskanie
henni
bragh
sudar
wilfork
udoji
verapamil
effortful
pharming
loyer
chabala
hurtt
cirino
moglen
mackley
virtualisation
itaipava
oml
croissy
chatou
lubrano
kindergarteners
sanchis
mediana
piw
rosta
hindmost
broers
outmanoeuvred
lorian
davyhulme
prescriber
pangelinan
bushbury
belmokhtar
edwidge
pulsate
altheim
immunologically
polymetallic
yelping
alonge
rosneath
kiszko
engman
campora
broude
biersack
cerra
channell
fsis
haramain
apaseo
kosman
scelidosaurus
lyss
bctv
santorelli
satre
fifthly
vicereine
orabi
cayla
seosan
cbre
costo
shahidul
ancell
raiment
nrv
eulogistic
shivananda
senko
adwan
lanchbery
chepe
garre
turtleback
thanassis
worster
skinwalkers
latika
witmeyer
adeney
bahan
connan
astorino
tzemach
attractants
dastagir
rifampin
herlong
tenebrous
escapers
limor
copybook
zav
laxenburg
shopgirl
wissem
obame
múgica
informaiton
sopi
buring
masetti
corentyne
aquarist
obasi
matawin
meggitt
keathley
gammond
talalay
loebs
keevil
dipturus
neyra
mcgiver
tahquitz
bartoletti
larn
geita
syniverse
degollado
jarom
bobrow
umaid
reincorporation
biters
ultramarathons
atex
brillstein
occc
seeberger
vajna
elisheva
malc
omers
sherpao
yahav
oooo
kimberlee
idis
pycroft
acara
rossant
savelyev
csxt
tedrow
suguri
commiserate
rothenberger
deby
gisbourne
mcflynn
norbulingka
ulyett
ximenez
diatomite
unindicted
kareen
daelim
pgw
espci
luxair
scoutmasters
avidity
corsage
lindor
carburetion
cashell
polyamides
karrar
ebinger
renovator
schijndel
sickens
willimon
masculin
misfiring
blatnik
augusts
confidantes
oberste
popemobile
xiaofei
starlink
camile
wase
eisinger
queering
reedsville
sheldonian
monochord
levered
rushi
libelling
ecosphere
multiview
tomate
transall
tifft
helpe
hotpot
spruced
wafs
reinvestigation
duddell
interdigital
harad
mandic
kulagin
stanback
yangshan
thoughout
scandalously
holograph
gagarina
mcfaddin
rawiri
volmar
rías
doub
nanometre
cias
hargett
stinkin
crumpet
boiro
fauziah
minucci
corrodes
hollingbourne
sicheng
arborists
sultanahmet
maliks
tobis
demanda
chastelain
nusret
tridacna
bucklebury
orteig
moscovitch
avea
lineout
hilles
yuzhou
domar
xifeng
livengood
rouass
longeron
sportsworld
arroyito
corozzo
lorak
paillard
satala
hosler
scrutinizes
photodetectors
stairsteps
herget
boehlert
militello
lollies
okies
infotainers
bellusci
meridor
eloff
sarsat
duale
sirhowy
steacy
moorfield
metrotech
loker
annamma
bédié
hinnom
beric
lalich
devaluations
theatergoers
browsable
dasch
endotoxins
ohb
woss
kugluktuk
galabru
zarr
bioacoustics
sauropodomorphs
saltney
culbreath
mauck
pendency
dejagah
reguera
spampinato
drumroll
sucumbíos
solipsistic
lindpere
chorda
explorable
bunjaku
closeout
stokesay
allegre
callebaut
haniel
unthinkingly
lahun
risberg
actaully
haenel
vassiliou
leukopenia
insulza
fleuri
transgressor
remon
moorehouse
minotti
couhig
pedicure
ridgers
malbranque
coronilla
netherhall
zakes
strache
votel
mediterranee
nahmias
isport
bilger
hayseeds
maindy
richenthal
virgine
librescu
ebbitt
gramlich
boonie
lcts
bonnor
decaydance
grecco
coffeeshops
spinello
medco
gasb
plasson
wisemen
mahals
edek
chapell
ohiopyle
akara
jorrit
ivone
loel
mitsunobu
starsmith
metafile
mantu
sturzo
gruchy
oord
revolutionising
changfeng
beheira
directgov
schyler
isight
doyley
aberg
ranters
lobectomy
workups
phenomenom
mccs
sambava
lehendakari
khanal
szeemann
stickland
mocatta
lutoslawski
aafia
tarong
homewares
lantry
radzinsky
cefalo
brokenness
salini
madidi
troxel
hueytown
bentiu
kahui
cantalupo
deciphers
gobbledegook
jimmer
noisiest
ndes
marcora
geers
parthiv
mustique
siberut
rushmere
orahovac
tuat
aez
conkers
fricka
rambled
xacml
risus
tapert
sheers
hetauda
chirwa
kopylov
szohr
futagawa
commments
rtz
apocalypses
shamo
angeliki
topnews
dessay
chalford
oesch
thermogenesis
microevolution
nobuteru
montrealer
bridgeway
epicenters
refound
colonially
feal
chère
waitlist
samosas
yassi
sampoerna
victa
pross
eija
nanhua
kalihi
explicating
anamaria
haymaking
guoyu
concertantes
argentière
fleta
maxell
tibetology
bostons
hfb
farry
dremel
rmjm
portago
sainath
casona
philharmonics
foxcatcher
liden
misplacement
mefloquine
tyndrum
darbishire
ktbc
unalloyed
bpn
jabbour
koehne
minocqua
efmd
tazi
velino
collabo
teschner
concious
juhn
scorekeeper
bulkington
tarsalis
nasality
appendixes
samsing
krahulik
chervil
pleo
aex
shabbona
murley
maintainence
flunitrazepam
pegeen
victimize
mtbi
arcam
hyperrealism
neea
abassi
yalding
imco
vgc
maeby
darleen
williford
wprost
ivison
spirochete
barabati
sibella
apatzingán
hashmat
bewitch
ajaib
gelson
amputating
sads
unibet
interjet
pppoe
arrt
geomorphologist
clemen
biswal
edamaruku
hairball
fawcus
guoliang
terzis
thyroidectomy
berkovic
calan
rosiers
coillte
bugloss
valleyfair
delane
oeser
deckchair
montjeu
hambrecht
excisions
edris
alfama
qeqertarsuaq
tracksuits
bense
vvn
gâteau
tetany
castellitto
stunners
familiarisation
ironmongery
nexgen
daquin
explainer
borghetti
guandong
xanthan
kiichiro
intented
flatted
zeuner
blonder
preformatted
voorn
edhec
recants
undyed
deidrick
venality
rileys
gimps
rooibos
glynnis
rasco
blechman
churruca
maxxi
cellan
katty
harai
béatrix
nonmilitary
lenine
dke
emili
hiles
hammell
sungard
jacobinism
salur
hallowes
webshow
facilitative
sopel
restructures
includable
aeds
disfigurements
okonomiyaki
shimpo
ringmer
boogerd
carneddau
sommeliers
nebraskans
turay
schopp
neuroprotection
nevs
guesser
durrus
wvec
vaculik
lewellen
mescaleros
mcmackin
cloward
weaponization
cortado
tiahrt
daker
ortonville
caborn
darrelle
polmar
byland
terumi
semb
ormen
karrin
galyani
robertsdale
nanney
waipara
bryggen
rossford
caputi
unmeasurable
airfares
pajon
redentore
dauvergne
altcar
verbrechen
bbtv
bourzat
nazarova
demodex
mondini
seya
progr
gordievsky
regrowing
agba
squashsite
porath
sitemap
tommorrow
manh
rubbia
farmacia
hedstrom
benrubi
bleaker
woth
barmore
jacir
megastructure
defar
mnb
kisnorbo
truglio
nothign
hooman
silvi
mikeladze
ceniza
rysselberghe
systematizing
crusaded
tasco
estacion
springborg
eorl
microporous
barukh
aseman
sidiki
fresu
besterman
pujara
gleditsia
denbo
cebrián
taittinger
pachter
ballymurphy
filoni
liquidations
chsaa
yongbyon
conservativism
hydrochlorothiazide
jurca
rexy
namik
thermoacoustic
favero
wykagyl
chuckled
dwindles
hertzen
pashayev
winepress
argc
farzand
nazneen
petacci
mukri
oneidas
achive
hazle
andreanof
shabalin
shef
salpa
mottingham
testings
multiband
sharmin
sinlaku
nicot
pestles
lapthorne
etgar
castelvetrano
radack
nutfield
afroman
schoenfelder
munmorah
grandfathering
keyt
wbls
electropunk
kardam
lumpectomy
aflam
damasco
kashkari
kazutaka
sarlo
lokmat
lordstown
chalupa
lycanthropes
frejus
borrello
mulls
oresharski
dlh
hydrophobia
kinosaki
riam
hambridge
rhoose
solca
baecker
vli
planifolia
incepted
hadnt
ahmedinejad
pocho
südtiroler
gleams
sumaila
tautomer
intestacy
tabligh
airbox
mulas
lujiazui
rebs
albertas
kingwell
remodels
stampfer
klammer
aviat
dubal
dirtbombs
moshiri
erco
rosinski
ubt
stoneage
moonshadow
cambiaso
biologicals
cleber
pressbox
berrio
worldline
mercola
xolo
hetti
puetz
dearlove
responsability
aliceville
makiadi
danns
omai
felgate
culls
enthrone
truisms
rothermel
superclub
gcg
trumpton
liljeblad
equable
grafitti
seedpods
trouillot
hearkens
arpi
glasier
paihia
hallenstadion
brams
mishael
salia
cliona
febuary
boisson
octopamine
valco
bedeviled
feenstra
chevènement
stalbridge
hilsea
clèves
bilirakis
narrowboats
gartz
lurched
rainton
uei
moonen
koshland
lobregat
spectrometric
peinado
nrsc
gladstonian
verret
reist
pvm
slover
sumita
nominalist
grovelling
kachemak
quechee
bozzi
rhinecliff
collations
tarkwa
myojin
meily
sherbert
ricksen
sabaneta
rajen
bootblack
keenor
godkin
eclac
rathkeale
gbd
pozas
bertholf
jso
warsash
athenia
ambassade
kopit
gfh
grouville
manuele
beholders
makropulos
capex
mayos
loginov
miranshah
techy
cumby
alishah
leaker
bensoussan
hybla
sjt
tweezer
unha
ikuno
upstroke
presiden
nikora
pajares
biopolymer
neeskens
defrance
buchler
bosetti
albrechts
kelsi
superhard
sasagawa
petti
stiffeners
commis
vrm
luncheonette
toulson
bronchospasm
cvetko
befitted
atención
eustachius
jicheng
redstate
careerist
sambad
sarychev
dunson
manakins
relabel
quiznos
torosidis
knighten
ecis
mortarboard
phomvihane
billotte
piergiorgio
ibew
khairallah
milers
djoko
ecolab
ebtekar
transpac
caparisoned
ecstatically
tubthumping
morter
reasserts
hamil
rakiura
kinkajou
mcgarvie
meritor
zoia
radok
rightsholders
swezey
reformulating
rathgeber
pabo
flossy
vadnais
ditchfield
chally
bitterman
sirkus
buttrick
intimacies
nxumalo
casalesi
nadesan
ceilinged
ledeen
gomo
frenais
schutter
etalon
manias
karsan
dtw
kadan
ignominiously
habesha
phipson
becque
panoche
irujo
kernis
pattens
untypical
rudner
versoza
vidigal
ukai
pedometer
hamiltoni
haldun
sottsass
nesse
reichsleiter
lukka
rutkiewicz
tuomioja
gerbera
methone
tourniquets
creese
navratil
carnesecca
stithians
supplants
sepulchres
spil
webring
parlous
depriest
kodu
golb
jazzwise
puppini
clientelism
hangang
porkpie
benevolently
shippingport
xuemei
tischer
garnishment
deinterlacing
janisch
bmv
ewens
ebusiness
sperrin
remonstrate
valances
fissioning
odilia
miomir
buoniconti
mauny
jinghong
kindl
tiva
hatikva
awda
hagibis
tyvek
shouf
stepfamily
mealtimes
mzoli
openmoko
gungor
hense
strategizing
nipomo
litsa
tvone
joshu
bowhill
tyrannosauroid
chingola
yavlinsky
kalifa
chipko
mulally
fleetham
yasue
balvenie
howzat
poblano
erotics
shidler
timeshifted
phrenologist
couserans
gitelman
verzasca
synergism
thadeus
modelinia
pedestrianisation
amlodipine
stemmons
pincay
bedsheet
suchard
kutlu
bellhouse
londen
walkersville
straughan
evdokia
watco
russy
gardon
eking
donnino
fmb
foce
ruffins
norstedts
penteli
chid
boucherie
berrys
caselle
passet
sobat
urdiales
landsberger
korrespondent
heatly
harangody
supersoldier
ummagumma
shepherdsville
flashier
chiou
morman
sanghavi
jumptv
coxhead
apella
vsh
cjs
repulses
effectuated
pyman
renzulli
meliton
blockings
ptn
hartsell
natio
igoogle
chancay
qhd
aboot
yego
schans
circumnavigates
multa
ybm
tombak
hauteur
blanching
digicams
courneuve
commentor
kyriakides
csca
heumann
avaiable
hones
hrl
pgb
soteria
clayborn
montcourt
fedewa
beggining
satterwhite
skerrit
booher
anencephaly
chabukiani
raunds
hoofer
soseki
teasley
fieldcraft
strutted
ksas
undercliffe
bikila
hanen
fundings
issawi
slifer
gooneratne
moneybookers
damelin
berezhnaya
primorsk
reker
graceless
glissandi
bozek
decongest
verdienstorden
niffenegger
jiaozi
halimah
virgie
mesmerize
klingsor
pashan
chesterford
karita
pithead
ramree
haleiwa
ibuka
firtash
krey
gumboot
carnedd
annat
schub
motherlode
realizability
blasket
blicher
piestewa
electus
kuujjuaq
alirio
leisel
musyoka
ariely
florals
conkey
nstc
shabat
pentagrams
codorus
werkstätte
licciardi
norin
queenan
teynham
excipients
praca
possibilty
marchesini
vaishyas
tineke
rudderless
randalstown
lickers
egov
calderbank
cryosurgery
bernis
dumbreck
quazi
vasodilators
americanas
fastrack
viridiana
clifty
fanis
braddan
appt
cenon
shamshabad
kawempe
lykens
baiana
shoval
miramare
cinephile
carbonera
banim
prevotella
fuzzier
saranga
svanidze
pigliucci
torpid
dressy
astir
cummerbund
zib
circumscribe
hktdc
nikolais
girvin
wickenden
bentwood
stefanidis
panamsat
ieu
magnasco
nhmrc
maquiladoras
frumpy
screengrab
tarapoto
naela
jeetu
vldlr
jrt
bazzani
undeliverable
wcsc
zonealarm
schulthess
coarsest
ebaum
niitsu
clubroom
odone
greatham
inattentional
huband
awdurdodau
campisi
breastmilk
gharials
borchgrave
shouty
clennell
barran
obliviousness
flab
tafalla
riemsdyk
hooe
eisenhauer
ccsu
mechtild
dzurinda
quarmby
pistoletto
largeness
buies
wettig
yodelling
conceptualist
jabri
frauke
upgradation
melburnian
maniago
cinzano
wurts
screenname
tianxiang
infestans
weeksville
pelevin
honley
irbe
extorts
lieto
dwarfish
dixiecrat
antecessor
haixi
vocm
demonetized
totland
nicholaus
overstaying
nearsightedness
kowt
naegleria
zaobao
albas
podell
treille
tebay
olayinka
rakoto
muccino
lampanelli
smackover
fooks
cryopreserved
latka
rassa
kilovolts
supai
souquet
mentionable
warchus
unworthiness
conveniens
rezillos
denkova
slagter
heshmat
haaf
cebeci
devotchka
ketubah
frausto
traeth
mubashir
kaller
leaming
artusi
ferreting
cocksucker
abimbola
hitwise
caeser
manava
aerogels
arag
goldhawk
saskpower
morpholino
grindrod
coastwatch
salmela
scx
geoghan
pinckneyville
programas
negrini
pinata
heinle
pyles
swedberg
millipore
rostami
bayad
sanyang
banaz
fody
elding
schérer
yongsheng
karly
supergene
baste
hoyles
fendt
darold
osipova
xiaoshuai
bohannan
phenological
peza
sfakianakis
finny
mondor
bishopthorpe
sahour
transcom
mbandaka
metrobuses
rongelap
rasmuson
yoshiya
mentana
aigars
aluka
aronov
obscurantist
uchtdorf
mevagissey
filby
willowbank
uakari
sempione
zotto
modders
shroyer
klampenborg
dargin
balicki
baltra
jinfeng
ddk
eastlands
lingonberry
rooters
zafir
ndugu
downhills
charite
zastrow
handwork
dallesandro
booties
alusi
donwood
interbrew
blountville
mangueira
ficticious
marchbanks
kfox
bkl
shinny
auri
havey
kinnison
tetrazzini
iwamatsu
alquist
almondbury
kretser
gamebryo
compassionately
unframed
adjudicative
gouais
batucada
fairhill
defrancis
bayed
rubey
utami
dinty
cioni
hzds
sinfonias
avinu
tangkak
oreg
tyn
fetherston
wighton
dramatical
yurchikhin
likability
dallington
chusan
bandt
tamargo
zafy
vasilescu
wano
wendo
mcguffie
tawheed
symmonds
lability
onawa
hollioake
silsden
bowood
torrecilla
klown
comittee
westacott
karygiannis
ambe
mansueto
omarska
kaspi
wilhelmi
kallenbach
neslihan
margene
podolak
webbook
wilshaw
choules
shoebill
fushan
galanin
cisf
javale
extortions
hugel
menand
limitada
cathouse
patricroft
teeing
omero
biriyani
scribbler
arcel
nost
heatherington
undervalue
desegregating
pockmarked
viviparity
burgeoned
kovats
gpas
alero
chaddha
illig
ducktown
clayborne
asphyxiating
xihe
amod
dtx
lazarevic
metop
horlock
toumanova
mezey
layng
wanzhou
macksville
piozzi
stekelenburg
ridgedale
bartomeu
commissionaire
kinnelon
laboratoires
rodrik
flowerhead
durex
morcillo
godfroy
zargar
melior
kozloduy
bonnel
calke
baughan
cwmavon
goessling
petric
mkx
barnbrook
straffen
foxwood
toddle
dspd
schmucks
rumana
sytycd
absolutists
ejaculations
ravensdale
ziegesar
cyberdyne
libidinous
waldock
bluntnose
pachycephalosaur
actos
wzzm
rodell
zenón
legace
drinkware
acclimatize
warehouseman
cryptosporidiosis
panzhihua
furling
xiaopeng
ninel
overpayment
archard
firetruck
mbokani
coolen
ickworth
steffes
davidstow
kpnx
iraola
ommegang
fergison
vladimer
storrie
makefield
gullick
hotrod
abrahamyan
acron
clunkers
ziehl
lynford
splurge
wilensky
wahat
mestel
strehl
trademarking
unmanly
bancomer
demag
laken
kitri
stansberry
emmeloord
perko
standage
chording
nolensville
reclassed
uzala
guergis
kran
shrake
criers
cliftonhill
thondup
adepoju
lullingstone
julita
tonnages
transvaginal
enamelling
alll
jiwani
weakland
hairsplitting
jilib
antep
fyffes
stoen
gajda
tremolite
fremd
hugoton
nion
bojinka
fing
insectarium
jigging
partium
imgs
borgonovo
greenbury
bhavsar
winnall
gutenburg
burnquist
hanft
ameera
gunrunner
superminis
soubriquet
monterotondo
malkangiri
bishopscourt
ebrington
obscuration
undescended
fanthorpe
wilmont
patinated
lennix
shalamov
smudging
lenska
hydrologists
discreditable
laragh
shimbashi
remilitarization
decertification
atrophies
downlands
ndamukong
bamforth
schmiedeberg
zaldy
licker
micklethwaite
roone
dalisay
linalool
flowerpots
hirotsugu
simplicio
bogumil
dirtee
steamrolled
alysha
sengstacke
wheeze
gouled
khwar
goyim
fruitfulness
blant
tixall
bookscan
newfoundlander
aquaporin
chigurh
joblessness
blackheads
stracey
uppercuts
myklebust
eurotech
sluggishness
zubar
virajpet
elvaston
unselfishness
mottaki
electret
grossest
nieuwenhuizen
magneti
worle
squam
reassembles
spurting
mingxing
overgrazed
humidification
cassone
scrapings
buchsbaum
cospar
deganwy
acuteness
mezzosoprano
ious
bulworth
llanarth
godmothers
strelna
ruritanian
brdy
assos
oaky
lammermuir
craighill
lengies
eskandari
malteser
palmtop
leyritz
scms
nescafe
huntspill
dellas
fatha
oaksey
jsg
jeanpierre
woodrush
systematised
mahfood
limas
launius
shemar
menjivar
clewlow
baweja
cevennes
njoya
maleate
xuefeng
shadyac
bertinotti
hoarseness
lavazza
bajracharya
slimmest
cyanotic
boundries
glave
focker
reaser
baumel
nicking
castagneto
africom
tanweer
machination
fhp
thugz
interminably
hammarskjold
macroscale
marzia
monopolists
cataloochee
gornergrat
bohemianism
zelia
benjarvus
wohlforth
balladares
crisologo
paloschi
buelow
domestiques
vgl
chudasama
bsec
sharlet
dhimmitude
hazelius
popeil
chemtrails
gelfond
sahwa
tewes
dhiman
astors
nemorosa
duning
andronic
quackers
bartolotta
ambras
rambis
dunkelman
bertino
vemuri
sitagliptin
crazes
lancret
elasticities
impels
jamaa
rabee
cotulla
romel
leupp
salinan
palanan
spungen
rebated
sherrilyn
kalmaegi
keycard
nfci
mokena
lovera
pullmantur
sarson
roven
wrobel
visaginas
levoy
yele
billingslea
weiming
busemann
sankha
sambizanga
downstroke
vernissage
rotarians
hunding
damak
xingguo
particularized
stoler
angi
mirzayev
mathies
technip
patriota
tastier
shrubsole
priebke
dismore
plaint
kerrod
shann
photochromic
larreta
swizzle
psyllium
pipefishes
tymoshchuk
rozin
bauers
durmaz
anodizing
walheim
vandewalle
pellizotti
hengel
adaptec
ellerbrock
unkindness
thrombolytic
mcgilvray
baloy
ghaith
seadogs
cerrig
newes
rushin
iiif
hunker
gortyn
barningham
retallick
sji
lembke
hongyuan
gramophones
zykov
darlton
sxs
eten
astronomico
fastbacks
baisers
gilston
lukšić
velichko
ptaszynski
goldsmithing
schultheis
marial
rastaman
normington
okakura
longingly
nauck
ayse
sigle
sardari
dupee
handicappers
yegorova
ipca
ludy
klinker
harvards
gasped
zingaro
sirko
aurelie
brazi
gravitates
broomhead
yamaki
sweetmeats
jinggangshan
cuchulainn
shootist
duggars
bandes
tashilhunpo
campano
rossler
taqueria
shannyn
hogge
yunes
foodbank
delbene
pascaline
troiani
abada
darín
mvula
sanou
contactable
rudiger
canessa
combino
idex
cowls
dyspraxia
coluche
esclaves
sobyanin
garos
husna
prust
reinfection
trestman
weyoun
yawa
inkpen
almonacid
nighty
lagana
nsis
kmiec
nublu
procuratorate
tianyuan
cenedlaethol
akman
browses
sobhraj
indemnified
höfe
streb
koplik
prade
nalls
amarin
defaulters
genitalium
mergansers
patka
heti
goodby
klever
oruzgan
lubuk
pendeford
godeau
callipers
crucifixions
kumbaya
palafrugell
beisel
kentfield
plyler
ibama
prejudging
subin
punja
keratoplasty
lisnaskea
gignoux
kuc
nairu
deibert
detectability
akos
yamadayev
fugro
renascent
sinnemahoning
methylprednisolone
locomotiv
cydney
magistrature
eisteddfodau
pegwell
epigenome
binaca
rabbo
lawers
kuz
chocola
quibell
baorong
coko
voros
nooses
macritchie
labaki
medtech
daphna
appen
clementon
suissa
slanging
maccarinelli
aprox
somniferum
cyberknife
alferov
koreeda
hanz
krisna
faizullah
cardioverter
trudgill
matchsticks
vietor
aksa
sahraoui
sixpenny
suqian
spinka
foetuses
alboran
unmentionable
macnelly
yonathan
fuenzalida
streetsboro
kippen
philippon
evalueserve
carvalhal
silbury
goudarzi
zollo
hemoglobinuria
jotun
douds
malmsten
aurantium
chadema
anglians
diphu
stabb
sholapur
ficarra
ruedas
kotok
cutrone
speedways
roomies
gallow
paroxysm
validations
bassil
goodger
bizimungu
arapaima
fingerless
nonnie
callouts
comports
sedrick
gaut
motional
portlock
banus
masamoto
booboo
kissena
iribarren
rockcorps
bkm
rll
maghi
fionnula
rupavahini
holberton
cadential
berneri
soor
sharpeners
contr
tianxia
tiramisu
jlg
carlen
hyndland
pck
noureddin
dunkers
maribeth
eenie
kopell
dukw
tuula
ruckelshaus
visentin
prixs
echolalia
mingun
algonkian
brisbin
madl
stuffit
greate
grift
wielkopolskie
prevue
lelièvre
bratkowski
piroska
chieveley
trocchi
coloccini
enlivening
kloos
fogged
châtelard
davidsonville
kranjec
eanna
teratomas
utube
letterheads
flechette
cdfi
lobert
congregant
wildhorse
annasophia
pandarus
ricart
fucheng
tammaro
hardage
becaue
megastructures
tandi
narong
crps
confindustria
crosshill
litang
eiche
pathetique
rudest
boof
karadzic
plotless
pulcini
sioned
hyperplastic
neemrana
shanaya
garmo
tanton
poujade
marzel
katims
merlet
chafford
wassup
garud
rockpool
richens
tawni
infirmaries
gruene
godawful
hingst
dunipace
keyboarding
kerse
bindel
mugford
moate
suzano
diadora
almar
kilted
benna
chauvinists
gordes
legman
charleswood
abbett
pbz
byler
muskoxen
koukou
marzorati
porker
espinar
phalaborwa
criscito
wahbi
flacq
pottsgrove
aquaponics
apap
pahala
dalyan
manantial
trunklines
galeras
griebel
pedestrianism
whiteinch
moenkopi
iwp
procainamide
murrells
reutilization
ondansetron
heliostat
oxenberg
heileman
subtexts
gernhardt
turnley
wiesenfeld
wuyuan
disentangling
enap
shuli
oroya
samon
aconite
apocalypticism
gluckstein
anston
northleach
horyn
lfk
hanban
hewerdine
triclinium
nerijus
takenaga
quacquarelli
gamston
folgers
bureacracy
modele
pedicabs
cerebra
audis
wgal
minyanim
seana
golondrina
squally
dyble
vmix
skues
tesio
specificially
mabie
schodt
maciejewski
bogas
metropcs
alza
skitters
yere
genzken
fugly
frenk
grameenphone
rigal
fereshteh
swallowfield
erdenet
redcross
scapegoated
gilzean
limina
mihalik
yasur
numericable
hannant
asmaa
aace
fornos
jipeng
certaintly
neola
christoforou
baugur
somersham
toppin
tollett
zocchi
greason
impellers
sumeria
franciska
lashon
heintze
righton
mrh
shojaei
fancifully
wepener
abdula
tramontana
comberton
paua
behemoths
paumen
wickland
yalom
glocal
takeishi
sceptres
creal
ovingdean
gerke
ntcc
aitzaz
orejuela
paraffins
pixellated
rohingyas
proxied
mcgaha
formato
greasley
haltom
heit
leshchenko
marenco
nechama
garvaghy
breite
claque
ratanpur
peregrini
kiyohiko
chovanec
gaida
kilik
aulas
fermentations
jakup
counteraction
dnfs
prawa
bootstraps
mengzi
hillwalking
routhier
backings
soltero
hanahan
fossilize
gamersgate
felter
macoris
blachman
wkyt
franktown
roehm
yazhou
utopians
microbicides
ghale
masturbated
vercauteren
handballs
tinne
sunblock
zanten
burdi
lewsey
martley
lgf
leveaux
helmshore
rabil
accouterments
caram
rucks
clearbrook
ruotolo
geting
mccraney
mho
shazza
koroni
jizo
saitoti
harville
persicum
jhc
rühle
disempowered
revenu
louds
dehumidifier
lingwood
klop
upbraided
piret
miyakawa
summerbee
externalized
khovanshchina
rediker
velikov
dilating
pulmonaria
redtop
peyret
drillbit
galloper
mjb
hynix
frenchkiss
wastin
msdos
ortmayer
trachsel
revengeful
monach
croque
artadi
groundwood
liimatainen
guancha
vsphere
thebus
alibek
furuno
cantine
shust
murua
lanser
collegeboard
nalis
granillo
ketsbaia
fastens
cakaudrove
unpreparedness
reserach
angarano
artifices
handsfree
decompressing
kiff
nessman
farrior
tancock
skyboxes
joses
hamamelis
agoos
stoehr
vinegars
leye
ravaglia
savignac
muddles
hogestyn
freetime
wastefulness
klüft
tullos
komedia
neato
leonova
deejaying
teallach
grunert
leafe
cicierega
neele
pentavalent
debelle
andreyevna
azf
telesat
mmse
druyan
feticide
séralini
hokuetsu
wrase
pesi
bankunited
cavenaghi
meq
limbert
carpani
iclei
unlistenable
ponorogo
trimley
ascl
meiers
ayles
lijst
bardini
hoberg
trombay
lovibond
overvoltage
visitacion
coloso
vmebus
scalera
burao
gaard
brockworth
hegar
quarrier
fenichel
nonfictional
demyanenko
dinkel
norful
praesidium
grobman
neediest
boucek
baniszewski
allfrey
iora
scheidler
polyuria
chandrasekhara
nthe
froglet
bijlani
privatising
ilango
klossowski
leuer
taarabt
chagaev
andreyevsky
reen
ciy
galilea
prosequi
libeling
assef
fossella
prouder
shivakumar
jacobellis
inle
bujak
mabs
stodola
opnion
memminger
daele
acyclovir
simlish
mpemba
bauerle
mufc
kamrani
nanfang
murakawa
baksi
inserter
koob
whisking
millgate
couve
gassy
welled
ehrenstein
ahsha
worsdell
frucht
fiaf
adomah
mustaqbal
dweeb
baihe
blagg
incomprehensibly
bettles
mardis
stoppin
beacause
fradkov
sinewy
detrimentally
chhotu
washerwomen
pinko
birdsboro
protasiewicz
fischerspooner
nasc
mailes
rajkhowa
caravanning
gianmaria
gauloise
cumbo
cxx
beckey
mahasweta
nnsa
calella
stoppini
skibbe
flippancy
oxwich
fariborz
leki
breon
relaunches
sforzesco
hohaia
anif
massimi
montignac
mouthwashes
allonby
hemifacial
quinidine
barcamp
kaddour
bilanz
caygill
fii
bartons
treharne
coloniser
ritualistically
wagyu
bonora
kloves
chandrakumar
parrino
warter
mackovic
marau
astonishes
knoydart
handelsman
baldhead
teicher
unrelieved
hucks
schipa
goze
cameroonians
librería
crunched
moulvi
pantiles
goodlettsville
hambros
piccini
wisnu
kudryashov
burciaga
baltzell
kiszka
tabulator
sequitor
snir
marneuli
aushev
transgresses
gratefulness
torroba
untangled
trethowan
scuttlebutt
acie
mühlfeld
tista
pageboy
bethuel
yers
egotistic
ragg
califon
adrenalina
cmaj
mizanur
gethard
bidden
yongfeng
cumana
nonsmokers
schommer
kossi
chinamasa
liedholm
taleghani
wareheim
paintbox
sylvana
kurils
linemates
begain
opnav
riesgo
cifaretto
kleindienst
kombinat
ronchamp
hyett
infrequency
kluk
pentewan
crepaldi
aeroplan
trun
niru
willhite
metronews
sharla
gazoo
bcbs
atavism
crazee
densham
nummelin
daq
steinkamp
intracytoplasmic
clac
higer
libertini
invasively
sintef
grounders
unpredicted
sprowston
crummock
reimagine
philarmonic
tricorder
hunniford
hugli
quinces
okri
jaray
unhorsed
unsent
scaneagle
catemaco
morovis
azel
whiteflies
navfac
ouda
mtns
esxi
weststar
milicic
miliary
papillomatosis
leysen
gration
tlt
lewanika
jast
svod
etoys
wisdoms
tintype
jutted
periodontics
elchibey
sharana
bounteous
friedeburg
gambell
olexandr
mullery
shanidar
hauschild
majorie
variegation
nairnshire
rastro
regata
twinings
jpii
bellefleur
peterhansel
commencements
mceldowney
pracy
enmities
aitc
arnison
ithe
caerdydd
fairpoint
gripp
moika
dolt
hitzlsperger
donnacona
houches
trillin
renovo
shellie
sotiriou
headwaiter
loven
bocek
morshed
schelklingen
kille
stiehm
fermier
arberry
pistolero
developped
froberg
khatibi
palmed
foregrounding
bll
tocchet
piekarski
richardville
prefigure
mcilveen
accussed
bolstad
meinhold
framestore
vittles
parnevik
grh
counternarcotics
evah
couteur
batal
tambopata
krieghoff
scarfs
arry
anisimova
ivt
bahen
inertness
dowbiggin
wallers
onera
julians
murrin
cullberg
inx
olitski
ouston
iolanta
shinola
intervale
tayyab
porbeagle
furi
wheather
chakiris
gambas
akamine
mogote
peevish
papamichael
keening
shivas
ostry
fembot
wishin
emulsifying
darwins
gutheinz
canapé
marmol
xpressmusic
moallem
boxtree
lizarraga
croquettes
scharoun
immunosuppressed
intermittency
fossi
steber
herblock
plentifully
thymosin
psig
kenwyne
kanka
kärntner
armetta
sellersville
barnt
pulverizing
mancebo
peppin
panteg
barcus
aarnio
vermeij
viani
trumpf
karriere
gouger
shirato
beur
ijc
secessions
kongregate
onair
karges
sokoudjou
plantier
servicers
vocus
katkov
ngen
jne
kompass
durazo
flatrock
mealtime
chatchai
kochin
soller
ugur
aleatory
juantorena
huntingford
rouxel
canelli
bulykin
divestitures
icmr
hyperopia
shelsley
okhta
lightwater
sinnamon
cise
longlegs
citrulline
consorted
paone
jiewen
zhihui
walkovers
fitow
gcaa
backstopped
indya
albarrán
ocz
wolle
photosynthesizing
microstrategy
discontentment
lanotte
stillson
sensini
schoolkid
gerta
simpatico
botín
longside
abos
achuthan
bequia
zhaotong
stanbrook
fahrettin
rossello
sirkka
ingleses
bleeping
trena
stippled
plotz
corrugations
hoeksema
aspull
euromillions
swx
vsop
kamya
jelley
kresse
schiehallion
oduber
trka
edgemere
shindand
basey
yufu
standerton
giuda
atascocita
navion
reconvening
liebezeit
fletching
pinery
alikhan
sasco
sigit
tipitina
forss
holdrege
mughlai
tix
umeboshi
enlow
bobsleds
litespeed
verka
rivertown
tacconi
rorippa
skerryvore
globulins
tzec
marilee
awadi
mozambicans
banyamulenge
berendsen
learoyd
atisha
estudiante
idole
niccolini
blaz
miesque
quitclaim
foxford
hiltzik
krebbs
hayhoe
olvidados
persistance
balloted
decourcy
nscn
blakk
dwf
brenin
naïvely
stoecker
kamiizumi
gvp
hallmarking
haematological
kayapo
mediagroup
moute
ulma
nnaji
cherrelle
triodion
ribet
dialga
tardebigge
cssd
lechero
cosman
newtownhamilton
luttwak
blondi
clampetts
carandiru
zairean
doughs
genos
zidek
barosaurus
albiston
benegas
alverstoke
delima
diceros
incontinent
evilness
mariachis
brodowski
multiway
euphausia
refractometer
pattammal
uny
chbosky
danelo
waterfoot
ryogoku
meselson
burck
orwig
pooter
ncrc
landaulet
mgy
contino
talmudists
ogren
dazzles
chancer
wfts
acon
sergiyev
wordlessly
kirichenko
muela
buble
andreina
masen
eloísa
galinsky
tavakoli
brighstone
rheumatologist
aabar
pulgas
arsenopyrite
quenstedt
freidel
borum
trenchtown
unromantic
edmée
brisbois
ahla
cressing
helguson
peria
mccreevy
benglis
cheminformatics
freudenstein
raggy
tembisa
jeyaraj
nassi
vieth
powerbuilder
uplinks
soutter
kaprun
cacioppo
stockier
epifania
honoraria
ricciarelli
pevney
szeliga
nightspots
unkei
liberto
bassiouni
malk
ferndown
budweis
suppressants
hildred
perlich
callowhill
arseneault
jupe
bielema
berkovits
sammes
venkman
karamay
biennales
burtis
tasci
backsliding
coln
matip
oppositionist
lalime
dubinin
unseasoned
detroiters
consorcio
potch
havemann
hänsch
soulive
catorce
manulis
assertively
diapering
margarethen
vettriano
lifejackets
methylmalonic
zeth
awr
reekers
clingmans
pitmen
kathman
gatz
azimut
sincelejo
wofl
asume
bioenergetics
militated
langeais
yankovsky
heleen
monkeypox
koenigs
killowen
oestreich
vallot
belaying
bejo
donghua
navels
thakurta
veness
ellagic
folliard
malassezia
shughart
autorickshaw
pinecone
hosey
spaceland
raycroft
pachora
trolli
pfe
mutualists
hetemaj
vishnevsky
cobija
wian
gotv
tangentopoli
achyut
gerbrandy
jabu
windbag
fanck
ummayad
natasja
shoplift
beachwear
greis
bovespa
artiles
hatf
wigfall
bisphosphonate
llf
anod
verdine
bjerge
cardioversion
achewood
tubin
mbv
senechal
mccrindle
ramia
virosa
barce
stuever
hanigan
taekema
akj
wolfsonian
aiya
clayman
lakai
selm
bnot
crug
galerías
sunward
ather
stavely
wkd
transexual
hangup
degreed
sidus
lanne
cattanach
kubel
altenmarkt
escherich
klebnikov
giralt
cornflake
jangling
ncas
dominy
ragama
lamlash
timewatch
blei
rogalski
compromiso
miracolo
gossow
dawar
kafiristan
wtem
sironi
streambank
yanar
aloy
moisten
gagah
summerstage
aleu
alcudia
mamedyarov
nilla
koerber
jowls
pepco
ohshima
sugaring
hilário
hongdae
cavalo
krust
suburbanites
rscg
kousa
schwahn
galison
pocitos
foregrounded
ctk
delphini
cloudberry
seafrance
marjoe
kharchenko
bagayoko
hartselle
askhat
hirschl
abubakr
kumarasamy
pandals
pounces
birdmen
juva
obselete
popenoe
ebby
flotta
newpapers
haematopoietic
maracana
remise
cassey
reynal
jiles
positif
ruffell
dioses
kanzler
fuh
loftier
halam
legos
overslept
nwachukwu
menc
blong
flye
pandang
bakoyannis
kastellet
thas
comping
rechecking
nikbakht
nataf
leprae
akifumi
disrobe
hannula
anshu
zirin
boullion
buchanans
prescribers
moonah
luteolin
slagel
grantsburg
lifestream
anzalone
marchibroda
metronet
barrero
wierzbicki
redesignating
economidis
diago
nqf
mudlark
tiznit
pasquino
elaborative
yakusho
shorthouse
meeus
janica
teletypewriter
mudville
malty
guanacos
hinnant
treleaven
scalpers
vennegoor
hasna
greatcoat
interisland
dollhouses
rhiwbina
gannan
kitezh
jeudi
karinska
expediently
sturgill
thibetanus
afta
camerota
neitz
convers
azambuja
unpersuaded
lgen
kissers
christenunie
ettie
beloki
alesandro
mountsorrel
veneman
kamboni
seminara
bakassi
schow
whittenburg
milliwatt
tagliaferro
stancliffe
preplanned
tocopilla
talisa
boutelle
mergen
dumervil
chelimo
kokkinos
macandrews
delonte
shuren
roosmalen
boogieman
haveing
fbos
scofidio
larnaka
incising
orgiastic
shammari
templiers
gilma
gutfreund
kig
crabbers
bioprocessing
petulance
unenrolled
partiers
xperiment
cottman
squelched
yasuni
cabannes
sparkhill
longshanks
mongla
almazbek
tredyffrin
fariba
rønneberg
broyard
northmore
eckbo
sgae
acars
jimmies
ratting
acps
chalkley
meseret
mcilvanney
vanquishes
manolete
bikeshare
tristen
krejci
aysgarth
carmelina
covel
garance
kyriazis
mccole
sugaree
llŷr
instone
guh
vainglory
trescott
toiba
tdu
dulling
khadafi
galiana
baltzer
verrilli
washo
australopithecines
whitesville
centerfielder
kébé
skag
metallization
collomb
kottar
neila
boyo
mclibel
minsmere
pepperidge
nemer
scoffing
introitus
busco
kabini
cowon
meliá
bárcena
koeberg
culligan
giasone
tongren
pedrad
cernea
reviser
botches
yeliseyev
postdate
gamberini
sveum
barbarez
narron
leire
wittiest
qujing
mecs
kindel
mechanize
seiber
tudgay
maby
orangefield
colleville
biggy
tagami
mangosuthu
norgaard
mitag
aulaqi
uisge
awwad
lindhardt
fehl
breastbone
hethel
damiana
naturalizing
endler
coutard
suppleness
wizened
wearily
necedah
weist
burgman
hampus
kemalism
matrikonopc
peschisolido
gvk
datacom
gestating
aczel
shimmers
eih
breadon
helpings
riahi
sabbatarianism
sheilah
gambiae
salable
menisci
rizhskaya
zago
cossey
woronov
gottstein
erlinda
expansively
owino
fochabers
mullaly
bratislav
refutable
woolsack
alperton
seccion
euphonious
mckone
radiothon
dillion
kaina
unblockable
damron
wytv
unsexed
gulpilil
cherise
hotkey
psychogeography
monae
ticad
froilan
mamonov
alstroemeria
cez
bajofondo
boyack
vuna
moua
mondorf
codependent
worldnet
symbiodinium
paleoanthropologists
remek
polymyxin
gallier
stanchion
meerschaum
averagely
bigazzi
bordone
excoriation
harbutt
pippard
econometrician
kipiani
prostrating
gateacre
guerneville
simmon
ptw
zhar
boddam
rimawi
bioactivity
tweaker
particualr
sportscene
autobiographie
piecework
loooong
kubus
wukesong
dobe
boogers
pincio
mazars
wormsley
sharpham
reko
phillippi
sissies
apolonio
stemi
tullie
deif
bullit
mungiu
aldermanbury
gallner
merse
votkinsk
istructe
uhmwpe
katmandu
bittering
vinda
movil
occitane
outsmarts
sozzani
postseasons
hurban
russophiles
nakasongola
rabanne
gracen
kiunga
lochgilphead
duelled
itek
interventionists
siphiwe
dunns
waltzer
swickard
ilopango
hadland
abutbul
rambow
lealtad
cityjet
freelove
mashreq
xperience
lambent
coult
veltheim
locs
altius
hegemann
dumbstruck
fwv
gyrate
stojkov
muge
facer
oufkir
sueddeutsche
pernía
linsky
nko
lasry
dionysiac
störtebeker
breidenbach
garreau
quiff
baruta
roatan
wafaa
dujail
allenton
daptone
potshot
cyberlaw
radiometers
orwin
viguier
begrudging
jonathas
cheekily
inomata
revver
gtlds
tyronne
maligne
eugeni
nikolaeva
bealey
accomodating
bougival
sleestak
sapkowski
irh
necid
arnette
recalcitrance
shurmur
kagurazaka
dtour
cysticercosis
looong
mungos
binta
paska
gaizka
sinharaja
mowery
haeger
dockstader
summiting
vucetich
wabo
arsine
vanover
orengo
hudon
kokesh
avara
battenburg
haylock
anomic
treemonisha
kaminer
naoshi
residentially
congresbury
tombola
permai
baijiu
griffel
chelmno
shortie
cordingley
simplexes
ipscs
brunetta
gilde
cihr
perdanakusuma
dahlak
cherrybrook
griseum
jiali
kastri
frood
cantare
caniglia
tuvans
intoxicant
siegl
genden
aniak
sempra
maniscalco
erlkönig
glibly
sneezy
ticotin
bagla
mtnl
psaltis
kedem
globetrotting
mangia
unsearchable
goron
migliorini
sesma
cheapen
diles
shakhbut
karakaya
hybridised
suasion
nonsurgical
osio
hemmingsen
adventuresome
bellview
sulgrave
saltine
calen
nigina
bermann
cinthia
rinzler
splendours
promesse
wallcoverings
cyrankiewicz
brackeen
wineglass
maquina
bavier
adot
cloudera
microsurgical
vadisi
milieux
onuki
siddick
reimbursable
noorda
eliécer
shadwick
outcompeted
chatterer
interbay
biochemicals
kittridge
esala
gortner
glenohumeral
etchegaray
bierley
delenda
troisgros
protais
adebimpe
reinvesting
douses
scadding
kolbert
rödl
maynas
mcferran
dishforth
mariappa
commissaires
mohammedia
unkindly
neque
neurones
shukra
mathys
soldatov
mavens
ilgar
adnet
evv
washings
shikata
mainlander
carandini
syrett
radiogram
tatarsky
pritikin
boge
bracamontes
wobbler
salvati
spieker
asms
quiverfull
baselessly
rentoul
eurekalert
simao
floreana
muehl
gorta
methodic
lastest
micronesians
turiddu
vasiliu
sanjit
ackee
wcd
fachtna
weasle
holkins
mtawarira
rohatgi
pyla
superspy
bigband
glenavy
deadend
chilcot
canalization
melikyan
gaspe
naruko
minasian
twiddling
strachur
tolleshunt
sej
ftx
barkov
churchdown
makovicky
osterwald
häagen
gmhc
hadary
kilmory
clwydian
kinjo
divvy
ascencio
crambe
dobermann
nonproductive
serenitatis
prosaically
reforged
oliech
corish
coopted
cheezburger
spads
flobots
stroudwater
druggists
melodist
puppeteered
nccam
birbiglia
deconstructivist
koeppen
symbioses
genistein
rampone
distribuidora
thulani
bengston
gethsemani
standpipes
motortrend
overby
valgeir
leagrave
amsoil
isvs
ringroad
ledward
aspd
weepy
zoloft
gilmerton
anzoategui
fairtex
maclaughlin
longcroft
zahawi
mcmicken
cottone
uncongenial
lugging
wikicity
jills
leveler
windchill
ghormley
launde
bornheimer
nantyglo
mofokeng
kampusch
ingwersen
worthily
nhlers
konte
huayuan
treviranus
aleka
reenacts
prahl
bazemore
suzann
creg
shies
cout
coatlicue
desilva
broo
aceval
kikukawa
sanyuan
theaker
svobodny
outsprinted
podded
ryken
hironori
foronda
thunderhawk
arepa
carys
grun
intiman
vanquisher
ruxin
lukyanov
mudcrutch
booi
pecari
sadda
markakis
hankering
blombos
cupper
visione
safta
lonegan
picturesquely
oldrich
longnor
regensburger
madhumita
huntingburg
barnstorm
loanhead
checkland
tesler
pavano
baguettes
piggot
ghasemi
kavalier
prys
osias
noggle
asatiani
olins
skell
collini
chevin
sloughing
chass
scheels
thorntons
noyori
panh
murt
bluemont
priviledge
sandbrook
jesco
currach
lenzen
cboe
shernoff
pascrell
knobbly
maintenence
lechery
hanasi
wnct
naturopath
hegang
chyler
quadruples
pietrelcina
shiming
deely
medivac
worthwile
irsp
nagorny
conrads
nprc
kragen
earmarking
ftg
tretorn
meltem
oogway
endy
javel
rattazzi
longshaw
sroka
oximeter
sweeties
wboc
bizz
boldy
dvorkin
elkabetz
ashti
arkalyk
frantzen
downdrafts
starcross
oumou
beamont
itsu
muhammet
rafina
remits
stefen
ibac
welterweights
zivko
saffo
dopp
emar
bodhgaya
bourgueil
redheugh
plensa
keese
plainsmen
himala
carlon
lape
kreipe
candee
khandu
beketov
vlp
zebroski
devonte
zhangjiang
gallaghers
nightstand
breeching
sunrooms
gerben
connotative
crowed
blandón
strano
bodart
tavish
dimitroff
bertoncini
kibeho
konnie
wahaha
copernic
praet
roero
cardamone
petkoff
scillonian
visable
rihan
broiling
mundum
lenon
belaid
flouride
goldikova
dwango
grev
urusov
northbank
damps
jenning
ahhhh
adeleye
engulfment
thrangu
mcgauran
spataro
spaziani
proanthocyanidins
buchbinder
batwa
totesport
ergometer
smailes
presby
chenes
zimmers
sheshan
sopore
heublein
swirly
evdo
nonni
beholds
janhunen
salvatrucha
bocquet
bazinet
cashen
nssf
connectu
merrington
snaggletooth
odili
latinoamerica
marsili
sist
ardudwy
kolari
catullo
keglevich
witht
tamasheq
peguis
pnin
guileless
cigi
unrooted
rayhan
paulauskas
temerarios
degreasing
smithsburg
garaudy
neulion
brownings
kauf
huntin
acacio
shabbily
omron
süss
yeezy
niccol
spek
boombastic
ragno
papeles
cherrypicked
subrogation
shortlisting
boatright
crofty
biava
björgvin
ultracold
durieu
mccarley
concupiscence
gelt
nyet
glennville
ariss
snris
fluoresces
rondi
marren
jianghuai
griess
bilek
eoraptor
asid
antidiabetic
reygadas
genset
kipawa
tardes
whoah
hunsinger
sneider
encausse
graynor
tottie
chumps
walshaw
qiyuan
parme
arges
malcolmson
baulch
iorg
lre
mixcoac
rozan
galeri
iocs
eulenberg
snitzer
ferulic
josee
turkman
reasi
flatirons
sweetmeat
goranson
genographic
volodya
titina
pennino
orpha
styl
surette
quillin
crawfordville
kaieteur
xiaojun
cocoanuts
lindhout
girardon
fenham
plitt
shapland
muris
dougga
bayoumi
margareten
dolia
noordoostpolder
ieb
spruit
barac
sipadan
lusardi
ronayne
feitelson
arboretums
vanegas
housemaids
penzo
crcs
flaxton
superorganism
proisy
huba
badda
bárcenas
plys
ziegel
rauschenbusch
shittu
abdelkarim
pbdes
arizonan
otmoor
markoe
flot
mosteiro
nafas
frühbeck
exabytes
microparticles
meridiano
slacken
kameo
kaegi
godet
cherrapunji
employes
zerbinetta
babulal
primarly
zamparini
edale
crona
maxthon
kerney
microdot
incongruities
kobzon
schlieben
pierrots
jebus
hoffberger
lordkipanidze
preposterously
goodwick
naganuma
pasche
wtm
truncus
mualla
davidow
diablerets
sadurní
seiberg
diffract
balkars
jvs
debashish
falloon
sclerotherapy
westcarr
sujin
shoesmith
llansantffraid
julu
hgvs
kettledrum
streetwalker
spott
nevisian
resounds
shifra
valpo
ambati
kouno
tamp
aspi
amylin
toofer
kirstenbosch
corrinne
barracked
zayandeh
yucky
ccat
hsas
hlavaty
lustbader
fitzy
lanese
oocyst
chargés
knoebels
chasewater
difícil
overspend
shoman
zhol
yaoshi
ldpr
galvanising
forzani
wheelan
kressley
ignitor
reely
mindlin
donda
misalliance
barnwood
asile
alker
portnow
aginst
toschi
bettison
multilink
ikaika
cozad
spacedev
lubber
maycock
rotters
penitente
guntis
proliferates
butan
aiha
petke
arzew
offhandedly
gunwales
doumit
signaal
lams
dextro
bakka
loincloths
flutists
calil
lunched
krukenberg
lupis
jysk
belying
acounts
poulsson
anastase
ghostbuster
fauci
timmis
qpo
givhan
ifg
azeredo
powerbase
katyal
xrf
tritiya
claustro
strate
scatology
blokland
upwell
kmgh
nsfc
doutzen
bunu
xkr
myoko
henham
helmke
moammar
ghareeb
privado
ivth
chapuisat
franssen
arenys
mellem
dallan
noseda
borrowash
uclick
blueefficiency
jannette
unassembled
weerakoon
gitt
freckleton
overbuilt
mercaptopurine
budvar
helmerich
geysir
reassortment
ilab
clarisa
revelling
transneft
wolverley
geddis
derealization
wette
delmonte
ridot
allas
chabi
castmember
bergendahl
streetfighter
heringsdorf
cryptogams
kopecky
xaviera
flinching
oppressively
coning
berardino
poliana
geotextile
screencasts
zemaitis
bugeja
identikit
puricelli
botwood
tierkreis
feve
macknight
kadazandusun
zahur
shallcross
kilvey
kalalau
piras
wasg
kopra
kihlstedt
pyrford
gottschall
expeed
morrible
amga
calmy
ikl
motyka
superglue
antiinflammatory
geezers
schlecht
boliden
anani
moorhens
wjxt
esky
pingel
woodsy
mugan
ylli
merrells
asellus
kalona
kebbell
stranmillis
tubau
aures
matache
losi
twite
reinsurers
dtb
verry
superfine
lri
arkhipova
jibber
rosoff
naulakha
fermenters
stylophone
shoney
counterspy
snakelike
chanan
zwigoff
captopril
mirow
cleopas
cronshaw
tretiakov
eichen
bekas
dockworker
atossa
kirsan
nsel
saleha
ihg
tvu
rossellino
rygel
burgmeier
ryazansky
harir
platypuses
jimtown
collamer
cellulite
vrdolyak
hubbardton
zahniser
botnia
reimagines
guero
intolerably
buderus
opendoc
ranu
laczkó
thundersley
fatullah
airo
propanediol
bulent
anjem
peashooter
postgres
landguard
hras
conjurers
nymark
erek
orleanians
sindona
forer
gothick
greeters
ballykissangel
rester
pistolet
kreitzer
perogative
eatin
elvina
kudelski
stoev
newsbreak
chillingly
aydan
jmf
actavis
supernanny
multifuel
syerston
silverdocs
mikail
galkina
kerrin
changé
guai
kingmakers
markas
playdate
mssa
comprehensives
kenealy
bushwhacker
edginess
vgp
helbing
toddla
solovay
levinsohn
antoniuk
baldursson
jamelle
hoback
mollen
blixt
jmr
taffarel
gozitan
bolometer
braunlage
mabiala
borghini
flandrin
libe
radicalize
octreotide
homogenize
irrigators
hanuka
zeff
meilan
maluf
agostinelli
narsai
lipitor
nifedipine
akhmim
tripti
guitart
saavn
xiaochun
rotberg
uckg
erdf
massamba
cilgerran
adeptly
ryser
everist
moomintroll
overwater
talansky
substitutable
mousetraps
gsusa
knmi
beville
benicàssim
fashola
questi
malmquist
piche
feltwell
reattributed
seiichiro
tande
psychoanalyse
melniboné
banzhaf
bouveret
stemmle
mcnealy
tude
diethylstilbestrol
rendre
nippers
inacio
supercruise
harpurhey
herrenhausen
lipoma
winogrand
riefkohl
heslov
petrilli
micco
harpham
saturna
rosemarkie
isara
maqdisi
julyan
abasolo
shiftless
disbelieves
yilong
manzie
carcharodontosaurus
brandolini
flacso
nolita
edds
flashforwards
tunks
baruto
hotte
waipoua
rensch
drippy
contentedly
whiteoak
sidel
christijan
niello
longniddry
brcko
diisocyanate
lagrimas
perenco
thoughtcrime
ryad
venit
walkabouts
copses
againt
hto
venusians
mayron
stainland
monied
mulisch
eminences
chartbuster
pensford
reoffending
mbanza
tentation
pafford
argentinosaurus
grushecky
incb
kawalerowicz
exagerated
wxix
bricmont
distressingly
camelina
melawati
femtocell
authoress
havili
niehoff
philtrum
dowdle
parrillas
johne
bathwick
koret
mockers
klippan
yalong
unowned
furo
cking
sirènes
citycenter
gjoa
sieberi
pavlidis
kcci
chamique
ghalibaf
boedo
zanes
pankov
peapack
erandio
riordans
absolutley
kewl
unprejudiced
alwaleed
eggan
nethercutt
russom
ventry
maquet
spiritualistic
karkare
ance
heglig
engracia
koeppel
bychan
kierszenbaum
kluster
homfray
zwz
iacc
grossglockner
mulet
ehrlichiosis
raices
modafferi
tedglobal
seabus
kuam
xiaozhao
kellee
niesen
kopeck
cafo
cotidianul
eamont
coody
dalbir
rozi
hvb
maben
niessen
harnham
mitarbeiter
morigaon
saddling
nazr
entourages
cailloux
antoaneta
stankov
mphahlele
obsequies
ostern
sicel
yachats
etxeberria
sadists
phreak
blecher
ceratosaur
sinjin
floorless
manises
bialy
umizoomi
déborah
choonhavan
hinchley
geringer
recapitalisation
phyl
aals
decaux
crashworthiness
westbroek
farivar
shaftoe
dairymen
dundrod
ergotamine
brookstone
wapner
eyedea
wagenbach
espadrilles
ryberg
smike
jiminez
nebrija
gerde
pugilists
bellmawr
zore
wtvd
resells
pust
plaids
cyberrays
wek
oborona
esmer
gerwyn
nuan
cheapo
sealskin
genin
kinmel
amien
catechins
gautreau
kirit
coolspotters
konishiki
essman
voyles
ohlmeyer
xandra
clementis
hiep
molluscum
marrus
raziq
arvey
borates
qudsi
biw
hensman
puffers
decreasingly
narwa
sanj
kawar
spheniscus
unisphere
scodelario
syncml
gajdusek
springman
aapi
santogold
boughner
ashhurst
zaouia
furuhashi
santes
selye
jmj
matsa
lancelyn
provenances
sandee
lonhro
htx
donté
tincher
medicinenet
plateaued
abaga
mercuri
lesieur
recourses
putas
disklavier
mortons
countermand
intersil
cataplexy
unsubsidized
nahta
letarte
bozhidar
freediver
nityanand
garabed
baggers
trivago
cilegon
atencio
phantasms
groundstrokes
mdct
gvm
aalten
langfield
galder
meraz
fictious
basharat
yant
heartbreaks
dierker
lempert
muha
scuds
ceq
arriagada
ryb
metacafe
vaginally
farsta
bristolian
osterhout
hoppa
sabca
assir
manorbier
golinski
briedis
pasricha
zonday
ipms
antispasmodic
mistura
khasis
undrinkable
athletico
rikabi
fröjdfeldt
stefanski
hollanda
mercredi
suder
norteña
regales
amrinder
shoshanna
shoebridge
lizárraga
smss
huaman
isabellae
ekmeleddin
rogovin
ercs
opw
heubach
bellamkonda
somani
rashba
gillanders
koetter
lanced
weathercock
mcgarty
bandbox
anitra
rorqual
cutsem
quantal
artcraft
dtf
dermoid
pantycelyn
somnambulism
amasses
pallonji
laywoman
electrospinning
thirsting
tritschler
sarwari
cogency
ashover
levisa
raeford
glori
kleven
diadems
recommendable
aeternam
gerrick
prosecutable
bogar
chiau
iheanacho
kaag
tadiran
linemate
costabile
lingan
multiservice
tsur
presupposing
toyohira
meale
abengoa
sixer
savusavu
jeannin
masatsugu
tattling
tamsyn
monsef
gulland
pianka
jignesh
chanco
bouie
luath
sope
forsa
energise
mcleary
enb
baristas
lovebox
hangeland
snet
grammies
bagian
premenopausal
seubert
plages
fiddy
omnimedia
cowing
torger
rwandese
viraat
orlistat
kinnersley
kamaruddin
gottschalks
klukowski
chetwood
barstool
wuerzburg
niavaran
teasel
wanhua
errante
karembeu
gulda
terreros
trudging
masco
byck
donaghey
graphitic
yablonsky
occlusions
craxton
kovacic
stampers
ryobi
banques
volgodonsk
cprs
hagge
sohal
speccy
sked
consumerlab
bhaumik
ferko
higly
dardan
errick
retesting
misbach
magaliesberg
kabbani
vlieger
belzberg
hupmobile
enculturation
olimpiada
compatable
antman
bagirov
bagpipers
goofiness
fanchon
antelme
fuzzed
idms
dadaab
micali
wuttke
tatenda
whiggish
seher
reconciliatory
moncoutié
dramatising
dibden
dowman
gurin
xenotransplantation
jkf
unprinted
cataglyphis
gigantour
positivistic
kelpies
mansudae
thermometry
gameshows
livs
penalising
bibbed
cuk
transshipped
minarik
paleographer
ifis
feigin
mururoa
zyryanov
laureles
ruini
mogae
mraps
nerazzurri
pice
kakul
seidemann
raivo
maximino
esterhuizen
jacey
hambali
copé
morford
nagaya
mellouli
sinecures
primorska
bestest
eyeballing
macd
overharvesting
absconds
antoniadis
anesthetists
teston
ertz
bushtucker
exi
maisuradze
ambidexterity
larking
australopithecine
workmate
mcmillon
hegley
airds
graveled
millilitre
eastbrook
yake
sentimentally
outsmarting
carjacked
vorobiev
thrillingly
flutters
tortas
wickrematunge
azita
,where
maltzan
wodaabe
sccp
itaewon
bikie
fusobacterium
samadi
navstar
educacion
polyus
haribo
conventionality
rautiainen
consolo
hypomanic
hsph
shangkun
winkelried
guiry
grrl
payg
audo
chatmon
maaa
oomen
mammalogist
butenko
saharawi
donee
yir
gibralter
canoers
catts
postum
ajak
unnacceptable
layback
vlcc
rapf
colwood
minal
sompting
miyashiro
ponch
segways
langenscheidt
tempsford
overrepresentation
kvinesdal
tiltman
johaug
bombeck
nsls
colsterworth
buche
toady
blackburnian
thickeners
wenyuan
deliberatly
rosenhaus
knoller
kolodny
poplin
lustmord
tamela
siendo
copiague
eev
unblinking
derülo
cargolux
pates
gwaith
klinck
dzhugashvili
montagues
glahn
brox
avms
gresini
wtoc
jeoffrey
fout
halban
cliente
pkware
henbane
mothercare
bellybutton
earpieces
videodrome
seing
amabel
jhp
cortazar
freshfields
boops
mazzella
boym
kistner
rodborough
rathor
hentrich
neckwear
majko
waterjets
sauri
benzon
autopilots
platooned
lewisboro
iuu
kisiel
idealisation
bonifay
roxann
greenline
rbt
normaly
maroth
bloco
gillain
horenstein
adrie
steege
rewatch
iweb
catlike
banri
patriotically
lewe
scheuring
dräger
snowline
dagobah
plange
okimoto
affliated
houndstooth
rhon
luxford
goldtop
arthralgia
cesky
hagey
ayoreo
daalder
mongala
penz
rapu
alladin
isohunt
essentia
pokhrel
cuing
frenemies
velocidad
ogbonna
linet
kuhns
kaiparowits
brabec
rothblatt
grossinger
nirad
enwezor
tooze
vézère
tayla
bloatware
swh
rompe
netherthorpe
transalta
silbo
grenda
sridharan
spectacor
crimefighting
louima
genoud
millilitres
incentivizing
iweala
geoglyphs
malloys
easynet
meladze
boldak
baldon
aghajari
gorak
qaitbay
chuquicamata
offically
chocano
melwood
awsome
waleska
salicylates
ambarawa
biffi
tyendinaga
stotfold
imclone
minjok
runabouts
oltmanns
abdelhak
kebe
conures
poweredge
rowans
dyspareunia
recolonized
prezzo
smallbore
prinsengracht
stanborough
vandenburg
whch
camming
standoffs
lutèce
luwak
cortinas
luxenberg
duratorq
degc
bussan
bouyeri
manzon
irbil
ekotto
bayman
whippets
balikatan
paddleball
niell
wiggily
adolfs
ifoam
perrino
wlross
toyz
gaoled
hydrangeas
bougherra
erap
francey
baute
francy
hegge
reinterment
plungers
zivic
zelin
countrypolitan
winkies
ldm
yukata
slimness
posti
kolko
mahorn
checketts
lhm
bohbot
braying
tunisi
sorella
thorrington
turds
yangcheng
mazzanti
jcf
danida
demagogy
tkvarcheli
schlein
bronchoconstriction
falkow
loping
jeita
osumi
noddies
manorville
kxan
nlos
brodgar
posers
finos
quangos
shinbashi
qai
galdo
caking
terrey
murman
palmerton
eury
cystadenoma
cucu
dolbeau
borjas
goalby
farzaneh
greedo
lauaki
nariaki
maneki
chatri
hurtgen
hsupa
nanofabrication
dataquest
eitam
intially
cabbies
frump
rebury
gafni
woodies
sopp
mobilink
iloko
plunders
multivac
metab
chatwal
centerfolds
dusi
mytravel
shayler
steinle
ordinands
mease
ravencroft
papademos
aleesha
derailleurs
dessoff
dosha
glasbury
talash
namechecked
purnama
arganda
calhoon
argentaria
némirovsky
averitt
afuera
webroot
braceros
mekorot
roschmann
ravidassia
wez
darmody
mkhize
klecko
fleeced
cecere
breyten
tbu
conficker
nanjo
stinchcombe
choderlos
pein
unstinting
focht
kobal
judee
lezgin
koleva
dalilah
kht
westminsters
zambry
pialat
perfectionists
bylund
amadei
arrecifes
sdsm
safiya
jacobsohn
cybil
ssis
tevere
ipaq
bemerton
barsa
twirled
vanille
mellini
suckering
guebuza
aquanauts
spisak
swiftest
vinet
greive
goslings
tigerlily
rosnay
mizos
harlen
wesolowski
ceska
batcheller
ryazantsev
thurlby
hennell
eleme
fraiser
irremediable
polasek
janeen
alne
yitzhaki
jawlensky
burgettstown
affordably
zewde
nocht
hulet
icaro
basdeo
decklid
zurek
sabean
adjourns
maaouya
overmuch
nabel
mekelle
jiulong
atran
anthimos
fettiplace
salesgirl
gibault
lineas
hatz
pedrick
campolo
jid
teatri
grabau
altruist
heeler
pennekamp
omaar
vtp
wsfs
toshiharu
haberfeld
wincer
bulmers
cennamo
soirees
abalo
chabal
bourneville
chelwood
adages
aimard
garwin
koepcke
trimalchio
shinbo
riddims
gottardi
badwan
carvery
leashed
schaufuss
ballybay
pintar
weena
micrometeoroid
papadopoulou
proske
humanlike
villatoro
gazers
rescigno
cocody
bosher
abolfazl
manukian
hannifin
lamphere
gamine
coverley
tippler
screwvala
demong
netjets
mue
jethwa
txiki
citabria
wakker
rolvenden
semidesert
audlem
saurischian
paudie
notetaking
riddhi
danley
alberg
maddeningly
crudeness
pagulayan
acat
mossie
patillo
saipem
crabgrass
nanosatellite
defination
bellefield
conradie
megna
shaer
matalan
karia
dharker
seetharaman
jathika
oehme
pyracantha
hitti
luckes
invensys
athelete
rvi
oxeye
hirer
quarterbacking
minkler
garnes
seribu
klaveness
msgt
gncc
krizan
disqus
vigurs
degussa
petersons
gumprecht
riskiest
podres
gangeticus
talarico
poth
taseko
spolin
wrightbus
semmler
methot
mbenga
pertile
toubin
shamong
ncds
asfour
wellbutrin
romanticizing
lamis
jashn
lazarou
assailing
kamins
guimond
oleifera
chiefland
suprematist
parthenocissus
banaszak
pharmacodynamic
akila
mshsl
mannini
tassotti
steeplechasing
honeypots
péché
gorefest
reverberant
sherie
abena
flighted
pucklechurch
lavonne
montebourg
conversazione
wfmz
influentially
sonde
requited
creepiness
ozcan
posselt
meadowlarks
bunchy
haught
fluorescently
tekwar
burkinshaw
pinschers
tristam
palfreyman
strivings
bresnick
posch
thatchers
rilles
dunstaffnage
adjaye
pacher
smolenski
schepers
bushkin
decongestants
greenies
miet
smallworld
cryptozoologists
kielburger
krishen
behnken
avv
boretti
vigilantly
shakila
petruzzi
greentech
wrockwardine
lere
verey
macko
xtm
britvic
scal
thodoris
matthies
ventola
zitouna
polydactyl
soapstar
lamming
pontesbury
isat
ptd
polinsky
tanen
greacen
mamah
tubifex
sucess
gdps
sibon
eshun
sunburnt
wollert
gwydyr
xcaret
pipistrel
tolles
loie
elvy
methemoglobinemia
ginni
proactiv
fiberoptic
downmarket
escucha
vandam
metula
nodo
taiyi
raptures
cfca
raiwind
scalloway
abjection
markopoulo
obligingly
shoddily
persbrandt
archdaily
chioma
bellido
pagliaro
megajoules
suggestible
hinkson
bivio
kakei
pancevo
khaldoun
artforms
grodner
hushing
forstall
linthorpe
ousts
ustaz
ayot
bloodroot
microcirculation
bigwigs
acrs
sidepods
makinwa
bumgardner
kinson
hanksville
sancar
salaya
keth
rushan
sheheen
unremittingly
kupi
pemra
durani
otm
maberry
nhulunbuy
lingwu
naturalia
dager
lichtenfeld
pukki
romanija
haino
umlauf
bmac
musicke
sahelanthropus
irishness
supereva
sekong
wahlstrom
sheepscot
kohlhaas
bloomin
ultratech
kretchmer
intelius
delagrange
golfweek
minnewaska
isci
faty
valiquette
tourne
larusso
guermantes
dateable
franglais
zolotarev
louche
hurok
rolands
muscadet
winscombe
avvenire
beantown
chellaston
tacchini
unremarked
buzby
roomier
piously
despotovski
vinerian
brunk
lordswood
sörensen
mezzotints
tamesis
francina
vallarino
supercarriers
onthophagus
detestation
erzerum
royces
yanagihara
lineaments
hristos
inbee
yellowface
egu
regionalists
ponding
psyops
zacks
gunsberg
psychotics
hosken
tailender
chicama
conjoin
ngh
nzf
haweswater
kedge
intraventricular
xers
gunflint
romps
uliginosa
surreys
hlt
prettily
chako
zenair
asomugha
kunzru
borsos
boes
jarrahdale
pamintuan
matrixx
lamentably
kierra
bonazzoli
rowes
skandinaviska
mcburnie
romanticize
vuvuzela
moisturizing
circovirus
castrogiovanni
basilea
lappe
stinton
ponceau
wolframite
adma
contemporani
giuffrida
outshined
foncier
quos
khaldi
fenghua
sasho
malloum
tulafono
tibidabo
roychoudhury
televangelism
falletta
glucosinolates
whacker
roset
novenas
guga
hyphenating
subfolders
malveaux
chuffed
jumel
ansbacher
botching
droppers
fritchie
konaka
bamfield
raissa
significances
langeberg
ogidi
meting
morritt
maggin
mawi
whiner
mondriaan
ratter
lerew
torishima
leinwand
grigny
imara
torigoe
rzayev
rieth
henrick
kyrgiakos
feffer
velos
rooz
daff
chadash
kalkan
giannetti
shibley
langeveldt
zhongguancun
saidaiji
panaca
schlanger
reproaching
kni
rustie
honkytonk
maisha
squanders
discomforting
hamachi
freeholds
schs
lirio
burpham
isme
droops
toonz
tanh
freixa
gomersal
viewshed
lasar
zayani
balancers
puw
malorie
borgs
erzulie
merriwether
ksaz
diggi
genette
morss
amenophis
mcelrath
clontibret
froots
keoni
limner
chuxiong
lyburn
phrma
reche
chidester
amounderness
paluch
colbourne
sariska
indubitable
tettey
shuwa
tyahnybok
adicts
harked
fhfa
arround
gyrations
ruaha
discuses
medland
jpt
gyron
ght
lafreniere
lumidee
gravities
mercan
biosocial
assimilative
parasitizing
oilmen
seadog
ethelburga
capitolio
malariae
biocidal
gope
bruccoli
interquartile
bassen
producible
venema
funchess
windbreaker
dimaio
selcuk
laxmibai
richters
schroll
perhpas
orbus
arthropathy
stockinger
floorpan
kiszczak
loughead
araras
estridge
gorry
wildcatter
yedlin
mappers
guthy
andrii
vasospasm
baraan
indispensible
tyrannies
tolbiac
landeros
gunnislake
freecycle
jobi
twinjet
carras
shorouk
maschke
perquisites
shafroth
agstar
yusupova
sticklers
whatta
nohl
nijdam
stoykov
bootes
juncosa
jessamy
sujoy
defoliated
mientkiewicz
anorectic
camarones
aftertax
copperman
waterbrook
achiote
cqs
waurika
maybes
psoralen
langmead
gaudete
vernaccia
ginge
neuropsychiatrist
gsas
ednam
yttling
gilbertsville
lamantia
chodas
spazz
cavos
microlights
munif
oestreicher
cheatle
heugh
balamory
armenio
pcms
dolgov
kahrs
esencia
renfrow
acklington
paektu
eschauer
blackbelly
hongkongers
graczyk
eeu
ahca
shbg
rulin
greenness
sharonville
miscue
nyhus
acclimatise
aerosystems
avrom
gengo
inconstancy
hpm
qassimi
reconcilable
vitek
llandrillo
kanie
ghattas
hysen
lamoreaux
langauges
maseno
multimodality
superbrands
donard
worx
flad
shinju
greetham
harsch
foreignness
cagr
baichung
totty
refuseniks
adenuga
mountlake
krumme
topflight
propecia
snorkelers
davro
muyu
dobrica
annaka
submittal
halba
soulié
rieck
glickenhaus
snitches
whiteknights
massengill
aharonov
stihl
nake
overdrafts
chandrasiri
chion
daytrippers
sollee
talibon
incautious
bathos
bleck
finnessey
bouazza
westly
babenco
lundborg
jungwirth
wgms
biggert
parasitosis
shambaugh
mandich
speightstown
superlattice
tolos
solman
oldpark
aerovias
disallowance
bloodsuckers
surfline
meijers
eyde
totec
gabbidon
flexilis
bournbrook
etherton
lairig
fortinet
zuckert
rodanthe
saawariya
grimaces
hurlbert
afoa
jacquire
rowney
brumidi
chornovil
civils
acupuncturists
trichomoniasis
moneylending
corboy
tripler
badder
codpiece
eifs
eryk
feraud
manasieva
thissen
imperato
etno
portends
lyashko
phulbani
mattinson
hufflepuff
puckered
raloxifene
iruña
gbf
figgy
upreti
marinades
abecassis
motts
sundeck
cryptonomicon
mumblecore
pietz
gluzman
springwell
esqueda
opin
curae
shafting
nebot
capodanno
takeno
joch
strikebreaking
dimity
pempengco
durational
chalvey
dalkowski
correlli
njs
gwersyllt
brandstätter
sirica
warmley
gawk
paschall
mamola
platooning
inexistent
fauteuil
ninis
overdiagnosis
netzarim
denters
kutscher
throughway
wztv
pawnshops
furin
thabet
elvio
moakes
huasco
mamay
payap
miralem
umgeni
warrents
tanki
hntb
bodysnatchers
mannucci
undresses
principato
roerig
gerada
enio
stoltzman
shurman
slory
devilishly
oneto
nkosazana
presciently
bladecenter
fukumura
ranby
mincher
widad
housebreaking
vaporised
bailer
ooms
bengie
saussy
shavar
nervión
dinovo
rabaa
asoke
bellver
thringstone
duzer
defrosted
winternationals
hyne
akingbola
boultbee
underwrites
echeveria
unamerican
mattin
egemen
palaeogeography
lambsdorff
argy
unbinding
brigante
aptamer
marchio
bcbg
flatbreads
earlsfield
shofner
dunay
ecstacy
jiwon
jaenisch
udy
javal
potes
nimis
gerberding
torslanda
salves
busboys
waveriders
houghtaling
lehnhoff
masuzoe
conceicao
kelechi
matley
verdicchio
flatau
asdrúbal
larenz
capewell
mwb
travelator
guangcheng
geotagged
jabel
gamlin
blissfield
darwinists
malpeque
besley
eimert
stamfordham
transaero
exploitations
larena
thall
lukashenka
fermentum
chimoio
purp
bakala
colegrove
gugulethu
xenocrates
uninsulated
laskowski
theodoridou
fsx
prusik
rgk
werts
astound
kuhrt
olema
meditational
amess
algarotti
wendkos
ashrawi
practicals
doubleton
viscosities
tge
spaceboy
shoka
wahls
szenes
dcaf
bresch
handgrip
navteq
shyly
pamper
shokat
lasi
shibetsu
washford
varias
zuroff
silkroad
orh
karmah
namedropping
macbeath
benigna
derrington
bodor
fairlington
nishizaki
letsie
bolet
explora
eschede
talgat
qilla
chitungwiza
dubie
viriathus
akili
cryptologists
barbies
howgill
yanchev
seversk
rupinder
gardnerville
gleim
mermet
vezzali
boseman
myun
gabions
boyers
overreached
leoluca
zwol
levithan
xiangning
collagens
kristinia
leslau
edtv
wlp
codi
bielby
togaf
vawa
wrda
olev
vicker
johndoe
hensen
iov
chuggington
gardos
otso
teir
dodworth
hvp
protostars
morana
nfln
giric
kigen
spitter
overscan
harat
walding
shemtov
meeta
aepyornis
mezei
schayer
fibroid
yatsenko
einsteinian
hobden
sprucing
thow
marinol
meguid
ruppe
perjurers
cepero
koshiba
béchamel
aspden
verny
changeless
jene
elektroprivreda
angelin
matousek
mortlach
bobbled
antitussive
fluffer
siddarth
rehung
xenograft
overawed
councilpersons
incanto
chipboard
bernas
dehumanize
jibjab
spermicide
inhalational
ahlin
pigford
saik
gavage
khatir
mytholmroyd
momchilgrad
rori
midea
wega
sorgue
rosling
wiveliscombe
coruna
contorno
boffins
aún
bouwman
yero
awre
blixseth
hihifo
milram
arnarson
superdrag
nossiter
cristofer
riddlesworth
xer
wlox
benini
msic
internation
haury
palpated
scavenges
footrest
irpinia
jcg
hendriksen
motonobu
akyaka
stroganoff
alphin
kollias
manovich
peterle
carhartt
kuhio
tattletales
trevisani
gwendal
pennings
neachtain
scomi
grissell
sharna
papermakers
tadej
stargardt
onecare
scrounging
grewe
millender
dhai
craftily
formateur
merwan
covelli
sakkara
bastianelli
gerena
rasm
betley
latrice
colombus
fdle
blights
bernabeu
lilliputian
pierangelo
cossio
lipscombe
glycols
deitrick
malua
qma
huaqiao
xinsheng
boria
rosental
fiaich
durgan
bryag
knowlesi
contrasty
dowagers
yulong
slighly
slurpee
kosheen
dubov
airconditioning
europes
therian
dtra
chartreux
lrf
tfu
cordera
funnyman
atheletes
minestrone
yemassee
freedy
celera
demant
loquasto
euregio
vanderlinden
clanking
courteau
scatting
schily
deiss
kidan
stiffelio
rerunning
wiuff
strelitzia
evron
mistype
micklegate
kalnoky
otunbayeva
slowe
alsatians
leszczynski
khreshchatyk
tsitsikamma
stensgaard
gugliotta
shakai
baratashvili
cimini
yasawa
motson
pxi
zolciak
emmetsburg
riascos
eustice
kimora
newtownbutler
miniclip
deemphasized
jungfraujoch
mercuria
tassoni
brazda
groveport
phulbari
edgaras
firmani
kubby
limin
blerim
raschi
haigler
popsugar
visitscotland
tellis
staffroom
birkerts
wraf
darek
mosko
epton
radiolocation
oversite
ombudsperson
lambot
dunville
foudy
tiffs
cenelec
moov
cleminson
hogsmeade
amanatidis
kobach
spartiate
alkire
tenuissima
patels
pinger
condy
rolim
neatest
dirda
hajdari
beazer
dande
toones
powerstation
ilec
jaybird
dafeng
medhin
finklestein
skywriter
yary
megale
sipc
rajdeep
authoritarians
prabal
suya
parros
publio
sulfamethoxazole
zano
encouragements
estec
gamon
kintbury
evered
nonfatal
godon
bacau
mpondo
mobis
romario
delfont
gorme
demián
glenluce
liron
werre
mosop
kakabadze
talhah
dossett
galon
ivorians
kourouma
seyval
voiles
eskisehir
sejersted
sheepwash
thorius
aspira
berwanger
onl
intone
winfrith
sunniva
mainers
ayed
mkg
negritude
leyba
betim
dermochelys
crewmate
calmest
rossmoor
volturi
raniero
heliostats
bulteel
brigette
mangope
starzl
wildermuth
gherardesca
baitfish
unboxed
danesi
tariel
hildale
morelle
siat
stefka
follia
agayev
jonás
steelcase
photomask
ishbel
stachura
tirmizi
sunuwar
buehlmann
margelov
kitv
anote
vorticist
catfights
wysong
wideout
muscadine
gilchrest
villians
switchblades
teutsch
rdas
restaging
gallions
kenter
hitlerism
cusson
duisenberg
gcms
qcf
trattou
fitoussi
itns
birkenstock
lisping
alzado
danniella
girgis
saleslady
vukov
chupacabras
gunstock
roulston
retrogression
salloum
vassileva
smerch
siyabonga
meisinger
lahinch
unice
sperlonga
hemlines
rampaul
giresse
feser
doorbells
greenore
starobin
jambon
bertrando
tannoy
ledoyen
macmillian
dobsonian
pavicevic
hollamby
tocotrienol
troian
gft
babji
okalik
escriva
rajbanshi
savoldelli
preddy
stethem
shovelling
fuzzies
bffs
storkyrkan
silverhill
bluetec
maaskant
angelillo
olisa
kabunsuan
markinson
saurel
goyle
interdit
guidant
trisakti
nugegoda
candaba
capaccio
paimio
gizo
arbeter
furen
shippagan
nely
animesh
shrm
wodiczko
makram
rohloff
demare
jiuzhaigou
crmp
muzi
ordina
falkingham
omand
suzlon
siedle
scaglietti
piotrovsky
pescheria
bondies
hopefulness
smy
durlston
pharmd
doormen
gashes
hexter
unrewarded
pentlands
psychostimulants
feudalist
standifer
taschner
konate
gillin
pliego
cacs
hicken
verdura
condict
dalecarlia
acmd
groseilliers
waxcap
urias
jerel
mylonas
sakellariou
wangs
bigbury
mustoe
topete
poled
smirks
llynfi
mobin
phab
gundel
misers
declinations
dallinger
aerator
hanneke
footprinting
seffner
treaters
marcianise
burnish
jamail
robbs
niñas
cheston
mzimba
freakishly
isokon
doxy
kxjb
parvaz
crosshaven
lookingglass
sambi
paciano
cockleshell
hustad
hotlist
verschoor
cifa
carafe
kobelev
doney
lumpen
bellavia
protazanov
stahlschmidt
herida
mindtree
ehrhoff
asgeir
reappearances
gambrill
inishbofin
stadlen
belonger
marchionne
nka
ziq
lychnis
maeder
pentreath
nhr
levanto
yokoo
spinothalamic
pratto
maquiladora
tomlins
maeva
tumas
noyd
desikan
petechiae
kanine
riteish
helplines
fefferman
adelita
liansheng
yardarm
degroote
bezige
jeebies
setty
medair
jmw
duvernoy
biolay
amaris
vdt
sanal
laam
miramonte
livestation
remen
derkach
impotency
bruss
melies
konecny
gurdeep
preconfigured
moriarity
nystrand
hyperinsulinism
klyne
pampers
stratmann
choisir
muhsen
yumin
canelas
benzalkonium
murrysville
bernes
antipholus
ezeh
campout
zhongxin
maltster
assche
damaru
dunnage
bunds
adastra
tindemans
goorjian
jims
lijo
katten
hotpoint
becuse
thomasian
tidiness
tmvp
zhongjian
sufferance
zersenay
baralt
cioroianu
sarkari
chaitén
guaraná
illiam
corless
hassanein
morys
conesa
llao
hishammuddin
shuichiro
zarifi
bauls
attalla
wwu
ackerly
ugrian
gdst
asadov
jangmi
hoong
assents
bja
affandi
emceeing
usag
genz
mellers
waah
techint
cheddleton
nute
spdr
cokey
ashburner
fakty
giannelli
bührle
gorlitz
brodsworth
nhf
kazanjian
teddi
motovun
koci
amerks
breithorn
topcoat
stohr
callegari
steiff
kozar
longenecker
biers
heldentenor
caffin
armelle
coddled
hazels
dutchy
damaliscus
clavicles
nephrops
roey
jansa
bromate
beechgrove
yahyah
stembridge
veltri
otton
firebreaks
adeane
davari
kinderman
hayfork
sabath
geordies
paharpur
millhauser
epigenetically
kandia
sundon
foulsham
wawarsing
workrooms
muckers
strelley
morayfield
qvt
graining
piked
tohill
stratham
smyril
ecbc
mclucas
bisso
routable
antolin
feedburner
bindaas
gromova
suelo
grassmarket
checkmated
denilson
lancang
huebsch
metier
googolplex
fascinatingly
lybia
serendip
norihiko
grottaglie
brodkin
rabotnicki
agbar
grecians
bowyers
anchiornis
qbert
turab
burgio
tayback
geschke
sawhorse
fleeshman
exhumations
perfectv
middleville
elsasser
cranswick
berdahl
camco
arreton
resistence
jacarepaguá
werbeniuk
weisgall
misconducts
militates
codebooks
njenga
kadison
glenbervie
daigneault
skunkworks
intradermal
anj
qassab
datastream
sinisalo
kvaerner
walthers
friedensreich
wesham
rageh
perryton
kidner
kandra
abercarn
cavens
dovi
subra
hender
carsport
furmint
schjeldahl
dannel
landscapers
pigmentary
fbl
würtz
furbish
sarcos
stuc
fenin
remes
edms
memari
locicero
dorfmann
hodapp
teosinte
watkinsville
fountainhall
seith
inspirer
jireh
issuances
oceanology
rawk
cogley
victimizing
measureable
nnt
kesri
pastilles
oseary
wendron
oleoresin
greenspoon
rouyer
pencader
hotung
tonn
klengel
assertations
stammler
abiword
autech
heene
kligman
espcially
bacevich
afganistan
entier
jpf
adomian
wect
deets
winos
zenko
cyanohydrin
jamshidi
rajadhyaksha
nockamixon
rajmohan
ajusco
transection
villagran
gowling
moultonborough
sikma
bahah
samdrup
qsa
munnerlyn
basciano
wirkola
pezza
robaina
particualrly
prioleau
hanspeter
bitner
knsd
mbengue
sheberghan
flinty
maon
apob
membury
pockriss
hussien
deerstalker
hendricken
vindictively
laurene
chinedu
someya
odourless
notter
bluma
heddy
mcgeechan
nabateans
sokha
hammack
prognostics
vishnevetsky
progess
oasi
lacouture
gwala
razvan
shirzad
hugman
vanette
adresseavisen
aleksanyan
zeichner
babas
bews
pasni
bassanio
fleshly
ophthalmologic
binkie
dadrian
cesenatico
geduld
shoen
cipro
aquatints
sasin
poelvoorde
buzzanca
orthotic
mingulay
thickset
bbss
dilson
hazarat
oswiecim
goedel
zandberg
insted
xla
glamorized
sheran
reeta
infosec
unengaged
muckrakers
marianelli
kippers
buentello
rybolovlev
auvsi
cesarani
kotchman
spirometry
markfield
sakoda
constancio
avellan
planchette
deaderick
mope
wheatly
mixner
gusenbauer
cfac
diwaniyah
infosphere
winnberg
stiliyan
appleford
antich
seydel
varco
bambenek
downsville
tulipe
vabre
geoeye
adipic
heebie
glemp
jaleo
pomc
klees
crowberry
workless
rueter
lapper
rongcheng
radiopharmaceutical
nalgo
ceratitis
sensationalizing
derma
siput
daljeet
storebrand
hizballah
anonymised
hids
scee
luzinski
sandars
carmello
filice
rotas
huur
asiya
zanda
duttine
dixton
niceto
merrison
konstantopoulos
acetazolamide
sincerly
picou
plaats
uncompelling
skycycle
peatbog
yutian
piner
dsos
siegmann
gkids
timofei
tompion
jakobovits
dempsie
eryri
impington
beauval
puchi
salway
jins
cannibalization
rivarol
shawa
mangaung
godstow
kellenberger
schieber
ofqual
outloud
talulah
instream
dolle
dolgarrog
piddock
limi
yuyama
lustick
shishapangma
mascha
furedi
tropiques
prefabs
eisenhardt
cropscience
earlsferry
tilth
ibraheem
parliment
nephrogenic
eaglescliffe
jazzie
ethe
eastbank
biggerstaff
usurers
lourenco
bewes
intercropping
europeanist
cymry
szczesny
knoke
keyte
roopesh
honorato
weepers
mikaeel
queijo
ayanbadejo
moonface
limthongkul
markethill
zagunis
lcos
altig
hondros
steakley
illah
milou
merrilee
illarionov
albeniz
yetis
shier
thurmaston
frak
moritsugu
rutted
ikhlas
hamdaoui
barsotti
unselfishly
furlow
bravi
godwins
cecconi
antwaan
unclearly
changemakers
byob
kookie
terseness
southchurch
yuhuan
anycase
reinga
isolina
churros
moiseev
cdsa
pridemore
stickies
fbg
shinners
mcglinn
echs
fewkes
symonette
fasil
mezza
therme
aggelos
moonwalks
orangetown
yandle
subparagraph
jazzier
cines
ullico
hitchman
endsleigh
trademe
fernery
fawsley
manatt
voge
petrassi
qurei
suau
rimma
borracho
stiner
éclat
drem
forestdale
axelos
bobsledding
ramfis
diani
baxi
anchusa
hyperventilating
vougeot
ostp
villaume
cames
rajpipla
kolpino
licet
morellet
kettleby
heugel
sumatriptan
demeure
ansary
teece
batia
delvina
indoctrinating
hemiplegic
teko
workaday
pulitzers
pyt
lelie
kiyani
gemologist
corymbosum
darcel
franked
shiono
lapuz
rabeh
wcmh
viettel
strykers
cherryfield
mournfully
mosinee
rozanski
sohm
compstat
zhixin
skeletonized
normatively
béarnaise
nestler
snodland
addin
kirkliston
demick
baturina
wasner
summerhays
unreduced
audiocassette
tmcnet
aucun
bichette
strine
sulphides
chainrings
oonagh
whiteclay
pachman
falcinelli
abimael
deckhands
ctbt
collura
powertech
explantion
florrick
hearkened
seend
skenderaj
ulcerations
tennen
camaret
pulvermacher
fusaro
customizer
basca
adik
dumpers
blace
specifc
deluce
zgoda
alsip
smugness
hublot
bioterror
pettyjohn
geat
grutas
wtov
montelongo
budinger
reshad
sonneveld
rmo
xpert
cihat
reassuringly
madikizela
cantle
calciopoli
rezaee
internalisation
kurka
keddy
cuajimalpa
displeases
deliquescent
sahani
trommel
jinchuan
ratepayer
beml
wbcs
chw
labbadia
wintershall
abstergo
overweening
wahida
ullage
watchband
manina
clouthier
atin
relativly
keyon
toyosaki
excretions
germon
warmongers
narz
speth
jungk
stonefly
shoul
bugno
gigantor
palamau
disingenous
seige
behaviourist
vaginoplasty
guyan
udalguri
schyman
bilthoven
submunition
goldcorp
phyllo
wagenseil
luyindula
streetside
ivanschitz
groaned
mohammedans
dayparts
medunjanin
gilkeson
lechi
welburn
kagari
valere
˘
mugabi
pmoi
smotherman
tateishi
anner
chersky
djourou
palley
sooden
eleftherotypia
chromophores
airbridge
utsira
narsingdi
amplifications
chimerical
mcmonagle
staplers
colborn
kuck
abrigo
leadgate
vanasse
kumars
abdolreza
ezekial
commissionership
partys
morrissette
vocoders
ooltewah
woodturning
drtv
wdw
chage
launders
charyn
matalin
enesco
sightless
roseline
bipropellant
hohensee
kiken
liferay
khori
mourneview
henzell
pierian
scoones
compiegne
anglophobia
viteri
schortsanitis
samsudin
precipitators
yanaka
pinx
freshet
izhmash
friman
minnaar
betrayers
neurolinguistics
endocannabinoids
controle
osse
massiel
bruinsma
miragaia
pejic
hearkening
alee
hamud
kotche
leuthard
myddleton
gasmi
vender
ventersdorp
landport
ervand
kingshill
antimo
gedman
tiddlywinks
maceachen
moonglows
subscapularis
apurimac
elkem
bijon
slota
tarsila
waen
nasw
punga
businessworld
crunchie
shariatpur
hufford
breazeale
eyesores
helpmate
bifidobacterium
juventino
thion
bilimoria
shoutout
enochs
bachianas
usery
stiassny
resan
schefter
stoplights
personajes
corpi
lzr
micheldever
sinaloan
herscher
natalizumab
hlh
froehling
centralists
rsis
untarnished
misgovernment
anad
rimmel
seli
britwell
wetteland
peetz
putain
anjema
roccella
cwrt
emboldening
labrang
concealable
cleverdon
eynde
showtunes
mindfreak
chintan
sinor
caseloads
unionizing
descision
bater
wanja
tstf
morningwood
nicorette
citronelle
seaming
katehi
kkh
lassitude
toynton
gamertag
natra
brijuni
tolani
vellu
kogut
cheddington
augarten
teza
guitarrón
stumpff
fleecing
anikulapo
grottes
toia
jaimee
strensall
communing
ests
bouchette
gansel
pagliarulo
filppula
ateret
mcts
sturry
bologoye
weoley
phazon
surprized
lieberthal
dcvo
cuzzoni
dehua
steir
ribbeck
izzedine
elq
larche
anaemic
garfias
brackens
kickstand
cooperativo
clementines
wtb
bandol
morleys
xavante
gkp
robens
dreisbach
ofws
amelle
lyndsy
zarooni
sphincters
ouzounian
buffoonish
jayasena
discrepant
azizan
johnette
mcz
janneke
mastella
lipizzan
darrall
kohlhase
fefa
nagakura
kasimov
delatte
darenth
reichart
wormy
tavano
jordanstone
lasater
methamphetamines
vedeno
benzocaine
protectively
maerten
nyiragongo
zaccheroni
anacaona
kernville
bibury
goldenson
reggia
oeschger
winthorpe
thela
fibreboard
barnsbury
wrd
debolt
kipkoech
goproud
straley
duralde
goldbloom
aivar
goosnargh
piedimonte
logvinenko
haptics
yasgur
jafarov
wgu
superball
turfs
carthay
ntaganda
paing
ballistically
annihilators
blacke
harrys
oracabessa
forro
shurley
zuba
thurleigh
zydrunas
magnums
artworld
roei
giannino
religare
fundin
innovates
aizlewood
semra
kenwyn
ouimette
sangbad
blonay
moisturizers
wonnacott
gruhn
eddleman
kesteren
polysomnography
groundskeepers
taffs
illuminata
quickstrike
milcah
autoantibody
saumya
cruciferous
recker
leandersson
neustadter
sestina
harjit
prestia
jimoh
mudde
regnerus
bockris
utkarsh
paramananda
epatha
communality
varient
nanson
tarakanov
reids
sabis
ciit
cacus
becs
fumigant
crimond
inditex
maheswaran
emera
proelite
handclap
aumonier
phytosterols
niederreiter
tnbc
lipsticks
phonies
ullin
gentrifying
bettmann
osterberg
evette
grayback
sarif
alehouses
dovetailing
cluxton
nuzi
gooney
saffah
slorc
huanghua
tuch
sultanas
vimla
armerina
tomac
abdulle
seybert
hust
heacham
haematologist
lucidi
mamaev
fagor
jsx
nonviable
backstrap
kpax
zisman
allodynia
kibosh
olyroos
smolan
lappé
capitalisme
porphyra
quarryman
senia
coiba
parkington
morimura
oxidization
reedbed
katcher
wafi
replogle
pencilling
kriegler
aliquam
metus
fairuza
aioc
cerp
arbonne
makenzie
indiepop
supergun
homestyle
huevo
stonebraker
lafell
zeidman
recuperative
chrysothemis
unlockables
abdollahi
dewatered
eyler
warnick
laniado
haeri
palazzina
braemer
hosley
dohan
masuka
bellerby
zanja
minnpost
towyn
deflator
nikitina
posin
burrup
stitchers
dromey
rá
cacheu
siyad
lockney
blit
jiggling
hirokawa
giya
massinissa
sheilla
chollima
mularkey
zuoren
eridania
xiaotian
nccr
addonizio
saxman
koumas
pinarello
baldomir
clemmer
ktbs
inculcation
marlan
jehanne
wiswall
semiquavers
ventanas
speas
appignanesi
komoro
nvqs
kowalsky
tetrahymena
buczynski
zindani
katanec
aldag
beccy
gradishar
yerxa
hogfish
datebook
grandmaison
kayam
foreshocks
carreno
branstetter
kerameikos
notturna
cocozza
josimar
owlpen
aguinaga
chulym
bersham
storni
muscularity
tannhauser
gangemi
sentinelese
summariser
paravel
vassos
wdtn
elop
mishu
meylan
butin
bunke
pocking
hdacs
mahamane
xinqiao
klüver
mobitel
nanocomposite
afognak
stonecipher
mossend
rumtek
gilderoy
doppelgängers
nissl
suvorova
zachodni
fowlerville
ssps
sevres
whippingham
averoff
organica
gooders
niteroi
brasa
ptes
glasman
saragosa
breadline
svf
akaash
grimaldo
dilligence
gbt
wussy
namby
ugochukwu
flagellants
deneau
pavelic
gilhooly
uhlich
lebert
mardini
mxp
ahmer
wlky
statman
kkl
temel
helmes
deputise
hilgenberg
quadrifoglio
paydirt
turteltaub
pochettino
anointment
hcfc
cheeger
wretzky
cunniffe
grandmama
sharn
baraza
mccreath
salkin
ruti
eiriksson
meriva
horam
sydsvenskan
verdoorn
contretemps
homogenizing
subleased
glorieuses
sobin
engelberger
overlarge
khandekar
tumorigenic
lalami
micrometeorites
wely
clubby
mocidade
feathertop
allehanda
rademaker
thekkady
carpegna
ipab
salum
vendace
quitters
mallat
bindura
nectars
wabtec
cmbs
financings
hué
antiship
lighton
fullfill
procurve
bourcier
beji
renseignement
larges
tacuma
wembury
piltch
saher
superhydrophobic
ineson
fondled
condescend
leithen
atcha
initialled
shillito
operationalize
grolle
larrazabal
unmovic
apologetically
ineptness
insana
redial
zver
windfalls
squeakquel
nowrasteh
vladi
aardsma
pneumophila
ceriani
homeware
dazzlingly
countrified
fledgeling
adesso
neistat
andia
forden
saughall
gardar
nacewa
reorienting
astrologist
dowa
hobbins
crownpoint
nagareyama
roundell
feck
moelis
qataris
sural
consolini
depaiva
couts
vws
ismaila
ependymoma
söderman
teobaldo
reems
cafod
santen
hadag
peltzer
saddlebag
fullard
drut
wdi
ionut
goodway
launchcast
nucleoli
coscia
connerly
pisoni
hilleman
pascucci
powless
tomine
tatge
maxam
vlogging
cprf
paur
everland
beitzel
buriganga
cyres
columbaria
nexter
sakakawea
koguryo
muche
awassa
droeshout
mcclenaghan
tedy
ccap
cayeux
lindskog
oswell
silicified
usdp
scalfaro
nng
kechiche
gbx
mawle
mutuo
shicheng
lastminute
adjudications
selda
prina
teutenberg
ehd
roshini
kapsabet
nabavi
hapton
rafiqul
classicising
floorplans
roquemore
jpp
celcom
yohji
kabanová
securicor
marstons
kese
newsnet
nppl
cervelli
hudnall
blackhat
almus
fidgeting
economicus
tangney
perchlorates
leibbrandt
vvc
invergarry
stellarium
nephrotoxicity
vienen
tregenza
sinaga
stadelmann
jabra
lindop
carletti
grandiloquent
heterophyllus
sbragia
hakem
hdh
schiffrin
overtop
whitinsville
ashis
miangul
harvington
reverends
donaghue
fenfluramine
olonga
ulk
turgeman
shati
appealingly
crolla
beauxis
fsas
petitclerc
hollywoodland
axiomatically
agps
seaquarium
wilmerhale
treva
bickler
melaine
martyna
cgb
digitalglobe
trovato
holytown
ayisha
exhilarated
boddingtons
kandji
roomette
nitv
callups
apprise
hyperammonemia
zokora
axumite
huapango
kahama
biji
truckstop
afshan
lmdc
jenette
litterbug
thayne
photomontages
dauncey
hample
somebodies
lovegood
bahariya
kcm
rallings
bolgatanga
dfk
carolwood
fluorescens
rovno
yuming
whizzing
alyas
unlearning
bomberos
goolies
trochowski
fertilising
humaneness
mahrt
sipped
npfl
scowen
hospitably
adriel
crittall
chikane
layo
buckyballs
grafing
shuping
camarata
yakimenko
bieksa
sacko
detraction
tielman
oche
zdzislaw
bodiless
clivia
suhayl
hallaton
catalon
dominga
opes
kaimuki
imagina
integris
arkia
kasmin
lexcen
munsif
whitbeck
mcnasty
accessibly
rokkasho
chegwidden
toguri
iub
batth
leston
münchausen
linganore
kernell
nodi
ihle
severall
pravia
adolphson
kez
harrar
gayest
aucklanders
footlocker
lippens
israfil
canlas
rainless
maytown
snodgress
andersonstown
hanzhi
kingda
laitman
divisadero
thoss
dooher
alfredi
wuhl
prorogue
jenever
anterselva
orus
soheil
svelte
nnu
theat
daragh
larussa
paskin
kakure
masayo
bibimbap
radkersburg
durno
amodeo
stepmom
jiayuguan
parvaneh
richaud
bowtell
kleptocracy
ferren
umea
avoda
weaseling
liwei
kalispel
unchurched
tobor
iteso
japandroids
delaughter
nationalizations
ajuga
patong
krosnick
khutbah
turano
dhaheri
palombo
dongjing
ruhm
crosskeys
baroe
dishware
sheetrock
hamhuis
reichl
zaatari
clumpy
rigidus
stobie
mezrich
totalizing
pargas
mccoubrey
misjudge
eastnor
bullrun
koziol
borschberg
zhenyu
ortenberg
larrocha
gleniffer
glenullin
sapara
pueblito
voluble
danjuma
nordbank
rezone
ngawi
alac
haircare
berden
ertan
ranum
hristova
bananaman
cumene
oldring
gyor
brynmor
cockerels
hidayah
alstead
immobilizer
ortman
knxt
snood
passalacqua
edify
abdelhafid
cymer
jutras
cuccia
sawaya
kuebler
archenemies
galvano
lisbie
chinley
sellaband
barff
bohler
grubman
hillcroft
ixi
greenbrae
yuyan
estaban
garble
lmx
magin
outré
darch
ecolabel
otavio
auw
kerrii
paderno
karamojong
wiliams
seppelt
herceptin
sombreros
rednal
seago
audibility
wifes
premachandran
ellyse
lunev
talloires
qasimov
thorazine
boumedienne
languishes
wibberley
fonoti
tophane
kathoey
wonda
escambray
impeachments
tenancingo
heliborne
vasya
obenshain
alexandroupolis
wijeratne
anycast
biglow
whernside
atika
launderer
aisc
cfdt
mirinda
gunja
bowlen
bleszinski
moulitsas
versicherung
seikaly
ucca
mischel
tomasetti
bellina
bodices
bisan
humanum
lübben
scramblers
stieff
mimetics
golob
jinns
ihram
allibone
yalda
tristate
sloughed
jouissance
deale
achal
oxgangs
carrickmore
spaceguard
sunbathe
rusland
yuo
cowpeas
radici
monokini
mosedale
ferree
puft
stashes
kiep
undertand
mcinerny
sertoma
churchgoer
consecrates
wallboard
schoene
haleyville
gassers
movieline
pardal
torkham
bognar
frant
olwyn
mauritanians
gustavs
hemmingway
hydes
cambus
instamatic
hanim
elfstedentocht
documentarians
taslim
nimni
kuser
indeterminable
cawkwell
mumcu
lakemont
bombi
huntford
dorning
machame
caramels
peleides
kyjov
ceridwen
breville
piquionne
traversa
shuguang
dixiecrats
giussano
disgracing
neelie
shadravan
copenhaver
readyboost
acess
cerd
unscented
hanso
stebbings
themarker
tomonobu
hypophosphatasia
dunlevy
audrius
cllrs
conformers
sbac
dificult
muamer
beogradska
shiffman
maull
dudesons
blowflies
majnoon
avocations
dampens
witn
houlder
mujaheddin
stigers
rhees
sereny
sigwart
misapprehensions
porush
hamanaka
maurissa
moratinos
auklets
blithfield
hitan
moskovitz
wobbled
quitted
motivos
trinitron
arquitectonica
raincoast
zeeb
fengcheng
commonplaces
knurled
tibshelf
imide
gerdts
scalawag
filicide
juban
derwentside
tangental
catarrh
verdoux
buav
smadar
greensmith
ephebe
toreadors
repa
carolers
jozy
leonids
yeang
boquillas
risher
nunchucks
misreadings
palletized
munnabhai
peloquin
cholesteryl
marilia
flatlined
lamsdorf
lenalidomide
ckoi
spose
westendorf
espo
graca
pasveer
salignac
showboating
bandying
langsdorf
herjavec
coene
unlighted
lagartos
baracus
weinzweig
sazan
emitt
illogically
harbury
kirati
fttp
dramatise
morillon
aperçu
yitzhar
chippings
yaogan
obligating
billesley
doodlebops
agrio
coccineum
scobbie
accelleration
jobsite
prikhodko
briggate
allmand
unparliamentary
haemon
pieczenik
densho
pimental
ifw
lixian
newpark
delgados
komba
mascoma
sawahlunto
natterjack
catcalls
ritola
parnassian
delavigne
gravesen
laryngology
nagori
waran
adye
seifer
quantick
thornleigh
isobelle
narasaki
pmbok
hyesan
scorning
rogala
alberghetti
compote
wimer
panella
depoe
czyz
curveballs
silvstedt
salifou
blyde
phv
etiwanda
falfa
pagent
hypotensive
aphrodisiacs
ahas
charnas
jinbei
capasso
naptha
maale
tolonen
misplacing
choppin
cocq
tweddell
scrooged
lankarani
dorwin
cruzan
fujianese
alisdair
goranov
razzmatazz
csra
groundfish
crystallises
hypersexual
europro
avto
scowling
ssts
fretz
nzt
kaarle
insanitary
internationalize
convoke
teferi
formichetti
perinton
cayless
oberle
grael
jellystone
dunsworth
emolument
iraklion
afriqiyah
menthon
mykelti
schlachter
airtricity
nanograms
dewolff
rnt
endplate
middendorp
berhan
gerardine
laâyoune
caserio
dewpoint
farebox
visma
paramjit
orlac
raitz
aqs
jomsom
hoilett
tangalle
noades
proudlock
waconia
sauceda
asthmatics
longabaugh
casselton
customising
scrumpy
unruffled
raffray
batho
ottobrunn
wna
lamberhurst
foolery
overplay
moneymaking
vitalic
grimstone
owuld
navone
sustainlane
burnstein
unloving
pgms
hetz
rocinante
eegs
thousandfold
moraleja
idara
ironfist
accl
basen
ishimura
pimpinella
cronberg
iford
knowth
exurb
bodysuits
perforatum
temo
shizue
streeterville
adaminaby
paonia
mimivirus
thorat
jaidee
doigts
hepcat
chukka
aeroponic
ringley
acaster
satnav
daktari
sublimates
irus
pisgat
verratti
rinkeby
cantore
iwelumo
gardot
harmeet
gutch
joypad
waterparks
waggner
häfner
tigresses
capizzi
buildups
sphygmomanometer
colleran
khooni
diogu
utting
epicene
mandane
sucessfully
accelerando
carasso
incarcerations
cinedigm
wud
rxlist
mutilates
sandline
nijman
borchetta
wieters
nitwits
gillott
cedarwood
adone
chryst
blaffer
wyne
summering
iosefa
adir
frontex
remerged
brinegar
superspeedways
harmfulness
ghazarian
mauren
nicety
purebreds
melungeons
aberporth
oldsmobiles
payá
jaisingh
ciénega
afh
deepings
sandag
mercosul
kamancheh
ksat
likey
llanrhaeadr
bucksbaum
emaciation
pentyrch
ichihashi
rehouse
discomfiture
glamourous
hga
softs
hawkin
flashpoints
hebrang
maltodextrin
bohnett
sigala
bookfair
crimeline
hajri
bgg
proliant
choueifat
wasan
sonhos
crisman
vocalised
cataloguer
ihar
intex
abac
menchik
recalibration
luminato
umberger
gatra
lemonia
angouleme
astuteness
beik
amurri
gibbous
varvatos
adebola
convulsing
winker
putland
hatin
stutterers
engro
baim
sambrook
kirbyi
chisox
wolfsschanze
marcinelle
frommers
markovitz
schiavon
bartmann
bobbit
koivunen
luttazzi
passero
strobilanthes
alipate
quander
selvaggio
defragmenter
treharris
trounce
monthlies
tnfa
narragansetts
diester
cwru
hillers
maurici
oab
smoller
ommission
feudalistic
bernath
qimonda
blunsdon
scholfield
silda
cordus
cohanim
bodle
exhales
heupel
fickleness
nonlinearities
dulcimers
kalau
overactivity
deathrow
tyntesfield
bealls
resurged
hickmott
crau
slivovitz
bunner
bsat
brontës
felpham
coracles
bspa
karslake
ebit
dukeries
barrowlands
arats
leibrandt
shahenshah
wittwer
makavejev
heyuan
leimen
silberbauer
holifield
mongan
schaden
etwall
debbouze
buffini
pecknold
sensitisation
perv
degolyer
kinya
canel
komisarjevsky
lemonis
gjelsvik
katariina
shawki
oltp
zeneca
siau
tecton
xiaohui
frasch
ulcerated
enock
handman
casartelli
orchestrators
lucks
nazira
earthers
oakford
subtlest
loitzl
ardell
diar
spiritwood
apear
ususally
stanchev
bechtold
buddenbrooks
mahnaz
vpx
gelida
projectionists
spawar
gele
stomachache
aaberg
froide
ultralite
micol
nalli
lobbe
unsalaried
enrollee
hfo
adolpho
claires
gangplank
donofrio
southpoint
desmin
bready
chalayan
booky
chillar
thorugh
ltx
fibrocartilage
hibernaculum
buya
marula
palia
berried
dieudonne
streetly
hatab
fischman
macara
thurm
prometric
dorward
holoprosencephaly
shanan
sosuke
thede
deafened
kuchera
piercey
osia
netzero
stoppani
rankers
fiancés
indigenization
ornish
phimister
magothy
bedar
moueix
metn
aitkenhead
vinum
ikb
pnnl
cretinism
heighton
ermakov
deroche
ctos
triballi
tootles
consommé
tocsin
chessa
cassinelli
gabol
burkey
chandio
bompas
sulayem
rousers
aesthetes
tigercat
koretz
eurypterid
vituperation
skeins
fridrich
freakum
tendresse
pocketknife
naydenov
pamphleteering
thyne
uldis
fidelman
kezi
berch
guishan
poligny
afghanis
sansho
webmethods
moorad
stupefying
lukla
precocity
ruberg
prebisch
perkasie
quetelet
hoogovens
indovina
transmuting
mercados
beco
zeitler
ivalo
sukie
keatings
spurway
geeti
calfskin
además
yalo
enterococci
tihama
ackers
wpgc
netherlanders
arteche
shinier
flopsy
outcaste
abscisic
ebeid
parijat
dedza
betrayus
forsmark
munmu
thibaudet
collectivistic
talebi
tatneft
groomers
koat
mcgrattan
factus
slurp
biggies
analogized
pichette
angor
commoditization
ameena
hurstbourne
ivm
auerswald
ettal
nutraceutical
feer
fcca
maultsby
subjectivities
spanks
exora
couvert
voorhoeve
irrationalism
gregorie
addictiveness
amerykah
transcaspian
curral
rozanne
regus
ayerst
bolick
cheoil
foschini
dtrace
bhut
gibberd
leckwith
ljungqvist
okoth
dvf
pottu
gingell
abdolkarim
fossile
khadar
conchata
kenard
pretto
kharma
acebo
mejias
assocham
eshbach
treuer
kirkbymoorside
buyoya
ghazab
opio
massasauga
fialho
moremi
smolt
champéry
hazed
ocx
fitfully
slothful
stephensi
oozed
tuffin
annear
treeton
antiquaires
gfci
tomelty
okcupid
tenontosaurus
keishi
navantia
khizar
batstone
heartsease
brunhild
flightdeck
wakata
laurice
abily
ataur
miedema
empathizing
vetro
chanterelles
mammas
percolated
coppard
fems
franjic
lucheng
siwanoy
hiawassee
wairakei
dwar
kooistra
sikand
colugos
archibugi
biopower
sgrena
jalapeno
baffler
durdle
semidouble
patau
ljungman
peros
fundoshi
maniitsoq
butterbur
buske
vocalism
vpo
kerbing
folkingham
lindenstrauss
gudmundson
mehitabel
ldo
downscaling
hanis
artistshare
senneterre
rizzotti
spsl
batf
grap
animadversions
melismas
interposing
louisy
laneways
chalkhill
psma
tossers
borri
buljan
oncken
soic
dbg
rehbein
ailred
prehn
yoba
ppas
nyas
amalgamates
maione
silvering
hemsedal
adkin
rhines
maerdy
huser
kingstanding
skerne
hody
interrobang
lindu
solntsevskaya
goisern
guadagnino
fatiguing
osheaga
bleat
motorama
permenant
westie
polyvinylidene
combet
grunsven
barbastelle
thoughtworks
nbpa
omro
crumpton
alltime
interpellation
berlino
lccs
charmless
fdt
precalculus
golm
tinca
eaglebank
loosemore
bhimbetka
drippings
nosair
solley
nemat
nithyananda
nashat
fasel
stonybrook
poizner
tredway
maybellene
emk
compatability
gamesa
thnk
utl
hongwei
juicers
chiaia
tenleytown
konkola
khayyat
nadas
serrato
brez
kreft
tadatoshi
slye
hurdes
szarkowski
shangaan
newth
kias
contriving
benadryl
parman
wfuv
underinsured
comstar
bidford
mdz
pmx
biochip
cltv
tofan
euromed
wikianswers
mrtv
cornfeld
goen
beharry
mcnuggets
hulen
ogiwara
lipschutz
toub
soleá
vodochody
tonnelle
woul
softie
scalextric
grindlay
acidophilus
paonta
sebrango
opionion
multiform
kythnos
xiaoyan
romie
jruby
kirundi
dinnertime
defour
shokai
afw
kleck
chondrosarcoma
forcefulness
wyness
wakame
deceits
tahj
maritimo
pizer
croci
hegira
rutley
rokocoko
muliaina
lurches
crushingly
phytologist
disinhibited
gmpte
convenors
monarca
lynches
serfaty
pellucid
lyness
distillations
herger
dilates
ofrenda
abdelmajid
ansaldi
jonell
detriments
weatherson
risinger
fizzing
matzke
rora
swillington
giannoulias
sarcomeres
xingang
pokharel
sorich
fraboni
matano
trehan
senaki
sayago
swagga
marginalise
pridmore
galanis
kmtv
marro
dins
barnicle
boreray
pagnotta
beckner
dudh
erwartung
icad
gaillot
cuthberts
edghill
tydeman
princessa
withnell
bradish
channer
thrifts
cardiel
akbank
heterotopic
gevrey
vacuously
coslet
screwup
strathcarron
crociata
fylingdales
whic
pelagio
leade
monheit
khazanah
tabacco
carrabassett
gitomer
dgf
dasani
rodo
nasso
kirchners
dulé
langshaw
melius
zaccardo
waddles
pellinore
unidimensional
hopei
socko
buckwalter
bagert
britannias
cityville
misk
gamst
dauphinoise
epley
sauze
humph
kenza
derickson
empanada
levier
rolton
deforge
isely
aridjis
downeast
wonderfull
ashwaubenon
staver
saintliness
keybank
auctor
considerd
mazibuko
jezek
nurturance
marijn
wischnewski
netiv
keepass
gamm
conflit
junkanoo
pinzon
millhouses
koumei
andile
danin
haitao
crinkled
reapplication
servier
ocana
wnyt
westwell
clendenon
canari
vcam
kasota
henck
brawne
oxman
grisebach
ggb
ebeye
fraid
trackable
aggi
bandoleros
retinyl
boyajian
reboarded
archicad
pedagogically
klaveren
lavasa
wevers
lattisaw
saccule
maringa
walliser
briercliffe
eluard
gangbang
kafer
aams
kromowidjojo
wum
staghounds
soulard
battaglini
esmaeil
keagy
peldon
barreras
ohanyan
kelda
metasploit
medimmune
timss
bpv
prödl
choque
linebarger
vespasianus
dunthorne
frosties
herranz
tilal
boola
villepinte
potatoe
nigersaurus
facenda
moulted
holleran
culmore
chany
bobrova
nomade
woodhams
sidbury
scho
gotz
hardingham
werblin
holderman
harkening
outspread
kosen
dragonoid
biffo
baggie
troussier
velva
pramanik
mmtpa
empa
wijngaarde
premer
henneberger
miang
celcius
wazirs
aphc
loton
frayling
consignee
contadora
japans
reknown
sorey
lancastria
bowmans
gainor
fonua
chlorobenzene
translocating
sbcs
prendes
isere
midlake
dubble
mamula
ofb
gayley
choekyi
recruitments
sugerman
kermorgant
gpws
sertich
kalifornia
magnetised
zenia
eggy
xiangxiang
novembers
showaddywaddy
rommedahl
lithologies
bocskai
hongxing
seeders
crowdsource
laingsburg
transversality
malthe
sundre
mahmoudi
tetrachloroethylene
acclimatized
aurelien
nationsbank
solemnize
khalife
mesan
glenshee
preud
holsteins
sandboarding
kencana
redouté
parolin
katalyst
davoren
henries
champêtre
jellybeans
olis
organisationally
baslow
trophyless
honeck
borneman
aynesworth
linter
wener
landecker
magnay
ackoff
sportage
maiti
cockenzie
visualisations
mccourty
silsoe
ninewells
peders
rabelo
postoffice
cesr
olina
hentemann
arato
hersant
murzyn
titon
marrara
weaste
patia
murderball
roddey
moresco
avdeyev
rapira
rainin
gispert
zheleznogorsk
angula
straightener
barbering
warao
supermac
winteringham
schmale
minik
cyndy
schoop
wiimote
leedom
tootle
windless
thornliebank
deviled
wakai
osw
parroted
yakisoba
gizenga
njc
cucine
wieren
keezer
niggardly
bishopstoke
doxiadis
tegla
louderback
lazzaretto
overpainted
chaigneau
arah
wellbeloved
yammer
bankcard
fellay
leweni
gercke
leitmotiv
mitres
bettin
chida
loof
doos
curatola
lomma
hornak
dalva
barrass
wasfi
younce
brownshirts
autolink
kornelia
brahan
swartland
malchin
madone
interst
ayten
uras
haule
mahyar
sedwick
elumelu
lorman
adderly
capeman
adrenoleukodystrophy
gururaj
meachum
damilola
khairat
krupnik
alasania
forsakes
franciacorta
hanff
facials
mettre
gabu
wcax
plucker
colpa
markert
dulces
lusinchi
ifra
dreamings
cafritz
rever
exept
qaisar
wilkshire
haemochromatosis
paicv
ludham
phantasies
scalloping
sassaman
gayoso
demonically
rikk
campiglio
bscs
liepa
osgodby
marienplatz
bromford
plattekill
imahara
kery
darndest
sternin
kohara
neuadd
traffics
propounds
girodias
bronchodilator
ozyegin
twinge
eluned
gendreau
ochamchire
burbury
balwinder
indiscriminant
tognoni
leffe
ellistown
raymont
krach
corazza
sizzles
pecota
crudo
jwm
colorize
mrak
thubron
raymonds
vansh
instapundit
tanat
ledwinka
affymetrix
tabarly
pfahler
pureed
vembu
ziyuan
sextette
ochamchira
photosynth
expansionists
southmost
ballfield
feburary
neumeyer
anichebe
pvcs
yakitori
lomonaco
tegner
dunbeath
zehava
connetquot
qpcr
hoogendijk
sdw
deacetylases
edhem
ponytails
remenham
iodp
nehi
loussier
tamilian
honiball
daydreamin
lucyna
lamone
cnsc
mccrane
cdcr
islamophobe
emulsification
caressed
sfas
solignac
dystrophic
wdtv
clubhead
elstob
effin
edra
lehzen
armands
bootup
peggle
nihl
gbn
iniciativa
disrobed
mauzy
achel
strumpet
jokic
luchtvaart
bocharov
myalgic
prudery
munchkinland
howatt
fauchard
imls
fhd
defunding
doorstop
xango
unsynchronized
atea
annunciata
amtran
bhv
fulu
wifebeater
kabylia
mahones
bercovitch
bartonville
forepaws
retyping
urbanizing
dalloz
chryston
canwell
phibbs
ntca
marfleet
teagan
unfancied
sotherton
cocorosie
marli
unigo
biodegradability
odr
precisionist
verra
ramidus
delisa
precariousness
tolstoyan
masar
beechman
rainmaking
inboxes
belet
bengawan
fcas
pehlivan
friesinger
chaa
mulqueen
unfitted
losail
mireles
weifeng
spigots
gratifications
abayomi
rozzi
casone
ktt
southway
capulin
bressay
soci
zobelle
hisingen
bedraggled
mceachin
fervid
tooles
pabuji
synergetic
demonised
stepmothers
unfortunatley
delaplane
bearzot
kliper
passthrough
bikita
portageville
trotternish
ehmann
capoue
durmitor
kakati
tataouine
napf
sorrowing
greenisland
charanjit
bollworm
sunter
barbudan
nationalising
photoblog
guyatt
bayville
roundstone
bashkirov
zahav
stylin
scrumhalf
galatas
vendola
hooson
uncontracted
alsc
bulsara
abruptness
granik
eligable
schrad
lateraled
teetotalism
kalac
poilu
hins
monologist
quaglia
borzov
roquelaure
beseler
hobbles
olympiahalle
mashayekhi
lucchino
lasgo
broederbond
spectrographs
jowers
béguin
soldner
renck
mawdesley
berkhout
vhi
shekau
novikoff
graffham
skywalkers
aellen
chorded
hibbett
klutzy
bizzarri
licinio
outshines
haub
stonefield
yuste
karey
ripia
baffa
booyah
akinyemi
volksfront
kühl
romanet
bassford
saadallah
blackwill
giriraj
sidetracks
grippo
handhold
songun
avoth
slackening
vilallonga
lunching
arseneau
claudet
northlight
linlin
weprin
flyways
tunisair
mccarney
misapply
knollenberg
psychiko
tauran
morphosis
calabozo
bohart
shadowboxing
untraced
marcegaglia
afua
goecke
siouxland
hesperornis
shoard
sigmon
olia
repêchage
ruche
baset
butterfinger
caponi
ghaghra
gobbled
waad
septuple
autogenous
alamin
sundancer
wenford
lowri
wermuth
wauchula
capitulates
subclan
edelmiro
bmrb
soss
shreeve
lemberger
kintampo
hamate
hallettsville
bigfork
hostal
earthshine
succint
jadoon
doca
arandas
gundog
levanon
trullo
wanis
laber
zealousness
stoor
shitamachi
waissel
chodron
kheli
lanthier
leming
aghios
aegerter
balestre
termagant
fukatsu
aasu
gawdy
hipkins
qatargas
shoucheng
furlanetto
lumos
pugilistic
kanmon
mensalão
gpon
zohreh
polydipsia
primare
confectionary
rgu
smallbone
chibana
kalik
theros
jeay
carryout
hrb
marittimo
decanting
multimatic
wissing
hawija
masinga
shleifer
schindlers
uwch
yadlin
salahaddin
poovey
noorderlicht
gunsmithing
kxtv
brusatte
hamr
lhéritier
grzyb
attackman
bardhan
louna
bettes
preforms
frosti
proverbially
puett
conciliazione
mansaray
griefs
guidence
iason
nonvoting
pyinmana
wainright
measurability
unacceptability
leics
lefevere
luf
gabii
whiffenpoofs
dibona
precooked
sommerlath
fisken
kifl
rindal
moping
counce
toupée
meritt
tavani
rgl
matiur
lingotto
drieu
stamer
ledonne
sagot
kjer
sauza
atwan
bourdet
hodnett
maws
satyamurthy
lugged
kulig
sounion
crie
akhras
trinca
mcgruff
troodontidae
angliss
nhadau
raco
lobov
mnangagwa
shortform
unwrapping
attachmate
kandilli
mtel
sulaimaniyah
alberici
shishkov
pined
levander
hennin
allenstown
zalayeta
geopolitically
fluorocarbons
blameworthy
guettel
quicksands
toughie
blathering
roughwood
listservs
riptides
krenzel
protezione
pijesak
shewell
paragua
ascherson
lapolla
epilepsies
fany
scabby
matikainen
uprise
krqe
wireimage
nbbj
mashallah
mohri
voicexml
endwar
whiners
ilaya
aebischer
kollman
stroppa
rosukrenergo
aristocracies
northtown
danya
lustrum
gourde
nemr
motsepe
bossley
tuncer
mickleburgh
colombière
mincer
kipyego
nsca
akhnaten
coppins
urbanos
parnham
jaal
kanis
lugt
wykoff
macunaíma
luxon
flipbook
hopfner
ruleville
vólquez
gleevec
viser
wendlandt
ghannam
vtx
geimer
surdu
unappetizing
arverne
klarwein
dipti
vwd
leist
yarmulke
pokeweed
transcoder
hillmer
noctiluca
actuall
howry
camouflages
ferrovial
maltzahn
contiki
unlikeliest
curtails
lefcourt
pavlopoulos
burmantofts
countercoup
sweetcorn
chaiyya
virji
taxpaying
inbhir
oltmans
piliyandala
galex
honkers
winnersh
verlagsgruppe
peltola
orren
lacher
mincho
pblv
volstad
reatard
hairdos
longlands
pretax
terezinha
gafford
bussiness
tesser
ahadi
hurriya
springtail
bonavena
bronchodilators
boylesports
ilton
collington
gattai
dubailand
talagi
saoud
carrothers
odelay
anier
polytonality
agle
clockers
dannielle
mahre
vidale
ronet
jenet
erjon
dantis
injil
bakero
modasa
roselyne
defragment
yuting
hellp
trebilcock
rungwe
lollo
odair
laffit
traphagen
aceto
gallay
poseable
jhunjhunwala
dumisani
yuzhong
dorell
basepaths
joline
reflets
smush
jannes
templecombe
clst
tanongsak
dussollier
poolman
shorebank
lysate
haslington
corduff
cianciarulo
proffers
woodrum
wessler
bunde
scholium
libdem
fenning
vinessa
condren
beddau
cornhole
columbite
ibach
fison
manucharyan
coppersmiths
cardena
vittorini
wampanoags
renia
hijet
ganton
arrojo
fetu
iajuddin
giddiness
xandros
aebi
elmstead
colomé
bollmann
equinix
chichiri
shackell
paartalu
sonobuoys
wics
wellies
haresh
schimpf
kabara
tdx
girdling
kozelsk
waymouth
sciencenow
murden
starstreak
perenne
obfuscatory
ultrathin
disproportionality
weequahic
newmilns
schilsky
magaluf
clambered
eggum
brehaut
kaplow
sicne
wackiness
fidgety
riffraff
taillamps
rinck
rakshit
moneygram
donleavy
yucai
donators
lorig
twigger
icps
smola
hauber
roquetas
munchie
gasthof
masamitsu
jish
bsds
informacion
toutain
rehavia
dmw
virgilian
crear
occupationally
bhamra
unroll
verheiden
binzer
mandatorily
cinergy
ovenbirds
rocamadour
khalida
tsos
oxpecker
ottosson
magli
lotito
endang
leden
zadan
mischaracterizes
humanness
rollercoasters
cantey
dustan
rizos
aafes
brightling
goseong
anisur
drako
sporran
belligerently
okwui
leviton
nenndorf
califf
flouts
mcquoid
dagsavisen
deewan
sexson
hellbender
posthaste
chinati
giberson
ekster
upgrader
cales
vssc
revalidation
judengasse
stramonium
mathi
contrapunctus
kefu
agazzi
bellette
creevy
owne
rugge
eglon
rajaraman
andreyeva
aversions
klumb
cowle
retrying
gtin
brathay
underfed
bichat
cassida
reinsurer
contursi
unsecure
agnelo
tollerton
hollein
reinertsen
rifka
ryal
radovich
gottheil
yaqubi
cornard
remillard
tylers
undeservedly
uphoff
uysal
hualong
vipond
buildable
dropwort
bowkett
nulato
spanaway
ansdell
indyk
pedaled
fantuzzi
roji
ayob
bookkeepers
sudani
compa
corita
tourelles
cichocki
zangara
ehart
hoai
dirie
clases
sparkie
dunagan
bilma
boualem
scei
bickleigh
railay
kenwick
colrain
elliotts
mouride
silcott
vallois
gawande
mushaima
peplowski
schoener
mazuz
rapacity
decroux
rollerskating
bgh
cauterets
bortle
rideshare
chmiel
jtd
blumenschein
amcc
javaris
yahud
necromantic
rukavytsya
smithills
colorno
azab
trus
zura
grommets
yosi
jahoda
jiemin
missen
implimented
finback
agnone
puppo
russki
dougy
yantian
kubilay
synthe
unrevised
villere
ratted
joar
singlets
interiority
wehlen
ndikumana
loux
microenterprise
formaggio
scooper
mjs
steiners
sanitizers
mackle
oversold
foreward
celgene
sensually
feldafing
rockface
overfilled
rinde
hofbräuhaus
oschin
eulália
translucence
perspectival
pecha
kaukab
gaiam
sadowitz
haysi
paganelli
linos
vassilev
saadet
bratman
pjp
makowsky
faiveley
wady
alday
stoeckel
sidman
ronis
lerby
overidentification
donno
kitanglad
plandome
karahan
blackamoor
journalese
willfulness
nager
borren
pmsa
dwek
nsit
goetzman
bonvin
hoeck
faves
alkmund
cherny
choppa
wholesomeness
kutaragi
neid
paetec
poisoners
divinations
tiffen
nonylphenol
ballybeg
jarque
haehnel
uttaran
geoint
vitol
bergenline
orfèvres
pmv
rales
ostro
unpatched
noshir
erme
benedicts
zubieta
splinting
varadkar
fournaise
caspa
chipinge
marzena
supplicants
toneless
motherships
antiquarianism
hoeft
clambering
rigidities
rosbach
lfv
dayer
marchés
engelaar
schmoke
avita
pachyrhinosaurus
chere
kandelaki
holzner
augustyniak
upend
kleban
ragon
agitates
tylertown
perriman
góes
ahus
sackings
kempter
penkala
zubeldia
knipp
janela
potterton
culbreth
foxall
yetholm
sandling
akter
tehnika
californie
romping
baic
udzungwa
deibler
inadvertant
baduy
exploiter
unprompted
hakimullah
josephy
tollin
nachle
deveron
anthropologically
dowiyogo
nfld
lawngtlai
uxb
tahlia
titlis
companie
samura
komunyakaa
uslu
goodden
mitrani
faga
hoveton
yordanka
mableton
greasing
budanov
hillarys
throstle
espinel
codjia
mcclenahan
gladstein
wuzhong
guzzler
doinel
grippe
hanton
lolol
gruben
taguba
brina
hawkinson
hygrometer
wible
cigarini
profi
hungers
stoffels
choudhuri
stachowski
seein
ruffling
machavariani
senreich
leighs
recondo
nhlanhla
documentarist
jozi
barmes
spackle
kamman
procrastinated
fredda
bucy
cowick
bonvicini
sundarban
thurnham
lassana
butyrka
buitenen
dronten
mamasapano
bisto
porc
iconix
kimbro
matronly
kilju
qatil
razza
padda
cudlitz
finchampstead
mondonville
lison
morgenstein
zii
roadworthy
nanri
syndicators
mutal
rezai
miskovsky
bookstall
manguel
venipuncture
tributyltin
lawfare
adeniyi
jehova
atheroma
explicates
lavernock
greenawalt
beaney
rollox
usherette
ahwahnee
catspaw
tensioners
jaxson
dmitriyev
tchani
delyn
dolgin
maryfield
krens
sinquefield
wieler
wilis
duku
jacome
sheffey
ythan
kavin
saige
jaoui
moorage
dufault
uvr
fatalist
prazak
schwaiger
ballons
inured
beese
polyolefins
larrick
bonera
kitada
wiesinger
spratling
curdle
bbci
seegar
domb
inverkip
nango
nordenham
carrascosa
detre
ambrym
phormium
uygun
defoliants
mbulu
lafco
fischli
turncoats
seasteading
cannaregio
ballingry
rangasamy
affligem
glazers
broecker
darkon
challange
gibbus
baschurch
fraternizing
coccidioidomycosis
shetler
skrall
holburn
susceptibilities
polarise
akvavit
albats
zefiro
bogusky
franchize
bishopwearmouth
casseurs
goitom
sleights
aleksejs
bouchut
pulverize
albopictus
achatz
kinokuniya
colistin
befalling
kibar
himan
etanercept
mastocytosis
hamou
sugarplum
kazakhmys
ranchland
reassignments
homotherium
radoslaw
zbc
milingo
kupperman
koyi
deregulating
hymettus
tussles
androgyne
losartan
massai
weatherboarding
nusbaum
hallworth
pompilio
oehlen
challengeable
gunnerside
murciano
koppu
wahnfried
chéret
aerobically
starline
sinofsky
kwashiorkor
ranawat
safenet
rusli
beya
publik
kampman
bolthouse
bufano
caldarium
sunchon
lauret
aswat
telephonist
latanya
gayet
driefontein
yadong
panek
mirail
grochowski
mejri
barloworld
displayable
sawbridge
fedexfield
neuroanatomist
brightnesses
immunobiology
jeffersonians
riak
arek
bmcc
cinda
fraîche
artreview
antitheses
geiranger
subassemblies
watersmeet
hovingham
stockroom
brennand
musante
bruer
cidre
nautico
spreadbury
ambuj
harilal
cucinotta
powergrid
schaber
pucara
moggach
gollwitzer
rerio
hyosung
masterless
brinsworth
rusha
preclusion
pehrson
kleeman
maie
codependency
mégret
keagan
ishani
puning
bogost
dierdre
radziner
protoplasmic
glackin
rehousing
sabhal
clasen
lekking
granatelli
semans
nulli
fleurie
precursory
anderer
sarro
unornamented
khanabad
reexamining
pliva
kalomira
geib
woore
neubacher
gwilliam
waljama
macaron
pettijohn
yuhas
oginga
aeroportuario
freshening
boin
overlea
internationalised
ariff
mozote
squawks
sijan
lechuga
smartness
aptamers
antiproliferative
halfhearted
mccreedy
seehausen
kabua
snowmobilers
kersley
conjuction
morcombe
duerden
vandoorne
pertiwi
swatter
joson
sweed
palmy
angelology
budnik
navarette
denmead
georgetti
butterfish
nitpicker
caravane
chugg
envi
brost
dybdahl
schroders
tett
hony
offloads
dubnyk
refrigerate
omerta
verities
chandlery
crudity
mastercraft
bollier
relearning
esteros
bangguo
hammacher
kandil
violences
fallers
pouya
tabone
banko
enthrall
albach
raber
haggins
articulators
causae
kindleberger
scampered
tikun
karalius
yawns
congruency
forestay
koken
balangay
barhi
saib
frady
berlingo
makart
traceback
transito
zeynel
tessé
heffler
tytherington
milea
mison
stanground
indranil
deemer
steeplejack
cansler
perno
kande
mwale
farinata
hippe
quilp
okin
swn
rassinier
rusnak
lesperance
chatta
renos
llanwrtyd
mashaei
barnie
reinprecht
prophesized
lybster
bbqs
skydance
mazower
bralower
aveva
warnaco
seamy
adoum
lochleven
phylis
kiersten
zorc
stibnite
alysheba
rimula
gocha
woulfe
securom
debacles
deuteride
rybnikov
greenspon
catalysing
pursing
pcts
geg
wurzach
yaroshenko
deflates
parlane
estacio
pozdnyakov
kreek
profesionales
dardis
lillingston
windsock
additionality
killie
savagnin
crescens
oarfish
changnyeong
cirs
fortnam
optimizers
peslier
abcp
quintain
rudas
helfand
gadir
vitrine
daul
salovey
lawrenz
adkison
glengormley
aspendos
raanana
arlidge
threet
voicework
mathern
monchengladbach
akutsu
attingham
odam
badil
diatreme
cjn
mercuries
tunecore
taurean
flavonols
woolpit
fafard
makeda
sanand
kalaba
zanger
ayal
croutons
nnamani
defecates
gisli
pocketwatch
rozsa
euractiv
chaix
murguia
hammerman
geodynamic
busche
millenniums
fiorelli
cossutta
kesgrave
scovill
sekret
charulata
liferaft
basketweave
hemodynamics
worcesters
multiphonics
harebell
schechtman
niuatoputapu
konvict
jetway
badat
bujsaim
nuhiu
schnoor
horribles
eriboll
wika
madlyn
ubr
alewives
gavitt
reeltime
moinul
moonstones
abiquiu
rictus
idyl
azpilicueta
vatel
machacek
ninigret
marinero
ambro
ginna
enf
konrads
deprecatory
klimowicz
alrewas
omh
groper
beatha
shichinin
dkm
landreau
tanuku
rathborne
ockbrook
lykos
teunis
swo
mispronounces
pianura
vwp
gassan
styli
aurigny
heijn
commentaires
sklyarov
decontrol
hils
zaitseva
burtons
mulches
overstimulation
neuk
infinis
baccus
bogazici
waso
kanesatake
controvertial
rollup
leijer
showmatch
immolate
serber
mayardit
olayan
kermanshahi
ullens
stoc
broaddrick
gnm
fulginiti
ugbo
debach
avago
qqq
shipai
urquell
goldrick
kyowa
karvinen
scagliotti
manderley
primitivists
depressurized
backspacer
lightshow
lvad
carimi
chhay
arval
pahan
knoyle
adjudge
boening
obrestad
sandycove
mismeasure
millionths
lillet
proffit
maisey
jbt
wakako
boozing
iuc
hemley
ascensions
natgeo
pashkov
glaise
healthwatch
mapfumo
dovecotes
najem
bouchez
inflexion
lmh
solider
machlin
palamara
readerships
puo
sparty
renovates
cervinia
palkina
biros
qada
audino
kerckhove
goodlett
virginias
bucherer
kleeb
thrombophilia
dausset
neocolonial
fraisse
caitriona
atwt
mastheads
uzel
lowing
birostris
digeorge
bauld
shuaib
clode
gubi
handholds
nutmegs
marotte
tetlow
sabelli
williamsbridge
yanyan
cunegonde
clanger
balink
innuendoes
ccoo
succi
courrèges
sanshui
rills
vigdis
trainability
beauts
vamping
huntingdale
kraybill
augustino
pashtunwali
varah
ltj
chautard
genel
tractability
mourvedre
clods
nilmar
blodget
liuna
ranter
teasingly
diefenbach
surveil
abjectly
keyham
twirls
synaptics
dharmakirti
grindlays
moelwyn
loughman
guipuzcoa
dreaper
jinxing
pandeli
stielike
ruwais
lonardo
grassmere
andravida
skenfrith
epact
meaulnes
lahiya
skarlatos
adella
prizm
marimekko
elgood
mì
yolu
nehme
behrooz
hopen
resurrectionist
feverfew
jacomb
ortwin
intar
cryostat
pointier
dugarry
tecom
pauma
rochestie
javer
sarf
ommitted
ptz
smtv
kirklington
christien
poynor
bertolotti
bakdash
gibsland
mănescu
nonaggression
probally
infinita
indigents
designees
parasitise
doblin
arcidiacono
steelmaker
unobtainium
samari
sephy
ansu
chacombe
westerleigh
kalp
pulte
yasufumi
schlicht
kraters
wichman
dwarika
artimus
ravenscrag
zenati
alternativo
shikasta
elmir
xinfeng
nikel
janah
kcrg
merenptah
zaniewska
sunkara
subsectors
noao
hagins
borro
komara
hansung
tangibly
yousefi
galliani
artemesia
nanobiotechnology
needlestick
websense
cryosat
procuress
mamzer
boogey
jawaid
skarn
honiss
arvor
fennessy
hydrogeological
crvenkovski
zwiesel
willughby
fanya
lempel
skubiszewski
tongham
suster
wintory
sauls
desco
wythall
sterckx
unrests
musacchio
satriale
boneta
luddy
threshed
kragthorpe
saun
brontes
petunias
okruashvili
cichy
widcombe
zazzle
jimny
dielman
abderrahman
insh
pestilential
liskov
miaow
periorbital
stimulations
antibalas
jhi
corporatised
knaths
loseley
vegar
duflo
etchemendy
christens
compactors
mickley
unsuccesful
baldonnel
zicree
noak
adiposity
managment
kolesov
puter
phyfe
hinthada
malaxa
relativists
westhaven
dŵr
sakhawat
fordwich
lindens
wildy
cartwrights
ristic
hapoalim
mukhtarov
generaciones
moneyline
maliha
zambello
aglianico
fandral
polywell
ippc
prucha
plantronics
luten
odier
laurino
myford
vrla
joggins
xiyang
domnina
brickfield
superheroines
llay
ferriz
tobacconists
yiyuan
hangleton
lxr
qadar
koivuranta
comforters
otavalo
minger
osmin
oconaluftee
rannells
sikat
driedger
lisak
temnothorax
borrás
lambright
bassets
sacramone
secondmarket
weilerstein
weeton
prayas
paracels
bufalo
thundersnow
intervision
dittmann
symbiotically
cervin
qiaotou
godbey
porizkova
newsvine
potencies
gotay
mkandawire
gyrocopter
cirie
readjusting
meos
hueys
tidmarsh
weiping
forfait
artos
detmar
albinia
wiemer
rieke
milliwatts
handovers
teazer
globalfoundries
enjoyments
wddm
fvc
babers
corkill
dilbeek
heavylift
grandmont
roved
shakeela
elgart
gevinson
dreiberg
brandel
rasin
nsec
hoshikawa
myhrvold
hemnes
vestibulum
gatenby
paolis
audel
amblecote
sntv
rozel
hiders
scrounged
fás
alcivar
titas
pasque
meritage
hibachi
coliforms
closeburn
qaryat
manically
gambinos
linzey
aerostats
bennekom
romen
usuhs
packinghouse
haysville
ftas
sunam
nately
godane
chipchase
explications
aleksic
shipps
derlei
manganelli
rido
ignatiy
nahari
keverne
wqs
timesheet
,then
strallen
attkisson
wavendon
kuffar
forteviot
jollity
sharifs
xiaolu
fergy
cheverny
haysom
mhor
bashy
boesen
miyun
carenza
marketwire
bosox
peasley
marathoners
witheridge
mazzacurati
cutdown
romanesco
magaziner
seront
prodromal
tőkés
tuitions
bosingwa
kazee
gerin
nwn
gedrick
colindres
gref
necklaced
windfarms
roelandts
sistina
msms
wtn
somen
woolcott
spectrographic
easthampstead
vegh
redeker
portzamparc
booy
undecipherable
arnotts
goaled
mcerlane
engleman
sympathises
howett
kullmann
cablecard
hoggs
martinetti
zierer
gershwins
klon
nikken
actualizing
bobic
greetland
gurov
pmln
stets
ranton
glassberg
reuser
leibold
sapsuckers
maliszewski
dunner
bajnai
fritos
intercessors
giftware
mislav
manot
denkinger
powerfull
evangelising
feierabend
leatherjacket
mcginest
wangfujing
maenan
gorran
zelikow
guralnik
hockeyroos
oupa
capstones
rocklahoma
bivouacs
zirc
chodorow
fluttered
krathong
satyarthi
thornby
cabaniss
parilla
kozminski
matilija
gornell
kancho
emcc
baginton
tweeds
reprograms
yaqiong
pacolli
heimerdinger
baitz
zamarripa
stirratt
averin
sperl
newmains
grisedale
kancheli
trebol
hirschfield
rase
fishergate
iffat
windell
hazlerigg
angelenos
pegboard
sawano
mansoori
bdos
prodrugs
trashers
earnock
dunvant
vélizy
rabon
liers
bahamonde
vallie
busked
plights
jtm
spgb
palud
nastase
lezignan
yosh
chicharrón
ellner
cornacchia
sharonov
yamatai
hinesburg
shosholoza
articulator
muffling
wxga
yergin
ccee
sube
orgasmatron
tanno
zweibel
alleg
janay
xpc
effrontery
hanun
kharas
sureness
okaka
reinsch
imh
denge
mcelhenny
entenza
wisler
petillo
husan
romera
fortey
shneider
ohia
avowal
chebbi
paphitis
qingpu
bearly
sewri
hahnenkamm
theatermania
meekins
farinha
pashtunistan
khudyakov
northcoast
sozzi
leverty
abersychan
zaatar
lagniappe
bolar
remineralization
undefeatable
issia
bulku
acerno
wava
kadafi
rossport
juntendo
hdk
clein
chagford
faah
pseudacorus
remelted
kilgallon
kapchorwa
quarantining
nobelist
mohib
xiaoxu
gondjout
malby
zippered
formidably
bidegain
pasc
pressurise
sofian
seropositive
onuoha
ratier
bergsland
insoles
wyatville
preeya
hayatou
suunto
bablake
detoxifying
mountfitchet
scorekeeping
egilsson
habsi
nazeing
orizzonti
kaneva
xinyuan
kavaguti
emoting
dubash
mahallas
burqas
milnthorpe
schulmann
kassen
courir
købke
anakinra
supressed
metheringham
denyce
aeronauts
anxiolytics
freedia
extemporaneously
clomipramine
seiden
rbma
walb
putout
defne
clericus
sazanami
attritional
bifolia
folgate
indianhead
dungu
bussel
scholer
gumming
aproach
billow
nawabi
kilrain
abasement
vrp
belongingness
amwa
choudry
thornback
pitroda
bredahl
transurethral
paticular
annin
perretta
staubli
exfoliating
kitteridge
spem
eitzen
abdulai
ribadu
liftgate
eastney
rivergate
stalky
pluta
hwp
beldham
liborius
maestrale
midis
mahto
snax
nolet
chacho
barmston
beneficiation
agglomerated
derwen
evonik
markyate
cocido
fleuriot
sinsinawa
shifang
acklin
fidele
starcom
retzer
iglinsky
harpooned
heeswijk
askern
vosa
événements
majola
isch
parliamentarianism
windex
saunder
fluoroscopic
rivadeneira
rochel
imon
airservices
tudhope
splicer
misidentifying
ragueneau
karrimor
doublers
suranne
magendie
mishina
hamoaze
krotz
letscher
egami
yusaf
aerin
hakin
senselessly
ussd
marzullo
sirpa
tadataka
nyanda
lejla
maschler
saganowski
superheroic
shallenberger
metamorphosen
horrorshow
chongzuo
lauten
iribe
nonrenewable
skok
pariaman
ondcp
coul
coldbrook
triche
karnowski
tollefsen
tretinoin
poudel
ncaas
ogando
googlebot
sextants
teressa
charterers
faren
ridgewater
advertized
caic
oreck
faseb
ubben
fofs
ghailani
accost
gerlich
entirity
hilleary
sonetti
trickiest
tailpipes
ostracize
mohieddin
naturedly
hovels
lampl
dizaei
radioland
aquarela
lozells
flyting
dispur
vorontsova
otherland
wpo
moulden
karissa
marimuthu
cranie
tuilaepa
rateb
tousignant
esterbrook
hashemian
maladroit
stupids
hatherleigh
darkin
mamut
fanfest
upfronts
haing
dorfsman
athanasia
erythropoietic
yaping
gyasi
fishbach
scallon
mjm
mipcom
dayyan
muteness
seldin
cakebread
natallia
turkishness
nonalcoholic
managership
wildwoods
shrady
breezing
madadi
matelica
maximalism
nermal
sandbaggers
unappreciative
strenger
lambeg
snowglobe
logistik
knuckleheads
tillingbourne
cardiomyocyte
hfq
fellman
haddiscoe
giaquinto
khouang
bazza
yamai
krystina
winnenden
chavela
annemie
cordileone
blea
showjumper
gaver
riney
lisfranc
larmour
purifoy
ozell
phyllosilicates
hohman
thinkfilm
grevemberg
negócios
nucor
okeover
torday
bacteroidetes
danyel
salekhard
surveilled
pettet
dabao
borowsky
wallem
propsed
tilmann
maddens
travanti
jankovich
antonito
muya
brainpop
kodaly
hovsepian
icat
karnail
obika
strittmatter
bruntingthorpe
precisions
theurer
rebounders
nuncius
calvocoressi
diack
hochevar
popocatepetl
lacunar
colvard
cowers
feras
zonen
ustica
bijie
moitessier
backslide
glais
vortexes
mohajer
psychographic
lizzi
revocations
abergwili
kofta
galvanise
smas
hatty
tkachyov
matucana
mervis
mirowski
musavi
frostproof
sadowsky
akte
rohypnol
puspita
shoukry
grevers
harbormaster
lymond
archant
infectiously
highview
beaubois
invloved
budgies
biologique
dietzen
tribosphenic
ouc
icbn
daska
selita
straightaways
recinos
vinoodh
gemcitabine
miraval
tter
wylfa
reyhan
besmirched
zhilin
klyuyev
boroujerdi
catergory
chofu
andr
colie
creekmore
ahenakew
janiak
cadeaux
poonia
randian
kuszczak
dennington
tanika
abdulsalam
pardey
skomer
unlikley
antsy
yurie
kouroussa
bizenjo
bechmann
langbaurgh
akta
yurevich
arza
mcwane
brindled
boomtowns
fscs
sheepherder
karson
acnes
barsac
uninviting
lexile
averbuch
shults
hallion
ghencea
medoc
endarterectomy
whatsit
barbarina
humbard
rousay
tuncel
nucleare
pyongtaek
hetal
sarles
tabernae
obliques
passfield
stori
majar
escapements
corporeality
yusri
poortvliet
minshew
seeling
gunawardene
lording
lutie
meric
moceanu
backpocket
flambards
chaskalson
brickmaker
gateside
verduzco
prope
pulao
cronista
edelen
ruhengeri
maiella
lotario
leinbach
leil
chelford
rogate
lupines
whatevers
mumby
marinakis
fastweb
uniqua
carsdirect
spitznagel
roso
khine
balladeers
dementors
lingyu
italys
scribbly
mutley
saky
lichuan
lightless
triazolam
hullo
memorium
acquaintanceship
pptp
gladwyne
annigoni
linnie
artschwager
fremm
roncal
kirsh
acedia
rabka
paleis
schreckengost
laprise
cbct
angularity
reflectometer
goldfoot
herpetic
tarsis
exclamatory
skittering
pepping
sunport
häring
maidservants
mcmordie
limington
kaffirs
necrolysis
crystallise
dinastia
maotai
veerman
flitter
wazed
ektachrome
emotes
rerated
acrosome
marveling
pikine
hasib
pression
germicidal
boringly
carefulness
jubair
campoy
epirb
kende
bounceback
louts
ajello
moita
hollowness
ledgard
glinton
multigene
atrai
tudorbethan
allg
pflaum
pouillon
aschenbrenner
angelyne
budds
heureka
sheepfold
farokh
unwelcomed
morlan
lightworks
schrage
zhiwei
heptones
naics
humar
millns
scalby
felmy
outbidding
faceplates
basenji
shukur
ernö
hindelang
zerelda
trusler
unocha
foscarini
geosystems
lefevour
fegley
schmaltzy
isw
zentai
maximisation
shaha
ungentlemanly
olguin
epithermal
spiegeltent
cerner
fengyi
networkworld
klann
wuhai
fery
paygrade
simyra
ppos
mcvaugh
tinniswood
blueskin
rrh
pahlavan
backflips
ngd
mosquée
irritans
wongs
odac
tisane
kivalina
shamshuddin
yoknapatawpha
cinto
younès
ostracod
grindin
arsham
tocopherols
abertzale
cinemateca
lauric
oppens
deshon
paloheimo
môme
hydatid
opdyke
pervin
upsides
frecklington
corbiere
bobe
intikhab
crisil
tankards
skandinavisk
delahoussaye
disqualifier
kohle
viramontes
ginting
abraaj
aisleyne
brabants
kangol
wavelike
jamario
harib
arrhythmogenic
basualdo
meria
wessington
baobabs
parvenu
kohonen
sfard
chamitoff
spelunker
lownes
greencard
hakola
rhome
kilims
diardi
metsamor
psychonomic
colls
parche
hoshide
gadabout
ecumenist
riffed
progressiveness
warehousemen
geesink
steuber
chkalovsky
saffold
shangrila
ghettoes
minie
pudil
allesandro
manneken
aeroponics
eenhoorn
sexpot
scrutton
salsberg
mugisha
mallison
bcsc
herbers
boghosian
alico
eichorn
tref
tessitore
hertzfeld
sfakia
chevelles
karystos
mccurley
anual
ruili
elze
altice
jeev
lebaran
garas
gefilte
qarni
torm
wonderous
halldorson
zulaikha
stuk
filleted
stoeger
mikuriya
ettington
stateville
beekes
implodes
kiliwa
mironescu
barkeep
eibner
unpublicized
reknowned
strati
kingma
nakahira
zbikowski
lindholme
caisses
chrisp
beign
maccrimmon
cosic
fraizer
instow
parchin
mpq
vavoom
taepodong
murbach
kehlmann
plx
marghera
jennett
awilda
whitewell
jiyun
tropicalia
absinthium
hakman
cockettes
heveningham
mufi
katisha
rimonabant
erics
baturin
ramboll
micropayment
sokolnicheskaya
anticolonial
moeser
centerport
exame
nouriel
giba
krishnendu
argyl
emboss
kolp
checkley
scroby
samois
kuroneko
wemp
lbg
qdr
bashkirian
madobe
ahlawat
civoniceva
kernohan
pennycook
henricsson
martien
hotfix
interlinks
oversexed
steyne
bankrobber
ilas
archaelogical
stansby
tilahun
sylphy
milchan
mussi
steinbruck
nahhas
mistajam
thermostatically
delee
trachycarpus
zuiker
polenz
duberry
boabdil
matsuzawa
shorthead
pyon
sophistical
eisman
teshigahara
pugno
ngcuka
titchwell
tetro
ocps
dachs
tobs
mcdougle
whitemoor
striper
duntocher
villasenor
perras
blatchington
untung
dispirito
synn
epicurious
torriente
pitera
foyles
moisander
zambrana
emos
farnerud
lotts
romansch
foong
shirlington
inquirers
codswallop
bythewood
kaufhof
gonson
solanine
streib
claverie
kisatchie
trajano
lipu
impolitic
geha
capitain
blowups
sajda
govtech
magliozzi
anoints
lakanal
contemporaneity
lops
prts
fgi
kirori
ramius
seuthes
nottebohm
kornienko
reprieves
siby
pruss
ashenden
mision
montevergine
harkaway
schaack
rahil
warga
equipement
clocker
deepness
maslenitsa
kadhimiya
jide
hdcam
largeau
kpf
laurine
footwall
masire
aacn
creepiest
overexploited
perihan
guggenheimer
merill
luedtke
breu
coldblood
emrs
etg
paysanne
lindert
launchings
imbecility
delforge
sherzai
mayet
rashawn
houblon
innovativeness
wtvy
dgn
visteon
stocco
savvis
faucheux
alittle
karera
crondall
oee
mariotto
finigan
kennish
shouters
conflictive
bedroomed
undramatic
mayland
advertisments
malae
predeal
bakkal
fanciulli
fakhar
rosenvinge
universitaet
djilas
nihar
fsma
bolanos
stegosaurs
glucksman
jaza
chancellory
delice
intelligibly
overachieving
vorilhon
sretensky
delaram
balchin
roadchef
rodat
terrio
harison
mavic
bilde
palmore
sintez
kilobits
catroux
twillie
okot
firths
nafion
zoet
bossie
katigbak
scuse
gvhd
waitaha
hilling
charouz
witcomb
keratectomy
mehmud
malbon
wegelius
lockshin
minnillo
policja
usti
dravet
janosik
assenting
kosgey
felica
yiadom
postindustrial
tanaquil
usurious
schererville
criticsm
threating
broza
fritze
mcphie
cspc
ludgrove
nativité
sevmash
kywe
biswajeet
dunsinane
francolini
pepeng
mustansiriya
idei
niceley
mollica
zatkoff
apics
defoliant
servat
hereof
cossu
oceanlab
lythrum
kratka
dagerman
gendun
ciesla
odera
mangy
syntel
yorkshiremen
laquelle
wainganga
wvia
müllerin
aronsson
thabiso
alandi
jellyroll
ghanam
herrod
signally
tohan
mwenezi
shafaq
willenberg
picards
hoed
cyberterrorism
immobilisation
aulakh
careen
foleshill
murrindindi
astolfo
fornes
sublevel
peverley
stultz
zagorakis
karpf
participators
cdns
hubspot
wolmer
wimmin
nutjobs
biodegrade
darkplace
subdermal
hunkeler
yongchuan
schepens
highfill
fucile
déluge
parvo
selchow
sauerbrey
funkee
thise
yacoubian
rehabilitates
tatin
leymah
laindon
damerel
embolic
monterde
reunidos
dolgan
convalesced
chartham
misstate
chorney
idealab
uniter
fullbrook
duncraig
moodiness
jaspreet
ipps
zaldivar
lavash
rgh
marinoni
bubs
overpeck
disgruntlement
homeports
noue
miscommunications
abrash
sangare
koncz
dubilier
seyran
terrail
thrun
overstay
bardy
desertec
theodorou
gmeiner
wataniya
makari
fluvoxamine
oberholser
chatrier
yaizu
playtone
acoustician
holim
arrate
piara
hutin
werley
ratched
portioned
afram
iwpr
baerga
woosley
launer
hyperreality
debarkation
anglicism
sstl
outdoing
tarle
nzc
wefts
shunk
roychowdhury
taufua
elph
moctar
leija
shuggie
martill
zemer
dimick
dicuss
declamations
erwitt
kunhardt
vinings
parameshwaran
holan
moussawi
prizefighting
kirkmichael
ulceby
ouds
unexecuted
umland
coff
subtotal
literalists
addit
devisive
gurwitch
manzullo
mudry
dewei
longbenton
goldfinches
thieriot
mulumba
josina
laboy
cavalries
juande
newscorp
mihm
tweedale
abry
diffrence
rosenhan
liferafts
ballouchy
threatt
grabavoy
tarique
myrto
divakar
vaporizers
pindell
hols
hyperacusis
rabassa
feest
agness
faap
abisko
intraconference
ladell
brandwein
relyea
kruczek
polshek
ayar
frizell
cizek
thereza
ardboe
cleanrooms
ruca
brunetto
controled
obliviously
cossío
degema
peopel
watc
moreaux
lifespring
veizer
bolado
yerwada
rowbury
neroli
haselton
frauenfelder
idrissou
freeston
iwaya
jumex
ypi
moggridge
pykrete
nimule
melanic
hollier
gilsa
buffay
puzzlers
landrover
svenja
ottershaw
ragamuffins
eefje
edstrom
tekel
thei
mcweeny
irukandji
lantirn
huckabees
folli
lchs
handycam
hamsher
cantil
zouaoui
interleukins
vespro
aragones
gne
beac
radzinski
seznec
cyberattack
soza
bernabei
neligan
ceap
craigmore
amhurst
kakan
markranstädt
vrouwe
houchard
harwinton
reloadable
pommy
ezechiel
etablissement
pintails
mirboo
kirkoswald
stickin
easly
laurentien
gunten
apotheke
procede
icrf
floud
unctuous
vacationer
layec
gnadenhutten
uclg
stüler
journos
soldano
hopefield
okumu
abras
desaturation
overreacts
micklefield
robiskie
tarlochan
montre
recalculation
docomomo
kunyang
frechen
mubin
mccartin
dismounts
borowy
mojado
julies
stoel
shoveled
woodhenge
severns
ques
perambulation
tahuna
gork
diac
guideways
scandone
antifa
muzzi
mispellings
astrologically
diomande
allots
kanbar
jitender
degradations
vileness
wetback
rosmah
sridhara
gaugh
norborne
korach
fradley
ibri
manzer
ecom
frasor
umid
wicketkeepers
lancelin
perriello
strohl
vassalboro
teleshopping
samuela
easthope
syring
primack
trimboli
capucci
hous
cicogna
tabucchi
sates
redouane
kargbo
simoncini
milov
opensocial
varnell
brockhoff
pongolle
methemoglobin
preprinted
sonnenburg
zarakolu
arien
ecas
triqui
ruyan
derangements
simos
fireblade
synan
fekadu
kepis
menwith
kakodkar
halama
lahmeyer
ammiano
europan
brabender
berbere
agonize
paulien
mathu
audaciously
alioramus
parkyn
hamis
hidebound
cavit
tintwistle

sidway
picardi
unhatched
murfree
gashimov
hermawan
kores
tarapaca
buckboard
koothrappali
doz
zonn
vcat
millercoors
saqqaq
snri
calida
kalanianaole
mellody
aneurism
mercieca
spello
rosoboronexport
malkowski
shaffner
qinghong
eyeopener
karnofsky
ecgbert
lelyveld
milkmaids
henmi
lhi
dyslipidemia
arwenack
tyerman
innerleithen
eichenwald
rosaly
regrade
conflagrations
giler
nacro
zathura
youview
ebk
shehadeh
strida
kargar
greenhoff
pheomelanin
wellmann
cyberathlete
seamers
chapped
szirtes
picault
evinger
theofanis
hoffert
financer
barbours
reiger
shepardson
kabamba
monotones
asikainen
hegazy
mwinyi
zoabi
foodtown
desparate
colchian
mccrystal
hairiness
hardwar
therapeutical
messagelabs
rades
boryeong
suggestiveness
encontrar
zeese
chavalit
blasdell
rubai
rocamora
cornbrook
cultra
rejectionist
zherdev
crusting
mhn
russos
iadt
gotts
shela
masochists
ettridge
tchaikowsky
ciavarella
betto
kiddieland
trilla
balakot
lubell
randazza
zzzzzz
renon
harpole
aspland
glushkov
gyulafehérvár
palimpsests
puking
eppolito
kinlet
lightsquared
mynci
reclines
greenes
celona
calanus
nowdays
geragos
riggin
paraprofessionals
pawsox
abukar
pullet
lorusso
shahba
abdoh
stölzl
murni
elsi
sterlite
knidos
muratore
beavon
northiam
minutest
töpfer
efw
rededicate
martinon
duros
hefeweizen
gorgie
revisted
voiron
levitates
worldskills
sansha
pearlie
thnks
bayton
ravelston
lipsyte
decouverte
beitz
coalpit
valedictorians
cudlipp
hazira
brashness
lgw
saughton
ferriere
granjon
lipnicki
gendercide
ziggurats
lolas
idiosyncratically
sensationalised
cremins
deductibility
superyachts
buscher
modularized
zes
tecolutla
salka
sterba
ozama
mmmbop
asuma
prebuilt
melbourn
mahinmi
satisified
ambien
rehavam
laywomen
wahren
flaum
evart
maghoma
verstegen
undestand
sigir
phlomis
bowthorpe
scutt
fetoprotein
shahroudi
corporacion
monee
mersad
rockeries
stehlin
blago
unstained
liquify
scourges
demonstratively
fullmoon
süssmayr
diamantes
productiveness
parreno
ecru
kolinda
clinking
adedeji
rehov
bookland
cahora
wendelstedt
schuetz
bohen
solinas
frasco
sheli
eijden
zahraa
redoubling
tamr
kentuck
spymasters
madaraka
invergowrie
kà
ndv
lugovoy
fontoura
kanas
bartiromo
transducing
perner
walland
pentatone
bénouville
mungiki
harberton
infelicities
govil
glancey
sherrer
teranishi
chesebrough
ments
afewerki
ruley
muzzleloading
jackfield
söze
layin
zlobin
hypoglycaemia
bridgeford
wordier
recommitted
sightly
ultrafine
firkins
gartree
advanta
kuntner
rasselas
unscrewing
havenstein
yorkey
disinterment
awar
nizamani
puskar
kuangwei
subparts
soulfulness
sketchers
jaunts
vahedi
laris
gayla
vilakazi
dessaix
banisters
ayda
wreake
hadra
westvleteren
schneiderlin
lougher
giardiasis
annmarie
kakuma
bellowed
unncessary
unquantifiable
pelmorex
mayar
feemster
golddiggers
flunky
southmoor
sikandra
hennesey
revett
nakedly
quasthoff
murcott
reinjured
hindson
molan
lusia
mogambo
stope
shene
wipp
brylcreem
patroling
chauviré
germanotta
hawksbills
adee
liposomal
sorrowfully
tankless
zarka
premix
lavishing
guangyi
zeira
plantlets
cooperator
recoverability
witherell
fusionist
salgar
pharyngula
gilreath
shallop
steelmen
obsessiveness
lettable
aughnacloy
alread
kishikawa
infernos
disasterous
onychomycosis
jorie
diarmid
nurminen
fakhry
mohegans
catsup
macmullen
makerbot
chukchansi
masquerader
charabanc
diffidence
diaboliques
ndez
stillie
langenbrunner
stavoren
paraskevas
wenyu
calp
obraztsova
superdrive
fleitz
stoick
duey
pianeta
osbeck
ponsa
toiletry
eshaq
togiak
rereleases
etorofu
okino
intellij
jath
isolda
andratx
tertulia
bradenham
ukas
mediterraneans
haathi
kilmun
tatia
senft
bucyk
deindustrialisation
bigman
dehumidifiers
scorsone
makkonen
turnhouse
kuhak
berghoff
gashed
pallini
schlosberg
barlaston
freephone
hentgen
concurrences
luner
fiduciaries
asphyxiate
humera
pauperism
aubuchon
aquos
walpin
cosmetologist
sounes
remanding
lonza
unmindful
lopen
icef
pogatetz
favalli
gliosis
oversimplifies
deardorff
accomodated
rokach
bussmann
greeno
hamat
newbrough
valorie
zigmund
tomiya
cadaverous
bedwyr
dumpsite
webbers
baccalaureat
hellings
kerbside
invitrogen
briskman
hogans
aneth
transmittable
zelnick
butyrskaya
bpu
corna
dmitriyevsky
makarevich
apma
vuckovic
kampe
boukman
breakdancer
bondra
fraise
parazynski
aberdovey
watersons
sightline
bedgebury
maddest
wastewaters
sampit
pingjiang
slake
btus
riddel
handfasting
jungleland
filipowicz
scoon
zhihua
rapino
asaoka
adulatory
maselli
taimoor
rehydrate
concreting
afanasiev
uani
qingyun
myners
ruijin
gideons
zilliacus
aggrandize
levamisole
waltersdorf
briarcliffe
workes
milkwood
windproof
fwy
lobell
neaga
vinall
trapps
brainbox
elvie
raiko
raevsky
xlendi
silvo
pacemen
tacurong
walewska
caorle
christenberry
dongya
agey
liveline
conceptwave
brucan
guowei
hartwich
mayoralties
dyak
kecap
poptones
phenylbutazone
mehrangarh
schexnayder
worboys
niyi
subalterns
vallauris
banorte
mcglothlin
stasiuk
lvd
quandaries
thromboplastin
gasco
perroud
zhijie
aerotrain
shopfronts
mutualisms
pierse
fundraised
verison
kozai
smic
saabye
stockmarket
inseminate
spurdog
weichel
gourmets
amphioxus
tulkarem
bbrc
gholston
soufiane
megakaryocytes
rubem
corrada
kgi
attachable
smooch
fulghum
trepanning
wickedest
shincliffe
cardamon
glyder
garendon
singsong
foxhunter
duquet
coattail
rushe
belisa
subnotebook
zantzinger
bastyr
delegitimization
archigram
chacewater
thek
templepatrick
mackenroth
smythson
dxa
taquet
blunter
stanka
linthouse
khutba
radulescu
moyar
muszaphar
paranjape
forstchen
wloclawek
ashlynn
medine
cowriter
bumiller
knopper
kondopoga
stainback
boerner
vierra
demonologist
arnol
joleen
defectives
graffitied
cubbage
lelis
aircrewman
ugali
abengourou
mousie
mww
gettleman
monetta
graine
menez
jetée
notwist
urenco
complexing
inwardness
zussman
passably
hafencity
armado
anantara
nunberg
carno
shenzen
monetarists
jahid
jetro
whined
terceiro
aleen
deregistration
kangnam
armenta
wordsmithing
aunties
mulgan
fanmi
greenhornes
lescano
griefing
gtos
tgb
vilda
kornbluh
terracottas
roho
morwenstow
hübbe
tuckman
thimpu
stepps
vietnams
piercers
khoza
dubcek
paulescu
netherley
larg
tjapaltjarri
brownes
florianopolis
underdrawing
demacio
robdal
microcassette
katsuaki
nebulas
blaupunkt
chippendales
dromedaries
waaaaay
cardine
souillac
dattani
churchwell
villone
eow
zabransky
sendhil
freshened
lcss
edey
sabaya
amrozi
cmss
squelching
tittel
uddhav
jayaweera
riesener
correy
chelton
suncruz
utai
altium
amirite
backbreaking
ohama
hiyo
amarjeet
jindra
passyunk
laquidara
engleheart
teimour
shanmuga
carthew
nhatky
kinchla
parlotones
molinia
creditanstalt
kinfauns
buspirone
thena
overstressed
youngberg
fjc
treiber
douthwaite
clandestino
ellens
ishimatsu
appetitive
wongsawat
downpayment
costen
wainhouse
cahills
sunderlands
biabiany
lattanzio
rigell
cinelu
dalis
babushka
nethercott
lorius
kasprzyk
letten
pandigital
obodo
pearlescent
jaison
ferreiras
sicken
chatan
unpleasing
tomasulo
cags
isae
phillipine
ragozin
hotwired
idria
lovebug
bundt
ghanbari
geographics
bergomi
limberlost
mutz
oglebay
necesarily
dependance
pirovano
internasional
liposarcoma
rutshuru
πr
georgij
wnv
woolhouse
timsbury
koach
explosiveness
isho
mayanja
casquets
scarily
lawerence
pilibaitis
cardiovasc
cinelli
ostaig
eaga
nccic
evanton
bomani
rumore
craughwell
throneberry
coxes
dustbins
interrelatedness
softphone
dursunbey
fleurier
jonno
superimposes
kawin
comedically
subcontracts
jtrs
ozimek
cignetti
corsetti
dziemianowicz
kolby
lorens
ginowan
michalowski
asbel
hemophiliac
krant
multisite
venora
isixhosa
mackler
sparsholt
argenti
exempla
rwi
pedram
mossadeq
thatto
ccne
lalchand
nihill
ferrill
trendsetting
hoffecker
xiaoguang
vims
yve
liow
temmink
calt
twinbrook
yahiro
heylia
padberg
tumb
kypros
abdelilah
tharman
hungar
wellock
manassa
smerconish
weich
exclusivist
gusau
dewinter
ibexes
sragow
driers
whitcher
norley
wenge
leisa
safrole
immortalizing
hafta
milongas
koenigswald
fedorowicz
jermy
serviss
jimmo
kaiseki
jolokia
backbiting
schiltz
gaywood
doulting
kiga
armwood
hardacre
aiders
macmullan
jauncey
oblongs
mythmaking
granath
glaswegians
sck
hepcidin
struss
ndk
nymphet
reince
eilis
rogie
crouzet
kog
etkin
sprayberry
taormino
sanco
radiochemical
sqi
gislason
atieno
gridlocked
vasanti
toxocara
stiffest
rozendaal
broodstock
heuheu
micklethwait
ruza
loranger
nonexclusive
ippr
argoed
syndrom
sturdevant
purlie
shinfield
slocumb
menorahs
gostiny
pixi
disabused
zaib
jakie
zipzer
nrega
fafa
fayt
beachcombing
iconically
myeongdong
dolling
wce
thebaud
dianetic
shaiba
academicism
leucistic
behaviorial
goldfine
zeynab
silvennoinen
piggle
lubovitch
eddins
airlocks
preschooler
rany
agglomerates
folta
petralli
videotron
sepi
shvets
prebiotics
drizzled
adoptable
obos
profilin
kerrs
biang
reclamations
kemna
dbk
ardeatine
devilbiss
ehrler
samaan
tryweryn
baisley
preffered
reenen
baddrol
depue
theodolites
odoratus
incandescents
crangle
liebster
roué
odni
geoscientist
villares
sheinberg
mendeley
xdcam
busek
kalua
shimao
thurland
uncorked
sandero
barkun
ghione
wintergarden
mycoskie
jbm
iasp
boula
zelienople
speechwriting
dettmann
manipulable
oncol
vrg
arugula
salaheddine
kgr
ndma
bedwellty
boseong
hatbox
nly
cuvilliés
schwengel
dinnerladies
reccommend
tareque
jindrich
jamul
ourso
quyen
alican
scotsport
bürki
jilt
indescribably
caracoles
shanmuganathan
sowe
athwart
harrick
ahmat
floaty
seelos
oluwole
shrimper
mithen
penrhiwceiber
chiseling
mahale
disapointed
copings
dishy
egerszegi
rezvani
tsogo
argentario
dispassion
laverde
ayd
broadstreet
delev
wafb
cherubic
sader
ibec
bezanson
wassailing
grishuk
wedin
cision
grimaldis
szczepaniak
epitomizing
kest
leaseholds
tuenti
rubberised
flexx
trawden
wehle
kirka
mashiko
bheinn
dispossessing
patrushev
pratfalls
apostolou
sweta
newhey
wyms
microsites
narcoleptic
nahla
supari
forkball
narus
odle
mackowiak
zurcher
dunstanburgh
tetri
grammatology
colorism
castlebay
cmte
noninfectious
roitfeld
biches
wittelsbachs
soltam
buckram
careerism
burnouts
currahee
vasara
uccs
poller
skokholm
turbaned
firnas
oxberry
pensionable
shirlee
daoists
milbrett
lilys
codford
rubiales
nmci
hoglund
kartell
rensselaerville
juridically
filipetti
alw
trailered
telem
pathbreaking
mezze
portswood
deecke
methylhexanamine
enrc
tcpa
tarell
vasectomies
aplus
yenagoa
eij
tsultrim
lerida
asrc
candybar
marista
stobbart
buchheit
bioengineered
pacifiers
haapala
mackeson
elwy
moelfre
grymalska
ziman
nishiguchi
nakumatt
laraque
yertle
kanju
woodmore
planetout
felch
biomimetics
drypool
xerces
roley
hershkowitz
okasha
mizzy
magnetopause
doublings
tardigrade
beauticians
postiglione
zipfel
ajmera
lubricity
mechtilde
quadruplet
laughin
kyunggi
valdano
nauruans
zaydan
spri
hominoid
rahardjo
reydon
cincy
interlachen
auman
sarne
yankers
roomer
hallmarked
nakhoda
aprender
huysegems
tetraplegic
speedmaster
yima
mimos
gouty
keitt
micrometeorite
tarquinio
refluxing
fuimaono
abergel
saxondale
mallinder
nadda
batanga
osor
icdc
jillson
srikant
subbotin
kisspeptin
matteau
linson
delme
felson
amcs
payscale
ncra
steamin
oberholzer
futaleufú
sagaponack
travelmate
abair
macconnell
shong
kolaghat
weltner
hummers
jonz
nordlinger
upstaging
facebreaker
embakasi
varel
prea
hoegh
michler
glossopteris
udrea
sevil
wlib
mckown
mismatching
cavewoman
jaggy
yamagami
sigge
deko
outboards
brancacci
bashkiria
jiwei
scarratt
cebuanos
urena
kuttab
rupie
rickroll
salhab
cavaday
rolvaag
naoupu
squabbled
nahalal
fractionating
aldinger
overlake
hoverspeed
pamby
weschler
klove
hirings
indosat
bianculli
maualuga
ghiaurov
dambach
fleagle
cogman
nacido
porlamar
dijana
savored
mammone
alving
proferred
esquerda
bespeaks
celtel
dnevni
alpinvest
bclc
zwerin
milley
pisey
partos
jerre
sousaphones
faythe
reichenowi
demonise
interpenetrating
guangya
sunitinib
metoprolol
penhale
retranslated
pract
lactulose
dwn
hayler
petito
breneman
balers
preppie
lovelle
whitting
brakemen
hasp
wisbey
olinger
marielena
arjo
macramé
parwez
chameau
spätzle
fainaru
loftiest
cius
bogdanor
neustar
berau
fleder
whitethorn
bendlerblock
lessac
caran
lenbachhaus
beihong
cogliano
palombi
disdainfully
haycox
popayan
laviana
nyathi
overflown
canovas
blough
inverary
yiping
rabab
takanami
bungles
pantaleón
supersemar
imrei
astringency
cariocas
tirtzu
inukshuk
veridian
schweikart
ansin
turbie
terryville
frado
angarita
eichengreen
canai
passionflower
thiet
gasbags
jutkiewicz
helmar
joseline
ivany
gudermes
aright
scalpay
ellert
lacerating
fokina
robicheaux
cxl
storeman
shenkman
thomassin
bastardi
ibekwe
orosi
permanents
antidemocratic
neshek
bentzen
okrika
stridency
piscitelli
helis
pankratov
toniolo
oline
scheunemann
edgemoor
pinola
reemerges
containerisation
triwizard
surveyusa
inchkeith
mungai
reseal
counterdrug
gaugler
caffé
kashkashian
ence
mantou
desensitize
acctually
altmeyer
newitz
anosike
ntuli
dallek
irsyad
stahlberg
fahan
dalgaard
rabu
igate
radloff
rhodey
eyestripe
horatian
bankrolling
breman
immobilise
needled
mccullin
sinas
chademo
dormans
novellist
doraiswamy
shahana
experimenta
sniffers
shafiul
maslowski
pridie
regularise
leekpai
canobbio
tianyou
mitsugu
journolist
kalaya
eyad
idy
luchsinger
maiken
sensitiveness
mokaba
malvidin
porté
steelmakers
paulistas
jili
cataratas
depigmentation
punctuates
andamanica
tecta
guibord
principlists
coiners
transparence
plack
darkies
pavelec
maraval
sivori
edite
hechinger
runnemede
hirtle
recolonize
exs
pliyev
breamore
macierewicz
rehydrated
crecente
bigwood
pinkins
acsc
doonbeg
litvin
anthelme
conseillers
reestablishes
kapadze
ectopia
determing
xiaoying
operationalization
shlomit
bignone
wymer
suchan
siyam
varyingly
quiron
figgs
deene
bacigalupi
oswalds
croisette
dalwood
gunmaker
osita
chaddesden
terentyev
bagheria
ejogo
fallsview
bordellos
pianism
consignations
decaf
ccfl
vacheron
zaba
freezone
slowey
robshaw
grinkov
imray
pompiers
alchemic
kanbara
tritle
peppi
hightide
forelli
sportswriting
israelian
charbonnel
kylián
lanhydrock
chilbolton
nipe
wellesz
memeber
kerbel
orache
marbut
douville
breugel
vatopedi
retana
reconvert
saprykin
kief
haval
geminis
crouton
scharner
metallers
kornegay
worldham
idma
zubeidi
dsge
karlene
borota
basker
kangleipak
intercuts
carrilho
demineralization
santina
huguenin
coverly
belacqua
tchen
packhorses
rajapaksha
funked
joz
namuth
reignites
raska
afforested
fakhravar
maharastra
brimson
xmc
pinegar
antioco
narela
klark
teddies
madrileño
pokka
nunsense
legorreta
syle
oxalates
qumi
haner
heresay
deqing
pilati
vaus
bafin
canigou
zenos
completive
hauben
shanzhai
auditioner
cortexiphan
letson
pierotti
medjool
iguatemi
authoritive
reucassel
vernie
martock
palaia
atarot
kawas
coromoto
madelin
mova
bolme
jopson
stacee
maltman
unspecialized
mamboundou
outsources
zavos
sarni
censorial
tbwa
taing
khanaqin
shatalov
ransohoff
vladmir
specula
buongiorno
seronegative
blandon
massin
bassuener
lohra
consuelos
mensdorff
pergament
muen
schang
varya
sarraute
ruv
carbapenems
flourens
samruk
oidium
gashouse
salako
landingham
rokus
icus
whitepapers
verweij
speakes
ehab
atrios
anarchical
wctv
roiling
kines
tootin
halat
speediest
ungoverned
souray
antivenin
koplowitz
rustamov
impune
lathrup
echeverry
opon
onnagata
meintjes
belot
olman
caprea
ratajczak
savable
proffering
idelson
weli
vermontville
ctrs
longnan
mvl
lippett
bobick
abloy
nikolski
estreicher
ginsu
brutto
targetable
novator
resona
comahue
inishmaan
ascs
superabundance
equivalencies
ruweisat
shrooms
thrombi
architeuthis
zaruba
dinho
pokernews
tzvetan
tumin
turlin
excoriating
molefe
scriveners
ackermans
logans
crpc
reefers
iannuzzi
baquerizo
whay
specialis
kingsthorpe
hajra
egomaniacal
grunsfeld
vasovagal
swidler
resorbed
pharo
cosmopolite
julier
shuffler
fathia
popples
disneynature
leili
jussie
bollani
spiezio
clal
scalper
patapon
mounded
econômica
wafiq
numi
belardi
prasenjit
sadow
dominations
lapps
ravera
torsades
dargai
dauger
autopsied
gullberg
waft
glw
estephan
bernales
wachmann
medad
orangina
aerate
mantain
citied
lovellette
kasztner
orioli
licensable
ragley
horizontality
hypomagnesemia
imaz
dkba
benhamou
registan
polier
gwaelod
hensarling
exsanguination
mirghani
garavaglia
mcnees
propagandize
margiotta
sobranie
ahds
spoilsport
norridge
veldman
mennie
blundy
vainglorious
celedón
kelcey
bunthorne
olodum
hardberger
thate
precociously
gibbering
twirlers
oramo
thiselton
magnanti
sombat
zbornak
cammaerts
sippin
metamorphosing
ellipticals
brahem
thrasymachus
italicizes
evangelio
lempiras
succinylcholine
semimonthly
ventriloquists
helenio
dcfs
ixnay
djamila
weimaraner
fotini
leasure
shuford
heatwaves
focolare
sambil
kessock
coomber
hanikra
diegan
thumba
bowmaker
roelf
morrey
nocentelli
pankey
mazzuca
sandvig
tatshenshini
evraz
atrix
soyka
ebsary
malino
edito
khanjian
kraljevica
napravnik
wintery
cannas
wycheck
labradoodle
schuco
indelicate
mhlanga
wickert
dustbowl
alderbrook
prydie
inebriate
studbooks
reuland
ladwig
shinsegae
darwitz
shanower
copthall
nametag
pyfrom
domoic
labrocca
familiarising
highbush
cherryvale
riegger
amdocs
halak
rebekkah
krankl
geman
malyan
sasikala
aubisque
kyauk
cza
baldoni
aerostructures
bunted
duplantis
trousered
interserve
ivd
meiwes
wycliff
jolle
mixu
brms
lalic
ruggerio
papercuts
cordura
kaplowitz
rendina
mistley
wockhardt
cawsand
plutocratic
boxfish
uko
karelis
backlights
bevs
katoh
pacom
praktica
torode
catsimatidis
bareheaded
goulder
communityamerica
schalcken
amoudi
khaemwaset
capanna
nafe
postales
chrismukkah
charlayne
wilgus
runneth
otman
mitic
cardiorespiratory
timko
curreri
aliadière
eraclea
hieronymi
phonecall
pekao
sichem
melchester
gosht
morever
geissman
andalucian
océ
twd
pitsford
mettawee
varosha
superheat
chernikov
rylander
rabson
kamarudin
factious
schrieffer
arresters
trigorin
steinbauer
heatsinks
splodge
uneca
nakamatsu
destremau
acklins
desmodus
kawika
casselberry
abéché
nyad
gabbiano
bilyk
theatric
himmelreich
uhse
saloniki
boppers
joorabchian
mrj
maybrook
holocausts
getaria
iaconelli
kesselman
venditte
bobbleheads
spooners
elkader
biskind
nobilo
rafel
sensitizer
timebase
domracheva
echavarria
pilferage
reilley
intersexual
wadl
tincknell
dheri
linse
ppcs
mushkil
ponchartrain
paraprofessional
irresolvable
barez
yars
diedre
isaa
sofyan
lazebnik
batrachochytrium
antigona
aveda
tatas
intrawest
reiserfs
riady
aaqib
barnburners
sypher
toners
pitesti
farhana
pilanesberg
olatunde
hedsor
miked
waihopai
cobourne
trevathan
kelderman
kltv
gines
hackforth
wsls
houri
bankrate
khdeir
desalinization
milanesa
wilborn
unpractical
fujikura
vavau
gandhar
uemoa
bennewitz
hardley
wishman
daboll
marjon
orefice
dadford
thik
zeder
pallion
augst
guman
vigouroux
kanev
reoccurs
kidpower
posas
wichard
masoe
schmelz
gader
hmn
hmph
moxi
swinnen
keersmaeker
upending
ukti
geesthacht
otterbourne
empaneled
dcaa
supersized
lahmar
lightsail
lipica
americanised
felindre
garnishing
giarrusso
clennon
mutrux
badiane
horris
vindice
maní
deolali
schundler
minquan
lessin
graziadio
klinkhammer
nuyens
barú
tollund
ciccolini
sangen
occhipinti
sitars
webre
vitrectomy
amethysts
thackston
lohri
ballardini
mcgeeney
telecare
husting
accotink
cedia
lamblia
chittering
ssns
riach
liveris
normandale
baloi
efface
ecfr
jejuni
strs
margarette
nembo
lameck
setoguchi
irglová
hetzer
astellas
rumps
schar
radel
lakoba
eoe
fraiche
tessalit
apigenin
oldmeldrum
ampt
routray
ladwp
cheapens
longhope
counterbalances
menkin
lize
eliasberg
kashag
walsch
sharmaine
roraback
yesteryears
marean
tuulikki
fuerst
baggini
zhucheng
leintwardine
aeromagnetic
lipshitz
jurman
soapland
tisci
systembolaget
itin
assistent
bioassays
millivolts
comalapa
micromanaging
mingchao
borloo
sscc
safai
lamott
éminence
laubrock
duffner
suchinda
abade
ollivierre
penalva
flaim
humaira
irrelavent
boneham
outfoxed
delabole
houndsditch
idrac
murro
strobing
imagistic
dalessandro
wolfley
neurogenetics
myelomonocytic
cabrito
adls
jihads
mangudadatu
teamers
fidra
overgeneralization
joique
uncrewed
lifejacket
llaman
kyron
crèches
reissig
ynysybwl
aspers
koenders
prescriptively
shiley
snorts
woolhampton
dook
ponnelle
ajm
pseudoscorpions
aldersley
djankov
cohosh
hultin
cornrows
mesaieed
roccaforte
precourt
negoro
beldame
warramunga
epcc
stangeland
bolotin
abets
suicidality
microsystem
falchi
downshifting
slea
pagos
kerplunk
swarf
kolata
chri
seawards
plainest
diplome
nusseibeh
heronry
nsta
zishan
wyndam
jurnee
meah
meadors
vadims
maddren
haversack
bamidele
hahm
chanos
blackcaps
cagni
winnowed
blaak
gilette
abdelhadi
bikfaya
diametric
soundproofed
zonker
organizaciones
texter
bwt
jerebko
aborn
floetry
sensitised
hindlip
badmouth
plooy
masci
allopregnanolone
mascaro
sulloway
sww
aristolochic
dilettantes
rowdiness
mixologist
feijoada
anacapri
casby
hacktivism
bauke
pussies
soussan
celmins
marzocco
mrcc
gamero
itchycoo
stoles
wrigglesworth
payaso
peacocking
ruocco
promethazine
rlh
adelberg
murcutt
revanchist
mestral
rhosllannerchrugog
trainloads
ayso
yasbeck
faizul
oberpfaffenhofen
anseong
oakworth
scis
sidelong
adeje
colback
karti
svec
upbringings
amsouth
onj
fujitani
wsbt
corie
connexus
departements
ehrs
alhurra
cloudscape
huxleyi
garavito
refloating
prospera
retweet
hepler
fabela
guoqing
mytown
voluntariness
otx
killefer
mcelwaine
fridrik
lawnswood
gsxr
petcare
kval
ristow
bustles
wizzy
cushnie
darche
savov
rangiroa
apparatchik
straggly
hoerni
ruak
kotin
firenza
riyo
calicivirus
jousse
eurosystem
dimuro
mindshare
monopsony
ricciotti
wkmg
granoff
chipps
raim
wähler
sylvette
kempin
duwayne
leick
pollino
garfein
starikov
dowds
swissotel
chairmans
jelsa
stael
tomatometer
aubergines
partakers
mugwumps
borhan
mjj
shiping
mohannad
individualize
apgujeong
virtualtourist
airbaltic
mvi
sunseeker
tachograph
lillibridge
andric
huaiwen
ruku
bahcall
hellmer
shital
lumme
yelkouan
convergys
pettine
nimai
mirch
encored
craniopagus
manseng
kincora
bloodworms
brohi
eareckson
wnuk
norrman
alkartasuna
anuses
gainers
chalupny
tazza
leatherwork
mughniyeh
dukurs
mazagon
pccc
togan
kligerman
foshee
phillippa
amac
rotaries
hvorostovsky
auvs
hattestad
claybrook
becomming
monsour
weaponize
calcs
youki
portend
eys
thoe
wandell
gracefulness
inmobiliaria
collagist
viceversa
apian
giovannoni
zevulun
magnifiers
shellard
disapprovingly
limnos
luebeck
pummels
ioda
suenaga
otterness
ikenberry
linbury
kolakowski
hagenauer
marchiano
conventionalism
thst
olarte
ijv
demobilizing
jingsheng
enoh
pajhwok
lansburgh
lowick
balliett
mementoes
prudes
secher
eijiro
schuss
inglett
tuvaluans
rubie
additon
berkswell
roaders
chopwell
,that
dosunmu
nordlys
reformatories
chopo
modish
sherrell
cluses
chicky
hamoudi
parx
trumpkin
deify
polachek
babbino
clubmen
smilla
aganist
kordestan
belches
greaseman
folkard
maneuverings
umenyiora
superfluids
tobel
belland
dxo
matrixes
parioli
unredacted
gorie
rault
tocci
craske
demre
rieff
madhouses
xico
inattentiveness
tenuto
pillager
tweedbank
delderfield
noirish
bassy
barlborough
pretentions
heintzman
grazioli
drx
breadsall
inorganics
kabukicho
stewarding
sharkboy
hargan
trixter
aesthetician
antolini
harinath
kroemer
phranc
groeschel
hotbird
worgan
shikumen
iphigenie
cadiou
mystifies
bacnet
alian
sawer
wanek
groveling
ducote
bontecou
gafoor
bruchac
usfda
persistency
rusesabagina
mahouts
seersucker
intrinsics
samini
dissapear
cafs
conspiracists
soboba
unwarrented
glitterati
anexo
sfumato
frigidarium
ghanashyam
keyzer
seff
gadomski
chicco
deliberates
ceco
wers
earith
specchi
leitchfield
cerralvo
callar
westlawn
haukeland
enghelab
toroweap
actioning
woodfin
maramba
orchiectomy
fletc
staus
dijkgraaf
interrelate
kulina
henchwoman
ewloe
bolívares
kesi
journaled
josephe
dystopias
unshaded
cooped
mehigan
hypothesises
dunalley
wamala
castan
lelli
stewartville
shaggs
spradley
hanaway
carpena
nras
bouli
wudunn
sellinger
dusko
jbara
eliav
freerunning
mutuelle
puggy
tuymans
shavlik
fizer
viendra
krukow
shoyu
geoffrensis
schiemann
vytas
palest
buttiglione
stompanato
manhandling
cordingly
arizonans
mank
sussed
yuja
inkblots
angsana
starn
ghettoized
vezo
vladika
legatees
crazyhorse
kogod
kamalabadi
mazouz
dassow
turckheim
gardere
packrat
llanvihangel
bloomingburg
biazon
bouge
pinette
baymen
figuera
sproxton
privilages
wafted
hindfoot
comon
demurs
maksoud
ponsot
piehl
noonu
naziism
sacrococcygeal
polytomy
bascially
maroš
zaccaro
digipen
howgate
misstates
rebun
generacion
harmlessness
etfe
bilborough
costis
hirosue
triptans
polovtsian
westernisation
folarin
dazzy
faton
perenchio
emran
handbasket
mittelstand
suhler
delvoye
chast
shiancoe
gainesway
jackfish
toolboxes
bunzl
pureness
chlorogenic
walpurga
razmak
baghban
filezilla
aldecoa
fulin
sapcote
exsultet
bledlow
abusiveness
rlx
jlr
sulfuryl
bittercress
eassie
flamant
conda
merica
brutale
hasnat
toine
dahms
fivemiletown
golondrinas
kld
checkpost
organoleptic
symbion
planetoids
busst
boyang
camira
basescu
woudenberg
congruous
chouchou
cvetkovic
segu
meiser
shigeto
chebet
carabello
fernet
lambertz
rackheath
bibliometric
pidcock
paghman
hitcham
ventresca
hatchell
safm
terron
penzer
ordinariness
vicarages
masnadieri
froemming
isserman
consolidator
akano
unedo
sexwale
thorsons
klimke
unltd
degn
dgl
suellen
sidetracking
kashem
buttonwillow
harbourview
nyungwe
allbright
nydam
rastafarianism
blockbusting
aswa
freshwaters
dallon
amoebiasis
herzenberg
livingroom
wolsingham
banavie
malielegaoi
boshier
poire
whoville
iwakuma
intellipedia
hdfs
loyals
orrock
strangulated
inshallah
macal
whther
trentin
zhongwei
kwwl
khwai
oxmoor
velux
exobiology
cintia
crumpling
helprin
arren
pajitnov
bohunice
usally
renunciations
noirmont
oio
jonkers
carabiniere
brek
cepu
landesberg
siroco
kleitman
gollin
niwano
prejudgment
lowdham
dilkes
antifolk
behrang
baggaley
ollman
tourish
birdseed
sidecut
omalu
siefert
formanek
hashid
haad
csrs
oae
clippy
collarless
jaroff
ogling
chryslers
schwenke
fayne
chaak
munib
bkt
brockholes
döpfner
mondschein
fratto
brignoles
roffman
mindfreedom
pachouri
mishchenko
rushey
georgeta
dobtcheff
baldia
cablelabs
wahle
kenana
priddle
discoverability
albigensians
seculars
remmert
ostomy
rothenbaum
hogansville
carven
arced
sportscast
manipuris
unconvincingly
thouroughly
kitz
holstered
fêted
ganglioside
jll
becta
hadspen
kornacki
taniela
wintel
tugade
shili
conferees
plopped
lcra
stolk
sodha
arslanian
devilfish
bravas
miedler
lazaroff
saim
makary
shamanist
ollé
vails
bastardization
sayeda
adfc
bilyeu
sopka
thomases
metolius
jugoslavia
chopan
adja
korsmo
mirvis
lavonia
prescod
westonzoyland
eagleswood
istockphoto
ormolu
haralds
shrieked
inchinnan
sayeeda
witsch
jorunn
elephunk
norrath
watersport
snodin
mstp
americanairlines
maldwyn
shriekers
myasthenic
tgd
ennstal
begur
demko
ffas
umh
odejayi
amrullah
batre
duckham
disneyworld
chanty
atopy
sinani
kopernikus
cutbush
becaus
hypermodern
biyani
nguoi
gainst
dioula
kreskin
zamaneh
dreamily
drebin
sevim
mza
slabbert
ifremer
rahima
burbano
eugenijus
folker
elsberry
pekoe
degale
adubato
filumena
mundie
raelian
toolshed
giusy
overambitious
internalise
wargo
adrastea
affronts
gotabhaya
tvam
customizes
unchaste
shlain
logiudice
delobel
maratea
stuhlbarg
paradoxum
letterboxes
hollerin
stallholders
adium
mahmudiyah
marlos
chaderton
bodedern
soundwaves
ohler
ivanko
sylphs
donalda
pinhoe
callon
kritsky
bahgat
ilfc
ralfe
hullin
kaufhaus
marszalek
rinky
cihangir
etam
falanga
standi
midgette
malecon
kenis
izotov
ciganlija
venza
kirkness
linderoth
kreitman
tomberlin
lamping
clotheslines
zellman
binkowski
morsbach
kabaeva
underarms
petruzelli
borgström
autostereoscopic
steenis
pertinently
kandis
trinkaus
neowiz
nesil
scandella
boab
gonnet
sinikka
ballerino
abeele
klezmatics
highpoints
kexin
straubel
jacquez
nextmedia
demised
widford
dabbagh
theanine
sleepovers
johannesberg
asay
sequesters
moralez
cecal
ranck
glidepath
helixes
bervie
bialowieza
tshisekedi
flatulent
sluman
warsteiner
debauche
hamady
torrontés
yangshuo
clts
bryants
kertzer
miyasaka
handscroll
reformational
jye
silsby
sabkha
rafati
régimes
schorn
föhn
gyeltsen
buice
hootkins
weisenburger
zabludowicz
peke
dannen
emira
cccu
morín
rigourous
boulahrouz
mysa
glanders
romanticists
smartmoney
lovefool
balliet
trease
glenfarclas
sugartown
niane
nahalat
murison
darpariaethau
cederschiöld
johnsrud
harlee
humorului
hereon
shuberts
fcic
sirkin
kominsky
forgues
bolivianos
blomgren
paleckis
autosuggestion
maani
daven
rhodiola
hepatorenal
tatana
itria
palop
pillowman
tagaq
unfaltering
utahns
suntech
sumsion
jawbreakers
nmdc
microgeneration
delistings
myrtus
feuermann
roylance
monache
eakes
donadel
networkers
kazadi
reorg
pegel
ketaki
shiploads
ccdev
traille
sidna
suranga
tippie
fette
ardie
waddling
stinkwood
tintner
smokejumpers
honeymooning
comunist
houchin
meperidine
biaggio
hogsback
skutnik
dygert
epcor
crang
auspiciousness
kimberlites
springfest
maggard
tabea
gaitonde
flickered
unarguable
broadswords
mealamu
mischka
requiescat
weaklings
agonizingly
nibelungs
pinizzotto
marcondes
nagalingam
mallinger
ashkar
dagmara
weidong
disapprobation
laju
że
kidby
dinenage
altran
honkin
jamesport
loaner
shoni
ramgopal
obsessives
teetzel
kohlman
gravidarum
sportsnight
artman
ywain
aand
zzzzz
farmborough
mermin
aymerich
knaster
paoay
belaire
laufenberg
eifert
suschitzky
barrowby
kallat
irek
woltz
watter
bakhtawar
kaarel
ayanda
hucklebuck
gatherum
shinmachi
holne
doggerland
filgate
jacc
tark
whitta
meritus
rushbrook
hydrox
ecocide
misick
trimmel
miniter
bobola
arabists
sivivatu
kosin
dogmatix
lanos
ipea
mutrie
punishers
platzer
mehan
pantaleoni
onely
stampfel
unpacks
ayliffe
tianfu
hullah
eulis
sovan
televises
dieldrin
aliah
gretz
cowherds
sweetin
falstad
lydell
kodjoe
dialers
neuregulin
capicola
benjamina
cki
heaver
goliah
karacan
predations
hornes
inscribes
yerby
londis
elledge
lookbook
grindell
poje
pactual
katselas
fechteler
grinzane
atteridgeville
piddling
umeki
cooky
kightly
planty
pentothal
coupet
flewelling
rechnitz
arsan
vilseck
patronages
memorisation
larroque
murless
lamacchia
reclosed
hrach
shawlands
unbridgeable
myrta
labidi
yangmingshan
lell
pracht
holtzberg
liveleak
dahari
kosem
tallil
eaps
beadie
jerko
ellenberg
grapefruits
charton
wallbank
tweakers
edirisinghe
vocationally
yusufu
quen
jatinegara
boughey
tullett
omark
lendu
eneko
telehouse
ettv
cracraft
dengie
ají
villumsen
begala
reasonability
jihan
thandwe
allergist
beacom
larive
mohanad
isaca
hoeing
gure
chandramouli
bordley
molaison
externalization
fontán
schanberg
nutrilite
dacoity
weart
kavaja
manati
dble
narramore
recomendation
brazillian
langguth
spycraft
barreling
kubelik
postol
cobell
rompetrol
kawakubo
chiappa
brittin
knibbs
frazar
lowliest
bonao
ooma
peasy
directoral
lhf
fanzone
calamar
montreaux
virdi
dorks
lamsweerde
emadi
bullrings
olimb
careys
aflatoxins
williamses
zarine
bottlerocket
indomethacin
suppliant
allderdice
pumpido
cigala
levchin
rachubka
tinatin
saju
valvo
hudis
dalmarnock
cubbins
georgopoulos
terina
stoneground
calibra
impressiveness
soundchecks
balanda
preconditioned
poky
carthon
protamine
gourevitch
giovanny
poupon
bakhita
igli
inspectorates
cervino
bunz
tokin
traditionals
vly
adeptness
morcilla
nganga
thiaroye
jentz
selichot
soppy
kraits
blincoe
waisale
makoun
kecoughtan
ducommun
zahler
kamuela
abridgements
dougans
représentant
lopping
kopan
gottingen
despatching
einfeld
coppertone
delegado
funches
superheavyweight
tajine
kuusankoski
spooled
flaxley
instonians
beseeches
downwash
jauss
peated
firstmerit
disgree
musulin
arblaster
mabberley
lukan
mansford
northend
touchpoints
gulping
breskens
nalley
spastics
saila
vadas
khalek
nuth
knoche
athari
ketv
dogcatcher
retama
spireites
enigmatically
goldwell
rines
herricks
senekal
brouillette
pollione
hairier
chethan
butson
evos
powershift
discords
staphorst
handwerker
ardleigh
ungated
mujahadeen
jewkes
blumstein
yafai
obx
bacalao
sexinfo
turnesa
kulatunga
nowacki
pocketbooks
unops
connived
erucic
lozowick
pbgc
repack
pactum
lanciani
deerslayer
ottilia
croisade
krogstad
carulla
mench
iava
youngers
rodders
wetz
panderichthys
relabeling
franchiser
multilateration
palookaville
shadoe
braai
mwape
stylee
disinheriting
onfield
saiqa
pirouettes
gafa
enx
kleeberg
nufc
hillsman
microanalysis
albertsen
colosseo
cecs
nicholsons
schueller
askerov
bisky
dinizio
bogland
atenolol
silvy
mushir
castigates
carteris
tiravanija
quarterbridge
cyffredinol
bouy
bellringers
douc
dainis
hanvey
nouman
daviot
mahli
javari
reclusion
atai
thrybergh
nyuk
marjah
newsbusters
linpeng
rivetted
craigleith
zisa
bedevilled
saletan
synetic
mcalary
mccloughan
tasleem
ramphele
shrouding
themsleves
shekh
backlands
paisan
sagtikos
zarzycki
tanioka
beaudouin
entryism
valadier
churro
imison
folman
gobelin
extensiveness
oedekerk
smidge
skicross
multispecialty
rabbie
razik
supermicro
dannon
hongli
nzrfu
morans
mangku
lakeway
euromonitor
paralyses
lurdes
ekaterine
similarily
fbx
zakheim
queenfish
xuanhua
toploader
datacasting
mediaite
kappen
hechi
potiskum
ruscoe
prees
softee
ismaeel
javie
bgf
sufa
varella
neels
zooniverse
douar
marimow
astrocytomas
piggins
cnnfn
yuille
douillet
khalilullah
mashaal
rapel
nicolaï
middeck
daubney
fagles
hadia
gadaffi
apeman
mnouchkine
bartlow
abey
rockstone
broe
tomoo
smbs
benchtop
argyropoulos
fsia
pigeonholing
aviram
ayari
brundidge
morasca
struggler
herrig
mayoría
sivertson
rubenfeld
thrillist
mfume
anw
jakin
questers
strathblane
tieto
yps
aiz
schenkenberg
kazakova
marfin
durenberger
warrener
caland
cew
maruca
sobbed
caravella
makeups
kristos
madni
obt
galahs
kasasbeh
swingler
instable
edv
carreer
abotu
fidi
baltia
canners
zadkovich
tullibody
baumannii
similary
elsley
windrose
slann
boonyaratglin
dithers
kulusuk
asel
allures
cruisecritic
phulwari
trotskyites
parakh
losangeles
comins
futter
velveeta
marylander
henein
catatumbo
microstock
hurtles
mihailovich
individuated
puterbaugh
eqa
novitski
schnurr
educare
hemmat
muleteers
failaka
manuchar
lomachenko
sfjazz
preciousness
narkiewicz
haidara
eruzione
uncured
shuga
watsa
delaria
cohiba
jinhai
ruqaiya
glassblowers
scas
thebaine
karazin
nerja
guity
mitrofanov
palila
aloke
esmark
pechtold
sittig
annualised
fiocchi
nibbled
wrightsman
afridis
doley
buraka
tranzalpine
fauldhouse
morreale
epidemiol
transplantations
kazanka
addtion
tumescence
moulsford
klunk
flairs
incompetents
britnell
obstreperous
vaupel
pendens
warberg
joing
asael
kaws
amobi
comtec
hyndford
ghezali
brandie
unflagged
heckles
anabella
vyne
podger
duroy
thomsonii
polarizes
pisanu
aldemir
handcarts
akuila
imeche
oberhauser
masetto
heimberg
spinna
dehghan
blurton
charvis
megacolon
overcautious
compl
bagnasco
iljin
sacker
hollingdale
ironware
psychoville
hebes
glucksmann
dlpfc
chokin
drh
devender
modrow
acedemic
broomfields
amee
jadrolinija
afcc
newser
hibernacula
ixtapan
slenderness
scyld
compal
loanees
dolwyddelan
pencaitland
hamshahri
kinara
cuningham
grafs
tindley
ciena
blando
kendalls
shemaroo
mcfedries
topock
waterkant
hyoscyamine
chits
snel
duellman
zounds
jackling
akol
nfts
sfos
bruneteau
videre
jeptoo
korzun
mahna
zhilong
ecca
naeto
weihenstephan
torrico
earlsdon
falungong
friess
urfi
jeantet
rathman
gladioli
esrd
apro
ariary
mortimers
raphaelson
protus
bonallack
castledawson
piskor
glockenspiels
bayham
dritan
quickstart
mmw
sakyo
aquellos
dogberry
cablecom
coagh
pontcysyllte
rumbelow
africo
blackplanet
dmap
paretsky
walfrid
habashi
windies
clamming
combusting
ghf
misjudgments
elior
boler
kravica
garity
neupane
superteam
kovacevich
calanques
erromango
islandwide
calcutt
nhlbi
reappropriation
hofmanova
himley
barehanded
lochinver
brashly
puddletown
digswell
fortini
coxiella
whataya
mileham
congresspeople
gromek
articulately
rateau
eyerly
agresta
flahaut
moniaive
sarcopenia
chinguetti
bulgogi
aletti
abta
councilperson
firstenberg
lequel
isikoff
bosca
hagolan
fordice
hasti
leftwards
demming
breightmet
kcom
dahe
vacuumed
ooda
intersexuality
degryse
hosn
roofers
gerad
freebo
probasco
summerdale
kashk
pand
radif
copplestone
shammas
buscot
guney
aftershow
trischka
ibsley
slickness
cosac
wisa
baulked
ibcs
ktlk
balsams
kobie
chouette
crem
latke
grappenhall
kneecaps
nyoni
monder
backworth
cherimoya
diamé
serration
pestilent
cookoff
oneok
bianconeri
helmingham
promesa
duplexing
willmon
mountebank
ebus
falloff
olliffe
xivth
balash
vallat
rampantly
dainippon
hoffe
achmat
streamwood
periodicities
resetarits
nastily
burnable
maderas
benishek
quilici
entrapping
greggory
goldsworth
wiederaufbau
bignall
crummett
biafrans
xns
comolli
poyi
iddon
eavesdropped
jaud
triffid
reshooting
minick
goldens
zadro
gangte
woodroof
wittke
spruyt
temozolomide
incubi
yizhong
brownswood
asae
optum
carboline
bukar
charkaoui
nightstalker
bakley
silksworth
everbody
platforma
semaan
seidlin
raddon
ceus
sorbie
fieldsman
dyfan
veliz
kiyota
rothemund
teunissen
wmas
supachai
archaeologies
corrigendum
ganis
insignificantly
lakshminarayan
frud
waitsfield
rimkus
religionist
quoddy
pipher
myrtis
cgrp
immunogenetics
chevys
plomley
crivitz
toolmaking
dfas
commensals
esplin
wfb
scuttles
pamphleteers
croughton
sorrentine
ncv
minimovies
wheelman
weyrauch
latapy
bacchi
cloudcroft
wheelersburg
drechsel
palladin
egberts
dejiang
sagnol
albaugh
flyout
scotish
delauter
overwing
pollari
henske
stehle
submited
anie
birke
hyperemesis
moulsham
lanett
regrette
notw
burfict
recombines
halmi
telescopio
elkhound
anielewicz
uttal
minges
caroms
swishing
neier
urushadze
stackridge
aetiological
nalty
odoyo
tymoczko
przybyszewski
sherlyn
meininger
hedychium
farooque
metalized
noever
fenley
monocled
abene
naspa
chiasma
knavesmire
brewsters
heesen
sanikidze
inonu
fato
geomechanics
ledroit
clinkscales
hookahs
palanivel
hamu
porumboiu
copertino
velilla
payatas
gungho
hargreave
huseklepp
merja
perrée
lonmin
naoum
custards
barki
tuchel
kabocha
saronni
carnmoney
hanmin
mcso
wtih
cholodenko
mirk
intangibility
faintness
jiahui
doctora
fingerling
insall
polishers
ibargüen
tadakuni
ceel
heff
meowing
revuelta
waipio
dooks
patrica
unforseen
benthamiana
trau
otk
chynn
gorleben
popsy
zvika
tzena
gilhooley
bacsik
transcon
fillory
auther
mehamn
mcaslan
smyers
dinna
lazzaroni
fuw
dakotans
hipperholme
lauterbourg
beatlesque
joralemon
zuleikha
apil
liljedahl
scentless
industrialising
langkow
lekker
friending
rassel
motahari
brantas
pharmacogenetics
nationalbank
rikyu
tompall
fallaway
bacchanalian
brocks
shosha
eurus
guidepost
cirl
bated
preoccupy
coequal
hilltopper
komor
pianiste
besetting
getta
blayne
darklord
korus
laudi
cambusnethan
cordesman
brettler
meerow
bethencourt
darrien
alko
ingrate
agonizes
vandevelde
shoreside
grod
gulches
playskool
zigeunerweisen
boepple
creditability
worl
chondritic
sunzha
hobbesian
lorean
epcs
namirembe
hummert
roggio
ménilmontant
mcguckian
joeli
nightshirt
akabusi
gurfein
brockmeyer
garz
lmct
hindenberg
bolzoni
stormfury
wackiest
iecc
toos
azahari
letcombe
hardbody
shenar
zscaler
thean
remarketing
jetted
ilkin
demographia
mwenda
theoklitos
crosspiece
gurgenidze
parmigiana
showreel
vontobel
irishwoman
vulgarism
tumnus
devonish
bushisms
udb
stokenchurch
altadis
polunsky
blastoff
megaloceros
seion
kewa
najarian
genpact
dafen
bachelorettes
moretta
strategem
bayana
atomoxetine
rathe
bailin
dramatico
sorrels
inglehart
goey
reillys
hanceville
boldre
bernadett
gobbling
sophisticate
interministerial
ziekenhuis
chronis
chambourcin
sheats
jazzbo
tianjing
llanfaes
mandanda
daguerrotype
adenoids
apostolates
elastography
posession
jelli
sharafi
fuegos
popinjay
kingsdale
turowski
losse
immunizing
bejewelled
allured
stockett
redenbacher
fungicidal
minett
agroindustrial
phyto
rottier
weedsport
garsdale
bucklew
kinlochleven
samme
zigler
psammetichus
widgetbox
jixian
sarongs
sanny
shenae
traffik
agulla
johhny
hdn
sankaranarayanan
petkova
marije
performable
naturopaths
sheron
opare
danielli
euzkadi
wynns
reguarding
jco
hayner
viniculture
cheffins
hyodo
finnieston
ssas
reprobation
strothers
fumiaki
winterized
pabon
gherkins
opvs
hulshof
guiton
kirtlington
briem
palladia
edil
iied
fusilli
tizer
ryler
hedgepeth
mosholu
itsi
leblon
stondon
torrisi
vimto
mastroeni
pwrs
asinara
cremains
montanez
cleasby
ujjwal
ebele
boschetti
sayar
teramoto
scognamiglio
nbaa
mallord
snibston
usamah
progenies
ekati
undies
everone
skulled
untrammeled
samiya
softy
moily
iliopsoas
machimura
shipibo
richardt
quinquefolia
dragoncon
buile
searson
cerin
matsch
burnhope
mayaguana
pillagers
synchronises
cile
nadder
controversey
corendon
leskovec
keratinous
benevides
succeded
ceglie
rtus
moisson
sirtuin
shoehorning
mckearney
ttxgp
teleperformance
lignan
hookey
showpieces
maltbie
brkic
zaripov
ikramov
oppurtunity
bekaert
cluemaster
perusahaan
airmax
aurélia
kalpakkam
angley
harit
burham
sassine
sikha
originalist
vanderburg
phenacetin
sexxx
gingivalis
hollimon
bushwacker
tsuruya
gengenbach
anonymization
nahrawan
astrit
legitimating
hummable
gröner
armit
buhera
magasins
weerstandsbeweging
karlsberg
diyan
rofecoxib
recommit
langworth
tashjian
dionisia
lifesciences
domperidone
keshishian
fiveways
gartmore
gipping
mckennan
tolong
shash
danielpour
lanskaya
sebat
egton
mutahir
macdonogh
kieser
imperilled
kptm
sampietro
muramoto
ulin
taxidermied
taymouth
deddy
derin
malefactor
crann
endears
tattletale
merveille
arem
tackley
lauffer
korowai
newquist
mulry
seigfried
criminalises
cystoscopy
bomhard
upperton
sterman
tenoch
appart
laboulaye
chiweshe
candlish
chalifoux
paslay
linett
stabilities
ecks
karar
ekofisk
arritt
coprolite
crathes
mithali
landzaat
sherwell
chuppah
hypercritical
shives
delinda
accp
eclectica
qru
harelik
leistner
pirg
sukhum
vasilievna
kompania
dugal
iryani
serlin
poelman
herskovitz
sparber
directa
secularize
precipitator
noncombat
boyde
mcgiffin
lochcarron
scuffed
sauté
maglio
wasantha
toktogul
rollcage
durso
khorsandi
tonis
kouilou
quaintly
sanmarinese
liel
impoverish
nextdoor
kappos
hereto
lomana
prates
issl
pradel
wfi
pooping
thornely
skau
mumbly
prospection
reddaway
shortcrust
amvets
milage
enniskerry
kösen
qurashi
uncoloured
lezard
otas
rethymnon
anush
marena
skelley
pruner
holdich
djerma
woodpigeon
unconvicted
aronin
drollinger
okudaira
laubscher
viveiros
lumbly
levox
shefer
chlumsky
grrrr
damxung
longtemps
scrs
bulter
prutton
bohner
swavesey
bodwell
shavei
friedenberg
taverham
menomena
monnin
esdp
melching
wlbt
utsjoki
rerecord
senser
premierleague
islamo
merrymeeting
norplant
konec
huili
milioti
problemo
saraland
canzonas
peñate
werne
ifpa
graveline
damman
gracq
ornithomimids
magaw
butterick
gaian
southerndown
sisamouth
wpbf
pettrey
hillclimbs
arsov
surowiecki
maximiano
fatullayev
payner
encrustations
northline
arrestable
kurkov
robing
hinderance
peranakans
collecta
belaga
bawag
draperstown
aeroespacial
impactors
whisler
unpromoted
poderosa
agoutis
lehenga
menabilly
slackline
lomnica
pdus
iconia
kayley
stunna
tostan
cojones
witcombe
worrack
chewable
ghafur
canst
eyecatching
khazan
antitoxins
situationally
tostado
geremek
steineckert
juang
popworld
tagge
multipartite
rubbles
kockott
fxb
bednarski
ledum
adipiscing
sobhani
wisecrack
vathy
slating
bramhope
defrocking
kasparian
meiggs
filippenko
ivoirienne
virtualize
whirligigs
inarguable
kepier
unserious
capezio
tigerland
turndown
szczepanik
fiss
dinello
rotos
hadrien
languorous
snowsports
aspr
corncrake
daviau
onuma
sarducci
mihok
epistemologically
owensby
sydsvenska
nolot
doster
sherryl
ixworth
caucusing
jevan
pahars
shijian
planetology
cravats
imar
sabbiadoro
rahmah
lindenmayer
lineberger
tenho
mcclatchey
culpo
wittingly
burgoo
bearpark
vear
pleasureland
lagunita
ejectors
mmabatho
gionfriddo
misdirecting
elkana
carabelli
subframes
jonbenet
valleywag
afinogenov
cippenham
coud
deveronvale
wainwrights
liuba
ballfields
feasable
cirigliano
blythewood
menaker
unselective
igfa
haroche
hodgkiss
jurys
tcks
reincarnates
latendresse
liddington
vincenzi
cockbain
micropayments
weisenberg
arav
vinnicombe
hoola
dique
kissidougou
naumoski
bopping
rejig
pilsley
duplicators
rivieras
petignat
melhuish
pinheads
naisbitt
coffeemaker
socolow
kolomenskoye
verburg
masculinized
unimagined
trrs
poésy
mayaki
pixilation
clomifene
kalnins
scotter
houchen
cinsault
shahrzad
roskell
sosin
endevour
gastrostomy
conceptualise
chytilová
ecureuil
lowriders
bladnoch
santell
reattachment
ambassadorships
tysoe
amardeep
reinstitution
enigk
unwrap
scuffled
lubinsky
verheijen
leogang
frieser
intercoms
bellic
soundest
mcelhenney
indialantic
dogtanian
khda
garbarino
encourager
lambrusco
uptakes
blastomycosis
kostyuk
mercopress
ariela
schoff
dhokha
kontinent
ukrayiny
fractus
espenak
homestays
bontrager
perspicacity
atlixco
brightwater
ochres
pibe
sulks
davenham
memorizes
gaxiola
canson
leibovich
guscott
inventively
hrab
zaretsky
yapese
overstone
soldini
nonoxynol
koelle
soffits
bouras
tropp
paleopathology
emarat
breiner
scenting
rataj
ratcheted
taffe
vizion
sailele
coskun
profili
pulite
roest
kotch
interglacials
castelbajac
saywell
alvaston
jansport
huanuco
dysregulated
corinium
orakpo
danze
backplanes
surra
hanushek
dustpan
mahonri
vtl
twinz
leece
bgan
shukr
wuh
capitaland
moteur
okayed
dykema
rollright
sigalas
musen
oenologist
carderock
intuited
hypertriglyceridemia
suhag
tamudo
entreating
sheherazade
sottile
neumayr
volcanological
mbakwe
liscence
vaudevillians
niedermeier
streakers
schol
sonapur
sedlacek
candelabras
havill
waikoloa
moluccans
parbold
retching
whistlestop
melnichenko
vitellia
jujubes
merinos
burruchaga
virag
jamukha
courtot
pallares
koodo
myne
jackley
brazel
moladi
annalynne
borza
alwayson
bolikhamsai
peci
whiteadder
guseinov
jannati
passivhaus
boedeker
menchu
philipon
norriton
yasen
impetuosity
monohulls
cherrill
disanto
temanggung
expiatory
gulliksen
grabowska
holtet
unscreened
rehiring
mulesing
buchwalter
kohlmeyer
shinozuka
veran
doval
glossier
steininger
dadda
footpads
irwan
eradicates
zanatta
iiasa
cisek
zarko
capsa
tugnutt
delphinidin
veiw
dinges
sicoli
shadai
agler
strogatz
swansboro
ziga
flodin
adalimumab
gearan
fenosa
saied
gething
dadong
hartshead
knobler
infibulation
bahadar
ealham
vanderkaay
bouchey
schwannoma
yose
rubdown
rajabi
smarten
perra
chenega
tognetti
plaît
gabrial
kurtenbach
nuremburg
dictations
wappler
geds
drafty
banisteriopsis
margaritis
ryann
zerka
dottin
misconstrues
randt
eafe
baseboards
manvendra
jetters
betony
conehead
symone
duchemin
tasiilaq
sidlow
graybar
prudom
ilac
ntaryamira
todate
basura
npws
felger
prig
likly
ngola
seropian
eurogames
secondigliano
kylemore
funfairs
saffery
riverwoods
venkov
echaurren
grainge
sandtoft
hengrove
suwan
tonkatsu
indispensables
joanikije
datapath
moules
fwp
narender
ezzor
clevelander
nakfa
sundaresan
pretest
sonobuoy
baboquivari
economiques
hutsell
normandeau
munguia
uhlman
telegraphing
kornati
shulin
chcs
wilkey
chervenkov
fiords
mpssaa
misterton
imambargah
cespitosa
bustier
freeloaders
toodle
kinetica
keiler
outdoes
licheng
cannistraro
carseldine
nemiroff
ciar
internationalizing
crosshouse
denationalization
stepparents
munce
cdli
williamsons
sendek
yellowbird
motability
radamès
jawal
bithell
sisyphean
phal
norworth
arini
ajaz
nahb
oday
chimbonda
selmani
kresh
catino
movimientos
palan
faizon
connoquenessing
usme
burkesville
karytaina
landbank
miyao
unfertilised
deann
alexandrines
savides
koryaks
luse
ouyahia
scha
vollmar
minjur
burnitz
nymphomania
ashchurch
pipeworks
velayati
rinses
jti
filley
rossin
touchard
baiter
joycelyn
etheredge
subfloor
minskoff
kebbel
wdca
amortize
irasburg
roussopoulos
gazillions
kulpa
deschain
hilman
minniti
norick
allwine
untucked
laugesen
dervis
lambrou
recomposition
turista
mijikenda
diakite
andreychuk
vxi
ayuthaya
kukan
gunmakers
pmq
barthé
pyxidis
honea
mathijsen
enfranchising
eurich
wadhwani
surachai
sportifs
dauner
augean
haulover
provençale
ueb
sifford
galsi
merabishvili
solel
zoie
romantiques
boverton
boissieu
siia
garbs
snco
mctague
carcases
duggins
clemenson
pilarz
sulfonylurea
kitu
triboro
weinbach
usov
reincorporate
torqued
camlough
theocracies
jania
omegna
belleair
tuffey
togs
blasphemers
knuckled
pittville
tonyrefail
jerde
classist
tatoosh
hotfoot
matern
coupla
kawarau
erdei
klopper
reinis
shafique
glomerulosclerosis
keas
mikulak
gruwell
andacollo
gerima
goerke
florestal
nrh
opta
qoute
defi
songsters
edley
zaroff
kaeser
noti
complicite
upperhand
umai
brader
araf
smolder
jahreszeiten
dryosaurus
cruet
dadoo
uzzell
romilda
koshin
healthline
jooste
nabr
estopped
campogalliani
ictj
geodes
dayro
khoshaim
fastly
mousavian
bleedin
nutcases
scribblings
nemetz
metapan
mosaica
rodeheaver
psychologic
syz
swetman
arkadiy
unkrich
zhisheng
nagasato
gilvan
westendorp
qcc
erlotinib
knezevic
armina
chieu
cluzet
schinas
medani
bullheads
cassocks
lapstone
imporant
phyllosilicate
saloonkeeper
moodley
khanan
sammis
haughtily
capc
wikus
bredero
andersens
globalpost
silvis
campodónico
mosen
rakeysh
disquisitions
harut
toskala
donepezil
wakui
pinkman
kreizberg
conveners
jaleh
orlock
attemps
robart
antonian
henllys
hostelries
heckington
oliviers
fluide
amiesh
imagi
xfp
voxx
globecast
nht
tuckasegee
brumberg
rogachev
svenning
phurba
forvik
zappala
adherance
olari
ferdinandi
ambulation
jetsunma
corsano
draddy
doret
anonymize
gastaldello
burum
aih
negativism
friedewald
sharpies
membe
zamacona
skilbeck
rafiuddin
debark
ruke
entwisle
bressingham
raghuvanshi
jackasses
jabbed
madresfield
keitany
midlevel
yongxin
disturber
finjan
spicing
tolsma
huyen
meetin
ascertains
cutz
muralidharan
symms
villarosa
rumpled
boulby
niksic
monchique
klenk
andora
codron
freesheet
triers
hopis
mcconchie
hilker
roseway
yaohua
digressive
hadrill
rosengart
blimpie
lawbook
rollerblade
ohev
dcgs
lanzoni
speta
matatu
lowis
imperturbable
engert
tarmacked
gaboriau
hornyak
lawyerly
wilmoth
habis
mccarry
dobs
professionalised
schoeneweis
pasillas
obua
jagdale
patrulla
decimates
titillate
hippolito
rundel
kirkleatham
grasu
shoshi
lauca
efros
montavista
shvarts
schmerling
correns
lookstein
claramunt
bienert
portville
motera
kiyonari
frays
huckleberries
salari
lgbts
snorkels
gaup
tartous
leiby
cowburn
kafando
wiele
bayston
jerrie
cliffie
paintbrushes
kosonen
syse
kyson
toblerone
wilpert
rimm
respekt
lagravenese
leazes
churrasco
hengqin
conmen
mrozek
masazumi
agyness
brassier
crammer
cassopolis
dismissiveness
alwa
superweapons
gurry
rhody
bno
renfree
microstation
cocooned
tuller
wuv
stagefright
secca
thromboembolic
traineeships
cossor
grynberg
atsuro
beaubrun
seperates
yna
slaithwaite
pangburn
allocution
pyrethrins
maneely
hillstrand
cosens
umes
azahar
jbj
gayles
reveiz
avercamp
pellston
porphyrios
yfz
ardolino
huatai
wbk
weezy
pielmeier
churnalism
wilce
qutbi
personation
bandsaw
satele
feirstein
sotalia
kantilal
ermione
banchi
srtp
transgenes
finckel
wyndmoor
hofstad
horizontalis
giblets
gunsaulus
zukin
towery
rheault
disjunctions
depósitos
merhige
rathen
mcquarters
sidhwa
simiyu
dierkes
backswing
joxer
atletica
assessement
melone
grann
degner
rawda
laayoune
hilti
sweatbox
garota
toja
undercovers
vondie
lochearnhead
vett
apnoea
disinheritance
incommensurate
aduba
nepotistic
zorbas
deviously
pierside
covance
scareware
johannah
crumm
taoufik
sharry
clearpath
lucchetti
barasa
lacosta
museion
vanin
salvors
leira
aladar
decliner
selfsame
ekho
salyers
emcdda
bottari
choreographs
swiftlets
deines
cotard
remédios
grinstein
archey
pallin
possamai
jurmala
spogli
haverton
bertam
iino
rollerskate
kanjorski
isenhour
airhogs
maharero
faja
leyshon
skewen
palling
jariwala
auvinen
vershinin
lulay
ilze
perms
lowfield
enforcable
luedecke
screenprinting
apprently
thecodontosaurus
fishhooks
caapi
talamantes
squeezer
sachio
karamjit
carbis
polyculture
zardoz
tugg
laszewski
vedad
brimington
dranesville
lineback
imta
defunded
touger
puxi
amiability
matinicus
carisma
guthro
candeias
ottewell
reprod
martirano
galwey
camarero
ginocchio
dadon
shelvin
osterville
ludik
tothe
cqd
tittabawassee
loompas
blaspheme
börne
fedotova
laocoon
sarbaz
deeny
savident
zbb
bunshaft
faine
fkm
haddie
gibbens
diaspore
lucine
annuls
nutts
bourelly
impoliteness
lituus
barias
hendron
amoros
fosun
voudrais
coggs
endress
ulugbek
neurorehabilitation
zappia
trentonian
brico
lockbox
torpy
ponyboy
yizhar
equinoxe
kyonggi
diasporan
wonderfalls
cadabra
beyene
cheatgrass
ranny
intrusiveness
fergusons
gianotti
longjiang
gmx
leja
mattawamkeag
badsey
porcari
cormick
retuning
salzgeber
brabbins
economise
bondholder
stracke
cyf
isss
deitsch
kharafi
cohabitating
dolgoruky
derngate
ezarik
shochat
mskcc
murville
creuset
ottolenghi
handstands
ccps
itemize
cvma
xdsl
yizhak
faour
brisby
shewanella
nizhni
nuyen
korek
woollens
godowns
delfouneso
microbursts
covidien
treuhand
unrecovered
edwy
grimberg
clothespin
spectatorship
amim
sangs
acromioclavicular
penrhyndeudraeth
supernumeraries
unbidden
disdaining
glisten
macsween
poxvirus
nclr
larita
torksey
bodell
nardis
koelsch
monsoor
chalana
zucchetto
thaker
palyama
pregerson
periaqueductal
kinng
ppw
swilley
bodyboard
wesc
riggleman
martinot
sileby
mifi
subversions
srec
smallthorne
motorshow
apmc
cirone
swinley
rcgp
herran
essense
barentu
frecce
maynardville
flyertalk
maiale
durran
doest
repulsor
pinte
lprp
kinniburgh
codner
condylar
dakka
desforges
deele
voelkel
beinin
varmints
republishes
xma
heavenward
aroon
azahara
clachnacuddin
lhakpa
amiram
carrieri
pamplemousse
fayer
vasicek
isakhel
pechiney
nhd
weligama
marucci
lensbaby
cédras
wearstler
folowing
changez
myaungmya
fairydean
dorthea
arvelo
tsujii
horev
lauch
vasiliou
poopy
glassport
prestissimo
mccranie
halldor
bolls
scieszka
mengniu
waine
artspeak
scrappage
mickelsen
eble
trifunovic
wva

duneland
kunen
recolonization
bigfix
racquel
konare
diskerud
endellion
magnetize
picu
urbanite
duniv
philosophize
irremediably
approachability
brownbill
imesh
posibility
maggert
potternewton
stephans
vijitha
penclawdd
vof
longboards
cassillis
cruzat
rvu
wharmby
unsocial
lavishness
pralines
lefanu
peppermill
vietri
blacon
amiloride
freidrich
espe
recertified
hosing
baros
glenurquhart
skaarup
superjail
dhg
iwd
nazih
brocaded
stickier
malyn
teakwood
longerich
moxifloxacin
tnz
ushahidi
ganache
grumiaux
yuquan
modernisations
brandley
choji
unbanked
fauzan
sambac
vukmir
bulerías
elderfield
peszek
orthochromatic
neaves
salination
lorina
fosa
shaohong
gowadia
megginson
compensable
wisener
meconopsis
chmura
zöggeler
stanic
unbroadcast
ogogo
zinat
plyushch
rashaun
glba
exor
levetiracetam
rafta
dayron
kjærsgaard
ciona
feleti
dunelm
pubes
jasms
notizia
maraniss
prorated
veljohnson
russwurm
uncc
buback
venceslau
scratchcard
introspect
bulker
galer
scooba
sumus
ponant
grimonprez
mapple
dandie
dialectically
classicus
gundula
naseri
warmhearted
sioc
seedhill
markward
mckinzie
frenzies
paikiasothy
grannie
panajachel
scarwid
ngoy
beddow
gaborik
boulmer
polylactic
gwynneth
kolodin
jagang
pitsunda
anthracycline
cariani
dohme
zarem
prenger
slavitt
prak
carjacker
sakhir
childfund
sencer
mamady
tioté
xay
biltong
zendesk
dgac
jelloun
masterfile
slavov
ladybower
bupivacaine
drools
belridge
armouring
yodok
tumacacori
unluckiest
juvie
masley
pomonella
shawal
mdoc
thinkgeek
drysuit
learnin
psephologist
matityahu
wisam
hatosy
benami
soubry
bacino
regasification
tebessa
polyglots
nagell
almadén
materialising
deselect
monotonously
microelectrodes
brutalizing
incorruptibility
constructal
schumm
downscale
cude
mahasi
schönhauser
diamantis
schriver
akopian
rougham
nordhaug
rootstown
tribunale
webware
hoffstetter
hostler
knw
tolba
zadi
bunmei
flubbed
whaddya
shigetoshi
wenches
messolonghi
vasilij
covello
osmington
kimmell
citp
raytracing
wherries
müntefering
cleeves
touriga
keflezighi
coloradoan
shearmur
flamands
borree
megacorporation
ibbs
garvald
bulliet
halime
iav
sytem
hlophe
pacal
physarum
nyree
jorasses
cheok
unlawfulness
forthrightness
hrn
palevsky
chondrocyte
morani
piazzola
metas
couesnon
coxall
coathanger
blizard
troupers
oxidises
shahad
afzali
florie
incentivise
chlordane
cascata
itzkoff
agna
mastandrea
bustelo
guillotines
grabiner
bobbidi
tongi
wiertz
mardale
aquire
famosi
resalat
sportske
quiktrip
munther
bombproof
yaren
davino
rieussec
woodseats
tgvs
mullineux
montagnola
pomposa
indoles
freemind
hepplewhite
modwen
menageries
vaitkus
marez
sunrun
chuch
ursell
dilts
assiette
epon
flâneur
tabatinga
huttner
vmg
pramuka
kaab
manhire
sukhorukov
montagnani
isobutanol
roizen
exterminations
tadzio
zaragosa
friulano
macroglobulinemia
leaches
eppinger
pokies
gerhartsreiter
beyblades
sauvegarde
jtac
spirograph
ihes
brozman
wawrzyniak
guestroom
maresfield
smithland
liheap
quillian
calday
nontron
dehydrates
poorva
huddles
sobczak
golosov
przybylski
pekkala
gwithian
tebaldo
ndereba
shahzaib
kelsay
aldrick
lemel
exb
awasa
unitaria
autoregulation
saundersfoot
volcanogenic
tshepo
brachman
brandram
barath
rudston
malachowski
yongchun
stocken
wardington
tatnall
transeuropa
aktar
haggarty
kainos
langway
didem
triola
cleage
weihan
trojanowski
gccf
aptidon
lichty
mumo
cesspools
dmcc
semiprecious
augher
mallan
luarca
kashfi
watzmann
xclusive
cmcc
edner
slipperiness
nightcaps
norwin
feighan
sheetrit
grese
hhp
felicitations
shreeves
utseya
editoral
stromatolite
meself
moez
yemin
moondust
marczewski
luminol
coffeeville
stupnitsky
foolscap
duumvirate
beechy
sween
markou
crbc
mouchette
tiven
gutowski
capellan
beiber
wus
cheaney
stephentown
sinde
modchips
boudier
deeble
tundras
audsley
clotheslined
blastomeres
acheivements
dommage
sonnenfeldt
airmont
falleni
wabara
raewyn
indisposition
geck
fitzpatricks
allègre
yaletown
bouffard
cavehill
nezavisimaya
pelchat
jianming
marshwood
spearmon
enoksen
macgowran
fuzing
madian
tiida
mottahedeh
hyu
dahesh
rgi
glissandos
penketh
fieser
aultman
embera
biztalk
goldenseal
festered
dyane
nivison
ichigaya
krygier
saiyed
templ
coatham
gorgonia
synchrotrons
cheuse
libere
enad
crimps
malediction
ceaser
shettar
otaiba
catalanotto
guemes
kamunting
uncac
nadig
dyukov
prolapsed
trapero
ncri
sheeva
gastroparesis
wpbt
tonye
eggington
anathemas
twinkles
shampooing
blub
cource
dobrinsky
rotifer
cwmaman
mailonline
bilali
fogleman
hartikainen
bvl
tufano
dablam
veba
wledig
saimon
dopes
dorsiflexion
pelmeni
possesion
vibraphones
emsdetten
manipulatives
strathaird
bayandor
hamsterley
dromoland
somin
chups
bilney
parasitically
jabour
dirtnap
tripwires
heatherly
blackinton
bunnett
konjac
fahrni
knacker
chiaromonte
otherhand
seiches
qadiani
shick
banyard
manigat
lrad
witchell
meprobamate
cantele
altruistically
hooverville
lamrock
duncum
moloto
cayucos
cambiar
raraku
dislodges
thembu
stäfa
archeologically
eagleburger
recapitalized
blandina
piernas
savitha
fancast
alkyd
tulliallan
layabout
pipkins
crudest
fingerpost
pakistans
sawkins
juans
marteen
eggo
cerha
napolitan
eilon
lizhi
walen
pevs
yated
namche
backsides
sukkar
tiens
proportioning
arnet
tonsley
nalp
serpell
wangdi
deepcut
tonin
tarheel
nannette
hegenberger
brueckner
suspensive
pandera
fleetcenter
llaw
kikwit
cownose
hydroplaning
canfranc
ishige
rotoscope
epitomise
weichai
kiryienka
chouraqui
fradique
alarie
stalteri
diarios
cairnhill
reticulocyte
mettenberger
vorticism
sklodowska
klause
nevirapine
unseal
ruppersberger
shoukri
vibiana
vacherie
isleham
mkiv
keppie
jbi
neus
bienvenidos
verkhovsky
sadir
hunga
tokage
dubbin
haussman
zinny
crociere
yobs
auryn
deya
venkateswaran
neog
reepalu
knerr
holnicote
gervacio
glamorize
sportingbet
davol
catacamas
marrack
foiles
skillings
bokhary
ladys
singletrack
eyeworks
badir
seldomly
bunkley
iuss
lounger
lenczowski
theophanous
dynastes
bozidar
weixler
bahawal
disinterestedness
moshpit
coccolithophores
pultusk
ajita
scantling
sibur
cuvee
hatam
lleshi
dariel
ribadesella
keyah
konon
eyestrain
bosville
joner
sirus
singkil
wunderman
jäätteenmäki
fuest
bawling
papagiannis
adec
ayalew
amfortas
weichert
wawer
diq
dorrans
aguaruna
amberol
regretably
elbel
pasqualini
irta
dogsbody
ignitions
menik
serigraphs
aerially
materie
ekantipur
greenstock
rrd
nesi
dainelli
chabanais
monjo
achraf
katsuyoshi
diversityinc
unmovable
shipworm
gimmickry
hammerklavier
eriocnemis
balzani
balbirnie
epode
takuo
bombie
mortadella
aumf
ziebart
bowmer
obm
bonduelle
dockrill
sharbat
adisucipto
chojnacki
chaifetz
anglezarke
lancellotti
amatil
oldfather
desso
sugarfree
fuwei
reshapes
kassidy
frette
markee
incarcerating
xinping
kligler
amaré
bellan
peices
akinfeev
golog
lukash
tressa
maajid
mabrey
linkins
characterful
allanton
drakenstein
alcano
glosters
overpayments
jinda
escravos
myricks
tasnim
jehane
tricoloured
hewell
ghir
globovisión
moranbong
stocksfield
agonisingly
keji
pleno
sorrells
emnes
footie
uralvagonzavod
pirtle
shallal
cirm
havner
plews
meuser
quepos
peruga
writeable
diesmos
quicke
voronoff
grosgrain
ermua
glaven
plagiarists
schmitter
voguing
mcjob
donka
oberdorfer
anemometers
nevadan
bugun
tammo
rujano
babou
yanic
bolinder
keiley
jumbles

honeychurch
mutaween
akhundzada
brumback
hofs
radwell
binnig
piccione
soient
muckleshoot
bransholme
krushchev
teichman
kneeing
rggi
arzo
shonibare
iceplex
scurr
bougon
farahnaz
shumsky
tiznow
allthough
nikam
ogba
slipcased
rachmat
afsaneh
arrowwood
soneji
trezza
severnside
perseids
eisenbeis
unibanco
ascani
remaindered
aspo
gorongosa
hehn
dipropionate
arabinda
rizer
viano
ssem
authoritativeness
wardian
delena
ploeger
fujimaki
sappi
unprofessionally
brynamman
audiological
clodfelter
centrowitz
siwiec
coray
dharamraj
landenberg
mutato
yulan
lundkvist
biser
corruptible
marcolino
bordón
tempur
bruening
officiants
mssr
caídos
pondy
padanian
shagging
keens
synovium
cornflour
udraw
bumiputras
kalavati
sofrito
denzinger
intensives
bullcrap
tanada
jamsetji
bashas
vauvenargues
rythmes
mceveley
turkmani
ginsburgh
makaha
tsvetanov
niehs
emmc
rvp
haloid
aaargh
jesualdo
hansens
snaresbrook
orihara
simelane
pharmakon
habita
digitalb
hayal
exadata
snøhetta
turino
petrodollar
barklage
congregates
bukola
abron
furhter
aymen
nonsexual
marom
schierholtz
chrissa
cranworth
assou
durland
yout
hillmen
montsalvatge
bailamos
yusefi
deleón
montiglio
llanstephan
hassanali
subtile
zene
ribfest
shamelessness
selawik
makula
jibberish
vonlanthen
weepies
feindouno
widemouth
kreiger
tinners
dyserth
rugen
mojos
umoh
ernster
mawnan
raynard
romelu
nustar
brownley
lisahally
rupak
blehr
oligos
styraciflua
saire
abruption
elberon
santy
huguely
biddles
radiodurans
charke
brookmyre
bonsoir
arnow
junaidi
flowerdale
pingpong
sveinung
zillman
helali
seminoma
refah
beyrle
alho
yeaman
easterday
nettleham
weberman
detraining
daems
sarj
mantoloking
pollença
buzzie
genachowski
kydland
tarusa
drager
aharonot
spensley
duato
carmacks
pinhel
nacd
stephensons
kobiashvili
vildagliptin
garaventa
tokidoki
piketon
dueted
worner
crevier
oshodi
caporale
josefov
schanche
cytarabine
cheapened
masuko
galluzzi
aguiluz
mosese
clavo
chinwag
saffran
aygo
saxl
ptsa
belic
illertissen
ncell
chiney
ixchel
bainsford
overtraining
goheung
astillero
vcv
olympism
koppenberg
firminy
sinfin
refracts
skinnier
trebanos
bethalto
dence
deftness
obba
gressoney
lyuba
retroactivity
straubenzee
mitsunori
kearley
pontnewydd
killens
swensson
gallé
bhata
peepli
clamav
shiek
narasingha
novitsky
iaq
bendixen
digirolamo
camogli
cremers
tamped
marinating
diagnostically
lanker
sadad
gritton
fakt
geopolitik
kupchak
dulaim
semiannually
adibi
hogle
zuger
maravi
doctores
mohajerani
neener
nunns
worke
clubcard
pmdd
slughorn
perejil
gabbiani
diegues
mosasaurus
whiteleaf
derivitive
myria
throgmorton
solemnization
debroy
serps
daks
hfpa
poorhouses
colfe
rockness
selbourne
burgard
shadle
cliffwood
rosaiah
minorites
swayamvar
abramtsevo
dlj
birkinshaw
svara
konkret
untrimmed
sfrs
plibersek
chiltington
wkc
cadenhead
eii
seaperch
vassalo
aloa
dahlbeck
holborow
soufriere
casarsa
halda
cmtc
mcgrillen
avants
geurts
santanu
benison
blancornelas
rorer
ninilchik
arop
mattoo
penmachno
francella
umit
teufelsberg
coys
redeposited
hriday
acampo
contextualizes
heima
redha
windoze
decoupage
chenghai
mjo
rosière
cholecalciferol
nbcu
digon
liscard
samarrai
pixilated
cvf
yandi
roundball
kosteniuk
sysplex
fayzabad
dechant
emilija
ozouf
saidin
patrington
overclaiming
blissed
kalonzo
healthiness
sefrou
jerson
cobit
inape
gozer
rocheleau
misfiled
dembele
pursglove
houngan
costings
nevadas
burhop
agribusinesses
pierron
microfiltration
sabam
chillwave
kalfin
salkey
morda
disentanglement
papercutz
blacktips
inadmissable
thermoforming
giros
athanassios
wenzhong
friskies
rospa
reaccredited
tolkan
delorey
chugiak
finaghy
papering
witchhunts
chocolatiers
stupendously
blenkiron
schenkkan
folz
dromio
acemoglu
brookmans
imag
prostar
glowacki
grimsay
asselborn
duncansby
diametre
cursorily
hotchpotch
fgw
heligan
chatwood
salade
basili
marzocchi
grossers
andreia
jubeir
grua
rabaut
fadiga
moblin
parkton
morres
tamms
qaq
maynards
pompom
salsas
naughtie
nyahururu
gadiel
searight
roadsigns
loutherbourg
balby
starer
barkhan
niney
remyelination
susette
olubunmi
acpe
petrona
robertsons
freewheelers
chellie
ventham
meticulousness
reitzell
choirgirl
ederson
rianna
badakshan
overeat
stagedoor
nanocrystal
igl
turkified
kendrix
tweetie
extraditing
lakka
maladjustment
eastley
jusoh
kharja
cicchetti
discounters
alisia
scepters
wepper
diarrhoeal
ardiente
despues
disestablish
importantes
sots
volesky
perona
bryntirion
vinal
cretins
sackboy
availble
dayside
meline
balmori
laetoli
hfea
episiotomy
deboo
shopworn
taproots
mcskimming
ramola
muxloe
quickens
broden
loebsack
baiocchi
jicks
divans
biodynamics
jassm
alliedsignal
mcevilly
tableaus
kwiatkowska
brutha
moloi
chiredzi
holdem
ksee
zazula
municipals
basinski
piolet
sakib
pelser
bohjalian
kdi
gushiken
kralik
scotchman
qmul
castellina
fermina
dhiyab
undischarged
greenlighting
rosuvastatin
tuono
niagra
kastelli
salvager
notamment
kreeger
dateless
somaly
aviad
rusthall
bramford
thrillseekers
gurg
maquisards
eastmont
ethell
pizzuti
misidentify
bennigan
rosher
ramonet
dreijer
decouples
whomp
amelung
commanche
norkus
ultranationalists
anadol
myoblasts
metzl
shorto
derrinstown
mothballing
kinsfolk
leibel
yoshikuni
joura
odah
hakkarainen
navs
zankou
neuromas
genx
randerson
oafish
confusedly
boydston
protv
papadopulos
mudguard
trifasciatus
aequorea
spermatogonia
blashford
lysiak
demaree
roula
unfortuantely
erjavec
akinde
zahorski
barwin
klesko
zilpah
collyhurst
viewsonic
tamsen
mitchie
postfinance
reelzchannel
wartski
westfeldt
devic
allaerts
marysia
mazariegos
dinuba
roade
galperin
achievments
stridor
safc
ashkali
hafan
botolan
desbois
talam
mandeans
dhiren
strabag
sideview
dragas
segredo
percée
tossa
aedpa
paneuropean
mmrs
stiffy
clacks
klumps
clairvoyants
wallraff
richetti
koefoed
livaneli
sandfields
jubelirer
minou
bagai
erakor
situtation
kininmonth
stoy
cheesecakes
barkleys
glasshoughton
jarius
vernard
zemel
berewa
swallowers
boutte
kundig
krosa
lenhardt
lindhome
jellison
eldeen
cimmyt
bomet
firmenich
barzin
laholm
viles
orientate
siss
ametek
goshorn
honister
valastro
israelsson
englebright
connal
hungrier
feresten
visclosky
ivoirian
hsw
killgore
asika
chevrontexaco
tenere
giacomin
yavorsky
isua
pancuronium
burano
gastroenterological
güines
tamez
clatterbridge
ahmedi
philliskirk
creegan
wowt
jcn
goalline
porchlight
schu
raubenheimer
slovakians
maasvlakte
chongo
saeco
largess
hudepohl
autoworld
taynuilt
dhondup
rodnik
premarket
fraz
tricolours
imee
shabaks
periodico
albyn
gordis
dalmat
microbicide
heiti
akeel
caldon
ocado
hennacy
vsnl
fongshan
bredius
backhoes
aristegui
plotnikoff
tabárez
moonshining
ovp
badaruddin
petted
dimdim
educa
mirvs
qamdo
gearless
canadia
exequiel
tomotaka
yess
grayrigg
wistrom
golliwog
fiapf
buika
pterodactyloid
veneranda
seckler
hafun
jorda
diridon
quarless
zarafa
lessines
caundle
rautaruukki
playbacks
tomassini
knacks
pootie
aslr
custodiet
mullenweg
copacetic
maurstad
zmuda
mohns
patashnik
trosper
cabochon
tambach
khiam
asociado
beden
yuksel
machat
zfp
youle
groseclose
galipeau
sandflies
quie
couey
ghobadi
wftc
ermera
yakou
tulay
falsities
greenberry
vecchioni
chomhairle
pertusa
schager
swats
pluriel
pollinates
sheyla
cadder
snitterfield
daaé
chive
aghdam
underinvestment
mahamuni
hgi
legay
mawlawi
tadgh
martinov
srini
sibolga
hengdian
grassic
helson
sideway
kumana
bouncin
sidiq
politecnica
tripple
rogi
sunedison
saisiyat
classmen
contamines
cybernet
longues
unmusical
sopho
sixgill
blinken
orthopedist
sorber
pagonis
leenane
monaldi
flareup
spraining
adrain
reffing
ecocity
cameoed
incisional
lipsitz
hbu
thamesdown
penina
sambourne
fonejacker
galvao
clonus
fevronia
zerby
richebourg
otterson
blueness
filmstrips
djuric
loas
kernodle
mycophenolate
frontis
optim
wafting
bonas
equuleus
yamamori
merrymakers
psds
poilievre
fraidy
turnt
lovick
unseasonable
tastiest
kouadio
unchallenging
koita
sambueza
mandt
lyssavirus
nucular
kyriacos
cronauer
havnt
magilligan
standees
toshimasa
urmson
emett
langewiesche
jillani
nadirah
ssafa
bilocation
loroupe
refashioning
chaldon
coxsackievirus
pencier
unaged
selectees
wolfinger
tabulates
imputes
landres
tequilas
zorita
foppe
sillah
nabbing
sugarbeet
capay
raffetto
pueden
matmata
searchability
touvier
falles
celestis
araneda
buey
diriamba
voler
electroluminescence
pushpakumara
mattawoman
privalov
calmo
trivialising
unhook
ganey
hmmwvs
boada
daric
preemptions
halwai
pzev
naskar
lenah
mastracchio
anuta
sittenfeld
califone
llangwm
anec
trejos
binging
cansfield
aurukun
federating
oxandrolone
donnarumma
immodesty
spindel
biobased
struwwelpeter
pillot
mabius
ulay
lumbago
heluva
instal
knowl
salopek
roeber
kootz
riling
féraud
lubchenco
borka
mersham
auditable
bolney
radcot
montsouris
xiaoyun
porcelli
gadzooks
leaton
slovic
semley
averbukh
heikal
joanes
demetriades
aerators
mascota
shakhriyar
badbury
jackboots
sperms
hobbiton
badalucco
nizer
pny
wheatstraw
ferma
doubter
iglauer
radich
mcgillin
jaragua
handelsbanken
bhisham
childfree
perran
whinstone
luq
dietrick
nonny
nanon
egholm
lawder
seargent
googleplex
naxalism
broadneck
brassieres
ethologists
auvray
justs
tamarama
haikal
palter
halligen
gawky
mcaliskey
flatscreen
squints
anovulation
iapa
ccsp
agonised
golina
marasco
burnier
profe
jizhong
epperly
nyserda
ediscovery
reorders
melvindale
keetley
shoreward
newcrest
pagliarini
waldinger
chokers
shrubberies
tchenguiz
takie
pentiti
brownouts
uza
radic
wincing
shayk
ablate
alejandrina
feillu
proffessional
qaasim
verrone
wickerwork
olkiluoto
decebal
garozzo
phadnis
nicta
mediobanca
belives
matchboxes
shabib
bramshott
spaun
oceaneering
rescan
scheibner
landow
fanz
illion
wreg
safarzadeh
krest
freels
dentons
wallaroos
deschenes
tuneless
unterberg
ircs
gasland
jdub
nln
bifocal
kooner
kouakou
svartholm
lurgi
bottin
powerbroker
woodgrove
delbanco
barredo
osheroff
follansbee
finnell
pettini
yfc
taikang
triumphalist
therizinosaurs
mbtu
terbush
crucifying
pshe
rawest
nepenthe
tifosi
mortuis
ieso
abae
toge
pucher
orona
meggiorini
zesh
ciff
palce
mbalax
hoboes
misguidedly
decission
scavullo
dubas
equivocate
armendariz
backboards
sublease
lachrymose
edir
tise
lancha
conaie
inktomi
lecesne
craghead
logmein
sugarless
leedsichthys
sidan
mthembu
varnishing
cubin
azamara
itkin
cogdill
ytn
stutchbury
swarcliffe
bathtime
anshel
bogglingly
vtsiom
wive
graeae
liley
hadaway
fiocco
daish
notario
tosha
sutyagin
smiddy
vignerons
creaming
gardezi
ileostomy
himars
naughtiest
falfurrias
jega
karpinsky
gompert
yueng
ryabov
fakri
ipilimumab
fleischner
blagrove
hemse
aspc
spindled
amerada
barabara
bednarz
faiella
jagodzinski
optix
wateringbury
mcgegan
hully
katiyar
gendarmenmarkt
shrivelled
islams
megamouth
shurat
lacen
guruge
tajan
hende
yaverland
catani
shiran
chevreux
hillson
ransone
silcox
matzen
shoveller
sebastiao
mansbach
alila
racier
alizai
diagnostician
ifil
vavra
nevski
benedictis
urumaya
peles
coplay
ryke
bfgoodrich
pasternack
mcdonaldland
dref
clia
gyrls
illovo
equestrienne
basest
rasti
dyskeratosis
jarad
orgun
haidee
detectorists
stassi
diepkloof
molests
koecher
voke
crowl
lavishes
jnl
penpal
purling
boeckner
jent
subhani
mugnano
fahima
gisevius
konstmuseum
parlier
godshill
knux
creedal
elvino
prestigous
conservations
dorléac
avgn
douris
snaked
foglesong
jezierski
glanbia
kalvin
lawnside
nemet
egads
hodak
moonwatch
pommels
metronomes
barkman
duvel
effusively
minicar
buddenbrock
faryal
rollerblades
droney
blackhurst
phit
dicenzo
ringgenberg
divorcement
osmena
gelmini
milans
isomura
yiwen
fatoumata
attahiru
kopper
cofa
creigiau
washery
aronia
bolkonsky
chrys
yanna
makgoba
bartenieff
funn
shanken
yongala
kvoa
hyaenas
numberplate
rdw
klimas
andee
burghart
kenansville
otai
murashige
lodore
shoigu
donoughue
lojack
ysgolion
boulé
mediaweek
poultices
bidets
iyogi
lbb
roamers
storagetek
cellnet
mabyn
yuanqing
alaka
kranepool
impoverishing
canonico
cetuximab
pssst
zajec
folsey
aasm
mitie
khambatta
couder
bidco
bahk
voriconazole
bearse
reinvigoration
pibulsonggram
mpower
smallcap
bortezomib
djebar
idfc
aido
bromelain
hassey
salterns
gentz
thermogenic
weisburd
dydd
terrine
ollin
etuc
pedobear
annita
myrlie
bhoy
strøget
easels
smokefree
reaming
pavelski
wato
persecutes
eryn
taskings
rionda
bundanoon
dhere
casualities
beckworth
thakali
myeloperoxidase
festively
jablonka
hassenfeld
lissoni
pollarded
guirgis
hazelbaker
mcmakin
baidya
rhoderick
jonel
horwill
quisque
gestes
ssac
museumsquartier
reservas
blagoy
banharn
dolder
filmaker
harvestable
sandbagged
fouchet
ghostland
pizano
maver
duncton
borislow
stinginess
casentino
wakley
priora
loates
infoway
harmine
haraszti
assocation
kippel
nelken
chinchorro
aecc
dyment
ballards
laudehr
früh
rachell
attendence
noury
cognisance
scaramella
palmachim
pozniak
eisold
traveline
alamar
malzieu
fictionalizes
manit
sobh
halters
bacopa
kasdorf
akot
roomettes
bioethicists
dandyism
aafa
ponnusamy
ignorants
tamrat
tastee
wilnis
sotogrande
iccrom
incredulously
cavero
timewarp
siden
yardie
twohy
yquem
archs
rockafeller
muñeco
retrogressive
plinking
lobotomies
wildmon
adlabs
schrab
gutkind
carf
shaposhnikova
loffreda
indiscrete
glanusk
crescenzio
baleshwar
bombora
overfeeding
casy
anglicise
sravanthi
ipswitch
alwoodley
appro
truste
bastienne
nurgaliev
duble
nonflammable
adeniran
kinglist
shouyang
nbcolympics
setterfield
mugdock
sayasone
kurosu
poventud
monchegorsk
robonaut
paramours
ielemia
biscardi
seppänen
lewan
serait
gaffs
nafti
emplace
gogel
stratégique
comida
moodys
loutish
varshney
avy
unerringly
prefering
onie
schacher
himmelman
quater
shadd
spyrou
prah
wigdor
tskitishvili
sharable
clcs
kbh
hiwa
aleady
hrdlicka
nkole
motyl
ucpd
croaks
gazela
noddle
soysal
cyberarts
moshinsky
corporatization
premarin
milgate
holdco
dnestr
hurel
diat
gardened
jetton
soref
fangzhou
maguari
muno
inforce
yeaton
stant
gesser
lenardo
pigbag
bikey
symbolisms
hashizume
superwasp
dampener
megatrends
kapllani
nitazoxanide
matsson
looky
meeuws
poseurs
prendiville
incriminates
bidlack
layby
spady
muscio
highborn
tongaat
gibside
alhat
mulyani
roughage
raffa
talkradio
domizzi
cpic
pnrs
eringer
maslanka
serviette
timoner
moseneke
funmi
carrels
aadb
dierking
tein
terracycle
kardan
paffett
cautery
vaccinology
gubb
yesayan
guarenas
crathorne
abetment
gellatly
kiros
dagga
shearn
droogs
hooi
rooper
diagouraga
dälek
jalani
walchensee
monita
warith
wigfield
wice
kikka
abera
yonosuke
roese
choummaly
tafolla
qgp
lautier
ringelblum
novakovich
kamalesh
perriand
simpletons
skimboarding
pujehun
egot
wheedle
warshawski
evanshen
abdelnour
haiman
rscm
cichon
rodenticides
mattering
volvic
linebacking
xianwen
padano
chikyu
deaden
milic
siems
yellville
brinn
stockist
mckinna
pdca
granelli
imipenem
rahme
britcar
sakashita
wium
ramelow
massam
ffion
pbxs
greber
widger
shoreland
biogenetic
wytch
calio
tolzien
yemm
cusumano
atliens
mccague
forgy
chertkov
artesunate
siaosi
strathtay
dills
precuneus
valenzano
southcom
perahera
sandfish
causas
bourriaud
burtonsville
bannocks
stoppable
gudjonsson
saxilby
ataqatigiit
sommariva
commutations
pöttering
anestis
isted
mugica
hru
prejudged
boti
lecroy
whitter
bonine
gorging
rexer
pinchers
kebo
tradeston
somby
wailed
presbyteral
hekmati
wiebke
hardan
fkl
penshaw
nkala
gomersall
marcolini
hinkins
cfk
hibbins
multifaith
rieckhoff
khanfar
delwyn
chickenhawk
hylda
kreiz
kibuye
chiclet
alwatan
chegg
tirrenia
khashuri
geniez
lindoro
sentimentalist
haseman
wilderspin
mckowen
feerick
reforest
dropzone
vanger
shillinglaw
satc
alguacil
huntercombe
gilovich
gbg
basks
jaakkola
lorine
mcglew
grilo
revanchism
tral
bessi
pacaya
aptx
louann
johnna
hadjar
dashiki
magisters
goldi
burnat
etemadi
baric
gunnarson
nonis
stepparent
lentin
mccluer
jaguaribe
tumbo
gerris
demol
bonnen
stockstill
worcs
hoggett
gilpatric
prym
mancham
smallbrook
associados
olasky
hetzler
manfully
macas
hirschbeck
dubaku
flunking
cambuskenneth
calcifying
vinas
staniszewski
discussant
barnetta
eichman
lousma
boreyko
mazzi
feir
hamouda
parkar
feddans
peregrinations
banchan
gephart
njord
hanagan
sugiuchi
misplayed
kelang
lamichael
chaddock
sauerbraten
bonior
kashia
sutin
worchester
bochenski
bilberries
karlstrom
boukhari
hochstadt
sekulow
carfin
tienda
metacom
midson
aloneftis
beechen
balnagown
naringenin
leamer
eaddy
keyline
ockrent
yanbin
marsay
schiel
airforces
shiff
kaltag
girlfight
redrock
harimaya
slingo
gonorrhoea
slayden
amsterdammer
lechon
holmlund
graysmith
kondratyuk
dienstbier
amrabat
dunand
agwu
paternostro
keuchel
oshun
boxscores
daingean
knighthawk
perre
prozorov
gosia
adoc
chele
coronella
dykeman
kitaen
gazzaev
mcnown
songer
mincu
riseholme
hippocampi
idit
mawei
inon
personalizes
ständchen
tobgay
iliotibial
gonski
federov
piia
cardinham
girtin
jayashri
mcleese
kvue
bimbos
roston
ponche
ronel
clavulanic
cyclamate
fordell
scrutineer
bonfante
conchiglia
decanters
gyang
commanderie
marambio
novruzov
accessorized
elemento
prew
latroy
courchesne
soret
petrological
cherington
ciancimino
wassim
zifu
hanie
drys
timberwork
essentialy
portu
pousadas
backache
sedlar
rippe
nonspecialist
wickstrom
ethylamine
yohimbine
brv
norfolkline
paeans
wasko
asiedu
itron
ladu
maravilha
infelicitous
iets
yanwei
shenkar
gouldian
navorro
fayaz
lmk
knk
mosis
unsighted
krukowski
wyc
zoubi
duchesneau
klebb
nabiyev
rainsborough
badelj
demerge
jongkhar
alvington
delana
velits
qaidam
grunebaum
selbie
superjam
peplow
sikelel
maala
shmuley
sonification
casm
maasdam
chidlow
hunmanby
rumaila
gew
annamarie
ikemoto
gummies
ukba
woldenberg
askrigg
kamut
yehoram
metion
gesellen
romaní
flatback
fluticasone
affability
tollington
cnam
dipasquale
spatulas
ankan
karimloo
houts
harger
zhuangzhuang
battan
ulyana
unpublishable
demostrate
rlr
chateauneuf
chadi
wakerley
artistiques
wsyx
bodysurfing
kamewa
autoexposure
hvd
enosburg
excitingly
elten
fritzie
uists
salha
hessa
amref
gyrator
techblog
nordegg
heuliez
zenna
bozar
greensward
risques
morva
riggio
unbalances
coplon
mepham
warnapura
muter
hymans
parshin
mindreading
benkenstein
tunng
bullseyes
washbasin
derating
kapinos
recapitulating
purkey
morace
janjalani
czisny
disick
henllan
feichtinger
libano
mflops
kendle
slotkin
kokal
motorcraft
hankyoreh
micombero
mikardo
hypnotise
cohrs
sayako
punctually
faceoffs
usian
mukhin
wadler
circumscribes
renay
tiley
euskera
dnata
obtuseness
loncraine
buddie
bullshitting
mazzello
pattersons
wolyniec
niddk
odel
auki
uthr
decertified
krejza
gubernator
pflüger
shenhar
margallo
weidenbach
korniyenko
roundedness
hersee
fango
rspo
stellwagen
chande
gittes
rovell
enthuse
sanel
banalities
rotton
geissinger
vanderlyn
cyberlink
apostolis
induration
kindertotenlieder
egalité
puda
ignatia
krass
mauves
worming
abysses
bilqis
walderslade
howieson
expansiveness
badstuber
dettmar
dzama
wifey
lundon
decubitus
mellowing
textualism
bahla
chiusano
getachew
plinko
clicky
kudisch
moskvy
ambassadress
auten
riversides
kavinsky
seasonable
chengdong
tafer
roelants
samh
suplicy
slub
extrajudicially
dumlao
evilly
sonnett
pudwill
posca
nadella
postpile
celestini
attore
signon
oravec
demob
dystocia
gansey
ppmv
dargent
paananen
terrasses
octone
kosolapov
mgarr
williard
dickau
sauser
southon
niquet
cheslyn
nancie
touchingly
waddock
mcteigue
unobvious
kuah
masseurs
msba
ords
langney
eddleston
waldhaus
ashill
comision
haberkorn
equivocating
dabhol
dowlen
trysting
lagerbäck
earby
malave
barzman
nursling
woodhey
grabovski
aiono
flacks
mydd
ijtema
loueke
razek
junkets
amidi
chrebet
maghery
bereit
yokels
boatbuilders
raese
sfca
darine
steinbrueck
pomes
gorbanevskaya
jeck
sobti
turque
unremembered
gloop
effron
kindergarden
ceyhun
dyett
skirvin
boeri
sibneft
yaqoub
blendon
kritzer
reyat
inquisitiveness
valmir
snickering
ellenson
yudi
wendla
premalignant
vallehermoso
arner
decimalization
suuri
makie
buerkle
decoster
hurtled
dealtry
reddest
xochitl
mhlongo
delanie
palsson
pelu
hoedspruit
humam
karponosov
brinke
bicycled
kennie
spiering
killay
takac
bbcso
lentivirus
finishings
pouter
metalware
overdid
chloromethane
uahc
zimerman
yinan
overdeveloped
unauthenticated
skeoch
auchtermuchty
sifts
aghia
sanlitun
ocoa
discused
scupper
danet
unburdened
schwedler
circumferences
carolynne
anouska
pipedream
efavirenz
macroeconomy
abib
peaker
stanczak
mitoxantrone
grantmakers
misbranded
plink
innoventions
ibms
benington
pocari
rishab
unconcealed
staelens
hetar
evgen
mandrills
zuno
microdeletion
jirka
torchon
havea
allesley
unusal
hardebeck
pead
simplement
teetered
anonymising
voudouris
daerden
baldivieso
porrello
kuzin
minea
marketization
challanged
reichling
mercereau
covino
tonry
blute
overemphasizing
beever
superhighways
ifixit
huestis
ikramullah
bakradze
moussavi
nikolskaya
strakhov
velonews
axcess
phibsborough
paradisus
nimeiri
vashistha
hyperbilirubinemia
vrv
daara
rahnasto
abacavir
sulphite
tranquilized
reseeding
respirable
wrighton
fenzl
amonte
cordery
rreed
sculli
canoed
hussen
quess
etem
teruhiko
bendus
experince
hegemons
schnauzers
nassiri
jurrjens
papé
bagworth
cuckolds
deisler
niederhauser
fuquan
brissenden
hoho
broadhaven
makley
sibthorpe
thumpers
speraw
chibber
dzau
galimov
deleveraging
rmw
mudang
nonscientific
exculpate
jussy
aikau
perelli
globis
nienstedt
irredentists
brul
mcconnellsburg
aascu
hencke
paltalk
qianmen
kidzone
arborvitae
mgg
jakubowicz
loxford
silton
mcsharry
masika
niedecker
zucchetti
fimmel
kolton
kontakte
gundling
valaya
dobles
citro
cantaloupes
llanrumney
fiap
stebonheath
olkaria
itches
lawhorn
flatlander
ruari
yousry
koskie
enjo
nsai
zylberberg
tião
hotpants
detainer
panettone
marou
wjec
arrindell
lkp
madacy
scanzano
studen
cataclysms
sharlto
stwc
yambo
lurd
conason
baliani
letton
anderman
amper
fruitmarket
securitisation
phenoms
armoy
euobserver
overuses
lohnes
tainio
turse
ouf
cardiotoxicity
ominami
tichina
podrazik
phn
bosu
funaro
schuerholz
macaroon
hilferty
schoolrooms
perttu
softail
embajada
akehurst
nimura
verdell
abourezk
iwg
aiona
unipart
maggies
iken
aecs
crisafulli
savara
wezel
kauder
midgett
buttolph
markree
nkhotakota
leats
laydown
harkless
tesuque
mphasis
proably
achuar
crosiers
pregunta
plishka
lineham
talkshows
phytotherapy
schlage
artprice
brandenberger
brassaï
eviscerate
profet
omes
ensberg
garreth
goair
intraparietal
impregilo
cesg
bisutti
grimsbury
consumptions
whetter
jeanmaire
larian
cambi
lsds
glaspie
radermacher
carulli
ambx
goil
harmoni
carndonagh
iolas
coloradans
heisel
murofushi
nikesh
youqing
lamorisse
ragout
ischigualasto
zutty
pennywort
churl
tvardovsky
unprofessionalism
mapletoft
gianpiero
kimia
cleanings
likeability
kilu
rorion
bellbrook
metelli
castletownbere
barcellos
quadriplegics
madugalle
dalham
sembcorp
fiascos
eket
cosmosphere
craftsperson
caseworkers
baseness
entasis
famiglietti
beniwal
manches
uncrowded
calis
goutham
liquefying
brocq
lamonts
edwinstowe
lamplough
beston
furtiva
bleicher
clunker
ameet
tanney
bunnag
kolodziej
kondi
puttock
wiehle
suketu
schweers
teneues
lnu
takahira
disempowerment
cby
garad
minyard
markopolos
midcap
nerlinger
anshutz
autopolis
travaglio
puréed
exl
rezaul
junquera
ishola
thermostable
sudarsono
sxi
vaginismus
jovani
qik
disingenuousness
despoiling
grayden
politicise
illegalities
reconfigurations
carabajal
romanchuk
puteoli
whistlin
muriatic
scalers
stroppy
shoudn
tuija
campain
mirena
debeljak
laulala
saphire
pepín
koroi
castoreum
birecik
cutman
faleomavaega
papilledema
lannie
chizhov
hinny
brefi
rabbe
brunansky
bajas
kaleida
yuschenko
murin
knuckleballer
uncommanded
tetchy
pastorally
lfw
keetch
keteyian
swearwords
dayrit
dreesen
limpieza
lisanti
kremers
adrenalize
ibg
carius
cisr
responsibilty
cetirizine
suspenstories
panduro
tharthar
nazare
muray
barbarito
wenr
dreisam
qiming
panitch
intubated
storycorps
kalbfleisch
tabbing
wowwee
cutlip
selston
kantonalbank
spol
hirudo
unmarketable
hasley
gallinae
ashkin
kanamaru
lovullo
dimitrijevic
coppelius
retransmitting
acnielsen
extroverts
yough
kenon
sohaib
fuson
meskill
goltv
trautwig
sedgeford
mcj
refuelings
jaks
xex
banegas
grauniad
tianya
schwartzwalder
grooverider
marmer
timeshares
imation
hapka
psyd
laurila
muncher
estin
steamroll
ccba
lorenzino
sepideh
decoursey
padoa
coned
dapp
newbigging
tribbiani
dehumanising
sententious
pogrebin
ingenieria
kerne
greenroom
scanavino
malkhaz
rosete
anthills
amdur
leidseplein
sheerwater
pointman
lovelies
streat
nussey
pasteles
iber
mahlasela
schnapp
segalen
farveez
chudinov
stepashin
blaspheming
uproariously
climacteric
umcp
franich
eighe
witta
sè
fearns
ghulja
bheja
airspaces
adolfi
shankle
pilfer
shanmugaratnam
denesh
nobunari
santero
capodichino
mofatteh
probenecid
honeymoons
entender
snowboardcross
goudstikker
sharar
yasuhide
poundbury
lisiecki
forseen
kunkle
cackles
sharik
cropwell
juntunen
lueck
ganti
mutterings
semington
friargate
gamze
crymych
ireson
spectroradiometer
buzkashi
busyness
bioplastic
flightplan
tenaha
kely
moaned
idriz
mitcheldean
ferozeshah
epoetin
rokas
akora
keltic
stewarded
innervisions
eleri
queenwood
willmer
dibromide
bréguet
pickelhaube
solaiman
portanova
casley
janita
reincarnating
steppling
bresso
transferor
llandough
woerth
pkf
escwa
batenburg
watmore
drissa
mayeda
baab
xao
papantoniou
ringler
gelernter
cervone
playdom
mashiah
puttick
lobão
asperity
mrbm
ukeles
continuosly
chewin
solidarnosc
collegues
mallmann
kohlsaat
redcurrant
immunodeficiencies
dymally
multiport
isaach
manacled
inculcates
constitue
boggled
wite
terrytown
macki
renney
mohon
nopd
vavrinec
drillings
aftereffect
masino
shortsightedness
zhambyl
dindane
gfe
wayburn
gwot
acclimatised
rarified
shimoji
stalemates
virts
wena
pilocarpine
keyspan
homebrewers
atilano
keiths
maggart
mastopexy
orane
turcan
rhisiart
heatons
taqa
bobbies
defoliate
shibam
cdfa
bankwest
ncbs
kindliness
ashlie
dromaeosaur
wranglings
hjalmarsson
dharug
boyson
quintupled
adjudicatory
warrawee
prefilled
openair
candlenut
masqueraders
ultrastar
almac
koromo
sopris
robak
hispasat
livecycle
pheonix
lièvremont
marlis
trela
ostrosky
pflimlin
xiaochuan
elgoibar
calypsos
marcó
andale
wretchedly
dunera
formans
albio
rinaudo
sirigu
reiners
overhaulin
caravello
movieguide
briese
poidevin
suryadi
grewcock
amiina
cathi
moiz
taubenberger
bossio
obraniak
priviliges
upholsterers
permanant
davoudi
brokencyde
mouses
mythologically
ignatios
vardenafil
ratlines
holtkamp
azran
recirculate
benussi
simonsbath
sibutramine
abarbanel
tabulators
takeup
disrupter
bobber
warwicks
ladykiller
thupten
serac
visored
fresnos
videira
charland
kowald
dhalia
knowledgeably
damarcus
sniffy
appoggiatura
herrion
unprogrammed
radiocommunications
rockview
retrievals
seabeds
paramilitarism
redhills
pevear
aphasic
hauliers
clawdy
spirt
parnitha
eyraud
freinds
cosgriff
cyra
ryes
panetti
anyday
phw
inroad
chelsom
campanis
lifecasting
ruhle
snuffleupagus
adlan
mordden
sasamoto
delagrave
dauphinois
completists
farai
szanto
keogan
piga
glared
goyescas
ngon
torahs
nelio
hendrikse
digicam
baburen
srbijagas
vanak
ramsland
stanningley
austine
belsay
micronuclei
vespignani
chukhrai
hellers
calipatria
asherson
gotbaum
shangdong
oshio
obernai
mérieux
ujc
eede
firming
indycars
culverwell
birindelli
goodey
djou
norbreck
razorbills
mccartneys
konrath
petroski
billes
aronica
taubin
timegate
serratos
feiz
encryptions
esterline
labey
corestates
embrey
chantay
midamerican
siedlecki
elyashiv
afros
zaxby
ghassemi
hexed
karoui
masae
neidpath
napped
morgantina
cliquish
boystown
cowhands
utopic
rosenbauer
ghezzal
diterlizzi
pawing
urbanists
mcenaney
shawon
reassigns
amphenol
ponticum
lambrate
langhorn
oanh
bisby
lindheim
chakdara
stefaan
froxfield
abati
koner
labus
frankenburg
dihydrochloride
orzel
mccarthyite
tjeerd
chomps
investigacion
sifuentes
trenor
joycean
smilers
frax
aerolineas
assadi
corleto
zaslaw
mansally
geovanny
govanhill
hooiveld
ifaw
agyekum
wote
clamoured
counterintuitively
acidifying
sudin
nerka
slatin
milmo
rossif
shamberg
skloot
filderman
wenhao
recategorisation
frech
hospitalist
salko
yaohan
uncomprehending
wirawan
mottisfont
strikas
meschery
woodlanders
shian
ollivant
klowns
wingen
calved
bours
rasmusen
wenjin
omnitrax
helotes
europass
matchpoint
marziale
weizhou
gtech
sudipta
bockarie
mynarski
beheer
lubricates
bitterling
thisis
toothill
dohrmann
tianwen
pikus
follini
sweetbriar
weinger
xinhe
sahebganj
graser
lerach
suffragio
spinnakers
shirked
anvik
neria
halangahu
balasingham
wassan
pengrowth
busload
frequenter
eckes
macrumors
turanga
médicos
raemon
punchdrunk
tuca
yealmpton
undy
birdsell
bradway
kerckhoff
prisk
kimitsu
felicis
inactions
mihos
ucan
mariages
lekota
runscorer
mukham
keresley
molsky
wanlockhead
unreasoning
inhumanly
otterloo
jamies
ulliott
surhoff
cityside
chutki
chukwudi
toosey
viss
himma
differents
northlink
vinoy
cuni
snaky
tkc
nolly
taliya
metioned
buder
ekos
deviousness
overblowing
mordente
nouble
gazzard
lachemann
tigua
nsubuga
okitsu
quickplay
bakas
findability
abdulali
burpo
kowtowing
vidro
chambery
giorgadze
reinman
ndia
colstrip
discomfited
ledgerwood
dagomys
hattfjelldal
colruyt
jammie
silkmoth
bonza
reattaching
kibria
peyronie
mahsud
datteln
semliki
jiaxin
ingos
helius
cogwheels
khizanishvili
yefimova
remm
sentimentales
lihir
heartsong
hitchcockian
interrail
awni
jabarin
criddle
berwin
rhinorrhea
investable
nurlan
leithauser
hoorah
naburn
lengthly
mizque
tottel
vouchsafed
ciriello
howver
curis
toribiong
faiq
kingson
isaza
mudhar
earlies
cohee
rombauer
felisha
diversidad
pazcoguin
yeend
chateauroux
zaton
abubaker
miking
procrastinator
abdulsalami
cordaro
cliffy
terrano
beefeaters
retrials
harleigh
bluelight
meddles
rockmond
ignoramuses
stahler
vaporous
monye
nissinen
kathrada
borrett
ruhul
kuperberg
biglari
sarty
oueddei
catenaccio
pambula
saltimbanco
cncs
succintly
strollo
eulogize
littlerock
somersaulted
petruzzelli
dimmest
freestylegames
haspiel
wamo
scte
sajani
centereach
epcam
brisingr
nanoprobes
swetland
orthodontia
underexposure
hcca
devient
arkivmusic
turnipseed
tamaddon
catchup
silkscreened
lohmeyer
noncitizens
vidra
publicmind
calpine
emigh
esenin
redcastle
parasitical
cansei
ragu
castmembers
bureaucratization
varady
cosmically
raffinose
harvell
yoxall
shatterproof
vorel
allografts
denshaw
machesney
papaverine
liebestod
potapenko
delancy
dubina
austrade
unprepossessing
rigters
tafur
ghelderode
duelfer
rendcomb
competion
overexcited
titfield
jiving
hustles
claysburg
kessie
faultlines
bogale
lanthimos
sudamerica
runnicles
wildebeests
reposado
butternuts
serba
jacumba
compartmentalize
iols
appiano
beiteddine
naxals
linnemann
cossins
scholastically
cahalan
westons
wambui
scheu
nidec
casabella
girdhari
sakra
creativeness
dilks
speckman
galy
everychild
badsworth
mimika
ahwazi
keswani
oxyacetylene
askeri
arendse
mcelveen
micronized
porcupinefish
overburdening
magnanimously
ellagitannins
datastore
woodhorn
hairdryer
beersheva
piq
rapine
gleamed
rigshospitalet
jenners
sidereus
boritt
monoglot
ornithorhynchus
lawrences
ormand
nonrandom
dscc
illest
vpe
oxtoby
plascencia
banyon
nayon
carryduff
ufg
aligners
aafp
nonsensically
wragby
bamir
bico
mankowski
holgorsen
informaton
hne
gasparino
cilfynydd
pbe
torgerson
winchburgh
orya
jajuan
loredo
canevari
uthayan
biorefinery
mirebalais
bogata
adss
dogfighter
pancytopenia
improvident
conservativehome
carlotto
dioner
astrobiologist
handorf
riposo
cleadon
leyda
moquin
sickler
roozbeh
teevee
oneiric
feoktistov
jasmeet
busybodies
photis
unceremonious
repetiteur
datel
biguine
giacconi
minneriya
multiprocessors
mwyn
gateley
shortchanged
graul
aggrandising
carolini
misguide
strickson
swagg
gilliatt
mauga
kurten
raeside
perseveration
lavori
leandre
cloakrooms
excipient
kytx
eniola
buic
masferrer
pessina
sbobet
ninny
matsunaka
aidi
discounter
jepkosgei
wingett
koong
triodos
panayotis
lavy
miggs
bataar
wbng
zinna
placentas
melekeok
traceur
xrays
urokinase
yekhanurov
minooka
gastel
arisan
mohen
dederer
claybourne
jazan
mccarl
sekera
rhigos
proprietorships
iott
gaiser
función
muson
driesen
nutsy
pellentesque
verbalized
operacion
seeiso
unheroic
finberg
xiaomei
germanischer
ziti
nechung
sarft
rera
warhola
meshaal
frykman
abakaliki
liveness
avana
satbir
luetkemeyer
binghampton
sexualisation
dreamchild
zwerling
whnt
dalmally
arashiro
coaltown
tenía
boushey
gylfason
fincas
geiberger
tohmatsu
sharrer
kazyna
bolasie
oildale
enlivens
wicketkeeping
superstrong
aselsan
countrysides
elastics
kojic
mycoides
plagiarise
affiliative
karyl
jinghui
bioengineer
gogar
tenconi
micachu
oex
taean
gunville
rejean
paltiel
cipp
cdms
kirgiz
huifang
shuld
somiedo
offerton
ballarin
freidman
palpitation
grouches
vant
ragdale
sortation
wystan
ruhama
baizley
kaboodle
missiroli
afpa
vennard
dwts
babajan
evon
sizzlin
guanylyl
unhide
tubbataha
triphosphates
paperwhite
verveer
baiza
krivine
massel
polzin
harpending
irreverently
mmtc
pasteboard
bryanne
blanquette
ramadhani
arbatov
firmest
movado
votevets
sjo
grazalema
stert
dishonoring
morays
amortizing
clintonians
chunked
fmh
nykredit
schermer
chaloupka
sporza
beitia
eickhoff
counterproliferation
norcliffe
longcross
diltiazem
boatyards
brandings
barlinnie
treaded
orum
kohnstamm
ringwraiths
clipless
cragside
epri
sinatras
yazbeck
nilofar
cajori
landu
béhar
strikemaster
shefki
riccia
deehan
unvalidated
tulear
eremurus
chafetz
osteogenic
plyometrics
barnas
giacinta
rubbished
dpmo
fereday
earswick
firebrands
wpec
tual
gailes
budged
knettishall
unnameable
honorata
istabraq
hadopi
kremerata
hermantown
hollender
waitstaff
udeur
radinsky
taibi
whipworm
wijetunga
burster
hunchun
maybee
gymru
wunderkammer
tressell
ohly
fluorescents
antifascists
budka
javnosti
transafrica
raskob
hsy
hirak
sauver
gangitano
tafuna
telmarines
gaboon
hibler
perseid
castlecary
ensnaring
montse
ontake
brookstein
garua
seders
scotched
dionicio
basavanagudi
kloppers
rathnew
esac
dougald
littleover
gadzhi
exaudi
merchandises
roid
lumpenproletariat
couzinet
swarn
bobos
herze
behrs
gajendran
kompa
damnably
wiling
sros
stepdad
yeoville
nerida
untermann
queenswood
ghionea
volatilities
queensryche
flvs
slingbox
abrosimova
bedrich
billown
plently
shitrit
dunkeswell
pursel
kahla
turves
gunnera
krumping
bozic
gillaspie
shiyu
queller
fairall
bodek
strongwoman
whataburger
salivating
ingrao
molted
serat
geter
rungius
bartu
fopp
hadler
blackduck
benoa
vph
pianosa
sikua
radamisto
cwalina
clearways
familymart
goverdhan
recommission
knaap
lattitude
sealions
aswin
emani
pesar
glennis
seepages
gullette
kesling
gonia
ayto
behenna
spab
lakshmipathy
pablum
goodmayes
bahaman
murren
frogging
magaldi
karolyn
hareb
crinolines
creuzot
tachtsidis
unpatentable
cissbury
aquarama
gulps
ormsbee
taboga
sorlie
liljefors
osmel
microtech
urney
urpo
viglink
worra
sonequa
nehgs
clarabell
reschke
semshov
tushnet
cofie
stambler
sayeh
gatr
vidia
morrazo
talty
collingtree
sewel
parkesburg
hiway
microloans
dorsomedial
vacationland
cicle
flambé
barle
nyers
tononi
rucksacks
kurlander
becames
enunciating
hifu
vibia
missle
jihua
fedorchuk
bokros
matsuya
irinotecan
audiovisuals
bjcc
vecchiarelli
rehberger
demartino
mobileye
mascia
mcgarity
rjb
nimrods
swindlehurst
kalogridis
ethnocide
sancocho
merevale
inbreds
baharna
igbts
holtorf
luczo
thermophiles
eulas
estan
fragging
shebeen
arii
foodism
acip
haematoma
makdisi
rocknrolla
haffenden
efstratios
bugbears
rueff
grazyna
moignan
santarem
tinku
gardone
bbbb
nstar
undesirability
fouch
slaveowner
koonin
zhenzhen
waylay
veillette
gimje
merrilees
zut
kujtim
zondi
edgers
cosmogirl
prady
gronlund
dieticians
mazowe
taggle
sawy
marsalforn
vrf
arrogated
garven
hourlong
rungis
haemostasis
soliz
hkex
margaretville
kinlaw
gajjar
unlf
registrable
blankfein
harkema
irfb
bakersville
wrvs
limbers
lke
condron
sibrel
pokphand
footrests
midnapur
chimneypieces
whittles
nnrtis
wytham
seferihisar
plastilina
sawad
inotes
gorgoni
ensnares
zehir
nobukazu
attender
nuseibeh
tylorstown
heldenleben
caergwrle
eliades
tarlo
nhek
friendfinder
delrio
atriums
msft
hyperprolactinemia
gjm
jarel
defeater
evangelised
melva
bouchardon
colubris
tuson
kennedale
qanuni
chamula
appology
wienand
uncrossed
nedved
unmistakeable
patzcuaro
calverts
zhijian
lizars
hecatomb
capillas
recoinage
ruyton
genecards
snuffbox
minuti
sanclemente
filmstar
lawgivers
pqs
nadelman
complutensian
holmbury
egizio
cinephiles
bloxom
hoani
comitted
microscopii
gunalan
gosa
bandslam
bradbeer
firey
lisanne
omundson
galadari
resizable
obrigado
ceratotherium
pohjonen
herskowitz
ramezani
disadvantaging
feebleness
riskiness
imparato
analeigh
canonicorum
hmie
elswit
beeen
hardenhuish
granito
kandler
altor
cindrich
aaiun
cfif
industriously
osbaston
arifi
retreads
trumka
sharpsteen
gawne
ruders
hurtwood
jilbab
fredricksburg
thiaw
uchikawa
baozi
tarlow
karran
calbert
mitac
pomezia
puhua
henrichsen
naparstek
borovoy
persuing
ifrah
astrum
volny
conceptualising
muslera
hindalco
colegate
hidemasa
nyamweya
clayesmore
tailbacks
isbe
wainthropp
cacciari
doorne
oxiana
yasujiro
xeroxed
detc
overspent
teradyne
bachtiar
zaleplon
gipton
merita
kitesurfers
shirasawa
boilly
negress
szot
airtanker
ankama
lignocellulosic
thouless
acai
kosmala
zaabi
morchard
fairhall
shtetls
dicynodon
bredenkamp
gormless
highdown
moncler
postdated
marquita
herrada
entenmann
sharmas
emori
edyth
tounge
westernizing
breakdancers
tashard
colico
louisette
shakh
roudebush
tabin
xeriscaping
posterous
unhidden
valy
mccluster
koppell
mindo
allegretti
dullard
haith
nooruddin
torchia
arpan
vaste
qss
smailholm
carvill
meenan
tanko
gopalapuram
albanel
tudjman
eesa
knr
darijo
outmaneuvering
alviano
lekima
trancas
waddams
bhagirath
rafaello
lendrum
crocketts
calstrs
tirelli
shaps
schifcofske
porgie
cmdb
clappison
tryna
tunnicliff
estampas
magrathea
buncefield
jash
twizy
greasby
airconditioned
fesco
orgill
aprn
goolsbee
stashing
wyatts
interboro
hould
hoffbauer
mariem
barugh
yuanchao
davidtz
sywell
stratify
rudisill
hirola
morstan
strew
qayum
manolev
peralada
barmak
hemline
schiemer
swimbladder
seatpost
smalleye
maheen
karhan
shteyngart
gno
trakas
luxx
winching
hachiko
bosshard
ignarro
fricking
mourilyan
gonpo
fastidiously
rosenlund
srijan
tresidder
beerhouse
dabke
hourglasses
gouffran
tigerdirect
casolaro
hadeed
circhetta
stellen
donadio
euroclear
lichaj
spinosi
hevs
refus
millstein
zanele
rivara
mcnabs
kalmadi
gradwell
bezant
morrisson
kuijpers
upshall
warrell
gurnemanz
tarabay
mucks
indianness
esho
sileo
yasunobu
dismukes
derogate
hubail
ffii
chelated
bvu
eisenbud
wric
macwhirter
tomoji
informatively
kaestner
choppiness
fairstein
adulterant
atlantia
teall
cabinetmaking
chillen
eastpointe
homesites
sercan
ruttmann
musharaf
paasilinna
schliersee
tenderer
alexandar
usml
jawdat
drolma
parikka
swingate
rockette
jeryl
sweetpea
uplinked
belchers
runje
joannette
caru
usfk
amyris
habituate
bullers
delsea
exult
stokey
massiveness
nelthorpe
bladel
hovenkamp
aveni
vraiment
waagner
intrade
indahouse
chichijima
bofi
geare
mchunu
peredelkino
tehmina
hesford
pcj
intoxicate
popularist
balsley
vcjd
andrina
asterisked
hcw
xto
montrichard
naukluft
greenedge
osteopenia
aarif
pigovian
clavero
rauti
igas
tarpey
millepora
pratz
kaufer
seacom
bozzetto
bartter
tariku
ismayilova
semmel
reanne
giolito
rhf
rakestraw
deaker
oologah
eeek
bandipora
gefitinib
ccamlr
clybourn
modularis
mcgimpsey
churchs
branagan
nijholt
pough
alexandrino
flh
antek
slotbacks
minium
muser
thabane
kilday
meskimen
efua
gordhan
trepashkin
candiotti
inxile
bothma
qaly
windemere
yagur
mayerson
ellerker
saqer
osteoporotic
rolley
neowin
goler
konger
sawka
kondrat
pogorelov
schmemann
goodhead
whitner
offiong
arachchige
fardon
siggins
humungous
bruschetta
amoc
cerge
yachimovich
techiman
darty
wadworth
folse
gaviotas
chadburn
rtrs
barreiras
bacrot
pcca
wendlingen
rugh
heared
truswell
rosero
kafeel
hooding
nold
markina
adme
uofl
midnights
airtrack
skyscape
brookshaw
onida
greeklish
blmc
chugs
misskelley
willford
mukaber
tidbinbilla
pclinuxos
ninefold
telecomm
marginalising
gerstenfeld
bluecoats
cabibbo
ibizan
misogynists
rahter
shikotan
lentiviral
nantclwyd
eitingon
decrepitude
bedclothes
trimm
luzerner
benninger
niman
mezin
metzgar
nansi
burrower
lavaughn
jiko
kmex
kudelka
kalsu
deemphasize
freeloading
aacap
homering
mmtv
writerly
tevi
waterbug
trapiche
guedj
pavía
slithers
snobbishness
munnelly
quartiles
bisou
mccasland
jdu
segares
spoonfed
brassed
blackfly
liquefies
vertica
loevinger
sukhanov
mishkenot
langtoft
adegoke
skoch
hidcote
reflagging
hummin
fregate
mahamoud
minyanville
censo
shuckers
blankenhorn
grich
wxtv
lunder
cursos
bassinet
kaban
bestway
akamaru
netbank
foege
gîte
kazmierczak
loduca
bvh
putaruru
afrah
brandstrup
polymyositis
narayama
harendra
pitas
trewhella
pefc
saveourseas
rostraver
macklowe
entin
matachewan
zoromski
buccellati
kupol
boenisch
wallonian
limond
estro
iland
newcomerstown
angeleno
stevas
nevenka
kresa
chiatura
weve
footsie
sorab
itfc
behrends
olsberg
milovanovic
sovereigntists
pomander
fibrinolytic
luva
murvin
malaki
yanacocha
akarit
wagonload
pdcs
prostor
carotenes
dimasi
gustibus
gean
ampelmännchen
tresh
premis
wibsey
jeanny
nonagenarian
cutaia
hilbre
tywardreath
veira
tacheles
collegedale
saadé
wolo
godforsaken
pietschmann
epigenomics
batool
trypanosome
tchibo
niea
superstrings
cieca
bubblers
colectomy
ravetch
realworld
kazel
tollner
perdre
lyo
tortellini
arbia
grinned
wiskott
winget
tishrin
feuillère
ballasalla
hornbacher
veiel
toughbook
scrivner
mbare
chantrell
hennis
banket
dhalgren
holum
malerba
opentv
mehlville
ynyshir
imedia
suneet
kenken
enkel
skycam
vernita
ingelow
meeteetse
awali
frigidity
gnaws
woolacombe
blankley
trowse
zaozhuang
lafeber
thoughtlessness
hexamine
combusts
apparatchiks
gschwendtner
odometers
moaz
bompastor
nonhumans
almos
gemstar
nocturnally
thursley
reclosing
sublethal
swaddled
kilsby
guedioura
lcy
nande
breezer
lsst
agoa
asshat
cardelli
vulgarly
chanakyapuri
kros
swoons
tocks
priyamvada
homeboyz
kidulthood
mashaba
hospes
flowerbed
sany
pluscarden
jhan
kayle
mounter
caramoan
palaniappan
individu
golcar
cfoa
crrc
habas
janga
trathen
netsch
chavagnes
krief
mobipocket
fagbenle
baida
animality
doomy
bioproducts
kaaren
tongogara
hotez
mckinleyville
photocells
trone
falkowski
staib
shirreffs
phocomelia
berghain
sumber
khd
alio
mansfields
manhunts
fiscales
wijnants
aeri
axions
vassilieva
bonasera
iort
grillwork
porteños
morely
lamalfa
papplewick
mangahas
scire
disputatious
kgun
merley
guthman
assiduity
advertize
coursey
oosterbroek
pinchos
rasikh
clarifiers
vouvray
ionides
grigorov
howkins
maddi
exceedance
loll
bordelaise
swiney
akinbiyi
israilov
wengert
rocketeers
cataloger
qualifed
paunch
trovador
hyperpolarized
shuddered
carucci
fisons
buthe
trita
lonewolf
isenheim
nicetown
chudley
coarsening
scurried
evocatively
vignali
qmi
dassanayake
kivel
bassaleg
lodro
bottini
scoglio
blurting
kleon
agema
matv
pirog
gravitron
hadramawt
kvamme
millian
marline
foisting
cudd
kuras
radheshyam
uos
galani
fluoropolymers
bleakly
emmaline
pennywell
bluffdale
adhikary
anglicize
hwo
konary
duinen
apoligize
conahan
kundun
séguéla
wintz
snuggles
peuvent
dahua
monopolising
bowcott
siemering
darier
ndadaye
tychsen
nmai
tgh
ravoux
tayto
maarouf
meji
frango
radicalizing
geechee
roughan
baglio
bothel
cherkasova
broadwick
linguini
buluan
neshin
ilim
kaii
budaj
lvh
righto
elstead
reunifying
battipaglia
piang
bachelorhood
tcho
americares
dobriansky
furoate
tewson
rocinha
zolotaryov
kreon
tokmak
chiotis
jacarandas
capitalia
funnelling
contorting
apor
tiplady
dejar
elean
pontyberem
zulqarnain
unwaveringly
mogale
priebe
yuanjiang
leoz
epistemologist
birnbach
barrowford
nonreactive
godding
bbu
kisel
qibao
grittiness
thernstrom
glooms
sorimachi
overthrust
natixis
cennen
squishing
waesche
ztl
soku
opos
zolli
khagrachari
taybeh
thiazides
asayish
karole
yudina
mumok
maculinea
wissa
teabagging
rambos
hilter
granov
dragun
azzolini
caddington
actully
bugiri
finbow
bechir
brentsville
weaverham
bahais
ccsvi
allans
craighall
koussevitsky
owenton
readmissions
vidaurre
pancetta
stantonbury
forner
shimshal
tuyet
fireeye
lautenschlager
ribero
blewbury
gache
malinauskas
sealyham
elat
boyen
dragger
hovercrafts
dziuba
recordholder
senff
defroster
rensing
rehires
scotlands
lahej
jcdecaux
demario
totipotent
regularisation
zineb
rachunek
yonath
kanakaredes
hysterectomies
monoprints
burgermeister
karanganyar
gatski
incidentals
westy
somatotropin
corsetry
maceoin
intravitreal
hultberg
hastiness
gergel
avik
zarei
allí
caskie
brimful
adva
verdian
lokendra
lawfirm
charreada
fasken
greuel
queux
paccione
fukasawa
hectors
preventers
xconomy
norsworthy
cliniques
wachtler
saiten
undulates
regaling
reemployment
eastmoor
reacquiring
calgon
yra
beneteau
vekselberg
vinification
wulan
earful
taci
berezutski
talkington
boao
garzone
musbury
iisd
fakhreddine
hadzic
faithlessness
cubley
flightaware
okd
gubar
ritchson
veneziani
echard
midtable
figglehorn
brout
wireman
fictitiously
hellerstein
untuned
isono
ribadeo
shahrani
zeig
deshazer
haxhiu
alette
larrison
detoxified
woodsia
lôn
cuchulain
issimo
sual
pagis
supercop
lics
ewaso
rmbs
stada
namaskar
nanomaterial
crowmarsh
efv
muath
aeolia
vilcek
ropley
loizou
kupferman
canutillo
espin
raggatt
thater
maydan
contas
csbs
deeyah
drls
doua
hince
britisher
postherpetic
wjrt
astatke
billis
bonghwa
cebula
benskin
reneé
elvidge
harleysville
inverdale
rythm
naohito
ewerthon
flyboy
levington
dumble
liapis
strausbaugh
mierda
snowplough
somera
autobiographically
regrows
ihome
disfigures
fruitcakes
eglish
rhain
potentiating
anodised
paleness
mabbutt
nickelodeons
classier
ferrán
diuca
morlet
wdbo
igdir
lipomas
sadbhavana
schelte
berdichevsky
aarya
schlitterbahn
kyre
adada
scimeca
middleby
makkum
leetspeak
legislates
elkus
tromso
craigton
listenings
azaan
champfleury
carcillo
parbandhak
goonetilleke
jpj
menelaos
hoteles
orfalea
backoffice
dissuasion
kouao
brunker
antiquarium
soskice
bryner
muscleman
lacework
gafcon
unviewable
amarone
terc
bechtle
stanbery
greenspaces
dibutyl
cubbies
kretzmann
wutaishan
panchagarh
kirkstead
telramund
criminelle
kolonel
kingshighway
lazareff
taiana
tranquilizing
rianne
bleakest
queenly
citrinin
viton
leeanne
teaff
kastellorizo
jamy
hampl
whyman
stema
genro
samaira
fillans
discotheques
illarramendi
nonfat
taillevent
keiffer
groins
firecat
larocco
beattock
dannielynn
loibl
decoux
pramac
nahmad
ballman
fardell
naias
roofe
gravey
supercharges
microglobulin
pundt
ffolkes
hecks
paralyzer
mysto
zizou
tiering
xindu
prusiner
felman
euthanizing
elrose
wallgren
sanitaria
salsify
yela
mcanespie
babolat
hydroxamic
shanly
yadel
facp
hamaoka
bytyqi
citril
ennoble
macniven
jingdong
portglenone
hijli
pemon
exminster
heideman
bandeau
touati
bedevil
hervieu
formigoni
holahan
keola
munstead
larrinaga
deports
dimmable
prei
enikő
shiralee
degang
crozon
lautenbach
lazor
toberman
wormed
altmire
mcguff
petascale
midwesterners
hlw
fabasoft
lacinia
klindt
vogtle
rht
swiderski
urwah
damonte
silverwork
melée
croo
circumambulate
cpos
trounson
douching
dettwiler
lesk
llanddona
garrets
semlin
ueland
badry
malaprop
metronomic
abour
keilberth
vempati
megatrend
buildwas
lushness
odean
pendergrast
kilroe
ceramicists
sevcik
bozos
telscombe
ladybank
socialtext
forwent
blasberg
saifur
maestas
ramcharan
tenter
cgap
zysman
dolphinariums
aécio
wpht
zayid
wld
gaymer
dende
mythologizing
calvia
beachcroft
galerias
stubborness
unflavored
occasionaly
coixet
erber
lindegaard
monex
brangäne
rogow
hvga
defragmenting
hawara
imprudently
dickering
correspondant
iguarán
sifnos
jafr
mccrorie
gakkel
reelz
muqam
lugoff
magoula
ranina
hbe
bollock
willhelm
yasunaga
roorda
libermann
spigel
topolsky
zizka
gyllenhammar
gushy
pinstriping
perou
syler
biondini
tricastin
venerables
behnisch
itchington
brindis
bewilder
contendere
survivorman
superamerica
dolorous
gromada
tabua
statcounter
roszkowski
coelius
bided
rivelino
wuping
buket
goreham
yimin
cuitláhuac
itma
davignon
clop
sossusvlei
hydrobiology
antoniades
maghaberry
garafola
fgg
margutta
daniyar
colnaghi
electrostimulation
guarnerius
fastlink
tabcorp
lowline
sccrc
ktvb
landberg
mcjunkin
kfda
aapc
autorickshaws
internists
landfilled
biospheres
wenyan
gaunts
keszler
luminarias
weisbach
drillship
earthiness
dubawi
mostostal
primeur
kwassa
shtokman
gabara
hobbling
feminazi
europeenne
drybrough
photostat
crynant
heffington
buratti
shivendra
barriscale
eurojust
rimpoche
neudecker
isaack
zyuzin
miremont
willox
revolutionizes
pacioretty
xuri
hiremath
esquibel
scoggin
esoft
totani
jiron
mnajdra
openwave
czerwinski
sterno
daping
verdelho
zbv
élémentaires
underperform
kfp
apocryphally
customizability
zagier
allmen
righini
kapusta
filippis
spidery
exotically
cincom
roseann
prepa
johjima
ganem
hoelscher
cyberbully
hoolihan
bewsher
hasland
mutaz
rezza
kincraig
desmopressin
populare
busso
ombersley
lenfest
folkboat
efo
umbach
kawazoe
segol
calderoni
yandarbiyev
enshrouded
decelerator
schr
hasc
faiyaz
gallaga
islandmagee
domanski
hansis
quietcomfort
rhatigan
rikon
montaut
pankisi
scheibel
dirtied
comeaux
brender
antiterrorist
iacobescu
brummitt
gosier
chrissakes
dauterive
strontian
rugao
lightkeepers
bording
niaga
kujawa
danelle
givan
eastford
alevtina
jacy
steinbrecher
schimmelpfennig
unchosen
odl
rayvon
chernukhin
nicc
nicolaisen
retune
apuan
makeing
innodb
soulseek
possibily
fonio
sláma
pliosaur
jaran
razvi
filleul
janot
madliena
desalinate
kaiserhof
donyo
theoharis
doorpost
spirea
tousled
kondaveeti
estor
shalane
silka
neurodevelopment
pogs
causalities
geolog
tanenhaus
harkened
carshare
mccutchan
angelinos
nettlebed
painchaud
biesenbach
permadi
tregoning
definative
kuerti
guidugli
memorability
coverlets
fairgoers
molodaya
shieh
averre
explaning
hoebee
verbalization
riether
stranczek
,if
liudmyla
zharkov
scurria
jakeman
monastiraki
luber
hachama
aioli
odditorium
fucino
hayt
illumine
nagui
dtcc
hikind
alights
pynn
boeckmann
vitrolles
recessionary
perich
tooro
barrowland
koshelev
carmyllie
pigmeat
barrus
shoesource
photog
domanico
decoratifs
brorson
electrolytically
ncmec
mostafavi
appreciatively
stetler
luxottica
reverbs
gougeon
indecisively
eliopoulos
wholey
chihi
cogsworth
eldard
loughinisland
bulkers
allemann
sundermann
chemtrail
hardelot
footboard
masasi
shippy
unusualness
sincan
justness
swazis
guará
mukhtaran
withold
antiperspirant
kuci
boudhanath
talke
ballinascreen
dalhausser
kouwenhoven
okorie
balcomb
nishantha
sleddog
sebokeng
muzzling
wombling
hively
kurbaan
hoegaarden
bollox
sompo
dichen
proprieties
garnica
asph
zwanziger
scheper
nanotyrannus
zeltner
winnecke
fardre
rosello
pointner
rhinopithecus
thabeet
greste
yunfei
montier
abramova
xinjian
bischofberger
puttees
nikky
partings
inventorying
baxa
powderpuff
brouillet
belaboring
basted
jotting
corston
claytons
qipao
hirelings
samey
glenconner
chortens
labiaplasty
definitionally
billett
rezendes
gambut
chinmay
blady
beiler
giddish
jadot
meridith
whetted
junkyards
ghengis
bureij
tavernas
jongmyo
carrasquilla
wvon
buttonholes
lammtarra
twardy
dracunculiasis
spliting
boneshaker
unattributable
labral
epitomes
marcedes
kidrobot
nomiya
spiritan
bicchieri
packman
saigol
ossos
mehari
kochel
barbash
hotcakes
liscannor
tasselled
zankel
sillerman
burgum
sarasola
abortionists
decorus
kesen
staveren
jazayeri
vhc
artprize
corver
calibri
chignell
mbacké
dartfish
punny
bradie
palese
easterlies
pookutty
samworth
palming
quietist
hagit
wonogiri
multiverses
marrons
wrongfulness
nyons
puenzo
barlby
aleksandrowicz
ideacentre
umbarger
stimmung
efstathios
cinevegas
sertao
mumiy
icmec
brynjolfsson
rehabbed
salsman
appellative
okaz
feagin
gildor
wmb
fezziwig
infomedia
shurta
leleu
argenziano
kestutis
cupitt
recored
leamon
popik
dedlock
sotin
riblet
duckenfield
fuemana
coaley
pothas
vamc
petrof
osthoff
siwi
liqun
pranay
rooijen
zixi
freegard
bouda
turbulences
cambrils
philpotts
lewkowicz
habia
dziedzic
dongtan
langelinie
wailea
ailuropoda
looseleaf
reconnoitring
cabergoline
heidar
afet
railcards
svitek
hildenborough
meerman
actuates
nemitz
ingrooves
abermule
kasuya
decheng
repointed
avet
coddenham
mediawatch
trabucco
aapl
silbar
kollege
mofongo
klina
straightest
woan
godtube
sigmoidoscopy
promotors
pepto
pollinger
merzenich
sucher
charboneau
guoping
whitecliff
zingers
eaec
timbavati
morbidities
sharee
hasaan
arico
kaiko
ibragimova
documentum
sophism
medicexchange
ectoplasmic
bunkered
backbeats
casalino
shiau
olr
footsoldier
thetimes
chaiman
lagomorph
kotite
emong
stewartby
oligopolistic
hawpe
recordists
dalil
parabens
dyken
togian
ahavat
wathan
overide
hods
abdulnabi
sevengill
taghmaoui
mpika
vors
hdri
horcoff
drms
mignano
helicoptered
cavataio
embalmers
pestano
dcma
peruanos
hkma
dbo
silkscreens
kolosov
chiaureli
breadbox
conexant
cockburnspath
pillages
besancon
anshul
zebari
sfbc
ditsy
multipla
bruneval
ohri
gianlorenzo
aokigahara
clutz
odhar
lionhearts
canete
drex
slicers
tarmey
perachora
macia
grados
lvb
jianwei
auchterlonie
btq
gaped
farebrother
terminable
daufuskie
hbot
saltern
sonnleitner
diselenide
salei
wanya
orlofsky
ableman
vanney
saccani
ranomi
agaist
pennybacker
sidedly
perfer
safadi
sayeth
anhembi
monina
mignolet
dabeer
yusha
skalka
herodium
hecox
antipathetic
doore
uvi
chadlington
cornishmen
robustelli
hartzog
hiers
ooijer
oilsands
inarajan
dismembers
vlok
pelourinho
tonkolili
turkson
karamarko
jarc
probot
servicemaster
toumi
lesinski
athill
grodzinski
prosen
orapa
dotzler
musu
demimonde
wetv
rivalrous
aquaplaning
cullins
goldies
brennecke
gisin
selvey
berani
microdermabrasion
snuffles
zeevi
cozzolino
sweetbay
sukiya
mceleney
townswomen
hambrook
drano
racs
donigan
taravella
whyatt
panych
bhavans
rapturously
murtada
nvt
inexpedient
zakiya
delisi
uncorrectable
spillovers
kembo
kalms
meldrew
lynher
rosenfelt
dhavernas
globemasters
kiffmeyer
haraz
katama
coccinelle
garing
hammarberg
gobowen
arbizu
sallanches
zuhra
aschau
plagiarizes
benzylpiperazine
hido
selya
subretinal
zayda
zumbach
delagarza
rowswell
batelco
pandor
idealizations
indigenisation
kng
vigano
natpe
dodford
lotty
passent
nelis
abrol
pussyfooting
blidworth
uhrig
raili
kilonzo
bookending
whisenant
retter
gliha
hackert
decompensation
pasionaria
khaddam
laveranues
damnedest
tractebel
aprea
waskom
colorfulness
massen
deia
blippy
redmonds
mazzie
gayot
yevloyev
sludgy
harrises
ashna
saborio
dewdrops
dovolani
taligent
balie
lakmal
deads
charlus
careaga
gerrity
nedlloyd
fivefingers
cinci
onancock
bruegger
neveh
jettisons
scheinfeld
detectorist
xtrac
reming
zirbes
disfranchise
branigin
cynog
quindlen
voorheesville
hakansson
ratatosk
raws
tebogo
marrowbone
devora
nhleko
butalia
sidoli
fache
baptisia
mawas
torain
phillipp
dkt
hanjiang
blushed
suassuna
collaterally
rayno
bilharzia
delbos
buhai
cañaveral
matlow
nasab
faddish
stultifying
llangennech
ohhhh
kniphofia
trudged
yeste
saza
burani
budded
deap
genelle
benwick
phaedon
chiew
gawr
creigh
shurin
korydallos
valie
alite
kylian
borje
khobi
gorres
starchaser
vasik
jubilantly
negba
acheivement
mareen
urvan
gozzo
bureacrats
herrema
thake
atls
cawthron
mongiardo
croakers
moshtarak
osteopetrosis
quesos
capenhurst
pithoi
callidus
ogwr
groucutt
skynews
darcie
kassell
stuhlinger
zurga
ohsawa
ippa
heybeliada
schönemann
shepherdesses
aldunate
gnw
cringeworthy
reticulocytes
heggestad
asenov
quizzical
calfee
impermeability
manjoo
strugar
bagster
repointing
gastronomie
mazarakis
dongyue
ccla
apga
pueyo
joydeep
unbaked
eleventy
minsterley
ossama
castlerigg
skirrow
katsaris
codesharing
clynnog
barkey
natalka
hamidreza
mesud
pedros
trencin
bacari
bazzano
wornham
jackowski
fasher
burkhead
zhongchen
chignon
magistro
laundromats
tracfone
mcpeek
daunorubicin
caipirinha
warshak
itca
poutiainen
mirvac
reall
matakana
manoeuvrings
gite
astrantia
whipsaw
meows
doobies
silus
chaudron
hemin
inopinatus
cerie
angove
chiong
reata
astrophotographer
stibbe
misters
erlichman
happenned
sprake
screamadelica
vivus
simum
wittenborn
barcham
tikis
wellsprings
resprout
typer
ceroc
onomatopoetic
bobonaro
selmi
acclaiming
bruggink
ilegal
bertaux
modolo
huskins
bugbrooke
sludges
gomory
ilsfeld
orbin
bonci
bolinger
diaoyutai
steinauer
turbotax
regulary
ghilas
esps
kolja
tittensor
singo
nelon
amatrice
sipan
motherf
mvb
kelan
salsano
mcvean
dissapeared
bruerne
milbert
comanchero
suton
keela
oncologic
rapidio
ardeth
régua
libdems
anahi
ualr
trumans
genereux
ossawa
reinheitsgebot
menosky
esthetically
interpretational
herrenchiemsee
sunbathers
banditos
ottl
chuukese
decapitations
jaksa
redenomination
paulaner
sempervivum
natalicio
procrustean
faneca
hahoe
kumis
follwing
janell
elijo
rossoblu
ceridian
weik
nael
wolfensberger
hosseinpour
kaddouri
akhvlediani
rhossili
zeyar
excatly
apda
nimisha
fragasso
strensham
aerialists
clashfern
pasturelands
miazga
wending
chettiars
hohlbein
thormanby
bisenzio
postminimalism
espirit
dorge
impetuously
dulls
favara
dauman
contrabands
xiaobing
michod
outshining
incongruence
campell
martials
beckmesser
adagietto
buellton
stazzema
partovi
hodding
bupkis
prosimians
guaifenesin
kaarst
insound
vahidi
finocchio
cocca
yahrzeit
humdinger
bioelectric
stratasys
scumbags
llandybie
peasedown
binter
adroitness
belcore
puchner
jiechi
gapper
kandha
albaladejo
cgis
mozeliak
bagosora
tollet
wachler
cliviger
bashara
cnil
uplisted
circumstantially
leefe
mickeys
coppices
xsp
corniced
iema
cfbt
advisedly
pepple
hypocracy
riesenberg
yodh
landrigan
parmly
sideswiped
safiyya
berléand
lmsr
feijoo
ecotricity
angiopathy
lisps
pandered
wli
hollesley
leafleting
labuschagne
mutum
stangel
kosofsky
skycargo
elika
prolife
trinculo
sarig
susse
felicite
elts
riadh
frecheville
chlorofluorocarbon
peligroso
derrett
anvari
barzanji
onkyo
muarem
symphonist
djohar
oesterreichische
itamaraty
umpc
shinmun
mockumentaries
mihailova
veejay
brint
dzongs
kanat
epiphania
zahner
agx
soundexchange
dozed
saillant
nihr
milanovic
polyarthritis
abdool
aarc
nerved
spansion
kapral
mcgavick
monongah
luderitz
russkie
mabi
marxer
murrelets
conjecturing
rwt
tacker
arimura
kindof
tibro
prac
arcan
spodumene
cherrypick
huaqing
pushpak
redmann
baldino
druthers
davises
pyrethrin
bohne
uncf
itaipú
gingy
secularizing
conversationally
daddi
elodia
proser
kilger
popularizes
softswitch
emec
hyperbolically
manouche
dunum
bustamente
meadowood
pickaninny
creamers
yabo
flechsig
azodi
muumuu
pereulok
lothe
varujan
eshan
kaupas
socia
worman
ludeman
bahad
atlantans
lenoble
morari
proselytise
nograles
chalkboards
lindmark
aimco
hagas
hoved
leggott
aswany
pleon
herpin
krulik
tindill
trillick
salda
tauren
wastefully
butser
passard
abdominoplasty
inacceptable
pelagornithidae
naea
echolocating
tucanes
crawly
safavian
meifod
agropecuaria
empts
mrdja
bilbrook
saydam
cmon
imn
frenchified
carens
crossness
misnaming
rayas
skeie
saeeda
bergessio
bersin
glycated
visconte
coastland
watermans
blumenfield
muzammil
felecia
margeret
silodor
eyeshadow
sawadogo
beseeched
chipeta
deever
fernleigh
mbes
universalistic
alvo
andron
saltcedar
skimp
pederasts
kanaly
scotford
alexion
skv
khavari
kingsbarns
physioc
dorsoduro
lisnagarvey
mindspring
correnti
ninevah
fimian
steffie
wyken
innateness
strangio
ditech
antell
tumtum
parren
shoumatoff
noncredit
canyoneering
barsetshire
jaunes
rhomboids
prioritises
tribhuwan
mcconnelsville
blithering
rübenberge
geise
clerkin
vulgaria
pertemps
playsforsure
medalia
bankshares
amagiri
dummerston
fusha
manesh
cierre
sylbert
swearer
eskay
marolt
salahis
troch
petrouchka
icod
marsman
navickas
lynns
endi
galotti
barrys
hakes
scheindlin
zubi
ruegg
mairs
duffett
nexia
plazuela
moena
tuatapere
baluard
nilay
arfield
eadgyth
welbourne
zent
guzy
powar
furchgott
raulston
abudu
imraan
majano
keisel
bosavi
ehnes
electrocutions
rebiya
garriock
labaree
ossificans
pavis
savills
arboricultural
unbent
selley
ataxias
fulminating
ankleshwar
tylosaurus
lober
busteed
degus
littlehales
iddings
alameddine
zunes
durably
dealbreakers
famen
tomago
deferrals
vossler
uwins
faku
tudge
mannone
pardoel
quepasa
mcquiston
allegria
haaken
ethambutol
mâconnais
cogdell
signficantly
overexpress
parranda
rakhshan
timestep
remediating
knifing
seidenfaden
sanidad
mcwaters
jerrells
rousselet
mvnos
backroad
dufton
bndes
laax
skyworks
torreblanca
jerrel
loganberry
nubs
cullison
femaleness
unmc
buchdahl
djamal
lecterns
thinkpads
solidarnost
luddism
bvk
rostill
francies
evaldo
mozyakin
tracz
xiushui
rachal
arsc
melioidosis
sammartini
rosmersholm
memet
huanta
denominate
prestage
driburg
hawked
satura
nouvelliste
seksu
alltech
aanholt
pasqualoni
pontbriand
prober
westcote
homecourt
megarry
isabelita
subianto
busloads
kutsch
sapeurs
ohsweken
shaniqua
tutting
flowrider
stefanyshyn
brauch
munisteri
kordan
pichola
aksenov
binstead
wiil
bielenberg
ekeblad
surfwear
meulensteen
hbj
dahuk
veney
jalonen
beringei
muscardinus
knla
sherring
nadzeya
kemnay
sabb
whooshing
zenone
jingming
kellard
scic
propably
serigne
plimmer
kalenna
merrigan
carnyx
belman
carvoeiro
jantsch
tanztheater
bedfords
yakka
interpal
splats
ballclubs
ajdukiewicz
kutler
reticule
suto
personalty
trockel
wondrously
ferriol
nusserbayev
mandarich
huggetts
haggs
riotously
ysaye
kapsch
enticements
overriden
sloganeering
schimmer
momia
spleens
zhiwen
bunkhouses
ziprasidone
mspb
flagey
zanin
ysanne
mellisa
sabden
drakoulias
quikscat
qasemi
ablyazov
quintano
availabe
automotives
dayers
garbageman
vendaval
ranched
shafiei
prsc
thottam
roag
medhat
shevket
serrell
astete
crystle
iprs
messiter
assitance
herlovsen
shipbroker
renaultsport
koteswara
schanzer
lapize
quintuplet
hidding
biello
affraid
numis
univocal
nonono
cryptomnesia
vitet
hothfield
yanov
rayle
sissako
rrw
bilandic
coyoacan
stonier
dundrennan
oltra
moun
eliecer
conz
outmigration
powerlink
linzy
peddars
mastersingers
pankiewicz
hinck
couñago
firstrand
unsteadiness
gegner
lumbered
searchs
zanders
corvairs
knost
dermatopathology
dunmail
levs
branquinho
formulator
decriminalise
bandstands
indivdual
orthophosphate
hartsop
aggrolites
mahen
kingshurst
karmapas
labas
llanfairfechan
porterbrook
mannville
kevans
kerfluffle
mutara
brynteg
loutro
hardial
nonrecourse
iafrica
wims
adduces
clusaz
hrdy
deniau
mayfest
steinhäuser
ringold
goonewardena
rucha
mashes
nikiya
timme
delanco
etron
faussett
mikros
poupaud
gontard
firoza
jagatjit
ferrare
folksingers
wikramanayake
humax
najman
rasmi
demineralized
merkl
rebars
cornflowers
pitroipa
hotheads
fkk
ikebe
heartful
bohem
joanou
segontium
ianniello
invisalign
valjak
epigonus
erwood
kurara
motaung
decribed
sinkin
cous
campanario
brumwell
whitsitt
skavsta
masontown
maharashtrians
hotchkis
mirax
boggan
chamisa
politie
bryggman
gudmundur
fraher
disingenious
swainston
irureta
pavier
burias
howcast
minver
czapla
hogrefe
autonomo
kurson
bridgland
breacher
charfield
cargile
sexby
meltwaters
vignal
adriamycin
deselection
sanjin
ticketcity
whittingdale
sovremenny
sensa
circumlocutions
bcms
jiuhua
nielsens
yurij
greeny
cygnets
gelsey
improbabilities
faithfuls
mesquida
roggeveen
luthe
lorey
gome
attune
fole
ymax
smoosh
robocall
cantelon
dulcet
whedonesque
jwoww
faubel
hempton
peppler
ruwe
dichio
tookes
enamora
pâtissier
huthwaite
howle
empanelled
clearasil
stanescu
moverman
ajani
wahconah
annisquam
zonca
montjuic
miltown
dakers
transmissibility
sanmen
trimbach
mcdo
affluenza
falola
bondsteel
abertawe
consigli
springthorpe
cartmill
tomiichi
ankawa
raghuraj
buangkok
plaxo
deadwater
izady
companionate
depe
bancos
sidell
mandroid
candian
joleon
kaurismaki
cityzen
robar
tearfund
humbles
guotai
guanaja
acquafresca
ranty
saude
scruffs
kraenzlein
vaughters
eginton
almanzora
bridgemaster
sandercock
schebler
leinen
tehre
segale
rosebowl
plastination
alkylate
confinements
hasell
factionalized
saah
pumpe
ajaj
miconazole
magre
choclo
sigfusson
drusen
intertie
ordish
guodian
binyam
biglerville
treichel
eaglestone
pyy
shoebat
omrani
nordegren
vauclair
horrifyingly
kleiss
kronenbourg
kesner
husing
unsheltered
gospocentric
kumyks
wriston
tryptich
tamaro
sarwate
nellist
disquieted
minimo
plockton
lucado
immunodeficient
afari
reallocating
tootal
montefusco
umbilicals
pantalon
nakas
tidey
widdle
managerialism
splashin
assails
iannetta
aaco
meny
larded
zurer
prepayments
lloro
gamecity
partygaming
doffing
pavé
puppetmasters
energias
hafizi
jarren
disembowelled
kammerphilharmonie
rejoinders
lowthorpe
sarb
inss
eucryphia
guarentee
devinder
artola
ichimaru
tokonoma
stobi
montelimar
tongil
bioprospecting
planaria
jodee
judum
tatem
ofwat
duberstein
satinwood
coxheath
burghard
bridgegate
grasberg
murlough
rareness
wrth
babor
overreactions
sekkei
coiffeur
duboscq
xinlong
bingos
glug
schamberg
kuitunen
montanans
brechner
whiffen
cutshall
wtkk
galp
nadji
bufalini
girardelli
alaris
polu
bergelin
hollenberg
chawan
secy
sarazin
nasruddin
competely
steamrolling
bourns
noninterference
carnall
lepre
tokunbo
workover
clophill
nkem
betteshanger
daneman
koobi
vedernikov
atkinsons
revin
shrager
isbin
comapny
putinism
stawamus
concertinas
stateswoman
murderously
kandol
breydon
rittenberg
coylton
micex
rutina
malebo
hourihan
cheapening
baychester
shabazi
feltsman
sieghart
allocators
bicyclette
makeout
pcyc
kotido
piggie
siguiri
larrington
nessesary
veloster
weaponless
thorpeness
nilekani
jonelle
acxiom
uchenna
presper
purifications
underplay
dogmaels
semestre
mcleroy
masisi
enten
mccline
maricarmen
kilogramme
ryneveld
jermichael
velappan
courteline
observants
unnerves
grindr
kwtx
spiddal
shaalan
jamiel
altruists
alexeyeva
holda
afleet
transcribers
undrained
dahon
sandqvist
midwesterner
preparator
yermakov
fragniere
kwv
brahmbhatt
ifop
afkhami
bradshaws
trbovich
arcahaie
disassembles
duquoin
nigeriens
dillsburg
dentmon
lewman
elima
hovik
chawki
halstow
tellechea
arvell
bergmark
slurping
cheerless
hernandes
lly
zhuji
gdv
rajasa
buzhardt
haemophiliac
myrl
afic
yusup
dabbler
vry
sprezzatura
riels
wooburn
ngx
reregistered
muckler
inflects
cedella
ribaldry
backshall
unfussy
grosberg
striefsky
swabbing
kecman
kolesar
phuntsog
humidities
janish
kubacki
ruaridh
soms
bohlmann
phuentsholing
bradham
grosbard
musli
kason
papalia
friske
alabamians
iafrika
stifford
laicization
breughel
stampalia
elektrotoer
kraftwerke
cetiosauriscus
luctus
manzel
beyong
floodwalls
stejskal
prochazka
telefono
jamaah
hijabs
holgado
chargebacks
maxes
nanya
griffi
chais
undergrounds
mormando
suwat
hütz
unlatched
splicers
anogenital
disorientating
lurz
coloradan
ffos
excluder
gmes
aahe
mortuaries
picassos
synthesises
gll
fargate
grimness
onchocerca
trollery
rivercrest
smithdown
slops
ospi
sackman
aviance
liberalist
jiggly
nomoto
ganju
cruzeiros
yeares
afci
briody
daneyko
noyo
kurnitz
bramlet
gabion
dörrie
skorpios
cybershot
padura
richhill
hiraethog
klrt
churandy
voluptuousness
ulcombe
dirs
honeydripper
yahiya
bkw
wharfage
plantel
abased
slusher
greenheart
knowstone
risp
champignon
kneehigh
belstead
mcbratney
mava
reoccurrence
bancassurance
breds
shibboleths
ransacks
essendine
maxwelltown
hurlyburly
ahoghill
wowser
panjandrum
gvc
dialectologist
jurançon
weihnachtsmarkt
werkheiser
declaiming
vyntra
carbidopa
zhaohui
sitko
rockenfeller
firestation
palaeoclimatology
glop
futz
llw
tranexamic
misguiding
schlomo
shashikant
boediono
budiarto
depaula
flaine
elmbrook
nicholaw
socializes
jtp
maximalists
gilfoyle
hagwon
kanafani
zicari
tabletops
accosting
undisputably
qazigund
rafidain
ecns
aughts
maraldi
exenatide
fumé
sublimating
empi
stanlake
pether
sportz
youla
overfull
wamalwa
microtia
ringdahl
degress
jinyuan
oxenholme
caahep
ihedigbo
abps
tailgates
tbsp
amaka
linderhof
iuliano
jeongeup
sarraf
concelebrated
socios
bardell
irakly
vinik
florenceville
wedren
gigatonnes
windowsills
equidistance
drohan
pecho
nooyi
walcutt
saitta
crueler
stenholm
bifocals
granose
geometrics
balles
nebuad
mcrib
terabit
grunwick
storlien
cheongdo
obermann
isolations
ziaul
princi
fesser
bartosik
ageas
bristols
penygroes
bentleyville
swingtown
elong
fornham
kurlansky
stigmatisation
deià
perricone
ordell
klingenstein
hechter
tiguan
brouard
lauralee
grobet
sharah
everolimus
showmance
agrosciences
iufro
valliere
henenlotter
costley
kasser
ivg
spiderwort
playgoers
icefjord
blackboy
bewl
farges
ecopetrol
odessey
narveson
steinhof
noncitizen
kirchin
unutterable
dellys
sanglier
zoete
underlaid
thornhaugh
scitex
stolfi
delcambre
sstp
smisek
korber
insulins
feghali
swindall
westroads
shakman
grandmas
kenoyer
bristo
stongly
caoimhe
danos
feola
saggy
croyden
bankatlantic
ngongo
mchinji
possibilites
meditech
benesh
bukharov
kittipong
unmercifully
nemov
griles
neurotropic
anway
hayriye
homeaway
mailhot
outweight
fiander
springers
inniscarra
marthaler
gholamhossein
fchv
schimel
millibar
finelli
bruney
graziosi
speechly
yevkurov
urol
schedl
allaying
aguecheek
npia
zadroga
fujiya
jerrica
doublewide
ronal
rectally
beels
cridland
gladis
sangji
chewie
llangadog
unsettles
burkei
gawronski
tomboys
gailani
reganbooks
pothead
multisyllabic
ragbrai
lassic
farvardin
aronow
filé
unnerve
piskorski
cheika
delvalle
vesterbrogade
behdad
sibenik
daney
sangiran
hazlehead
wargnier
nasserite
carfrae
poofs
twofish
kimche
nowness
mannon
pleasers
fournel
fary
kaiserwald
basem
lopat
hcci
sentinelle
gabbi
latheef
kvbc
buckyball
schoomaker
seniormost
rosenboom
nacm
piszczek
uhaa
duflot
ijichi
ibori
macsharry
hartcher
coggin
retweets
arrr
thassos
waialae
pinehearst
eliad
spigelman
lisc
entrecôte
pigging
peelle
huaxi
tvland
dupontel
hensol
jawbones
lyminge
russkoe
scnt
jodhpurs
oscon
capannelle
pangram
farted
kthv
pipsqueak
cornelison
konono
carkner
elegar
giac
cumberlege
oehha
cober
momeni
suyama
entrepreneurialism
conservationism
enrichments
hcps
constructeurs
lilliputians
aselton
aith
rogner
blepharoplasty
mamoon
szukalski
uvira
allders
policys
superintelligent
craviotto
nicelli
burfitt
sechler
situationism
dooney
juvonen
nakorn
vujic
internments
meeson
tipplers
kosslyn
kamana
jncc
forecourts
suburbanite
kosdaq
nfib
trefilov
unknotting
sonan
mazagan
jujamcyn
tumescent
imiquimod
tabberer
rfps
galey
meshkini
ropey
nipp
mompati
colossally
tricolori
cremating
kazmunaygas
hoppenot
nebb
snapback
presenteeism
erectors
righties
monegros
shamas
brachet
unfastened
begelman
renou
hullavington
gumulya
uchino
superfecta
meihua
baneful
ladha
dorasan
minuted
eurorap
latton
montminy
clodoaldo
oshu
dieing
lahaie
gamey
amoss
bugel
neuters
lambrook
fungo
kabasele
unfurls
saunter
wingreen
playgroups
catalogers
lacina
pieterson
villechaize
rangemaster
wrox
hoiby
corren
lightheartedly
francisella
looksmart
gaglardi
fukang
portaventura
quintron
amre
cmrs
wyda
configurability
vinai
inseparability
indicom
shawish
lipofuscin
aneurysmal
biocatalysis
shalders
friendfeed
griddles
bennets
gorgui
zednik
steinfeldt
dums
tianma
kalw
ouspenskaya
justiciability
zwilich
lattakia
jarringly
crotchets
eaglin
unpasteurised
heinecke
doomadgee
dahabshiil
cuckney
clendening
solorio
manchev
chokling
strangeloves
superbia
superfans
cubillo
pasatieri
gioni
balice
bedspread
spiritedness
pifer
joling
turchini
jakosky
amrc
harwick
unadventurous
relex
yarlagadda
singalongs
pongala
annica
lessingham
kedrova
fiar
siepmann
seré
zicam
bryceland
baym
muneeb
differant
iapp
macaroons
adventurousness
grathwohl
lidington
qurans
mackesy
valuble
wunderle
ballgames
sugarcoat
stalham
secunia
jaiprakash
crossbeams
choristes
ponniah
paranoids
hibaldstow
belloumi
travelport
mdpv
shaklee
inflammatories
fifteens
chingari
dlbcl
probabilists
nicey
daggerboard
towboats
rimell
nocentini
qiagen
lare
wasmund
desquamation
buker
runts
frankweiler
karry
icex
kozmic
oxaliplatin
starkness
porthill
senba
adeni
pearblossom
gauvain
mehrzad
weetwood
desideratum
poett
atayev
tiera
naudet
eut
ofdma
huaibei
hydrocracking
gezhouba
runco
brimer
tengboche
andretta
heilbroner
dinks
danyal
carmazzi
vald
tesseracts
tallin
foetidissima
ballbreaker
mavridis
basinas
wehn
hibbitt
millerstown
andic
laino
enlargers
kalashian
yonglin
cerveny
kuller
hefley
dunckel
haxthausen
anthropomorphised
barsamian
berthod
restauranteur
indosuez
laperriere
goulard
bassplayer
hideaways
lijjat
guidewire
fauld
satyapal
jouet
manikins
smallscale
janakiraman
aiyer
hailer
salmeron
everlong
beatmaker
akerson
mabbott
alphie
brobdingnagian
parres
stoneridge
reconceived
ooooo
kinderdijk
cevahir
embroil
injuria
syko
bankoff
hasher
verderers
socastee
barewa
kilmessan
grovely
gissar
zylka
carrols
gulkana
ibai
beineix
cairene
directtv
bittel
tomake
seino
tytell
scalabrine
grotzinger
ecps
barefaced
debevec
tiare
bearhug
camelid
alium
vasiljkovic
sassan
beguinage
uliastai
hexachlorocyclohexane
stantz
allées
wieseltier
saccharides
chopp
rajendranath
enteroviruses
unitive
ciji
simlar
krack
nessen
bushwhacked
cbgbs
maimi
encumber
neoadjuvant
caddied
asep
mirit
gunnels
rzd
demoustier
wyff
macneacail
rezaian
obala
calamansi
dahlkemper
cdz
zieler
singeing
biotechnical
spio
wickwar
osbornes
timana
lightings
benney
moriconi
kunder
interruptible
stroock
dinorwig
millien
mows
pifa
fritillaries
limani
godchildren
piker
bankfield
bhojraj
bpcl
nyh
rodolpho
heary
babeu
zimring
rippert
buchina
capellini
tavare
masoumeh
shabwa
thangai
nelia
cfia
goscote
menorrhagia
mcleavy
racegoers
sheilds
spyders
sijia
ramush
insufferably
harinordoquy
hesp
imbricated
biet
youthquake
mouw
fariha
retractors
wickard
obss
dexion
kedwell
bibik
underbite
fursenko
ecotourist
aufgrund
erdan
semiha
volvos
antilock
timorous
mayobridge
stalnaker
audigier
barrueco
tollis
zolotukhin
yovani
fosi
kokubo
lafc
waterlilies
fettercairn
ornamenting
henkes
salmoni
starcatcher
cardioprotective
masloff
enoc
blucas
thomasin
arvest
mundhra
buachaille
enic
shewchuk
shortenings
chafes
dulais
kyivstar
morticians
immi
yauheni
danter
blondet
sahan
addazio
ranya
commensurately
cyclothymia
reponses
moennig
panagopoulos
leason
pommiers
oprichniki
cfmi
jossie
guit
alfonseca
centralian
ousman
serms
kamares
epigallocatechin
jingchu
bygrave
strangelets
tumlinson
leiweke
rihards
civicus
valbonne
seacraft
vaage
nagarjun
mhsa
languor
lhcb
rosborough
slh
nasra
lagergren
bassham
energising
lollypop
kxmb
motiur
merbau
muchness
tizzard
gioiosa
phull
muktinath
wndu
munhoz
thessalonika
adorni
aiim
pranger
krupin
trislander
jeckyll
anslow
muuga
etios
mccarthey
azzawi
gragnano
demeyer
oquawka
forkhill
peelers
jiaxiang
bintley
llanas
fernhurst
fajitas
dalene
heraud
boysie
condrieu
feasterville
skenderija
iedc
aliou
calonge
buckeystown
invoiced
tosta
stockists
caixaforum
suaram
andrist
vondelpark
distington
lovettsville
baculovirus
levinstein
graden
rasps
chrysalids
playmen
plekanec
somei
eclat
brinkburn
browell
suheil
mó
goldenvoice
ofn
derris
hajizadeh
legerdemain
georgei
umbrians
woodmancote
onw
parameterizations
alerus
gagnan
orlanda
cramb
narey
hydrolysate
honks
goradia
playlet
mozilo
xover
bonos
boconnoc
mezzocorona
mclardy
indentify
farideh
hunsbury
capco
bahrein
hayatabad
suicidology
undersecretariat
hitchhikes
aeriel
illanes
alogoskoufis
reindel
steenbeck
taim
levegh
chandeleur
essentiality
blepharospasm
mahender
corduner
okeford
kachu
charcoals
clockface
güttler
globokar
realisable
shavir
stiffens
allbrook
degeorge
nissenbaum
rectorial
ruskie
hcfcs
nyckel
disconfirmation
manelli
spidering
guignols
poloni
vongerichten
deschner
crivello
tiffanie
subsecretary
creason
panadol
farahi
waterboarded
extreamly
caerwys
crawlies
carpools
dulzura
sinz
pilrig
absoluteness
bhaya
krap
worksites
andriani
ilani
eculizumab
cincotta
prevelant
fiancées
greentrax
shotz
volpaia
wheldrake
certifiably
benignly
foxbusiness
mahay
moussaka
burnfoot
sakichi
absentmindedly
bowfinger
fehring
yantic
serpe
montecristi
benkert
luing
ripps
farrells
pspv
sprod
threesomes
hillfields
centella
birdbath
davita
mcing
picower
schwarcz
kocha
crigler
manassero
proximities
gruia
stradley
catweazle
didomenico
bendelow
edano
secour
starfruit
leavens
widden
killik
finessed
bke
duffs
boulden
hanegbi
uhlenhorst
diekman
jbic
maaco
unhealed
serap
imin
krm
haberl
microgrid
navaira
pusc
radakovich
hermance
zarouni
allai
kosse
boim
buttner
calipso
ventotene
shoffner
claverley
hoffenberg
rosane
sabetha
nilin
akale
floridi
zamecnik
füle
hallum
monter
liebreich
principlist
tumanov
barten
dobol
worfield
leafa
natela
greenwoods
hgcdte
zhaoyuan
langlie
norcom
smmc
mbugua
priggish
roshon
gmap
peperami
odlyzko
dowser
shaddock
polycarpe
obnoxiousness
prodrome
bracker
niemczyk
creuddyn
showbands
lewry
labranche
verschaffelt
cappuccio
hintsa
lamely
nolberto
mulhearn
bernaldo
sailesh
atayde
ortved
stubblebine
afeni
hyotei
sourse
mccorley
timpanists
wholescale
townscapes
tuffley
congruity
jigsaws
ballz
understaffing
mouthguard
maclaverty
atras
shella
reconfirming
nyts
mithai
mckeel
sapio
klauser
crystallographica
cycos
romeril
chanonry
tarasco
ctenophore
cozmo
nativo
sonestown
felinfoel
guttentag
lockheart
kbi
horseferry
wallia
coalición
noser
takamizawa
pheidippides
villoldo
omata
workshopping
daycares
plumm
dealin
doretta
upthread
refueler
joppatowne
klabin
hhg
sprats
bussink
rifu
triesman
aicar
desmosedici
wuld
teks
shomi
pzu
boundy
unlikeliness
wildavsky
zares
poppinga
depa
springboarding
bubblebath
budock
stevi
eitb
flots
medion
akamas
gigatons
kokkino
mimick
femtoseconds
lenel
alima
gonta
maleo
smolders
scrabster
baldeo
nullam
libary
ener
sebek
marginalizes
paulsgrove
computeractive
foglietta
shrna
srour
cabaletta
bance
josemaria
zennström
delich
sunshades
waunakee
jilts
pletikosa
bioresources
unchallengeable
zyzzyva
europeanisation
blanchar
roberty
ellef
hamdullah
macroeconomist
momos
pinkwater
livernois
creasing
htis
wxii
mcpike
biagiotti
eurex
yeses
guyanas
mentorn
encinos
weyant
jindong
jeudy
guihua
neuburger
kambi
outboxed
korch
,they
pastner
rochina
retiral
akris
halophytes
dilston
gildersome
mouthfuls
bkb
netti
yoxford
sunart
annother
bacot
kallos
acetosa
avastin
serbin
selworthy
zinberg
hoxby
hornbrook
pendley
julani
palauans
cwn
virgilijus
venetiaan
bergtraum
kaaki
swallen
pneumococcus
adron
schneeman
bullmoose
arces
bonna
northfork
peaces
josefson
kunigunda
buñol
demchenko
cyncoed
nubi
anupong
churchwide
aratama
gulia
chelsfield
blackstuff
cuéntame
sansonetti
lanners
sambadrome
definitley
membrillo
maylam
penally
temeke
germanico
cupidity
ruwa
ayiti
mauston
flagons
betweeen
medvedchuk
hegeman
gunsights
adila
parasocial
montem
bemersyde
readmit
jeelani
eckstrom
ladany
baised
cookridge
brunnhilde
zuppa
vauthier
narval
forequarters
tiefensee
admissable
jackboot
cringed
sahli
darks
suskin
daikyo
foulston
targamadze
cagnina
elyssa
raptured
povero
metricated
rutman
knuckey
handbuilt
autophagic
leiomyosarcoma
fortney
weelkes
daddio
glitterball
elice
dabigatran
lumax
thinh
flugga
petrovietnam
papathanassiou
buitoni
motoyasu
tabernas
eyharts
mousehold
jetz
osbaldwick
everday
yukai
korty
louay
iita
hoblyn
chiriboga
potage
camdenton
sabco
lehand
shucking
reprographics
ervil
chidsey
sarabi
friedmans
eufrosina
diametrical
eiders
dermatol
kinzler
monkeying
yegua
yaan
madlener
lappas
declaim
bananal
beeskow
privatisations
ahmir
zweden
wilzig
eventers
heyhoe
titti
rushby
newcourt
prety
giovannetti
melitz
jgc
moteab
disintermediation
fze
guayule
worldpride
staa
heekin
fermenter
odubade
méchant
teahen
aouad
byfuglien
unpressurised
borena
datapoints
cibotium
benhall
cefta
wohlstetter
assane
wingdings
shanto
aguerre
massicotte
hogin
specialness
corian
vollaro
lukasiak
qkd
ovulating
tonja
decarli
ghahremani
appier
remizov
bytheway
cmsa
quadzilla
eibach
sukey
urg
waystation
standring
squeamishness
willmot
bedpost
astroboy
furnell
afps
cawsey
fuencarral
underprepared
jijia
tavita
lojeski
lowi
pozole
rammellzee
msq
swierczynski
anstee
rollett
lastingly
maseratis
kirakosyan
hurun
onek
deerness
einzig
srms
hensher
souqs
kauffer
misplace
rosebay
waunfawr
boora
killshot
cantuta
sgoil
maguro
lfh
paranoias
stripey
cascella
taibach
lagunilla
eadfrith
sorafenib
repossi
battlefronts
ethnopharmacology
skiving
dolidze
barkell
haitien
milevskiy
deutschmarks
siyu
blacklion
haruta
vecellio
shiplap
modjadji
lundblad
clova
ihde
bufala
grishina
forgey
tomescu
elymians
senning
cxr
gohary
kaichi
galguduud
queendom
lagerberg
buzzkill
kakkad
muthana
decety
delara
panzeri
yigit
petagna
siderov
dignifying
mininova
paektusan
benka
cumulation
tyshawn
rayes
killingbeck
humerous
renegotiations
banzi
xeo
zarian
analysand
ngubane
pörtschach
coychurch
progressivity
blander
coastwatcher
denselow
zaffar
wron
endocast
cfact
borowiecki
brevig
nonbeliever
uncouple
sediba
quindio
mediapost
manicures
shakeout
mahgoub
myristic
quenches
whil
barming
hable
drachman
purdham
caunt
gledhow
cadila
stopcock
retreading
childersburg
perovic
rogerstone
vonck
stankowski
chelmsley
rifi
youman
momand
ewt
saper
hunstein
simma
hoene
congeal
arakel
aule
hammick
paddleboard
howmet
shafee
romasanta
gabas
shippan
conveyancers
leonide
dhanapala
wickremanayake
blantantly
rwandair
bancaria
karbon
kotara
itinere
outernet
shoy
daheim
langata
naccache
jianzhu
arthit
quesadilla
fatefully
qlogic
clybourne
showtune
stemcells
gorre
pelous
papahānaumokuākea
gatsos
overawe
titsey
markab
helgerud
dreamliners
meneely
tajir
atcheson
lisia
tarnation
schoolwide
terracini
boesak
grossfeld
goolden
incent
munish
siriwardene
misso
colins
webcor
wway
scourie
swir
jsj
positas
silopi
seining
guiglo
halem
adeyemo
wiedeman
kacie
phibsboro
ruggedly
soifer
yida
rodricks
abwr
barsoum
omegas
fajer
nanofiber
carnlough
irchester
babby
aaos
márai
stucker
lears
ilina
labanotation
sitel
lettuces
jera
sawit
attakora
fromt
changs
genessee
meitar
kvoo
brantham
steavenson
toshizo
brandman
urbanek
tajinder
danshi
sbas
varno
millworkers
werstler
grisel
cobscook
adelaja
barwood
statz
shrikhande
buccaneering
colbourn
stansell
bensons
grupero
wasel
kornilenko
elderflower
hosenfeld
crcc
simpy
astrada
ocreata
exacly
xti
panelli
cottontown
spiels
manglapus
puskas
présidente
mistyping
clonally
statelets
gizzards
kosowski
laggard
africare
espersen
wellfield
moneybags
thoen
skipsea
maskaev
wachsmuth
yiren
scratchings
gasque
quizzer
barikot
brandau
providentially
metropolit
mesika
resons
beehler
thurnby
comparitively
lambi
bobrick
piercarlo
lüdemann
woub
iketani
eolas
courgette
dissemble
dawsey
tyrangiel
replenishments
shihezi
janacek
equador
siloso
abubakari
azizov
mesón
busser
auxilio
oxf
vingaard
derosier
picolinate
szymkowiak
kcf
ayele
scabbardfish
unkillable
gacek
farocki
unrighteousness
givaudan
deadbeats
cetyl
grushin
geisenberger
sabour
perquisite
histoplasma
asiasoft
accidentaly
mycoses
shihri
daallo
gecas
merrygold
dhale
confabulations
faqeer
skeates
stiffener
halali
boura
cassaro
whiteabbey
catterson
kanuma
lexton
monteforte
rustics
particpate
flavorless
leaney
zdroj
decompensated
karsenty
sagely
gvardiya
bravin
udhagamandalam
halethorpe
athreya
wildau
bookmobiles
freeflow
jirón
cyle
gravies
olympiades
watervale
solemnised
eliding
gryzlov
moneygall
cental
ushanka
mayrhofen
propogate
balouch
lobley
hordle
callenbach
colposcopy
twerp
uphall
debashis
zibi
campling
avelina
crooker
beauregarde
darland
inview
unrepairable
ibarretxe
mccaughrean
bethia
acep
gauch
emptier
kapaa
shijingshan
endal
hairlessness
tunb
gasparin
absi
ctfc
mahlsdorf
bridgen
trecynon
antimonopoly
cohabitants
mirinae
voestalpine
namazie
auditioners
rosmarinus
ellacombe
bilmes
vesi
auspiciously
cipta
digitalisation
direly
perey
liscio
kingswell
zackary
unredeemable
gansbaai
ozs
lengai
eppel
cheorwon
lighthall
termine
kareli
rodenburg
kurzman
cheryll
anin
dexy
dingding
bukamal
willesborough
stickmen
manque
gelbard
brotherston
pilotto
hohler
gillmer
baronova
staehle
medef
astrodon
mald
tisi
librado
calnan
saumon
boisdale
warnham
depodesta
mercker
seabaugh
butautas
montas
baloyi
crownhill
lomaia
botten
damns
rizzolo
sigl
filamentary
unmodulated
riah
stanzel
writhed
norouz
doisy
odesk
naugahyde
microsemi
prows
enow
veramendi
regente
tucanos
professedly
laters
garshasp
finerty
clevers
icaf
welsman
cardiological
mangwana
golik
mpri
intellectuality
medaling
siddartha
grzelak
feruz
disconcertingly
bakeshop
barbate
cnvs
streambeds
orignial
weissenstein
bourjos
gangbusters
tarika
vaynerchuk
contextualising
lutine
tassler
regularizing
gaiter
broadclyst
zasada
licona
bellenger
rectenna
prewett
rhaeadr
huggers
golant
syas
houdek
castellvi
dumbness
hefter
hamamura
antithrombotic
retsina
schey
unbonded
restrictors
minack
ahome
eyefinity
perthus
fouracre
helyar
prayerbooks
certanly
galat
undershoot
massimov
sharkskin
beaut
untrodden
azarian
ajara
plisson
nyasha
uncoiled
jinglei
bresser
asesinos
mangurian
stannage
pirogues
nadym
malolo
marpessa
ofd
wallison
camerer
irrelevantly
slavich
duplexer
merenstein
meyrowitz
revile
ratemyprofessors
intune
lisl
suburbans
corralitos
mckeegan
proofpoint
dady
mesoamericans
coutries
asianews
irruption
lightbourne
ossabaw
helpage
jiff
hartsel
psos
yibna
cinched
wilcocks
flury
wishology
clomiphene
stup
baala
baptie
tirez
koops
alkhanov
huaneng
everex
coiffed
availabilities
solidere
cashner
hodan
recabarren
cvetic
nendaz
angiulo
dentsply
pagai
barazite
victime
jeanneney
jinky
vinther
upvc
marshack
cpz
waldbaum
febles
sandipan
romes
hspd
eckerman
frang
libatique
jawline
toha
smilow
heliophysics
ninkovich
wambsganss
tofield
ostergaard
joselo
fados
nassour
nelsinho
appologise
jetski
competant
medidata
rtds
inves
vestra
skryabin
urbanize
glasby
pagon
buess
pathobiology
vitanza
sargus
rowayton
berky
kaust
dawkin
cervellera
roiphe
griefers
komei
bealings
tajani
unexceptionable
nakaima
rdn
midcourt
busbar
mulheron
yoshisuke
charlatanism
tapeta
calderstones
undercapitalized
parmele
savoured
salivate
flechettes
stroj
kidar
izzah
kobin
jobing
osnaburgh
ecofin
theofanidis
photopigment
sandomir
effra
kepel
covan
brusco
kasubi
adulterants
foetidus
dubby
gorgone
waitzkin
wimbley
powerboating
medicinalis
lifeguarding
bmn
gropes
synthy
blethen
zaeef
ladyman
cellulases
venton
jersiaise
slmm
arzak
quiett
batfe
campilongo
stamberg
cler
sharqat
bishal
pedigo
florita
foreskins
rurrenabaque
blackmarket
gossiped
gaylen
escot
beschloss
accio
zileri
thereat
defaces
superville
guiyu
jefferey
jemmott
waymire
senlac
ceed
lasantha
fowlmere
gulangyu
hanung
selee
corrour
overington
schwantes
fallings
bambú
hughesnet
kilani
lunchboxes
gorby
gardenias
medgyessy
teraoka
nazik
cevian
bellhops
yeremenko
noves
lowlifes
hackwood
chippie
alpizar
scratchers
boilerhouse
nanninga
armona
wayn
devenport
lastpass
dyneema
kopin
recompensed
stormin
arraf
isreali
softies
calcific
heskin
subcity
malev
brewington
gastronome
hesen
luciola
natsios
iyan
craniosacral
grubin
tabloidy
edamame
chattisgarh
futral
mirant
doored
antonenko
rlj
medecins
cenarth
jejune
llamazares
skylarking
franchesca
wude
yeongam
veselka
miot
fastfood
lebovitz
deutschmark
jubran
eix
guédiguian
monan
radiosondes
ferrant
sajith
parasomnia
burkenroad
sensis
novelo
whispery
karumba
madiran
ohtake
freyssinet
permenantly
ondarroa
outbreeding
lijia
teca
brichant
lupillo
imrt
presumptious
sheathes
venusberg
weitzer
bbcs
ponzo
rowlings
kozakura
akber
stefanovich
toomua
immigrates
oloroso
robel
ashrafieh
conterno
kiele
gradgrind
prognostications
gavins
cannadine
huxford
ldg
girão
zenbu
thurow
doorkeepers
keppe
welco
wifely
schulhof
outdistanced
giallorossi
turski
cleavable
nehantic
peccadillo
freelances
heilbrun
lenor
chaumet
marlovian
silkmen
oculta
mcgartland
warumungu
vedia
leduff
bioterrorist
barberena
konchog
ryecroft
albarado
shehryar
ruhs
silverfast
tomassoni
flato
destinee
lisker
ditter
amena
vru
pepperberg
rickerby
mankoff
hendrawan
visanthe
macronucleus
kamler
whih
crossey
ockelbo
ekerot
sanbornton
yetman
mmsc
rutsey
unhchr
eristoff
perritt
subaquatic
maramures
cushley
crosas
neish
nbpf
ibram
denly
kamangar
relvas
merkt
reimold
walsgrave
rustu
boldyrev
walbank
chapline
visibilities
zhimin
patellofemoral
eex
leanza
slumlord
mmj
gutterson
nedry
mineable
hackbarth
lisdoonvarna
velours
cardoni
drai
zerbini
maccaig
batiuk
hyperrealist
eichenbaum
llanfoist
manthorpe
mueck
goic
nohant
carolo
neureuther
mcpp
rostering
afak
hawkings
whalin
chemould
steil
aalberg
knwo
entner
basudev
bantjes
astafyev
airheaded
yaque
houchens
nuez
extérieure
mamberamo
macauliffe
coppedge
dettman
unzipping
krombach
exonerations
antisepsis
fishmarket
rolon
kalafatis
bedsteads
gavigan
emeghara
englishwomen
serrania
gdn
slathered
ssms
jilian
barths
deuter
hindy
hesket
ferra
kovaleski
gabalfa
yijun
cryptochrome
gobber
sharktopus
shearon
marw
luno
niskala
kolpakov
murielle
rpx
sathorn
elseneer
esbl
kenthurst
comaneci
lopham
rummell
airtimes
achmet
busbars
sedric
servient
soilent
universalizing
kyriakidis
chirang
arkema
botteri
zaina
brancheau
sulaimaniya
cafr
telesystems
gewurztraminer
prostyle
lasan
aunjanue
clarins
fleeman
milkis
urpeth
blees
acryl
vinery
achron
eflornithine
haying
vinters
resistin
eisemann
cadishead
ulker
klodian
companionable
towage
bassir
brún
newi
youren
fanar
smullen
teuton
litterally
ostar
ibiquity
hauptstrasse
wolstein
aguacate
usss
anonim
brockham
capesize
szmanda
salihu
pemetrexed
aviel
catchword
contrario
ghiotto
vasallo
stagnates
velandia
gluttons
salti
mihajlovic
novastar
mysociety
unteres
haloed
backcross
gloryland
tullman
yga
angelitos
foulest
kondratieff
crif
ijp
quicky
federalisation
goldbeck
fedoras
avina
puji
emilee
meretricious
nawas
latonya
oversaturation
pyrexia
ghufran
lysippos
cafayate
gertrudes
riederer
banchieri
methwold
baldick
falasha
jaworowski
gosaibi
peculier
magara
vibroacoustic
jheri
untwisted
hepi
hunold
wagged
skullcandy
pathless
birthweight
penpont
henselt
cfcl
anthropoids
bachardy
lobiondo
welldon
reenlist
mcgorry
kurnia
spago
shougang
bangarra
updatable
tje
tolchard
raineri
deru
trevethin
kernochan
rieker
prigioniero
moneyness
onthe
kerasotes
troxy
cristhian
dummar
fruticans
bioenergetic
donside
weggis
lessore
atalla
yehonatan
aake
maje
kales
alere
lienau
toups
maximizer
grage
malbis
trepca
stenungsund
coper
loveys
poate
zajal
spierer
democratise
unimed
giusta
assoumani
mercaptans
adeli
genzebe
tuama
euh
shijun
piotrowska
bartkowiak
locky
laughren
yessica
feury
delfland
maidman
metalious
tielemans
filgrastim
ehlo
sirrel
bespeak
oich
sheu
langbehn
papermaker
elgen
exaggerator
cicak
budarin
lieing
hydrofluorocarbons
kissack
leodis
yungang
melvina
alverton
waisman
abstractionist
fassino
cumbers
muhanga
biobanks
savours
prausnitz
latulippe
burglarizing
delisha
muleta
busic
nicmos
gayen
akomfrah
rafaella
giuca
koshetz
tadawul
winzer
realogy
migrante
kontiola
wdef
plagarized
tadahito
abunimah
dickason
murino
kamli
hulsman
panhandler
matc
boath
ekram
barquero
legkov
poretsky
amuria
stratcom
piggeries
whitleys
bitingly
herfindahl
alanen
clignancourt
kular
brotton
waymon
fuliang
thrivent
etra
credico
kamaboko
heartbreakingly
overpainting
gego
arcu
rechy
cringle
stoclet
yeso
carate
cremator
rehear
borsalino
trivet
krekorian
dotun
sonner
sansum
cazzie
coyer
cuscatlan
pioneertown
customizers
irimia
provender
neotenous
marm
resemblence
snoot
superlight
bunking
duhks
everpresent
mosborough
rohrau
shallotte
firetrucks
casada
cuyuni
cattery
kolachi
lerato
faurschou
navigo
pengwern
camon
hezlet
holmesville
mccallan
ditz
deckchairs
bonsignore
reposing
starworld
coachbuild
feethams
hildenbrand
amoako
kazal
beeks
letendre
youyu
colpaert
selco
pimiento
parnia
cofresi
nabers
multipage
polemically
idonije
mitterer
nonphysical
wachner
mandoki
zales
wahler
inflowing
wowie
tirebiter
manhart
sarmayeh
dudka
hidradenitis
mohammadyar
ridging
aneke
saadoun
duanwu
siginificant
bluebonnets
sidique
witchford
ekici
osel
baliles
aleotti
redrow
serdes
belber
droppingly
propjet
oqo
greyboy
bitartrate
psinet
shervin
langis
hershberg
mutoko
filipinotown
clobbering
elkhan
anucha
mcwhinnie
drazan
raido
witherow
kme
zusak
neurolinguistic
pinget
nottle
michas
resarch
arnao
acsh
tradd
birck
wellingtonia
macoute
wnat
vloggers
evenlode
fulgham
ugonna
vinke
gaylon
vanderkam
colza
carnan
borregos
geniality
irbms
bracquemond
alckmin
yousendit
alloush
ascp
ferda
reinbert
fortunoff
kylesa
dockett
snowdown
tameer
sangatte
caramelization
ghl
skanky
courtships
carreta
obikwelu
neutralino
intermetallics
mongeau
dalmiya
pece
mayombe
assel
voirin
childhelp
venero
bunaken
moei
carrega
presspass
gearin
unswayed
bartold
immaturely
oahe
beqir
maximillion
anderl
cuidad
engemann
bluescope
tamlin
arcati
blakenham
mentalists
playlets
azw
duisberg
dunbier
greatcoats
mirams
counterblast
seismographic
szczerba
bubbi
oatcakes
usak
burnam
crewless
muntadhar
kimiya
underutilised
witkop
awani
kirkee
sbj
dnm
yunel
irbs
szetela
usec
grigorije
kutum
molls
zhihong
hatmaker
rvl
southminster
contrarians
dromara
annastacia
tooks
dashers
boschwitz
prator
eizaguirre
alliott
digged
renaut
laureat
diabolically
leopolda
heirship
undesirably
vallegrande
lorinda
akqa
egad
motorwagen
leclerq
sild
bordman
evdokimov
fengler
mazaheri
jereme
cailin
podeswa
disembowel
granita
worlde
chavit
amulree
auo
cach
pollos
choicepoint
duley
fonden
jeanjean
abergynolwyn
rubiks
fialka
activia
whirlybirds
airborn
fode
ponseti
rozina
iffco
unifrance
autoparts
centinel
exhibitionists
convergencia
ascod
dunhua
oneshot
sportstime
fishnets
otcbb
chowpatty
piris
sibghatullah
swartzentruber
nauert
barmaids
aev
inauspiciously
dancefloors
derogated
frontbencher
jeetan
peple
glumac
zhangs
xrp
chelomei
florman
udaltsov
bilik
maaten
daithí
pentapeptide
weidlinger
gloated
shalhevet
zoppo
gentex
backley
byock
meidner
digestif
nilas
granahan
roehrig
kafar
micropower
tweetdeck
eliska
antedating
phalloplasty
arcangues
trillionth
helgen
hasee
gehrmann
sitecore
hongguang
prabhjot
alyea
citylights
pitchkolan
ncx
amirante
renovators
hydroxychloroquine
broons
garishly
touhey
renninger
dolans
larrabeiti
gubba
screechy
neets
walikale
taisce
gundle
xfe
bradney
ekv
plavsic
coffe
tingri
homburger
biddenham
lecourt
subventions
kirshbaum
chds
everth
petering
hessey
roundtrips
muskett
beccafumi
faline
johnstones
peahens
unecessarily
vorotnikov
steorn
westbridge
naida
maev
gourriel
modnation
salvington
samaire
staes
sydnee
resections
nonobservant
kelvyn
charna
coordinative
durris
anze
vbac
milutinovic
menia
motorik
uelen
dancehalls
rokia
breyers
tudy
matulino
intosai
oxymorons
vises
munshin
punkish
nedergaard
studwell
xfr
markovsky
abbassi
rudebox
estalella
glycolaldehyde
anuga
nonmedical
trinitrate
challow
danzhou
honigman
petracca
steiglitz
overzealousness
miuccia
gyuto
pierini
gatchell
eriberto
hersonissos
megalania
delicado
orner
agvs
dambuster
vitaminwater
seye
aswani
toukan
dbas
tartrazine
kinbote
perdigão
helstad
gearchange
keelty
chaplow
humidified
snowcapped
tsatsos
mackintoshes
meanwell
glom
phototaxis
unscear
somborne
barthomley
korobkin
dunmall
mantelpieces
passeier
rumspringa
villarejo
rodmell
allsaints
capitalises
ovelar
cadoudal
chicon
photosensitizer
bowdlerised
retranslation
binsar
malori
ruminative
aflalo
isoflurane
maxiell
tureaud
mirro
zijian
boleskine
hartong
clickz
gerenuk
kopplin
gitanes
taumoepeau
dague
industriali
haverthwaite
gettelfinger
brassware
portended
jiuzhou
rakhal
hilberry
juaquin
marturano
chessplayer
dougher
corking
marseillan
wickersley
fengxian
sotoudeh
toreo
ueyama
mummert
langesund
cogbill
bipolarity
zylon
crackhead
gadlys
schwindt
zuleyka
pompously
sestieri
salespersons
nieuwoudt
raphaëlle
sukowa
enerji
churchgoing
watchkeeper
wttc
mazeika
perdicaris
insync
tulo
persada
ballywalter
arkins
ameliorative
arambula
rookes
codetermination
halai
hutarovich
reynie
mystérieuse
magnetotail
ferdin
gobeil
lemesurier
winstons
naad
bonitzer
ulibarri
asmal
oure
terol
zuehlke
closson
ternay
phusion
solutia
pisarcik
hekman
vulval
mascarpone
epipen
quants
gyrodactylus
sevran
teabags
ftca
insensibility
qpc
payin
severities
weyerhauser
tiendas
moosilauke
wakened
raap
ziobro
housel
brightfield
misagh
harja
battalia
microcap
dowlatabadi
tyrrells
izmaylov
deuda
craigend
traffig
mövenpick
yevseyev
plutocrats
bucshon
flasch
kendama
barkway
browbeaten
avramopoulos
ailin
aflp
predock
tapatio
ayllon
bozena
thyolo
tutone
aiuto
pruess
kulemin
froh
chanels
reul
pääbo
bape
byw
aouita
tarried
wfr
lammi
feyyaz
aerc
micmacs
restrung
calderhead
omotoyossi
seidell
brammall
jaine
hurdzan
cardno
guttenplan
outgrows
marnock
tresman
colbran
wyndhams
jabin
shoaf
funkel
okkalapa
kppc
rynd
schinasi
voogt
martorana
rewarming
armantrout
manoury
anvita
tenille
delargy
bausell
rurka
sebouh
atonic
gwd
panner
nixes
dotage
afam
unarticulated
lombarde
lanseria
fieldworkers
kikkoman
creteil
yijin
slowhand
edgeware
kolzig
neylon
altana
chatburn
cautiousness
kinning
bruntwood
piggybacked
breakroom
bancaire
huckabay
rathmell
vende
malarchuk
guarente
xenografts
cemi
evolvement
incoordination
uusitalo
tabai
cyclophilin
anjulie
edgworth
paterniti
gallai
horsedrawn
shrien
bluepoint
pompiliu
overeaters
lazin
naegle
ketura
kaleigh
thundarr
anozie
daus
cundey
nantel
goldsack
hackerman
prologis
quincentenary
kading
bryngwyn
wtw
bukhsh
rancidity
borca
anwaar
elephantopus
frasure
bsam
labii
vath
abce
brunon
wappel
confict
retyped
anotha
waj
broadwalk
geisa
miqdad
taxidermists
cascos
boatworks
predetermination
unreturned
mizukami
suposed
philippoteaux
cienaga
ndcc
insomniacs
rozhkov
sissener
engeler
achived
sgg
tarling
microvision
zoppi
albariño
passauer
koyuk
convallaria
alom
beanery
kamishibai
ruffier
rotliegend
lectus
bernert
overdramatic
grevy
undefiled
bhimani
laj
railsplitters
dangin
akinnagbe
pilsbury
wgaw
rotheray
cubing
zhurbin
zehetner
guanghan
gluey
slanderer
batini
quiets
sannie
cantora
kulis
stierlitz
receieved
howsham
apperances
arnaldur
hoisin
charliecard
kamanda
methylone
navigant
tsegay
tollbooths
coore
reenlistment
crooned
tsurphu
dabengwa
lavatera
warrap
refix
splashtop
hamleys
metrostar
floreal
colombani
apprendi
outwell
gweneth
wtok
wittner
donorschoose
goodnough
stuchlik
hosain
corkin
bioanalytical
negm
raffish
sigfússon
ljmu
erlestoke
leimgruber
andou
enunciates
juny
skander
yongqing
configurator
trustco
kswo
ketchen
hydara
faucons
reverso
emaus
bheki
avandaro
loungers
collender
shoniwa
sariwon
legimate
huggies
canonero
dolgoff
vits
tokarski
delauney
swordfighting
blauser
hooshang
kulyk
sollie
howardian
gerdemann
ndvi
khondaker
bentalha
diers
uspenski
benjo
egalitarians
rascist
michelman
somersaulting
thommen
yoff
rideable
carbisdale
squadmates
bergdoll
bambous
salee
zijin
ramsdale
llantilio
arguer
scaduto
toshifumi
zhifu
jinhui
decidely
clamato
karthaus
mushka
boothstown
rectifications
hellertown
ferguslie
uneccessary
japanophile
katawal
workplan
attributor
cumpston
schabowski
qarabagh
utj
bogusz
kernes
bartholomay
langhammer
dematteo
isfa
hinderaker
oldaker
feets
teenybopper
adland
cheteshwar
fashionologie
basudeb
carby
bramsen
prowls
hinxton
chaperoning
cledwyn
madonie
shoppach
crosspoint
pelkonen
gushers
eyecare
osipenko
abler
brima
igniters
unistar
hnpcc
henwick
cigno
cantagalo
aberra
cadley
yapping
kosilek
pavlica
engela
sgps
rustomji
ketterle
kreuter
corect
thabang
sirop
banac
sjb
wachiramanowong
holmesian
bartling
cortis
stainthorpe
fumar
gics
zairi
reresby
yarrington
mahd
seawifs
tropper
camberwick
sparv
leeflang
styer
kimmirut
baljeet
mclaughlan
fazul
borinqueneers
chanteys
woza
wandy
acers
propagandized
berhe
puah
grimalkin
raanan
weismuller
ngobeni
glatfelter
seila
xisha
brinon
antao
wabun
fabulists
puva
eventid
vaporisation
kiptanui
pinkeye
aedo
eavesdroppers
keyholes
telang
jianchang
oxblood
hartack
sepehr
agrihan
parnells
tacha
zaker
sulim
confortable
abongo
wistfulness
bankas
knepp
exerciser
malila
ensisheim
wvlt
ribby
reya
castellabate
quoz
exabyte
musayyib
mtoto
adesina
broadminded
fudosan
gehrlein
beezley
riemersma
preforming
kralove
finmere
camiel
cicoria
scob
hansch
puligny
resuscitates
delaere
fendrich
southwater
vanelli
whirr
johno
ajil
jgb
wangechi
kyats
tonini
trifu
wesely
aray
partwork
vitalii
saddiq
eshete
unrein
dureau
nattering
dukha
ormer
hydrating
orpik
nyangatom
anjanette
vilani
opdycke
vaibhavi
tipnis
prurience
unsealing
rockwool
clitoridectomy
burngreave
pianta
recumbents
mcconaghy
moonbow
gasometers
patpong
thuresson
playnow
sudsy
legibly
holdcroft
froglets
oliviera
fonzarelli
griptonite
salie
spatiales
bistritz
soulfully
yasuní
slickrock
greaseball
hyoscine
neringa
chimères
videoed
cutchogue
stergios
aritz
kanke
gtfo
kasrils
cardiol
paté
bayada
langrish
sarabeth
hofmans
abderhalden
gamehouse
herbet
wakefern
dozers
omeish
barbella
buley
exning
ripponden
iev
cohu
burfield
mammoplasty
carrock
inamorata
veiny
compris
carnoy
broidy
barrand
groundball
dogu
caprese
genesio
hallwalls
kwhy
tred
maulit
facinating
lifson
troed
testteam
steria
rajabzadeh
mjk
stron
buckleys
gulko
huntingtown
straughn
exoskeletal
studentships
sternburg
weizhen
plainwell
moonset
fings
antle
altmaier
multicolour
pfitzer
myddfai
getronics
corfiot
peggs
nossek
chaon
ortygia
glenford
taughannock
breeched
rokr
saffa
ramiah
mikovits
searby
interactives
ngardmau
luambo
fritchey
gwinner
mosisili
lizcano
husak
chavarri
nebiolo
grampy
vicolo
hancke
quadras
reggane
uonuma
amberson
bouchareb
metts
aipc
spik
addtional
contar
impasses
breward
rilly
overcalls
soilih
linny
ethne
caracazo
spicule
stairlift
laide
adlerstein
personnally
borve
clarel
csfb
perfluorocarbon
cecchetto
stanzi
anthonys
mainak
muraro
kailey
arnaout
prashar
alica
leor
donats
scraggly
rapada
azima
footbinding
asplenia
ppap
adjustability
skokloster
torrs
durjoy
harpring
sardelli
besra
hediger
reddings
baechler
moviestar
retconning
herxheim
intracardiac
bestower
afos
cerumen
spevack
masanaga
filippelli
convergència
ogd
bibbidi
kewpee
junkins
comotto
mandlik
berrill
shaibani
blasphemed
doscher
tonj
adhamiyah
humoreske
resus
remounting
kellye
dontae
daiso
monse
thannhauser
virological
akona
nasad
mulya
cnrt
reindeers
kusc
doronin
yuxian
gimpy
omotesando
aymond
grumley
kelam
rittman
ussery
ppx
impex
reckonings
koussa
conserver
matal
celent
troglio
mistrusts
hiner
kinkell
asraf
countrys
sonai
herault
malnati
entangles
diety
renehan
stendardo
engh
ansanelli
chaviano
raschid
jousters
salmaniya
whiffenpoof
quistgaard
bodeen
reportorial
aerating
drcongo
lidderdale
comed
microdissection
costcutter
stomu
carcharocles
sunniside
jurg
rousham
obadia
pieres
ekaterini
santaquin
carrbridge
croisic
khau
lonchakov
baatar
xantia
humphead
rawan
steinski
cinven
lochsa
sammir
plaat
muallem
serigraph
allmond
garbis
amcor
melamid
airwalk
aitcheson
terminuses
convalescents
glast
carrizosa
mirriam
lettin
montie
makhoul
mumbi
simmie
rosburg
serhan
arellanes
modernizers
gamania
engish
thalassaemia
hongmei
amirault
krovanh
velvel
orien
stelzner
paddleboarding
stolojan
pacelle
nadhim
abdelhakim
hachijo
loughgiel
lindall
garches
handcycle
devins
kurtzer
bleckmann
dumke
digresses
dstl
paperino
fouzia
poux
minchew
glowsticks
bootmakers
fogelsville
shudehill
kelsie
ingratiates
grecu
rotfeld
guanahani
zayyat
rebutia
unreciprocated
rubberneck
adevarul
pejeta
jarrin
voicebox
psychrometric
cuerno
chauliac
sarka
leves
poltical
labrada
onate
phats
stowford
bloaters
finistere
characterless
coxwold
micon
dengfeng
neuffer
manderlay
kincannon
reelect
steinacher
bloks
dln
colker
itemised
puopolo
computor
maxinquaye
batched
haagensen
unar
velvelettes
kilmany
koshino
taqwacore
kilocalorie
kosik
reformats
sanctuaire
afes
soliola
yazzie
arcosanti
deruta
haemophiliacs
salvarsan
libanus
enedina
teamtalk
screenwipe
icasualties
ziporyn
francescatti
udana
wolodarsky
donilon
neopolitan
crinkly
sistersville
karriem
rubert
dodman
fasih
nahasapeemapetilon
pungwe
lafrentz
mulville
ruettiger
rhinegold
houlston
carion
gilesgate
falconí
afterman
exelby
preprocessed
proctored
rootin
eisenia
versata
kijiji
matondo
cjeu
exorbitantly
winegardner
nhpc
maaz
barellan
zuhal
haikara
jostens
idiomas
hesson
korsten
lahman
darcus
chishty
alexanian
simonova
vandalisation
profitt
fishwife
vreni
princelings
stilley
igby
moschella
feherty
ggw
flemons
borio
nonmember
oxyhemoglobin
pierer
yorkhill
grenon
stockbreeding
quil
strzelczyk
hydrozoans
timanus
paranavitana
mucke
sdrs
daung
photocoagulation
tightwad
astacio
microcircuits
delory
inacurate
spandana
repass
tamyra
yopougon
borlongan
voluminously
garms
spurlin
elvers
chorn
lacedelli
deily
heyy
aliff
shahal
lifa
greave
duensing
tahmoh
atep
tridgell
gadhafi
dniestr
bronzing
cantering
bigpoint
vonetta
reappropriated
garath
drumpellier
economizing
rondonia
gyimah
maiava
derb
yacoob
wackos
ajala
saenko
dehumidification
prodigality
brookbank
methi
undeletable
whelton
ravichandar
throwin
carmyle
replaytv
petm
drash
alazraki
hellqvist
shellabarger
spino
reinnervation
lolling
michalopoulos
goodhand
ysc
valiasr
horobin
beardyman
disaggregation
khagendra
barkett
mcclellanville
gares
bfpo
buoninsegna
szoka
mcanallen
zoophiles
handong
ballygowan
somtimes
concievable
kagal
maruta
jocketty
pinotage
furbished
niver
ritters
hanapepe
splotch
malem
nhial
bananafish
recalculating
uddi
agms
naturalise
etag
cottager
iosseliani
nerpa
elidor
frisina
marinate
clewes
wfie
videogaming
hartly
hagy
kgan
taris
facture
koteshwar
ellmau
isri
redican
kakakhel
manze
fedco
cusimano
gareau
davone
laurencekirk
nouv
fanger
azize
copters
dhh
moimoi
isoline
termeer
glaciologists
rememeber
llangynwyd
rostal
pedroni
werkman
hypothesise
tollesbury
fahlman
lanus
eug
sellotape
schnabl
szatmari
rogart
rusape
hoyzer
nupen
raisings
newswoman
kallum
marshburn
barnetby
kovack
delaurentis
sapey
macosx
dusautoir
weightloss
biostatistician
ampico
lennons
vernoff
zmed
jurin
alices
sudarso
islambouli
lynndie
hascombe
játiva
hardies
sublicense
sluizer
dinkeloo
celex
kolobnev
playbooks
burtynsky
shadowless
gehr
akti
shringar
nazon
weasly
ncai
homola
benander
duncairn
magris
housemasters
agsa
woonton
akahoshi
astrov
hanborough
kleinsmith
barlows
jaric
ruolin
ewl
mateschitz
nonpathogenic
disgraces
wajih
begijnhof
winser
grimod
worths
linesville
abstainers
prestin
relm
aniara
disy
lenita
gagnaire
camaleón
liwen
solidum
poncey
headrush
muhs
themerson
deterding
threeway
seaspan
roncero
xinji
jawans
serifos
cussac
mokashi
saarbruecken
guelfi
pedja
magimel
ringham
pealing
myracle
arestrup
batebi
cpq
darwinius
ruardean
mnookin
tramon
kawthar
weinhandl
bavetta
poppit
componentry
rowetta
soundalike
bqe
sheko
stethoscopes
preconscious
karabulak
fusheng
melangell
acassuso
chastleton
prowled
elain
legget
polybrominated
malesani
carnosine
paritosh
liriope
grisoni
bernon
brandberg
hillmann
tomka
ustyugov
pluna
capsulatum
vengefully
birao
bedsole
lichtblau
samek
bettinger
brammo
brabenec
unfactual
lert
battani
kubis
sadden
slemp
chemotherapies
jingxi
flegrei
hohberg
lobstein
tupman
hennion
contruction
kalangala
aspros
realme
corporatisation
fischoff
ospital
brein
scatcherd
ribblehead
bookclub
lassan
kalliopi
moonachie
greediness
becnel
leapers
lupit
aldy
kalus
ashenfelter
branum
laister
coloristic
tesauro
jiye
flahavan
silveri
valspar
qiannan
holscher
whiffs
shiksa
adbul
liks
rupertswood
cockup
fuzzing
okpala
opacic
obvioulsy
godelieve
cué
mcdc
andrex
sweers
zumar
benacre
prepubertal
kilton
whitewashes
gorgeted
gambella
friscia
tetney
apcc
picabo
kulov
sathyu
arcano
sby
souleyman
schmooze
sinndar
majerski
hohle
imli
kapugedera
johnta
thursfield
synonomous
kasyan
mennin
glyptodonts
bickert
razzy
gurdev
geaney
pcast
afz
foreordained
medlars
shikabala
wilh
bza
ipss
rfw
panka
featherstonhaugh
carders
punker
negresco
eurobird
brutalize
ekeroth
concering
doucett
piatkowski
puchalski
coeds
handz
leboutillier
callejon
reny
perfectibility
whow
flattr
towry
zagorski
bjorlin
yenan
gitarama
floros
aner
peraino
hrytsenko
numark
konia
stationhouse
fortius
chiggers
embarass
pbms
amster
esselstyn
voltmeters
hemsby
fluidics
kalaeloa
rtms
baranyai
korbin
asmi
puletua
clingendael
pietarsaari
aliments
lustral
allami
priem
fedossova
extranjero
torresani
bioaccumulative
rlf
framwellgate
aeth
noncontiguous
cleyton
bedinger
nykesha
hamblett
jadon
minibike
wbe
jcvi
gillikin
housen
helú
hesco
ngema
daigh
warzones
joas
boscolo
rossitto
loyiso
bivar
sugarhouse
nemir
micrometeoroids
hogen
trainspotter
cusset
tratamiento
sockalexis
dtap
aphanomyces
elona
ifosfamide
dissectum
shlaes
krown
lubben
appi
rtkl
yordanova
gisp
shahriari
zonneveld
seaforths
brickwood
lajamanu
appraises
lewellyn
sineva
olestra
ultraportable
baloha
lexar
kedo
imrul
gunst
cryptogramophone
hospira
seafoam
shpeley
dsca
barbery
gudiya
pilbrow
thurley
aneel
nswc
hamai
garnons
negligee
shandra
pariol
basrur
panathinaiko
metreon
cutshaw
kantaro
shingleton
outlives
caddying
apert
lewine
odora
seligson
fencepost
kitchenettes
steegmans
sabawi
yatseniuk
pimply
scrawling
ariege
jeremiad
horen
unchr
sendings
kedgeree
ultrasensitive
playworks
pehaps
dormand
qmv
hagbourne
psaila
konfabulator
meduna
sieu
burried
stellas
fashir
elucidations
azamour
truecar
everynight
pitstone
groarke
cyder
desuetude
bocian
littleford
fregonese
morrisette
fairplex
bedspreads
poznanski
quiggle
hait
walbert
minstead
elward
kryger
nonconsensual
broadoak
diversifies
fountainbridge
saem
stoli
eesc
mordt
chinaberry
catcott
nalgene
lucet
hawsers
romneys
terrestar
schlussel
neuharth
busabout
ondrasik
whiteheads
shorrocks
dclg
haemorrhages
ahlus
watchkeeping
vih
externships
bedlinog
aspirator
jokwe
manmeet
bings
casner
dovish
coyuca
lonnen
chivilcoy
micronucleus
nobrega
jakobshavn
yakob
bleeders
informatie
mallonee
asfandyar
roamin
reclose
garnant
versyp
mistrusting
necromorphs
worlders
munsel
loughrigg
humanised
sudakshina
hinnigan
schulmeister
yanin
flitton
faraci
trepte
dmarc
sbirs
kouachi
cack
ceccaldi
ocarinas
wlt
onlooking
hoggatt
daybook
hiper
doriana
whodunits
ranallo
kirkheaton
avci
hidy
waldbühne
assael
lawr
oportunidades
morett
hollandale
fornaci
annabi
townsperson
palliation
karadas
sallying
venerdì
rokhri
pandher
stammen
verlyn
orgin
slacklining
voas
wkt
patru
mdea
meris
railpower
erel
pittsburghers
dambrot
mondragone
kirkstone
hieronimus
llanddulas
salena
orido
redactors
pollington
merelli
rouzer
ajaan
firestops
mrnd
wincobank
rheaume
biomarin
winterberry
kucharczyk
etns
hirwani
jopek
houli
earlybird
laboureur
savané
marras
bdn
mcausland
gaultmillau
lysandra
branda
chkhartishvili
phh
rysher
wangel
navitas
vesnik
globality
ranae
frankham
erlbach
moorends
persnickety
repacking
sissay
sulfasalazine
horsmonden
nury
louisianan
cwmbach
breeland
thone
pryke
nordyke
saracino
soemone
buffone
millea
suprun
transcorp
safwa
subburaman
ethicon
soulforce
electrodialysis
starin
géa
soozie
squabs
navo
yazov
saltmarshes
kovarik
bielawski
kohlrabi
justes
biltz
knowest
colinear
jenbacher
lousteau
mcmenamins
etiquettes
burtch
jaroussky
nafar
slavka
samouraï
ufu
quenton
stanely
dumay
tolimir
leistikow
tagliatelle
gelles
cregagh
xiangzhi
siskiyous
maxjazz
tusd
pamella
dontcha
traute
subversively
burray
sutomo
sideboards
melloni
mevis
zadik
camisole
hamson
ordure
imod
sowerberry
acoustica
caher
didanosine
topliss
cotinine
saes
ters
sedova
pettoruti
succesfull
govenment
biffa
ogunbiyi
bengeo
farty
tahari
haseeno
holaday
fauchon
suppossed
kristiansson
amongs
karise
kléberson
ausra
laras
verlinsky
concent
globalising
tanigaki
maeil
asteroseismology
filburn
calabaza
politicial
glassfibre
borgeaud
remarkables
whacko
unrepentantly
pozzovivo
aprd
commonsensical
dramane
isfield
couchsurfing
kenes
yorston
laq
norde
enterohepatic
zeger
cremello
folkstone
parasail
barel
cenovus
vanneste
diaphoresis
dpss
colaianni
irresolute
rideal
seys
raoni
embling
garavani
marcham
narro
pérec
sterlington
buffie
franciso
sophina
liverpudlians
burga
sokolski
kitley
abban
fussa
folinic
scarpati
stude
isovaleric
topcoder
blazo
seef
augusten
zehntner
auguries
hemond
hattery
zevs
yabbies
ekranoplan
gleans
retentions
cheal
crabapples
nordtveit
muthspiel
renouncement
kavre
cornavin
cker
iros
bachleda
chaldees
annisa
irton
mings
melasma
muravchik
mohai
gedeck
histroy
agcaoili
noyz
dopson
caveau
myza
bermeja
dergoul
jonet
nocella
cefalu
fangtasia
santeetlah
kratovil
schreffler
howabout
littel
unscrambled
ruzi
mughniyah
downdraught
trampler
alterra
antea
hasmik
demonstrandum
siahaan
mcqueeney
yekutiel
conable
geldard
kstu
millhone
laiwu
anla
mukogawa
jpac
strenght
kfh
fresson
weizen
twitpic
hadorn
erddig
ellerson
shono
carlita
khamovniki
chinda
veysey
scota
grotberg
djp
unfamilar
nitwit
clintwood
wansink
connington
humidor
moviemakers
carriageworks
besset
dewe
moussy
dolch
ziming
verdery
soif
plyometric
suchomimus
cstr
swype
mcrc
alrady
blaqstarr
opik
conceed
emmit
patternmaker
heidy
hydrobromide
pharmacal
sakartvelo
aleksy
nastar
violaine
woodmansee
daguin
seethe
lamya
readhead
berhalter
ambev
lignans
neuros
ruhrtriennale
crism
richenda
garefrekes
guinobatan
virtuously
tiptop
sauerbrun
balza
ladji
canright
cagno
moqtada
dundela
gringa
lcme
grascals
polebridge
comission
monterrubio
bellugi
autocue
csq
perturbs
apitzsch
yadana
krajcik
sandelin
ktrs
shirland
sharara
nyko
tenseness
kuito
wenming
geumgang
kidon
amout
shaqaqi
manke
akoto
claines
mautz
yoshikiyo
jeanrenaud
rouhollah
torbet
proceded
chuanqi
rochin
arvinder
carco
moluag
wickerman
viggers
hemric
stanlee
bachna
acerola
arrivé
ugv
recomendations
thrusted
bhimji
dubroff
sdrive
aronberg
fctc
ulka
cullyhanna
minja
kirkconnel
mtj
underachieved
hangnail
nolfi
abbondanza
pyeonghwa
triacanthos
balikbayan
enlistee
thous
skjelbred
mberengwa
dalto
minia
misconfiguration
neuvirth
teahupoo
cande
kiker
spritzer
ziying
braathen
giuli
guell
tbas
sibilia
tranberg
lenti
astigmatic
dialysate
birdhouses
schoolyards
dolinsky
norelli
haemorrhaging
tinier
postl
bupleurum
crommelynck
vlti
biki
peterlin
ezard
kexby
knackered
olaine
herer
bardales
primis
shorland
cvitanich
kamien
varicocele
gerstenmaier
tippetts
actionism
brusati
imbedding
battiscombe
ferals
bettine
czarniak
ziglar
trian
hiptop
botos
fleabag
akey
cuisinart
centerview
tryscorer
tilma
grebennikov
freeskier
dannelly
prognosticator
backen
refaat
blackcurrants
joughin
ruffolo
wildomar
wtvq
waghmare
rcvs
platel
klochkova
repitition
statesmanlike
laro
nokesville
darwich
pattishall
heiney
hospitalists
dazzlers
kullberg
kealia
verjee
gaieties
hibernated
lansac
trendiest
dejen
androni
whitstone
montechiaro
finkielkraut
besmirching
bluecross
zampogna
poutchkova
leoville
favorito
norine
wheely
coari
degray
sissonville
hailie
eveyone
afpc
khasab
sieb
sumin
deconstructions
rassi
birna
deprogrammers
carhuaz
bulcke
wfe
nesson
standers
forewarn
implosions
gervis
jeffcott
denley
powwows
biopsied
bedes
phylacteries
wastepaper
terrariums
hedgesville
knowlson
borodina
orsillo
tautly
multitalented
sherlockian
dankner
lehoux
bürgel
lwazi
shafiullah
launced
odero
saarloos
lotteria
redish
jammys
groys
pecc
ballynafeigh
favorited
lactalis
henschen
discrepency
faily
delman
gisha
nationalizes
brislin
carmax
flavie
acclimating
cruceta
reamed
shustov
berstein
charcter
adkinson
neuromarketing
verot
bitsie
portmahomack
cowcatcher
granfelt
unmit
jansma
annulments
wellton
abuelita
kefaya
leadley
whop
diference
wvla
brosnahan
rozonda
alcea
pardner
aviacion
sauget
jumbie
philipines
countersign
fukuchi
otoacoustic
jopp
pilning
fanjul
ripostes
kisko
soofi
openable
scaroni
worldport
kvitfjell
uppermill
ruden
nitroaniline
gwyer
babayaro
chanie
dicillo
dohmen
excommunicates
rendy
issele
tâche
democratique
gaidheal
zagel
seghir
stemm
xtv
salopian
fisma
rosenstrasse
katzin
beque
bruichladdich
breckon
ihave
hardnett
ratchaprasong
leavings
lintu
currywurst
ilmi
beefier
feminizing
gamechanger
relaxers
skripochka
claerbout
ventrella
outdoorsmen
yadkinville
nebres
vélib
arrgh
astons
slighty
takala
stears
luuq
pawnbroking
pignataro
yazeed
rankov
landcruiser
gurnam
harrovians
bultman
schussler
lewers
spinotti
dealy
turnersville
kajita
imaal
jobrani
dodgems
jaggies
pleonastic
gereshk
piratbyrån
fortingall
melchiondo
involed
vanino
repsonse
ggyc
querry
agera
meibion
nacke
quic
wedgies
farooki
foor
neffe
fitzwilliams
redistrict
dejectedly
ghadeer
madari
dolk
woodchopper
mebbe
fortymile
brants
driveability
lacotte
stargaze
intolerances
gambarini
alaux
stifelman
octodad
pugilism
deroo
almut
beglin
sunfest
lleyn
khorog
chesaning
tfb
halfwit
budrus
zarvos
gengis
mantovano
crevecoeur
terell
oechsle
entryist
whybrow
mitigations
mootoo
marcovicci
sighisoara
faenol
fundatie
loovens
siteadvisor
zaky
edenic
nonjudgmental
obici
ftos
turrill
nbh
silverstar
boudia
curcas
stuhldreher
scaletta
kranzler
sankai
textura
duric
glenmary
avgerinos
virmani
mutebi
autodidactic
triggerman
unessential
daiane
mispronunciations
cyangugu
larbalestier
coar
bruntlett
sybrand
vedantam
ghettoize
milstar
masip
capitanio
xingxing
peppone
bonhomie
headcase
brinkema
riolo
defintions
kiyan
alkon
girding
tão
piperine
purrs
druggy
wkow
boond
sadakazu
absents
willott
dongling
dicking
outcompeting
tunzelmann
lophelia
pagnozzi
zagan
grandfield
kishenji
minong
lutui
tôt
landkey
dumbell
paratrechina
antihypertensives
unsurpassable
kettling
pelino
durn
smoochy
allover
hishamuddin
kamte
goliaths
deichtorhallen
duthiers
metanarrative
diammonium
ligustica
valland
coie
lathen
windtunnel
silverbridge
bartles
intertek
gracanica
akopyan
vadivel
professionalizing
vanags
flanagin
pjn
armo
preziosa
narum
pokerface
baross
schotz
jialiang
kbyu
prasher
emigré
unskillful
diazinon
hisanori
xos
goldminer
numberous
darly
icpd
higgenson
adjoa
bsx
raphaelle
grinker
caressa
teemed
gassings
nayeem
sliva
vendy
wintersburg
mathangi
malaba
montenapoleone
fidell
giggly
amarasinghe
balal
depósito
fixates
commercializes
reeked
bobridge
mackell
klain
pennacchio
causus
nonresidential
litterateurs
noahs
pevensies
striscia
tursun
mimicks
himalayans
aimlessness
nothern
croisset
quasha
thistleton
rangnick
pomroy
ahari
pinkville
eddo
jaitly
meling
onh
metaweb
uuk
palmateer
madworld
palka
nure
wastell
yocheved
foldes
telescreen
berzon
nesar
jiuling
tunnard
gopalkrishna
hursti
potboilers
redefinitions
caze
dinerstein
jetbrains
kalamaki
yufeng
delbono
cartwheeled
heartrending
plancarte
narvesen
ekow
champcar
gallegly
ickey
dayley
guyler
gettis
lodwar
stefanowicz
mavuba
broderie
insouciance
rutu
milham
penylan
ostrum
michuki
yaqoubi
hongjun
makaya
korup
hollymount
astd
pervy
yanhong
recapitalize
odegaard
hunsley
unduplicated
hwyl
nvda
helfferich
xtp
jimmerson
sucres
moonfleet
nixey
fayzulin
schobel
sajad
barzee
brutalised
tapit
southcoates
aquis
repassed
wazi
présidence
beichuan
disconnections
leadmill
mezan
flexpoint
mayrand
staddon
gritting
pedlow
boericke
hutongs
parvesh
racho
lengthways
touchable
enemigos
lincicome
intermarket
baffinland
henkle
communicational
untrustworthiness
garduño
kyger
arano
yopal
riter
leitgeb
ohnesorg
susta
chusovitina
fortugno
hrawi
ashling
mortems
suddaby
economista
burjassot
besom
soldierfish
ticheli
periosteal
kindy
collectivists
alcopops
briner
fexofenadine
commercialising
bysiewicz
cocacola
alize
rwenzururu
abdisalam
sunmi
cottoni
gebru
notepads
fabe
cryoablation
grunty
baart
achaemenian
floberg
earthweb
zgs
jaitapur
jamilla
taula
rivermead
crouzon
bottone
ajijic
saltgrass
rashaad
gorner
monem
electrocardiographic
christianism
yoshizaki
geotag
pickfords
laroy
experientially
resizes
coosje
haselhurst
toadlets
westhay
autonation
novogratz
secas
ecosport
balneotherapy
gilje
smokehouses
sudhalter
whirlow
azmin
menagh
bladerunner
cigarillos
buzzcut
martinico
vivyan
groenewold
skys
biryukova
wannamaker
aberbargoed
springettsbury
goulue
pastoor
ringelmann
tubitak
chure
hueston
phogat
securite
fouhy
gambero
isx
jibran
latitudinarian
zampella
prq
regalbuto
turizm
seide
beltram
teakettle
etas
eschmann
galled
markale
pusing
bisgaard
hyphenates
afte
eryl
crims
motocycle
darris
kamaliya
palmsource
ndaba
akros
morasha
backrests
lemen
bobrinsky
fradulent
wraparounds
riano
jivamukti
inza
bluestockings
foible
kazuhide
tnuva
cohmad
urx
gurganus
gunwalloe
askatasuna
youngish
charlottes
constitucion
woolaston
indentified
imaginasian
apung
ramdan
drouhin
qahir
lascano
semashko
counsil
oiv
fleig
patchin
gedächtniskirche
cybercafe
ixo
cooly
langmaid
lykins
kzo
doell
harrisongs
pegrum
mourier
zix
lischka
jondal
sheknows
forestlands
dirndl
westferry
kolyada
goodyer
menahga
sébastian
slydini
recalibrate
bashall
honu
shithole
wardrope
beigel
penacook
fecht
graco
irvinestown
dubi
bichard
counterplay
saveri
langmann
ercot
tejal
shaybah
dipo
prises
takur
hebgen
idealogical
heijningen
filthiest
corruptive
nettop
ccta
davus
hoseth
randone
pumpherston
pary
gubbay
susac
samart
dessena
kittner
fahidi
lamasery
robilliard
actioner
armleder
ravening
navios
bresnik
oxygenator
hufkens
creake
tuberculosa
faluja
morphew
shirakaba
bradd
ojjeh
iskakov
matinecock
bellavance
lisberger
stingo
satchels
baille
naicu
halff
slayter
radiotracer
newin
hometime
flashbang
varnedoe
kawczynski
llangoed
azia
unknow
nordenberg
dica
kepple
maitlis
omaezaki
papillomaviruses
trashman
lutfur
osem
chesire
macharia
niida
paharganj
helfman
lennert
moggi
oglivie
siderúrgica
bectu
sawsan
mutsuo
capriciousness
gibeau
tandan
compagna
chepa
remax
llantysilio
silberg
euron
eurocodes
autocars
untempered
pseudomembranous
kightlinger
brainstorms
mondol
renken
llanafan
kauswagan
coqueiros
plumpy
pahinui
resealable
airlifters
arola
dhone
pilares
mondaine
imerese
rokka
burfoot
soens
tuffaceous
metters
crada
epically
stevey
martynova
veinlets
guiliana
kenefick
angelas
nurme
baño
underpayment
frankwell
robalo
hypermutation
hiyas
koules
chellam
shibui
mirzaei
edgings
counteroffer
societally
friedenthal
nokwe
housebuilder
corralejo
crudes
brunot
abes
shahri
hjs
vigeans
debriefings
bancs
guandique
bentota
vestar
mindgame
analisis
encom
larkey
reconfigures
unostentatious
schnetzer
atasoy
gliss
gegenbauer
moistening
bushmasters
laquan
skijoring
xiaojie
nebulizers
gco
penicheiro
worstead
scotlandspeople
nark
tasseled
teares
westow
beachball
ludek
kenge
tapiwa
talibans
wanniski
kykuit
palitha
gainsay
aamot
acidify
jarallah
cantv
goitein
neuropsychologists
desaulnier
perignon
korac
abeylegesse
glinn
aptenodytes
prps
miyamori
aldebert
iccc
nickol
fanel
trefil
meechan
literaly
hertzka
mahsa
uarts
sheepy
hougham
matlovich
antier
tricon
adonia
rrts
obeida
batheaston
unwra
wrecclesham
penalization
abiraterone
seter
nicholle
vermilye
benhur
sabaa
videsh
literarily
ecodesign
makiki
amona
doong
msns
joti
soyeon
neryungri
polyketides
bookworms
handpiece
schoolfriends
chinaware
czekaj
takura
greenhall
verrerie
yoa
seedier
shigellosis
wachman
wardensville
atlanticist
murck
krh
microtechnology
windhover
dawnn
perplexes
geeves
varejao
ossana
gjorge
outvote
karapetian
cdfs
wcos
fahrenden
fumihiro
montz
davinia
muraviev
sadullah
onozawa
qahira
fancying
helbert
gazdar
mbanga
shaunna
berzins
tretter
commixta
transcatheter
alphavirus
vilà
jarosite
baiters
sempé
illston
lodder
samoyeds
asaduzzaman
secuestro
varey
marcali
onselen
dfz
vaut
vaze
labella
pasquarelli
chetrit
auston
seascale
glenbuck
trien
emri
miniland
glod
fetcher
formlessness
lepard
cockley
dashon
umble
schmidts
büchler
goldwing
qalyubia
birchett
varadi
vernix
toughman
aletter
samim
zehri
rumman
bortoli
schillebeeckx
weinheimer
weijie
tantalizingly
spinnerette
borgnis
baldies
nagayasu
rejuvenates
gorio
pigeonholes
gustov
bankfoot
perfluorocarbons
mirafiori
whot
shcherbak
ruther
beermann
palaios
posess
glamorizing
blacklands
haefner
sassano
hoeryong
lehtovaara
papanikolis
talya
claesen
jaouad
blilie
thakrar
wordfast
checkmarks
flatterers
bettauer
shaari
tepi
soboleva
goarshausen
musashigawa
airtouch
unhooked
doorframe
ligang
dossari
boldenone
aurilia
skille
rectifies
calcott
obomanu
dahler
cabezon
edgmond
deutschlandlied
silverblatt
jungers
lindheimeri
whitsundays
marsans
sionko
seaburn
musse
rodeph
syverson
didenko
microbacterium
straumann
mmix
evaldas
portero
canaille
dmh
turahan
lemelle
freydis
feuvre
rajarathnam
mastenbroek
addlery
cholos
schwarzsee
tunie
loyn
gripens
tiptonville
pigtown
tarawneh
innerbelt
kellers
navman
haldin
willm
mollenkopf
doege
stoneygate
hadwin
belhaj
parapolitics
blackwoods
rimming
espelho
kutia
chaperons
gattopardo
zey
lugos
wycoller
residually
advil
rieppel
belsey
herzi
sanely
carbuncles
zhabei
heartstone
hpn
blotching
yuly
texturally
topalian
quandry
erbitux
nesrin
rbv
bellringing
lesnie
tumults
sylvanian
tastelessness
somis
kcen
ferpa
ostinatos
emara
whitefin
djermakoye
recentness
pangma
seegmiller
barlay
olenka
alphonsi
avlon
sadaka
glamazon
nimet
gmcs
siebels
cheatwood
rootworm
deltoids
llansteffan
stiebel
fessenheim
salicin
nbtc
schlicher
mediaflo
lensmen
zingaretti
cartloads
steans
werks
nattawut
sambath
anwari
delino
yardena
pirton
caccamo
hsx
mukhlis
gdh
claudi
adjournments
airflows
rechsteiner
golis
styluses
carusi
revkin
misgiving
masumoto
denegri
sheibani
maysoon
uvaria
rohi
schoenke
rheinenergie
yangquan
kuske
parquetry
sumantra
brisset
googler
marzook
mcconachie
malese
siphandon
hikkaduwa
klitgaard
mobilises
vindu
strause
greenhut
tcx
adiyiah
grischa
jummah
koranteng
rankle
noirot
éclairs
smallbridge
mecke
tetsuhiro
doxylamine
choplin
philadelphi
rossborough
nitel
oxybate
meulens
trolly
contortionists
ajas
pearisburg
vsel
kekana
wharry
panagiotopoulos
insureds
offie
curtainraiser
expectantly
soton
correlatives
rohitha
albala
duijn
moeaki
gabardine
robach
kaiman
dunnings
rhn
homogenised
inews
othar
outbox
resolvins
retox
deciles
salceda
boated
butzer
allicin
creepily
intramedullary
akumu
mahiga
roulade
fimbres
nangpa
lki
kplc
huarong
allix
djamaluddin
petrify
pietragalla
precised
fazzini
antoin
coinsurance
ambrosino
syna
foeniculum
cadgwith
primicias
yilma
sherzer
huilin
kilgannon
preternaturally
eeshwar
ladette
degaulle
zetra
mainlands
neediness
betten
tsujimura
chafik
argungu
gossan
mazzarri
lehmberg
whimple
trackback
carrasso
machugh
grumet
natuzzi
turducken
letchford
andthe
izen
hettich
inglesby
blenda
powderkeg
kilfoyle
spooning
kitov
tyneham
vatsyayana
hartburn
spanghero
haldenstein
nassos
renacci
barkann
proforma
sizun
wras
shumba
featherweights
houvenaghel
péry
illiad
disposables
moghal
odey
rozo
showstudio
questors
diehr
jesusa
loehmann
búzios
giblet
rajoub
nqr
shevtsova
lilach
locanda
devra
somekh
ilpo
yoshinoya
bonni
colloquies
vinit
biddick
schaffel
straightjacket
neighborliness
rolnik
mozaic
boomboxes
trapezes
haimar
pupusas
spack
hongyu
marketting
konecky
careens
bvf
sadah
malpartida
mascoutah
tronco
grafstein
cpes
warthen
terian
racoons
pashtoon
heidinger
faha
hilscher
olmeda
ravid
letch
winrock
cailliau
behavour
bergères
goffert
tcga
boulis
forna
heartline
colorama
hongyun
lindsays
edmands
fujitaka
alnabru
gorenberg
sabudana
oklo
yuldashev
fugelsang
zimny
kransky
sutterton
razim
hypoperfusion
deysel
malekzadeh
wallerawang
shifta
myrtos
yeye
garnishee
niçoise
ceric
kwamashu
ibbi
söll
paglen
frase
dozes
ucavs
tottle
lcmv
lehninger
yodels
serebriakova
notenboom
fabini
malacanang
pollie
zigzagged
wvf
mahbubani
whins
fuselier
amchit
sextillion
firby
mailbags
centralizers
cityview
multiculturalist
farmboy
scandlines
kostadinova
peregrym
suardi
tonina
tornero
redlined
ebg
kotkin
bernick
religon
kifah
bartholomews
hollyfield
relabelled
hamin
fiscella
bednall
genuflect
gussets
mackney
qualis
fena
leicesters
tarnower
beatties
manella
seroquel
tatang
lassos
shiron
closable
kilmallie
sighvatsson
woodiwiss
boscoreale
ghur
poeticus
arepas
loury
poha
dreamgirl
pboc
brunne
chlebowski
llangurig
mayom
vokey
skillsets
kalatozov
koom
salamin
bottai
enfold
scientistic
linglestown
bhengu
finagle
csts
crossbenchers
incaviglia
dutertre
bonura
perjurer
narrowsburg
pimstein
wellner
particpants
jrh
soula
protuberant
tianchi
greiss
teper
daybreakers
boonmee
nymt
allroad
kornhauser
mouradian
cowlick
jaffri
lnx
tonel
trimper
cassazione
schoodic
capadocia
houliston
kalikow
elfa
bracklesham
waul
gardom
aroldis
mccandlish
blurrier
aspropyrgos
bluestones
surnow
blitch
dislocates
stss
aldham
childre
grindheim
certaines
elsmore
finans
tjp
marreese
ropartz
ziploc
eyewash
wassel
karlton
takeley
gelabale
wazowski
andrij
esajas
lerangis
avvo
finma
hongji
porrata
secdef
divestments
mouriño
sapkota
washpost
uitp
shapoor
tieton
khac
recross
progressiva
germond
cashtown
elos
chemi
knuble
kusuda
miia
romanowsky
genth
corrao
plancher
hommages
eastway
dessy
klaar
verklärte
mccallany
mbour
mamadi
ottakar
oosterschelde
bceao
kooijman
kaca
eeles
pandin
saabs
pltw
zywicki
ppmd
kilspindie
rackoff
medigap
faiman
idolization
pepetela
dastgerdi
antipathes
edden
gamebird
darsi
fédrigo
promed
raupp
tabc
einbinder
chucker
coverlet
magnetoresistive
preeminently
piceance
hegerty
ikg
charnia
adaro
crawcrook
pej
knockholt
multidecadal
dvorák
littleham
revu
amoa
boalsburg
witchu
cosker
fortnights
moneychangers
etps
selinda
djw
synology
giannotti
adiel
santanna
lindback
smoothe
diyya
khazei
calloused
downspouts
huajun
raybestos
tanged
charmain
repowering
gröger
lukins
jouster
rakeem
upmost
facco
kepp
fassler
bartelt
caiola
amnestic
aquilini
gaos
kothe
sujal
hornbeams
faid
acually
bergl
eoa
dozo
mozartian
avow
biaza
scerbo
guerino
yoik
naturalistically
sherrif
paneriai
unlivable
relentlessness
balmond
louey
mikhailo
netl
ballymacoll
zuccarello
stonehurst
reserch
wjm
yogev
hoggins
smhi
chuff
satwa
jeppsson
khalip
eversholt
croteam
cuckoldry
roqueforti
tintinhull
mze
yelvington
nightfly
headcovering
siyanda
abdirizak
monvoisin
meisenheimer
womer
mocvd
thamer
reqs
roddie
eduction
unresectable
loverly
panino
housesteads
schoolmarm
teruya
ilyukhin
rayborn
pricier
vanderhoef
unreasoned
leriche
consecutives
titletown
midwich
zmeskal
livetv
brei
ibtisam
hdls
sadhvi
agalarov
ascd
glaude
dagpo
kedumim
koomen
papenfuss
khn
dfcs
varekai
crpd
giegerich
shepler
yandong
elderberries
threequarter
besancenot
hawas
brawdy
mallary
memorialise
krizia
ozturk
akune
electrabel
dorri
nanoengineering
yachtswoman
nightwear
gulval
emagine
beause
almalki
anyaoku
insulative
itic
dorosh
incenses
econometricians
semmens
teppanyaki
biery
cloch
juggs
blonska
johanneson
tabuse
urschel
drabinsky
teixidor
iwade
overfed
perfunctorily
scallywag
extemely
yizhuang
damascena
djivan
oppostion
taprobane
baali
komack
zezel
gire
septeto
skulk
telman
munnery
splitsville
ispi
barbarously
davinder
bartholomeusz
inus
zenovich
rangeen
tsiolkas
ranalli
snufkin
robilant
hajrudin
shivered
bakwa
holeman
bugaku
mitarai
ritzenhein
banny
cuttino
fortun
sothebys
goodnestone
henize
grai
garthdee
giengen
doily
altiero
elxsi
kalmiopsis
emachines
johni
sathyanarayana
menconi
foghorns
perminova
berniker
bienvenida
afikoman
onesided
kgsr
stegodyphus
yatomi
wajah
hypoactive
crms
decleir
lindbeck
pentacene
jahir
staniland
champetier
arlett
rodenkirchen
borletti
inflator
kensey
pinwheels
thurcroft
condotti
sybarites
nauticus
bartik
sepehri
formentor
umbers
settimio
almandine
hannukah
caprine
poinsettias
mossbank
krisha
ibou
dudin
karlyn
tetlock
aghaei
watsu
prerevolutionary
carletonville
scrunched
immobilising
zabou
lsms
kagwa
stensson
margreta
countercyclical
tume
kunuk
douanier
dualling
llanuwchllyn
gedan
domme
cnhc
pythonesque
cameraphone
whisnant
herson
brancion
teevan
miletic
overboost
duramax
grünenthal
intentionalist
megaplier
alamy
rluipa
wingerworth
machars
scaturro
getgo
buckalew
gambol
maniam
kunas
uiv
mockford
allina
kair
depositi
pantos
anstead
bufton
röttgen
sutherin
vacio
retegui
seghill
gobaith
anba
defaulter
nahimana
pontificates
kottaras
aló
krogius
laureys
scoonie
pereire
bastiansen
scut
treiman
dhaid
rangaswami
limmy
amihai
overcompensate
itslef
wtvm
thorbjorn
karone
meecham
baylin
blodwyn
ultrapure
avil
skrew
kouzmanoff
loiza
factortame
flatford
gaelle
belenko
krtv
mcallester
firewalking
ashli
proppant
socarides
juvincourt
anatinus
poneman
shaibah
propsal
mzungu
scattergun
torkington
sembler
mahayuddin
bickershaw
tuwhare
mepc
lockups
tosar
fuz
gaitskill
zozan
smartlink
tellme
kerlan
spaceward
naypyitaw
cheban
trolltech
kristiana
szeklerland
elburn
knapsacks
coronelli
cornball
carryl
vernus
bentt
ransley
ervan
dobrawa
gallan
sunbather
eskander
pineta
rirkrit
monea
bedu
psystar
northaw
kiala
fuensanta
tigrett
rubby
usinpac
buybacks
keilani
isuru
odil
perkowski
citect
millbourne
truvada
dolev
oishii
naby
hollygrove
kronenburg
villarica
bobkov
riffage
baldor
tpwd
babbles
irremovable
azimov
leapster
cherundolo
criquette
pleasuring
inhumanely
redgrove
sagana
cabas
vujovic
lillingstone
shantry
hooser
bendixsen
rzepczynski
mandawa
llangyfelach
sniffin
katcha
sirmione
boyardee
adnams
radioing
spirometer
promiscuously
alieu
millitary
arseholes
guasimas
leauge
titford
lenkov
diked
bazaruto
usst
addys
dickmann
estermann
sofres
dishonourably
aryo
taters
snoo
caraeff
handier
ocurred
jiujitsu
kyllo
tzahal
qambar
jackeline
brkich
konnichiwa
wanta
anesthetize
mardian
dylanesque
dsquared
clowe
goddards
drog
tcad
baroody
nmsi
pozarevac
daimlers
hootsuite
trully
phythian
steatohepatitis
germander
brougher
marcelline
doas
coroico
peroration
unmake
immunomodulator
celada
inlcuding
bermudiana
museological
tarriers
freis
gintis
borsen
hyperglycemic
saucepans
stathopoulos
werthmann
emprise
morefield
yulianti
dreama
makoy
khasanov
pdry
kalian
mongstad
leons
hairballs
mazzaro
quedgeley
hassim
shlesinger
upwellings
taigh
sickeningly
melaku
plethysmograph
seidner
cyngor
maumere
kangyo
savaging
dycus
laryngectomy
laminectomy
antonette
mangone
flexray
penone
geraerts
endtroducing
negahban
smertin
sarthak
ftes
sawford
balgowan
hestercombe
athiests
cbfa
roisman
dafis
seens
stonewalls
injuction
pressurising
cinnabon
inaam
ckmp
ammoniac
felshtinsky
kevjumba
giustra
hhn
desanctis
cieslak
miracosta
carlsten
bloodflow
talau
calamia
shaohua
alala
fluck
antrel
vicaire
budejovice
gumballs
zicklin
moea
rupen
haimovitz
franchisors
ekwueme
fleischmanns
pomersbach
gilsenan
arbutifolia
frates
rahmonov
accts
wakening
trundling
lexico
manzanas
keny
gamonal
finchingfield
cullybackey
disbeliever
mittermaier
unseasonal
buco
schoeffler
yarl
sibbles
thomopoulos
panke
tropicals
cleta
marchmain
ehhh
corletto
abdulqader
dubarbier
polygraphs
newspace
elbaum
noller
golshifteh
pavlovski
velloso
plantersville
ghawar
prespective
engholm
djokic
paintballing
mileti
pittsville
baitha
begetter
obdulio
zappers
jasons
wyfold
inpex
calderoli
fanton
contaldo
unevidenced
yihe
alexeeva
uncarved
nurgaliyev
goldfeder
caciocavallo
obligatoire
timbira
showunmi
imri
corani
honglei
mangers
kellison
obtainment
condescended
kerchiefs
falkvinge
canzano
boogies
pierret
cotti
ghedini
kanstantsin
croation
ogbonnaya
perrilloux
bovver
fcuk
seahouses
barabash
revitalizes
licko
multitracked
kyber
palelei
ericht
peyresourde
bialystock
holovaty
zarnecki
vecernji
yoffie
predappio
lapinski
mechler
oncoprotein
greatwood
steffans
stubing
champalimaud
forgacs
zmp
lanting
kinmount
payor
färm
lycabettus
driza
salong
schissler
copperweld
keiper
nandina
fronius
hoffen
scroogle
keam
majken
pakis
bergenheim
riester
zaran
tichon
mediavilla
ulead
gayl
balikh
bellvue
pullbacks
eurospeedway
faletau
silvaplana
libba
bendtsen
snegurochka
mikhnevich
cutover
dayville
aymaran
szczawnica
oxcarbazepine
histo
consoler
morre
chloramines
leontina
jaunting
handwara
keerti
asml
colefax
schaake
polynya
houran
maig
nizhegorodov
pachira
ticats
musks
mabu
reiterations
unprofitability
conciliating
rasouli
tudhoe
dexa
tannat
bergenia
frann
visvanathan
tayrona
ormat
sachtleben
crusell
desmonds
pactio
nefas
homerooms
stogumber
plakat
aranka
flamers
forswear
esomeprazole
loj
masterkey
problematics
decherd
bulverde
auldearn
verifiers
fbis
mansu
egoli
kanaks
btz
ivre
arist
filleting
undercounted
transportability
changsheng
algún
edilson
cinesite
morson
cacharel
swiftboating
wehby
ottendorfer
rhyn
interpetation
thohir
tomma
jumah
luhnow
caylor
mathen
wanat
sedates
fredon
deprave
flocke
sivs
eslinger
llangorse
mwv
meletios
tanksley
kohno
middleway
meisler
habegger
nevzlin
yakubovich
qorban
woodentops
driveable
ecat
refiling
zaiyi
hardener
ferihegy
treon
prizefight
agrium
jlpga
omarov
tatsfield
superlens
pmqs
speedtrap
gongshan
kesting
graviano
ehrc
coumadin
uspstf
manjiro
twizell
medicos
jeeter
culson
utcubamba
hubristic
dinette
recipies
gpss
trittin
adjacencies
mythologists
orduna
litzau
jamilah
dimaria
pallam
grandchamp
nutech
isoft
joppy
lucedale
rathmullan
darque
abrading
deekay
vlavianos
bichara
hiriart
maerlant
semrau
leco
kallawaya
migdalia
fergalicious
caravelli
diavik
sorgente
muscovado
edwar
allanah
sanitas
daumesnil
nethergate
innovatory
kailyard
tacvba
preska
zuill
inverbervie
expatica
cubie
rosti
bourgain
carisch
cranage
streatfield
gianola
immobilier
midlanders
daskalopoulos
korneyev
unbid
bhit
nischal
nyugati
vxl
emrit
dientes
susato
liebesman
predesignated
taynton
ksfy
krahmer
misdiagnoses
schellnhuber
nebahat
inefficacy
bartica
gasolines
stuffer
sickman
lyk
mimoza
zopf
intas
hejira
parati
ohlman
jfe
tlingits
gards
kitco
vampyrum
roşia
tekulve
aded
caudell
pearmain
mulana
enm
mutilator
beregi
landgrebe
mitchem
nyclu
oakhanger
discectomy
horsing
delinquencies
dykehead
mesothelium
saltimbanques
greyback
opentable
ezpeleta
sundal
moustachioed
tendance
ncrp
coid
pavanelli
leafield
rahnama
zins
kilembe
biodome
zide
birri
secretively
tullberg
greenpark
newgen
thrombopoietin
deflowered
alit
rasuk
sposato
scaphandre
bobbito
anophthalmia
paasio
sgw
refuels
turkistani
blackly
rine
spondylolisthesis
siddiky
guarnaschelli
loompa
brighty
winebrenner
fontaneda
nirta
geraty
kindig
hinglish
tuscon
srimuang
judder
kisber
dorade
ircon
elease
unbias
indianised
glassie
glaisdale
angoor
wdam
patagium
khek
boldrini
goldfeld
fardan
prophylactically
eliasch
senselessness
burim
lonna
waaa
equivelent
rhia
mbytes
tincidunt
seamark
unbacked
coursen
psychobiography
chitterlings
accessorize
disillusioning
nrlc
chiarella
hirschbiegel
vehicule
cullingworth
emps
worser
tuftonboro
mevissen
tamaryn
lekeitio
chichicastenango
goding
nivens
saloth
ruairidh
radiall
moisi
kolesnikova
frisa
brents
tetreault
lievre
woonasquatucket
sharri
zhaowen
eraring
yogen
mantee
northsound
disneys
dhafer
kerhonkson
terryland
rakt
kruif
oheb
wratting
usutu
crossford
honnold
heriots
gathright
friedrichstrasse
standlake
podebrady
roover
icrt
clawback
afgooye
cmms
rosano
hutches
immolating
ebling
basnett
sturz
kosch
cripes
mcgilchrist
guelzo
hagemeyer
fiermonte
articulable
stefanou
mjpeg
hederifolium
hadwen
jawaan
eyenga
wissant
sadikin
khayam
rusche
overinflated
dajuan
messily
olesa
ukil
lisinopril
mccleod
sasabe
dresnok
asciano
ingimarsson
bawl
jaynie
dockings
rykwert
nuha
mosab
biopiracy
scenically
carrolton
playtech
chiropodist
escombe
godwyn
alloways
bryndza
perambulator
burtsev
pentraeth
dillan
insulations
gritsenko
quantocks
sharaa
junck
coolman
baltiysky
ibookstore
artangel
casevac
blankenberg
aymes
reformulations
wayfinder
uncatchable
elliots
edholm
kazimi
swissport
chci
danskin
sagaro
spanbauer
magnoli
seneb
utti
mctear
skii
klemp
morillas
skah
asajj
tencor
sqd
hilke
abates
bitesize
outpaces
northglenn
hirji
wolgan
freeriding
ranadive
crooms
cifor
jennifers
mixson
omfif
brasilian
epcglobal
shigehiro
seatrout
jeté
helmsmen
gccs
fordney
polarstern
wbgh
waimanalo
gaca
manai
loopers
meanly
yevgen
chaucerian
cavey
miserliness
carnitas
mixtur
positionally
kietrz
verea
mikolajczyk
astras
cocentaina
atton
schoenenbourg
najin
xrx
bawls
quenelle
bakara
vainonen
plumwood
balikesir
sonero
xiuying
stagers
kamco
antimalarials
burder
counterprogramming
hoas
anthes
ajaria
zapater
chulack
wielgus
africas
troldhaugen
pasteurisation
buccellato
anying
ammos
danjaq
butrus
cotillo
warmack
israir
eizenstat
exford
breunig
vanhecke
afge
riffat
newtownstewart
suburbanized
granai
scrounger
kfsm
novozymes
sendup
snookered
deposal
brachetti
deramore
motoric
braw
malew
dearle
drinkhall
obgyn
coalwood
starehe
polyphonies
wolak
trenk
delysia
ibat
arhus
pekhart
panchita
kobel
swiffer
léoz
dolge
yanowitz
karley
widner
rsno
bertheau
nnrti
faad
disjoined
dreg
hammams
edwalton
budrio
nambaryn
qul
tanihara
kettani
stanco
csav
irinej
francon
zoopla
marinkovic
orginization
newbuildings
reinstitute
mesmerist
rothken
dragnea
piccone
shirburn
redesignate
odlanier
flameless
barvikha
jasminoides
lianyuan
microvesicles
wannian
noyola
sattari
marjayoun
ikettes
prouse
lorson
bouldery
yema
recognisability
sternness
beeswing
araj
dallen
mkek
transvestic
sherrick
qte
cheif
sigitas
ulee
unendurable
amandeep
magdelena
ridiculus
michelago
cadc
eoy
harmfully
galama
alayna
rafaeli
mellgren
penyrheol
mppc
maidenform
priceville
prweek
lbh
standfast
gallaccio
broschi
rainsville
elefun
dreessen
andenken
maunga
ltrs
phoo
expositional
salesi
cobridge
stoical
amondson
bessant
hillandale
slunk
soans
mcammond
nalidixic
appliquéd
eraill
todesco
risom
latife
alier
gdg
cavalierly
jegi
renningen
capdevielle
remorselessly
swanmore
mindblowing
bixente
simpering
celski
ramsons
maysara
suwalki
dominatrices
pulsates
rollerskates
antidiscrimination
vuoto
kumquats
ceibs
gadret
dmochowski
assistantships
villatte
duhart
nephin
vlj
newbuild
cpds
kokka
lefkow
strossen
impelling
froyle
corrick
disincorporation
cultism
tapash
chainlink
bédoin
whca
sloven
kadian
fortifies
tonderai
sedd
dangjin
shufu
achilleos
nugaal
jerusalems
eggbeater
homecomings
mikos
hryvnias
osedax
rollinsford
hornafrik
uresti
barrhill
trustor
twala
hohenwald
epicondylitis
bensouda
uwsa
naohiko
noji
ennen
kasza
sturua
lakay
megahits
toothaches
sahba
scorrier
papian
uigea
jff
brenston
aymaras
hariya
qaisi
cucuta
oedipe
surkis
dmgt
lacivita
beziers
meilyr
daylife
attles
godmen
unactivated
reham
mozah
juliaca
kelek
noordhoek
scouser
ljuboja
hrms
clou
farsightedness
chelonian
plasterk
schlangen
meiselas
cafasso
evenwood
osterwalder
veenker
examinership
anesi
toguchi
icid
blairite
expn
farinas
nanodiamonds
roschdy
escb
rigotti
shoplifted
katinas
vitreoretinal
cartonera
delocalised
paining
yct
birching
iparty
shahrir
urmanov
firwood
cocorico
ecms
turracher
pojar
gudu
caneira
hardings
unplugs
betemit
digressed
mohaqiq
komondor
wollenberg
hiltons
lafita
coltsfoot
lecia
leitzinger
cayre
citibus
ndna
naouri
peole
bowlmor
edde
modupe
irresistable
berin
limescale
imediately
wiggled
brandyn
farzan
dolent
scheving
braca
overextension
angelotti
sephton
zyazikov
langseth
carie
scappaticci
ridgebacks
beled
artemov
scampia
cheekpieces
keping
gatepost
coverack
moneyless
bramfield
massaponax
chaiwat
screwups
dyana
brooman
lightheartedness
gxg
melroy
geocache
vaccuum
foxing
otologist
masalit
opinel
branfoot
drakelow
mingyur
tennents
catcliffe
bandaid
bedpan
uksf
listicles
roseana
toolsets
hoblit
homelife
iads
ntw
beumer
ghorbanifar
crapping
endrick
hirsig
mineshafts
maktub
antiangiogenic
kaftans
sulabh
urumchi
copyboy
deogracias
yesenia
sxe
anzures
frv
electrocautery
hupehensis
miltary
leappad
albor
clypse
remanufacture
ergogenic
toubkal
undesireable
seroconversion
lochridge
mukhortova
viorst
fouque
gelis
flander
ciner
polga
morikami
mayest
mohibullah
chapstick
monkish
fanciest
itogon
screenful
kondracki
bagillt
lyor
crucorney
intercoastal
televicentro
poitrenaud
lynche
adney
berdos
froma
qit
undateable
aliveness
farecard
photocatalyst
liason
rayn
rumohr
demotix
bristolians
harleys
kalhu
knifes
tilders
sermanni
laurant
openvg
hileman
odongo
constructiveness
jurich
guincho
meerbeke
laibin
kaena
hendri
cobbers
maese
leftenant
fargodome
reappraisals
firooz
bukha
schroedter
oxenhope
fausset
wilkening
calarasi
lefebure
automattic
ringz
cheonggyecheon
geschonneck
gutsche
migron
pollutions
hdo
quadir
yinghua
mahowald
bajour
berntson
rtas
abessole
adex
ternhill
telesforo
crucifixus
spittoons
sahibs
amponsah
faucette
nisp
dunlea
unsmiling
postured
dragic
abergil
bizarreness
hauserman
conection
experianced
furl
liverman
eule
ayeni
mumuni
dervishi
bavington
oladele
copulated
lambir
sindall
ruskington
mccains
paralelo
millisieverts
snowbowl
landells
dawyck
mutinying
slic
kromm
baroudi
taake
suryani
bilsdale
servals
quartararo
antirheumatic
hoogerland
modiin
brog
aigas
overdale
abdelatif
journalistically
gaag
bumbo
fttc
bustros
heartwell
raduyev
edutopia
prodigals
varinder
excerpting
gabulov
dobermans
stussy
zud
lykov
pastafarianism
linssen
louro
usasoc
motsoaledi
balis
tzachi
kostek
linfoot
azmeh
monzo
rössing
epaulet
sweetbreads
abdennour
nugatory
schuback
dubiousness
helghan
blisland
openreach
yakobson
redbus
albarran
cockrel
xiaolian
ujiie
fazi
prachya
hydroxyurea
laisterdyke
emollients
melvern
holditch
seea
trimsaran
orietta
uson
buchholtz
engebretson
walleyes
bryncoch
lifecare
persuadable
minurcat
athor
mcquilken
unregenerate
gesticulating
maurilio
gozlan
hulland
hasheesh
rabinovitz
bmmi
pawleys
begleiter
tutak
orex
horschel
tsukuda
aregawi
monges
abrashi
sopo
axels
künast
harrowgate
durdham
blanker
pratfall
cerato
fults
sherill
stenotic
priss
spinball
bremridge
terenzio
thamel
angely
tufty
workboat
jaccoud
marymont
undernourishment
baishui
castorama
newid
grimmy
torrin
raincloud
coalminers
carmelitas
mancusi
mooching
slacktivism
atlason
sijie
goytre
smdc
antiracist
heaston
geovany
vidovic
sfinx
mesnier
liliyana
velis
pullovers
soluk
cji
benstead
comentary
thamm
fluegel
agner
disburses
ncqa
photoreconnaissance
southtrust
bessels
insinger
forecloses
banducci
naraghi
lomography
ventes
rareshare
romagne
hutzel
visioned
enalapril
schmeltzer
snitz
dxs
bearingpoint
airmotive
optra
aswini
cristol
acacus
shanno
tranquilize
klyce
tekere
edified
postell
rasnick
bucchino
instantaction
cervids
bsce
bedhampton
horsehay
rokos
degroff
forkey
bierk
nuzzle
elyakim
rosarno
cloy
sygma
krautheim
botataung
parkmore
kaillie
paratore
sitwells
tricomi
firedoglake
steier
menzi
rusizi
identifed
swerts
epro
conviasa
hardheaded
rearrest
laguiole
boulerice
countires
porthkerry
mccririck
schaunard
petina
murrison
quebeckers
weild
burde
victorianism
ncoic
porchetta
wowee
lddc
mozzi
weeraratne
zarrar
spred
jayalath
genedlaethol
altha
overlayed
unwraps
mythopoetic
wriggles
neros
jackel
calheiros
blackler
bournes
uzice
prestonfield
touchpads
turso
umberg
rohwedder
bouboulina
pepy
milnacipran
granot
zoncolan
bensham
gondwe
magnetotactic
bashore
jethou
conspicuity
facr
rosewarne
placerita
elbrick
ehrr
machain
idj
nabataea
dilweg
schnauz
myuran
melberg
corncobs
acera
swid
unbind
melkert
moodier
kiryandongo
coverups
goodwell
emporiums
melphalan
manitowish
wame
dellucci
baltimoreans
mountebanks
ceredig
hellesdon
untenured
simango
codirector
shamdasani
marygrove
dhaval
supression
finchale
windowpanes
csci
mutki
fitte
mcpake
bzh
lecker
dexedrine
civically
mamanuca
shishmaref
myheritage
patricof
mickleton
marianos
gizi
superflat
geffrye
rubislaw
trenary
teeman
arakcheyev
papava
beanbags
wordly
asot
glore
jayati
stae
togiola
nongoma
govs
aroung
dmas
volodia
musclebound
yens
sunley
hyeres
kluczynski
sury
pakong
canottieri
egb
essiet
crapload
imperiously
trouten
aytes
swint
arbaces
karva
fontainhas
arrigorriaga
hydrokinetic
zalla
handpainted
nccl
tennakoon
tankerton
bouchaud
uralkali
zuccarini
vcsels
rawski
presidental
woolos
keratoses
laymans
rydon
sneetches
espalier
slipcover
flather
bechtler
muddiman
buynaksk
marcoses
snowbelt
hanadi
winterslow
lliw
bockel
chollas
reile
bmibaby
nodak
zingales
cabdriver
genter
blacha
unpiloted
karalis
elmiger
mcmuffin
parbo
dealth
standardbreds
casaus
schneiderhan
manriquez
dresel
maxamed
ucil
hemorrhoidal
nubira
braatz
coercively
stortorget
stirrat
bewitchment
debauch
muliro
nijjar
krautheimer
bibhu
otgonbayar
mukkamala
oghi
grindleton
ipana
perzel
tidily
tohti
dooh
boroumand
modernizes
wangchen
crathie
peginterferon
quanell
blinkx
goulston
sê
refashion
efthimios
linell
sillinger
politian
broughshane
peffermill
merendino
tisei
skyguide
anfernee
dirr
pandorum
barraged
activeness
oilwell
honkey
stenehjem
guarnere
japonisme
rexel
tauss
reetz
onesta
amendable
humiliatingly
verel
tassy
baathists
yepez
hieber
laytonsville
idealizes
bilsborrow
earthwave
kummersdorf
manríquez
russek
swinderby
zieff
halwill
uuv
knaup
fazzi
wmsl
empathizes
copon

ermete
hayfields
twizzle
dominey
caveney
shobdon
bisaria
dunlough
theologists
giudicelli
hollyhocks
deuteronilus
mauney
wholely
rishel
loade
bivalirudin
steindorff
salati
korengal
grazian
humectant
zoysia
lamarckii
archila
anusara
ndd
copen
bolad
dorthy
jacono
tapulous
limpley
muranaka
atomised
dkar
smoulder
doorposts
daveed
paperworkers
kadivar
qosi
vactor
wolfington
greyness
oduya
amag
tabarak
greenhorns
polynice
natalegawa
pocos
voluntarios
acci
rescored
pastorek
xtube
weligton
gloeckner
errazuriz
racan
homesh
marichalar
disbrow
relined
abqaiq
contibute
gamerscore
antithetic
scartho
devolder
pseudonarcissus
pezzini
zuidema
dcmg
cadaverine
mulliqi
furusawa
colyn
undoable
shapingba
terrett
chlorotica
creţu
bacai
ncci
feba
dyanne
elbogen
feock
golon
robosapien
varndell
artley
sketchily
guilfest
contex
electrowetting
mazzotti
wliw
karantina
channahon
melnychenko
unretired
canonbie
dornsife
avrum
barathea
villans
syjuco
watchwords
shenwari
matea
kamdesh
tannishtha
linaker
limoncello
eephus
manifestoes
raboteau
unaesthetic
trouver
pertec
postillion
simler
chatr
shimmery
pcps
genex
govts
rummels
gheit
kinnon
elementaries
reengineered
padborg
tanishq
fattahi
quimica
tiryaki
arjay
retch
anquetin
serama
rudine
nasolabial
graduands
polymathic
kadugli
dks
westernize
hickton
gilwern
cortaderia
entombing
aptness
commerciality
showest
bolkvadze
sbdc
wyszynski
longi
polariser
couzins
artisanship
jsow
exar
htein
sumud
nonprescription
omgpop
windo
bindman
grossology
exhalations
karonen
potamos
pudney
pleating
mischer
trevitt
impenitent
beausire
baltan
nfx
mastoiditis
calitri
blogtv
aloisius
entine
tuinei
tahiliani
rakija
pilarczyk
hemagglutination
perreira
calcars
whing
wfn
corralling
woollcombe
injuns
varenicline
botc
hoseason
cernak
roohi
zagged
cairnie
dayanara
skirbeck
hourn
surace
studebakers
idiosyncracies
mallas
marrows
cogenhoe
mezcla
mahmoody
riddings
shamsudin
guangchang
zyb
algus
hipodromo
ticciati
workrate
worthiest
hhe
saunt
kctu
vatersay
googlebomb
horridly
darse
hemianopia
nonstarter
guruprasad
spiva
gamber
lacemaker
resit
phrai
unificationists
reigh
leasowes
innercity
villagomez
devoran
aboutaleb
outqualified
sermonizing
addded
raileurope
elkes
lovaas
octopodes
munkeby
filmfour
stolichnaya
kohm
schrafft
alwen
unhealthily
lamelas
mcelhaney
kayalar
scheible
giardi
mikulas
fundoplication
hannelius
cheesesteaks
buchannan
essaying
kiddington
therof
fawzy
twx
bujalski
arginase
balslev
korbi
mammadyarov
whittam
ultrabattery
ramrao
scootering
lefton
drizzling
ascó
pbrs
mehanna
hepher
kondh
impliment
diffey
sunburns
celliers
pauletta
slynn
ruffer
reacquisition
homebuyer
starbeck
colajanni
medalling
orten
longboarding
bergisel
voca
septuplets
assignations
tshirt
meisl
mopes
dijak
hwl
kosove
ludbrook
underclothes
trakker
rebney
teisseire
baxt
klor
thegame
bepi
psnc
premonitory
fotakis
limewood
ewww
revillon
recondition
synar
timika
becoz
alkis
underbid
yongbo
kiddin
overzealously
substanial
mohney
kakade
maafa
maxxpro
pascoag
assasinated
biologos
bodrogi
shigefumi
wasta
dorinel
penalosa
betide
gupton
xpcc
cukurova
rickaby
luiten
soperton
radiowaves
freij
kilic
dhruba
thomsonfly
vauvert
houseware
mvrdv
walem
motioning
flavourful
alferd
calangute
whitehills
jovians
placemaking
shrawan
saxer
mckinnis
abderrazak
mildy
maintaing
bickington
polzeath
broggi
chaddesley
ieps
brilliante
yantis
hitner
marong
gandhis
parities
shawanda
durando
cockers
elbers
cockman
staiano
damrau
steinbacher
wibble
contractionary
araldo
pelargoniums
fengjie
nonino
conks
ashjian
honeycombe
mandolino
shukrijumah
hamburglar
kreischberg
moominvalley
nowack
disseminator
illuminative
hafei
gwion
therma
nanoscopic
roughhouse
embratel
buechel
attukal
delousing
corteza
schoenmakers
rohrich
korshunova
bernadino
mazzolini
turcat
timothee
annotates
pitcock
poyraz
jakupovic
martagon
magsafe
freimuth
intraspecies
lemasters
qmu
hawgood
krx
champoux
swartout
maconchy
hsct
ramson
xiali
dmfs
snocap
jersild
generativity
manyata
stunk
reoffend
bernado
pennel
resurfacer
rothaus
beckstead
crye
hryb
tkeshelashvili
nenita
netbase
rokin
ncircle
pakora
booij
timimi
gendall
akognon
potsch
pontcanna
kadota
raib
bosisio
coldblooded
collimators
grelle
rabigh
moisey
midgham
katsiaryna
meole
glotzer
ledermann
thronging
dadis
smallcaps
galinski
patakis
dith
ouakam
jeanetta
grygera
astroland
kristeen
gratae
mouthy
breitbard
tradeshows
shinkin
shahrvand
raikkonen
distractingly
queeny
lavere
activehybrid
hde
deonarine
bodach
wesa
zwane
yingjie
benzing
dolbear
ubad
kurr
tprf
relatedto
moggy
agrestic
naofumi
buseck
gaslights
daters
firstbus
keavy
asotasi
exotiques
mirwaiz
vivie
wnuv
kranjc
improvs
omnifone
abatements
minumum
hich
mehretu
shuangqiao
businessobjects
tods
krehl
carim
waber
fieldorf
lcmc
kamanga
parad
dvalishvili
gritten
pavi
languge
creveling
ringwall
kunak
dmsa
spys
moiseevich
pentney
ojt
outrightly
greeba
geocachers
keqin
alterable
reactively
weissach
mlpa
corbacho
achievment
nimroz
polruan
civilise
azhdarchid
malissa
seducers
queerty
cottaging
techwin
fabens
krugerrand
kuragin
snam
outrageousness
lingor
pharmaceutically
inquisitr
uspap
zolo
abominably
lunik
ishiba
merad
françafrique
ainun
raisi
divemaster
rockhard
heartthrobs
beidler
neylan
chalonnaise
wideload
avini
olanrewaju
dexters
eckles
bornholmer
camelias
mcgeever
ritchies
mahmudul
trustmark
densus
brunsdon
ambrotype
mutchler
tepees
chickies
ndabaningi
bayoneting
payables
condemnable
graffeo
polaski
vhb
roketsan
whoomp
ammara
quinter
artemi
imz
greenboro
loyalhanna
popalzai
mcgarrell
handey
cxo
galdakao
lightcap
klatch
ekimov
ebullience
eurodollar
ditchingham
dumes
dimap
stear
schnucks
refreezing
pluss
bawku
cookey
contries
kuumba
masimov
schiattarella
quiting
yugoslavians
remotus
henschke
jokin
soyza
iraqiya
sartin
saharon
gutbucket
kamrul
grabham
unlocated
laggy
overpricing
ruakaka
ceske
dysmorphia
hoba
earlimart
biljon
reinwald
ratm
lingafelter
beverlee
naqqash
tdap
wolkoff
radulovich
muzzleloaders
bame
iame
jafargholi
boisture
chegutu
easterlin
ejike
nissans
seitaridis
secoya
ihd
lauran
gallman
sestito
lakhnavi
humphris
celotta
pittinger
mudavadi
kirven
briegleb
chinwe
hrpp
klapwijk
throwley
soumillon
futuregen
selecciones
bastani
karsts
manoeuvered
palinuro
irms
usherwood
southcoast
neukom
louca
hawx
malmborg
valdobbiadene
bowey
footstone
cavanah
kirchick
niedringhaus
sarcosuchus
excedrin
sabaudia
denbies
antiretrovirals
arkansan
scummy
rubinsky
kaylin
soonish
denicola
legrande
scapino
xom
dhlakama
prolixity
faughart
zorana
todaiji
perezhilton
pipilotti
colombine
confreres
czuma
saquarema
dtz
seethes
roflmao
unbiblical
montres
galbanum
oscommerce
biltine
sicco
hamisi
eulogising
avocational
xinfu
worby
flyger
diki
hagadone
witteman
saac
gianbattista
tsiskaridze
bokel
kishorn
envying
knoebel
lustleigh
partier
invigilator
marata
undershirts
bocco
lebowa
lotfollah
exhaustible
abaunza
overgrow
zentz
hoeber
marvyn
hedican
mccleskey
puelles
kuehnle
waing
mcgees
eyden
gogia
hassabis
scythed
chones
haloti
teleconferences
abag
traymore
tongchuan
beignets
mezuzot
ninkasi
sidoti
backhands
ensalada
ybas
petrache
pontsticill
estuardo
politicising
saltis
lavallette
ctas
impenetrability
karipidis
lanken
caerhays
wyth
hocutt
borell
monthes
rogé
cinestar
trupti
revenges
henahan
pfanner
preise
rends
macheteros
leshoure
multiculti
kyungnam
chillenden
bubalo
jambor
pursley
tmetuchl
conglomerations
mckayle
strous
vasks
pettifogging
puranik
uset
anso
bainum
qna
uncat
whinlatter
bittar
tarutao
chinu
pedregon
ncfl
macronutrient
nfkb
thorncombe
wcas
awak
hasin
paams
bozz
superinfection
mokes
suicidally
lininger
montsant
lifeboatmen
telepathology
convos
meagen
aers
adition
bleadon
biancheri
biotherapeutics
hasheem
kosmotras
badwi
twats
keratsini
musts
cutepdf
spicewood
ilsinho
wefald
bfca
altemus
pensione
yeter
villalona
ecologie
dearnley
barbwire
usaca
fanless
lituania
ditf
fibbers
hartstein
mozhaisk
webtop
fenyvesi
globalflyer
belived
severability
kyobo
badh
cissi
andritsaina
invista
sonographers
gwy
cernik
hardwire
purveyed
fovant
monsal
fishamble
cinefamily
rubano
artax
siegle
gtaa
dunecht
frommel
rpu
unquantified
staincliffe
loakes
koprivica
veech
hipness
billyboy
sdps
ruhrgas
drafi
mahamud
elides
orangewood
moominland
ruha
rozell
goatwhore
tvpa
hunchbacks
coronated
shillingstone
shortell
khalifas
wordage
zanganeh
eiden
sharrett
waab
vesko
icbl
chasses
nachtwey
watnall
heitner
thurmann
arbitrates
knödel
kohnen
telecomms
clannish
rusnok
belterra
shovelled
mputu
okell
jiyan
notarization
middletons
zdx
baches
ebbin
zubairi
mianzhu
dickran
tsec
citrin
acroteria
kellyn
dissatisfying
littl
grethel
bloats
emelin
zacharo
extrem
selph
iabc
creamware
wusb
npca
reoccupying
somontano
kremmling
malara
adeem
maarit
nannetta
guarin
couette
cheik
panitz
continentale
haemodialysis
catarino
negs
sennybridge
zamka
vishay
swanner
strichen
photopolymer
jeramy
gaft
polyneices
gazey
spearpoint
dualled
gundelach
rilwan
zaloom
triax
creaks
dasarathi
tasini
kortney
seeduwa
nnk
pineridge
bederson
tsotsobe
limpar
caldwells
lomon
sweetney
braungart
rustled
everbright
somehting
industrialise
croyde
croda
sodales
augenstein
jamain
ledsham
rodhe
leang
brockhill
facepaint
gdrs
ironist
paskov
bepicolombo
whiney
signifiant
loggie
maing
electrotechnology
mcds
sdsl
belcoo
prokopis
maffs
mcgeown
andiamo
jalees
harpaz
northrend
haros
platysma
davilla
schnapper
geyserville
ichinohe
channelview
lambdas
evis
doocey
maryinsky
humbie
pertis
zikri
mwnt
ilych
omnicity
centocor
staka
pulverizer
deryl
cabc
sterlings
sidd
egide
morston
gasifiers
candidats
setsuo
ballboy
visan
skrepenak
brajesh
vork
buntrock
thimmaiah
mccumbee
tchs
sleddale
ribbleton
bolkan
aufderheide
reifert
lavagirl
fuchsias
overexertion
kalana
yudu
citings
weidenbaum
spanx
redel
lagardere
torkelson
hunston
parentless
tantan
beelzebubs
dokumenta
legian
rascally
darusman
duiven
kobina
huizar
bladenboro
monasterios
benik
mamic
exposito
higareda
ratiocination
tufnel
kamtapur
epsiode
roseraie
kanowna
sullins
thrashy
bombmaker
zehner
peir
whitfeld
lby
lawing
nympho
unburden
qfc
sylvinho
sengi
golston
siptu
kingscliff
naaqs
viscious
hamedani
samsom
hacktivists
selvy
genstar
vosne
alagia
shweli
dorando
flygare
ubukata
tamiris
presentiment
tike
amicizia
gainline
korvette
tecpan
informationally
diski
rollouts
hepu
improtant
cienegas
bassong
blaubach
shenay
holender
misalignments
hybels
queudrue
crossgar
brondello
imbaba
holomisa
allio
dalwhinnie
chatree
stahelski
fuksas
olowu
outgaining
tubingen
headlee
geiko
multiunit
jenniskens
shuan
fangirls
jalila
ramsays
nooo
cognisant
lindinger
eveything
angeleri
windier
gelston
pudor
karagöl
herfurth
trepanier
ribena
bufford
enoteca
deininger
boxhead
regretable
eroticized
malyon
sandeno
boggis
massett
nagarhole
morfydd
greenmarket
workbenches
barchi
poletown
dahalo
elvi
scottishpower
bluehost
mogilevich
garcelle
mckenley
ongpin
chudzinski
guilloux
gvg
vart
eagling
diah
livarot
grapnel
chingis
sepaktakraw
tranquilli
ilaga
xingfu
bottomly
sehnaoui
onevoice
kenzaburo
masey
iltalehti
optionality
resettlements
ironik
ragghianti
sinohydro
steinhatchee
funks
drumnadrochit
hambright
porfiry
frik
halswelle
skane
sandefur
manc
snigger
channu
rabone
roriz
etj
eyedrops
mthfr
mbarga
huatabampo
rufforth
phenylethylamine
eventualy
interflora
serrao
beaverkill
privee
gscc
notkin
mudhole
asgaard
farmingville
conserv
bido
momenti
cager
fladmark
lukasik
hannawald
marlar
lroc
schiefer
vulcania
jmk
ciso
terzieff
wibe
cucuy
moussi
analogic
nrha
halterman
nodari
boetie
toaff
pinboard
dgx
ndri
picanto
issie
tand
vejer
changeovers
maddrell
huallanca
wadhawan
enyeama
weech
boisse
ratnasiri
brickbats
movables
contadina
starlab
creatura
brzezinka
zitron
rangely
mcconathy
valpak
kisielice
pachyderms
grousbeck
antonveneta
porticoed
maiani
buspar
critcism
simrall
romary
resendiz
houris
viiv
junxia
valrico
merche
cowger
mohnhaupt
xiuzhen
writhlington
sevoflurane
valhall
craigmyle
sontheimer
amama
giertz
aleshire
unconfident
testbench
dilon
verta
shailer
longson
jacobowitz
abrogates
straightline
fusina
reclad
kemperman
ellida
sunlike
celeriac
ukelele
brueggergosman
tiernach
poruri
smoketown
joselyn
danieley
paperclips
harmoniums
vilayanur
soufan
nyishi
khunying
breathalyser
liko
birdoswald
antia
darkrooms
pullens
horkesley
borrani
fezzes
rotunno
oundjian
mcroy
aquafina
twitters
lavagnino
hoffmans
fanhouse
yebda
outclass
orbitting
taotao
hanifah
gwahardd
glacé
pitmedden
allisons
sobolov
evercore
icna
aasb
skegby
pytka
accelerants
mcclard
tabbouleh
garnethill
lamborghinis
taiichi
mallie
nads
glasheen
zonia
sofinnova
anorgasmia
aycox
farizal
aast
zenoni
simiane
jewelweed
istm
arifjan
matteoli
yacoubi
aabc
paiz
heter
niaaa
acció
arzoumanian
ayyoub
nokta
incantatory
harrellson
chiran
mukherjea
deadliness
euphues
butyrolactone
minney
beauvale
infuriatingly
veruschka
xiaoqi
railey
huxham
burtka
hellens
delacruz
underwoods
keiter
zeewolde
dicha
muddler
succentor
babaoshan
loopnet
isabeli
sugested
unitard
vailima
abony
pataskala
resco
americablog
perkovic
bindoon
sennelager
asbpe
nedelin
thta
mji
unaccomplished
similair
duss
buchannon
austrey
bottaro
epalle
unclouded
starwave
caporaso
jazzin
fnx
alire
nhis
wheezes
labarca
barmaki
abdulrahim
kreindler
digibox
wjtv
writhes
yazdan
unharvested
alipio
gayler
aeschlimann
angelical
zevin
glenallen
buddo
rutsen
larrimore
korsgaard
zialcita
rbh
yongan
terroirs
zilberstein
frand
shallowing
obertan
elmohamady
cagny
egnew
svcs
teeb
weas
antidisestablishmentarianism
deliverers
iicd
geesh
bostich
bonifas
dyomin
maldef
yre
theu
psittacosis
apopo
servillo
verint
nofal
lytes
gongyi
bloodsports
afrik
shoushan
mancunians
dimness
cassoulet
rebuilder
reinharz
skyhigh
reira
groundstaff
entraps
bebout
exorcises
clayburn
richar
rabei
lucked
pinkies
transloading
groomsman
antan
mirbat
grannan
ogiek
tidies
essayistic
computerize
lengthily
ezeli
dochart
freymann
suer
littlebrook
sattelite
waramaug
lubo
safesearch
escandon
nuckols
spiriting
istithmar
cabriole
traprain
freindlich
sogou
champy
jary
tecktonik
lligat
sportwagon
blw
hamutenya
rowsthorn
aschwin
redstarts
vaccarini
grenadians
mgrs
chiya
offcuts
holekamp
aasan
jinga
laconically
unintelligibility
eccb
lychees
shatford
paddleboat
belkis
doddle
kinmonth
pilkhana
burtenshaw
mountainville
nowick
koellner
bjoern
bathrobes
codicote
sehome
ugland
kesse
comparitive
tanwir
barette
controll
radiowave
grens
marjanovic
andrean
tavris
batek
mesospheric
witha
rigsbee
nasaw
breathability
kenniff
weijun
nycomed
dörfler
statelet
braila
mussen
depite
emplacing
rosica
mindfully
iknow
canion
ogonyok
abshir
ambivalently
nethy
chequerboard
craftiness
bawana
snailbeach
mokae
hillend
zhongliang
rasiak
cprm
scampering
soud
blazar
sonenberg
inamoto
ducruet
nivkhs
lillehei
rangiri
sequenom
lorwin
githongo
sumthing
parred
msec
gooley
sedlescombe
cochinos
gotshal
katiba
fakoly
ronette
wanni
gazania
ripstop
minibar
commmons
weissenbach
reassume
jaras
penwell
geia
bput
cedis
ccea
franze
follie
bilges
glor
munto
lidy
koten
overcharges
sfmta
mwg
nhw
ogunleye
bekri
jerramy
dokan
runte
predannack
gostick
harbisson
yuskavage
havi
shrubb
megabucks
westell
assp
delectation
abrines
inexcusably
yoran
etelka
neeb
unhitched
fictionalizing
tsampa
fryzel
acquirers
aslockton
drawcard
loftiness
bogeymen
llagas
steyl
tippers
wevill
sweetbread
koelewijn
nanan
bismol
nicad
japes
ekp
komla
parve
zernov
dmdd
falconetti
transgenesis
overstocked
qap
raciti
eezs
cotana
vorhees
tamaya
scudetti
bolano
mulamba
metzelder
küpper
boasberg
archstone
gowy
eurocar
exsists
katchen
wijekoon
xiaoyang
fayemi
cooperrider
terrordome
mosset
bryghus
matsuhisa
lecanto
speedos
abdominals
pandove
gratuit
moviegoing
prevalences
henjak
awes
logjams
etak
cyrilic
volpini
hedonists
londyn
codger
allbutt
hartcliffe
sundram
smoko
stucki
insurrectional
khandker
denenberg
texmelucan
hooved
movilla
evoy
kalantari
hiccuping
khristenko
veverka
recommences
pentaerythritol
wirtschaftswoche
funtua
tokoyama
colugo
leatherheads
garscadden
torneos
sivanesan
leilah
pinton
afagh
vikar
guanling
doublemint
forebodings
nancys
klich
gazz
amyx
oefelein
querejeta
deaux
detouring
totto
webcasters
talland
westbahnhof
waith
cvrd
townhome
sweepings
cookley
timberman
halmosi
bojo
okemo
nerem
belorussians
dequenne
sougou
mursal
ingrow
halver
backsliders
unpremeditated
lainez
defensibility
banghart
sabayon
samho
premaratne
badgett
harvel
darfuri
chlamydial
opuwo
karlmark
goldhammer
relleno
gilgo
keeran
inkwells
refoulement
kohe
tretchikoff
rusafa
financeira
schapper
downbeats
quansah
aerobus
sgorr
thanvi
kurzawa
afeaki
tollman
voluntown
varenyky
madrassahs
lafca
bringuier
qadisha
drub
oposite
hussian
nurney
emtala
nyatanga
anyhting
baffour
collectivised
coert
angioma
strunsky
mislabel
copado
otniel
cruiserweights
eith
randomisation
ehen
greffier
fflur
juckes
premack
reprove
ajt
neep
embeth
concentra
sowards
anshen
hipkiss
austan
readymoney
caitríona
keshar
keleher
biosimilars
revalidated
hagelstein
rusling
wrangled
bchr
concow
sweetwaters
mpds
kathiresan
shumard
leslee
kontroll
lucita
ancestory
schoeneck
lasvegas
poquelin
tyronn
caires
feca
jobsworth
limetree
machrie
natera
mesick
whelpdale
flamberg
aloneness
minnear
ashenafi
honchos
fitschen
zubrus
kamus
thecityuk
baldemar
exia
hammed
keram
enak
lxb
bocht
gelineau
wineville
suhaila
garretts
manfreda
stairlifts
henault
suttee
sinowatz
huether
karnei
pominville
inextinguishable
belue
frisked
loppi
mazher
terrazza
fritas
oystering
nuca
délices
cmmb
millefiori
balking
creaney
ximending
cpoe
bajuk
elblag
scheetz
regurgitations
hubbins
kanwa
kimutai
compunctions
jimm
cancale
monans
phumzile
boet
pigram
opion
khodr
koumba
arbitrageur
pheromonal
shairp
mahbubur
snan
hannig
mcconnells
kehilat
foxhunting
makosi
humanizes
avst
taxies
rebuy
demeco
lougee
nsidc
strivers
egusi
acab
yurimaguas
hamiguitan
hatchings
selectee
gwynfryn
souri
hydras
nogle
alfon
sorpe
puddington
kasavubu
stumpers
frpi
fantabulous
drennon
sisterhoods
alwani
plumped
sweetlips
yerbury
tempier
grandinetti
tostada
paratene
titman
transmutes
kahlua
seymours
totok
boyda
linby
amital
levieva
gradney
hurka
sulkhan
conero
kropf
boasso
teleflex
dcri
torpedos
christofle
piaui
blackgang
meningococcus
bunget
fiserv
gwrych
infratil
denuclearization
avedisian
yumen
telogen
mirchandani
kringen
deshay
colesberry
fremontodendron
ignac
fathallah
shelekhov
alemtuzumab
glatiramer
blastocysts
boguski
eurocorps
hajari
xnview
safaryan
ballwin
hiranya
giubba
mihoko
rayl
loreburn
basix
sickos
yuhanna
farmyards
brittani
averroës
wasserstrom
grafenberg
adala
fasal
kontras
srivatsa
yanbo
stabbers
duhe
gysgt
barrells
sneads
quieten
lccc
lupane
hinga
chalfie
hellos
scoby
riduan
najeh
eggborough
teratogenicity
galten
jarvi
gurey
pkv
highbaugh
brauw
rsamd
willowy
jinmen
desanti
triay
decison
opiyo
meridan
nostrums
valeen
frankfurters
nautiluses
stotler
ghoneim
brecqhou
delestre
greeson
sliwinski
artwalk
loku
schnader
jaliens
taracena
fennoy
titbits
czajka
batkovic
kulfi
rotherwas
elegible
gudmundsen
montgenèvre
benakis
claming
speedcubing
wittekind
surayev
malarky
birdbrain
chalhoub
horridge
todorovic
factset
nke
newsfield
ddraig
shepis
masaga
ditu
gorostiaga
subbaraman
alaia
jukkasjärvi
panus
otiose
hemophiliacs
drezner
particleboard
havat
minsker
whitetails
circumnavigator
loverde
wanjiku
yinon
mavica
ozdemir
lamphey
ribhu
arrowood
gourmands
ginori
aiwf
entwhistle
antil
lombaerts
stiffed
alberga
berowne
heek
jumana
mannofield
converage
notifed
glodok
tharon
kamstra
mujra
neede
mònica
urasenke
capuçon
minoff
ruminates
philmore
nienke
talamo
barquín
nunnington
alledgedly
xol
nortriptyline
radlinski
koskoff
kalley
ndas
gofton
dookeran
deréon
dawne
cambe
brayman
tyrannous
alekna
iberico
vacco
navratra
thila
tairi
bagpuss
walgrave
ghanian
bovin
mercurey
chinde
trenwith
sollberger
streetlamps
greutert
pączki
nobodys
norweb
chartock
nutrasweet
sureshot
thuoc
sellman
inbounded
flatlanders
vaporising
isella
rizvan
bungler
hukawng
calsci
suryan
streck
ryue
dunitz
longstanton
modhera
kimbe
edivaldo
overtakers
lowa
shirah
gnlf
interdictions
bharananganam
vandenberghe
bhana
quattrone
radaronline
helpmates
serralles
unibrow
maryla
armanti
sasportas
tangela
lenôtre
vacillate
americanizing
gettman
surveilling
midmar
theway
hiel
anthopoulos
etok
sfra
rssi
kunes
arndell
rangelov
countermanding
korr
arshin
borisovka
superbug
gobstopper
buncha
interrogatory
belta
liberhan
ivlev
catarrhal
leftback
bilinski
jediism
samel
mehman
contraversy
tofo
mtvs
artex
kvalheim
panauti
hoodless
micheel
hairstylists
ciabatta
vht
richville
radiomen
joyes
ovalbumin
sqc
minnett
unnervingly
auxier
alavesa
dingmans
courbis
nimród
roban
myslef
greengage
begon
matala
itsa
forthe
unitt
apti
lionhearted
lawbreaking
boeta
hempsted
daimiel
efilm
movsesian
aleknagik
honourees
namkung
kirsteen
lhh
georeferencing
funnest
oxxo
grossness
macarthurs
strich
nikes
shivji
aldbury
simister
anhe
idzik
niurka
macena
ifakara
keala
summonsed
electrocardiograph
phalluses
tibooburra
hesley
oligopolies
irréversible
clarridge
shoora
brightmoor
fergerson
unal
dommett
mortifying
gerwin
jovenes
saracho
vagenas
suseo
marathis
hesme
kempa
ostrer
ferlo
shamin
falastin
gurvich
indigence
kervin
cieplak
vacillates
uksa
niloofar
umani
derriaghy
ksby
weem
sidelnikov
forlornly
venugopalan
reconsiderations
tarish
unsaleable
muthyala
pronatura
boerse
roehr
bambalapitiya
cushendun
frémaux
nasrudin
droughns
senecal
oladapo
lickliter
aulenti
moeketsi
breheny
bauby
aryal
obediah
riflery
daris
jutanugarn
diora
baiba
mckelway
favalora
remic
baratti
hualian
bothies
breastroke
leibish
michalewicz
winkless
cohre
sniggering
hijuelos
osswald
elsewise
qiz
airan
jukic
rhan
finci
villandry
lcfs
rochemback
filmdom
fmap
inclinometer
fodé
encouragingly
chanaka
sovereignties
sebastiane
chavista
slobodkin
kestel
avolon
ghandhi
joze
acdp
demoralisation
vanlandingham
marikar
alcalay
prejudicially
chiropody
brolsma
washstand
atwa
enfolded
attallah
cobey
remley
afrim
saturns
passera
leatherbarrow
nows
cheongwon
trevanian
lushnje
lidgett
centrefold
panchakarma
mishcon
brixius
silverswords
malavé
concidered
clearinghouses
destini
boiz
füsun
ligambi
demer
landler
baetens
jinxiang
ipratropium
unsatisfactorily
minoprio
koge
cunneyworth
angy
dharmasala
schmidheiny
rephotographed
beeri
patinas
zingo
promptings
iczm
denouncements
tamaru
heertje
calzadilla
wenxin
petteway
slaten
narkiss
wuwt
khazali
adeniji
nyamira
brainers
gnatcatchers
whorlton
katp
hyperextended
starbursts
douridas
leuchtenburg
memc
fiorilla
pozieres
recommencing
faiers
denford
onesie
rozental
looooong
delamielleure
misimpression
vandendriessche
khamees
tagliaferri
defoliating
redbrook
czuchry
theresienwiese
jigged
yardsticks
langgaard
lcps
oring
ehrhard
benza
mrgo
brezec
fountas
contenting
aigua
agoglia
sambhavna
cookstoves
norment
topica
destocking
lanikai
virginny
fcfa
coronini
bandele
cesp
pontllanfraith
curtsey
tabooed
wrapup
cxt
mcwethy
betsi
raborn
qionglai
eviscerating
ceranae
katouzian
razzi
artyomov
bûche
nibbs
dosch
woodworks
locali
froward
multiscreen
shubho
ericaceous
powa
lylah
clinkers
mclinden
galbiati
købmagergade
phip
kapoors
gillison
ihec
sharfuddin
cocooning
jaeden
anky
singhalese
baranoff
pelleted
shenington
powermac
jlm
spurted
dehnert
khune
aivd
jeffri
omair
zumiez
fauss
orick
frig
cooperativeness
ghafour
kinokawa
corriero
icrw
whur
rubiera
cusu
sombody
chumby
patacas
yerokhin
besla
lipizzaner
herefords
whateva
flyfishing
cuppers
korka
hortatory
figuerola
ascraeus
hillwalkers
butina
giantesses
taxanes
hardev
genuflection
emberley
opdal
valiha
badshahs
wiggenhall
coplin
fenstanton
agran
,you
paternalist
varughese
casellas
mordern
ruwart
ilos
kuca
tonghe
bolita
hamamoto
phinn
yunhe
maamoun
nazzal
insch
enviromental
freiamt
verbrugghe
istinye
shaugnessy
miryam
absorbtion
hoofnagle
perplexities
gonalons
ashwick
chachar
kvetch
insourcing
shorecrest
nutritionals
circumcising
woosh
valsartan
dimaporo
backwoodsman
igbinedion
glittered
lunts
eying
vigorito
filipowski
kilmeny
scratchcards
jogis
casseus
kreutzberg
oumi
lispenard
gennadios
desisa
ruchika
borf
asuquo
ffin
brunken
wiard
orignally
tatlow
doven
tesche
tagab
surapong
becknell
lqfp
simonstone
pallisers
defilippo
archibishop
jdr
wcjb
puttar
confab
despoil
commodes
blashill
bareknuckle
shieling
bateer
rebids
lanctot
hanauma
sharyland
klauss
ariga
almsick
ccss
skewbald
dosb
novellino
elot
briarcrest
hessell
heilemann
skaugen
spygate
reductively
hekma
emmannuelle
summerford
somnolent
gotobed
montegut
teashop
saxonburg
prairieland
jyles
fubu
canakkale
nammo
unclog
raqibul
roadbuilding
absurdistan
ozmen
attili
orrefors
quells
revalue
mfps
ganzi
rosily
casualness
penix
provitamin
reheard
aufhauser
frizington
klamer
kosb
luzzara
parklea
hangtown
darcheville
podrabinek
reinberg
nuttiness
avcs
manpack
yade
dispatchable
kmworld
inabilities
hockenhull
gardeur
phreaks
stfa
heffern
bahtiyar
plagens
bluemotion
doree
haslen
westburn
gobbles
munks
arteria
gdo
ubaida
hospitalizing
weninger
eckbert
turnball
suada
bookout
zwelithini
tuile
unorthodoxy
ermen
viably
oatway
icet
imk
broekhuizen
damehood
phills
magubane
hackler
csit
wiechmann
umkomaas
shevon
goldmines
cicig
fenglin
maguy
congi
virb
kazatomprom
trika
xchanging
immunotherapies
bobbsey
practioners
jaramogi
pretexting
harfield
rajamäki
unacademic
grifasi
badenov
crams
worldy
arranmore
psid
khonsari
ependymomas
jubouri
therriault
tabei
sivak
sozo
sischy
brettschneider
gunta
breasley
newhan
chonnam
beaupuy
reaccreditation
escalon
marijampole
milpas
pirkle
jailbreaks
altares
tatsuko
mssl
jwaneng
telis
breteler
dunseith
sixkiller
négociant
sutrisno
rosenfeldt
semenzato
markwick
reinbold
govedarica
emerica
nashawaty
steggles
fccc
halatau
clinkenbeard
kemsing
iriyama
ahuitzotl
lavaka
audibles
winsten
lamichhane
karaszewski
lenat
kübra
bugattis
oranim
distruption
stylize
lefleur
zegota
grupe
neurofibroma
buñuelos
cervoni
quami
goonhilly
archelon
rasiah
materialisation
moszkowicz
woolavington
fivers
braudy
latenight
laugerud
matadin
battis
josemi
majoritarianism
howcroft
decani
zogg
onil
hamao
draughon
limache
ronca
mamatha
montador
schurig
selc
belloq
emtec
hucclecote
unionisation
shangyu
confessore
gilat
brautigam
orthop
gachechiladze
marzec
contemplator
kaledin
rowlinson
dogmeat
stambouli
dgfi
culpin
bawaba
heve
molinar
insufficiencies
scuffling
onanism
fallowing
buyuk
chinee
decendants
oeuf
unfocussed
geldzahler
moutet
massys
rossmo
bamn
prospal
llansamlet
triangulating
sophea
impersonality
jabeen
salfords
munnik
cye
winterkorn
jamestowne
ettinghausen
countie
pinkava
stefi
saghafi
maccioni
ulrick
massager
lopota
basner
laughner
flambeur
manent
chandelle
portersville
blanchester
glouster
supercool
desaster
noooo
gyepes
skittled
kotulski
trandon
fernandino
ratomir
wiessner
chimayo
helmers
casadei
salsedo
metalink
hossen
paykel
benu
fantasiestücke
teuscher
fiesch
gardemeister
omrlp
numbskull
spelunkers
yuanlin
ncoc
kipchirchir
roets
goldhill
wyers
mugg
stanardsville
forker
cerletti
llanddwyn
unibond
wormlike
aagot
duskin
bandz
arancini
beierle
painshill
gebo
foucan
fengyang
emsc
beimel
falih
recolonised
dongsi
thudding
kenderdine
chollet
fawkham
plagiocephaly
ahip
kebri
santapaola
femtocells
middel
erwann
delphina
occhetto
preysler
volaré
maraging
tonteg
locater
charikar
jazzercise
outplaying
ctsa
dolto
kibale
darnestown
romayne
dennery
azcarraga
saunton
mucormycosis
emasculate
yike
kutiman
fedotenko
kasatka
timpone
montagnac
tayer
wolinsky
keem
bistrot
yemane
everyting
proximately
elgee
shafayat
sutar
bakkar
mcdorman
biancone
sagle
ladyfingers
microenvironments
versifier
neice
kadriye
bajillion
reengage
schmeidler
exalead
dominque
weitbrecht
lanard
tiwai
benally
lasn
sliman
biowarfare
baloise
gambians
ikonos
bairnsfather
rovsing
fawc
burdenko
zinovieff
pessary
bryers
jxl
dutchbat
cely
trinneer
trabue
hardhats
drabs
ratzon
contently
prestrud
megabases
hosty
dottori
stanfords
zolfo
dris
pokers
incarnating
eskan
arush
twana
parkridge
afful
moonlets
caldew
puddled
demonisation
kaylani
nacirema
eglingham
cherrydale
gerhaher
cedarcroft
eggimann
desena
investimentos
klingenschmitt
shortener
insincerely
vulpe
varne
salawati
staindrop
gutersloh
unamused
sightseer
hicp
huaixi
vocht
natalist
visnjic
vernice
tinnion
overpaying
fusaichi
putdown
zeltser
greenebaum
keily
kingsey
kempka
mcvittie
pahlevi
mutassim
yetnikoff
nazims
effexor
schmick
latisha
cletis
dipsea
olsens
jetons
subsample
coarelli
grear
borucki
malie
kcbd
collegue
spattering
olaiya
iarpa
nyha
mcy
sûr
densitometry
masunaga
gurwen
nlj
crosswicks
sehk
dawi
nedkov
denno
cilic
dedicatees
kuester
antipathies
porcellian
matriculants
risø
vilu
reinstadler
golez
sawe
pipestem
lindzon
strumble
abss
audiovox
shaldon
flyingbolt
byman
cleri
ruggiano
wimble
durdin
taleju
kunie
earthjustice
deloss
olazabal
nrityagram
standbys
nilar
crotches
speedbird
pettinato
vermiglio
ivas
kfsn
zarganar
mejorada
deghayes
montipora
dornum
corer
placket
pirnie
attatched
renacer
solio
chandola
bangour
scsl
blighting
nostos
standwithus
khuong
coulais
fullman
goldenhill
leisured
aglukkaq
supercenters
alpro
binetti
gudger
gomulka
incise
litan
riepe
guara
argi
jehl
niederauer
araroa
cens
baleh
misko
podgorski
kalyn
bettcher
haggadot
infills
borgstrom
skarz
tunne
vnr
tigecycline
punke
rmh
horrify
asyl
paudge
cedex
megill
munitis
antczak
breinigsville
nigo
jameses
slewed
verbale
banyak
conduced
sheild
katangan
strewed
prepublication
fluoropolymer
nationalistically
sedco
sinabung
mackanin
craciun
futurologist
preloading
lazarevich
derafsh
yonggang
truants
berretta
thawte
enayat
adamos
sodrel
quilling
daohugou
monochromes
gemelos
comar
tannock
paname
malthusianism
whre
carretta
sabaratnam
bines
lomasney
artemide
althingi
galster
ribboned
matrouh
enqelab
lligwy
abdillahi
ventrone
jawas
lightbourn
crocuses
eyeline
goodings
benzel
pirone
hiemstra
bantamweights
meshack
trackballs
vivancos
judiciária
gavarni
covic
colace
chulmleigh
wontons
cantinas
irfon
meeds
panoramica
yuans
wenling
stanikzai
grieder
stantec
frautschi
displaysearch
backfilling
whitnash
bcar
cordiner
atomico
tortoni
itel
arouna
santeiro
gantlet
jackup
teklehaimanot
llullaillaco
kotex
zhengsheng
raue
primarolo
fims
lesniewski
redelfs
hunanese
fransman
firethorn
occludes
eggold
ferrisburgh
farewelled
avrakotos
munsingwear
pinol
tuakau
wonjongkam
leilei
lihui
costard
daosheng
equiped
cheryomushki
egglescliffe
haverkamp
esmo
neora
chheda
coachways
pistoleros
evolvable
baissac
ashdon
concieved
prayerfully
proctitis
dipali
gitex
averette
suborned
niermann
ktva
harrovian
kamgar
hendershott
emminger
jurewicz
bicmos
prudishness
plaschke
gbeho
pizzaexpress
castigation
suppertime
momm
avoch
teklogix
unversed
morghab
featherstonehaugh
wflz
misperceived
euribor
paillé
straightfoward
kliman
flum
verruca
cfcm
branchflower
vanmeter
livadi
dequan
vicuna
montsho
orlev
hokanson
mcleans
marun
morrisroe
habayeb
cromwells
overbooked
mordue
sonoyta
griller
botkins
weatherproofing
pulmonologist
ghlas
xiaoning
deossie
palada
matmour
bankhaus
upcycling
baalen
crimefighters
limpert
pendre
dusenbery
xfactor
osin
telkiyski
webberville
rscn
skouris
henrard
lubinski
mafiosa
lardo
jakobshalle
pagliero
eurosatory
cromac
condori
slobbering
goobers
eaglesfield
iccn
finardi
laurien
kassan
tssa
dolinka
hothouses
prive
golovina
nykl
angiographic
vandeventer
flyman
ellenwood
mydin
pullicino
tamely
pappin
broadmarsh
gaido
tervo
kwakye
dacourt
raichle
karnstein
baghouse
leage
lonan
saulsberry
funmilayo
mayhall
cepacia
hardbacks
somehwere
fundable
glargine
bowmont
sanyukta
kroese
stickup
meadwestvaco
aurthur
kalynychenko
biggart
brundall
salicaria
liberationist
mccormac
delicia
hamadou
tiken
upchuck
photobooth
salomonsson
bellany
dehere
mardell
klecker
stuke
carsons
nierop
flourescent
ceinwen
armonía
fodors
fuggers
ufood
ankhesenamun
cukurs
witthaus
sanchia
abdulahi
czs
ctirad
pother
fatted
xlf
spaihts
sportsbooks
teched
hatchway
derriford
crumples
switz
loudmouthed
quorra
idoc
donsol
malignity
lookaround
salmeterol
piebalgs
mosha
anchin
dutrow
glenesk
kuliyapitiya
playon
bohmer
slad
stahnke
stajan
secretiveness
bhopa
gaebelein
bodansky
lovre
pentamidine
fealy
perceptiveness
aifm
togger
waterfire
kpsi
chediak
muresan
farino
subclassified
mozelle
nardella
axr
exageration
superfluity
layups
juicebox
burway
cursi
toyboy
infrabel
alysa
candaele
disprovable
bakkies
criminalist
januszczak
thirsts
plainspoken
cipollone
bricktop
gertjan
birsel
lawsons
dewars
concil
heynes
marsella
witzig
riblon
jacó
sauerwein
phantasmagoric
aibel
outlandishly
sdsr
khutsishvili
dettelbach
worldwideweb
follain
atj
humpin
freudians
pillowcase
capanne
yongjing
etlinger
listserve
gresser
hahahahaha
tompsett
siyaj
loffredo
riesenrad
neroni
ramaiya
snoozer
tewin
oxazolidinone
yuquot
rusnano
litto
berain
louisans
alcine
knook
mothertongue
scarcella
agaba
marshon
ventolin
rimantadine
storkey
ealier
wartell
polmadie
bracadale
khalden
downings
smaak
vinclozolin
gollnisch
spitalny
bradon
hughs
netafim
tynda
hinet
brickmakers
thurtell
fulp
vermeiren
telemadrid
despond
lfls
tacher
proceding
lemat
gulet
gilf
cyanidation
biopesticides
recoupment
paszkowski
kinmundy
subcomponent
askaig
abanto
phantasmagorical
budging
danais
bartin
taqaddum
luek
siemion
lupini
epidavros
regeneron
kinakh
atzori
viler
mccarthyist
lepchas
khelifa
mylroie
fennessey
maegan
mangabeys
churchy
damji
unmerciful
rosan
benmoussa
organizacion
chinking
jalula
parrinello
tyrolia
mikaele
bankasi
nishita
kohring
mclernon
sinaa
vitelloni
xposure
marinis
niumatalolo
nyjer
appli
leysdown
wibbly
reveller
kupets
hispanico
mbyte
walorski
keirrison
mickler
walhain
multitracking
enobarbus
sivok
livvy
crainey
cepsa
nieu
flins
tuberose
sandback
ramseys
basran
berthaud
sukhdeo
leonberger
cambron
zelizer
erke
vonitsa
dharmatma
kuljit
stehlik
haberle
caftan
memin
yudof
arrivabene
waxx
krulwich
nunatsiaq
speigel
yambio
subsidary
maralyn
tunley
léoville
vrolijk
mtkvari
folkie
roughley
belittlement
parakhouski
postpunk
detailer
convecting
pectins
stanberry
bellingrath
rolm
dvoretzky
kkp
pilfers
cyberworld
wanniarachchi
geidt
manninger
eirias
banglalink
vitiate
atones
zygi
overmach
cnsl
newgrass
bettridge
detsky
peprah
entacapone
esmee
lapides
unsought
visiters
originations
banse
sedbury
andreis
tanjug
vpls
rajula
guilden
emboldens
phalcon
duick
waretown
lampell
tibouchina
alando
rocksprings
squadmate
unenriched
nurofen
lazarist
homewards
zahiruddin
unexcelled
firestein
romancer
chodos
landsvirkjun
importunate
wiedemeijer
itchenor
esrf
feniger
kurosh
engo
mouhamed
gruyere
cattan
discoursed
cherilyn
diabetology
wahm
freerunner
goza
cicciolina
enckelman
galluccio
junshi
trulia
jarmon
bulkiness
jutge
grigoropoulos
travailler
stallingborough
shaff
argar
mths
goromonzi
clarksons
egee
lindrick
gusi
glanfield
sington
lollis
lemierre
viterra
hypophosphatemia
suhey
massei
amenability
bratby
guidances
cccl
boehler
padmasree
ledcor
flexographic
ozersky
ciric
illinoisan
tandra
glafcos
coppen
peellaert
polycarbonates
vidrine
agui
eriks
carlomagno
fetishized
meaninglessly
nurenberg
subsidises
ebchester
jcrc
nonwovens
orshansky
woodentop
trubee
hipple
manetta
dubuis
keysar
tsoukalas
giin
spreadshirt
hiort
pogorzelski
wienerschnitzel
caujolle
landsharks
binladin
bedient
sccm
astles
bses
diverticular
okochi
zelazo
crowton
biebl
dibben
huchet
cotner
jaslene
hilbig
schoenberger
southpointe
ahg
notepaper
ised
crossmember
tregua
ranchito
draughty
simbin
brinnington
samie
collinet
nekoosa
qingnian
offermann
welney
vilnai
nuzzo
afspa
pfann
yaqin
cinderellas
anyother
hecking
kidepo
shoeshiner
eminger
ijustine
mitchener
lassy
fingersmith
sidestream
fullfilled
unenumerated
triaged
plov
ishaya
motty
okaro
ayodeji
bohon
kulay
unilateralis
keauhou
aernout
tecia
deandrea
muscatatuck
outruns
benat
wannabees
wentloog
engquist
bantz
passacaille
referrers
quislings
indestructibility
groman
pacult
hobaugh
doocy
antik
lepik
ebrima
børs
wasafiri
menstruate
burgett
titanum
hazelett
stobbe
moorjani
sacriston
mmpa
stolar
southbrook
coffeen
sabih
hoei
bluteau
millennialist
blindfolding
kandak
yanomamo
zew
mceneaney
backheel
tunesmith
demoralise
nunnelee
shrimali
desicion
rheta
tawaf
milty
pister
mccobb
albarello
ganor
clopper
midttun
goetschius
frazz
crackington
zarang
listach
stopsley
portabella
biked
burriss
filkin
salukvadze
pivar
uraba
montemezzi
kownacki
gimi
aberdulais
meganeura
deincourt
ophthalmol
susantha
posnett
mbete
nieland
picornavirus
jelimo
petplan
gunrunning
yumei
rapo
spaggiari
hmda
maroilles
veigar
fenomeno
jumbling
drollery
gunvalson
weste
reefed
unboxing
zerner
ahrendt
rizzini
glamorama
potlucks
sterilise
cassington
girgaum
rushydro
adni
mylink
masayasu
kimbel
bonami
qiqi
pinprick
stencilling
grillon
silversea
transmogrified
talea
meuli
eqo
ktuu
dentice
mackereth
attaullah
taphouse
peneda
intertrust
sarpaneva
waight
neuroinflammation
fujiyoshida
fettuccine
rockrose
ochlocracy
spaceway
axium
undependable
dongdan
callaspo
macp
fdcpa
flinton
masciarelli
mannatech
forsworn
inlcude
yeakel
exulted
méribel
forbad
qayoom
omniture
unhallowed
berish
bluetick
vislab
prediabetes
earll
loratadine
ecuavisa
billström
izenour
ileto
caravansary
karara
pida
bellgrove
hackbridge
marse
boizot
petrak
kashechkin
kiknadze
romeus
scherbatsky
elorriaga
motiva
murum
ammari
lampitt
curth
stakhanovite
buchloh
hamberger
karnazes
earthrace
lustenberger
nehruvian
zarema
asira
vengence
supplicated
sibierski
torrone
tranquilo
scissoring
danged
desbarres
deinstitutionalisation
nafo
treelike
bungs
merchandizing
bishen
duffell
radica
rashidiya
scelfo
taty
volnay
leerhsen
bushbabies
agianst
inherant
sanso
laplaca
jozami
insolvencies
millisle
derouen
irania
thundershowers
chapmanville
hobnail
azobenzene
beedie
senzo
paavola
bizar
swelter
gillean
saucony
corrupter
crones
murgh
burdin
soueid
eena
kingsbrook
inlaws
nmwa
henrit
idamante
neurontin
nakanai
nurek
semprini
rully
barsby
bofa
brakni
pkl
manumaleuna
fasolt
sferra
deferoxamine
rakib
hogstrom
tesar
analy
ignitable
dugoni
newshounds
rhy
stupka
groeninge
scarnati
venla
cromley
kpcs
offeree
abergwyngregyn
fredrikson
angoff
downtowner
khemka
verla
champi
alakai
thamir
frikkie
xiantao
didgeridoos
cantillana
intertan
flng
hedqvist
calahan
mcgleish
lauberhorn
lygo
berlian
pessaries
osteo
goldstick
kvadrat
koska
dreyfusards
xiaobin
meisha
varini
silovs
betanews
placeres
msee
scandanavia
heurelho
aafl
orrorin
extrudes
gangmasters
alticor
ardersier
arrestors
rundall
flipsyde
amport
amézaga
benfey
nombreux
chagossian
secound
brodies
scriber
nackt
discolorations
playtv
romgaz
unscriptural
mundella
videophones
djilali
bodnia
escabeche
frita
dzhezkazgan
matopos
carbona
dihua
halkidiki
dishwater
llanhilleth
tropos
cruchaga
phmsa
attieh
piccolino
nissel
cnss
sebou
melder
flamanville
baky
kuroyanagi
shirra
scougall
jomaa
ilw
bluck
plymouths
mericle
creamier
zili
kosek
yira
dodoo
grahl
loynaz
postie
tigertail
lerg
kyrsten
eapen
awuah
uweinat
wackers
noella
graae
dshs
swets
mccolm
beechwoods
ephemerality
beltre
murcielago
jerpoint
concordville
ullas
vulpine
fmoc
lhomme
xeros
cinéaste
leflunomide
electronvolts
ingoldmells
klaviermusik
coucke
impera
administrational
hopera
unpick
romac
neeed
bppv
bertoglio
commotio
sprotbrough
opap
duntisbourne
meddeb
trmm
dweebs
graiguenamanagh
pocketpc
giselda
sadun
birthstone
cristman
uamh
gogledd
pidie
diverticulosis
colonialization
wtg
hardrict
macrobiotics
jaymie
estrich
aonbs
weyanoke
homen
creepier
wickers
arandora
obg
raring
mistretta
unsuk
hardhat
quernmore
swiveled
simat
limy
pennsburg
bedout
areta
bridgework
tidswell
learmont
erechtheion
yellowlees
powdering
systemax
oheka
weatherwise
whiteladies
indecorous
villita
pleau
mascott
fttx
pintsch
harff
odoms
larpent
lacava
lisha
vampy
armamentarium
litchurch
ouwehand
yerli
maricela
obduracy
walloped
butylene
wineman
cardenden
ecmm
fisi
bricking
binaisa
cber
jacquizz
matkin
hotnews
orchardson
mealor
erot
drambuie
dillistone
unawatuna
sunfeast
pammy
carbaryl
reesing
glyncorrwg
hahah
bicks
upbraids
cassutt
melanocarpa
piepoli
defaria
complementarities
freewheels
housemistress
musaid
taleban
dunkery
epifani
kfdm
miljkovic
smartmatic
brohan
salvagers
upsy
newmore
upperthorpe
unthreatening
magnetars
carletto
scuppers
kuenssberg
colorizing
reciever
unsual
guilded
easygroup
mentougou
musn
bindy
schudson
undiscriminating
sérieux
zaslav
vevers
molate
possilpark
agritubel
foxen
alra
arfi
kingskerswell
posnanski
penneys
religulous
faccini
kyei
northchurch
triet
saïfi
okolski
infocision
recliners
fludarabine
drooker
radfan
amoore
calliste
monomaniacal
carcinoembryonic
ravat
nordnorge
bockman
straighforward
harsent
minimoys
uhrich
cheezy
malbin
pdma
lockette
merchantable
moonis
wijesuriya
nonclinical
lindemulder
breitbach
lanoue
fretful
dazhong
narcy
maharajahs
idolises
cloudier
handcrafting
maizels
andersdotter
crisped
siasia
maimun
qitaihe
soulman
freudenberger
lovesickness
reinvestigate
blabla
wrose
smokovec
giddily
bloustein
qxl
gaulin
arduously
sible
grafters
cropduster
meghraj
eyrich
nasba
abdella
cardonnel
benchetrit
imja
avrig
libous
hbx
stief
tegs
undomesticated
latzke
frats
crossharbour
dragila
ingratiation
sellards
talukder
lainer
dechristopher
hohlraum
gurn
heisley
lausen
cbpp
panders
perceptively
andf
mastorakis
decentralise
hoper
lizano
chantels
greysteel
unelectable
braafheid
flinched
woodbrooke
merrall
hotplate
matana
raymar
whelps
abdun
verita
borjan
aftv
slogged
mycotic
lewises
rajiva
unexampled
zabavnik
tsay
muhiddin
endcliffe
woodlee
omoro
formalwear
hallenbeck
morquio
unsustainably
sigificant
carbonless
bandoliers
hichilema
reinaugurated
indispensability
swordfight
rohlman
spillett
ehsanullah
stalins
ascendent
neisha
benja
morri
growed
beus
wegelin
asby
kghm
outshooting
assadourian
menzie
fmrp
davia
kerly
anete
dichterliebe
pinny
viers
semisi
racketball
slonem
decriminalising
kaituma
xau
stamaty
privalova
tupungato
woelfel
woodchipper
nofziger
mitsch
promenading
lijie
stealey
berlant
cubbie
mirrione
aniko
kandara
efax
touqan
unselfconscious
mumsnet
impugns
gebremariam
brosses
diggnation
dabizas
abbington
venturesome
karbaschi
andruzzi
maiolo
kolonaki
guardamar
defendent
diestel
ceftazidime
epichlorohydrin
heggessey
golfs
cocotte
brynglas
dontrell
smartway
oler
delfeayo
noki
starband
oudsema
perphenazine
bloodwork
clalit
merki
maust
oyewole
faac
isik
lungin
unpointed
romao
sanie
hewit
kalandar
sibillini
xianghe
buffin
njr
fracci
amcom
yatesbury
dicorcia
annabell
numenta
puriton
entrepeneur
annys
boustani
aquil
arhuaco
atmosfera
karatz
cplex
nasimov
battlestations
bunuel
clemont
superbugs
houry
dangeard
dhaenens
sakhee
endocasts
gaung
doublestar
elaph
lagrotta
junan
priess
eftekhari
ynglings
bienniale
dimitrovski
exito
cracklings
rssb
harouna
misjudges
wltx
huixquilucan
neverthless
mclarney
ghanaba
domenik
joza
tharaud
yongming
travilla
iamgold
omalizumab
tommaseo
cagna
lagutin
rockfeller
overbalance
woolies
byrn
carryovers
arrison
kiberd
piria
makukula
moubayed
radislav
helmreich
volksdorf
tolk
forcados
tombazis
teigan
kuperus
njabulo
balzano
kilmeade
baswedan
hardart
coloreds
trecartin
morrel
depreciates
neilly
hoeller
haffar
lonn
chyron
apostolopoulos
spates
kasler
imbecilic
hillshire
heinig
kriegspiel
bertilsson
qmg
zatoka
sanei
stainfield
abdulmalik
adlestrop
maktoob
windes
dtsc
ognibene
gharavi
damasceno
tiddly
hemanshu
felgenhauer
shinga
cendrawasih
alfege
pluk
athletissima
rbbp
americanisation
kolhatkar
polypody
iddi
behcet
roosen
yaha
spitler
bakhtar
radatz
benucci
sumei
jerusalemites
szymany
bollini
brunstad
dormered
advancer
handwave
kathia
tokhi
biznesu
loosey
linster
inva
endometrioid
toshikatsu
maama
finton
anisette
prignano
tubeworm
naseema
tangelo
polich
saimir
bamburi
tanuj
shatti
leuthold
riddiford
cfsa
motoman
muick
iwarp
vestri
cornermen
panier
contango
tuckson
vanke
optout
fawell
ivinskaya
lisin
swip
shakara
lugny
wdel
ouroussoff
cuit
koharski
thorneloe
wynder
allgeier
sirous
doralee
ahlqvist
redbay
iframes
dormon
druidical
asbc
baechle
tahmasebi
inmigrantes
schalow
sahnoun
inveighed
gordeev
dessens
garrotte
llansawel
composts
dubowski
swimme
romei
viveur
overthink
trisko
posset
nebs
meskel
hansal
klumpp
tuigamala
indrawati
mayenburg
unsuitably
bharrat
lownds
lection
mangyongdae
ellerbeck
benificial
golaz
impostures
laphroaig
kymco
wooo
lorick
glasslands
fischbeck
glaz
bisciotti
aqc
vadzim
giersch
watoto
prosport
thamara
heroe
rentokil
linthwaite
proia
creich
jenine
lisetta
austwick
agliotti
numerologist
guilbaud
scinto
oliveria
disinvited
defamations
fidanza
seedcamp
niaf
aqis
seear
kpcb
hile
leadenham
zyprexa
mlstp
wessing
buscombe
bubbe
steepened
lartey
bensel
anglophilia
swoose
gnosall
belabored
lidholm
holik
tased
surfleet
kilminster
seamie
svete
mygatt
kristjansson
mccallin
spoonfuls
zetlin
tritiated
bullimore
mcauslan
microliter
comunn
gayford
tellabs
fourchon
wcfc
koed
branchless
arrogate
halfcourt
burningham
bubley
nagre
cotingas
autogrill
goenawan
pariseau
infinium
decrescendo
becketts
mandarine
gershenfeld
brambleton
remploy
symptomless
goulon
strupp
sigurimi
worters
westhuyzen
borm
artemision
naoshima
drily
krayzelburg
livnat
fgt
asne
okagbare
voh
carlyn
preval
horth
lessman
pieux
dissapointing
nosaj
eardisley
presland
safmarine
preservations
lamari
rscc
hilu
arnouville
kestelman
zour
dybek
ultracapacitors
muirton
henchoz
vyke
veton
bracers
wellham
floto
waggener
papathanasiou
aubie
haubold
plurk
nabj
bagneris
matzoh
wadie
cuonzo
metabolizers
kingsteignton
cesspits
piraro
cowcaddens
villis
mossville
natzler
coupa
okung
unfortunetly
climbable
julienned
goodhearted
sickroom
bosler
shevelove
ogri
cepal
issacs
duken
omeath
appuldurcombe
keukenhof
samways
navolato
marown
trokosi
henok
rhamnosus
swk
shamley
imia
cristales
swakop
drumbeg
barnstormed
solecism
ticos
unrefuted
salzano
bitts
engelland
katumbi
kendu
whitebrook
mepi
dastgir
unoriginality
chastized
unconcern
chesbrough
nuttgens
artemether
glaudini
economides
kavafian
hadan
puggle
brechfa
shiliang
qasam
ertürk
jerba
deso
aogo
visse
englehardt
zohur
sabogal
hoofers
cssf
whitebox
todrick
protoplanets
hallsworth
afshari
elpc
korshak
innumeracy
sybarite
shuming
gotzon
yemini
korans
brattbakk
arod
parhat
pécresse
mallis
redraws
panyarachun
lbma
stylisation
blaisdon
ekachai
wkts
westbrooks
pickney
molouk
geanakoplos
henrichs
stopwatches
kievskaya
autoworkers
hmw
nlv
haugli
dirigo
kirtling
stemware
jamari
mouhot
giambra
preesall
mallesons
sylvaine
unnoted
travelog
blakenhall
atome
audebert
ungenerous
poquito
criscuolo
landham
mizeur
mowen
chagra
canazei
sheilas
shahade
curvis
feloniously
flopper
glassworkers
kerruish
hergott
whitla
foodgrains
yasutake
merkland
vermejo
wolfendale
latkes
excrescences
tonita
togadia
zubaidah
mcverry
wwoz
diginotar
grudziadz
ebron
liyana
qualys
unfound
sesler
shembe
quanxing
amoungst
eigeman
toolan
mändoon
jurf
bearfoot
polfer
svae
wastwater
slipstreaming
underminer
carcassone
okuonghae
egglestone
propellent
embolisms
dyc
temascaltepec
unstudio
pbde
lulea
chippers
bridcutt
buerge
rayonier
mogel
usao
jobard
hierachy
napoleoni
uncooled
applebroog
uninstallation
tarator
nalen
rootlessness
perrottet
despatie
olando
ligthart
openbook
kingmambo
frewsburg
abbatoir
yanqui
loisaida
loescher
maffi
hoever
surete
msss
ferroalloys
hydroacoustic
santner
kerlikowske
glauser
beepers
wivern
cyark
koprulu
hypotrichosis
humphery
galella
coproducer
moqbel
keypoint
neckband
bruckhaus
onne
middlegate
vulgarian
cibula
smolen
bafflingly
holonyak
overstress
banche
teet
braveness
florale
chieftaincies
raafat
buscar
karcz
elfs
roustan
shelfari
inisheer
pultar
corbelli
pentel
sandeen
tatou
jajah
meiselman
arachnological
bires
albuterol
clarance
koepke
demeny
hradecky
bphil
smokescreens
gritted
magreb
griesel
teitelman
cadabby
caulked
marianella
karpa
nesconset
exoplanetary
jiroux
crantock
sayah
pernoud
verástegui
erker
hayduke
phillipstown
microcapsules
novatek
scifres
valeron
talvivaara
quirini
chiappetta
gurría
mozartiana
geosmin
eidelberg
kaavya
ospringe
newfields
verstraeten
korson
ruam
tuebrook
nanjemoy
sinnamary
schneer
angiolillo
shahinian
ensdorf
janota
hoobler
prolongations
gvl
brandweek
shariq
wachtell
mayda
cresent
cazayoux
carboxykinase
yamase
bmy
pontac
venas
audium
replanning
galleywood
polyhedrons
tristars
pageau
gyt
wilfley
daveyton
ciga
longone
bogomir
términos
deskford
piii
splashtown
microphotography
marrella
yundi
imane
mspa
ravachol
afor
babatundé
taysir
preliterate
juleps
aora
kislyak
treet
steines
marzelline
gardam
mtcr
conagua
niblick
eumc
cytosines
pcba
neelan
angeloni
grio
notus
yigo
jantjes
geale
icesat
opentravel
otsemobor
tahseen
minara
elokobi
klesla
manqué
cirrhotic
naguilian
bowhunting
hodsden
pattin
tweeny
rixi
biver
symond
godec
budgens
celac
schabir
jafarzadeh
knowlegde
civvy
metzker
rondot
milna
vulcanism
egnos
umbi
tajrish
seismograms
ghm
giostra
santalla
fhsu
marijane
olimpio
donnez
unrequested
halbreich
rakytskiy
godmanis
interring
moonbat
knechtges
hbss
cuddled
cptp
rudes
rcz
fumarase
bankboston
davutoglu
wayda
reddan
leedstown
ngandu
hudna
beeban
maarek
dewen
systemes
dawkes
rinca
lynfield
folino
karpets
danita
carnality
thunderclouds
mecanoo
midmorning
jiggetts
manahi
chupi
arbin
vean
utecht
hottelet
doagh
globalgiving
wilkomirski
kalami
zvimba
mesones
legras
ogonowski
duking
ladnier
moqed
ymm
tolentine
ubh
europeanized
hargens
pesic
chouest
spitzberg
brangelina
osteopontin
sistrunk
druker
jamesian
breder
roseola
hamze
rockoff
viggiano
rinspeed
mither
geodis
rouzi
zaytun
antithyroid
cibulka
kannemeyer
regardles
disengenuous
suffian
translunar
tchadensis
lynemouth
osnabrueck
hickersberger
wymeswold
ncbc
bday
haspel
foglights
ginia
palmettos
harto
rangin
fwm
dhali
patzer
okutsu
unstabilized
allariz
cnaa
mandagi
coving
gemco
semira
llaima
bluemner
blai
cuccioli
ojp
vbied
pasd
jabil
radipole
viyella
scrummaging
bacik
nexo
cryoprotectants
armathwaite
intensions
tzeitel
jiyao
romey
crymlyn
manhas
gaetjens
mabruk
irrevelant
molini
arec
saveable
uscirf
tingwall
respondants
jasjit
funspot
bonnyman
dependably
cuecat
siheyuan
yakubov
trybuna
superfood
wimedia
caramella
fotu
makala
kelsch
citycell
swankie
representatively
palmos
awarta
cannulae
portee
dpsg
scheen
raziya
tepecik
zhari
whiskery
stiperstones
oever
deskside
mawae
tenne
nres
adminstration
aava
unedifying
trieb
alveley
yerofeyev
kaktus
kotagede
freeheld
covais
veis
steinlager
tepperman
burnetii
austyn
mornhinweg
againe
milgrim
reponsible
romona
baribeau
fuzhong
unalterably
nordex
hrabowski
phap
mallar
isungset
moschitta
stadelheim
esthetician
khatuna
wesleys
herschler
tsuzumi
philistinism
kalmanovich
tarina
surobi
molavi
choueiri
starsem
hellebuyck
laane
operastar
arianda
bonati
mithal
cidi
speciﬁc
emro
rechristening
colemans
tianlong
doggies
forgie
realite
thumbsucker
samii
osthaus
meho
cooman
humanise
tacom
feczesin
jackbe
ruesch
tennell
diaco
padgate
nuptse
uon
walloping
spro
ornamentally
sunroofs
carsington
sydneysiders
asbos
leney
clifftops
ashara
cleansings
seiners
overselling
butcheries
toscan
larm
songkok
kelin
jarvinen
lauzen
immobiliser
citius
roell
haria
morbegno
holk
ellwanger
grayce
babyy
kalpoe
kosintseva
unaudited
trusov
bahador
firemaster
kreisleriana
tsri
elmgreen
arrrr
relationally
cudillero
melika
rzepka
gastronomica
sodis
paygo
zampino
gromer
redmoon
tianhua
purty
bennachie
lowish
lootings
tschetter
punked
mcconnon
geox
gartin
ballymacarrett
terrasar
shehnaz
schmier
jacomo
credos
dodiya
hirotoshi
bachner
tryton
maffey
onora
newmills
hidetora
dppe
topware
landfilling
igem
crerand
ternes
avilez
petlin
borse
storeng
chacaltaya
ukra
cordoning
surur
abitbol
witholding
lamsa
kemmelberg
ionomer
cyw
guardhouses
wheelspin
gatecrashing
rostad
entwining
wcrp
factfinding
gepetto
reforesting
braniel
broomes
nazeem
dumptruck
arthurdale
dilators
itzler
julieanne
unassimilated
butleigh
cuzner
giggled
abbou
agronomical
philomene
bonaiuti
ottavino
mecir
mohtarma
piteous
aryn
gallard
jundullah
cleer
javaone
bundler
pyott
reconstitutes
ribeye
mojaddedi
lopers
seatbacks
comported
vaporise
loginova
amping
teledensity
dedinje
boever
eigenberg
zamolodchikova
eyadema
ratico
fya
albarracin
ravasi
moosewood
vetos
fornarina
solazyme
fearfulness
neckarwestheim
sedlak
briceno
emmetts
effluence
meneghini
wawanesa
wuterich
claggart
camalig
circumambulating
mvovo
chiselling
hitlerite
buyung
ellinas
groomes
nayim
gearon
innocuously
gluskin
brida
mohamedi
mewing
retha
egames
laddish
rabina
fookes
deader
lauterstein
thushara
sonderkommandos
perspicacious
stempniak
uud
eji
globex
onofri
juicier
sebok
yeild
adul
redspot
waymart
kaczmarczyk
naquin
walkom
nomansland
vietjet
verhelst
colworth
soder
maskawa
hamstreet
struther
gerontocracy
liscomb
unmoored
technophobia
ckr
muckaty
pannus
pouty
xylenes
glading
dreamboats
edcs
budke
bechis
grumpiness
fadhl
jalon
labouisse
koperberg
drunker
higsons
sentebale
myersville
harvinder
poppie
photojournalistic
petrowski
sailortown
taranath
cinemagoers
proch
csfa
unrefueled
plek
grasslike
jezza
unreflective
cowey
sutanto
chlorpheniramine
schilawski
sentimentalists
lahcen
troutt
dighe
eleana
québecois
polyphenolic
battleborn
nseries
vaill
meital
smud
blet
liaoshen
firbeck
effectivly
barnehurst
frequenters
jishou
cardiomyopathies
gelashvili
hosam
wcrs
risebrough
kitchenaid
sucio
cecilienhof
dezenhall
otisfield
twante
entraining
edmeades
olaves
amulo
jehle
linera
wihtout
lateiner
cassen
atsi
vaccum
lucente
thees
vibrance
errm
sallai
decontextualized
rattlestick
algan
blini
rajnish
fannon
berzsenyi
goodsprings
kwoh
jayes
savell
antjie
kajiya
melchiot
tabane
tankerness
hirafu
gammopathy
abbadi
bcca
rotstein
smrekar
tibberton
freid
tophill
nienhuis
outdueled
mislabelling
bugaloos
bigdog
arkadina
kfoury
rezidor
wielun
xiap
derderian
bayrakdarian
sodomizing
turetzky
mclarnon
smallfilms
arcadis
tejinder
sljeme
oopsie
shirting
zaniness
filosa
ribamar
mahtani
gaulke
wjet
glenveagh
odgen
brushland
stancil
herlinda
srecs
mollinedo
syde
mennenga
plean
pompeiian
congresswomen
drawling
coppage
eakring
triallist
emergences
sonidos
casuistic
ameloblasts
writin
theoni
hospita
stranden
posteriors
rhinoviruses
acquaints
hoeflin
hakel
kilbrandon
rudenstine
gibbsboro
gnossiennes
guffman
riskless
uniprix
zoubek
preadolescent
lewenstein
sheely
allaway
lorried
quraan
preciseness
iglu
preassigned
ceec
annouced
bouzas
replacer
ollas
gouriet
holdups
adcenter
munchers
baharuddin
werburghs
worrier
dolomiten
outplay
ehman
candys
dirtiness
electricite
oshman
jiyoung
polys
vallini
whippersnapper
swri
joung
shimadzu
mcha
nonfarm
vakili
dawr
subandrio
veredus
particlar
hamodia
friss
heilpern
towan
wanlip
aaia
avtomat
uner
ostby
ultimas
hisato
broadhalfpenny
kissufim
mulched
effulgence
sheltie
grdina
josefowicz
eini
rasmusson
apicomplexans
grouses
cesaire
diseconomies
pollentier
churchfield
bodha
mendels
yavar
brighteners
kimlin
rogliano
dakich
scorpios
biomanufacturing
backpass
leonovich
klunder
injaz
roever
fusionfall
pifs
kimsooja
funiculì
zock
mendive
mcgoff
formisano
emtricitabine
liedekerke
melendi
preppers
stcs
loughbrickland
werbach
waigel
gameforge
emmanual
custo
miit
domonique
shockproof
khade
parlak
quarterlife
luthi
rumbler
livent
paredones
dentyne
rohullah
eilbacher
nakaji
restorable
safehaven
gossypol
kianna
spilker
adewole
saute
swingley
marggraf
bods
bromage
suduva
medicom
qayyarah
angoras
scoters
faleh
canizares
nanoporous
embalm
ccie
lagendijk
zoomerang
zorman
pfenning
megadrive
misidentifies
concret
arieli
perkinelmer
commericial
césars
paranoic
bolotowsky
dutasteride
crocks
brooder
vlasak
chimi
raunchier
leparoux
externalize
wagih
rothiemurchus
overbilling
smert
chikezie
zanno
demio
shrewton
parfois
soplica
schlong
amokachi
tinson
sinochem
schuetzen
dunnit
oxenbury
norfolks
psaki
zukowsky
asfordby
tigertailz
coalter
luncarty
chhun
strutton
danladi
lfd
icor
abiyev
paschali
ripetta
cameleers
githa
auriel
grazeley
forepaw
capucho
krauts
estanguet
turistas
dilligent
vivisectionist
hiatal
wessells
radiantly
bichsel
knotek
metinvest
crill
speegle
verkaik
portimao
neighing
mulvenna
sterilizer
coccolithophore
accessit
tomsula
norem
geothermally
roizman
assister
jader
krankies
bikeable
datacentre
edko
azhdarchids
candiru
mcnitt
tourian
mcus
childbirths
ljm
sodding
bravissimo
gravitt
disrupters
qingquan
toranzo
duggleby
lawd
shootaround
securitizations
bunye
microcephalic
plods
coopersburg
babani
soundarajan
antai
threee
ardler
wcaa
reice
multisectoral
fandemonium
langwell
guanghui
harsco
wogs
kiffe
macgillycuddy
travelex
lansdell
yumashev
tenterhooks
sandjak
waide
saffer
gaelan
codax
kambanda
gudina
dhanoa
hynie
laverick
risinghurst
teya
enquirers
assuaging
beles
fxcm
farenheit
sigmatel
titrating
morganstern
nutrisystem
streetman
castrates
dasornis
shreddies
boyata
favila
incra
ursodeoxycholic
candyfloss
stelarc
souse
rosabal
kneelers
iwb
presales
abdessalam
terrin
easebourne
sanctums
stichill
lechuza
skaug
ertugrul
hereunto
prinstein
patatas
entezami
blx
ricchetti
morral
yorio
tchoupitoulas
galenson
sasscer
misappropriate
josserand
lachanze
eesh
outsells
camuto
khee
mardie
geralyn
finham
gukurahundi
belters
wkl
nunchuck
ostapchuk
smtc
unpeeled
cipinang
gibbsville
heartsick
nonbank
pauzé
buchel
skiverton
upbeats
vacanti
squonk
lochmaddy
bannigan
culliver
krummholz
contect
leav
neddick
dashoguz
neupert
startline
fogey
hawt
niemeier
soliah
anick
beckii
kanze
repola
erpen
biobehavioral
tusayan
smos
pfennigs
blackdog
higgens
vilifies
waria
tyskie
dineley
karpe
zhr
chodo
slps
chaly
swinoujscie
swebus
pedersoli
mischaracterisation
hoofbeats
kalva
doorley
sellable
letheringsett
konchak
odelia
gfo
snowshoers
braunohler
pudlo
chianina
haricot
angotti
précieux
pretium
heurich
cullera
pasachoff
kalamity
blickensderfer
debmar
leavel
shmulik
blumenkrantz
bekah
winna
slyne
decontaminating
elazig
ului
basingwerk
nilotinib
sifo
attaran
latently
sajjadi
shajara
geuss
kotev
jensens
dequeen
cstb
gcon
burneside
khondji
ponderings
nweke
freudianism
bermond
xuesen
adere
zahr
harmonielehre
komano
dirigisme
tatp
eortc
abscence
kilcreggan
bouziane
livingsocial
shoichet
anastrozole
berntsson
suzane
goldmining
sagalassos
chombo
sherries
shiying
taloqan
inalienability
bvba
oesterle
refurbishes
supervisions
ardill
hodeida
valtin
elizabethae
usweb
listlessness
chrystia
fluoranthene
thunderhorse
brevin
silverbrook
reconvenes
tiffeny
tardises
inshaw
biocontainment
lenglet
murambatsvina
slithered
chophouse
alongwith
sirtuins
taubira
shamos
multiair
inturn
bajic
spyplane
zawahri
srah
phenylacetic
vaujany
choubey
liebau
uhb
cieslewicz
pathhead
creflo
vidim
refolding
gillmeister
galloppa
kathrein
jayceon
goodyears
genyk
ojok
roiled
vitznau
galavision
kalaa
jolan
omantel
filife
cotsen
bertolli
tisco
gopac
paddison
knightstone
modernizer
hybridising
nonresponsive
soused
arduini
kjellson
qasba
chows
unoffensive
danys
llaneras
mydans
rootedness
prominance
unlovable
dympna
hemington
lofaro
thierse
crabbed
penuel
mlat
neurexin
palethorpe
helaba
palitz
stoneworks
freret
overlanding
rbo
zeqiri
chiaverini
expiated
gravenstein
aliko
navigon
carlee
droppable
erotomania
smead
tinapa
benway
morain
britsh
graumann
pfaltzgraff
tabram
muzzleloader
bridalveil
pazder
betye
vicinanza
antipoverty
doubletake
relators
dallos
astaldi
ghandy
cammermeyer
chapelhall
shoshani
jewellry
misener
galvanization
zawadi
harano
tomatin
buzau
phonebooks
texels
arowanas
skladany
oppegard
dejun
chesa
annunciator
bushwhack
werthein
weasleys
kilravock
taghavi
helmig
himss
toxemia
gambar
currell
hanem
eldh
banyas
ulipristal
guanhua
malampaya
imlah
engraulis
unccd
carnon
thiery
downlisted
flaviviruses
celebrex
almaza
spruell
schoolyear
artero
dujana
jingmei
improvolympic
villines
treater
blackle
lavallade
opsiphanes
karanovic
dyskinesias
alÿs
chronister
betaworks
sooni
brüssow
kemsky
judenplatz
kdrv
buckrose
giorgios
vswr
stracathro
memi
kinlochbervie
bonakdar
ssees
brían
krajan
guadelupe
titterington
offenbacher
codevelopment
emn
quaalude
idiotically
stepfathers
steppingstone
blindsiding
zamore
loteria
kheil
burnopfield
uswitch
dulan
belaunde
gushee
alotta
indefinitly
autodidacts
gogglebox
lagemann
royersford
berends
harreld
buehring
organza
timis
garelli
curico
landshark
amigoni
midón
trendline
ecgs
romell
kennoway
bebber
lizi
elisions
shestack
sarenne
geyt
hesta
wenguang
ranua
bedeau
vacansoleil
nrtis
affion
ganzer
gendel
opies
wrenthorpe
exis
hure
hakuhodo
zulauf
devoutness
rihanoff
jianmin
hucles
vinified
authier
mythologised
periodontist
newchapel
velfrey
toosi
kerkeling
enkhbold
supo
ozploitation
noppadon
grea
jakim
radiologically
rehema
yoigo
ladarius
shamali
lypiatt
psychoeducational
pushup
kerris
brotchie
guffaw
amphioctopus
habbit
teritory
hermanis
reardan
freestyled
scarpitta
killea
clerestories
nonalignment
worbarrow
illuzzi
bullman
kavvadias
guesstimates
shantala
shora
bonfiglioli
cendana
budington
dafnis
bwm
enpi
ajvide
pubens
strassberg
tesich
shawi
hamied
kco
loserville
untrusting
spedale
mischaracterising
glacken
engano
wristlet
gaastra
drye
oyola
rielle
safafa
dykman
drdc
commmunity
susil
kaprielian
acet
wowza
skippering
nonrefundable
forfour
ubah
cerenkov
doright
wintergarten
shakily
reznick
garlow
kutsher
krystkowiak
danjahandz
bergwall
reforge
azuka
qalander
astrofisica
mbuya
usmnt
neurotically
eue
japanther
braginsky
visn
sicking
nough
ortlieb
communitarians
totonaca
poisonwood
countenances
namika
sapelli
citarum
spitteler
footitt
raimondas
tjarutja
manometry
lulo
doaks
abbeyfield
celsi
trummer
kilobit
domonic
amde
jasad
toolworks
mackechnie
iapt
vansbro
reinholt
unfretted
soutra
afterlives
powidz
zahalka
merryl
chetek
rueckert
spankers
benllech
denic
giovenale
andonis
nfip
sibongile
irrelavant
vavi
sharikov
bergelson
sohrabuddin
foxhunt
boneheads
intriago
camie
breadsticks
tokayev
llanilar
tards
nyamu
wotruba
uzoma
tollymore
dolega
steepening
overblow
pinworms
abkarian
cliq
elsener
odling
ferroalloy
agyei
goldemberg
shaheer
quizzers
inigoes
headrow
mechta
succotash
ishima
minffordd
kelter
kondracke
dibala
carboy
flatlines
harberger
jarah
balkestein
oaktown
aups
amorth
subletting
mindboggling
durif
tresillian
annobon
starliters
deriba
yamburg
issan
jayasekara
egidi
googlers
nabunturan
sportscasts
hopefuly
drivability
kenig
lehan
ractopamine
kabiri
brodkey
alacris
againts
bordeau
riccitiello
cuddie
lotterywest
keva
honeytrap
sehar
credle
hossegor
ollo
nutkins
sensuousness
karoi
zischler
serioux
aboville
charmley
sanatorio
bagpuize
centrepieces
ziskind
probelm
uncared
shirly
lowassa
oilcloth
mctc
newbottle
adman
lifg
offf
mideastern
jiyai
tremarco
polixenes
ropin
longannet
untruthfully
bigshot
viktorov
cratty
madey
suqami
simatupang
labate
volkow
subjectiveness
timesaver
psychoneuroimmunology
komisarek
galantamine
careering
stuttaford
christianna
bedier
cyberwar
jackiw
rotfl
stripers
mcse
haibach
cawdron
ayudar
insoo
delpierre
jimbaran
multistorey
kolenda
beleived
francophilia
merowe
sheikhdoms
netfront
slaugham
furd
whodunnits
schlenker
changhong
fruitiness
eonia
empathized
facemasks
houseful
granpa
tambaqui
whitebridge
msos
boulangerie
hfrs
pfefferberg
chishui
yetunde
drph
sharqawi
mistoffelees
solta
taleyarkhan
admiting
mesereau
franzia
casler
molera
ennepetal
msika
reboost
ordin
runnells
gockel
oshinsky
prezioso
thorvaldsens
thanksgivings
newarthill
treatement
extracranial
kandell
pushchairs
pálfi
alistar
lcca
jelic
gaelscoileanna
aalders
chryssa
munyaradzi
tévez
beguile
calcining
undervaluing
lineen
beidaihe
delval
eawag
evenhandedness
uncombined
fortuneswell
hogtied
hollered
solarte
harmonique
sechrist
aestheticians
ures
bioanalysis
ostermeier
strangelet
anido
bimla
crout
kasturirangan
pharris
swindley
kochavi
chancing
vintry
gerven
emberson
carmines
kvirkvelia
pogrebinsky
laughland
shneidman
aliana
heho
spelaeus
damerham
emana
gigerenzer
reinartz
ddda
mulligans
godement
milki
savitz
biviano
sabriye
abraj
suthar
siripala
cadarache
kipkorir
nendo
pavic
endplates
igcses
hajira
sounddock
kuhar
guenveur
lasnik
drymen
plasticized
opuses
lansman
haleigh
matsunami
serras
hermens
chiverton
baichwal
watsonian
mihajlov
votaw
mahas
drizzly
corpulence
alloted
outplacement
mussomeli
madain
imponderables
preambles
overbid
bengkalis
rabten
vincentelli
gedevanishvili
lathing
gake
carboxyhemoglobin
lemahieu
ownerless
abdulin
audibert
bucker
kappas
butembo
ourika
tiptoeing
towse
caustically
guhl
tragicus
löscher
inadvertence
dftd
isah
glenday
replicability
swahn
golling
sensualist
rajawali
ateeq
poulan
nuruzzaman
schoepf
plavix
dunkleman
barati
olajide
goign
ribao
izze
glints
isidingo
etown
yucel
karpat
gewargis
synovate
mattoso
sarcone
typicality
puddicombe
siqi
izhak
buttitta
deguerin
ottauquechee
sewerby
gullfoss
eranga
supacat
roastmaster
lidya
joines
rushbury
tendinopathy
kopelev
ilori
yayin
barsukov
ragano
muthee
cefni
vignoble
nacton
changjin
repave
melsom
cavils
castled
undermind
humiston
saona
ffiec
panameñista
stavudine
cornetti
latia
steelville
traidcraft
lemerle
imperdiet
aebn
sollicitationis
fuddruckers
burnhouse
scarba
woodgreen
abco
bioelectronics
kounen
importuning
werbowy
lightheaded
zozobra
vernhes
microbrew
moonscape
marchlewski
contexte
adgate
bordogna
naqdi
boded
loick
emigrés
midwicket
soru
dollarama
cdisc
backpedal
intervenors
wle
chunilal
zens
horstead
crappies
gartnavel
bozorgmehr
daylilies
karmali
lagavulin
sulphates
voltaggio
annely
matlinpatterson
newcastleton
underplaying
fossilisation
augue
euismod
castoro
talvi
ustvolskaya
freaknik
konen
thingamajig
fairfields
temara
wellins
iwatake
wallisch
coulrophobia
carntyne
cragun
rosha
listin
ostrove
eunson
hockham
shcherban
sipra
kubina
matlala
tontons
donnycarney
nightengale
tregs
dolled
universa
kutak
wawne
dsx
elmbank
tiy
barraco
stoerner
charlyne
klarman
stoeckl
temko
marzetti
dizzie
unladylike
herner
insistant
unworn
yoast
unfreedom
worswick
paen
evgueni
yamao
tantalisingly
dolens
pluralists
karpas
prescence
blumfield
limbal
sorabjee
ruthann
mulaudzi
chnage
aktan
parriott
skidby
kimberworth
oversampled
kapon
leylands
bigos
zanker
brachyglottis
gamesters
diagree
salafia
habitué
galunggung
marasmus
touchtone
chedjou
ativan
logorrhea
shazier
breschel
buzbee
leyer
probly
teiresias
nanoimprint
milz
herge
ziade
gaokao
grig
nihari
xeriscape
miptv
meulendijks
torghelle
tisk
landtroop
kresimir
kasman
semina
annik
muthukrishnan
unrolls
contemporània
ekoku
jacuzzis
linsay
lacp
tamares
sadducee
statscan
omond
forastero
ninians
gaerwen
pizzolato
repurchases
taquari
thorkil
westerton
dominicis
montuori
gentlefolk
shattock
favino
omnitel
sikkema
poliedro
ehler
tapentadol
kikue
neuroligin
phonecalls
drighlington
lilbourne
maitres
vercammen
scotswoman
genworth
selvarajah
dorsin
barbae
heldenplatz
sportscotland
monden
growden
wardriving
schweppe
eatoni
methylate
alviro
timahoe
classiest
garrad
almine
cristovão
megatonnes
princeling
eifionydd
ageyev
novinite
draughn
perkel
unrequired
tetranitrate
coyte
novatel
araripesuchus
oguma
steeplechasers
badshot
nmk
dallied
mortice
graphologist
belenky
norito
programed
obstfeld
westwinds
gick
daesan
juguetes
sadiqi
kayumov
yangdon
leasburg
mularczyk
landivisiau
spoc
lipgloss
meeth
shizzle
nuoto
glenmorangie
abbaszadeh
sidlaw
oakcrest
schaumberg
clintonian
hongju
spni
doublemoon
fattoria
tavenner
letheren
parex
deltek
weat
autolib
deheza
veet
wpsd
kpb
notatum
mctighe
mccadden
trewern
brillon
matusiak
ketron
ostiglia
demery
nieuwmarkt
darulaman
weixin
ulcerate
tizon
guaranties
amkor
lizanne
streetcorner
grigoriadis
ardbeg
thormann
pizazz
addressability
jonh
kerrera
barrott
hurth
gurner
quiltmaking
cattles
kettlethorpe
mrem
etto
boarfish
waialeale
jianguomen
megajoule
lieke
passantino
fup
dje
cregeen
legitimisation
starcher
jeda
worthpoint
differentiators
clucks
tabing
brassbound
maryline
azorian
yongxiang
twofer
stupp
odnoklassniki
shellings
boulestin
zambon
undereducated
usdaw
tonkov
salsburgh
laureana
sullying
fayçal
intentionalism
piznarski
sipson
thiéry
remediable
aslani
alasia
emmerton
cownie
gathegi
soly
travon
yetter
tahera
antigay
chilhowie
rieman
playrooms
attosecond
multidomain
shingai
bitc
ssdp
safaga
thilawa
tses
ramanlal
daudzai
slaymaker
tishkov
ballysillan
stephnie
fredricksen
skillen
answerphone
wilfong
scown
clubcorp
elver
permanantly
zaidis
leatherdale
nsas
sphenodon
bilour
wizzo
lenko
camorristi
therien
andriesse
pingguo
thumann
taktsang
mironenko
assaultive
ruinart
colletto
auror
rosu
learys
roughening
windau
cutforth
saulat
indictees
ingrain
glandon
uncoded
nebeker
gobby
brandee
hélas
tauseef
cyclosportive
feagles
potulny
defuser
curtner
ufdr
varazdin
cartal
rediscoveries
freshney
zuhdi
micciche
youngdahl
maranhao
fajita
branchage
ternura
perminov
postwick
eunjung
xpand
suppo
boldo
pipavav
inion
colangeli
sterilising
halbherr
oakthorpe
butovo
dunavant
blazingly
arouri
anderlini
gilderdale
unventilated
whiffle
huadian
cramerton
intersession
chymosin
kruschev
gemütlichkeit
sanitise
davitashvili
doeth
koty
allmark
dsj
riprock
atouba
stewartsville
cancilla
liiceanu
bempton
qzone
shenaz
torkan
amandi
callil
venrock
abramovitch
yrp
recherché
huggler
ﬁnd
recolonisation
ostracizing
cayden
depilatory
keppinger
nahunta
mahaboob
pyramis
schork
wilmotte
siok
univerity
hedwige
metri
urozgan
lessie
mettam
lighthorse
fluffier
putes
fatburger
jawwad
summe
nehal
bignold
gazetting
natassia
wedowee
haessler
cordobés
confounders
wilmes
zhongyong
nevarez
mohenjodaro
durakovic
gosto
auditionee
yosif
benifits
sklansky
cpni
serga
hwb
eisenreich
crescencio
tibias
scorpionflies
lironi
auza
drogin
mythically
wellingore
parapluie
humectants
motsu
geordan
mooned
epoxi
gonz
milies
verheugen
relased
hannock
landrush
henker
vanatta
macombs
cholent
chimere
somekind
seedheads
voreqe
delgo
aylmerton
görgl
hurns
traianos
microtransaction
porthtowan
kabil
disintegrative
shadeed
havlat
rodero
busicom
leinenkugel
gradiometer
marrieds
vittoriano
roths
inconceivably
neukomm
trashorras
gursharan
nkosinathi
postcommunist
rooley
mcclement
purewal
iswaran
paralomis
guessable
romberger
insitute
vbci
rollison
drawdowns
stansel
corteo
oblivian
superblocks
ultracentrifuge
nemescu
umschlagplatz
extel
exceptionality
lumpers
eberharter
stuard
moshonov
soekarnoputri
monya
secton
kasilof
nervet
worriedly
riviresa
cofre
nonconventional
filsinger
pupping
hoben
hiriya
tirian
arwad
duchaine
brooklier
goerlitz
livechat
humewood
glenridge
noisome
cranton
reyneke
bood
egcg
besford
pchr
dutchwoman
tediousness
kenson
animalism
wholesales
catterton
moskito
kresty
deshapriya
keells
fakhir
lavrador
superflex
rutty
filar
alchohol
dupire
kilteel
chalkias
surreality
kinnerton
nonsens
rodis
rapidfire
hashlosha
dessi
whinny
barke
cystinosis
forcer
eilberg
brodnax
sevele
courtlandt
eestor
luneau
lahlou
hokin
periolat
miag
wildtangent
schowalter
homeworks
januarie
hefin
schickler
groundballs
cassada
unsympathetically
steffin
cretton
threated
gusanos
griswald
leshner
odoriferous
chatam
impersonally
jaradat
landford
incontestably
gynaecologic
barkby
autothrottle
hiltrud
lossie
mittelstaedt
hajjarian
alspach
bakong
kics
religiose
clutts
cristofaro
viperfish
waymarks
garamba
swaney
jonesport
skotnicki
sniveling
excrescence
muqeem
petai
ghez
masterplanning
sycuan
pelter
mcluckie
cricut
shchekochikhin
tatianna
hobin
swinefleet
lamay
cambreling
uuj
todorovsky
ambah
tetrick
siafu
coppergate
supprt
varndean
leapin
vitrines
dongala
readdress
weakfish
chantha
ghadr
sidlesham
schairer
kathputli
vigilio
doenst
djh
zayatte
gbarnga
crosscheck
toffs
abbotabad
ousu
burcot
sirven
winecoff
dearbhla
mantzios
gilfus
tempestt
toity
mccarthys
grandfatherly
streptokinase
esera
sibiya
trisul
nowheresville
girardville
neteller
tailgunner
proschwitz
teambuilding
hempen
flubs
chaye
gunwharf
fratini
bundock
lahrs
meggetland
statists
ssae
gammy
losang
wyns
renco
negationist
immunised
dutse
carabosse
merav
macchiarini
rediculously
blezard
lubambo
broco
oatcake
jinro
marcal
macchu
ranawaka
reverentially
unfed
catshill
scamarcio
dassie
gvi
rosharon
nassco
amacom
cqi
handoffs
wijesekara
kolvenbach
arranz
lerolle
balsiger
habitués
mhf
eyedropper
shayegan
mageean
vanacore
mauran
zosen
erechtheum
symbiogenesis
zfn
razumkov
zhenping
adverting
butes
daypart
aigai
pagliari
fanpop
rabea
chamari
screamingly
kyar
tigrayans
scawby
beckhampton
arcelia
hangi
kasting
risperdal
shaffi
nkwocha
hagmann
giannuzzi
kampfer
vouchsafe
aasiya
moreoever
kliptown
gorzelanny
kget
leckenby
kabuga
powersports
provid
paypass
jackpine
guzzlers
otolaryngologists
beatley
kukors
herodion
messara
gameness
corseted
loomba
logisticians
matatiele
käpylä
masturbator
toter
glabellar
krysiak
narrowcasting
llona
pedmore
europen
sevastyanov
nabiha
twb
dcos
ferson
keynoted
roadrunning
shkval
affectively
ghods
valaitis
bluebeat
doxie
hannesson
steet
serein
hemas
wolffs
moneragala
spinderella
bågenholm
gfn
allenson
zorb
vles
fraternize
funiculà
chaun
gangbuster
entranceways
cango
lagrima
stupidities
artyukhin
trooped
clarine
obita
miniaturize
montera
iuli
tekamah
intead
barau
pekerman
riffola
mondatta
pennac
vasher
aggiornamento
ivankovich
wikinomics
flein
hittner
nizaris
deskjet
unmentionables
rizokarpaso
appelfeld
heshmati
barelvis
postoperatively
neurosky
deobandis
gemba
pianoro
middleeast
argota
guazzini
gullivers
zart
bioneers
transglobe
houdry
herbertson
courtine
lassoing
bkd
dahsyat
hanut
lienert
permed
paultons
letrozole
donta
celeski
sandside
lubaantun
greenmantle
broinowski
notecards
calzones
borgeson
eurobond
derbe
mullova
bodla
fontwell
mastrosimone
denault
guidestones
penallt
pascarella
leitman
gubelmann
bress
clunkier
ortgies
sheremet
rangsan
weisenborn
ratray
blandishments
fianceé
bududa
ciocan
cortazzi
boucaud
skytower
matrioshka
eychaner
cdbg
undammed
kleo
horkstow
senesi
varsavsky
tittlemouse
bestbuy
colorway
caramelised
underconsumption
proctology
saubers
shiroma
mirkovic
unretouched
chalino
jeanmarie
oreti
distict
mathmatical
cherna
ahlan
bollan
langfuhr
shelterbelt
humoring
greenert
sherwan
brank
winkerbean
clevelanders
superheroics
madkour
chaudary
fuhua
soeder
luiseno
kwoka
suject
serialist
brockhall
bonaiuto
hartill
vallery
giuntini
asoc
tordjman
jambos
nwg
capossela
cissoko
fraudulence
christianise
immediatley
predetermine
lomong
armacost
burrator
bessan
econtent
gambrills
plentyoffish
shabangu
tingi
vademecum
kludgy
aped
basked
akkermans
neidich
powhite
swashes
solomun
afriforum
paixao
liliensternus
auslese
gaziyev
merner
bondt
shakhnazarov
decares
rosalio
zabad
norbrook
intisar
antiscience
elluminate
kbmt
narjis
shoebury
grichting
mintlaw
mendicino
paleobotanists
breadstick
hirohisa
dluga
karoubi
artal
catic
gestate
partygoer
abdelrazik
microinsurance
egwu
pharmacoeconomics
bosquets
rottach
stuntwork
saburi
buccieri
toolmakers
busato
zaro
yasutoshi
familier
promet
benchwarmer
střední
lunetta
logistician
newbuilding
dailytech
prpa
mendini
gartsherrie
fumigate
gatehead
petrovics
warsak
pwds
waisting
quaeda
dcist
leale
chibhabha
teresi
dulle
brauhaus
mofford
aphorist
unfeminine
sabras
goerz
stiffler
itals
seelan
sliney
hoytema
myway
migenes
hipps
bugesera
escaper
flimby
narayanhity
bausman
tadini
karua
oppama
scrapyards
khormato
lofti
panzi
maikon
outclassing
benli
outmanoeuvre
suryo
apelike
horspath
chillcott
neuza
walstad
libardo
dagnan
giusi
cmpc
terendak
rowledge
batangan
leunen
fracasso
whitecoat
sibbett
somedays
mommas
montemurro
orchitis
sybella
gaffie
hammud
delyth
ilum
tapei
dahna
nohar
stoneferry
caporali
speedtv
venturis
kaufland
roelfzema
horine
bracingly
shaarey
bremhill
treichler
kornmann
mindspark
multigrain
kyries
sprey
scribblers
conatel
zinio
rasulo
soderquist
cukier
bulgy
monosyllable
cobwebby
fundamentalisms
oyvind
hobbyhorse
malfitano
clps
deonte
zair
expedites
entrenches
investimento
kaniguram
longaberger
toston
nuray
marlise
sawali
autoinjector
tahmineh
eventi
slooten
promotable
slatyer
leobardo
triviño
wahnsinn
salvato
larman
wenke
songzhuang
parkham
ginned
pamelyn
brainware
irrigator
bolerjack
ehiogu
keneseth
rebrandings
autocracies
gelligaer
dipsy
briargrove
deputize
sirak
tigi
brye
dombroski
stoneywood
selvakumar
mobisodes
hexing
alfold
offseasons
zduriencik
hemudu
domainkeys
glantaf
skanking
demayo
dandapani
jakovic
raydale
kesc
baculites
symptomatically
woodingdean
duyet
netanel
lockets
cercone
despondently
sanzenbacher
cedergren
longinotto
clads
brackla
gadzhiyev
tugend
mckeating
genaux
maccartney
chando
pendolinos
usts
slacked
datamonitor
aghahowa
naturi
ozwald
irro
andreolli
beraldo
shipu
merkers
kometal
flypasts
wickline
conrath
fazeli
naysayer
citterio
hinchliff
blakney
spaten
bburago
glieberman
macroregion
pulgarcito
dorsets
lobanova
recyclability
sobri
chinelo
triki
bullsh
aracataca
vitripennis
legwand
excellant
qassams
cumiskey
accessability
stellina
schardt
takamiyama
autocratically
splaying
handsomeness
lightle
cannich
wheatears
malony
kósa
cloherty
manycore
ropemaker
velindre
kcsa
skey
duffers
motorboating
weinsteins
ruukki
danspace
costolo
barouh
demobilise
callejo
welfarism
buiter
noumenal
bowties
hualalai
alethiometer
ngmoco
jerid
marciac
primitifs
scalea
pribylovsky
meirionydd
teruaki
unislamic
schmidtke
qera
lsis
firestreak
huarache
emporer
littermates
delaet
divincenzo
conventioneers
lewison
unexpectedness
junzi
dhindsa
paustian
weltman
wahhabist
gemberling
casperson
lvx
segas
falstein
westering
supersoft
bulks
schlozman
ticklers
arasa
almana
byrsa
mangaldas
lajčák
ayoubi
lojas
marychurch
tgfb
wideness
pochon
nalaga
chilango
colantuono
ciesielski
saltzberg
weisenfeld
sparticles
laventure
junking
lenhard
spinazzola
subkoff
hoffpauir
photogram
umtv
haltingly
hariyanto
transmogrification
cornpone
frankensteins
darial
mcvoy
asrat
intourist
gildernew
vavoua
doglike
migden
midcoast
karats
graville
adande
pontani
snoods
thornless
distractive
cume
whitesands
broadie
harple
nirupam
hoity
blinkbox
fremlin
hainton
vigabatrin
etendard
matarese
pait
photolithographic
lvsr
argetsinger
lapidot
wittenstein
silatech
cogo
aimal
karandikar
hypocricy
rhwng
biever
snuffing
wensheng
sofri
tackaberry
wibro
tennenbaum
kapl
hyseni
bowart
sherwoods
pzc
keratomileusis
akinobu
binzel
spelterini
pretreated
joppich
flutey
ismir
surani
kinz
bonnefous
mellion
bryza
cornershot
latavia
mehrabian
kassianides
eventus
catholictv
spinx
brassicas
pavley
aldaba
bacala
powerbar
antias
tipitapa
crudele
brilli
niederkorn
drobny
brimscombe
nuveman
bassolé
barnicoat
weinraub
shelle
beckstrom
hegi
raak
utila
compulsivity
goest
semitendinosus
synesthete
futuristics
klops
acquiescent
ulufa
rsna
narwal
muzzey
poipet
stiver
urbanworld
volcans
mmea
vleet
ichino
galiardi
vallario
itq
adnkronos
itos
gancia
catledge
vukasin
landsmen
csem
kruszewski
quenby
mpwapwa
yuriorkis
colligo
winata
vukich
mydeco
primitiveness
dbsa
bentov
schwentke
hspc
burgling
donelan
zakary
untrammelled
eacs
barnfather
kashka
bhangarh
valcareggi
haketa
kalskag
mansourah
nutrigenomics
quataert
puckish
computerizing
sabby
scarnato
mistic
ghettoizing
lomell
rubashov
pavee
winep
trolla
touchup
woodmansterne
flatbeds
daeron
quinol
objectional
smedes
antedate
rosebraugh
stearmans
hedayati
arenella
wsil
vredenburgh
behren
barany
awde
fazioli
grasty
minmetals
salvucci
hunsecker
hamifratz
balzar
tritech
polypharmacy
takana
zentralbank
sundress
imvu
minjun
schloesser
bodge
ningún
kebble
ferrochrome
dawdon
scuffing
darroll
unadopted
hilsum
glimpsing
prawle
roscioli
tunstead
almirall
eisenhut
pattar
rapho
mave
mohajir
soltys
enamelware
trzebinski
tintypes
piyapong
ghaem
louvel
dongseo
archea
cytter
transgenics
raedwald
sternest
martinovich
haisheng
contextualisation
immidiately
noncustodial
adhir
sessums
mantega
scandling
parlante
louttit
rumbas
brightwork
chehade
samaroo
unat
townfolk
hillstrom
mories
baginski
hayel
chansa
ngls
zhelyazkov
ecrc
habila
maryculter
hennon
frivolities
enano
ebird
shaymin
subjugates
herro
maximova
tarman
mynott
strewing
timbale
mckhan
dosnt
eljanov
farfel
baccini
demsey
sixpences
deltaville
pagden
consigny
schaafsma
marford
alashan
chinyere
lowness
zyvex
bertoletti
gasfield
larri
anandasangaree
panoptic
lyndell
crating
kijabe
rouland
selsam
lebogang
doodletown
fuan
pinella
lram
reddell
molterer
relatio
occure
baxenden
descriptives
rhame
azkadellia
pyrah
internalizes
schrempp
universalized
mazraa
patmon
grudzielanek
eternit
txurruka
stagey
imerys
quietened
islamification
dulhaniya
wholefoods
sinornithosaurus
uproots
ticuna
vuyo
leakers
combwich
timnath
zigo
shamwow
reman
architecting
howsoever
steffey
wolfy
tockwith
soueif
trenholme
rorimer
tieing
slcc
corales
nonconference
telestial
tregaskis
baisya
cocodrie
damaturu
htat
flourtown
hosemann
eruh
hangups
raegan
valand
tristin
caldow
maftei
louboutins
samawa
bocs
kindhearts
hopcraft
sakir
loyden
pelles
lopokova
costantinopoli
intercutting
iccas
magaz
quids
recomment
helem
desparately
hulings
diegans
nmw
georgas
bewbush
coulterville
matricardi
forand
straitlaced
tackiness
emailer
athersley
razzall
dollops
keiren
saimin
vraca
biolab
labeyrie
rygbi
moati
beachum
howaldt
frenetically
mbogo
francophiles
reçber
chocolatey
cherkaoui
cnngo
bridies
helferich
embarassingly
cervara
hazelbury
cimoli
cotija
flashings
stansgate
bartleson
pasito
tartaruga
prepositioned
shawsville
ramappa
kevi
axius
carrollwood
koutoubia
credenhill
erakat
torbinski
qixing
cumparsita
gazali
rowntrees
brusqueness
pomés
elystan
opprobrious
sollett
charlson
schmiedel
shafrazi
mames
kochav
buildout
barbini
bombon
systematisation
bioid
abeyta
backstabbers
hypnotising
kaieda
invercauld
nxr
howdon
stuben
mohabat
oberkfell
nezavisne
individualizing
chernick
chachas
sembene
unsexy
usnavi
krummel
pvg
macroprudential
naah
austerely
xhci
playzone
raser
precarity
highstreet
stimulative
rippingale
humira
youell
dalbey
mulumbu
qaderi
gamme
larranaga
cannily
shiremoor
astolfi
kieckhefer
ruthy
fnk
repoint
gatx
matsesta
allegrini
herlie
workum
ippodromo
dipaola
spelter
dhawal
aleynikov
paddleboats
junsheng
gadzuric
fanboyism
timotei
wardy
fogal
hanzlik
dervaux
migra
usam
scarified
endearments
kegler
orebodies
ajamu
flanimals
salinarum
stashwick
snowcats
eworld
kraprayoon
lazing
braxted
loanable
folden
ndx
tomobe
njeri
suppurativa
simpleminded
tonno
mccullouch
rakhmanov
debswana
cashflows
heidkamp
lozzi
centrebet
beggary
ncnb
convallis
hatpin
listrik
catchable
acnp
schaad
beatts
gentrify
matvienko
dannhauser
trevin
dible
vayner
russen
ababil
lavonte
barrasford
aiso
busies
maims
dejonge
parmjit
musuem
ifq
hdad
winterstoke
montets
hollyshorts
helos
sutliff
niky
khuzami
ashante
eastasia
plunked
cardoon
upskirt
zuban
stumper
shorthaul
cardis
braise
tziolis
songline
dochev
tayag
plagerism
grzywacz
mellott
federle
opra
scantron
yaghoubi
milmore
stryd
cafcass
hurrican
sundy
jutzi
horticulturally
atacms
hilari
gorditas
krans
dampf
oip
medialab
immolates
mbongeni
janning
lawnmarket
honeyghan
cascione
ditore
dynion
skepper
drumochter
bakich
trahern
hadnot
galí
spanierman
dickheads
dubbins
varricchio
coraghessan
belgrove
gagandeep
baghdadia
rentrak
airin
bierbauer
arcieri
xiaoqiang
achcar
usinor
handcross
torrelodones
advisorshares
ndubuisi
mahinder
gpos
systemization
panjwaii
premisses
sflc
akitaka
ozen
eximbank
daikanyama
tdecu
agapov
chuao
zelzal
woodwards
intuitiveness
relase
penallta
lebhar
rebuses
ndukwe
grispi
coronie
bawitdaba
waorani
pierola
almay
cusic
arvilla
mishor
nissenson
trivialise
temma
hahs
kaethe
overplaying
ricon
maysonet
zubik
claycomb
teleton
arturas
emmaville
ablow
appliques
wasen
centenaries
ballycarry
pintscher
scaur
verraros
polytonal
migrationwatch
badejo
pressurizes
ipet
nanthana
védrines
crittendon
swanland
xijin
codel
beutner
dalindyebo
manko
bortnikov
mucca
calza
naghi
teguise
carnduff
oakshott
brodney
natia
premal
infoline
silenzi
myia
dettol
okposo
thrumpton
choat
phix
spindelegger
meert
allahdad
baysal
smallfield
starkist
cpam
keano
psagot
santaland
palaeontologia
swartkrans
vernell
seastreak
purveying
dakotah
naybet
cutzamala
bottome
borusewicz
noemie
glendevon
eagly
wallendas
deanes
gerhards
gestated
culio
gurode
copmanthorpe
nanosatellites
heckuva
liverwurst
undoubtly
prifysgol
jarnot
mathema
adamsdown
darinka
effused
moonlighters
glocksen
waldrom
sevani
ymha
merret
garlieston
isns
keliher
kozyra
vilalta
shadier
ubari
hohensalzburg
shmaltz
kouts
postcoital
agentes
fruchter
sellal
gassée
todisco
nsga
kitigan
kitesurfer
castree
nkotb
englar
mostviertel
metrocards
moema
currenly
seabourne
agy
kristien
mitesh
tighar
naison
sleptsova
cauterizing
cracklin
stanojevic
kalida
afew
morowitz
bonesetter
depreciable
zorbing
egomania
firstmark
seatback
beguelin
absentminded
gyfer
schoenborn
darryll
pilotta
zinsmeister
odenberg
doctoroff
sebastion
facil
wattay
vranken
attaboy
francom
samidare
quirkier
parkerson
mailmen
srikumar
hesterberg
carsphairn
raphi
matech
rachlis
kbtx
laskier
artmann
meghani
schnider
postnuptial
tholins
sleightholme
unitron
liberations
spiegelhalter
telectroscope
makarim
joumana
mridha
serradilla
bhagyam
relighting
coreceptor
naica
gscs
fcsl
mediano
skyscanner
ehec
eulogizes
pressac
mieh
contractive
ashqar
eurolines
nemazee
pantex
baselworld
mccuistion
semenko
daryan
janitzio
hertenstein
eszopiclone
kankava
zelezny
rafia
maisa
securid
enflame
beyazit
tampopo
groundbreaker
padwick
ferral
buonocore
hightech
candiate
serwa
levitch
augenbraum
wittier
polyanthus
filmhouse
nalapat
daubhill
ktnv
sterkel
ceris
harmelin
emed
eirian
scarlette
lewandowska
mareks
adge
failand
rosindell
waxie
bourla
hypercar
yonamine
thoses
fatuma
gessel
lafta
clearcuts
combien
fielmann
prenup
kalder
datini
motlagh
wwpr
klawock
hosi
kotis
sirenas
carcel
stefanko
semiautonomous
ruggie
svenn
diglycerides
readsboro
ahronot
waitlisted
kurtyka
petrolatum
txakoli
djojohadikusumo
himmelb
kerttula
conkright
reamon
stryer
ouca
weine
torian
barcella
chetia
powerhead
divests
spetchley
whag
wildblue
stanch
assuredness
suborn
parsonages
unscrupulously
pocosin
rastegar
throckley
kerosine
depetris
nanopores
lington
argolic
centron
jetley
hattar
saltuk
cauby
kfyi
nibelheim
areeba
dtis
andreoni
vfg
oleguer
coviello
carrasquillo
akthar
bickenhill
sebba
tchula
shakeshaft
bragason
multiplay
misallocation
smartboard
sures
forehands
gabbie
peñaflorida
pulseless
rbz
livas
sagkeeng
moaner
metroshuttle
bioengineers
morrocco
campervans
fishcake
llanfechain
lulzim
mbct
farraj
terdiman
metrolina
peteris
shareen
ecofriendly
hachey
dishnetwork
farberow
necula
collingdale
hifa
panthi
kabob
mealworm
teji
scousers
genesco
creditworthy
cosign
migh
blahblahblah
outgoings
janczyk
fraiture
abbeytown
nawroz
reservable
devilry
trebling
ponnuru
phalguni
dzeko
bocoum
stierlin
gianetti
bekoff
azincourt
peppo
tearjerkers
duscher
sandpits
wetangula
distressful
blared
schlaudraff
electrodeless
desensitisation
penurious
texico
ecrs
anatel
kramar
mâche
halemaumau
megahit
sabera
ulitskaya
lahia
hisbah
repeatly
sangstha
langhart
kaechon
pavlis
arciniegas
hignell
ezzatollah
usx
rusco
kowalchuk
lowboy
pitchess
machno
albannach
usin
cobaea
idbs
rohrbaugh
nisku
shadowboxer
maroma
safonau
schisler
dyfrig
sherando
taurian
halliwells
wizzair
tengan
rocos
videon
muoio
maryjane
boad
fraschilla
eclectically
wormold
fibrillin
bourillon
mallwyd
kolter
villan
vemula
metris
lambadi
comau
welcombe
cawker
sivtsov
lashin
mapesbury
vearncombe
zoltar
cambó
kambo
klix
wallpapered
reversable
yasuchika
potsy
periclean
shumon
hisahito
remoulade
eaie
pouts
ferneley
shamel
raquela
gawlik
dqe
tessem
walkure
theire
demutualization
salaskar
akaji
pilkingtons
revolucionarias
troedyrhiw
nolt
lebrock
zvjezdan
bedrocks
esurance
soudley
rahama
plangent
gheen
bortolussi
poltrona
cichero
wentbridge
peahen
derartu
niebel
kharazi
telesales
bearley
kopitiam
bernville
kiriasis
dunakeszi
bastardised
dajka
browntown
pecvd
theimer
spaziale
matovu
cephus
duesterberg
cynghanedd
luxin
highjacked
skans
aberdonian
stitser
hewanorra
harrowed
zhongxun
nbad
maos
balderrama
enev
eshkeri
mfw
activee
subotsky
nyoro
scil
garantita
covereage
quennevais
professoriate
vulgus
natm
fajt
tooths
akef
piskun
nederlanders
laurinda
holbrooks
qfp
sandbridge
jacquemetton
exeunt
bregy
rentaghost
requirment
marrah
tida
abertis
obloquy
swicord
nestin
telegrammed
pecorini
blackistone
ripoffs
teleradiology
procreated
tonsberg
nmpa
darrieussecq
zucchelli
ahlbeck
taylorstown
kyogen
wedgetail
reinvestigated
varuzhan
biergarten
fams
kanevsky
manaj
poobah
rames
diqing
cawl
fickman
melady
schlachtensee
liguo
alaïa
samorost
terrones
augmon
mweka
kanagaratnam
schefer
heliskiing
addyman
karasyov
bilkis
carthel
maulden
wte
ewm
elw
khogyani
eigler
woolfenden
elsenham
pavlyuk
citypass
medek
nawash
godtfred
mellars
rodker
kiona
toubab
sanjayan
rudden
xceed
stumpp
ulev
kammerling
grafter
laverstock
yuejin
botwright
tarpaper
tunchev
ananthaswamy
sagiv
mlodinow
cherng
myfox
goofus
macdermid
frafjord
radric
agboyibo
kanoe
hypothesising
yuanmingyuan
scurries
anbumani
natalina
filipi
whisperings
haneen
teddybear
shovelton
walasiewicz
rotovision
ballem
levenberg
middleclass
robeco
genoways
grgich
brenman
downlinked
lavrovsky
somport
jenkem
radkov
dangoor
nsse
hobnob
unengaging
arafah
sorbets
itsuko
fruitport
panor
hairnet
cannnot
ebrt
compartmentalised
scattini
prieska
chaison
salunke
scraptoft
alberstein
habomai
vennel
reroutes
wft
pollensa
holtman
karal
dajie
schouman
mukhriz
jaklin
kdic
parvan
trainspotters
doncella
klieman
corlette
dairylea
kazaam
loeper
heilbut
edeka
karani
brainlab
endoglin
recidivists
demaris
candar
doddering
heiloo
herchel
rafidah
aaap
deadened
castellazzi
tsaritsyno
clecs
adili
fondas
magorium
fering
followership
subassembly
trilliums
matalam
gadhia
celena
palmilla
liraz
dasatinib
gombo
cammas
typhoo
afmadow
burgis
krishnapatnam
macoutes
vectren
amiriya
midlander
kalashnikovs
bellicosity
kostovski
profond
awford
mackean
jeffires
foxed
mortgagees
midsole
climer
pramipexole
dolichenus
steira
melanins
edvardsson
vallar
requalification
tuley
chickerell
efird
ceibal
kicevo
krynicki
caqueta
nakara
zfns
jobie
pickell
clumsier
walburge
dalya
mehregan
rabanal
firt
ichiba
parleys
serani
releford
jvb
jevans
abettors
mcdean
transcriptionist
ferryside
lasala
dilbeck
denuding
colums
artifically
lykaion
micho
insu
underrun
pillowcases
hpw
kkc
dresen
hightail
splays
drospirenone
recharger
cizik
gumo
wijers
througout
hombach
determinists
potten
usni
ulner
crole
deffenbaugh
angelidis
pvrs
oversimplifications
teggart
shapiros
purlieu
suspendisse
influentials
oakamoor
okah
defleur
njue
limply
feinting
krainik
nautiyal
paviland
eksi
osmany
inhalable
mimbs
janwillem
exotropia
prahok
malltraeth
doggers
brownsword
trivett
polon
metway
microgaming
ccid
babelgum
lochans
bossons
mve
swiper
wessell
boysenberry
mlbam
gopalaswami
britania
epazote
ghostwrote
shalon
jpi
sensate
bizcocho
stehling
muhib
cusop
adang
dobrowski
kanellos
dibbell
daybrook
gulyas
consigns
kemner
stanke
vickey
pineywoods
dsme
kizashi
naguru
frakt
tatad
tahri
morthens
pliability
ciociara
olofinjana
gunselman
samotlor
hamdeen
manegold
musella
shirasu
inventure
paskey
yinhe
countercultures
elisabete
nimesh
derriere
jesser
pikalyovo
onishchenko
sommersby
nicolelis
murlidhar
abraxis
waterboard
enig
burbo
thanatology
luffing
oxcarts
pacesetters
marbleized
tichelaar
buit
likasi
salsero
collbran
frizette
shamsie
haemorrhoids
violetas
hjc
cuckolding
trexlertown
mcketta
capuchino
supose
beug
lacerate
cronenweth
tourvel
steinsson
tennie
remingtons
passings
semakau
tiozzo
rvx
trumaine
immunomodulation
tuitavake
stobhill
pritsker
dewell
knowlegeable
evictees
annella
hancheng
tiddler
pzp
schetyna
gratteri
kroell
fraternized
punshon
pattyn
clonoe
lasica
matrics
dols
pdgfr
tovil
tredington
dinelli
mphela
interlined
teenick
celene
marrin
ecopsychology
mintal
sparco
tropicalismo
beinisch
bahre
qorvis
jubileum
fortesque
hussan
alagiri
osisko
hillmorton
uclh
venson
passeggiata
comestibles
consiglieri
coheres
verley
zirndorf
sutzkever
schollander
borck
lacosamide
toczek
nedum
rosabella
rafu
lecturership
kreutzberger
anmer
jicama
morvah
ronak
saidel
shahzia
earthsearch
dallying
holtom
amera
terhorst
edrs
hulanicki
chachapoya
geff
nicchi
sateen
reprazent
miscount
givon
gasim
fulshear
wartorn
shoenberg
sourpuss
dileita
bwx
propositioning
chunfeng
cypres
bodyworks
goliat
minich
fullagar
alnitak
disestablishing
iodised
witchetty
evercreech
exeption
lumas
olenicoff
bodys
ghiglia
hypovereinsbank
herwald
mour
thropp
kvetching
nabala
llangibby
imprecations
akoni
crazzy
adducing
earplug
papoutsis
verbals
hipotecario
ghafari
sweigert
symmetrix
trows
curgenven
ladak
takfiri
khalfallah
disgorging
yaccarino
grumblings
kindnesses
déesse
xolos
sayres
ricou
gloomily
snitker
chakib
mjt
faulder
degregorio
goddammit
lierre
lithwick
diagnosticians
kravetz
runing
bishoff
verkhny
chemcam
muttur
turtlenecks
pcma
assistence
laharrague
invertigo
gayman
cuntz
kelway
suggitt
bowle
khmaladze
neostigmine
plumping
simmo
yamoto
monopolizes
nonrestrictive
crushable
deaccession
misericord
gotchas
pomares
colaiacovo
schoeps
brainteaser
dirgantara
sanner
somberly
nfte
griever
sorbier
sanft
jobbins
sayulita
bossangoa
henly
grabert
zulfiya
shukman
fayence
gjirokaster
pasayat
sovietism
marisha
nickens
cambert
afable
halloumi
shizuki
marchbank
geriatrician
formell
definatley
yhe
shipler
deportable
shooed
seife
unwieldiness
naken
wokefield
esops
stoolball
kislak
casola
butana
gothika
asrm
cornillac
hessman
adenomyosis
goodrington
geritol
bullmastiff
deshazo
mcway
spinningfields
boonoo
bertolino
waistlines
retoucher
sankaty
nonentities
fernao
discrepencies
riedinger
desvaux
phytoestrogen
hunkered
torosay
alreay
jaymay
carree
peyronnet
doyens
tetherball
swfs
cusiter
shinnick
alsp
besseling
fansler
protoplanet
kainer
cowered
essy
samrajya
loogie
reusser
xlb
amerine
lazybones
muffett
siqin
smacker
delfini
moscot
lupoli
henio
itex
bellota
rautenberg
gernandt
penparcau
anuszkiewicz
ramblas
flagstar
dejoria
crossdressers
exogenesis
jro
resending
waswo
kynan
rollcentre
demopoulos
labone
barters
spacewalking
alfreds
pneumococci
hudi
wahib
wahweap
ajira
heerema
shavkat
malifa
bookcrossing
reacquaint
barryville
przemyk
catelynn
aereas
phonautograph
franscisco
bioelectricity
treister
bigscreen
woodsongs
boghall
gujran
hjartarson
bodnant
forsooth
strassburger
chaises
comen
wildenberg
scrips
yering
chitale
herschberger
neuroimage
tubeworms
unsucessful
amikacin
borenius
maxvill
mohamedou
jauzion
mapou
kildale
soumaila
shivashankar
kurtas
stickered
kreischer
jdate
toibin
däubler
worlwide
insensitively
rahlfs
areti
ispo
enjambment
dudmaston
raimunda
rienstra
unlovely
cytokinins
maschmeyer
ulanhu
genever
benefiel
okorocha
casimira
rozon
cutajar
catinari
brearly
journalisten
genetica
kdnd
felbridge
cdis
canouan
rakhat
dakis
laneham
shdsl
romcom
windover
galak
berfield
pernia
cymreig
islamophobes
rubaie
holzen
gtfs
popera
swarth
surojit
alotau
majok
halcro
orse
iene
quesion
nerikes
waggling
heginbotham
consolatory
ranworth
wertsch
barbarities
kaffer
napolitana
contradictorily
ultrasoft
naeemi
fomr
deontay
assemby
beltone
ridgetops
stainburn
mehrabi
marooning
doumen
jinmei
cillizza
ichaso
rsls
saei
disip
zekeriya
delucchi
arpels
rassoul
strettle
colsaerts
debbe
overemphasizes
vineeta
charminster
weimin
misplaces
liveability
limpias
blacksod
pelon
ieronymos
luw
descibed
pushtu
alfas
layovers
aukin
oxpeckers
hoshyar
antley
suisan
ifft
deconstructionists
kuney
pedone
haziness
taxe
crays
landisville
semioli
ntombi
bloome
maloway
tatoo
bahareh
lnl
minhinnick
countrie
klawe
rawabi
zatarain
unruliness
bacolet
rheoli
differnce
facism
neovascular
drear
endelman
odwa
marquel
xiaohong
kislitsyn
camou
netcare
tave
brooklynites
savely
lascio
inure
crues
javadekar
rowthorn
muniesa
dismissible
harbage
shortboard
moneywatch
likings
istead
arment
pietrzyk
jackers
aperghis
aobut
ahdal
rosevelt
dumar
finnentrop
settees
edaf
heidel
matveeva
whql
chernus
taxane
gurak
yowell
alsberg
gladyshev
psychopharmacological
ninewa
slinks
decriminalizes
squeezy
titze
birdcages
pinelawn
riggott
aaaand
nurgul
desensitizing
hadjer
coonrod
garl
karsums
falkender
flegt
babuino
umán
perchloroethylene
lakdawalla
dynam
tetsuzo
accretive
laurenne
bozan
tijan
nausheen
bootlid
slopping
bendheim
opernball
orfield
sanlucar
brogger
geophones
bootlace
glueing
ojd
kamryn
longshots
thoracolumbar
hobman
hodur
fonssagrives
traini
cyfres
liebler
personell
piousness
malashenko
maslansky
underdiagnosed
plackett
weiman
sacrificium
guaco
gerstel
sysmex
requestors
voytek
piemontesi
forestburgh
waghef
tsunekazu
jobarteh
bumpkins
tandel
extraditions
vcsel
wucherer
kiviat
llanbrynmair
davidowitz
jiayin
wayson
winmill
gollings
gurrola
ospar
claffey
goetsch
cangrejo
solu
changeability
tristani
harwin
achivements
philodendrons
atiyya
luecke
huges
chinnici
murkiness
expirations
hunthausen
selloff
bramlage
udca
biolabs
bozzolo
zarghun
snowpark
magnini
mchc
huanuni
huya
almrei
profoundest
zaraysk
finessing
ivery
sawbuck
menzer
claimer
dalmahoy
anabuki
nyyc
shuttlecocks
easther
oecologia
containership
nissman
triolo
mirfin
sankeys
skanks
sundholm
baseggio
nevland
maxa
bubis
dinunzio
resentencing
cacher
martelle
wizardly
holshouser
guyville
jianfu
majlinda
thirza
kypseli
bodey
isesco
photocard
pessimistically
altenrhein
kalachev
gamefly
retributions
tetaz
tapuach
souers
kampia
felinheli
khamar
easo
paroxysms
upthegrove
barbadillo
horsington
exulting
stoppelman
sciamma
annalong
dabis
levines
leinert
ycf
hrbacek
venuses
ristretto
ilna
outted
lacker
lahiff
banier
groopman
editorialists
agio
effulgent
longobardo
norimasa
oriane
dunger
hellam
prunedale
hovda
primmer
arroyave
hawkley
homeplate
ribic
izuka
recrossing
deuxieme
gumpel
unsurmountable
linteus
sensitizers
lamna
charry
derventio
intrusives
golinkin
xihai
ribolla
dilwyn
mcmap
voorde
aamna
ihemelu
superpartners
richart
yongjun
zasloff
mazzio
deibel
nemon
dfsa
wakestock
duder
fulgoni
fangyu
qct
ubiratan
kobeissi
coniscliffe
suctioning
waterscape
gormly
krin
complaisant
mclaughlins
prifti
tarisio
hemm
aneuploid
ungerleider
chunnel
coquelles
lyophilized
danneberg
mely
saido
reflectively
koppie
breeks
oberkampf
shemin
kerbala
compulsories
darioush
genrich
microcosms
roxxxy
nocturia
bicket
hpac
frizz
grandcamp
glub
mariné
zuccaro
scrutinises
whaplode
zecco
polge
trefniadau
brutalization
layaway
ozgur
ucatt
senegalus
madaki
gosal
fayza
alltwen
jangles
pedalled
vpf
loktionov
altoids
yingjiang
banjolele
suburbanisation
poyntzpass
spendings
leonidio
eskbank
yetminster
gaeseong
credenza
nadeen
twing
tkts
rehau
msrb
intell
onik
loughry
palletizing
sanctifies
herda
sprengelmeyer
lockdowns
huesman
corpas
stolzing
thamsanqa
barbazza
vocalising
rafid
csere
mccorkindale
kurmangazy
brakebills
sheilagh
hardeen
koert
commoditized
shahrazad
supervening
cavalleri
amsallem
ashei
blooding
copulates
calik
unanchored
irdeto
boltbus
laywer
alabbar
madchen
penllergaer
crowcroft
costières
piiroja
cantet
pueda
uluwatu
gilkison
alekseeva
exall
bruehl
misappropriations
dustoff
aliette
rosefeldt
bcaa
vencer
clareville
xintiandi
backmarker
wetherington
boomeranged
spamminess
talacre
namira
amphicar
kayonza
schaumann
aleisha
parboiling
extrememly
indulis
noncooperation
unreconciled
hengelbrock
wordsmiths
yian
gogland
aldam
inauthenticity
lingang
fibrates
balkwill
sprackling
staros
cassez
feriha
craigan
payees
thers
heathy
fundacao
flaneur
discretions
anglicanorum
sheas
tarum
rovensky
baojun
taton
rapkin
spiting
verleger
tabarrok
radnofsky
glioblastomas
draycote
berndtson
varaiya
countesthorpe
masui
hochbaum
quaintness
pluckley
fellig
reifsnyder
blanik
politer
appaling
thell
andasibe
clampus
itre
schwieger
grimey
cefas
yotta
rehmatullah
schwalger
lambskin
opex
afran
tristique
leski
erdös
zelasko
thwack
tomasevic
llanddeusant
diamantinasaurus
klonowski
swithinbank
corvara
folbre
lodish
tatro
comprehensibly
stingily
kuun
innsworth
soboroff
rodder
hansons
dunfanaghy
shivshankar
dialidol
clyman
magnaghi
dionisis
demographical
baroin
nirim
budesonide
tzatziki
accoring
torlakson
leigertwood
yoox
rekhi
taransay
pandeglang
hatvany
heidelbergcement
unordained
zuider
invertase
hellawell
fehily
recertify
dodrill
dymo
benicassim
unstamped
boluda
adsa
gostelow
ghostzapper
demmer
hassig
mylitta
umcor
corleones
cinisi
greenfly
shawkey
nosseck
scheila
centralen
penrhys
lockerby
linkoping
abusable
colantonio
pitahaya
cruellest
malicky
coldrick
crammond
mincey
burcombe
clipsham
montelena
lustily
cmha
kvea
vrbata
huls
storry
diara
mulherin
aedc
choise
ganciclovir
cazal
nordwall
faezeh
etos
zients
setser
arade
northeaster
fhimah
lewinter
pfra
jrfu
exasperate
nwaneri
onischuk
ispot
jiggles
tithed
sofias
lscg
yoncalla
farrance
immeadiately
freehill
nebbou
crume
frontbenchers
jerrycan
bontnewydd
yearby
pressers
palemon
dizin
hockfield
faki
founts
eataly
hamood
plga
kortan
underpaying
kunimura
recommencement
kteh
matloff
boardriders
rossing
svare
mssc
bizhan
flamstead
incitements
chamu
montek
frolick
fawzan
daequan
egglesfield
rutube
aibu
rabiya
ldas
musictoday
adulterating
lochgoilhead
famuyiwa
giovana
scripturally
hemsky
rawstorne
cenex
abeywardena
muffles
applecare
willke
ocassionally
achten
signaler
memorialising
nsbri
skovdahl
surajit
rehoming
alldays
krims
practicioners
autissier
anthocyanidins
sneck
allert
zwiebel
kayaked
,we
bismarckian
remedio
foucart
tripes
coryphodon
hddvd
revist
wiggo
stefanel
cegetel
villalva
gloor
hoelzer
tinte
survery
exquisitus
expertize
olara
utech
calyon
gennaker
phia
quantrell
christow
eunoia
latry
larapinta
bozovic
heredero
prosperidad
rotz
rooves
lebara
supressing
ferencvaros
berdmore
linighan
crable
edolphus
flirtatiously
overstaffed
witbier
jennerstown
chocky
dorschel
priyantha
rhydymwyn
samaroff
aknowledge
maplestead
ripudaman
byshovets
compliances
veguilla
liani
vichyssoise
brownsover
toura
harow
onetouch
chattrapati
wittersham
ogley
agrotourism
inergy
eassy
seductiveness
ctrc
weger
shanell
uludere
mcgladrey
neasham
xilitla
documenter
hardknott
mohareb
figues
motorcoaches
rogalin
kelleway
postilion
hypothecated
staedtler
bejart
koiwa
paesano
incautiously
thoes
narissa
numbersusa
lisabeth
belabour
hexal
kormakitis
zhus
waugaman
hommet
melgaard
kondoh
wenski
chunari
moty
gongan
licenser
ructions
stedham
mcwatters
firstbank
zabit
beaner
esplanades
garinagu
roadholding
espon
measha
isdell
wolfsthal
mohmands
shedded
zubeir
philippot
castronova
andrabi
jovel
holkeri
shrinker
sealaska
mcphedran
jordanelle
lumper
auder
lisagor
eletrobras
gallimaufry
breadmaking
oocl
burrabazar
westmoor
numerable
garçonne
nanophase
kanfer
joyriders
peñoles
litzenberger
klimchuk
madheshi
chavismo
sketchley
buwalda
brasso
jarka
grèves
lorentzon
toonces
adetokunbo
populaces
reinventions
nacchio
dongarra
napiers
treepeople
shifman
kandos
sylvaner
curandera
unimprovable
cantel
dlugosz
driggers
mentation
hoarfrost
perly
lasciviousness
lual
deyes
ineradicable
mxd
jouin
earlsfort
scriptment
cfius
koyra
canabal
freestate
janala
grinko
boolarra
youlgreave
synthon
cente
ryeland
ellmore
alaykum
strel
cornillet
cabret
deadheading
fmtv
yehudai
karenia
usurbil
leibell
transgenders
rantisi
nuttal
shagaya
holybourne
kupreanof
mallorie
buddleia
khanvilkar
vilija
demarre
ebaumsworld
surtitles
werlein
russets
uder
medicale
pechmann
vorobyev
matzos
commutair
scholtes
homestate
baddour
muminov
inebriates
matsikenyeri
deathtoll
boutsikaris
haicang
grv
gigalitres
eurodif
erzsebet
rintala
chalerm
bladud
rudetsky
serrin
ridgen
diapered
kobes
larbey
plaa
preventively
cristos
equalisers
hoften
zembiec
cubbington
gulabchand
dunsborough
gergis
debjani
jinshui
legan
whomping
ostreicher
mainbocher
weathervanes
spota
oduor
barisic
rememberance
aestheticization
pizzaro
geoss
navanethem
duramed
bioequivalence
divito
polesden
daurat
aurach
carports
dongsha
castellino
amea
jonsdottir
maray
muirhouse
crivella
nonsenses
babyish
gorebridge
nailz
belger
uhw
noirin
matchpoints
merceditas
calihan
oxleas
starrucca
jfr
ballynure
motorcades
zylberstein
khachigian
ekker
erdoes
equipos
wrings
gkm
kichiemon
figley
stagnancy
schonert
pathy
fenlands
grawe
misdated
banquette
pelto
waksal
sportives
schlepper
diala
ecolodge
smelley
regney
diod
reddam
leyna
aircell
khotso
oringer
crox
sanitariums
rpsi
izibor
peirano
brackenreid
joern
pontprennau
wadood
empedocle
chocula
hernon
worklife
pecina
wemmer
johnsonian
starphoenix
wrynn
irmas
heelis
carjackings
chloropicrin
asnelles
capab
omayra
dtmb
kyoo
paumier
immortalise
maholm
tanji
geoplin
beanball
efficent
vtg
housepainter
sipora
adisonline
pepperland
teacake
histolyticum
dareus
anwen
sparekassen
shayer
maiorana
swatis
tonetti
unsellable
karawaci
galinhas
décors
cloake
muridke
globalive
upk
babylons
tsys
repacked
earflaps
tartness
linch
whingeing
kopeks
meringues
vendt
czaban
maatouk
thirith
marangu
‘…
accutane
templewood
spiga
zenor
bardai
vassanji
chromosomally
pulawy
rufi
upsized
nutgrove
mdos
phillippines
anerio
ecotones
westheim
compean
downlinks
boybands
withee
nhps
pannett
christkind
iskandariyah
cavotec
neace
stinespring
readjustments
mcworld
gusmao
plew
fieldworker
wbig
mojadidi
chikovani
syda
callaham
kusina
vixie
gnakpa
wolton
kotey
tigelaar
woeste
optokinetic
rizwanur
kouyoumdjian
bettystown
sonza
alarmists
rösti
geohazards
frostrup
horseguards
signwriter
erleigh
iansa
sbisa
zhezkazgan
ieo
metsch
herzing
beleave
nimer
morejon
circumstancial
ogles
zakouma
cajoles
gruenfeld
workboats
mincy
pouf
cuxton
keusch
homebody
lesy
noele
fingerlike
doek
avenham
condotta
norbertines
pribyl
krasa
superchair
fedai
mcac
googlies
tragedienne
bairds
sragen
baradar
adolphine
exercisers
whiskas
sylve
notah
embarrased
aramex
catmore
boeckman
inntal
walham
australovenator
himsworth
maturi
joynes
wrye
northsea
zerpa
fluoroscope
pozsgay
feebleminded
abrons
wictor
seavers
cerak
sayigh
reify
charmings
fishpool
aroca
protectant
bhutani
scratchin
parasuraman
angrist
baires
legeay
telegenic
cumberford
procaccino
hilderbrand
habberley
maringouin
siljander
shipworms
pierremont
khansa
berbizier
diprima
bahner
selwin
qincheng
lansoprazole
defrees
precipitately
ingliston
relman
valtos
simun
frappuccino
fetchers
stute
ifco
sheering
zukowski
kristyna
escobal
longpre
fjd
passeth
ysi
kasperczak
zorthian
hohneck
michalska
pinckard
duez
thems
thelander
cadburys
qapu
superabundant
skirrid
golliwogs
nahma
byran
mocco
customink
shoshones
jaehn
adlers
rationalisations
byssal
neotel
sentri
gelderlander
wichert
appleinsider
reincorporating
electrophysiologist
servicable
revolte
schlabach
elzen
collaros
bunna
walles
norelco
enlightment
neuenfels
circumvesuviana
aideen
debasis
shamlan
satloff
chemotherapeutics
sindone
pharetra
katelin
gumline
darvell
knish
funkiest
reposes
sjodin
sives
addres
lacq
rasing
gumboots
jarron
bienfait
gridshell
yongchaiyudh
melyssa
torbjorn
irrc
alayon
mazzucco
tavaglione
submenu
rachad
mamenchisaurus
mclees
movietickets
videocore
umicore
hni
sucession
burlton
horchow
stoppered
aqel
jiaxuan
lookie
charterholders
swapper
gravamen
yardi
coastliner
adzuki
estudante
mouhamadou
pyrotechnical
kiesha
kambangan
peplinski
yinlong
eurovan
communi
brinkhorst
pasian
zaetta
mosney
plewman
bomas
lanzafame
gobena
anxiousness
incrementalism
brashares
aoac
bartkowicz
zytomirski
kleckner
monthlong
cernobbio
mainsteam
dunwell
zutter
vesty
pokolbin
poorter
maolin
mareb
scarcities
napqi
lonelier
garbling
winco
walshes
knightshayes
kosara
dampeners
shelat
naughten
hammerschlag
hypnotherapists
guzzle
underpriced
bervoets
installshield
gerbes
yalcin
zombification
keidanren
blocos
cywka
beidi
msdf
anyukov
berlind
beisner
ulchi
orchestrion
bambury
ilhabela
rebased
makumbi
pseudopanax
harch
jovially
ihsanoglu
salhus
plaintively
gurning
herenton
sadowska
chuah
temporoparietal
schmierer
marmalades
jbod
clarie
krolikowski
sleeze
lalia
overcompensating
holoman
malaguena
antipasto
munduruku
shalin
milliamps
falera
studivz
batad
excessiveness
mysterioso
fricken
forssman
sydbank
itouch
resealing
malul
hulkkonen
stybarrow
monteroni
sned
kandula
baricco
baladiyat
beachmaster
gemballa
haimendorf
superskills
chickenfeed
lorenzon
hornick
shirenewton
supernews
izambard
unterman
perspire
entomb
alfven
langbank
appelation
assistances
hazlemere
aspirating
maïa
mobistar
ameerah
semisolid
pucciarelli
borte
appalshop
zald
seini
aciman
singley
hukkelberg
thng
adaware
lisandra
motola
probaby
guyancourt
mediadefender
marsee
grumpier
archfiend
nourizadeh
baev
peddles
edgerley
feio
pinkberry
anzhela
walesby
skinningrove
lazell
reoccuring
karrada
itziar
leiderman
makola
mellman
actie
scofflaw
thade
sukari
aruga
recuses
semitropical
maxtv
lehal
topkick
gelberg
feedstuffs
skorton
aylesham
quiring
boardinghouses
pouliquen
baseliner
stosh
cockapoo
golfsmith
junes
djeli
mallorcan
jahurul
radul
steib
previos
kmx
filloux
ceff
oganov
deiana
pernickety
koether
nergis
bides
inapposite
twinkly
ateek
unfaired
dantzic
yazalde
banaji
rzb
rief
atps
keshubhai
akayesu
deutchman
bernelle
bacause
loznitsa
rouslan
galliford
ergasias
dimauro
sameen
freestar
pimozide
runako
acteal
grennell
mirrer
wollan
rusks
xueqi
sanitising
hemerdon
multiplicata
gulled
riabko
crittle
llanybydder
dimplex
salwen
sistah
caballe
quarrendon
jordens
thaci
sudhin
moshulu
vilem
exsisting
alexopoulos
submicroscopic
washerman
woodlynne
sigurdarson
posthole
discoursing
ilios
volponi
beteta
porical
birtwell
beleve
curabitur
changlong
idvd
takach
helmetta
läckberg
deterence
brenninkmeijer
colehill
sprüth
kontiki
tavella
kavlak
celinda
yiyi
rajaa
rabner
firebombings
regarder
kanarek
kanebo
ghysels
moheli
schroen
kierston
lochbroom
stech
iati
bressoud
karamoko
sadaqah
layzell
unflatteringly
crankiness
exista
briery
bolea
nightgowns
voorsanger
brondby
kozmo
beduin
morizet
dipesh
maiori
backrooms
wwjd
basabe
kessner
epicor
hypolite
suos
pettigo
javerbaum
verro
munley
pantoprazole
effluvia
ardkinglas
kakimoto
homebrewed
priding
iosia
louv
dege
pcmc
bidon
hangam
asthal
chapatis
childrenswear
digiacomo
patissier
merling
gasport
seima
lizer
assadullah
synergos
geske
berlie
preljocaj
dellis
leemann
nigiri
informaticians
mejorado
toano
beagrie
nvcc
gazprombank
nejame
freightways
kaiga
cornfed
bundeskriminalamt
sleepwalks
mcnevin
harrumph
protandim
raheb
valeriani
digium
muradi
jinong
doanh
demuro
caniparoli
alípio
bloodsworth
setara
flus
rephrases
palong
swimmin
obeidallah
timewise
cugno
schickedanz
schrieber
vanson
unoffical
vertis
fircroft
gerding
betterments
shappi
mydoom
coquelicot
ghillies
naq
compounder
sarshar
pachulski
dolbadarn
bilinda
obviosly
sutpen
efra
seclude
tysk
lakotas
hassin
krawiec
laroussi
raiffeisenbank
miyachi
demmler
maccagno
itau
aaq
anabaa
derakhshani
glenolden
exegeses
raisani
deathstar
marzabotto
transparancy
shyan
scrag
kidscape
zellous
palios
chequebook
hirshman
kamdar
sackur
sihala
ufdd
bracelin
silvercup
bomberger
awat
camin
crdi
shroder
turle
yolly
matumbi
siner
electability
imagemovers
josepho
matchbooks
hakizimana
sanli
marconnet
peskett
bibikov
ermina
valiance
bunglawala
kelser
hese
rantissi
cotteridge
wbbr
oreland
saharicus
underuse
olexiy
pushman
ophuls
cresaptown
kruje
meyjes
adduci
meadowside
denburn
imperatively
gibsonburg
radimov
aquacultural
conglomerated
telemaque
cytometer
freepbx
echave
autumnwatch
vucic
struever
flagger
mcmenamy
venki
ashely
dhis
reselection
slouched
gomboc
yurek
containerships
ltps
olaniyan
rayuela
arousals
enthralls
anora
joff
bjornsson
silvestrini
kadhir
repub
vaitheeswaran
kwatinetz
careflight
deadrise
gorlov
meneguzzi
lanschot
energo
fawehinmi
sapsan
gypo
tubemogul
hysaj
anovulatory
olesko
mafai
viaud
nikias
frisking
nuthouse
buaben
vorobey
calculability
mazzo
ponikarovsky
originalists
badonkadonk
undersize
democratising
cageprisoners
elsztain
jumpstarted
framley
decisionmakers
cayard
yunyang
mestia
olofson
alberca
gige
ilisu
yechury
magistretti
hubin
rigopulos
pipefitter
leganes
malah
proceedures
protokoll
shooglenifty
ercolani
attus
technosphere
robidas
disemvoweling
marangos
rolfs
ferromex
garics
decabde
laronde
awy
ffvs
dawdy
funso
lambasts
inso
joesph
orensanz
aztar
wkys
overstayers
mudpuppy
banji
visualises
lifenews
hubbel
kotake
navez
itrc
chipo
garmash
osbi
rasner
jamarko
abilify
fpk
kaneta
woodfox
fauzy
messerschmitts
chessani
fudgie
meinke
reproving
naím
rouart
percee
sergiev
nonwhites
polarbear
ilink
licken
nowroz
acamprosate
wahlert
manhandle
embracement
norihito
demonising
hosepipe
eurojet
shoegazer
ungers
hawtree
billesdon
zysk
wolfish
razmadze
kumpf
mannschaft
hartmans
bruzzese
pognon
tintina
athough
knep
yigong
changli
rautela
lakisha
godfray
lencquesaing
microgrids
mularoni
musabayev
kiptoo
blagojevic
romley
taphorn
depiero
recyclebank
posthorn
gharbiya
vijai
samwel
formely
blueway
taltala
macmath
calenders
beshty
moneyweek
grotjahn
ansal
stenmarck
situbondo
pedraja
sweeden
mankulam
piquancy
radicchio
rahmstorf
cpsf
homogenisation
gettinger
amazonians
deceitfulness
galard
varennikov
igam
guillame
morie
eisteddfods
herges
shahwani
mummify
keilson
younas
buik
lagrossa
molcho
anthracyclines
voelz
leiba
ngoo
rouco
gyrfalcons
bessinger
speek
freeville
seastrunk
jerilyn
pedasí
dustour
teletech
carerra
bundchen
verle
nystedt
garceau
indiviual
abbeyhill
setlur
acadamy
traumatize
wrair
wams
trahison
alpaslan
sumichrast
sals
baloon
cinquanta
bambach
laabs
proxim
sulamani
gensel
petrosa
tunefulness
skinnyman
contres
ardington
tsakane
trealaw
delaire
rechargable
direst
pluijm
tiddington
souchong
isues
bagful
evasively
thrustssc
protofeathers
maffett
menarini
recapitalizations
danielsville
urofsky
istúriz
polivka
heier
bassim
sterban
darlins
rodkin
gaubatz
bilked
wech
llop
tharwat
anelay
staredown
neglible
excoriates
santour
itabira
ruibal
scarnecchia
biggi
zuabi
avriel
forebearers
muireann
katurian
mazola
physiatrist
balintore
batmanghelidjh
unitaid
kuqa
perkis
nebulously
staddle
negret
haarsma
greatstone
castlepoint
mitx
sentell
uui
bongers
joles
soyabean
mallusk
alagiah
zft
miniatur
madlung
eclisse
darbus
votives
jonck
storari
hajigak
saderat
giannantonio
bessonov
timmonsville
forbis
extrahepatic
schemm
consorte
barrabas
solness
medflight
critisize
adipocere
lingmerth
earthshaking
villeta
kuoni
peacejam
pugilistica
alethia
eldercare
bokator
boilen
skandar
potager
moehring
lemmerman
thumm
sudlow
clipboards
mabasa
gulnar
nightdress
bendon
schwetz
crosshatching
chlorothalonil
echikunwoke
retrench
junggar
batchelors
konovalova
hsms
pursers
delusive
zedkaia
interviu
schlissel
stillwagon
bocaccio
neurostimulation
wankery
dialoge
latavius
plasmonics
augmentee
holc
everyway
olgas
superlove
kettley
nikaah
redshanks
muneo
roup
tiggers
herv
nonfried
graffanino
moggie
barbules
doormats
cumbias
ficca
ruzowitzky
matsakis
mindarus
astrachan
hersman
urick
yaxham
rovos
voser
tanden
cecils
waterphone
awj
touchez
naats
dauenhauer
yifat
makiya
dece
azucarera
erinys
miltonic
dietel
woollacott
nobo
bomback
axehead
laight
valueable
paracentesis
bobba
korchagin
livescribe
fangchenggang
puello
cryder
babie
zett
mammadli
genious
clytha
photomasks
rinuccio
maaninka
unsuspectingly
prufer
mitz
burbling
uppie
bolzan
overshoes
dayao
defraying
dilnot
intercon
mackiernan
breece
eargle
swol
hirotada
kator
sharapov
jurinac
aldwick
supervia
sewanhaka
dubee
oldwick
quines
horwitt
cymorth
onekama
theobold
psychotropics
hailin
toymakers
pommery
movs
drigg
mistrials
deutekom
americaspeaks
vodcast
coalburn
eurostars
thilina
wojciechowska
tshuva
jccc
brys
niched
haukaas
gasiorowski
ironville
worldgroup
gorbach
esentially
luminar
toshimi
insouciant
aenean
alesina
tempodrom
bruzelius
cammarelle
rachakonda
llangynidr
laguiller
buteau
beadman
raül
renita
antiperspirants
modrikamen
jurisprudent
forwardness
godlee
peseiro
gladsome
berting
adjud
flapjacks
adelsheim
eskender
descibe
oanda
zimride
interveners
descargas
glympton
lewer
sowerbutts
axenrot
gazimestan
holtsville
yoli
buggins
moredun
maduka
cybrid
transgaz
ejup
lesmo
orston
watsco
perpetrates
glyndyfrdwy
elasticated
noveck
onetto
janow
deejayed
disemboweling
caracappa
titeuf
schoenmaker
kayte
nahayan
sthe
stanich
eastaugh
druridge
ouja
wissman
bellizzi
bayron
mairtin
thiazolidinediones
friers
gofio
méïté
nonmarket
samassa
fornicating
sotg
watelet
swatragh
sayem
chhum
drossel
reguardless
muoi
vomitus
thielman
connerton
chambas
disfunctional
zelalem
rheinstein
raai
kktv
yuccas
littlebury
ussi
hejda
mithaq
aetn
bueb
belladrum
barcina
intermec
kaniel
foxhills
withholdings
bozarth
klingensmith
hasay
karthick
petzel
meyong
measurers
longplayer
golddigger
juki
malheiro
rajaiah
lanasa
venders
acig
sherone
sparer
opels
gobetti
dosser
ipass
astal
lerone
carvana
waffled
jells
blackridge
chaenomeles
ndidi
hooted
zoladz
marcle
leuci
borbolla
cattley
kharaz
kissology
cetp
destler
dualisms
vallecillo
antivir
indulgently
fasque
chatshow
copiapo
hanaro
thelonius
edelmira
icbs
snakey
wixon
cobbins
hornel
pantala
longwu
coudl
camfed
jerrys
inforamtion
liebestraum
dismuke
bcap
boeve
oohs
kuwano
manuelo
eagled
peacehealth
lihong
fukumitsu
exoatmospheric
folias
kellingley
techworld
talula
spyer
menis
ilaje
smukler
arton
wrests
gargar
leli
llansadwrn
kjrh
carfagna
claques
engelder
nsenga
seyam
pagodinho
hueck
namias
ssab
jailani
bellboys
razel
excitements
ktab
pozen
dudhope
buttoning
tocq
afilias
antidumping
mckirdy
xwd
bagaran
lauk
sapos
bacary
cynllunio
stentz
secos
sanita
nitkowski
miv
kosoy
garroted
wfg
sandier
buckelew
unles
nonspeaking
trentadue
yanet
magnarelli
swails
hwacha
dohn
multicentre
cultivatable
tiefenthaler
isilon
cadstar
vondra
nores
hollyrod
mugnier
kollin
bonifassi
alamshar
kyoritsu
majilis
boulkheir
functionless
tawse
sélavy
unlicenced
corryvreckan
stavsky
cruzadas
counterman
cantlow
ranitomeya
kondos
bankolé
mastectomies
schwarber
franczak
chapela
relook
secura
petersens
crombeen
firstar
ivedik
clickthrough
appartements
wisco
globalcom
ecclefechan
rospuda
flowline
orz
akunyili
zeldovich
shulan
demetz
cammon
andys
ncfe
lodis
lemas
jerkens
atmosphères
bucketful
schettler
travessa
workshare
turfing
tulliver
siguenza
dusshera
ycd
weatherstone
harrietsham
autarkic
suspiciousness
tapiriit
kanatami
ambuehl
ditmarsh
saletta
traister
scouller
orama
alekperov
godik
unpolitical
omakase
berenato
caligaris
knisley
prattling
pctv
lladró
mizzima
habbush
ryugyong
togther
shahra
fogden
lincomycin
necas
kallay
blogposts
mcaskill
gabeira
zolotas
pinggu
dagano
bedum
ecover
donges
rodamco
measureit
fragged
sonawane
silverjet
jieshou
maleeha
umph
bolani
whiteleys
glutted
diamondville
gyamfi
sansbury
guterson
hijinx
suddently
jkx
corpwatch
mindich
neutralises
autobody
stube
weltz
hendrina
kanodia
nastassia
minehan
urac
suraya
kandahari
quashes
overfill
eronen
viatical
rechtman
ambac
stuhl
wallice
ggr
mcfate
danaya
allerston
elphicke
pensby
rentiers
toldot
ocalan
yeshi
phytase
oaksterdam
llanbedrog
eboni
doerfler
bentson
peasmarsh
sanes
ovl
luwan
relin
alih
furnham
packiam
katies
gatten
cinemedia
engagé
francky
inservice
wonderstruck
klatsch
rolanda
rabit
ppts
furrey
erté
brase
ccim
hasner
daithi
nemone
liebeskind
connive
fack
sunliner
jalava
sabesp
aleya
loitered
spidered
glowering
brierre
gilsdorf
ziliak
klau
cesaria
dissembled
ilot
ivn
grondahl
scaw
sohlman
dapoxetine
esearch
sielecki
kamai
stulberg
girs
seamier
tokuji
neysa
wenfu
euroregions
bruggeman
inundates
preciously
vitellogenin
condign
redken
mytchett
egdon
alchin
gomorra
northwold
wrightman
mullholland
vomero
isuppli
kaanapali
nemenyi
leiken
jasmonic
kimolos
tuckshop
adductors
aliante
shichahai
downswing
nicean
montès
ashforth
nonthermal
iju
takiji
cheerily
harithi
torchmark
noncommunist
zappalà
tamsulosin
wruck
rijks
footwell
sarcoptic
portlands
horeca
vaporetto
itemizing
grousing
aliabadi
isehara
sadoski
penhill
overcompensation
larders
prevaricated
chengjun
rankles
terramax
newsfeeds
cfma
inat
antoniazzi
wssc
brinksmanship
raisuli
floorwalker
tucher
shahrokhi
cxs
akbay
lookback
statfjord
corbier
unfindable
webcrawler
crunchies
slabber
intragovernmental
henleaze
yarza
banjaran
decant
landsea
hemispherectomy
panpipe
reindl
karasik
teith
scatchard
bhulaiyaa
firebases
gleissner
bevell
lipin
garrin
genon
accomodations
trimarco
snappin
caliburn
couponing
briefness
imsc
crialese
siic
costella
moisan
hypermasculine
phillipi
lalka
calinda
viad
rgcs
villepreux
grenvilles
fenninger
exemestane
ingonish
wellheads
eltahawy
ccls
ballynoe
ganne
beidh
delshad
dersingham
nmtc
applin
chenowith
zaner
lodato
littlestone
llanbradach
kammback
mcinroy
duchaussoy
kalthoum
goldenrods
miyares
gehan
bainian
coverall
burckle
dismissable
juridica
lapdance
latshaw
cariparma
discribe
kyubey
burmistrov
snowbasin
doliche
eurofins
lucians
kersa
mceliece
volger
bazmee
evenin
dullah
cliver
novant
orsova
kaniskina
moderniser
skydives
livestreaming
inuits
karadere
coagulating
grayton
tribler
angang
kinking
unamplified
deryk
apanowicz
zimpher
guled
guyanan
sanvicente
zajicek
equilar
keidel
lietzke
zayo
sydling
foetid
bims
jaczko
bartenev
szajna
carjack
ravensbruck
grenoside
chebeague
rakau
desflurane
endourology
reprogrammable
repossessing
shakier
boykoff
arifa
dussel
tayseer
jawwal
averring
whitelands
glaubitz
declo
hamot
touchpoint
levelheadedness
zingy
greencine
kneza
pardners
strech
bioversity
sarandos
motorisation
oyebanjo
claretta
hedis
tooba
botolphs
sifma
lanty
immunochemistry
transorbital
piech
alegrías
debell
denish
farnsfield
prankish
youga
claudino
gramozi
parenté
overstretch
stadnicki
gerecht
bandurski
darshaan
coachhouse
griped
nuveen
sampanthan
wagonmaster
infeasibility
bernot
decock
buttering
excisional
internationalise
kacho
nanok
kuehner
valesky
bahloul
lyncombe
buckholtz
septembers
lamen
loadmasters
qualter
bonymaen
fikile
dallal
mofetil
prosor
defamer
refocuses
burglarize
lierop
dhimma
prabakaran
jabbo
bounderby
cowpen
forby
redzepi
shvartsman
hattrup
altantuya
pundir
trackpoint
gadling
trellised
nyarota
falbo
kabiru
sadlowski
labeler
zuerst
lcci
hinche
goverments
waygood
fraph
sculpturecenter
mvu
superlatively
brozek
sathers
dirtying
bekkering
devor
olafson
woodmill
encomiums
cocarde
suleimani
associa
blazars
spätlese
mandil
abendzeitung
idph
lenagan
seagoe
jiahua
sunja
nipun
chowhound
loctite
stricly
brisseau
suheir
swaggers
listyev
orleanian
briney
tdvision
luty
zahau
airfone
bonifield
myit
mantero
encrustation
sceptically
dcli
exergaming
hermosilla
fops
conforme
bagans
homochitto
shabi
cybercriminals
electrosurgical
mayodan
coulport
boeings
claria
magney
llansilin
frenkiel
rocawear
kansha
strathdevon
jovic
velina
kttc
momix
lieske
kovels
sturckow
nmv
heteren
drubbed
drange
smocking
mpact
omino
pikey
alschuler
discolour
deconstructionism
wentwood
johnsgard
llanystumdwy
ljs
daniken
merilyn
cozen
patoski
disinter
systran
greycroft
kasparaitis
cognex
sassoli
leckford
submachinegun
tiruchelvam
shockumentary
daling
bihi
margeaux
melodicism
freeroll
wisborough
langho
duyen
kuder
lambertson
mohácsi
ramonti
discher
inelegantly
hyperthyroid
rinascente
boudu
masnick
lawned
tannous
tawn
mvj
mnos
damaschke
poretta
jemina
dugher
copayments
darus
marissen
ledingham
slifka
windrows
aerosonde
txi
skarstedt
timolol
appeares
sisyphos
objectiveness
claster
birand
allessandro
zúniga
earthmover
fiddleheads
kuensel
drayer
maugeri
busying
fogl
smedberg
argumentativeness
fantana
remko
annoucement
fradin
rouche
bizzaro
movius
shakeri
wafts
runwell
scorches
firetrap
asist
tukiainen
gentlemans
phytomedicine
expocentre
semipermanent
serogroups
latto
twisties
exterieur
mortella
porkers
fontanarossa
camoletti
galiazzo
kurskaya
esoterically
fundaments
callosities
pulposus
wispers
andouille
cranor
koebel
jeesh
makaa
shebeens
eifman
kirkconnell
cryptococcal
idrs
lechter
corpach
sneem
mewelde
guffaws
gounon
groux
darbepoetin
alicat
mehnert
uyana
chillis
kabbage
brinsford
yahir
phoneline
sisario
ybf
strimple
tangley
buchinger
jabriya
baradero
goffriller
desousa
greenstead
moonrakers
aracruz
seehafer
acrobatically
vegoose
dunivant
flexa
ovr
yongpyong
seib
giugni
nesc
metzer
atmospherically
qof
seiple
skinnies
jiangyou
caulkers
anane
freixenet
toubon
khone
rtlm
mccornack
precharge
simonside
servas
insadong
bargeddie
stefaniuk
iyabo
kandace
psca
kudlak
cotinus
tapwater
khamdamov
gutka
chungmugong
renno
rupiahs
coykendall
ineluctable
marketo
viennale
iqair
costales
morwellham
mainsheet
yufei
taione
kalousek
alderwoman
pleanty
amirah
willsie
fulls
boskovich
tipsters
governator
ranst
cointrin
unequivocably
finlaggan
shovelers
sanbao
rkh
wjno
pseudobulbar
drabkin
mefford
heatlie
bakulin
hammen
jarrettsville
mcglohon
juyuan
sniegoski
stylishness
blogher
freihofer
jorritsma
controling
pichit
thommie
koyuncu
sericite
zygomaticus
blubaugh
subdirector
repackages
rolnick
stemcell
kambiz
pufas
stiger
italcementi
froomkin
slw
confusional
noureen
dhia
hutchesontown
lanx
ladbrooke
pmac
clariant
zupancic
lupron
stitzel
beamers
tarquins
snowsill
erinle
kinneir
fobbing
pilcrow
misjudgements
ferr
hyunmoo
choquequirao
phobail
biosolutions
marquetalia
tetrault
dehar
gih
dahej
postsecret
brambly
myoelectric
pawk
luttig
epassport
politz
cerith
felafel
muthaura
quazepam
hoody
estabilished
envenoming
chilcompton
filaria
amrhein
jobos
fillette
spumoni
homepna
ruckert
successorship
popal
superfoods
vallourec
kuomingtang
mistwalker
straumur
schnepp
drypoints
mardel
kleins
fleisig
evarist
acros
padnos
wolfes
aloka
lhari
naturalising
sekisui
kreiser
margarines
moalim
antuna
sheils
jinbao
summerhall
trewick
nanoelectromechanical
oskanian
hospitalfield
finkelhor
uconnect
gassco
gurneys
discribed
bathford
gornick
swearword
limted
bdelloid
beardsmore
aveos
deyhim
dasti
hanagata
mexi
dorianne
makine
hhf
bating
pastafarians
saveliev
verbalizing
souchak
kapuscinski
zitzewitz
goapele
achinoam
shafrir
fujicolor
bandannas
mamajuana
wildheart
santopietro
nephrologists
damore
gajic
vastine
bouris
hulshoff
perigord
richel
botanics
zbt
shangai
fasciana
médicas
harren
terro
zuyd
scarper
cooktop
duddridge
cfas
oshinowo
petaflop
ravenhall
barouch
zachos
superlink
raigmore
tyuratam
dacor
cosslett
openaccess
disgorged
viscuso
binson
scofflaws
lasix
phyliss
constructionists
mqtt
antasari
repond
heyningen
dattner
indissolubly
unperson
hbh
lgo
goman
longbranch
thando
advan
nyrstar
haubrich
devang
nolden
herremans
ruatoki
dadonov
domitien
qingyi
nelida
loudin
moskovskiy
rheola
wined
hucheng
vadala
marzu
begles
huus
talati
devanand
sumberg
maltagliati
lcia
gwernyfed
panau
abchurch
sophon
otterman
pushcarts
dollaz
gissurarson
ballymaloe
trinities
farke
iriki
gogua
bolsinger
puzzlingly
scarre
environics
motorweek
geobacter
bahaism
undreamed
palfreman
destineer
opcon
kromah
roboticized
midocean
enti
synergie
mattice
neval
fortwilliam
jessicah
huegill
rebollar
baltin
teath
claassens
dealmakers
firaaq
hexthorpe
worldpay
dynatac
shivery
sonsini
singlaub
sledged
leyde
wonted
difa
sasac
plowboy
vinokur
copple
surprenant
clubjenna
yetkin
stoneback
pescow
speonk
barbalho
metastasizing
tianqiao
letona
shapin
maendeleo
velutha
esperon
promus
encima
bunyard
dolphinton
eob
jorrocks
mephistophelian
tahmina
gaffers
mullaghbawn
camarão
millworker
importances
caraguatatuba
requite
birnbeck
sensitise
teets
vasic
bonnerjee
schuring
lightering
peens
dossevi
vandelay
beitunia
amae
gorer
mullikin
egestas
hodos
readaptation
ragbag
sporozoite
ugtt
merillat
hric
aneesa
bellyache
vorticists
sybren
oruma
labarbara
prognosticators
opdahl
gritt
twelfths
efstathiou
trinians
latty
eche
rainone
begger
asopos
tagula
keithsburg
wajed
gryder
ballyhalbert
kinal
mcbath
volokhonsky
samaw
reignier
xvycc
riverdeep
remifentanil
llangeitho
bazalt
sudlersville
forsters
pleck
jrl
ldrs
gregorek
nosegay
bizos
fruitbat
tortuously
hlongwane
establised
arram
rodric
guntars
beaned
coxford
rugamba
jiqing
trustwave
gokhan
nikopolidis
calow
dobroshi
denosumab
kuhlmeier
suilven
andorrans
dudum
kaokoland
totterdown
andas
detering
gerashchenko
mussenden
louisianans
overfilling
stepstone
mylod
ducci
graterford
skeer
cmgi
raffan
beadling
montelukast
lignac
trowels
wanty
zucconi
comis
terios
metallised
korchak
frangos
terril
glulam
parthy
winningly
prototypically
margarett
abers
slvr
aaviksoo
gamestation
cingolani
chavistas
frankin
caig
estime
distend
hccc
editoria
szarek
itacoatiara
maleng
pekovic
luzio
shafia
sukur
mayakovskaya
napfa
compliation
dubowitz
senrab
ukirt
aedan
monfries
thinkfree
morses
wazzan
spuc
reauthorizing
bornedal
psfk
shimron
bormes
abberation
sawchuck
meredydd
elefsina
strines
quickpath
siebrecht
atholton
nourry
pgti
barreca
serralunga
churt
politicises
eatables
efdss
rebeccah
allander
sportingly
nuiqsut
moneymakers
wpad
fadela
demerol
hellebaut
phleger
lixia
ciocca
pretentiously
anema
paim
goldwire
zarghami
courtnay
dharmsala
borodulina
fumento
moge
fanne
ivus
borissov
oroshi
nahem
surrattsville
whorton
oltrarno
intrapartum
weaponised
bvlgari
cosmovision
teetotalers
firor
nanabozho
thriftway
devided
puffiness
jitan
megacorporations
dhotis
noncommunicable
digitizes
accutron
brecciated
shacklewell
shapoval
rhl
bösch
longlevens
rauluni
rawaqa
onta
hujar
brainiacs
kaloyev
trefnant
temtchine
caseville
henrythenavigator
unstaged
archlute
forgotton
nikai
dellert
uncontestable
mattew
eroski
dribs
nishinari
rvca
vernerey
berkery
cdic
corralito
kambakkht
rudlin
mountville
desisting
spliffs
emmbrook
ayoola
lavigueur
mykhalyk
remonstrating
hango
nasturtiums
duhigg
danek
diebel
piccioli
raham
dalakhani
devenney
marimon
savvides
andren
capons
celam
hachiya
fendley
capezzone
gredler
wihout
minety
fourrier
ischgl
gammoudi
hiorns
wolaytta
harringtons
tamis
rabbinically
gurton
gusha
potentialy
aronsohn
reuil
bowflex
lassally
invitro
vancleave
aaea
toquero
repeller
csz
holah
roekel
antipodeans
caracter
obertauern
convulse
hojatoleslam
koudou
shobe
hccs
whitneys
hakkinen
lyson
gevalia
waberi
taho
rozycki
euna
marhaba
piment
fdx
denvir
tienne
camelbak
manaa
welll
spectate
hendro
beghin
moggio
hailong
minjiang
akuressa
mcquilkin
sharifuddin
schaffrath
powerbrokers
syomin
wahabis
craigiehall
chanctonbury
demonica
xinyao
idealistically
jossy
gutrune
provenge
katsunuma
shafii
hockridge
lambrigg
dynan
adultos
endostatin
postnatally
nephrolithiasis
isett
dissagree
mahanthappa
bachia
nameable
moger
otegi
tenero
célimène
andijon
raveh
prysmian
saartjie
zhaoxing
bernadina
castanon
natig
vinti
hablan
niggly
strib
marill
ridleys
alopecuroides
resurrexit
stopbadware
microcell
airola
décolletage
succar
jck
guvera
singstad
widders
vulnificus
eugena
backlots
problably
slickest
manufaktura
walravens
couchepin
benzies
boreland
duree
cwmdare
narrowbody
guangyan
zongchang
blessedly
corridas
rambukwella
autoglass
abelow
bendukidze
bintliff
moodswings
cullotta
maneuvre
archaeoraptor
hintertux
manischewitz
intermingles
lavendar
melles
petrifaction
livened
maesbury
ulic
unhealthful
bergerson
arone
stuffings
halekulani
antiquing
falutin
greenwash
briguglio
nexans
highlighters
togus
tejarat
tishby
stigman
talywain
chevis
ogundipe
cansdell
viaggiatori
purslow
laxfield
jingoist
dreamspark
cantagalli
ggf
crashlanded
pratice
ungracious
consignor
blueblood
sibeko
baliem
emisoras
boraas
knierim
maycon
qasab
deadmarsh
domoni
disent
aqraba
myah
ramandeep
insua
perspicuous
hawe
merval
chulak
daltry
rakhim
appartient
reginiussen
maundrell
neurectomy
gokana
steuerman
jiayuan
dpko
pollsmoor
architectonics
mccorry
seatown
bellapais
fasion
joyent
coherant
salonpas
kakas
epiglottitis
calvez
wilmers
seagrams
gingery
bregovic
zhenghua
langel
govea
syer
gamburtsev
spongers
jessops
monke
barthet
maxym
scheppers
pedicures
fpsc
vaadin
reclusiveness
iceworld
jahncke
corve
worc
purtan
munita
adigun
approvers
benzoylecgonine
bertuccio
salmonis
mpsa
mandatum
riddley
waulking
wlk
estranging
icx
planespotters
medicity
mononitrate
bonvoisin
lectularius
herculez
wannenburg
boxcutter
usdoe
tsna
lerberghe
poynt
rohrbough
barakeh
willowfield
piñón
yock
ranthambhore
biotherm
aanr
gerut
brisben
normark
antwain
khiem
turnstones
lisser
rahho
cowpoke
millio
wildgoose
rouvoet
frisoni
indefeasible
ljuboten
pullard
sanborns
triterpenoids
firefest
cohosts
yassar
koliba
brutt
datblygu
jemaa
cliquey
etling
porca
heezen
fadli
zirbel
isidra
ordener
simari
tlacaelel
olowokandi
seiha
ananiashvili
saile
thurlbeck
inholdings
irans
riunite
bhuwan
seppa
nikumaroro
benfold
jesca
kvas
attitash
deber
venzke
waltraute
bourdonnaye
jpmc
balms
unbridged
magetan
literaturhaus
elefteriades
gnx
tunnock
dansie
incensing
cutta
schriber
khazal
technogym
landbouwkrediet
kozhin
giacobone
pelot
galadima
yarema
hadebe
tempters
budos
strober
litsch
umeme
timbro
karibu
groscurth
riona
overnite
guéhenno
brothertoft
ngoche
harbourvest
lubecki
katsushi
cablecar
scrabbling
gadbois
hochkirch
touchless
fearmongering
midy
kerswell
horspool
bletchingdon
baldas
lyondell
yorvit
altenahr
foja
trenberth
backstab
plié
kweisi
leucism
lonwabo
inguri
militarize
angiomas
aulich
khada
heathery
fbop
mancur
samdup
druggie
dansili
christianities
quintavalle
chucha
socor
elapsing
cooktops
jabaliya
radetsky
finbank
carran
garned
cybook
demnig
wachsberger
yowza
tshuma
goresbrook
odihr
hultquist
genotyped
etis
bewteen
derogations
belled
morave
maalim
reisig
saikal
fortepianos
periwig
ixv
conerns
slaking
gruzdev
multum
mcshay
rutili
obetz
namic
gulgee
newarke
tingly
maulan
hillborough
panjwayi
toutou
byx
jounce
itajai
grafenwoehr
alpesh
shmoe
talfan
masari
houlahan
quavering
evanthia
kubodera
sachertorte
olivio
tatsushi
westerhoff
yothers
pehl
siemen
rafti
paicines
surpised
yamawaki
creekmur
pitchmen
plcb
copperhill
sediq
wilsher
clacking
melosh
sematech
nethope
gussy
hardyman
mediaroom
zanicchi
vahey
hondutel
fladbury
garnell
ilounge
gbas
ishai
mccuen
lilypad
freeda
bellaigue
colóns
tingles
ezzedine
gasana
ohim
divots
mcneece
bergren
multicar
berdnikov
slowik
lionville
stbs
skrimshire
jungla
muriqi
funking
beggarly
nessling
phedon
sfcg
incwala
shayesteh
planers
theorin
underworlds
cagsawa
stogner
alowed
maylin
crimebusters
blazquez
plce
pleather
elfenbein
chamkani
oerlemans
diker
sheasby
sonoco
weihua
zichichi
brendle
freuchie
levenmouth
okoronkwo
aadvantage
hillburn
dipyridamole
ceilidhs
mawazine
clir
cyclopropyl
tornai
genuity
philcox
isamuddin
weedkiller
mabton
incentivizes
chickened
lavasoft
wmgt
waxen
sonographic
runcton
sgma
moratoria
starchenko
baiga
fisherwoman
lyerla
renes
shabalov
presidenta
morvai
teleglobe
merckle
macfayden
ruyigi
cirrhosa
rohrwacher
telepacific
dropshot
démarche
karsa
souissi
kernicterus
kosoko
raistrick
loyce
waringstown
thevenot
motoaki
chugh
broening
sinx
smilingly
cispr
guangyu
brighthouse
maskless
immoveable
tolovana
analia
wapper
richner
sron
dunnam
manber
denhardt
blackline
munton
casquero
ldw
aasim
sivaraksa
fucci
technico
criticial
moleskin
andoain
keylor
qeiyafa
lantieri
rmdsz
guines
abettor
newsouth
naftohaz
wathba
portsoken
schjerfbeck
neiafu
reelections
shalaby
kopsa
lesin
lakenham
screengrabs
bitney
cimperman
chygrynskiy
urbanspoon
precedented
awsworth
estleman
savada
alpar
sandland
scamps
copay
camier
shrewdest
chessboards
dysfunctionality
tidbury
hormann
reisler
amerasians
trana
jetwing
stavola
bassetts
stum
alecky
homilist
ausone
assiri
erx
dobey
mudingayi
nardoni
teitelboim
erkut
pinpin
nanopoulos
scana
khanon
ezetimibe
cukic
ceop
ibold
kreder
boruff
laureth
calabashes
wheelz
kaib
melodee
inscrutability
reinitiated
markram
rambagh
gve
soothill
vrx
rozeboom
turai
mcrobert
heathcott
limones
motha
grobart
dunt
komamura
vogelzang
leucate
demilitarize
claritin
markby
harvison
marotti
luzhny
yaskawa
depass
musudan
fitzrandolph
stice
luhring
unuseable
weikel
appc
transesophageal
uptrend
klosi
boue
daishin
biovail
writedown
eople
nebet
pillorying
winkleigh
carias
caty
pfandbrief
codder
katlego
marlys
backpedaling
nordheimer
soehn
hrabosky
langsett
simexchange
mcewing
rollingwood
fabianki
kleis
changcheng
chermiti
avonworth
lisman
caterwauling
carafano
harbridge
mardirosian
mpsc
tsitsi
pke
youngarts
hazelhoff
filmakers
edworthy
whrc
raclette
zetsche
edozien
sidik
ezza
longrunning
rasaq
csfs
thinktanks
marno
sanoh
cloner
derald
arrasate
hoofprints
longdong
saftey
tchao
zalmai
beckstein
dxl
smurfy
quibbled
ibw
foubert
suffit
vibhu
coalface
sukenik
tsy
zhiping
centropolis
fieldsmen
jobtitle
karrow
artioli
tions
orlich
bujagali
aseel
finisterra
fcstone
ferati
inanities
faultlessly
lvrc
kidzui
mamby
exfoliated
futurologists
ssti
toxocariasis
phylip
wainewright
kadr
kalonas
brookview
mckelvy
siasi
jazelle
rigeur
doivent
eeriness
arncott
ironport
stealthier
darbi
dunscore
androstenone
khing
oldskool
indiabulls
rockling
allays
coltrin
pallen
umps
orangevale
khurbet
peepholes
elfers
bedawi
slupsk
bigamously
bussert
nasuwt
foxgloves
heuga
lamco
parmigiani
nowaday
sharki
shellman
westrup
pakalitha
punkers
frx
preventions
milbrook
washinton
pollitzer
hypothyroid
gandules
danilin
hawaa
luvs
dusart
riyan
pankova
rutabagas
elissalde
waismann
topweight
promulgator
portlanders
vaidhyanathan
holmoe
ibank
korosteleva
soltanov
jayasundera
datacore
unspun
brahea
ekl
knautia
froogle
ocassions
baohua
nadjib
astilbe
basulto
stanislawa
holdenhurst
gadhimai
stoia
tixkokob
jhony
pauric
bridgehouse
tabacchi
arostegui
rapanos
bendle
exults
desiccating
hondajet
abdulhameed
anejo
dimont
minnifield
fagging
bladeless
broström
northavon
parasomnias
iolta
firewalled
conaghan
rappelled
toivio
talor
probo
paskowitz
ogoniland
unpatterned
deech
mazrouei
usapa
chorba
vitasoy
sumilao
ballykinlar
atiga
cleen
zannier
holyport
kaslow
gardinier
mossos
roskelley
brittnee
kbfx
shaath
kofuku
ofri
suwandi
shmatko
overbanked
aridi
braskem
brynhyfryd
beechfield
aqe
picas
balconied
mandile
ariyan
scottishness
squillante
catapano
slyness
creches
kriangsak
kircheisen
araque
outman
chindamo
nawzad
kandawgyi
rabai
benaglio
palaly
umanzor
crackenthorpe
gonerby
fullilove
alamillo
burchenal
octomom
hiromoto
burniston
lolla
haskayne
kweichow
telemetric
unimposing
vildosola
rejigging
toua
mazzolai
blackfan
rundberg
longyi
adauto
editioned
ilter
hannas
devedjian
sbsp
rbtt
divita
komin
reengaged
berdy
mccrudden
westhouse
sautéing
tiem
wheeless
godleman
chairwomen
picada
haralambos
markens
morococha
latterday
hennah
handless
carneglia
lopatkina
roofie
cassagne
whatua
nalebuff
perchard
foulden
appeases
jumbuck
furthermost
ctba
ruhnke
dineh
bickett
trijicon
gweek
svensmark
andropause
haochen
borzakovskiy
domy
baileyville
plumbs
grigoryeva
pravastatin
depsite
karapatan
dornhelm
smae
myss
canonizing
parmitano
rantum
meave
sepilok
giaconda
lazzarino
vyle
antonakis
rettenbach
raphaela
hesilrige
bulbocodium
vlachopoulos
melodramatically
gandler
digeridoo
tacke
kolawole
barnidge
bozhkov
edik
dylans
savors
ncda
ginyard
paino
burnbrae
araoz
autochromes
archmere
ghaffur
meyde
raltegravir
cornier
strokkur
chardonnays
rubido
mediapart
trainman
corato
trasmissioni
astrobotic
elab
halber
wehrheim
lemalu
loseby
willistown
rohrbacher
nonoverlapping
rappels
arborio
overweighted
ekern
alaq
varengeville
sarun
liederkreis
seebold
dimichele
nasirov
stadnik
roginsky
holovak
miscounting
rabou
collectibility
mlps
roussell
thim
pretextual
cremonesi
fabes
atucha
muchin
daytrip
harrased
kavakos
ygm
hagendorf
begiristain
cipfa
bitel
currah
crosthwait
alipour
balala
barnathan
krasucki
bearak
mmmmmm
ratfish
dinakar
puniet
lamah
handson
okas
arcore
everythings
eiss
resurge
yodobashi
overextending
joceline
shakra
fukawa
aizenberg
montelibano
utans
universitys
nibh
memphians
dumouchel
cracco
ssy
subhankar
vilana
fellmeth
terance
windridge
zurawik
chalfin
qtel
goodrow
rure
proprietorial
willert
curette
frenchwomen
shashlik
lafaille
twelvefold
defacements
mollier
epirbs
devlins
monlam
zarkava
researchable
mabvuku
varilux
eiscat
pamoate
neoliberals
carteles
jearl
phrenologists
lisaraye
smeulders
itsself
mueenuddin
disempowering
keiskamma
disbar
toraman
gullfaks
moratoriums
villazon
scandentia
parochially
baselga
kirklevington
cattistock
oughtta
hsaw
cynulliad
gopalnath
slutzky
overvaluation
batterham
dallerup
ostalgie
henchy
governer
cija
ministerios
giangreco
liikanen
landay
clarsach
thermotherapy
oldford
homsi
ciervo
matzek
ibio
tereshinski
malford
komarek
murungi
brunier
zbynek
crocosmia
rohita
larcenous
mabahith
mastny
deschapelles
mozy
strykert
cladribine
ledgewood
liuwa
chlorambucil
needlelike
ganzes
hazarding
aeroflex
supraglacial
dimokratia
meina
willesley
groag
poucher
pisar
nubble
knego
chaffinches
tengzhou
bellamys
sadighi
chunkier
simphiwe
katchor
smartie
reimposing
haselberg
chellaney
univest
guillo
balester
meigh
bureacratic
floriane
moldable
jardinière
krisel
merey
stubbies
lutfullah
pioner
menemsha
whittock
nadjari
dupond
kolache
keaveny
bequette
atomstroyexport
furring
storcenter
raful
gänswein
monsterism
rettendon
morner
vigilancia
clemmie
schleper
lovekin
fennville
balena
prldef
karpovsky
osnos
lantigua
deubel
honeysett
pakhtoonkhwa
incises
nitrosamine
neaman
pensarn
ashika
boors
casoli
chagin
murko
griem
gruenebaum
gutterman
steltz
erionite
psychotically
pennridge
alawsat
undie
lepetit
vetrano
thomire
shaktoolik
linzhou
pensee
sundblom
rearick
kave
louer
stadnyk
cruciat
daning
illium
kliger
dormston
geac
sudarsan
thornlea
ekes
tyutin
pernick
llynclys
dedina
huguley
jstars
cavium
highcliff
ebbetts
suchitoto
skowronski
mathmos
nahim
torys
nightrunner
zurutuza
wherefores
linnehan
sartory
affianced
cooey
paratuberculosis
senesh
mishari
hree
dinkin
nrwa
dierama
fathomless
saltarelli
resequencing
lionell
davoodi
daidone
tanase
nargund
peapod
glassine
refsdal
hongtao
nonnegotiable
ishay
portakabin
flibanserin
slomka
huelin
walbottle
kengen
candling
galson
basketballers
pakiam
brazils
kadhum
suezmax
sparely
decompresses
eaglen
tamgho
boogity
winshape
thobe
microgreens
iqrit
dilaudid
treefrogs
colonises
osberg
contingently
valentinas
oatis
sexagenarian
oooooh
superbeasto
aucuba
négociants
leatherbacks
coquettes
crapped
equivocated
chytrids
bwn
cauterized
paintsil
shihua
shushkevich
bevendean
satio
hammarsten
cockshott
tomeing
tunnelers
doodler
hrtv
dingxiang
tinworth
abdelsalam
casaleggio
duddington
lacorte
gavidia
rundowns
bauduc
laureateship
cranefly
janigro
reaserch
lafford
hessing
hegemonies
torshavn
vartanyan
donahoo
wenqing
monterosa
nettleford
crosshatch
lysbeth
essenhigh
equivelant
hairpieces
beatiful
abama
irascibility
medanta
oloye
lassoed
ansbro
tregurtha
gardoni
asztalos
foxon
amarri
merchandize
aldwarke
declawed
fingolimod
paddingtons
bercher
matzinger
ciaccio
hoetger
hendin
bual
yukagir
misharin
muzu
shahrizat
brightsource
iccat
kemalists
stagner
obradors
teleworking
gisel
ilê
halu
amaranto
benardete
futi
aslett
suther
denka
peragallo
boazman
dû
abon
apiaries
fulfillments
rejecta
marijnen
chomo
compnay
abbassid
confederacion
chizek
killmer
nawruz
nantporth
telit
karkhi
hargesheimer
graters
khanya
padd
skachkov
grandstaff
jericó
iupati
diktats
splinted
lampen
ayerza
egekeze
theofilos
carryall
helvey
acrc
impedimenta
northall
sparq
maluso
bartrop
moneysupermarket
uglegorsk
nokie
cejka
neikrug
vespas
pepel
olfers
efail
grosman
haimes
tatic
nefazodone
mansukhani
paulen
undimmed
gourdine
zeland
zeyno
tayleur
chool
holzberg
kench
shamsheer
dungworth
litigiousness
whitefoot
tarm
stollman
videoblog
rohbock
nram
capecitabine
partsch
magaki
cabstar
pipedreams
frayer
nitu
unhesitating
sonntagszeitung
adiru
llanfairpwll
georgeanna
ciria
aghajani
aminath
kimaiyo
hawatmeh
plewa
kolonics
malakov
talhaiarn
nonancourt
alemao
baltimores
nkufo
apolipoproteins
zhangqiu
lavs
fransham
shamie
salotto
dumpton
yahr
suiters
homosexually
bonnethead
craigievar
nuzzi
heggs
kaminska
superstrength
slemmer
erasto
cunnane
gravgaard
whitebread
karatzas
mcvety
meteogroup
nagamori
headboards
imamverdiyev
geniès
mobay
adriyanti
overdoes
chenevert
casen
woodeshick
onuaku
nonagricultural
mabil
embezzles
polycarpou
kerobokan
saquinavir
timelord
antabuse
nabaa
epner
adefarasin
wildish
bruchweg
fleetboston
sputters
azmy
cadamarteri
puckeridge
klap
conibear
cantatrice
keckler
caique
unsoundness
salw
andronik
swanstrom
gaoming
neabsco
desecheo
yiquan
matney
frykowski
belcea
shoed
rebadging
genivi
dubbers
angelea
rissman
chrs
rodak
dropcam
impassively
cupholders
behing
groser
bòrd
tieshan
kenanga
openajax
hickinbottom
meminger
screenprints
vango
sisig
jamee
chemopreventive
jinghua
adorably
elouise
switkowski
rupeni
devany
shnaider
giribet
nusser
prestiti
grassini
oddsmakers
aramberri
luigino
jasur
purgatorial
mudlarks
rdecom
malangi
mambos
baree
spooned
takalani
idham
spraypaint
zeel
ixquick
eyeful
yonker
obfuscations
labash
sabrin
proaves
preit
hackleburg
rajnikant
neemia
katera
pukhov
backpages
dureza
radisys
greywolf
bustled
decentralising
hanifin
geochimica
quate
fhlbb
chimerica
maharajan
khawla
mitchelmore
sonesta
barbarino
peope
doncha
uninsurable
restrictiveness
chittum
choung
narghile
taishin
guobao
allyssa
bertrams
tessler
storwize
oompah
nistico
brabeck
killinghall
nogues
myoga
otokoyaku
pteropods
nitpickers
staiths
wilckens
expediter
francophobia
mckegney
muqdadiyah
bncc
fengxia
boriello
unterweger
ikm
strosberg
demotivating
coteries
gaikai
mometasone
cornog
yumkella
cablegate
watercooler
sanakoyev
eariler
kemira
lesjak
twohig
globovision
barabanov
jongerius
perezhogin
musican
milfield
aradi
simmy
albareda
uncomplaining
canditate
pilc
inheres
brema
fimbul
cje
mursell
gki
kaese
shlock
despairingly
invasives
hauptschulen
kyauktan
narsad
naari
mcingvale
michiyoshi
ssam
snif
metraux
divall
chps
oskay
errrr
oseni
chaunte
stokols
dimenna
weatherbys
marchick
wdet
morlon
roadstar
parknshop
lochboisdale
aready
talco
deoksugung
palander
bagna
ehad
ghostwrite
zinoman
anirvan
ballesta
zabawa
durring
performace
krikken
dalkon
ecohealth
walberton
tecnologica
movieplex
nysut
fucilla
immunogen
korfmann
krugel
byass
barjac
quainoo
iafrate
boell
khacheridi
krump
wessinger
thalib
bhogle
dandu
meziane
paleyfest
jianqing
koeverden
gillson
illogicality
conville
grims
orian
peynado
icsr
powney
suceeded
chesse
schwitzer
begrudged
donnall
decaffeination
heinemeier
barcena
kangura
chumachenko
huanhuan
galloways
gher
tamiia
costarica
jollof
katella
fetishization
ansoff
shuttlewood
leikin
yuden
lazarte
cnni
takefumi
taedonggang
colora
facussé
farese
melf
ahamad
greyfield
transferees
ogunsanya
soloistic
asbmr
homebred
ameal
masara
wigmaker
barsebäck
nmdp
wilstein
cotugno
stike
transman
disposer
robbyn
pusillanimous
panshanger
coultas
tumaini
caril
vandort
plasschaert
cesifo
cockamamie
nabby
mainsprings
degroat
ashia
itweek
paphides
atter
maenclochog
marshside
fixator
stauss
stigmatising
dreadlocked
traykov
vaughton
nichido
walmgate
dck
heledd
scajola
woodmoor
biaw
padalino
wiis
mckaig
quatercentenary
herpa
bockhorn
wde
lasy
duffing
bimson
towans
stupin
quienes
brads
homenet
prieb
jiahe
diprose
guttierez
canoy
backwardation
qelt
tarasyuk
sethusamudram
krayem
compubox
suvorovo
erts
laparoscopically
grimandi
juet
anagnostou
brevan
martek
sunair
responces
binbin
schoelkopf
graem
macarons
cheskin
pretinha
lizo
rayanne
molecomb
ohip
intertainment
garby
apirak
parcheesi
holdall
xapuri
mucolipidosis
latcham
tefaf
rihana
eirikur
dagogo
gfg
smoothen
poken
comparables
godoi
shorthold
ency
tillot
funniness
pollarding
erz
asmallworld
naht
duchamps
karora
mutko
poundstretcher
laffranchi
percocet
somkiat
asaduddin
snowbanks
wwin
sliming
micromuse
guriev
nazam
kolodner
stollenwerk
etchecolatz
montney
quiches
torgler
middlesborough
errickson
mortehoe
mediafax
lwg
sotalol
kemakeza
screeched
nannu
heathcock
resorb
luay
alveda
wmsc
nabati
llaves
cotehele
peeped
hedrich
amstar
treasuring
sablefish
betsen
muker
messud
roids
berkow
yusra
unobscured
ouvidor
macerate
terekhov
badinage
mruvka
deemphasizing
akalaitis
mussell
unguent
weatherbug
advancedtca
blackney
nutbrown
detesting
klci
krumov
turnmills
basnight
allistair
wrigleyville
bruecke
monohan
blaschka
hautman
tiralongo
commingle
yurtseven
mihails
radonski
instinet
fedun
magaro
frenches
teknaf
kalaitzidis
wallid
chewning
consolidators
intravaginal
filets
jork
lecomber
hyperthermic
recre
idata
kjla
gymboree
pavlick
hendricksen
velka
danneel
ficc
tules
userkare
dzr
yongfang
childminders
glascoed
tsipouro
houton
urgun
biopark
dockage
ceron
snailwell
untilled
stenham
mantech
murabaha
esure
ringmann
affi
vardanian
pedrillo
kibwezi
rakove
jev
chalid
pissarides
catera
cipressa
abideen
besigheim
misys
meersbrook
brafman
sandbostel
livock
unswervingly
russophobe
culantro
cervarix
kamminga
pieranunzi
robotti
alberman
zlateva
buonomo
rvsm
sarco
palinkas
wenhaston
greves
bedchambers
energis
depardon
camec
deighan
netezza
isys
adaora
jaywalker
marilyne
shewfelt
lockyear
ratby
sulake
boastfulness
dapples
arkinstall
irey
sothic
polonio
babaloo
obejas
fornicate
taggerty
marionville
hessels
relgious
elgible
carlitz
sspf
fidai
fnmtv
idisk
nayman
traumatically
farfalle
unbefitting
warke
romanticization
burnhams
indefatigably
puzey
castellamonte
reaux
darai
fallahian
opengate
peillon
cambusbarron
yasinsky
mottau
geseke
nonoperational
globke
downline
lifescience
skillets
schisgal
phototoxicity
winegard
tringham
synthes
castoff
cattenom
epitestosterone
pachon
pjtv
grayshott
hortefeux
familiarizes
stobs
huwei
tanasbourne
bhattal
expence
schifani
eglevsky
snarking
molodist
attrocities
rawdat
xra
paun
igrejas
payors
southcombe
fillongley
linders
nobuchika
pcrf
uzbeki
cnsas
batsto
blakedown
asuming
nubium
interacademy
blackens
mattey
florit
hjejle
maragh
threemile
blueshield
fxa
overtricks
khidir
sergia
volchenkov
smoldered
kelud
ghw
janti
emerik
oxymetazoline
apice
raffoul
confidencial
stamata
kingscott
antinea
biegler
reeman
manuring
collectability
khorol
klineberg
mcgiffert
mudzi
allmans
rashti
sunnylands
doggedness
loray
horseflies
bolshie
grimsdale
aerophile
hreinsson
oeufs
bouira
ruut
shoura
twerps
dermabrasion
levich
mahmoudiyah
tsikhan
hypocretin
lopud
tatenhill
szeklers
videographic
scheuerman
lampstand
aici
maxman
aliotta
crimplene
nautla
plachy
mizerak
tazer
roszel
dishearten
maoa
procreating
mancienne
karayev
unrefrigerated
demske
unabsorbed
tondabayashi
alipay
tortel
klipper
jardí
treshnish
observably
dreamlife
scenesters
oftel
vwr
ivet
yesilcay
wildung
minisd
anorexics
prid
nyquil
arija
caraher
angeletti
sindia
membrana
politesse
kahel
boullier
deakes
webvan
vecho
zasyadko
hellenikon
stonyfield
theocharis
feldsher
seperatist
raychem
unpleasent
orbinski
xacobeo
lorser
olufunke
eilish
scottow
njdep
mediatech
naturallyspeaking
yiannakis
kalisher
gnvq
caridee
flatfishes
griesemer
andell
leftie
signicant
gues
winnebagos
twaddell
bertagna
downspout
grundfos
toboggans
lifestory
bunawan
unfriend
noop
yellowy
kenjon
wojnar
acutal
tagicakibau
kootch
ciron
penndel
eurocamp
schussel
chamanga
vulputate
fuxingmen
lesego
stellmach
sharyl
picketts
maybourne
shober
delightedly
atus
biorhythms
ebben
chaudiere
strohmayer
governemnt
burgs
bkh
alfei
dhanak
marfo
tackie
baycrest
vickerson
parathas
mxi
mekeel
cottesbrooke
revuebar
komarovo
webgui
myspaces
tourneau
soeren
elstein
agadoo
aquilante
schnecksville
kasasa
shailja
kessy
sensuously
rabbitts
gardaland
kinnier
covaliu
gloried
freye
kluver
unripened
scannable
pollera
selside
siswanto
abashed
kunicki
amagertorv
sitation
sarafyan
compaign
albea
legesse
stepfan
relgion
supplicate
jhin
mysti
clennam
delpino
playfish
ludicrousness
krzeminski
suscipit
boxlike
biserka
santika
piata
ciccolella
elicia
damasks
aprés
trhe
salumi
spme
sholeh
jitterbugs
brownridge
calarco
redco
cked
champika
yoido
goestenkors
cleworth
megowan
richarlyson
moniter
impax
trenchantly
wishard
deoxycholate
kakkis
yien
hudur
cartha
exerpt
farrall
eckhouse
bipa
malkan
smolts
rukiya
ering
overbalanced
flavorpill
malesuada
remunerations
picadillo
kniffen
belimumab
chuyen
inexpressive
scro
lichterman
spittin
gearty
lamaism
starner
chessex
icehotel
zeanah
anchormen
wiederhold
emulex
tatio
birhan
axolotls
hoplon
scuzzy
ossifying
elaheh
lsta
messaoudi
lyveden
bernfeld
pronouncer
colp
magyarosaurus
feltus
serageldin
flexibilities
majeste
bunghole
barko
pinnington
chugai
kirchberger
almanzar
malouin
immorally
edgecote
magalie
palamos
lamagna
butorphanol
baoshi
lanig
mcnugget
icdl
skuli
statia
chengji
themselfs
bourgon
crappier
kumra
lipez
spooking
peruses
outlasts
sousatzka
ravenstonedale
zias
raskatov
albone
sundai
arean
shyu
guanzhuang
abdelrahim
solankis
pennyweight
gluepot
diuranate
darboven
gabey
behnaz
mopey
bashment
queston
prazosin
grisbi
purita
ashrita
daoudi
memorious
ausman
matzner
sumtotal
gubser
germanness
saana
nagae
crybabies
easterwood
trebah
bhindi
greenkeeper
newberger
dpk
redditt
sergeyeva
birthistle
ayars
cqg
rainelle
jeste
stikeman
terhi
chocolove
taskmasters
tumorous
soanes
contracorriente
schewe
lillicrap
trebunskaya
dodgertown
wellsway
vidarte
gyoza
rashers
fdk
feintuch
saedi
nedap
caino
escos
buddington
cussed
meservey
unmatchable
gtcr
weisfeld
healthsystem
leving
aesseal
pacemaking
mylife
justicialista
colorways
sarossy
taltal
santosa
blanken
ddim
hunty
vlisco
brodifacoum
afms
dedic
itaa
liepert
papaconstantinou
pesquet
ecumenically
symptomology
speakin
lindolfo
levitzky
sabuni
bunked
swingset
cuckooland
wowing
mulbarton
ecologica
rahila
methylenedioxymethamphetamine
robitussin
vulcanologist
quammen
prial
kulsum
agathidium
antionette
kinlochewe
troopergate
frittered
lummox
lagrande
nimbleness
dallachy
barou
achieva
mescudi
iseya
yibing
solexa
finol
ibtc
floh
drainpipes
mudbox
hayovel
petrucelli
morrises
raphaels
lecour
baolong
fenofibrate
fuerth
chakkara
deputizing
schachen
nanosensors
cantlon
zdb
ashera
westvale
kervern
latella
idenburg
nisc
lapsang
altoon
polota
rusper
hagglund
khatab
carvedilol
assurant
objectifies
cowfold
cypionate
lastweek
karsaz
hollyweird
sier
dalbec
dslams
felmersham
unwarrantable
stigmatizes
lowdon
onehope
marthas
pleasington
trearddur
diderich
emsis
salaad
cosmati
pennells
effluvium
disjuncture
pfaw
siné
sunit
kinan
malaquias
backchat
sdat
foulquier
shoofly
misquotations
blighter
agbeko
orphenadrine
snivelling
finesses
thewissen
swad
madisons
searfoss
subterfuges
arlacchi
bianet
beleives
noncancerous
buston
catellus
pronouncedly
benshan
deenie
abstractness
gantin
opsvik
errantly
kaseke
atomisation
heyder
pippig
clarissimus
hiit
gallichan
grouplet
bueche
hadir
toloa
stateman
fuchsberg
mouammar
lubar
wentai
untwist
cadboll
gemeinhardt
zoria
hpx
shaibu
undefendable
mclafferty
lamidi
hearle
longnecker
heartlessness
heliosheath
esmael
dewater
curlett
brancatelli
canters
kodwo
momberg
rangone
vended
nabobs
melees
leesport
obadele
ncvo
hhk
hooydonk
langerado
tinopolis
bootcamps
darsham
snac
pisau
martinoli
mehadrin
kroffts
hkjc
gudjon
hnz
delbruck
‪
overcooking
cogges
trashcans
bergy
vanco
reattribute
tinkhundla
litella
austinites
chechik
littner
dawie
irreconcilably
sheinwold
brooklynite
lorded
kyleakin
twines
anisole
abiye
cder
lemer
radiograms
shovlin
bosl
bedrest
virtualizing
génova
updaters
donnison
toppy
avigliano
ozier
preachin
gunrunners
milliband
postmarketing
targoviste
irelande
edaw
bouchra
kimco
abendanon
corojo
denesuline
chicontepec
dancewear
sahlgrenska
bawtree
funi
hpcs
filipenko
brex
liesbet
nilon
nallet
januario
mattishall
chetri
nomisma
overbye
maves
wechmar
trasporto
kacc
synchrocyclotron
boubker
mesurado
czege
spara
pintauro
smithey
disassociates
romeros
sjostedt
clemon
boxun
shihui
ingolfsson
corazzi
lubmin
flatteringly
gasohol
murias
ukt
overstretching
volleyballer
rhinosinusitis
dutti
rochlin
undersold
mdluli
arabtec
bortone
etame
tafadzwa
peosta
garyville
keath
flemmings
hamers
eupol
pdps
bhavin
gemenon
jihai
yunji
dezi
toccara
msibi
eprs
clogwyn
debilitate
liberatum
belbroughton
bacta
plupart
tucuxi
kdn
sadecki
earthier
kintamani
balbay
mmwr
mingala
zarni
mubanga
durrer
parch
dipple
wilkof
satiny
threequarters
ehlvest
posibly
nightie
corrias
ricke
brookwell
gordeno
tolkachev
tillable
olshanski
intrests
newcraighall
modoki
recantations
beerling
midfoot
durnin
nitzana
unders
groundskeeping
parnon
heungdeok
abrasively
onalfo
emasculating
newbald
celibates
yarnwinder
nelsan
whoredom
saccone
esman
clsa
airsickness
mcnatt
kneeler
bahij
carcamo
ezquerro
messia
solarwinds
rigatoni
unknowledgeable
blisteringly
thraves
helseth
chanta
abeni
marinela
savigne
teleprompters
axtel
helg
lackie
empac
daleiden
baylen
mizuko
warrilow
vocalises
globacom
osci
pitreavie
auriti
gallin
potbellied
openning
getgood
gayssot
wlodarczyk
boroff
harcum
sněžka
aprotinin
whatuira
diamon
oneil
kpvi
mspca
madel
kalichstein
rentsch
chemla
schleef
jeffco
saxelby
shudde
xac
verzosa
hohm
lichten
hadba
aerion
hyperphagia
beibi
gotthardt
kaliyar
baudilio
lockes
behling
movila
kayongo
boothville
altafjord
ronnell
transwestern
gibes
unitedstates
denseness
studing
nelnet
grandpas
sknl
agnews
duyck
cnse
stuebing
dealbook
yood
reviling
bnk
vsevolozhsk
pretensioners
pashkova
cofi
tobinick
geomyces
tyeb
petruk
jenilee
ubinas
mammoet
ecmc
delanne
rentmeester
achouri
jugtown
mamprusi
mathuram
zīle
hindpool
inaki
seomoz
beart
darmstaedter
salbi
nzimande
discomforted
waweru
khasbulatov
zarrilli
dobwalls
voinov
caminada
moorfoot
almand
uteruses
hies
asselt
baretto
ound
deveny
bleaklow
semenza
nagasaka
spivs
cernavoda
dalpe
thoron
annison
jazirat
cariad
stonesifer
trivialisation
wasik
llambias
bartella
bajramaj
gack
accordia
assasin
breitkreuz
ballymoss
streelman
sadoff
upritchard
quaggy
dowgate
procacci
feiyue
dullingham
aasia
hudes
cuthill
skorupski
cosmochimica
rescanned
sopchoppy
premedication
himym
cutco
lefrancois
tangibles
equivocally
tallness
depersonalized
fruitier
cvts
schoenbrunn
chambermaids
nonresponse
completey
broeke
sukoharjo
bátiz
oceanos
avadon
snackfood
gustard
discomfiting
liljestrand
brette
rapscallions
bleasby
foum
shulevitz
earler
echohawk
trovesi
wolferton
philisophical
welborne
sironko
straighteners
setpiece
valbruna
shammy
sergie
berom
machray
ertman
redeployments
piquing
wholistic
dongming
penmorfa
popovkin
heckenberg
jalovec
precut
narvin
iannaccone
baulcombe
iaap
wenta
coronados
novillero
llanedeyrn
fies
parulekar
tomicki
dazey
lendy
lilibeth
wacoal
bearskins
leonelli
peplau
huttle
trussel
breakway
muslins
drider
hasselquist
cwmcarn
librium
forsgren
deprogrammed
poleksic
mossburn
tansill
qiushi
kscb
ilkhani
innertube
psyclone
crippin
feedingstuffs
chege
pdufa
nevett
engineless
odobenus
scotforth
setpieces
eservices
theese
lowari
visitbritain
bartholome
cribwr
stenstrom
honked
virii
nwtf
lotha
abec
chitlins
gauer
wssd
gortin
lintern
xvt
tadman
schoo
blueliner
naimoli
hcpc
torwood
dimed
becaming
ncj
kreamer
milkfat
griffier
makhani
laffrey
aronimink
abrahami
agrawala
aimie
parceling
bugaboos
cyclery
principali
eyk
plasan
strelchik
prolotherapy
derailers
tieman
glenariff
blecker
wagnon
prondzynski
differance
unisons
oymyakon
benalouane
leguin
langsner
scaggsville
odeo
sease
resturant
supermajor
shamiya
obul
processionary
vanidades
damballa
firestream
trenin
herreid
absorbents
ngeny
estemirova
rhdc
textualist
nonfood
techweb
bsms
forese
aygun
regza
newens
cnep
tiddles
maesycwmmer
winslett
hybride
ncee
mosqueteros
rootbeer
muraqqa
guoxin
patsavas
breth
poiares
fatialofa
iwon
parzinger
gallerists
langthorne
craniectomy
pattering
jonquera
willocks
csss
danzon
ictaluridae
lemcke
cprc
harpooner
gubicza
wanchoo
kepu
disapeared
vitkov
untargeted
zaytoun
emanual
polino
oganessian
oyarzun
spermicides
expereince
keelin
unneccessarily
anthropomorphizing
sidarth
havanese
cedrone
appers
livan
stoodley
gulyaev
brocato
makov
mudpit
saify
catastrophy
slabinsky
pepsis
pajak
crisci
sisay
augelli
gencorp
firepit
rowdyism
tayeh
midpark
rivane
ioulia
camlet
bogdanowicz
transfixing
aqt
shefter
segurola
katzenberger
beckhams
kwas
calomiris
nembutal
dutson
dicrescenzo
terawatts
shiant
sadulayev
uña
abstemious
dunbrack
thermoregulate
careerists
subsites
butalbital
younkins
bedella
scapegrace
postapocalyptic
trifled
heyser
wangdu
hyster
electrolyzer
bancrofts
snowhill
jabarah
krugerrands
rouzier
trefousse
remodelers
maricle
boiron
petrofac
cynda
kissock
lifeboatman
goodish
biesen
keehan
leha
letteri
mobilisations
saull
jammat
woolsington
malevsky
weeley
mishel
mattingley
lones
mudawana
korecky
dagworthy
jahri
rudnay
madoyan
dealmaker
yoest
skelmanthorpe
romanzi
completist
pahlavis
tedo
waterweed
matsko
swampers
transys
adetunji
entelechy
mccrossan
lapt
gjerset
systemized
cagw
piloncillo
alcova
bernardis
leban
answere
interoffice
nongovernment
maquillage
remigiusz
ziska
walsum
inseam
newater
cropston
ambulante
pirozhki
manougian
skyr
forwardly
gossain
mcartor
kunnen
sdwa
lpsc
lehrmann
whassup
pudemo
trajkovic
eurail
sirwan
olmesartan
aratani
yiyun
moates
twitched
turpi
inmon
lugovoi
abbeystead
idzikowski
schauble
messagepad
springsource
disembowels
undercounting
demotivated
klpga
zemmama
provencio
livenation
imiela
kertus
ethiopiques
deceptiveness
gomshall
santolina
transluminal
mcowen
qrops
huvelle
arsenis
libran
cabic
revenging
scharin
yuzhen
sturrup
felitta
alspaugh
esteems
muttontown
roge
northdown
batrachotoxin
dubnov
alikhani
cornelly
outswinger
swabbed
towb
elmau
moutarde
westerdale
dilutive
chronologic
celsum
derrylin
polishness
prinknash
utx
lantin
trendier
iivari
mazunte
pederneiras
satinath
estranges
transflective
jahns
danella
borzois
aristóbulo
unusuals
timewarner
kruck
transversing
bessonova
verichip
burnaston
kaihui
jisheng
brascan
brung
qummi
malverde
mesler
seminis
cemr
wtnt
kenteris
varenna
savinova
mutsch
energem
chaze
hatiya
balzary
inportant
firebugs
ilchenko
oakwoods
superheros
punycode
featherbedding
slamdunk
stapeley
tecs
coverge
arocho
sundwall
bridgham
mucuri
poupard
asenso
bowlt
mckelvin
xenapp
rfh
qci
valorize
steeling
llanharry
rastall
incisiveness
unichem
looi
glutes
surroi
minibikes
barquera
chellomedia
nikhilesh
methylcellulose
gliori
thyer
pactor
pursuade
avz
barflies
sheppards
maliqi
zavyalov
bolkow
klepfisz
kenth
interros
laucala
unfriendliness
infatuations
gaddum
teros
neurotechnology
ruhnama
mischance
lumbers
rydalch
snoozing
ranadivé
krader
zypries
tarradellas
tithebarn
isothiocyanates
scirrotto
ivoryton
kinge
flicky
pmml
octoberfest
smokeout
bilic
ballyjamesduff
suring
bonnette
eems
muhibullah
indvidual
frostad
bayno
dayeh
cavallier
warentest
miviludes
jianhong
resurgences
ampules
sondermann
maraviroc
rempstone
cossman
khaosan
chiongbian
gyptians
liberationists
vaaler
sheepskins
dannemarie
iocl
edmonde
bacabal
ostman
aweful
immunoproteasome
throwable
burundians
ghazzawi
gwynt
klawitter
medfly
tensely
affirmance
intersputnik
saffrons
tremiti
pearler
earsdon
moorey
kouris
colonoscopies
pureheart
mickal
mcga
sphaerica
iisi
rosslea
fliss
prause
addle
raelians
hgr
tekna
vetches
hongxia
pelynt
imoca
kammerlander
tranquilliser
dioctyl
muzquiz
bupp
afit
emmonak
appearantly
estuarial
heiligenstein
gallais
rieslings
lewsley
taizz
yull
audrie
versaille
chokeholds
perfomed
stoneley
tyacke
squadronaires
guittard
michôd
fecr
cmec
sinnathamby
tureck
oposition
crissey
squillions
denims
inflexibly
kinslow
overextend
bobinski
jordis
xinli
doorns
unpicking
mexicantown
crassifolius
andraos
mubeen
niccum
opisthostoma
sireli
lamberty
yiddishkeit
wakao
chuwit
caboodle
vezzoli
glevum
craigmount
homegoods
parolles
maghazi
lorenzetto
dongmei
bashforth
aromatized
zalmen
treinen
magallon
bahlman
rrose
batar
stibel
ptj
inosinate
enfermedad
stripy
wanke
ampeloprasum
advogados
rojiblancos
kleinveldt
knauff
tostadas
kenen
unpermitted
nokelainen
cloudiest
hashahar
schwenksville
wennergren
jarchow
leutasch
incuding
yuwa
krestovnikoff
sobia
caiu
gilon
formoterol
prehen
legear
horsnell
imil
dossey
mhh
downwardly
reabsorbing
basche
zeroual
zillmer
sikahema
amendolara
throughputs
nawara
coldhearted
deshong
cheye
defanti
titter
superquinn
tlrc
lebda
bzdelik
dannay
stober
goeke
malinconia
hhgregg
behgjet
malarz
craignure
yurman
bucho
gunka
thomsett
norrena
butterman
szczur
snappish
mainconcept
jesses
transfair
rebuttle
mediu
elsby
cheesiness
longswamp
postflight
sherels
xedos
marikkar
poundcake
nonradioactive
abstractionists
savonnerie
gasbag
synners
dueting
loopt
strone
mercadolibre
wtaj
wwrc
rogne
kernick
anoma
tomasky
swimmingly
microdialysis
nadege
luminex
newcleus
cirad
kilmurray
ocse
armful
mazagran
malodor
claypit
frackowiak
miyakejima
unendorsed
trevon
baracuda
dashcam
randiv
castoffs
emak
reclassifications
borrie
frittata
jellema
shirat
fillis
catthorpe
tributed
accussations
dematerialized
dapkus
takotsubo
swivelled
bastwick
hilgay
carrodus
alonnisos
lukaszewski
duologue
hesistant
underproduction
arouch
pizzini
twal
cazuela
amukamara
amorosino
thhe
trannies
wisoff
dsrc
charleen
esbwr
enthusing
jacarezinho
oberau
voro
schuurmans
araia
premat
changhui
ladron
oapec
bengoa
gullotta
wanxiang
civc
microseismic
llangynog
recive
lobstering
saferworld
talwinder
convience
microblogs
hausken
keeslar
careered
kokoszka
brinnin
heberlein
moumen
loita
macrocell
weinzapfel
westrick
kulula
thriftiness
candesartan
gittisham
copdock
haulier
feus
claunch
lazarescu
moop
ravenously
ulemek
harperbusiness
decelerations
tkf
kangshung
farmersburg
celestica
wombacher
rubinho
ladwa
jotter
laverack
birbeck
momposina
frish
unbuckled
millfields
dejanira
laketon
manala
haakonsen
tillstrom
orcadians
rahs
zykina
riocan
radwin
hockeytown
toyen
ejg
serape
rebaudengo
kweon
schilthorn
enertia
yeki
belkovsky
kaputa
willinger
boart
atrisco
scampston
allums
electrocardiograms
cineplexes
laryngologist
rudham
saksena
treacly
strategized
sakie
twigged
hendi
recette
edar
glinting
lefkas
possable
gransha
christain
alteon
overpay
srijana
gwynant
eseries
whealy
laurean
brumer
hadewijch
yoani
putschists
bubas
vulvas
mebazaa
ongwen
buddon
lumar
fluegelhorn
zapiro
champex
shipshape
charecter
chawner
roadbeds
rohter
ehome
triston
zmievskaya
mcclarty
laaga
agla
manhattanites
bonenfant
exactas
oblinger
sahalee
mealie
hatrack
martinstown
supernationals
flowserve
wokalek
keraterm
carlat
superantispyware
arguez
teaspoonful
smartcity
nickless
etrace
poyner
relámpago
gurewitch
tobón
burled
bewailing
meriah
userra
petrodollars
puthukkudiyiruppu
avenyn
faidherbia
diictodon
kudankulam
pumfrey
fluorescing
maywand
momodou
glaciares
ciee
lochar
vonder
loehle
kubuabola
blash
dayoub
interlaces
budish
nosher
eslick
kailyn
rotherwick
encarsia
noriel
hankes
mirthful
boonchu
caled
winnisquam
informercial
cuill
marinoff
feniton
dirtcar
alleva
perspiring
suffuses
killoren
fingar
feminisation
specfically
unstick
oakenshaw
amrep
simliar
krumpet
byi
sojitz
conquerer
morsch
dragées
ichetucknee
gotomypc
gnomedex
openess
rossberg
niua
ndolo
discernibly
acholiland
sanit
cardullo
owg
skyservice
wriddhiman
trutanich
childminding
heartmate
joren
aramayo
snizort
gradison
alayne
sightsavers
gartcosh
hanesbrands
downpipes
manacle
chameleonic
olberman
criselda
zagurski
craigville
kronenwetter
pinking
bilili
mcguirewoods
yenta
garcez
psyllids
berenstein
nopa
satisifed
toneelgroep
xde
mainul
griesa
hankar
tartagal
visher
unirrigated
antigenically
torrigiano
freddoso
hinteregger
muglia
scandanavian
dzon
potrykus
appcelerator
sups
diacono
geffner
inchmurrin
furnishers
respers
eyssen
hutchcraft
minzhong
wojahn
badree
wikler
gloucs
kreimer
legna
litvack
indefiniteness
weigmann
permeabilities
droned
peramivir
unenforcable
artexpo
nenno
southers
wordwide
bucatinsky
himmelmann
euk
noorjahan
haideri
xte
affinia
oxygenates
oswyn
numbi
rajevac
braer
eduardas
préliminaires
thébault
fishies
aluwihare
avantha
jahanbegloo
edaps
tamie
nitot
scantly
khona
clonbrock
wessman
coquetry
moscowitz
matsuba
ballysax
godinet
steinbaum
churm
diepen
epsdt
elizardo
dieste
petulantly
stojkovic
biotechnologist
atfp
jaures
willand
bashur
kasse
solae
farzin
mardiros
tongkang
minitab
footner
christmassy
unclarity
sichting
attemp
hpcl
rafaele
boadrum
trelissick
arvel
massih
maume
ajg
ctis
spilka
coarsegold
burgalat
sotnikov
semco
solecki
sneezer
shumkov
armanda
knowling
zargari
farafra
miembros
cerasuolo
haufe
polastron
corbisiero
latting
placates
qlt
housego
poreda
pruthi
lachezar
zagats
cocal
korell
ible
hyperosmolar
werntz
evendale
pogosyan
togwell
kashagan
anothe
pecherov
kegley
macuga
jorgo
sviggum
filmbuff
arthurworrey
desiccate
cullers
museon
lagon
heydays
solove
fattie
lagrene
claverdon
gonadotrophin
bazell
eotvos
snapfish
voshon
kloner
cachuma
ampo
gordeyev
manaton
demeulemeester
klaveno
kincsem
weirding
vindija
solchaga
llandarcy
karos
sarbi
mindlessness
zulay
coiste
mtcc
sriskandarajah
biondetti
bewail
cherkasky
unassertive
sayano
wintersville
yachty
omotoso
cyrkle
wafula
ugueth
fluttery
iveljic
phonegap
labourlist
explict
marraige
mazure
stright
opticon
tarjeta
agrama
murrel
bossiness
hfn
dipiazza
datadyne
labèque
rafle
gopuras
goupy
donnés
metc
drissi
huwaidi
galtür
wutc
makey
hassiba
morleigh
absorptiometry
kendel
bruwer
cfts
centerburg
rajgopal
galácticos
cavenaugh
asplin
barcade
anyama
mennes
murugesu
orlaith
relat
hunkers
ichc
dodsal
glotz
symank
statoilhydro
fith
faeroes
edz
revisitation
celeno
eqip
darik
allmann
clancys
zawistowski
halau
moussambani
humbugs
anthe
amriki
mahla
bitu
nemchinov
arleston
oxney
hamito
nahai
wmus
geschwitz
sangpo
schrimpf
salarymen
landfair
aurubis
groundsharing
orebro
spokeswomen
theboat
phials
romanticizes
portos
birchmere
berghausen
proggy
mousses
faser
gomidas
savanah
brecknell
hulc
karic
roelandt
allyce
swoyersville
delegitimizing
reimplantation
keeter
hantman
xintai
anney
jaiden
minicom
housemother
gatecrashers
tindell
pipitone
reyher
truing
mbele
radanovich
mostapha
wachira
conflations
devellano
waspish
transnationally
franzos
humbleton
nsereko
smiffy
iping
goners
strandgade
rigano
supercrew
chens
prashanthi
liakopoulos
pirfenidone
dudleston
gambin
covad
rixos
tinklenberg
leijonborg
tapeh
gabrys
prou
densified
chicherova
weigelt
dechellis
hiong
demonizes
wilnecote
mazmanian
andelin
westclox
metaswitch
ameliorates
hassidim
iskan
feugiat
lidstone
admaston
nocere
redcaps
eqf
thakeham
streitenfeld
dishonouring
edocs
spowers
metenolone
riecke
motiveless
fydd
falettinme
tontitown
poptech
yanhua
craned
ossify
tianyulong
pedn
unembellished
jdw
grassle
rudyerd
shrivels
devcon
misdoings
ninio
eltis
tillou
tzortzis
ronkainen
sweid
premiss
konocti
borgdorff
bcfe
mcgourty
bushed
tamson
restrengthen
katalina
bhurban
pirus
nonfinancial
badaber
upconverted
traipsing
lurex
luvo
soosai
airtankers
fonart
baktun
icily
bitsadze
towelhead
kurkova
mitiga
cantarell
fragola
timespans
oxybenzone
bazzini
depoy
viharn
mubasher
adtech
bhavnani
mestas
illegibility
beydoun
chineese
hisle
corporatocracy
vassiliadis
altberg
lewisberry
klieg
debusk
schmutzler
dallis
teulet
preperation
unstimulated
qdoba
stammered
parure
ginjo
tinky
rightmire
alpargatas
unfeasibly
dzus
mutalib
armelagos
daylin
odintsov
vuzix
arette
basam
abpi
dustup
loida
coml
illume
tachbrook
seath
semiofficial
tomatillo
gladdened
cencosud
lisovsky
imperva
oluchi
pbsc
sisqo
popski
seaking
sibat
flocculent
roadworthiness
kiltegan
canf
kadie
vielmetter
otylia
amptp
imponderable
viagogo
eskdalemuir
shakiest
afiya
kazanas
ablating
brewdog
hwd
dissuasive
malafeev
intellegence
infinitas
haved
khemlani
chiemgauer
yianni
bananagrams
zkb
elounda
bourdages
bengtsfors
pcpa
karpowicz
measureless
purificacion
danseuses
llanrug
kamakawiwo
unvetted
chlamydoselachus
kwr
cooptation
kalabsha
benen
shazad
bague
makler
wandoan
venery
jiggery
georganne
volken
wieger
moslehi
wearin
hovick
penkhull
buric
karlovic
lasota
hoogervorst
giltbrook
taiyang
michou
ksde
crowner
bergemann
ruddi
sorba
pefki
luckock
dierckx
onamia
mangena
loubier
bellach
spratlan
sudjic
kaehler
bloodthirst
gerring
beachland
terrorises
ghermezi
karnilla
roslea
cristoforetti
aroud
untagging
geniune
viall
lancome
ekundayo
spiccato
lshtm
unlu
savouring
zampini
timewaster
mackubin
moharebeh
profonde
saltines
ritmanis
middlethorpe
pekinese
jetmir
pitsmoor
lexing
grigolo
badgery
clontz
maginness
ahnlab
sadigov
muhimbili
hatband
motherload
bronzés
hellgren
evidentially
reynish
eida
tillion
silveyra
dendreon
deanwood
nincompoop
winsnes
fornicators
dabit
drenches
centrix
hockensmith
reitwiesner
oluja
kinawley
iass
uncategorizable
snuol
ancyl
brandán
beagley
daggert
scorzonera
steinback
xiqing
asinof
hojjat
shaikha
onyekachi
imprest
patane
bidari
ranaudo
felizardo
habeck
staniel
dauntingly
learco
solidarités
aiston
pegoraro
metabolist
ciresi
pasic
bowfell
petroleos
mutchnick
rubaga
romanik
sakubva
bilking
bems
setareh
fuzztone
orrs
cackowski
beadnell
villasante
rogin
manvell
kocharian
livaudais
tailcone
littlechild
panormos
ayachi
margenau
cryogen
dispell
cruzvillegas
ibwc
unenthusiastically
istrate
jannaschii
wannabee
trellick
mukuru
amiee
kalandia
geraud
pitshanger
salinero
nycz
raay
vastitas
wilan
cctvs
chauke
sakio
schwazer
copine
wettengel
estorick
patsos
supergrid
lendon
vaisse
samsons
crestmont
dipton
plaku
sudac
kulbir
piossek
pirrone
sigifredo
eppy
cronberry
ntsu
tecdax
sidha
beyda
bcts
gelan
mingming
deisinger
beelman
spart
denga
akakpo
breinholt
marhoon
pickrell
huntziger
humanising
emanon
pentair
rajyavardhan
haeck
laina
acria
underqualified
nerenberg
samayoa
redgauntlet
hainer
lensch
maerl
enerkem
burkean
cullaville
stoneyford
gonk
adnoc
zinha
tussling
sicarios
razes
rakowitz
iraida
deoliveira
anathem
quirnheim
barouk
flancare
arnost
gaugin
glocca
siddiqa
ramelteon
gostin
jinglin
makapuu
orsted
molinelli
narayanhiti
gangbanger
repect
manalang
loyrette
almondo
rollier
malkiya
manobos
yashraj
slinking
holetown
zov
wisman
saland
pequenos
ostell
huadong
jerm
thwaytes
burdis
steeled
touman
tabel
afwerki
editorialised
oakgrove
portait
enfolding
memristors
santacon
feitian
sebestyen
hodell
cianfrance
liptrot
ponzu
currenty
zeitouni
pushchair
housecats
ettingshall
ishasha
mhcs
rickinghall
fujara
greasbrough
firstborns
unfading
beaurocracy
novitiates
berumen
hellraisers
sedgemore
kingmoor
chesuncook
lewites
defendor
sadasivam
pacc
mileson
kelber
degenerations
collards
massingale
dhcr
takal
mansson
freds
eponymy
patali
flitzer
skyrme
sharkia
mdrs
cartoonishly
structureless
wellawatte
edgett
husnu
ecchinswell
chitralada
fauconnet
xianglong
baldassari
hayarkon
wohler
ncor
inhalations
dhupa
tantia
onatopp
goetzmann
grayland
susur
gobbins
expressvu
randeniya
slinker
taygetos
nozari
zappo
restell
iggo
valerik
arhab
ginns
prizegiving
helft
metropoulos
hamshaw
necrotising
udry
crassness
craigneuk
moleskine
einars
gricar
ahv
shobukhova
aunger
bacn
leverenz
achak
itchin
interposes
cherelle
deats
jadav
dicterow
jamen
whisman
rosengard
marimón
shaheem
mainetti
samboja
invigorates
cineastes
nipr
rnl
antons
glenshaw
refacing
beths
scherzos
sweetish
skea
velouté
brutti
uniloc
zieman
wendeng
conspicuousness
chateaus
themeselves
outfought
liquidates
jamband
oica
chambo
eveno
umred
elachi
debenedictis
pricasso
gwede
baccalà
cliquot
anaplasmosis
grandnephews
tsakos
koshu
betsky
deaccessioning
kostrzewa
overwash
monocultural
wwoof
ladled
kinglassie
ilhami
misinforms
shakura
parapluies
trainline
transatlanticism
gillooly
romanski
cleamons
verreos
difelice
simsek
pareek
doodad
mainstreams
watzke
ccfa
grisewood
cynddelw
ufh
bellafante
muhaimin
portesham
unforgivably
accoutrement
maitree
zaplana
balatoni
parwich
bbls
lotina
shaima
iannelli
lavorgna
superstitiously
scarva
kelmarsh
beven
heps
nahed
undogmatic
shunda
pâtés
imperviousness
sepulchers
becco
liddi
fleecy
goanimate
lingonberries
fdo
mabior
hicieron
penbryn
cark
musone
ndiku
gentians
clamshells
echocardiographic
foba
enucleated
laverstoke
celebuzz
höll
simley
pettaway
nwafor
fentons
lindis
barbree
zubkova
mhin
buggying
fisette
karah
hillesden
kilbowie
grahamston
bisho
ditullio
bealer
chinch
spatuzza
sharston
jiddah
eifl
nahdi
rotr
birkhill
topotecan
kabaya
ntshona
matatus
ialysos
summersdale
nikkanen
wondergirls
tarian
bedsitter
soldout
peychaud
misconstruction
sabagh
elbulli
genral
perina
goppel
mathivanan
madueke
felsenthal
gloamin
unfriended
spinless
bunions
dufrene
tanygrisiau
michaelian
demin
mombo
portskewett
nataliia
yse
trenance
anythings
considred
myelodysplasia
gerassi
superpartner
aggresively
shipbrokers
schanker
tetrominoes
khachiyan
chambless
muhka
orphanides
udawalawe
dhondy
hutaree
flightsafety
kabashi
pickfair
varbanov
hajja
foldit
publow
nanodevices
siteman
deconditioning
islande
amputates
brogel
zeshin
shahristani
freshdirect
choler
meropenem
groeningen
hospedia
eveningwear
golitsin
farukh
lumberyards
jibu
helarctos
tortolita
laveno
usless
spacenet
painkilling
eiderdown
zier
pinnau
eclairs
kolen
landesk
brizard
cambier
piaggi
guilbaut
siegessäule
mcgoon
begnaud
tufiño
sapolu
greyston
lampinen
albpetrol
kempowski
bluf
illions
disenfranchises
muylaert
homayun
remue
barcoded
axman
shimasu
castagnetti
dalswinton
overstatements
devilliers
youdao
myelosuppression
sisemore
neckerchiefs
baura
monoline
rebagliati
decisión
jeary
cowdry
paessler
ilkhom
ghahramani
hoevenberg
booklovers
volen
soymilk
childen
biotics
steffon
weeda
bovenberg
parlays
dobbertin
wigged
duques
kariye
karpel
medinat
flavanols
vietnamnet
khandwala
skadarlija
bewailed
wiesmann
aylesbeare
horningsea
shuttler
suavity
bxvi
whitevale
mortoni
krzr
kerkow
electronix
trialogue
philosphical
sldn
melber
masch
ncnw
stranzl
widmar
melany
valian
paedo
raghad
seitan
paciello
elisse
minatitlan
afrol
najia
manyi
yuping
doilies
thebom
donowho
hallingbury
faffing
mahfoud
vulcanology
minisode
whackers
musalia
atmail
flics
annees
darsena
viglen
vacuities
iqn
nosedived
customisations
befrienders
trabajar
wested
kuperman
surrency
paedophilic
deeding
wigglers
svilen
llps
jumar
magundayao
localising
illiquidity
outmatch
durette
teodorin
sparkplugs
mahboub
plester
gasunie
consolers
zdunich
macellari
xiushan
mykal
marchon
seierstad
prilosec
frankcom
raditude
consumptives
stmicro
bradbourn
edleston
biohacking
dapena
savennières
bahanga
camatte
newsholme
territorialism
choge
cmds
wiseburn
csorba
snapdragons
hallisey
yubo
evets
lineweaver
hogget
kaiserman
stompie
doubletalk
bragman
tsvetayeva
janahi
narcocorrido
issu
grindavik
mzwandile
sputniks
sapochnik
mcelvaine
cajones
spritle
krestena
poleo
freegan
oxi
expalin
gawking
hartin
decembers
photuris
footways
garcha
dobel
shepitko
petursson
fastenal
malph
ibot
monua
critisized
serwer
kelps
guanfacine
synaesthetic
soderling
youk
kinemathek
meghni
philippidis
daggar
lourmarin
autographing
killinchy
killary
hanukah
mcelmo
lunk
rieper
algermissen
nichia
crannell
nonfunctioning
greenplum
grimmest
telmar
cherico
diacetylmorphine
amson
fiascoes
postgraduation
fungibility
entrenamiento
udeze
pearlington
huwara
garnero
kreitler
benzarti
mathebula
mnisi
citygarden
chocked
sabersky
butko
natynczyk
aleqa
radovanovic
bleo
mooty
autoshow
saamna
unclipped
waldi
almosts
macanudo
ktre
schubertiade
soooooo
zeune
gurnos
fictionalisation
seychelle
spellacy
millstadt
talx
pfefferle
bellway
grabill
hamdam
grassfield
sagheer
rostovtsev
archerd
undergird
berken
besuki
chevillon
atmar
watana
ibrar
spaceliner
kulvinder
jaleesa
thurne
qalys
iscar
spalter
oodle
youds
scotese
mazhilis
rajakaruna
easthouses
buczacki
honcode
christene
tahina
caynham
segars
mulrenan
fressingfield
mccamant
magden
keepership
wihs
dragge
abukhalil
unsustainability
jonrowe
sodan
benitz
atuona
kutesa
bluejeans
synergize
fakher
clootie
dipdive
winegrower
taiyaki
milanello
rivaroxaban
bodorgan
lewak
ayash
romed
fiser
scanzoni
ziolkowska
pedrazzini
jaico
hanemann
pontymoile
lukavica
koenemann
sutz
sandle
schifter
malagueta
ischenko
clementson
colliano
suon
shonna
peul
chrystina
allbaugh
hespèrion
rsmas
cognacs
hyaluronate
morphologist
viruet
collabnet
philanthropically
tabards
uelmen
baringa
yosvany
kajlich
yousra
sportman
seighford
dulse
barriques
werlin
kakata
tallac
counterpointed
matekoni
jinli
superbubble
mcclafferty
jalala
noveski
trelowarren
gauke
rochell
bushwood
forston
garf
lynsay
seiver
cigarroa
cridge
glowworms
nickolaus
agboola
reparable
albita
tawanda
natco
sangak
pinhão
biskupic
kleindeutschland
junn
emert
misremember
wyrsch
larding
parlaying
jobstown
worldspan
aharonian
photographics
nicaso
kalniete
poultrygeist
suspiro
hamfisted
adastral
ditzel
piccalilli
gavvy
baoying
nouhak
deidra
turnock
boonjumnong
cheren
gyger
onyszkiewicz
kablan
bartolotti
pado
zedo
polston
piena
mexx
gracioso
buzin
stimmel
bernall
bryantown
budenholzer
updegrove
rubbishing
howald
longparish
sulfonates
mckenry
abdullo
martinets
waxham
ricefields
doveridge
clarry
kaimin
stahmer
lutsen
tommasino
gastroschisis
brassfield
googlewhack
gelukpa
skyworth
artema
miltenberger
cabragh
berenike
preqin
jakubko
telelogic
thri
ctms
myfootballclub
hufton
dieta
cysteamine
soldevila
jeol
broodiness
macys
helvenston
knology
schumpeterian
ducktail
pnk
maylee
numskulls
norimitsu
afsm
decarava
charitybuzz
funghi
definer
zarabi
kholodov
adamowski
diène
kadikoy
mataya
raashee
beigi
yueqing
halation
aroun
reseed
bapco
fufill
pitlik
themepark
invigoration
pacula
schmidle
strathendrick
clearout
sosie
mcduffee
sternlicht
ahdi
pugnacity
tesei
dynabook
strogg
iwmi
penfro
avowing
intralinks
horovitch
hypes
proabably
mascarade
csco
beechcroft
pickax
crosswhite
dunghill
exmore
sixstring
zettl
dueholm
retinoschisis
slickline
manliest
lienholder
thorngate
sietas
didnot
simensen
sheinbein
mppt
eaglecrest
mptc
oelsner
bittinger
grangefield
namhong
arkengarthdale
tcca
winnemem
tarazi
valcarcel
leyner
danay
lessness
stickwork
mildews
tolver
robynn
snay
hinsch
kennemer
scottland
weidenfeller
montorgueil
pedf
koljonen
tamulis
birlas
polemis
visted
endobronchial
moheb
shearings
chamblin
firdos
tabart
benbrika
kabanova
jalin
yusi
skibsted
currans
lefkofsky
sucharski
falciani
hqn
freespire
tacey
literalness
herzsprung
sweda
ithaki
cpvc
luevano
zekai
paker
brackney
iwmf
slobber
brandstater
kriseman
golledge
moonwalking
garsten
elementum
juvin
weijden
everall
runswick
culdcept
rappold
songfest
shehla
rakotonirina
gîtes
horng
dichiera
hooff
domspatzen
nkululeko
ameln
gawley
trasher
sprinturf
radelet
kovachevich
polyamorists
trug
supersub
ollivander
maoulida
nykoluk
svq
vaila
sbinet
chilecito
calker
overgeneralized
glodwick
cassoni
haibo
traceurs
kyaukpadaung
counterclaimed
datelined
freiston
hermen
joannides
kinderszenen
sexi
caira
vishneva
afterglows
skinflint
happenning
setebos
lightstorm
keasey
compliancy
nubbin
hayya
ablator
jeffrén
graycliff
ultimatebet
kospi
reckers
abbar
kervezee
scup
chemiluminescent
cfao
kupfernagel
waterperry
epso
mussie
pommard
popii
etw
capbreton
masuma
teversham
beanpole
deceases
ashp
trudges
boccardi
purnomo
telenav
carliner
corah
weier
birzhan
menheniot
recompose
kupersmith
wramc
gpda
kafé
lechleiter
dortort
azema
badjao
terrex
lendvay
controverisal
iberá
edenmore
statice
shanteau
kutna
untransformed
alack
pseg
candidated
zanka
saesneg
santita
consummates
anjuna
perimenopause
turbary
tomkat
shonky
publicises
raslan
sannicandro
mazdas
hochstrasser
deerbrook
sophisticates
itson
arsic
ohlen
strathnairn
mulcahey
loston
broomhouse
dscp
pelliccia
ohss
uring
afkham
winglike
uclaf
kidds
madcaps
hamedi
trumball
mettawa
kemoeatu
ipala
aktogay
futu
amington
sumwalt
nooshin
stylizations
artline
multidistrict
agrument
pittaway
covin
witherall
machinarium
medison
finglass
kleinkirchheim
professorate
krassel
crolly
canolfan
halfheartedly
ulemas
giraudo
praktiker
eshed
oyamel
mlangeni
nickolay
anpi
baninter
tastebuds
hantaï
agentless
acrux
rasharkin
berecruited
genner
masseuses
hathershaw
vagas
kuric
ecotourists
harbo
skaftafell
massini
mourides
rascasse
dayjur
ghashiram
latheron
aldringham
kadhem
guiltily
shamai
pfeifle
ammirato
illbruck
dugar
healthways
ustashi
tengelmann
pavich
stutts
zizinho
gony
ahima
ketorolac
lifeworks
nenthead
ignatenko
sankore
empg
yerbabuena
nijhawan
kanani
hooah
vadrouille
lupercal
khakpour
guynes
spuy
timbrel
wellers
darwall
panici
mazahir
ashbrooke
longhaul
skying
shearin
ahali
methanogen
guttenbeil
grazebrook
puris
sammel
japanned
pinaud
sentelle
yacimientos
acknowleged
sechseläuten
neurochemicals
setola
depressurisation
touchflo
craiglist
webquest
faida
kolly
phokeng
gingerbreads
garganega
catchwords
batook
frêche
schaberg
froedtert
halesia
relenza
biersch
wiith
ispra
airdropping
parast
suzhen
ayish
mabaso
benmosche
innse
berol
ownes
daughtrey
ukho
tainton
perkovich
fleshman
tendinosis
bhabra
naftaly
holligan
aslaksen
eristic
ecap
iogen
chattenden
charise
medomsley
foments
impracticability
centenier
raaff
aletha
soulquarians
scarle
sourton
itno
cuzzi
kristeligt
stenin
venustas
ivester
bentel
underpopulation
piver
gornall
nutopia
katumba
yourname
onp
batiks
kabary
zenonas
goiania
scaffidi
colourable
farmwork
gooks
unmetalled
spurgin
darrach
ashlea
maixner
robathan
gezim
fargesia
youngquist
sofcs
dimmesdale
zaffaroni
cavins
repositions
joinson
gerontologists
kleinsasser
methimazole
resurging
xcelerator
sermet
oyedepo
poettering
infanticidal
cpat
kabulistan
atletic
vatz
grevious
beshir
vantagepoint
bogaard
shamoun
nejla
licitra
demutualisation
poncher
tavassoli
filthiness
optioning
samran
mycoplasmas
unshaped
placidly
rothiemay
keleman
tauke
vasisht
clouts
kruta
goransson
wirh
spadoni
sirtori
belder
longde
dallku
nienaber
counterpane
gollub
ography
brakspear
uzay
mitroff
namedrop
sonare
conceptional
wormit
pantalones
caremore
trepp
opaqueness
overbey
collaging
warbreck
dyet
chasity
priveledges
palihakkara
espeically
cosan
akokan
dezso
pupovac
tallwood
yezid
saavy
guoxing
fujihara
shenanigan
ceftaroline
beddor
vansanten
anoush
pabp
medwyn
hmmmmmm
haggui
dorayaki
outmanned
olba
kaurin
savundra
castera
wilfert
livesley
gennet
moataz
siekmann
intrafusal
dvts
castlemore
maccool
playscape
suicided
uzak
quilicura
masrani
twanging
fadal
pinche
narazaki
mbss
aliment
edmondes
naturalmotion
accoding
homestore
zauri
cwmparc
merkatz
toolis
afld
chromophobia
crisanti
cleeton
grenell
preecha
chilstrom
susheel
kandle
cathinones
micrographic
bushwhacking
hyat
blankness
loungewear
suek
aslanyan
unfractionated
wantonness
oecussi
euronet
sdms
bailong
harolyn
sandaza
scollen
gdls
travco
kasell
croner
labneh
baoguo
stocktwits
gradoli
opsware
hochstetler
castino
balkanized
methanotrophs
goldfein
keisling
vukcevic
levangie
jarbas
ebershoff
kifri
okays
karakia
kickabout
adamsons
incentivised
corsar
unsteadily
keval
electrolyzed
kkg
redrup
nobutora
dysplasias
hlx
frcc
conneally
kaniuk
cayugas
curraghs
vge
prophylactics
cylch
erviti
wangui
yashili
blumarine
roosted
hcas
nodder
countin
tripleheader
lolitas
yacona
guilloche
telepharmacy
evidance
vacillations
wilier
jives
cobuild
nashiro
ashibetsu
mouthguards
aouate
lardarius
simor
redus
enervating
rensen
presure
gerig
gonzago
incr
cluetrain
schlocky
civvies
llandegla
slok
twiga
manijeh
bencherif
allco
macker
ludtke
ornellas
shadowbox
mouldering
deboy
tchelitchew
myomectomy
gurjit
hotlanta
milliamperes
sareth
eiir
lebleu
semidetached
reifying
chmi
rving
kennamer
hymel
imette
palmpilot
dallet
gallenberger
hoerr
kenmochi
zinnias
huisken
manha
chayan
ballindalloch
yvaine
sveva
prayuth
stennes
premed
halcomb
chengappa
culpables
planman
karkowski
tiviakov
bedsits
qazis
nesu
epitomising
kkd
confiscatory
malook
hvtn
echolocate
tranen
newmill
juyongguan
brannum
tábata
alexsandr
berckmans
grimesthorpe
sheika
bleau
tjan
mongeon
kimveer
querbes
rebasing
avilable
blaum
allaf
gaspipe
boyner
airag
aniruddh
zubaidi
holdback
yihua
comporting
jezebels
shabandar
maggiolo
tcmc
anencephalic
gratefull
beken
ostk
delante
pellin
rnid
faridah
trapasso
oix
narkomfin
glosson
bretheren
wahyudi
stanhouse
zeisl
shamah
jetersville
lazarowicz
wajs
dgv
dueto
santeros
itsec
brassai
belfi
oneday
hegyeshalom
voevoda
mssp
buckel
chomiak
mandia
steinkuhler
enoxaparin
drumglass
charbonnet
innerhofer
kilmington
undesa
kazdin
monicker
unfreezing
ardtornish
tity
fouetté
squarepantis
merage
geijo
glengad
ohiohealth
octopod
semiconscious
sioe
teboho
brucks
bogoroditsky
tuschman
investee
lekiu
onewest
egyptomania
xiangjun
cheapshot
herzlinger
nailgun
toshihisa
loughney
anisakis
yapton
unkept
blintz
bounders
newscasting
laundryman
pute
heroku
beaning
metalic
vadera
xenophobes
janmukti
evite
bolitar
lorbeer
tillar
incisively
dongmyeong
gilds
kerswill
sarofim
bludworth
methyltestosterone
laire
weeder
leiman
reclaimers
colludes
greenkeepers
swoopo
schwartzenberg
ginor
biruta
marymoor
wunmi
labiosa
chastely
mikels
haddadin
csrts
metia
politick
ossetes
prbs
cavoukian
darboe
pockmarks
madhulika
pouw
doyel
dmsc
competetive
sagir
tirich
cutbank
atomize
ljubodrag
sauced
particpation
escapists
lekkas
snowdome
bennies
unfilmable
kaktovik
marès
utem
importent
hesc
blowpipes
aframax
barbrook
biutiful
larratt
serologically
ulvert
gamzatti
darity
pensylvanicum
ypu
reaons
carinhall
proselytizers
atpl
efca
erislandy
boiceville
amarilli
treehouses
karrenbauer
pliosaurs
amason
rapaille
zapa
yehya
prorok
nomvethe
arvier
gine
ceramides
revealingly
salihiyah
psq
checkerboards
vcloud
comres
inum
mapstone
patriated
keshawn
greenwall
luallen
supan
rocktron
sundback
fuseini
yav
pulikal
schappert
unbreak
astrobiological
lefthander
woodview
wiam
preisser
veritably
sagdiyev
danzinger
fakroun
hübschman
exultate
cinealta
heywoods
waffly
centerwatch
yastreb
torgiano
ivelin
pelcovitz
mulryne
ozak
boonies
quadricycles
avioli
guofu
raffensperger
clunking
zberg
glemsford
markheim
islandicus
unthought
samaddar
saunters
sluijs
tekkonkinkreet
deuell
jamesy
esmt
bpms
cessnas
qalandiya
guangli
prepper
krasker
mrha
trippel
mcbane
ottolini
hockwold
soumises
cornall
kwali
telespazio
aaaah
stallholder
turinui
hillerich
aidans
matthee
transsiberian
kandhari
beerens
netlog
qrio
tigerstyle
sexualizing
buttershaw
cebs
qrd
sicklerville
nafdac
bechor
wome
sweepingly
lechón
nlbm
videoid
murmurings
pfaelzer
doornail
fadesa
xerri
frogley
supersets
undergrounding
prelaunch
kopjes
demolisher
averchenko
ayris
deanshanger
breakfasted
inteligence
tomans
microfinancing
hartbeat
gotsiridze
electrifies
berris
hads
optiplex
dazi
schmoozing
natz
watamu
ionophores
clausing
schmidly
woodwardia
wiosna
novikovas
lubow
patootie
cscec
ratia
papiri
enormities
ekkart
herzogovina
poten
iexplore
emrooz
littledean
idealising
tradeport
nirvanix
bridezilla
shaorong
kruzenshtern
lucevan
gurdal
reusel
goldschmid
pradit
nizlopi
sindbis
pabrai
toussuire
fotiou
whizbang
mosch
nyn
glowna
tigges
cuaron
juiciest
bpos
kilnwick
coolbrands
imroz
plutocrat
tresser
pushtun
utts
szatkowski
warburtons
valvettithurai
pfluger
yanfeng
bohrman
himelfarb
weidel
arenavirus
bacsa
invincibile
arathorn
vocento
masthay
hydrodynamically
tobon
shiling
neigel
daulet
lipow
proroguing
joviality
larten
shericka
birpur
starpower
costebelle
vaser
maxum
cartwheeling
cornelsen
kharrazi
rubloff
globalise
boekelo
aayush
aliotti
comparisions
cioe
familicide
williamwood
undg
bossart
turbulance
oustanding
carpano
perraud
kortedala
zunior
fischlin
pornthip
cockling
framatome
photoflash
psychiatrically
lockin
nemirovsky
harmonisations
thorntree
leanness
farepak
friedler
pollara
fanboi
nbpp
recolonise
cringes
matteini
samiha
jackee
serricchio
praesent
akwaaba
ostfeld
werz
mendola
tuttles
metabolomic
binstock
alhaarth
timergara
comella
huraa
gelsinger
klingle
contrariness
lazara
pricewaterhouse
washcloth
spierig
newill
braggin
dambisa
pices
crunchers
yaponchik
mehlhaff
eits
uestlove
haggled
tonucci
codicils
kehillat
bromstad
simandou
madbury
slyder
teklemariam
berkhampstead
amezquita
demarche
trebetherick
antsohihy
mirabito
prais
baccelli
kenepuru
frankos
demutualized
foremark
hostas
quf
decarbonization
chastanet
trinchero
roomie
khowa
chaouki
mantica
analogizing
haydu
manifattura
cannizaro
soufflés
burnetts
nehar
whistlejacket
biddable
inconsolably
rouille
numbs
mixcloud
straubhaar
kountry
unpredep
kendro
gyrated
balpa
fayadh
icaap
certicom
ainars
herdwick
halliche
goligher
nightsticks
burnmouth
haverigg
sautner
ofheo
pokou
nteu
earing
sandpile
shalah
sarsden
denzer
bauknecht
nonporous
feifer
gxs
galthié
tostones
zanan
coutry
subependymal
cect
kowit
propitiating
terpning
odstock
marysol
griswolds
zied
makgeolli
feinted
thalassotherapy
noneconomic
warney
cookouts
confrontationally
defatted
contibuted
germà
biaxially
trindon
rauschenberger
househusband
reken
democractic
nistri
saids
youness
barama
fosterville
keroppi
hamuli
knapper
makkasan
moalem
noffsinger
dorre
chiropodists
cressona
mccart
yongzhi
koets
avobenzone
wisdomtree
mehrabpur
nocito
bodian
amercian
centreback
daywear
lancelets
hankmed
disorients
injectables
janee
commiseration
delibrately
whispy
rayven
kerstein
giclée
calleguas
natsuno
geere
skinstad
accommodationist
snekkersten
tortajada
steingraber
stechert
pijpers
polcies
hamshire
phantasmal
mcfarren
ancar
minuteness
snagfilms
behavious
jeremey
sherbrook
grandmotherly
congestions
sunami
vuono
shamva
directionals
deputes
ngudjolo
repossessions
ravenshead
raetz
swappers
horsenden
harandi
kobad
quamina
melitus
logrono
kynge
woodlea
haapanen
kiaran
sepon
otdr
columbans
videolink
unti
wampe
lyondellbasell
shanelle
gelardi
supplicating
deadness
newmyer
tautz
hreik
nafld
mercal
mendlowitz
freuds
neij
whaddaya
sanest
tacis
mpongwe
uwire
estafeta
boursicot
sheridans
respons
swauger
errored
ugas
menacho
simri
mianserin
beckhard
sharfstein
lanci
cids
hungrily
missles
avern
sannes
kelkoo
kassiopi
nordhus
catw
sumann
perinatology
tjaden
nonvisual
yuganskneftegaz
coedffranc
desser
aereos
leymarie
mktg
rustbelt
guei
chenette
valvetronic
quatt
robla
yaitanes
lutalo
beoley
mysteriousness
pajam
asmah
hamler
mizra
rangaiah
chiuso
rushy
statnett
tornante
haldiram
gtalk
mutitjulu
puroland
bruh
taglialatela
reinvigorates
formenti
shinri
pustelnik
louisine
mnich
aysun
hicok
gibbo
rabelaisian
tenderize
coxhoe
yanggakdo
yulieski
hfmd
saita
senoia
avinoam
demore
hamidur
emblazon
lampur
riceland
vilt
badeaux
gopio
mellerstain
souffrance
tsakopoulos
berkmar
osgerby
telefoni
ghanoush
hirshleifer
giambastiani
wittington
flensing
gershenzon
vatterott
witherslack
chakraborti
sneakiness
openmrs
ratjen
hierl
icross
dolga
afrasiabi
januszewski
trovan
catanach
jacquemontii
maffucci
gaffin
thermokarst
cropthorne
newling
blacksville
steepled
dantin
pomalidomide
voes
woldemariam
nirut
cuya
demystification
zakin
marzolini
catoche
herzenstein
wonderfulness
tendencias
awesomest
,not
convertor
serpentarium
tretton
atlasjet
caparros
mumsy
suceed
stouthearted
requejo
cyclope
winy
asharoken
husham
brumder
overstays
bekes
brushback
duckbilled
reallly
tassone
unfragmented
walson
odong
ethisphere
furzebrook
availibility
skytel
laurieston
preapproval
communed
thurton
sealable
hazily
grebner
lasp
euroarts
funari
aufschalke
kottwitz
stoneycroft
wickstead
stratta
sotirov
istan
cnosf
paleologos
nyagah
curphey
szaniawski
ciancio
agreee
wardhaugh
kaleidescape
urtubey
linguine
shawntae
belza
googleearth
pntl
ripplewood
fischers
skiway
sarwer
squiggy
tracys
adhiraj
hauswald
krzyzanowski
callam
superimpositions
paromita
lcec
thatn
gullo
rurally
wahiduddin
herti
nestande
ratifiers
naumovski
psychedelica
ucal
oykel
dolgoch
harridan
klym
muchall
muthoni
blackson
liker
limeade
servaas
fiacco
xiahe
islamically
reportid
fictionalize
medawachchiya
spratley
rezazadeh
geomedia
kearl
aravosis
virtualised
sessoms
lepeilbet
houze
rastrojos
langfeldt
humpers
drukman
radzikowski
healthscope
breindel
bergel
wpcs
ijamsville
harled
laudomia
nuvaring
ameria
thata
floella
unattractiveness
maib
lebid
presumeably
pecksniff
pappert
dkms
aara
matthis
middelheim
alleway
bulluck
everidge
bloodying
subas
desagana
rylie
petitti
huseman
rathjen
cybercrimes
expensing
cyclobenzaprine
magheralin
tamariki
jetboat
tobocman
tasch
acce
lledrod
socarras
crurotarsans
almasy
exasperates
qiliang
odland
seabeck
planarization
obligors
khandaker
dumpstaphunk
ataba
freej
guigui
moosejaw
tauqeer
ached
michielsen
nuart
parlourmaid
walchhofer
prequalification
bioshield
ahamd
vissers
iachimo
maligns
märzen
ildiko
mossler
imga
petrols
hainz
behmen
shatteringly
fertel
resignedly
engell
softlayer
ulsd
huaraches
deviser
pasetti
pittella
aulton
volksbanken
famouse
seargeant
percolators
ranibizumab
heida
bloater
skymiles
quilligan
gyroball
gofers
waldenfels
torquing
kalikimaka
antron
khumri
kandola
philippakis
zadra
battat
sasanqua
bialek
joani
paddleford
brynden
schallreuter
kuechenberg
spradling
norihisa
weinroth
chatelherault
peduzzi
baned
precent
shontayne
kerchove
bullfinches
abrahall
packbot
chieftess
copestake
kosak
irschick
vaciamadrid
judeh
chaghcharan
antillano
lgh
alleluias
tiding
aspall
cishan
sarnat
tramel
urbinati
mpcore
colliston
rissi
fonzworth
peiyuan
karatina
buenrostro
handelian
collee
consitution
flaggers
conrow
stunell
vicon
hourcade
ortal
boutle
gaowa
issing
iglo
tavecchio
dagestanis
sciennes
aisen
snookie
mayassa
gilgel
virality
korangal
cessa
gobern
bamf
middx
perlez
sandbeck
nephrol
cetainly
camisea
barbless
ferrostaal
hornitos
mandak
addley
bartholow
penhow
quenington
wilonsky
malborough
anthim
mrisho
antos
ynyslas
dilmah
strandhill
sciency
summerbell
jebet
inswinger
nonthreatening
naquib
delbonnel
purepecha
dastjerdi
woldu
unsparingly
rummages
lemes
smses
socialbakers
petrin
bedol
hockman
godlessness
bonhill
ungpakorn
etait
anaran
xijiang
baengnyeong
wahabism
dassu
dirtball
bairin
playpark
unmin
canidates
stingel
gudrún
whta
batholomew
mcmillion
belarusan
khalap
circuitously
descalzi
autoridades
thati
protz
luxemburgish
bouajila
slayback
rissmiller
liebst
electroma
rabeni
blackballing
kusadasi
poolewe
marret
angio
vedo
evg
harminder
veasley
silvestrov
joffrin
avellanarius
esiri
mucklow
condorrat
immortalizes
granfield
ifpma
snitchin
nerius
kelisa
staikos
rothert
lechat
overeagerness
crookedness
kinlock
tinsman
magueijo
oner
anglophiles
iavi
locane
necklacing
handiness
opulently
herschensohn
abilio
multicamera
perling
volpert
ayittey
restitutionary
boardercross
virdee
sagehen
hennicke
lizeth
disfiguration
fabulation
charol
micahel
mcsweeny
sosi
moneyfacts
hurson
hyppönen
yakushin
badawy
ultracapacitor
gaddes
hoty
deathtraps
taqwacores
marnoch
soroa
stoep
mullinger
ernen
qichen
hsdd
khannouchi
emons
microdots
cooperazione
setence
plaudit
syan
appello
villemin
rudresh
augello
quittner
lazonby
solderless
goryachev
actel
roustabouts
chicote
groovers
realschulen
kirdyapkin
banadir
garfinckel
embeded
billionths
tirabassi
groundlessly
giftshop
heatherwood
antonowicz
lamri
doomtown
heloisa
rangnekar
blendtec
caganer
guestlist
bourgie
schoeni
dandenault
astemirov
pikit
waterwise
lrz
jwr
badas
limpy
bellers
abastos
temitope
acore
garajonay
rihs
barrese
corbella
purrington
wladek
hiving
burble
radosavljevic
chunyang
santra
austrailia
caspofungin
venessa
pingping
monjack
escc
adado
gambira
jakati
particulares
peignoir
holystone
adran
lathem
landwind
lomans
underinformed
ottos
ddis
wottle
lakenhal
leuchten
papantonio
loewer
wilczynski
bipv
kingarth
prate
radioastronomy
layevska
telegraphe
obaidi
munem
gman
oelhoffen
fasolino
ational
hubbing
philipsz
kafta
catastrophist
hutments
gunite
giganticus
rashford
underskirt
boffi
colapinto
demello
deepal
vontaze
lockleaze
vardell
petrifies
balluta
stutton
zekri
cuis
mosti
wintv
paranavithana
suard
urness
schildknecht
wakimoto
chappa
riffelalp
guberman
amdy
barvas
planyc
gettler
kayyali
adenoidectomy
howevers
konchellah
preceed
outcompetes
carpoolers
dangermouse
rosemoor
ursua
cê
attivio
lumphanan
wuertz
optimises
microchannels
ewerby
imtc
quarterpipe
gius
bodiford
dehumanizes
kongevej
intermediated
monofin
cercas
duvets
skunked
pipistrelles
elcomsoft
tailenders
gastroduodenal
srodes
winothai
govermental
oltremare
thembi
misattributing
xpression
ravey
nodong
xingdong
grieveson
bolom
cahana
outdistancing
bloodred
cgro
nedeljkovic
bartletts
akpo
epv
amrany
supsa
laband
iconographies
zehetmair
lucketts
milinkovic
reargue
sagemiller
mervat
alveolitis
fluffs
passiveness
egarr
autarchy
pbcc
zurn
beyerle
guderzo
pfic
bredwardine
jiping
eulogia
cheniere
chionodoxa
craighouse
undistinguishable
celotex
ardfern
calco
intersexed
laganside
boehne
elfreda
radojko
quoile
hertsgaard
bitove
grinten
tadepalli
ludes
ginder
allensbach
caldarelli
smeltzer
spiritless
pennario
coile
desharnais
yelloly
jervey
timmie
getzel
iraklio
multilaterally
bruener
lugner
boskin
copaiba
arikan
hexamethylene
compatibilities
nadelmann
dromintee
timani
basinghall
isrotel
mahnkopf
celades
maimaiti
parirenyatwa
brannagh
pithily
hareth
venerini
decendant
daghlas
watchfield
miscasting
institutionalise
dutko
knic
roselee
salans
particularist
yordanis
yuniesky
closedness
scotchgard
ecolabelling
canjet
stenner
loher
midatlantic
echarri
lentos
asjha
brielmaier
hospitalize
wolmark
unchartered
khoresh
bronchoscope
palocci
resop
goggans
spoel
jhd
willebrands
korzen
wineskin
cleanout
prescreening
seona
tollard
delelis
hunx
smoggy
bettington
velchev
filz
tryptase
pokesdown
tablers
lembongan
kamie
ritournelle
baulks
moshier
stather
combustibility
nece
grieshaber
lungley
tenpa
maunde
urip
rondels
jessiman
rkn
haefeli
gatell
tentpole
martearena
pimmit
shorwell
dorment
comuzzi
maplins
milc
orandi
neuqua
rakhmonov
sebbe
chronopoulos
unibail
masto
stoutest
motamedi
knutton
vatter
ranchlands
jingzhi
argumenty
astec
cirac
mihangel
worleyparsons
bernhardsson
mellissa
fredin
ogrizovic
horreur
stormville
mellotrons
genowefa
battre
boultham
movenpick
mohamadou
skurnick
sautee
jomphe
gaube
aaii
kagin
dechter
abrahms
burela
bwiti
kilve
hayrides
documentry
trivialised
xva
fontainbleau
naimo
maciá
awtar
propoxyphene
galic
maranon
stripp
barari
nonsmoking
fatmi
abriel
iiu
sunrider
ciis
maruk
cdex
flyglobespan
powerplus
markwart
tornabene
geerlings
midgely
resynchronization
pullapilly
macspeech
therms
mwambutsa
todesbanden
dubhe
uhhhh
aviod
eiermann
ligairi
cils
floozy
niedere
boundry
ulstrup
glenmoor
caseyville
drongan
kitzbuhel
garcias
lacerte
quinsy
guilsfield
newfest
laili
sidang
omelek
schwartzkopf
ryzhikov
iovan
szafran
turbogenerators
arli
comfirmed
zoshi
bransten
gobabeb
vvi
cnns
momart
gaomi
softmax
condry
suhaim
rodowicz
jozias
littlebourne
meribel
monopolisation
facilier
fasola
leever
portera
grassmoor
lievin
epimedium
abecedarian
tzahi
bankamerica
luze
daubing
tryline
grandaughter
combourg
binamé
berria
rongshui
quaytman
metabolife
abdulrazak
anakena
upconversion
palepoi
upshift
natanya
slappers
bowburn
sirvent
gautieri
waterbus
pharmacoepidemiology
harush
cotterman
theh
pretention
dipascali
rhodin
healthspan
dzhioyev
saveurs
sitthichai
westlb
dergue
ferrazzi
liudas
celimene
cataleptic
faru
cedewain
dallison
pursuaded
yorongar
aite
gramscian
sludden
mehdorn
dyrosaurids
phenomen
purpusii
neuropathologists
pdss
duchscherer
devy
kevo
phoonk
dangel
jewfro
mindstorm
attenboroughii
nitshill
reporte
tiefenbrun
lisbona
windtalkers
qamber
giclas
homaged
rbgh
shiquan
gnv
discman
forcefields
warson
asbarez
sublicensed
troshev
iachini
makutsi
curlies
duprees
hershy
dujarric
dauch
simione
pedini
brizendine
bongi
staaf
ashapura
mistick
easkey
elenita
moayed
glauberman
grzywna
kagwanja
espc
chemises
dolkart
vannoy
dulaine
janifer
kamajors
stompe
hemy
ppta
jenoptik
saqlawiyah
mellowness
jaua
desigual
takas
panish
muguerza
marinucci
obies
jannine
botai
intracom
eyeblink
neurotrauma
polyaromatic
kibbutzniks
congealing
bakonyi
flashbulbs
meldreth
shaviv
hexam
positon
trippier
warfel
dorazio
corriston
quinsey
cattier
obana
macnabb
soreq
lukensmeyer
misdirects
fickel
touil
calda
suang
rallier
souleiman
shakiness
luxuriance
pavarini
weensy
alzahra
zisser
karachaganak
meacock
nibblers
wbx
kazakevich
econômico
stuhlmann
fibrosing
hashemzadeh
saliou
rolltop
logsdail
akepa
nitpicked
merzbach
agonis
sandborn
krupka
neumont
guttersnipe
keycards
thunderbox
prosthetist
nunzia
honua
kisor
skydeck
nobert
ridlington
voaden
bauermann
bechtolsheimer
inhs
ruinously
wendys
glenholme
dreiling
casserley
tableside
heven
fruitarian
stratifying
farri
ausaf
sewp
unauthentic
reinard
widenhofer
petfoods
sweetshop
zouerate
witchingham
fliehr
bandeja
oconnor
nanogram
electrosurgery
kimelman
morohashi
sprits
savater
rechargeables
meroni
melotti
okapis
helendale
uplawmoor
arcsight
pompea
significan
kjus
thoreson
zarou
osakabe
dongho
kampongs
arvato
astrobiologists
dittus
zarins
monterrico
mycle
mwaa
ceep
unanimated
codeblack
stomachaches
llanllyfni
nympsfield
goedert
levoir
ukrainska
panathlon
handsaw
rauth
semblence
metanarratives
lacavera
gwyndaf
appeaser
bruenchenhein
pontification
pensively
vlasák
maximizers
denninger
doerflinger
qawasmeh
borovec
oyeyemi
holidaymaker
zawodny
portell
ropery
enourmous
sickafoose
zambar
twentysomethings
oluoch
tollison
aframomum
scheie
rhosymedre
younglings
geocaches
redroofs
cashdan
floatable
pyrocumulus
traid
ptin
kneebody
winsett
swinbrook
ellson
achkar
calata
poillon
dendroctonus
whitsand
pharoahs
hinterglemm
sarvo
kufri
kundo
platonically
northbay
copnall
fmcs
salhouse
maghen
ligouri
wooderson
skateistan
idlout
raisner
nwk
weev
cristophe
stasevich
killeavy
biobutanol
diwaniya
fillo
catacutan
navigli
oxborough
propylthiouracil
garitano
sokhna
ohlund
aissatou
ceaa
groupuscule
ifema
tchmil
basini
operationalizing
disarranged
mtrs
channy
bleedings
hajaj
biotype
tiet
stelmakh
baumkuchen
downtrend
trellising
bookbag
publichealth
upholstering
sweatin
rallys
schrenker
rlw
amikam
barnardos
shanin
caneel
intergrated
schuil
immortalization
kokua
wieting
wladimiro
bigonzetti
kekich
chasanow
eatable
silverglate
fleenor
noogie
mulatos
houhai
hoagies
raichlen
sulaymaniya
undercovered
falsly
lobero
parky
similiarly
cherien
rhessi
cossham
shainman
treffinger
slapshots
tmap
iseas
alecko
badgeworth
alberobello
eirgrid
gemar
motezuma
inviable
jamstec
umca
lapushchenkova
longfleet
jamiya
trabants
cnnic
bulpitt
earther
blencoe
teardowns
tangalooma
wonderkid
cullet
lliwedd
sifiso
toben
ballykeel
drendel
focis
slenderer
emotiv
kurwenal
pakoras
mekele
prominente
nonorganic
arlauskis
vandereycken
zonderland
presbytère
preppies
titulaer
microblaze
ludovick
hawford
showery
jinpa
yomp
kaster
shizeng
dison
bonifant
rafetus
balfours
lrit
imperils
chubbs
oversleeping
compsci
spacewire
klaudt
turpen
biohazards
dreifort
allthingsd
beardo
pyritic
damásio
shasa
voicestream
superhot
bogachiel
nappier
moob
accretes
financers
radulovic
incorrigibly
barenholtz
malsor
blazejowski
ulimo
gurgles
olivarius
pytchley
perficient
lovestoned
abbondanzieri
thoratec
cloten
auyuittuq
counterpath
largly
psbs
lodh
lajitas
mieu
bihani
monot
weissbach
moskit
shirlie
hilliker
scottoline
abdussalam
cothill
crossenny
biodesign
chunkin
clubgoers
vogelaar
bookmen
feminity
picosat
wirsing
mamphela
ivanisevic
swigart
zuiderent
nigp
permal
lammerding
mgahinga
lornie
arcsoft
corrosives
totalitarians
nause
nalluri
lorino
fidencio
zizzi
wupatki
kroo
permament
pieties
egoi
dieker
padil
tricom
trefgarne
sidelocks
kroesen
freebee
acheiving
treglown
atheer
wellcare
lupeol
bhajji
hatmaking
khachik
findern
andreopoulos
porgies
cassity
slask
trebly
phonepayplus
ferrett
amieva
ghostlike
revelries
marrel
fance
lekgetho
hurll
pheobe
crianza
laikin
scauri
madurodam
overgate
yalumba
basmanny
hammertime
pricilla
ottenberg
conyer
gappa
tmcc
mcelfatrick
anthonioz
beautyberry
parlá
rauhihi
brezsny
lamara
spuhler
vivisector
nhlapo
traceries
brunches
decoutere
streeters
tadamori
rubenesque
polychronicon
abouth
tarangire
scopwick
loizides
reihan
kial
steeps
kingley
trauger
rugunda
sonagachi
tomenko
unbox
vanburn
wonderlands
tomasik
filevault
poringland
gunderman
averageness
bombsite
dibya
lakos
dobber
enfields
islamize
potoroos
joggling
kippy
azera
vallow
pornos
colino
fttb
msrc
matze
chevrefils
celf
cappucino
barkerend
laodicean
naide
rajak
bmhs
onen
najla
yatom
rxt
hirons
azziz
rudong
plodder
atba
snepsts
craigen
acquavella
worsts
scrinium
matsuhashi
carhop
drexell
wnbf
lauby
mlotek
balmat
aptt
crassly
darrett
caraco
lucimar
sandmen
liemba
titanoboa
efland
megalyn
lalmohan
cinderhill
shaikan
arbitrageurs
rogatory
parant
hesher
boomsma
ektar
pompeians
wanguo
camerawoman
curers
bigotted
lehra
homeplus
disembodiment
pacbell
cyclos
seym
moucha
cerridwen
momotombo
bouret
nanduri
celesio
bisnow
dannah
handpicking
ullamcorper
kathryne
scandrick
norne
blares
magnaye
burhans
athanasiu
ferriera
nhanes
sajous
darwent
alvarion
braccia
eekhout
stacher
inaudibly
aquascutum
marchwiel
chaverim
flowertots
baljinder
wissey
migiro
gibbets
sayong
possiblities
oshchepkov
kinton
ravetz
hadfields
potatos
tampion
jestyn
mohatta
dolts
wenjiang
danzey
gundogs
scalds
raske
lippspringe
ittai
eastcott
zylstra
hollingbury
novlene
sentimentalized
flystrike
handleman
comprimised
moviebuff
appropriators
tondar
goure
raskar
libeccio
montchanin
petrobrás
paradeplatz
simco
vwi
linneman
brayson
turweston
liszka
omnicare
merzak
sermilik
fisser
cathro
beyster
snoopers
dylon
iwpa
leppings
hurles
michiganders
hiney
emtman
hypochondriacs
villasanta
penninsula
europeanised
douchebags
eagen
pattisson
skimmia
roesner
phippen
pavant
enfinger
wolfberries
jessopp
dahmane
cochi
yasny
rkf
forbearing
sphero
luthar
volchkov
bishopsworth
falvo
umas
behnoud
yinger
venini
transfigure
longueira
teavee
wingates
sukhinova
doumar
duette
virada
kerelaw
uhmm
vachani
earthships
darunavir
muthulingam
sonkar
marchy
driftnet
hefferan
geml
epidauros
takeoka
titwood
vodcasts
montario
bornet
rowner
shervington
sommerstein
aktyubinsk
barding
sikhanyiso
assuages
dietzsch
hawed
ramnarine
lifeclass
resumptions
bookexpo
myroslava
hlady
elettaria
weatherburn
rockii
stracciatella
hurleys
musicstation
rawly
teinturier
ashelford
muxes
songtao
blastic
francoists
farangi
schappell
mostarda
omapere
mctwist
sheerman
shimba
binnington
effe
harles
marrese
shimpi
cassanova
halbeath
wenban
osteocalcin
buyin
rivulus
scrummage
emodi
louviere
muaz
amerasinghe
satel
democrazy
igbp
yaichi
relearned
reverbed
dpz
polycephalum
osac
paetz
aktenzeichen
nafh
khoroshkovsky
omelets
zhangmu
leamas
pandak
qud
cimm
trongate
midways
munyaneza
attemped
mccoig
ywc
peregrination
unipolarity
slobs
tetragon
prusak
sportsbet
fatsia
changho
gasteen
czeisler
sarubbi
lacome
saurischians
okecie
douby
lochead
promesses
tatafu
fitments
hattam
neifi
awra
legambiente
oilskin
eprocurement
hostilely
wilmorite
convivio
wingspread
sabaot
wattson
dazeley
chetry
swade
bacos
bernette
hairshirt
alavanos
mushahid
huichon
botwnnog
barnz
tacci
unfreezes
hoder
muthuvel
toshihito
itms
hodgin
jrtn
mesnick
cittaslow
bokma
vicissitude
majmudar
theatreland
anisul
ulich
prj
hollon
bcrs
lyminster
porphyromonas
qes
schoolbus
malkeinu
bartee
cowbit
conveyer
dezza
steart
plinky
ackah
stupefaction
danwei
ackner
nuweiba
jfx
mccreanor
braslavsky
ramipril
attique
peterculter
inciarte
vidic
accordant
encumbering
jochim
fallouts
tskhinval
mugla
tonghai
gallu
hausler
broadsided
commandaria
brogeland
medero
yipee
unaggressive
shakedowns
csst
ruttle
deblicker
foinavon
cppr
matsuev
ostling
blattberg
docusign
undemonstrative
lensbury
bänziger
woodbank
contry
curvier
unfavored
patronizes
bolometers
caterwaul
headends
addthis
strateg
dyspeptic
malandrino
lamprell
bolze
whatchamacallit
bulleting
haggas
franza
tortor
toensing
gressier
cisi
kirkcolm
kaiane
nmrc
mahasin
manoncourt
woudn
burkan
carolene
absenting
rondinone
aniva
romanko
vukicevic
rets
yakhouba
cumberworth
premacy
stenius
irresolution
avanzi
megginch
wolkers
bloglines
benzos
faurecia
alverthorpe
penglais
brfc
fizzling
galarrwuy
fassberg
declassifying
pinkaew
haensel
cattouse
rizeigat
leftish
muigai
supermajorities
badenhop
dimasalang
cvca
skoyles
barkhor
redrado
nagourney
clubmate
whimpers
aircar
icewine
pummelled
whitelee
munafo
mohanna
elmswell
karatzaferis
elscint
kroons
primesense
adelsohn
eidan
dhusamareb
domenicali
kurfirst
xct
setoodeh
racaille
mustapa
guiral
antidrug
hoggarth
theodoropoulos
ronaldshay
counterspin
serfaus
isssue
cadnam
drewermann
teesport
redated
kiess
chionoi
milhazes
veley
snafus
skep
chemoprophylaxis
massell
squareness
outsprinting
velikhov
nexhmije
tuppen
doyal
kapka
aossm
bilderback
muddier
embi
uthayakumar
mccheyne
iluc
maturer
gurland
atention
schloemer
stellifer
nals
vitalize
cahoot
scharping
assesments
bankamericard
themsche
woodlesford
yujiao
beral
demonologists
hydrick
bleys
guiping
tushita
gotomeeting
chijindu
latet
dandified
ewington
karimullah
ukcs
allsports
mccomish
chlorinating
muslimah
bricknell
ilgen
liitle
semerenko
nirenstein
milnsbridge
abdulkader
bront
kizawa
friedli
cathe
palatin
porong
gfz
garatti
anderes
natella
adenike
ommc
camak
schoolman
tyin
penlight
guenin
orientating
letestu
clise
saure
everytown
pukaskwa
cryptochromes
taihua
galekovic
larney
kacin
repurchasing
scusa
modifed
decended
newsreading
nefes
knome
anthropologic
transformerless
kenroy
mcdavitt
elwa
chasubles
ethylmercury
bublitz
metrohealth
odama
morebattle
kalkhoven
nashir
pencey
sondheimer
rooneys
antidoping
alela
nsba
tipner
leleux
cabbar
vocalizes
massetti
borgella
veracious
anagnostopoulos
rollable
marfuggi
shelnutt
abhorring
viticulturists
pokery
stanno
flusher
maymana
kerl
iyke
ralitsa
limaj
wikstrom
naevo
jern
tawake
intellegent
gaebler
atat
dumelow
ogx
ambion
triffin
twiins
youthbuild
jasperware
rauer
whissendine
mcleods
gerobatrachus
stefanescu
sifters
milosavljevic
denize
eyking
rauzzini
marksbury
dingel
wholegrain
shehade
kilson
lodin
mikesell
sherzad
medifast
mulisha
hemiscyllium
fadell
papermaster
vuguru
epilepsia
hardye
gibsonia
haeberlin
yankelovich
backlashes
supes
petushki
andreikin
dyllan
pratical
deveney
isses
jidda
lanugo
aaldef
nabozny
jerseyans
knockando
vittatoe
rummikub
nihonmachi
fuzi
gillenwater
financiere
gakayev
arraignments
bassat
medicinals
philospher
vandermeulen
consignees
sawhill
sjostrom
legitimates
bugajski
sangiovanni
museumsinsel
hyperventilate
genser
bevers
bowheads
lrip
sahal
mcgorman
brissie
beefalo
palpate
bluecar
agrifood
honigberg
speling
bayefsky
tangas
ejiro
nigsberg
killelea
orbiston
derwyn
weatherstripping
berrynarbor
catheterisation
guadet
deleeuw
fortepianist
rulling
doulas
wfed
nashes
babyhood
kunpeng
ikitsuki
bogden
gerberg
ucea
ambiga
rucking
melda
kendy
downshifts
aigu
jundt
navtej
shakhlin
stereoviews
nottie
vuvuzelas
tramontano
foulbrood
gaito
eatman
lumbala
mboma
gymnich
motoryacht
chollerford
hasl
cauter
keading
gelle
mommens
mangane
controversa
barbone
ryozo
uspis
garbowsky
excelente
realmente
bedfellow
norooz
fritted
afnan
felly
mooing
pyrotechnician
untiringly
haitong
batoka
toileting
exacttarget
practioner
mfon
granvia
trzeciak
treament
caijing
tremeirchion
witman
kadewe
celebutante
cucci
sundried
socol
ruvkun
madaleno
banahan
goldbugs
movment
parklike
ressurection
lecs
uerj
techni
koly
hoseini
unconventionality
barbeques
dawdle
cidp
walthour
cornstalks
kalakuta
vcb
withthe
swanpool
gruffly
aerodynamicists
bumming
theladders
significa
tobback
covingtons
linamar
paveletskaya
redlake
bharose
zytel
ciclovia
slavkin
rothay
middlemas
bootee
hotdish
verron
intelligenz
tarara
irom
santiesteban
bolens
pyrithione
subconjunctival
chachere
shpend
lyst
debriefs
bluecrest
kuppusamy
fikes
challender
guggenheims
chongde
muzito
worng
beilenson
zuhur
llangedwyn
farcot
kinkajous
febreze
shosetsu
kathar
cashpoint
rounsevell
souki
intercapital
godbolt
citarella
raqib
nguyet
exsultate
centrifugally
postlewaite
uwak
longnecks
undetonated
mitgang
bawar
adamses
fobbed
firepool
grimentz
blikkiesdorp
geophagy
katende
honeyball
iadc
linkshare
standeford
onstream
jouf
vampyrus
guelaguetza
renationalisation
neorealists
meakins
xitong
hauswirth
aapor
tituss
outworn
wildeboer
shusett
sunswift
behoves
gargett
willerslev
mamère
tenku
craftswoman
engraftment
shamuyarira
lefebre
manenti
centanni
belleayre
intenet
ucst
wscc
catano
icba
broon
karalahti
blackmans
kcas
werin
taroa
napitupulu
wagland
cothren
saprolite
santacana
tolvanen
iliushechkina
chandrakanthan
yosu
arquilla
bellars
mirikitani
ciofani
amytis
lambri
anagha
veerabhadran
usmonov
munitz
trosch
nasos
esquer
hrelja
bewsey
gluts
kiyoto
bennifer
bramerton
bradbrook
realigns
abkhazi
murakoshi
richelson
barkeley
zwentendorf
irvings
londono
greenfinches
ghriba
vallorcine
kmic
garnerville
rossos
frobel
dinamarca
mohring
bezerk
brookhurst
baranes
aariak
reighard
debello
galyna
mulkerrin
brebis
zhengyu
hairmyres
kittisak
ocfs
deanston
kingsweston
gazar
prospeed
geria
gestas
laini
madlala
fasters
truog
badrutt
biocapacity
goldcrests
considerately
gilbertian
botin
sanparks
dovico
ostendorf
hempseed
delerm
horseheath
qomolangma
pacentro
couso
bhelliom
truglia
haryono
skunky
zulkarnaen
darkland
vesconte
naite
minwoo
salteri
zimbler
icklesham
shalfleet
conductress
vagelos
zanies
emneth
haldanes
fiterman
morskoy
bantered
stojkovski
dadá
falt
gugliemi
chilblains
lownsdale
rainieri
claar
aceituno
stangroom
gritter
elarton
mallomys
robroyston
kelvim
ouside
erron
shernaz
nonrational
zaino
whyment
sedighi
almsot
mendik
goodnites
nandos
micromirror
croshere
fagatogo
heidrick
arunga
scarer
pikoli
bumrungrad
colourants
asna
radiotelescope
kidnaping
ewy
bishton
agustinus
revivified
cste
rooden
shamsa
llandaf
gagg
muthill
madiwala
dhgate
bercu
khoshbin
kones
mirzai
gaier
disaggregate
garcons
shavonte
partlet
krissi
chistopher
nsam
kaballah
partialy
mushrif
easler
forams
mcrobb
osmer
stuffiness
codeshares
demutualised
nosal
hamelech
pritpal
kippe
gemi
kentlands
orlow
howletts
mappable
indiscreetly
hillarious
infomania
exeo
colbrunn
gentilini
aigo
mouthwatering
metoffice
alinejad
brockweir
josefino
pellis
ringmasters
stilwater
yesco
soering
muntafiq
threescore
reboux
handgrenades
anatsui
taborsky
trehafod
xiaojiang
donaghcloney
harpooning
doodlebugs
tomintoul
buergenthal
mamabolo
pressoir
ebata
rimonim
einav
claimes
polartec
paauwe
bartletti
nettops
thumbscrews
adario
otedola
hvalur
enchantingly
montereale
sanc
groundfloor
jezzard
peltzman
duvoisin
bukuya
sexaholics
nicolaos
pmas
pacos
communigate
dsrs
swennen
kamyar
zhaozhong
travelall
kzf
envolved
futureproof
whimsies
jessell
waldren
restitutions
monji
friockheim
lecount
bacons
wbv
reignition
aleksandrina
selectiveness
kmiz
himat
uncoiling
carbofuran
quso
superweek
cavort
annino
ergometrine
volanakis
romsley
iwant
halifa
cilgwyn
muzenda
sacharow
unsay
patternmakers
richboro
kottkamp
eurazeo
dudinskaya
lightwell
laskos
ghisolfi
anixter
pirgs
hafa
liebscher
iraschko
shetlanders
ncts
sonnenblick
morenos
halamish
playphone
blandi
furer
hinnerk
kleinsmid
mérindol
kirchschlager
fainlight
protegée
gvaladze
southdowns
bransgore
sugarcoated
praa
miniaturisation
alfonsa
intralot
costless
enslen
constructora
niwari
desecrates
enwave
ganiyu
cadnant
qis
fernandopulle
indistinguishably
omilami
plonked
coalman
glenfarg
njoro
niedenthal
riggen
crouter
anthrozoology
ceniceros
uln
hevener
traigh
rogal
eslamian
beaus
buehrig
wygant
unboiled
lidgate
nurturer
gtcs
cranfills
kwast
bastie
sannoh
cunin
axlerod
reisel
jebara
bagaria
cointet
destierro
rimel
arefin
odyssean
zelotti
wijdenbosch
bught
ratchadamnoen
appeasers
kpsp
kwarta
accetta
pegatron
furball
jamme
squeek
pnes
boattail
climpson
banderillas
thumbstick
desecrations
oim
kemel
docusoap
rigth
firebricks
tulipwood
hairband
adeona
lorelle
tremorfa
astrom
kozub
dragovic
matco
dorschner
starlike
qalqilyah
tweeners
brohn
postdoctorate
microturbine
outmost
astarloa
trawangan
rummaged
kittleman
azizuddin
stingiest
naumi
gardai
suchart
pekarek
mmscfd
rocheford
mcswegan
mawenzi
jucu
morling
highjacking
barrer
shelterbox
cvx
gosberton
heliconias
parklawn
kewadin
regrew
soddu
mcletchie
tolz
coopetition
sbcc
wakeeney
trimpin
readercon
nafsa
nanoelectronic
rockcliff
killock
pocked
opoona
pixellation
penlan
outspending
comares
wellemeyer
spreyton
sedloski
ginsenosides
lemonier
antebi
bieito
esar
ruthanne
ligations
dordon
egpws
robleto
mdeq
robideau
ofisi
percona
putdowns
lidon
masilela
stollsteimer
vershbow
andreson
utilityman
villaine
kanehira
lorden
skweyiya
heilmeier
likoni
fonctionnaires
ordover
dahntay
placek
irwandi
honnor
surplices
zwiers
borribles
pangas
valldemossa
hamriyah
doetsch
metfield
lobotomised
chemoprevention
pmtct
ranunculoides
romito
glatman
snuggie
haertel
rasual
lieberknecht
ekert
sweedish
ogled
kopylova
tatsunori
phishers
tampax
petronijevic
mcelduff
engraves
umane
majette
sponsered
babyland
akpala
textless
yoghurts
kinyanjui
hayrick
vlts
eurest
sashay
dadc
tschida
tinkles
meloxicam
sportsdirect
vorinostat
kirbys
ihealth
mascardo
xkl
wiesberger
audiosurf
belgiorno
lilavois
penasco
borovac
yasukazu
themselve
nooooo
tsend
dariye
amebic
shimeji
kokkedal
perben
achelis
boxcutters
cluss
unigate
indur
tesl
solás
somalians
hvcc
gilfedder
adre
interart
soupir
haaften
weetzie
critized
carlotti
disturbers
tuiloma
wiszniewski
drogenbos
spellar
weigela
oluwale
vicke
whatnots
bmis
fetlocks
maij
addleshaw
raivio
aviara
petawatt
calamondin
shooing
ciste
untruthfulness
libicki
jointness
hayb
hannaway
mclauchlin
manhatten
counterweighted
kirrily
civilisational
chimichurri
mousawi
dogmersfield
yzaguirre
usbs
deloatch
nakhjavani
petrila
humulin
ousland
crossties
univesity
varnelis
slivka
acquia
manab
abdeh
terser
teape
karpasia
heelys
fishwives
selfs
amankwah
silcrete
waterpipe
footlong
mulero
maaran
waguespack
schwerdt
fernleaf
weishaar
telander
melmotte
yunas
arpc
flashblock
atsma
nyaga
fsta
ocv
rajbir
aldbrough
noscapine
kluever
aberdares
wapt
kindelan
dman
tappenden
tiputini
gitau
bertele
digimarc
rabii
oversensitivity
noilly
gandel
southpaws
proterra
corningware
birgeneau
apostolakis
chhibber
reediting
mokdad
stavronikita
iztaccihuatl
salit
schectman
tradesperson
slimehead
eastlund
hijiki
praxair
aasheim
fiskin
triboluminescence
upbraiding
alhassane
kansen
birkitt
nopales
wolz
sieburth
jazzing
vulvovaginal
psiphon
accruals
regather
georga
reoffer
sprüngli
slom
maksimovic
shelina
kibawe
goldby
uppark
popeater
krups
elizabet
dbj
esclusham
judiasm
roseborough
exterminates
conjoins
bootland
exhange
pengzhou
bywyd
pnueli
valinda
mormeck
lackenby
euroscience
rishawi
leest
evony
lindenwald
ithenticate
pouha
eclectics
chernovetsky
bieng
interpersonally
schellinger
protec
diad
oelze
arpu
cannibalise
gyaincain
roddis
kolkota
besthorpe
oggie
filmclub
voracity
liebrandt
wowbagger
iaca
brunstrom
hidary
larma
craker
garelochhead
lefavour
evro
lgbs
bramshaw
wesam
mekhala
screwballs
avsim
dixter
hashman
ridgwell
lajolo
himberg
kroke
gebelein
touchsmart
trebarwith
olarra
olafsen
atogwe
aerotoxic
wacked
dedeman
folkies
laterly
mycfo
ryskamp
vangsness
magaret
ners
seitoku
bluestripe
barratier
potjaman
sourcefire
hemicrania
baroch
lmgs
sharen
starfest
montay
staight
veneering
aircom
makiyivka
huichang
bulletstorm
diskeeper
flytraps
breezily
icls
arvizo
livelong
mitumba
lumsdon
groundsmen
bourguet
tyvon
designline
ciclavia
cupo
haluska
bráulio
greubel
congeries
solmssen
jianghua
gleno
disfavors
willl
turnell
kumsusan
eswar
entis
qype
kogge
bamboleo
mandli
jaberi
wildcoast
baduizm
papakostas
aeca
ujpest
lavel
famciclovir
insalubrious
wisocky
appal
juszkiewicz
xiaodi
cruzer
vadar
bidoun
yarou
caihou
woulds
insensate
frima
mayumba
preventatively
gossiper
untaken
rifqa
latinisms
kitemark
nahri
abercrave
brettenham
haizhou
partnoy
goeller
rudolpho
comtech
karbowski
tribemate
wickenby
bigest
perindopril
aruze
itandje
davidia
hintlesham
sende
daruka
sherba
qubaisi
fagone
eliminationist
seadrill
deliverability
dernis
braganca
guanica
salmani
kennys
barnburner
fyles
remeasured
souhaite
copayment
cowslips
omnilink
bernazard
magzine
ecla
creetown
sartini
wetterich
gernert
aqar
besche
banford
legalistically
llandwrog
claridges
wakaya
converstion
tirosh
rosthwaite
chattem
wufu
tregothnan
ljube
manawan
remarriages
valders
lifebelt
vivisected
tfiloh
tateh
veneza
cartload
wallworth
sbic
gansz
patriotica
dishon
colohan
neveldine
witout
toks
renau
ecocentric
pallidipennis
debka
steadies
qec
caméléon
naweed
xintian
hatheway
clevie
nalawade
regence
breashears
chada
aylott
gandal
toley
assington
cobr
pustule
zelensky
crookenden
amiriyah
favignana
smoothy
knappenberger
icklingham
alnilam
barassie
inseminating
sunup
windman
grundfest
rightnow
pascendi
overworks
euphorbias
niebaum
giebler
ahkami
pelletized
undersell
holstead
striano
hachemi
scammonden
demircan
jobbed
renhold
acculturate
chevre
montopoli
latitudinally
yaojin
flophouses
tubiana
dennet
fenceposts
kushev
paochinda
irps
spacs
fantasists
wakefields
arrak
grossenbacher
rehbinder
wnci
melnitz
mabul
hildon
wohlford
favretto
digicert
bryl
insecurely
contextualist
allpress
herrenknecht
cybernauts
seban
orwells
ermias
boming
jakk
bencheikh
matichuk
rabino
kribbella
creaser
contoversial
editorialise
prattsburgh
impaneled
langebaanweg
hapcheon
pgas
naziha
madeiros
blankinship
combers
montieth
spokesmodels
openminded
loscoe
routan
lorcaserin
khelil
sterchele
gaggers
ostrowsky
ajavon
gibsonton
onesteel
llandyrnog
kimel
swartzendruber
roas
tamasin
emulsify
harcus
solet
cuka
janhavi
refreeze
winik
zctu
ruzic
dizzyingly
inkoom
wskq
heage
apms
sapte
vanterpool
rebholz
yeghiayan
prestidigitation
faez
sersale
bynea
interveniens
bagnaia
wattanasin
helicos
fundu
kaeberlein
kessingland
alaei
mcging
elligible
galitsios
wazan
raucously
morago
trillionaire
brettle
ballintoy
bickersons
beffa
techinically
finalises
unitus
devree
mispricing
mengual
samory
mimara
aamulehti
innocenzi
gàidhealach
golby
nanofluidic
bidisha
pukes
vehanen
kavo
paradize
eera
kabura
snezana
hoaxsters
wellingtonians
ritblat
spacewalker
alerian
cracken
sheryll
cntf
hurdled
amreeka
heijne
kahaluu
przygoda
nasti
darba
muccioli
mtetwa
pohiva
backdropped
himmelblau
vivion
mundham
neukirchner
disapear
pelephone
lhg
archibong
buriton
pluriform
followeth
coux
partagas
semidocumentary
tanios
massob
breakover
microtel
indicies
liom
maxin
spanley
naïvety
summat
devalos
rokotov
ncma
dignissim
certes
xiaolongbao
preissing
mulryan
antrix
heedlessly
dinallo
nvision
tauqir
belltowers
croham
hackery
scharffen
tronson
shtreimel
berceanu
bhcs
hypersleep
wesner
lacta
howrey
ndic
dafi
molokini
kisaburo
massanet
zuiverloon
goldschmied
whitlingham
treut
shitte
reichmuth
alfonsin
eckelberry
claimers
chancers
sonntags
pogopalooza
warmsley
rosett
seeda
obhrai
bojaxhiu
benayahu
gsea
rsss
arzani
harkham
corixa
yayan
cutrona
neurocysticercosis
asmc
hörl
dukovany
sparham
levelers
choriomeningitis
ikes
derryl
bharwana
kitzbüheler
ppic
kaneez
puds
ballotta
jots
roëves
ananova
krzystof
soundbridge
dialoguing
caucaunibuca
nawe
proin
chalone
xianyi
porterie
bradburne
rahnavard
stevick
unevaluated
aceee
chelsia
microsofts
nondominant
kentro
shihuang
jansing
leadless
lonard
livemocha
pulchritude
gokarn
milvio
furutani
rememberer
edgeless
oteng
gleghorn
yxy
hrz
namastey
ramoche
mutahi
akhdam
pbtx
kagarlitsky
balestri
boosler
kupriyanov
topiaries
szechenyi
pampling
jerabek
everybodys
orata
ymi
mcniel
celedon
highfalutin
pallesen
shadbush
vvf
deradoorian
jink
fazly
mantei
addiewell
kurylo
esserman
monetisation
vectrix
cibarius
nothaus
triniti
lehew
skogland
elgan
ller
deflower
joho
chilwa
llangennith
froemke
powerwall
lundrigan
bumbles
stucture
petzner
sunshowers
hawo
woodwalton
renationalization
landaluce
coxing
perjuring
prequalified
fazackerley
washlet
opulus
aristomenis
leadwood
fichandler
troups
cufi
bigging
avda
rcom
supervoting
rohlinger
rudell
thorstensen
idealogy
chrimes
ramler
eidsvig
tawjihi
disconcert
lacaba
cwmtwrch
sumarni
globalists
nsmc
zanga
mardol
fondles
afact
lohaus
sensabaugh
redflex
fieldcrest
hez
squinty
hoj
melloy
sarmah
distortionary
clist
alvr
kushayb
seracs
cloisonne
hylander
haribhau
npsg
urgo
xiaowan
caxa
reroofed
diriye
cbas
schlucht
olomana
socioemotional
sobreira
turistica
horsehide
tulpan
homesense
pneumonias
higashikawa
schnatter
serajul
poolroom
moutai
aubyns
xinyue
horsea
polenzani
bhundu
neenan
saskin
scientifics
fiatlux
harray
talaa
amateurishly
fillipo
adonal
bilateralism
sonnanstine
mnemiopsis
londonistan
nirwana
cassara
basely
gonyea
shipshewana
methenamine
markeith
shkin
hodd
bolac
bewkes
cyfyngiadau
jirina
barrelling
ofda
annuitants
elveda
keahey
tihs
wenckheim
murrie
amytal
scaffolded
youssif
befuddlement
sportsquest
perlozzo
seuseu
maddix
cremyll
marriotts
ezzell
theunited
chairboys
painda
jerins
illiano
toiles
gulbahar
heatwole
qifang
pagli
yerkir
swaptions
enisa
wykehamist
wannes
peperoncino
diagnóstico
envato
denbury
duross
papandrea
kutuzovsky
jibreel
ziebell
kecksburg
orlandersmith
lamark
wulingyuan
gidleigh
luxoflux
gordman
callingham
chalkwell
similan
kapner
airboats
chillax
evisu
aedin
ecocert
coreys
gemologists
perjure
ostyn
decisioning
superfish
offhanded
idahor
widing
bunni
garreta
sylvina
heus
sharkwater
mcgilvary
boukar
balson
kokam
carstarphen
sitnik
wbli
dogfood
weinig
morici
armodafinil
jailbirds
unmined
ednet
gwk
emcs
polcyn
jazzer
knupp
acies
nddc
losper
feadship
barzilay
waternish
jieyu
carbonised
tomasic
girths
beauceron
moayad
khider
qustion
palfi
virginicum
faini
tonette
livestreamed
winnard
tattie
koryta
gevo
gyurta
bxs
noiseuse
ebags
ermelinda
edsels
bushier
maleta
cheikho
yashere
neuroregeneration
wedc
egerman
koenigswarter
bumbler
caffee
pigorini
lury
taoshi
consistenly
yonis
ejei
feklisov
sinegal
senwes
gapminder
nawaat
truvia
tournedos
reavie
tetepare
scissored
knicker
bergfors
alterian
viane
interweaved
tsilla
fruma
olchfa
tussled
vitalizing
dezhi
seekh
fyvush
subcommander
emal
thomass
libnan
hardbodies
falcarragh
conver
genitalis
christlike
nilufer
nemcova
epidexipteryx
weissbier
macduffie
rightback
drean
cnca
jettou
keskar
hoblitzell
androsch
cohran
hajiya
jaeschke
pishevar
wildbrain
foxtrots
unfathomably
baup
gregorini
hendrerit
afac
tailfeather
independants
skarsgard
tioram
gevorgian
flunkies
hullbridge
showstopping
treculia
aguer
unutilized
multimember
pratically
baltas
navacerrada
polpo
triquet
srcl
yoruban
circosta
barstable
ritzema
sembiring
ropinirole
eucomis
jagtiani
racecard
thimbleberry
picozzi
tehy
etretat
prizing
juico
ogah
fxpro
onrush
assests
hakkasan
schlapp
remunerate
spasmodically
heavyhanded
tellqvist
wadsleyite
starbury
brodziak
hametz
seethed
nutshells
frush
profootballtalk
seatrek
malzeard
morelock
shipbroking
moonlet
fairline
wladfa
djalal
sundrum
zern
tornell
kiuru
keema
labradoodles
aford
galardi
marabu
brezis
obiols
muirs
intimidations
abete
cirelli
rivm
skott
computerware
aysen
gezeichneten
cufflink
keycorp
rehobeth
nametags
issacharoff
madslien
laurenzi
vgas
rouler
otoro
yamashta
jibson
hults
awkard
weatherup
alic
rilee
sukup
ubilla
tinkerers
baraz
vermeers
sedins
maurward
issels
manbearpig
lathwell
trethomas
ofttimes
wistaria
excelle
novinger
nirc
bhuleshwar
lizama
powderhouse
reinfried
munsterman
usiminas
amrane
himelstein
radiocentre
manoug
oritse
sejal
hesmer
neroche
calmore
toyonaga
daneshmand
zyxel
biohazardous
batterer
hooo
fascher
tanter
strongin
snuffs
turnquest
ngeze
kersaudy
funkausstellung
chargepoint
homescreen
bigne
hurren
wajihuddin
sofo
redwick
duguet
azadliq
footes
sawbones
fangataufa
chalfen
prugh
fibrodysplasia
kozhevnikova
noeleen
gnad
soxx
uhlaender
angarola
mikhalyov
anbessa
soothsaying
hasenfus
weirich
valmeyer
ixy
xdc
calibrates
madarasz
muniyappa
fallo
nicklausse
toothaker
brelsford
waunarlwydd
gappy
sandale
sueing
kicc
edelin
glamping
pedowitz
laddering
veglio
buttington
addiss
dumain
dyw
hougland
iddison
kagy
unliked
kipsiro
schicker
forenza
bdx
yorkies
fusses
jakubauskas
lanette
karatu
realeased
stisted
trihalomethanes
raitis
yinghuo
wanderin
lapush
ladina
martinos
conjurors
chorwon
spiritedly
cowardliness
sabattini
weeraratna
kettell
elbryan
mundaneum
infectiousness
bhartia
pugil
hafele
miltefosine
lorenzoni
ubicom
fuch
plessix
sajed
effah
blessitt
azabal
addendums
benepe
blockwork
ptdc
coopt
sersen
hcso
tamogami
biddlecombe
anaesthesiologist
corticobasal
uskudar
nicker
shenice
gowned
careggi
fean
rockyou
amgueddfa
scoa
assualt
nearne
manses
omanthai
cowsheds
dillie
nuzzling
fsbo
hsj
edgeways
dilettantism
rahy
atomizing
schreibman
millholland
friesians
bovrisse
platnum
palmist
midafternoon
submittals
scandalize
lushoto
shwaas
meaders
liggan
frumin
cicalo
lymphadenectomy
malgoire
sosnowska
nazreen
resorbable
ozeri
qilu
cassop
flounces
hartsuff
mercede
lifeflight
didymo
tessenderlo
trochanteric
iepa
paidcontent
jerrelle
wallick
nowpublic
advergames
showcaves
quinquefasciatus
nidar
totters
binacional
lodsworth
dogsthorpe
darweesh
fougere
whetten
rickers
kadra
hatless
yoopers
pulsion
etampes
couloirs
philosophising
geneen
encipher
haggan
exisitng
dioxides
hedp
callixte
ifx
fallou
jazeerah
glenis
kamron
parhelion
bressanone
unbraced
snca
umile
quance
christabelle
eldra
chaffer
lachrymae
glocks
namers
ripatti
atomizers
schnieder
docketed
mastercraftsman
decertify
microbubble
akanbi
scheufele
ricer
canar
droguett
pannick
guayasamin
hesling
attenboroughi
milonas
sirnak
sabahat
acli
matossian
sigalit
hlatshwayo
dassen
bonafini
crute
whitekirk
galvalume
undervaluation
yokich
mitzel
peepul
megaraptor
pesonal
recrudescence
delling
cobblestoned
deilmann
domnérus
outsurance
gattelli
vollhardt
hypervigilance
seacoasts
mssb
mortifications
rcic
ralcorp
inciteful
gload
leinweber
miram
worple
jandarma
changeups
gratwicke
herden
sequestrations
portrack
unvaried
grossart
compromiser
dysynni
sukhvinder
hadrians
mansión
violens
euroatlantic
hassock
quicktake
tarita
judaicum
expain
plusses
fretts
zlb
taglia
jumadi
opificio
kobasew
licencia
scientic
burhoe
grucci
perked
brandford
bartelstein
anty
terho
pinecliffe
maneaters
gilhaney
bottke
wortel
fyf
dematerialize
subclauses
goglia
bbox
dafang
microcellular
rafie
followspot
anonimity
livingood
atkeson
knile
militating
leckrone
banaszek
shakespeareans
fumigants
strutz
farcet
ibagaza
unperceived
ursache
anounced
woeser
irrgang
sonograms
adlung
leventon
bridgemere
distributorships
awx
fuzzily
izetbegovic
hopers
adeo
oudolf
bendick
councell
caroler
massarotti
tabgha
yashvardhan
glengoyne
azizullah
devanadera
cutteslowe
subfreezing
cincinatti
shapeways
saemangeum
chhim
boskalis
lantus
pankhursts
suspicous
roxi
unrealizable
solidaria
donziger
unclimbable
gowins
referal
nossel
kurage
exculpation
donham
marmaton
stangmore
mogk
refurb
mitsamiouli
doodled
crossbenches
hirohide
haastrup
unpunctuated
bohnhoff
hollstein
ferrellgas
palermitan
derlis
vrdnik
floodtide
nosik
byrdgang
ailed
diffent
carpentieri
kowboy
perissinotto
bursars
mitreski
sobinsky
mullenix
enforcment
diebler
adrem
neuson
subramaniyam
abjuring
cripe
hajredin
deplorably
bocks
broucek
sturtz
janat
dinova
sanmu
slowpitch
hdssb
rabinovici
fiene
rhydyfelin
colao
krümmel
mansholt
epupa
lehideux
tjallingii
zelt
mourie
apalling
neighborhoodscout
oska
hurring
hunh
sudarat
hatkoff
ommissions
zagorec
chadwicks
replayable
hauf
barder
krithi
mahbubul
zajick
masticated
rakkas
pnz
walcote
rese
vohs
anwb
akri
schams
boerma
mancetter
mccan
juravinski
pieczonka
innocentive
hacerlo
blists
chalkie
ellenburg
malmsey
sportsblog
splintery
mehrangiz
coedpoeth
unstretched
styopa
tramped
highground
mathemagician
lapenna
melham
hitchner
ziebold
pinklon
nmac
saathoff
loura
gorfodi
kunashiri
imy
gasified
lagutenko
nekschot
sektioui
quadrennium
derrig
greier
tarakai
caussin
iowas
spoerry
derric
yuci
hogged
yawer
fotino
vaporub
islan
rember
sisoulith
bramdean
hoverboards
nabb
zabiullah
shahine
civia
ficek
trink
kirkcowan
steelberg
tabio
putrefying
jerkiness
pratesi
cocksfoot
polyacrylate
macci
branthwaite
youtubes
mievs
birdal
witzke
maltophilia
wayns
portgordon
abramyan
precociousness
meghji
buchlyvie
skrzypczak
pirom
satani
illyas
spanta
elaborateness
kazaure
simins
diffley
scrims
lakebeds
denevi
borodulin
whitebeams
mitchard
chairmain
remodelings
veraldi
lowlights
overcomplicate
canizales
oloffson
venmo
vyalitsyna
bigio
anthemion
wentao
harborplace
nhsc
recreationists
zaru
mcgreavy
benamou
loerrach
sinced
njuguna
linta
dgap
respecter
kargman
auriana
quaglino
bitched
prettify
leogrande
alagir
admittances
cloverhill
rayden
dyens
siaw
wagi
dishonors
vxx
dppa
khenin
vnl
shorefront
saachi
enslavers
tmos
tarrington
guntrip
gavels
holdman
nhcs
ceac
staios
buskila
cylab
haersma
fuerbringer
hannahan
fugaz
jerzey
complacently
manouchehri
fxt
playlogic
hofu
karahi
bellocco
discombobulated
extraterrestres
hamren
maeir
betc
katemodern
dialouge
almasi
nanogenerator
mcnarry
mintel
mcnelis
coundoul
oligodendroglioma
sssc
rxte
cogar
banquettes
emmanuelli
bursted
sheleg
yacov
karleen
anjuli
ikos
sarachan
icfa
mcmansions
mcchicken
pervan
craigsville
alvero
joellen
gwyllt
lainya
wcb
suleimenov
eastriggs
lamai
vanmechelen
ayoun
laksono
malinosky
corsellis
scharpf
easycruise
obreras
makhijani
delcarmen
lanipekun
otash
naifa
mecklenberg
stylz
lenderman
berezutskiy
candescent
urdangarin
farfour
attackmen
laiks
plopping
steffanie
kgil
guarico
kavuma
triacs
namoff
suckin
adbi
charness
raik
supernational
completer
unpicked
terabits
georgeas
coahuilensis
akhalgori
zeschuk
spanger
pcom
chantemerle
canoville
crizotinib
ujf
jidong
sendall
krasnoff
blaengarw
hexton
maute
amodu
gajardo
neuroimmunology
prendeville
sheeple
alsgaard
shanbaug
purkayastha
cablesystems
thoughful
shithouse
updyke
angliru
squirrely
newing
psdp
kesley
tumblety
hilgert
topco
aqualand
snaffles
seanie
buhs
senchenko
denorfia
rouff
grisez
dobinson
dystel
polyoma
kaseem
blabbing
cheesed
mengestu
pollert
comtemporary
brothas
cmta
ardgay
dzhabrailov
néerlandais
chorioamnionitis
afriad
hovig
throve
jedis
tsygankov
trouvés
bregenzerwald
umebayashi
ovoids
novarra
salier
inertialess
genung
andd
muravyev
slurve
azli
comare
sheward
sluiced
weirds
hautmont
boldrin
freiwald
herley
faisel
phillpott
kinrade
attrill
geeze
takefuji
cpjp
ilaiah
nalyvaichenko
upis
sibun
prisme
rasagiline
mcgunnigle
aberrantly
raqeeb
ngilu
daeschler
lessem
copuos
deigns
bouabdellah
twinax
wahiba
eleftheros
prav
circumciser
poxnora
urango
rustproofing
weili
altnaharra
hammersmark
perpective
mathlouthi
islamised
tomatina
sermitsiaq
canaport
middies
georgantas
jeronimus
epiphanic
verminous
nutbourne
dmards
endtimes
schadler
hummes
pourmand
madalina
prophesised
bicyclepa
westrom
alabamian
khagen
civiletti
resoluteness
beliaev
sheidlower
paey
protip
rugman
raisch
sawicka
phorid
llandygai
smoots
kajishima
subcomittee
mwaura
terracciano
haizlip
huista
premysl
babalawo
corporatized
shochu
oleanolic
candidness
benyo
girlies
sasken
sudharshan
beggers
cisma
seatless
itab
kanojia
martindell
cayan
woollam
rollberg
weisskirchen
barricada
transam
dyster
kostermans
vilardi
schwaab
hitchener
rukundo
aultbea
blastoma
hcdc
liyan
camrys
oozora
aeternitatis
withings
gruener
ebbett
pierogies
hesselbein
khardung
shopnbc
shandler
nnos
girobank
makutano
védrine
akhmadov
greeman
longomba
vitug
lalumière
bilsthorpe
kurucz
allscripts
noncooperative
birtukan
edwardo
mepal
groud
jaffas
stockford
vrettos
rigourously
mponda
riby
njogu
fatties
subsitute
reoccurred
labyrinthitis
erus
vilankulo
obreja
fondacaro
crosslin
volodina
longlife
koppang
hudghton
phosphokinase
donnée
butto
môquet
manucher
vuco
itcs
changyu
vrsa
degustation
plexicushion
couzin
albox
deloney
professionalise
soghoian
spaling
catellani
hanoune
ustar
jarjis
bravard
frolicsome
mashie
citywire
bertish
foreplanes
noela
charlone
manlike
brauchli
monoski
anaesthetized
amets
thrussell
jarquín
kivett
birdstone
masek
ripka
scroungers
allmost
loeys
pillock
prised
sintel
sundorne
suckerfish
xiaoyuan
zhiguo
baofeng
barich
fentiman
sivell
ludvigsson
lawyered
schoenstein
edlesborough
vermicomposting
latortue
lantinus
porphyries
guoguang
tcz
cuffing
ilri
nacionalni
kehres
dormanstown
blastema
epicentres
circumspectly
titler
sidu
inerting
boesman
sapia
oistins
avacha
microwavable
helictotrichon
meltzoff
tohopekaliga
izsak
arenzano
rollerbladers
treasa
himmelstein
geszti
bluesier
hedgers
taskers
tarrance
zelyony
vrtx
tasar
redmore
perfomance
danisco
adulterate
efj
desalting
veikkanen
hafeet
denationalized
rijs
yafei
defrancisco
neph
himrod
tailwinds
sulzmann
górriz
masterplans
truthy
heatmap
nöel
undan
colsa
andrographis
ticagrelor
ewea
giobbi
terrafugia
balentien
duara
soundtracked
chinked
lymphoedema
iccm
hazeldene
bohara
postmillennial
cherryholmes
oconus
izis
concierges
huncoat
foreseeably
myrsinites
asperatus
targowski
arbeitman
fued
afflerbach
keesey
ferrymen
spinsterhood
temeka
nondemocratic
furbearers
barrooms
benflis
janeites
bsfs
spadework
zoomers
replicon
froms
maicer
uruguyan
zepplin
telecommute
mosiuoa
protocells
footraces
naciri
presencer
winnik
arushi
franqui
benowitz
brigstock
krout
fleeter
almontaser
lifebelts
castellacci
komlos
dirlik
rotelli
perana
wiecek
attanayake
surugadai
perdis
tenderizer
fitrat
giansanti
baleka
lcgs
plender
clearcoat
mupirocin
akdt
loveleen
nobue
awel
floridia
awja
minkes
dubow
narraway
summerseat
sleekly
manganello
dresscode
jeyes
bilotta
tonganoxie
jalle
medders
uninvolving
narts
paolilla
chronicity
milbradt
rapini
quow
pivato
chilingirian
biid
semgroup
kokosing
mustaf
ingenico
kurtiz
orny
rosehaugh
iouri
almerares
prsp
ciclon
ccpl
jibing
ravich
derrin
wherehouse
korologos
cemlyn
pedalo
pilade
larrazolo
stolarz
wadah
fursan
briston
bluette
evaluable
wobbe
fesses
hongming
gahirmatha
dewall
kensworth
salamu
prebish
medbury
wtd
nurcan
krupke
griebnitzsee
mylyn
bilsby
huancheng
tajbakhsh
conguero
operatically
laerdal
deisseroth
haldol
gypsey
vigourously
bimi
spirent
imbed
washingborough
howwood
haqqi
wienermobile
kameel
entrevue
tatman
scelsa
huggan
kucan
mociño
attardi
mssrs
ozumba
namai
mcaffrey
borgas
timelessly
chuc
guizar
moturi
cobres
rigamarole
bankey
awendaw
crucifers
tabbies
minergie
stateliness
lingfeng
mesmerizes
binos
speechnow
cellblocks
wosner
narcocorridos
jonaitis
demerging
neusoft
bealeton
hearties
lauterbacher
correale
chappells
udhna
fitty
deleu
jagmeet
wentorf
lifestraw
sprunger
krepp
epec
eizenberg
marcoule
wve
navratras
abonnema
hackwork
furtwangler
kirkilas
cumbrous
asheru
tghe
dumanis
leaa
ancova
sultonov
hazelmere
aberglaslyn
hacham
bunnyranch
buycks
bhcc
hardscape
samore
otological
hiltermann
kaberuka
chanca
pendas
soapie
divorcé
waap
nisr
arsinée
pooya
zachs
pirker
verini
healthpartners
epis
tickenham
gfeller
fishless
keif
kaufhold
newsone
uninvestigated
blatner
oxjam
estrosi
tellos
britart
myca
hinckle
castrale
miscarrying
savuka
mangweni
odalis
bluesport
zhengrong
faraji
goatees
monikie
medhane
prestigiacomo
barbuto
videomaker
asesu
datamining
samast
sdsa
throssel
helideck
silvercrest
rijnmond
cissna
netmums
tameem
responsively
vansickle
myrtha
constanti
nuttery
phrs
eppink
darrent
machol
wackness
cientifica
steigenberger
ccbe
pentameters
rossignoli
trimborn
washboards
chrw
ravell
preslar
snai
halcion
gardeazábal
microexpressions
maxair
tristian
politest
regionwide
fiabci
hippotherapy
aubertine
nabilah
pober
tertzakian
cantered
stll
paperchase
burgundies
thaugsuban
haon
londel
agreda
kiyonori
hölldobler
bamigboye
galacto
jubinville
szczepanski
tradestation
echocardiograms
vadehra
mcfalls
atessa
luminant
kaming
saaransh
pandith
fennema
toothlike
fthe
stoianov
choosey
talysarn
bovvered
altherr
madekwe
malubay
konducta
djemaa
poppets
bordenave
renana
tmrc
jobes
xtend
nevoso
shurna
sses
sinigang
epidemiologically
mcclaughry
matusz
aiyaz
gynn
chappaz
nanosciences
wolsky
rosena
jabre
mechoso
turqoise
leanda
vanetta
bowings
tawab
mousepads
giordan
gieves
toradze
cityarts
opana
scai
majstorovic
zenzo
solters
lammie
sourcebits
vanderlaan
wallbanger
miskitos
sinkewitz
habaneros
coverciano
mouritsen
paleosuchus
frishman
warbled
malotki
zillionaire
glier
miuchi
entv
krystof
diagrid
batian
intitial
shtrum
jinpu
letterston
petrodiesel
mbewe
minicabs
amache
panthaki
traumatization
sciona
hazarded
hohagen
ctrip
adderstone
catsouras
kleagle
overdressed
delarosa
nrsro
markowska
wasey
scalfari
moonfire
monogamist
bussler
olivea
sarosa
haqqania
nethan
pitchy
domeniconi
blecha
asfi
amne
nicktropolis
maranto
advertsing
globalfest
unoxidized
elzer
tollie
sportsclub
fitzell
shinkolobwe
tarted
semillas
historics
krooked
curtsy
agroparistech
kröll
aneja
taneski
nway
guaranis
goona
vorus
kandice
patchiness
gossops
desists
mouffetard
genband
radostin
dranoff
bemowo
quicks
leavelle
trid
capman
pigheaded
synthesizable
jawid
jesurun
bekonscot
battsek
reanalysed
gieve
preassembled
yoking
afap
rossomando
benjafield
wielandt
pattingham
dicussing
pheloung
furbies
zilda
hitlerian
ikano
merchantcircle
peckman
gloomier
kochamma
commom
lavena
yalin
cinepolis
moorby
errey
heuch
unenjoyable
gonfreville
hachim
digerati
lassar
eleva
defenestrated
vauxhalls
beskidy
hoeveler
coworth
sambit
telkomsel
toshka
sompong
parde
tattoed
grovers
sahili
akiyo
denarau
marinca
silverburn
bandrowski
forword
phera
douet
yonks
falc
pobal
analytik
fouchécourt
sellstrom
kantaria
bxp
marqus
jenda
damazer
sorto
fantasising
syndications
ccoi
lcsw
rabinder
edgemar
fransico
pantridge
gwyr
maicosuel
fendrick
visine
tempero
shanghang
neons
ashuba
wiegersma
pinniger
sonicwall
samuelian
alderville
arbabi
shotty
bremelanotide
tadeus
foid
motzko
authorizer
millthorpe
montecasino
gedhun
afsheen
muqata
herrity
akbas
guisewite
trinitario
flimm
dogmatist
juliett
machsom
barsh
cdts
cabochons
sizov
beruti
negovan
stipanovich
glistens
golberg
splatt
bashfulness
mcilraith
paoloni
bradygames
klaff
zarella
tasering
hofbräu
reinvests
pommerenke
newteevee
shumack
soliders
pangestu
mumbengegwi
cheviots
davidsbündlertänze
riabouchinska
isaan
orongo
sooie
idependent
hellyar
lawbreaker
moatize
mennella
annamites
finnsson
nooka
verdu
nosei
swordtails
overtreatment
kaixi
marciel
clutchless
cherel
huwaider
vitagliano
currence
bedroll
odpm
weightwatchers
itanos
hannibalsson
ramday
lezayre
cotroneo
demattos
coghen
karmitz
hofreiter
penjamo
iiris
kriete
turetsky
superdry
filardi
transplantable
doshas
spitters
davanon
terakawa
sunwest
scoppetta
enth
iwin
bartelski
biomethane
yolonda
unallowable
holthus
tbis
ensco
metaio
catbells
zarrella
hasd
bloodlessly
brocail
kilotonnes
bettisfield
confortola
solecisms
goddin
mudiad
legeno
zolt
ersun
barrish
eurofly
baradei
boardshorts
virgle
tovi
tribolet
jbf
pushkov
hamide
euarchonta
turquino
hackleton
segraves
zoosk
iobit
bumb
darvis
weirded
tshirts
secularly
egregiousness
schale
vishnevskiy
shirty
teigh
klimaszewski
ayma
sweedler
samoun
paam
kondaiah
foltin
pantomimic
liche
alfirevic
cantarelli
nisour
rapuano
massabielle
petrocaribe
revello
submissively
neophobia
grinded
hren
yassky
shuold
carnally
francomb
gentlemens
madikwe
nasby
fluphenazine
temme
afcea
honeynet
groomsport
zertal
moutawakel
bourdette
spudgun
tofel
arguers
saporito
borino
zse
snoose
elati
antiphony
rozehnal
mosche
montsegur
cheasty
hairstreaks
strassler
hygienically
belenguer
kulayev
natiq
mcgoverns
nothstein
ralphy
djsi
kadija
hashmatullah
mindgames
zld
baathification
glutz
celexa
vanderberg
pgx
rashin
hewgill
aints
morken
efpia
blowoff
kayvon
hydrocolloids
martlew
pervomayskaya
morrogh
bonette
upala
keryn
oxidisation
tupamaro
denya
abdurehim
unpassable
franzke
kleinbaum
chamoiseau
ramazani
trepagnier
inswinging
déricourt
shkolnik
doesen
braziliense
quidco
draftfcb
qpp
gfb
prochlorperazine
mihalich
urzua
omtp
eigth
boneyards
asparaginase
ozment
loncar
demetrice
narisetti
ermou
paolillo
balladonia
roeck
superinjunction
sunesson
tucek
telegramme
harpagon
wendelboe
acuvue
kongou
biotechnologists
tvashtar
potolicchio
avineri
deriv
shinnery
fusae
lignocellulose
amgala
prominantly
bruseghin
defonseca
barbre
mikve
schroeck
pellon
rozenfeld
nebulized
wojnarowski
rapida
procurers
donkervoort
jaudon
whizzed
pcmh
maganti
pipc
terryl
morigi
keitha
secondments
eige
desenfans
mindbody
nineham
monoprix
pyapon
thejazz
lionza
mezzaluna
gever
guanzheng
credibilty
megalomaniacs
superfruit
tarasoff
suported
spennithorne
selvaratnam
captivation
catelli
smerdon
lubya
ddinbych
oplev
fenstermacher
kalluri
barach
ratiu
prayoga
dokoupil
compering
speakable
lesnevich
taffet
betimes
hensingham
usdan
chaupar
dongwon
tuataras
lno
ibish
rawlsian
lundegaard
longpigs
kakum
iuzzini
buoso
zmi
actelion
bips
chellberg
alphage
piloerection
approvable
xwe
vashist
dunley
ratliffe
kurzban
oryxes
qoe
nafeek
fiocruz
kientz
ccci
redhook
florescent
filarski
stinchfield
floggers
aapm
pollocks
kantis
crackerjacks
urquiola
jasey
figeľ
wathelet
eismann
shamsuddeen
loansharks
hypokalaemia
craner
nathen
triska
lpas
amge
herewini
alongkorn
fenyo
altangerel
restaveks
nimic
echávarri
nookat
yomps
spsa
kitesurf
antagonises
puckette
ujiri
compair
holthouse
pedrie
flagel
kickstarting
loutit
sivanandan
flitted
spintronic
unroasted
mukhtiar
unblinded
brenig
laventhol
downley
spufford
curre
innogy
telquel
harrowdown
evershot
majur
jongi
alpinestars
yajaira
rukiye
saturations
hounshell
woodston
sponheimer
jailors
rachet
lovefoxxx
engrafted
agap
korres
bombilla
ronacher
biner
mikla
makower
cofee
kluft
nesses
mantlepiece
farse
vanderheyden
eilene
jebi
huldai
karling
speedcubers
aandahl
scrivo
aproximately
wriggled
shads
betrand
microlending
swedens
bpx
matfen
ostracization
scrupulousness
borned
masoudi
entrekin
grinton
devecchio
marrinan
noordam
sprl
npsa
karaaslan
yanhai
bethersden
badiola
lamfalussy
siphonophore
andoversford
llanwnda
ferragudo
sadomasochist
kingslake
claypot
putzel
zampolli
balmford
indinavir
ilchev
wanging
landladies
smartwater
brugal
gowalla
buter
bargemen
hpakant
grasper
houweling
chemosphere
kumala
sophmore
yardville
ghardaia
metabolising
zivanovic
teleflora
ladda
caversfield
kazaks
hizumi
arfin
fracassa
jorc
lienhart
harpersville
gettings
batasi
dehghani
lochbaum
howtown
waywardness
yifter
ivh
vlv
wullschlager
recons
guanipa
veyrat
murehwa
mbai
zadokite
gellan
mashonda
dise
dethlefs
neller
papachristou
moralized
ahrends
ujs
navestock
fathali
rexes
grimus
trollies
avandia
lafontant
ngassa
tyonek
bolters
famau
inola
mediascape
kaback
hazey
mollett
presumptuously
dayjet
miled
profitless
jitin
myreside
semtech
sungevity
christia
tarren
brynley
domestos
pilsdon
kasliwal
johannsson
siamangs
thorougly
tabuaeran
omeros
rebreathing
madlen
rassmussen
georgelin
baudis
beinfest
gegechkori
tilleard
nonrepresentational
framfield
soshy
plaz
brulee
bernius
gebbia
grix
achoo
doubront
ebrary
endotoxemia
bowdlerization
tillingham
schudrich
anuzis
protetch
blotters
anoraks
bulding
yusufiyah
mallahan
papageorge
edderton
benettons
patikul
toumey
bosniac
explaing
winborne
kozlík
metallics
multiformat
mianchi
cdiscount
ottowa
sunman
hungered
kolodziejczyk
musaed
moosylvania
cwpt
synovus
okta
stavrakis
sumaria
commisar
captial
chastening
metrotv
ravand
araji
victorya
climping
llansanffraid
mesotherapy
shellshocked
gemmel
campaing
ndoors
truxillo
tayyeb
charpin
badreddin
ancier
oscc
barragem
annd
posterized
gaspin
dhoon
ermer
desynchronization
dansker
interocular
harkonen
falsey
zensho
flexcar
mlss
gadsen
dehydrator
toxo
chibás
lennig
troncon
yustman
tiete
blattman
aberkenfig
masoumi
dobsons
kirtlebridge
hulihee
tayyiba
blava
pricegrabber
colkirk
nonintervention
ponderously
kabine
goddio
lmos
goosby
sodded
arculus
worldatwork
shoukat
chivi
overexpansion
chemed
skeletally
marcona
arapey
burchart
teaford
muvico
mogford
caie
retweeting
upstages
leuprolide
marmorstein
carrall
rhosneigr
eastsound
cambian
polini
cedep
hadass
waltzed
honess
meslin
permana
galvanizes
undoubtable
procyanidins
clamper
jackknifed
damrell
boiseries
winterfield
stratou
exwick
lebwohl
orlinski
coleham
vaultier
beerenauslese
biqa
reguarly
paranthan
shortliffe
michaelle
emmentaler
senatorship
mathenge
ionise
sravan
woodenboat
jacory
lecrone
kuemper
sangwon
voluntarist
mollar
dingleberry
guetzloe
weisband
investissements
rockburne
sealine
prik
davitian
hettema
fettle
jallo
shatilov
keithville
napley
shirtsleeves
venoco
broks
gildan
mortonhall
elbourne
sodomize
anaesthetised
nayong
slotervaart
philion
manky
rivastigmine
yest
chinalco
triplexes
quecreek
michy
cartner
bandoleers
earthcam
bpas
studzinski
specialy
specint
faucibus
asgharzadeh
dematerialisation
moccas
bowo
weirauch
ziko
manged
hynds
delehanty
waldis
tiresomely
motorworks
loterie
cloudmark
lupolianski
hatsue
lopakhin
marylyn
jalaleddin
embosser
usbi
pekarsky
alstone
ryandan
rocholl
eucheuma
waaaaaay
sawma
vehicula
martikainen
crackled
abson
iaculis
turgoose
neubecker
kelburne
loone
flueger
treeby
ddec
gressenhall
degreaser
klavdia
lifschutz
robertet
herft
pacia
doppelt
vasilakos
croaked
hausfrau
vandervort
sangini
harlene
cottom
hollyman
debose
rollman
vencor
minchella
varrio
engy
touzani
hadari
vego
brona
sniffle
dominczyk
tembec
smites
jazztel
fellating
gramajo
aleluya
wedbush
itumeleng
inghams
odiferous
ghurka
ashfaque
garaad
kleypas
litcham
corsewall
orfanato
keneth
reseated
coladas
ammirati
minzhu
snci
berkmann
agust
soep
secom
mahantongo
hydroponically
chalong
witcover
regathered
weissinger
earthlife
fplc
pelczar
grotesquery
fttn
forgiveable
bhogal
greenguard
cloudberries
decemeber
laghdaf
yasuyoshi
gröben
fogiel
kades
crawfordsburn
fractionalization
chateauvieux
dorcan
kalsa
threadgold
campmates
ranan
flachau
ispir
priscah
stymies
cavi
netcast
guarachi
eao
musalla
augured
xianyou
hartsook
comica
stuntin
joustra
flippered
lurchers
zouheir
tysen
appletv
yscc
marlie
scarpino
pupusa
snorter
propogated
bastardisation
skypephone
femia
nebbish
chigwedere
pelote
gembicki
achray
spume
idrizaj
karabell
unidata
frania
gutkin
sties
grantors
hungering
ibda
sanyuanli
barrydale
sundahl
khashm
trabi
diagana
geeneus
georgiadou
messege
jamaludin
battening
tabouk
podor
legbourne
goodhope
fragomeni
droubi
inel
kamancha
maried
cuch
beween
bezabeh
harjeet
ogwell
madjeski
scafidi
undertray
jamdat
knl
tayaran
lezlie
alarums
squishes
terrys
squirmy
lieff
tepedino
coagulates
multicrystalline
meatyard
drysuits
khokar
achnasheen
corazzin
bareikis
noeth
thalis
kamoun
wessling
denaples
bibai
forbo
peipsi
eberwein
blusher
cotting
moptop
cynllun
decoste
pidc
extramusical
rbcc
zre
pwrr
spudis
ofcs
nitsa
ghettoisation
zarrillo
kikuyus
comisiones
nothwithstanding
sardiñas
indentification
chrysogenum
delacey
likhi
payano
tornay
suhel
sedgeley
bassington
ductless
kingseat
mesaoria
biproduct
amygdalae
sievwright
activies
rovetta
grandpre
particulalry
bedolla
noveda
farberman
caochangdi
lhakang
debie
siula
parmo
maltreating
masie
sanjust
fluhrer
dcha
immateriality
triquint
ufcu
critchett
solvit
boiardi
gfy
jesusita
zilia
handscrolls
hasia
despoliation
mowl
heldmann
liftin
artemida
postsoviet
thanki
wittiness
wangle
carolynn
gabble
memmel
willauer
nelp
khullar
ethosuximide
clogau
felstiner
keffi
aksarben
mizuna
gyroscopically
penenden
jerame
kelon
thurlaston
chisholms
especailly
mafe
besik
fillery
plée
tenderest
ingabire
thereabout
microsleeps
algeo
micromanaged
panepinto
uwchradd
rubbernecking
shafeek
ampfield
sippenhaft
bloodcurdling
amarg
buccino
putzier
itopride
samhadana
lymphopenia
matenopoulos
immunochemical
schoell
diamantopoulou
somehwat
kaloogian
mashhadani
samnang
overtrick
stasko
wadey
zehavi
sulemani
perusse
adiv
irranca
milborough
airblade
scattergories
modec
ludeke
wiland
ʼ
mawari
totilas
roundelay
ilit
samco
leyson
sreepur
aslet
chrb
pillories
ateker
committ
vosganian
llanboidy
vassel
marasciullo
warora
hinterstoisser
altero
petland
baralaba
oneamerica
kishna
cosmica
espied
spohrer
spluttering
troposcatter
repplier
agota
goswick
outis
zannino
holvey
poskitt
doleac
connock
tropeano
tupaz
itele
temelín
watermead
kfgo
unman
scrabulous
leol
carolae
wakiya
hadyn
tonetto
microloan
artcurial
redating
aggieland
lisek
quantitive
candolim
nayif
reboard
lambaste
wheelton
cucinelli
reconquers
tomochika
hospitalisations
mordad
kronish
outcastes
seht
atuna
medshare
sadza
leyne
marktl
liveing
briann
chiambretti
lucianne
osinski
helpern
udvikling
busines
niran
enom
szumowski
kounta
lobue
hypochondriacal
cizikas
browny
hiraan
unbelieveable
khulan
tenke
anyiam
draculas
sureño
wgz
federowicz
prasugrel
licalsi
glebelands
coulis
omnivision
tascón
stachybotrys
polyvore
wdcw
mauskopf
sarko
universalisation
hawthornes
mindworks
heph
lubega
rightmove
pullins
likins
decamping
vandyck
continueing
robinswood
sesquipedalian
menders
minvielle
caunton
nonjudicial
zelenin
freudberg
werdegar
tinius
hilmy
sexx
mapps
efimova
narusawa
lewen
xvm
danceny
ludemann
adré
fudges
continuer
mobilephone
purkis
corrosiveness
davidovici
divvied
huffine
balmedie
trócaire
moshassuck
technophobic
tarallo
untamable
ivin
barski
valdivielso
givings
strathmann
scarcest
barwari
eyeholes
focsani
gestoso
rossmere
wiederseh
rahum
saneh
piersol
miglioranzi
buttonwoods
parrys
habano
massaad
modrzejewski
freman
myotragus
mesage
interoute
kountz
cordina
judenrein
kalemba
raether
eiff
devid
caissie
zumo
csba
schoool
drumgelloch
newlife
legistorm
atomfilms
wakelyn
galyon
hammerin
saeijs
thornell
rifaximin
sextuplet
figi
genra
pitztal
khosh
scandar
vereinsbank
commix
teriparatide
artsfest
cavlan
jacquart
servicewoman
baiden
kalkin
frescas
chinyama
hecm
raqi
sanderlin
taulapapa
superjumbo
aliy
broughan
minley
photofinishing
hollybrook
zakian
munched
buildering
lebewohl
luebo
stearnes
denat
shterev
outpoint
wlae
yorp
highwoods
davola
posdnuos
groaner
easc
coffeepot
nyirenda
korder
fiberglas
evason
idahoan
biobio
gesticulations
helvin
goldenbridge
dmytruk
zammar
mesur
maquila
condoleeza
jacalyn
tounsi
tactlessness
pantsuit
hydrogeologist
witchweed
maintanance
cjv
chrissake
wanko
ayvazian
thandar
andeans
carringer
bivouacking
sahali
oios
ellsinore
pegswood
nivet
cabreira
klagsbrun
purposefulness
tartaro
burb
piecuch
guererro
teny
helsel
flugtag
reaon
gauder
emedia
labonge
milnathort
harnischfeger
diangelo
paisner
meatpackers
webbys
frelighsburg
poquette
mobtown
bagle
parcelling
corrugator
chode
lymbyc
nordion
knai
colognes
chason
bouwerie
skyeurope
loebe
bairbre
serialise
dripper
daffyd
porri
zhenwei
obviosuly
qac
washlets
foresterhill
distain
kumpel
barky
norimichi
registerd
abess
probabaly
piou
nihe
kerstens
crockenhill
offramps
rubberstamp
wcrf
ghurabaa
infirmed
ibraham
smolens
bruria
achba
shegog
filippetti
sandrin
huszar
humblebums
rwu
roanna
lisheng
minnawi
mercerville
beahm
rowenna
brainteasers
mizel
ebeam
colwill
slopped
yurko
sgcc
cavel
mirach
urooj
muttawakil
nasreddine
towbin
cmpi
blitstein
gurevitch
prevalance
arshack
underutilization
scibetta
podkamennaya
genuinly
santelices
zalben
rouhi
slaughterers
roskin
doyers
wwon
goldendoodle
knuffle
steggert
raymondo
shitov
reguly
mahallah
morphoses
boisterously
chicharrones
gikas
dragusha
webwise
switchoff
extravert
warschawski
schtroumpfs
economising
teedra
orrison
grotesquerie
frankee
sachkhand
shirvington
intellectualized
uige
harchibald
changhai
pluggers
lootens
schmatz
ayubi
rasho
npra
wessin
papcastle
seiff
silverbulletday
athaiya
primar
ferreted
pitmasters
hadri
lahoma
nzpa
marer
vegnews
drayage
streetlamp
tolin
radioplayer
landlessness
udalls
sooy
hugest
neonatologist
protégées
chunlan
petitt
kupiec
reynoldson
belloli
onesource
carlyss
cichowski
ctcc
ardebili
htg
mendle
meetze
greentop
tahmasb
minati
kerstetter
hutchin
yaowarat
jagemann
ramak
dokubo
freedon
sussmann
johnnetta
filoviruses
flextime
blankson
zollitsch
dewanna
cupriavidus
broms
nevatim
dijksterhuis
lubtchansky
coumba
ladurée
basang
heydey
spitballs
temazcal
wooh
gorur
tebas
jablon
abdurakhmanov
nadol
stiefvater
cystectomy
lionize
eleifend
argyles
nyiro
esbjornson
iraqiyya
cottontop
kasaev
scelerisque
landstown
aimson
yongnian
reupholstered
darg
vory
crudities
unpreserved
waie
limed
darbellay
lydbury
sageworks
catcheside
abbc
malonzo
capalbo
machaerus
withies
zeti
monbijou
critcher
aldorino
merrills
talari
icstis
bulygin
leibovitch
hissey
victimizes
prif
overcorrection
msamati
botteghe
birria
gigondas
galmpton
boyter
waqif
gorska
chilingarov
maltitol
schiappa
ruppersberg
ferruzzi
fazila
hornlike
mehaffy
kashti
ludworth
sibony
informercials
gissler
glutens
toothbrushing
kaligis
mootha
dmat
gueorgui
kummetz
yohana
jiamin
khayrat
allford
mouris
droping
unmarred
gogue
comensoli
gibberellic
juddering
comradely
abiertas
berluti
daskal
elyn
magniflex
sulfenic
setzuan
willersley
coyness
kharwar
orza
baingan
speedworks
schoep
doorcases
sølve
kearn
casie
beaujour
viqueira
qingli
pettinari
corré
hunminjeongeum
longhairs
renourishment
dadey
spirituall
mbunga
yampah
jendrick
osana
bogoro
boroson
dogood
keiland
carayon
pdip
cyclers
sheskin
mynetwork
crustless
gleitsman
maily
kalkstein
milbridge
millboro
startac
krutoy
aamco
bagli
minichmayr
tubercolosis
megaport
taricco
djukanovic
reinelt
jiggled
kaupp
woodroofe
sidewards
tonkinson
disapointing
ftld
comorians
amapa
hackable
crystallising
cleeland
tepalcatepec
golasa
paulick
carefusion
microencapsulation
galáctico
carnesale
lanjigarh
hris
distaso
vilan
shiah
batcha
thermax
schnecken
lecht
geneste
karaki
overspray
gussenhoven
lamparello
wigstock
merchan
tarkhnishvili
gestifute
pipefitters
firecrown
adiele
rajus
pfeffel
aÿ
thorlabs
kogalymavia
atna
bulleit
kagayama
ostrovskiy
zakhilwal
higgenbotham
nerud
jakson
dunholme
denisot
blickenstaff
usofa
annese
obliterative
macroeconomists
valloire
carrigans
anyinsah
dunkels
archly
bishko
wolftrap
fehd
moulaye
incommunicable
schink
zydus
paloverde
nuaman
amygdaloides
dfki
marava
babitsky
monetti
parami
delafose
roling
kolat
beula
berrey
safawi
lioret
underbellies
fokienia
plectrums
wackies
economized
meyr
rupi
katzav
hempfest
plutôt
arnull
promark
fiszman
oleaginous
digitimes
corelogic
celda
whaleman
claimable
petroli
diment
hexic
lulis
messiest
efaw
bucknam
fryklund
cheim
yongyuth
scarey
karnam
pohlen
orben
leiberman
questionaire
glassing
brachmann
zhongdian
linkebeek
gaggia
levring
fantasises
hydroxycut
stuttard
peruggia
wagley
eroglu
chomski
woodhoopoes
duffels
mhondoro
timpany
oando
chernyshova
srps
vulvodynia
milewicz
mischievousness
twinkled
rbge
simoneaux
tagaris
aeropostale
wadeson
agca
sloshed
ragers
reportings
abertridwr
adjmi
atempo
ditcheat
devorski
rosenthaler
hoogerwerf
kxxv
colva
farbstein
familiy
indentity
lausitzring
malamutes
packouz
telestrator
samter
aiag
nothung
againist
eligibles
modernisers
glistrup
matzerath
bébés
jaksche
whitsbury
anybodies
plazes
escalatory
tabakova
orzo
adumbrated
gusciora
oximeters
rayton
kippie
disharmonious
combatively
culyer
nsdp
mcbriar
aplon
zenobi
lexapro
nurun
emanuels
bombifrons
brossy
lamere
theyve
kioko
rasgotra
lusser
wissington
ostanek
jadson
depoliticized
steampacket
goldtrail
lucubrations
companywide
cpfc
sugarcube
touchups
alstrom
stankevitch
djimi
tcxo
currrent
nimb
rebe
echoey
khemir
stockers
rww
introspectively
frett
isoa
vasti
sidebotham
peluce
legitimises
buddin
kicklighter
confrères
ilegales
cecp
anchoveta
hovorka
tamakoshi
firin
haniff
kronenberger
sheps
yahn
slinn
callado
ceballo
aminullah
travalena
rusticity
admob
belligerant
rfea
veenhoven
mehrerau
hesters
rengel
refregier
geddington
freeagent
urby
knuts
knols
youthaids
espndeportes
lawrenny
topscored
zubillaga
koryn
ycombinator
abshagen
drzyzga
rhincodon
gunslinging
hoomanawanui
bajema
nimetz
strickling
stigmatise
opendata
owh
kommetjie
perello
scharfenberger
prolificacy
misspoken
naturaly
sivagurunathan
sedita
rainclouds
yahuda
usjfcom
drozdowski
servitto
canapés
saldarriaga
rifabutin
lundestad
senillosa
daytonas
cryospheric
chokeberry
douek
populuxe
tialata
odabasi
cattus
keynoter
postulations
keasler
stoltidis
marzilli
schlecker
romiti
ennahdha
exhumes
multiengine
integrati
idiocies
sportwagen
murrumbateman
haffey
wabbits
abandonned
hyperpower
enbrel
perfluorooctanoic
vielma
ragpicker
roob
calia
rodriques
warmness
intercarrier
hochreiter
winterling
miraculin
dbcp
naughties
wondermints
shobhaa
trundled
chavenage
cellardyke
bulería
avramovic
breathalyzers
matebeleland
medomak
messent
stanesby
sovereignity
fdcc
tangeman
reproachful
breezeblock
wwan
leverich
shuvee
semerci
cordone
paliperidone
probabilty
kanzius
etemaad
fathomed
wolszczan
acronymic
swooned
futons
shimaoka
morgenson
footpad
shovelware
dannell
babytalk
albig
iwar
upclose
turland
steinkellner
essiac
maluleke
lumpa
mudflaps
tiatia
mozingo
lakhvi
mortell
montlouis
overseal
shuying
conductorless
learndirect
sempo
tamy
ibell
televsion
fairton
upromise
nhbc
lobianco
tcby
shekari
celeritas
dmps
crossborder
twop
lustfully
ptychodus
namasivayam
roomster
mahjabeen
grandness
quett
hyboria
daqian
ciliau
smick
pochinki
wacka
ridc
wugang
hassman
numalink
lipner
ashprington
tachilek
dehan
boopsie
snifter
verrey
kanouse
mordkin
sollars
arbitation
youngquest
analisa
emarketing
bokov
wacholder
rutberg
minnies
aito
lumension
cozies
proselytizer
langrick
qubba
calina
embroiling
lifelight
taraporevala
seaching
naeim
blaenplwyf
pelindaba
tomatillos
ambari
suttree
trailering
satinover
rudds
renucci
bagwan
forestiere
peik
shahara
bicom
nsaliwa
vanowen
deliang
chunying
intralesional
sequi
eyeshot
olar
tronox
grotty
jibla
ballhandler
auberges
unweathered
colome
newnet
nazakat
reappraise
vashadze
braida
baldisseri
wijemanne
kodnani
insolently
notasulga
fortuno
maalot
fursa
bronllys
barcott
longmead
ouvry
neccessity
obscurantists
xuejun
blachly
hochschorner
mazlin
hoaglin
yuzana
aspex
biobrick
lasana
oculoplastic
kreditanstalt
irungu
syndey
buyse
resolvin
brajkovic
flowback
crematoriums
jumpseat
hassanpour
barnert
paetkau
petroskey
kulcsar
tomada
operationalised
casulties
kosto
nonperishable
zoldan
clemetson
sovann
owuor
karamani
mabandla
unproductively
frilford
bazire
holdeman
krisnan
baduel
radwa
tsokkos
massala
rumy
empirica
chente
farmgate
pardeep
canabalt
nyepi
romal
terria
awin
unpinned
derrygonnelly
khaima
kavoshgar
nghiem
frieth
suhair
arrue
brantner
promaxbda
eichbaum
opensky
tiscornia
haenyeo
balmaha
tchividjian
batko
mamlok
autotuned
hillfoot
blunderer
tinies
sanmina
﻿
leucovorin
cwmtawe
swy
fluhr
dudson
pellini
gbao
latterell
fusty
madie
bife
landmen
cxm
grindstaff
botesdale
galère
lisney
goodfield
jurkowski
vesce
masone
chidren
respa
belorukov
aggett
iding
depfa
waldingfield
lovingkindness
slanty
kembra
hoganson
aradhna
meryton
billmeyer
peculation
examing
psychosomatics
mayah
hodari
suckles
curborough
rdk
kopas
tokuoka
bakuriani
tassara
fsmb
casarez
reflectiveness
unrelatedly
plemmons
schlotterbeck
equilibriums
hayduk
solidarities
bodfari
sumaira
sunbeds
campustours
capd
lendal
palpating
wassall
afiuni
edusei
nutkin
jubeh
oriakhi
eastex
schlich
bulgheroni
scothern
fosgate
danjiangkou
ziebarth
wotte
nextbus
vertebroplasty
wethington
rivelli
muckamore
metatags
smadja
darvishan
fishcakes
ruperti
lebedyansky
rier
eyebar
eagleview
contary
micronas
huppe
senk
msbp
awadallah
roil
echosounders
lables
supportively
totin
subramanium
niw
innoshima
daston
theworld
biechele
gyngor
umezaki
palliate
bigbie
rotoworld
maurren
medress
mischaracterised
heilind
franceso
wheatfields
gojan
jilting
snader
dillehay
abstainer
coval
persell
ahtila
timimoun
pilip
hafetz
ookla
cercel
zannini
veikune
mousquetaire
zeppos
ossai
tolla
ogborn
musyoki
trudering
megapascals
vfn
limper
baetz
vickerstown
kobylt
duncansville
spatafora
vernetta
daraz
countermelodies
kinships
yerima
snuggly
reedie
yuam
noctambules
barschak
longliners
gafar
pooneryn
sgy
embarrasment
dobin
birnes
crockatt
arwyn
carniglia
grundberg
berdimuhamedov
frivolousness
investimenti
parahaemolyticus
hilltribe
badhwar
palavi
gosman
torren
steijn
tabron
kelvindale
shininess
allnut
halfs
ntamack
tnsm
incarcerates
brostrom
pivonka
hederman
uwt
mansurian
apcor
ideson
bisciglia
suboxone
riluzole
laxford
lokubandara
gerresheimer
lokum
wichniarek
vergo
hawing
ferragosto
dillema
commiserating
jaydon
smitham
cildo
mistruths
zulfahmi
kieber
croziers
takino
kittleson
vatcher
wickware
etone
bystry
stummer
duckpond
houman
njtc
phlebotomist
clobetasol
federley
stocktaking
bazaleti
verdehr
glenarden
morvich
biyi
flareups
csam
postrel
jailson
yasheng
peccadilloes
nonofficial
deltour
damjanovic
gomarsall
myntti
tpao
saret
nahles
viamichelin
kux
kojis
seedley
hoess
mroczek
ifh
lucaya
bauzon
joskow
bresonik
tonalpohualli
unobjective
valeric
brotons
pentrebach
mercexchange
subu
invigilators
pernetti
telfort
govone
dshea
chozas
cido
vasiljevic
viégas
bantom
rebroff
jochanaan
strategems
vecher
halvarsson
merkins
oggins
mullers
bretforton
letouzey
boerewors
craemer
vilca
sydell
uqm
aguad
lodestars
ethopian
shanab
metzelaars
rapisardi
flaig
fahem
fharraige
sloggett
totsy
deflowering
trailwalker
anklesaria
oppal
fufilled
godsiff
solvik
redistributor
abuelazam
harperley
vernham
hameur
feldeine
isaacman
negen
aldonin
rongrong
wandong
cluley
eedar
playaway
wlans
hilb
maziarz
erck
sociobiologists
sippola
dynavox
csbc
martham
kamall
unthreatened
fernholz
skwentna
skyjacked
luminita
ekel
blackaby
poulou
essure
chainless
disinflation
lyxor
penalises
quadrilatero
couser
overdriving
toxicologic
badgingarra
rutzen
harped
arcara
exposer
bleymaier
tolas
brissette
decroce
senioritis
arteriosclerotic
betzy
schweber
fishable
moralization
bogof
grauwe
sahag
sspc
remmers
tepoztlan
tauck
bangham
miessner
laddy
cartin
chunda
rereads
resig
beneduce
shweder
lazkao
ardvreck
wellstar
shahran
fese
bonsanti
impotant
pareo
cabindan
duckies
screenprint
synsepalum
fhn
coolhaus
washbasins
recher
darco
wiseacre
teletherapy
onechicago
avenido
giabiconi
cellared
shude
possitive
microtec
othmani
jialin
albertha
waak
countercharges
batterers
dutka
nordwand
mayeul
pasztor
weinglass
psihoyos
posuere
cahuita
languidly
prescreen
rhas
signator
gensets
kaysing
drumcliffe
venturella
bonani
jansrud
binita
jayawardhana
sipek
supportability
sulligent
klunky
gatecrashed
ibobi
langness
vman
bitencourt
jahromi
hoepfner
cheapflights
acfas
idabc
marchello
voivodina
patrich
ffrwd
mahall
ririko
disempower
grunted
bibliophilic
mndot
jenvey
whizzes
persued
pbso
gypsii
zalmoxes
fairwinds
mindbreeze
sadaqat
airwatch
disaffiliating
houdyshell
bafflegab
temsirolimus
ululation
totie
poohsticks
ngultrum
mushes
ranz
loelia
mollycoddle
midniters
krispie
brakha
varsalona
mangé
horseflesh
splendido
itzhaki
mangyans
filigreed
sampsonia
hintjens
lenzo
groupm
cranney
enova
ulukalala
arfan
kasarova
acquisitiveness
yetts
birdemic
vmpfc
seattleites
buffaloe
orfali
tibnin
nujaifi
zepa
bercht
parija
clodio
kedourie
japaneses
michito
cristoph
laurentina
bitterlemons
escuelita
porti
jixiang
nearline
gyrocopters
cheminant
barshop
bloviating
hoosen
scss
seakeepers
homelike
felsher
mostofi
defragging
ciganda
hougan
hobnailed
sealfon
albertin
loosli
ozbek
kibel
smar
woelk
canaccord
deregister
lapietra
wodka
bentalls
ciliax
runscorers
ibisworld
tocache
bambou
balthazard
mileageplus
wyckham
mulheren
herberto
sidewinding
bankson
raynold
quinonez
añoveros
atherogenic
hynkel
ginoli
thorborg
stridulating
presskit
longonot
hephner
netwrix
fontmell
lochos
bayen
odriozola
chinedum
sipprell
badmouthed
jacobe
koniag
arkholme
youngwood
nersesian
rebadge
outfields
chocoholic
latady
lodal
ampler
cymro
whitakers
cumpsty
paperbarks
dabbawala
shabqadar
huiming
sciencey
sartono
kinyua
secureworks
tangradi
charlesbank
hvt
scotmid
kivell
patisseries
retasked
ulba
stoutland
eids
minxia
eckehard
kameli
emag
urraco
lancias
delsing
cirincione
bakhoum
yeji
erran
professionel
glasthule
jabbering
trello
casano
aggrandized
benua
idsia
shedload
geys
collaboratives
tiarella
giftcards
cringleford
klatz
kalwar
aanestad
rocksalt
sowah
durcal
niketa
charoset
reatta
airband
follistatin
codgers
kulchytsky
okladnikov
odonkor
trecastle
chelem
establecimiento
massachussetts
radware
ibin
oide
snideness
beaufront
lovestory
revaluations
democratised
renovaveis
feistiness
thriplow
brobeck
collotype
matheis
dierick
patsatzoglou
vigneau
bieger
timet
odrick
belldegrun
beeghly
descried
zalka
castlehead
babbin
daisetta
wilkenson
meijo
rlpo
clios
reproducibly
disfunction
littlemoor
pumlumon
bloodvessel
perveen
cantalamessa
krivtsov
rosenberry
hlavsa
villarruel
kallam
sugár
tiltyard
asmundson
foolhardiness
lonnell
gandan
pickavance
thornberg
faiumu
zierlein
luccin
denlinger
languirand
gurvitch
farcically
exfoliate
luxuriantly
tremough
rimberg
sosanya
tippling
ladendorf
gittinger
jackanapes
chemex
notz
baluk
haiming
conciously
deregulations
wirajuda
tarran
aschner
chemmy
carpinella
gonner
subasic
mppa
retherford
outler
dittisham
dysc
braies
goofballs
fortensky
postering
lercara
aeeu
chorzow
noelani
immodestly
wittenberger
stecklein
hudema
sabti
bjornstad
haylor
refight
sulham
geomag
unnessary
lutzen
gigapan
thast
solntse
mobily
mikhalev
hentges
weisbrodt
orjan
quedagh
nonissue
shirtmaker
systym
salelologa
haltzman
noxzema
grotton
efimenko
aerobie
derridean
fcci
biscet
dobrygin
moshammer
tivey
nereida
clearchannel
onyett
kujovic
tastevin
jassar
esterman
yapper
embroiders
rondor
klotzbach
primettes
guling
kendler
gulleys
thommo
mohmed
keiss
helpston
sudep
sokos
fewtrell
cornwood
lévai
youssoupha
mimram
cance
innaurato
nightlights
intertubes
heatherette
chupack
vukadinovic
polsby
accentless
grafman
pusateri
intellectualization
redmarley
dziena
trollop
cses
exhibitionistic
paolantonio
reemphasize
hardrick
whineray
saisies
curiale
bawdiness
trostre
romalis
mondamin
housni
iraizoz
worldnow
spelke
dorismond
tempelman
unreviewable
dirties
amvs
pére
mateas
inuksuit
brookhiser
joyrides
tribunali
töben
berdine
paraben
hyderi
tuju
heythuysen
ikeme
roscos
heartmath
mccafé
cabbell
trenbolone
ihas
nonspecialists
oceanair
harlescott
gunnin
ondrejka
ngbs
omlt
headstand
shippon
phola
karabelas
dasovic
traumatizes
tarsha
paktel
manchurians
pggm
firewise
looke
staffin
reneses
giddyup
kolelas
tenev
mournes
markovics
baldcypress
oncotype
juked
govindji
khunu
vfi
tattooists
planningtorock
sickling
pleasantry
herrhausen
bonnevilles
ribalow
judyann
footless
raikov
bammy
geremy
micheler
limusaurus
garritano
antimissile
misreport
cauldon
lovefest
sawh
waldgirmes
hadep
gramzow
botein
zewe
crabmeat
stepovich
aeoi
pingju
esmaeel
histamines
ushaka
rosehearty
schavan
elmayer
apparati
sailboarding
singleminded
abdolmalek
tyddyn
lowborn
moonquakes
huysman
ropy
macchiavelli
ediplomacy
meilhan
zafón
mossimo
nolton
barkus
smrs
providian
mccareins
wheate
isikeli
cybercafes
lilje
bounden
fualaau
selek
infobae
kaviar
piledrivers
edvinas
noforn
porthdinllaen
ojjdp
landri
farrowing
malkoff
dwyre
kibuuka
othere
rakis
asab
melingriffith
geslin
thongloun
hereditas
levitina
dundonnell
emmies
ncst
wallowed
semrad
bewails
soludo
cousland
dungaree
bermange
ragosta
rodbourne
florien
cantiello
deyermond
siegele
poshard
nimit
elisofon
lavizan
blundeston
buonaiuti
heiferman
kretschman
aakre
dvorkovich
mbarushimana
kolanos
enlgish
waiwera
idolising
duhulow
miccio
singificant
nesch
superintendant
sapori
harraka
pxp
netbackup
bullheaded
krimmer
andrianova
rickhoff
gengler
fudoh
granick
relabelling
tasr
nasonia
spacca
someome
sesser
purt
omiai
rebuck
markha
unstitched
tabachnick
nilsa
uscar
tudes
solando
afco
photopigments
depressives
steffe
energoatom
resentfully
crapshoot
meledandri
soderlund
safetynet
postboxes
repurcussions
kyphoplasty
onstott
kasischke
cantlay
reymundo
mcilvain
bhasera
airily
kibitzer
shankley
vany
blackbridge
kostyantin
charmeuse
banyu
yewon
quinata
verfremdungseffekt
sugen
endoscopically
purlieus
adepitan
gracz
thermoelectrics
aceti
toves
nobacon
digitas
tofus
inteligent
aldeen
auchinairn
coule
mcniff
nonghyup
bushmans
iesc
caberet
ritterband
lapadite
viewfield
laudner
vorlich
funderburgh
thomsonreuters
woodseaves
jamii
sheinbaum
guamanians
yonda
ruckersville
musharaff
essop
schnall
adblue
credulously
boulevardier
kumbi
carrio
sungkar
brouse
resonantly
hingorani
flader
rurality
dongzhou
itinerancy
seesmic
analogize
lizaso
owoh
oystermen
fvi
asdrubal
whiteparish
wilcockson
flecking
blahyi
antiguans
einsatzstab
laor
duely
oberti
schlemiel
caroff
qunu
timmi
yobbo
schomer
suyapa
lisby
berkeleys
arispe
trut
regally
kvasov
trali
neronian
streif
fpw
woolstencroft
gatso
halvergate
noels
ondokuz
opionions
countermovement
sosei
winesap
delinsky
segler
prostrates
lonkar
spilum
tuitert
burnap
norten
misted
expensed
shiftwork
underactive
squirmed
kochman
mcquigg
chenet
richmont
remolding
playgoer
mendillo
wattsi
keylogging
bugarin
coruscating
bowlegged
faretta
iwokrama
pasqualina
chrgd
akubra
beetlebum
fernea
shicoff
solidaires
rahad
aleah
nashawn
krogman
bulthaup
riggings
folkington
unscrupulousness
rockot
kerpel
sonographer
hydroelectrical
tigan
lesbo
unforgettably
yanagishita
stdm
clickair
atazanavir
yisraeli
brixey
dianchi
thks
girts
renumeration
rübig
damschroder
tarkett
maitatsine
tedtalks
bartana
memsie
riojas
loosers
renilson
kiltarlity
mburu
peerman
gingles
pikler
goreau
mantria
hogarthian
botellón
campouts
agnant
yonnet
leestma
eyeshade
friended
vidan
taleggio
naeba
pendell
rafaelle
rayback
incompatable
shilowa
ketoprofen
journoud
grefe
supermedia
unamet
privelege
belvederes
krivoi
macdara
bellyaches
zisapel
laboratorios
biring
toyes
noryangjin
sudans
alfani
musahar
doorless
frittoli
welaka
deuchars
staudacher
villagio
albuminuria
retouches
cresseid
peréz
akinmusire
syesha
cfdr
jinger
fertilises
bonanzas
cgcc
ascribable
arutunian
khaznadar
brisac
backstroker
frohna
eckland
pshaw
heidepriem
stadtfeld
barrowfield
misbehaviors
panamian
merritts
dumor
lueneburg
belinga
saurs
kelbrook
spreckelsen
pavanello
babchuk
jericoacoara
ipsita
wenqian
jassam
marmorpalais
primanti
segnatura
behlendorf
macor
vichit
jaaber
maxmara
sumbanese
deboned
mukluks
provacative
sheerly
sabeer
kintsugi
papademetriou
dombrovsky
skora
autolog
watermain
mulgarath
ultrasuede
gleicher
scrupulosity
goydos
reinjection
broudy
zobor
vopak
inkie
countenancing
charkh
bearne
torakichi
tavakkoli
facilisis
doog
urkal
dmaa
freeski
boisar
scafaria
gussak
sarrasani
archambeault
samri
wolch
campiness
arzneimittel
bemusing
sares
gatrell
findochty
reinstatements
wordworld
tobianski
grgic
ifmr
tshogpa
committments
tcom
icehouses
freshies
katsoulis
repetti
trigiani
transmusicales
cloudbook
skelding
emmenecker
deola
batie
aliskiren
egilsay
greenley
ollabelle
larudee
snowpacks
lizzies
komives
infarcted
schifano
newswriters
georgakopoulos
voskuijl
demillo
lynval
maypoles
zuaiter
propogating
mittelstadt
ambady
juyan
giorgetta
crabbs
mashood
reaal
kichak
indiginous
fourscore
shizhao
southcorp
cohodas
wedtech
sucipto
confectionaries
ginestet
seiont
tamberg
sweetens
folates
soona
schipp
ramchandani
sadoun
bozinovski
yokomitsu
parentes
dind
riyanto
polaha
thorsteinn
boparai
storberget
texa
paylor
otps
fosset
maith
sumfest
unfamous
golbourne
nymphéas
mandata
wooer
saadullah
bioaerosols
sileshi
seatonian
passkey
pironti
spellberg
iqpc
cantamessa
beaverdale
jerrycans
darvon
vornic
lefko
dewoody
nkhoma
debdale
spyhunter
knies
otsuji
ravard
cofco
benberry
giniel
glynllifon
weichang
makkar
willowemoc
speechifying
bamborough
eynhallow
mnscu
gunduz
ipartment
hypothecation
covehithe
bonavita
meggido
cseke
equallogic
oloibiri
sourly
contine
paulis
cavitt
ahdaf
everwhere
raploch
kandarian
novemeber
paddack
harimoto
alaikum
pingan
hoseasons
unitised
ciemat
implausibilities
dsei
kuttelwascher
mughelli
procuratorates
minidiscs
yehl
wltm
labcorp
ujaama
kerbstones
saltanov
cavenham
bazilio
sookhdeo
avai
notos
sodexho
courrent
authentification
champness
ezenwa
buisiness
roocroft
hinkes
mulot
strachman
undercards
melkamu
krink
ultracompact
urethanes
weaubleau
lycanthropic
blanchelande
quickarrow
omcs
vashchuk
kreusch
placente
baatin
trutnev
ruding
shhhh
barzelay
billowed
riverain
biologi
hemoglobinopathies
sweatsuit
ingol
unforgiveable
ceroli
garboldisham
upmann
timbalier
zawar
dorenbos
sakher
lovric
kreig
chavhanga
awakino
dingess
sultanova
piquillo
spillages
vistaprint
reregister
racheal
outdoorsy
cradlesong
wenker
yaweh
frosterley
himo
karkov
ausburn
childminder
haahr
donastorg
omagbemi
nelima
payzant
kukoc
casetta
pinchi
mawrth
francene
mazzante
shufflers
prensky
npoess
chocolaterie
lincon
bauserman
huxter
bafétimbi
ruchir
lewars
liraglutide
marignan
torryburn
toddlin
diakate
shishou
bechamel
bartolina
gurnani
wennemars
knuckleduster
vanderlans
barnetts
barrea
pardeza
carriden
nutso
secrist
wolkonsky
squirrelly
custers
ennobles
micromachines
winwin
titcombe
solokha
homogenizer
flinches
britni
thelocal
lankov
yanming
paulene
mondia
ummed
keeth
amsale
vernors
myto
medis
glenrio
evps
ciocci
goldenacre
dolora
seatrade
diskus
bossini
xnet
epiphenomena
kristofor
balcarras
brillian
chartis
matichon
lancope
unsubscribed
benzopyrene
macfarlanes
moonalice
scopelliti
musaffah
foremothers
argyro
lélé
lewins
netbox
lornah
anella
gremin
computacenter
sdic
ithink
xiansheng
plodded
distending
nachmani
boochever
lhrh
jcq
amorello
mcmansion
acusing
parrella
thurairatnam
gheesling
rasey
heftier
toey
aboveboard
abdinur
openhearted
aquainted
shalan
buhne
rsma
ysrael
colourway
armiliato
nelfinavir
tealby
lihtc
tishkoff
alzner
ranin
sagastume
startpage
rowdier
fulljames
shoc
vanderhoff
telepsychiatry
saqqa
bmcs
polmaise
lincluden
bobošíková
calliper
hajin
lochaline
thorsell
cistulli
irenic
trefeglwys
connectable
wernersville
kettlebells
zunil
intercontinentalexchange
veeraswamy
invadopodia
smaltz
pekkarinen
chintzy
rocanville
batarfi
lovecchio
nanosheets
kolinsky
willfull
benach
pichushkin
limewash
examinable
kochetkova
derges
subnotebooks
seatings
multisim
jovicic
duffys
thac
arakelian
tomarchio
crusoes
lqts
biospheric
patchi
haselböck
camay
freckling
ironkey
reagins
tussler
gutjahr
starmore
apua
hanker
sumang
coppack
groupmates
koroman
enervated
woolls
boardmasters
regenhard
resino
terunobu
cyron
ioakim
jerrett
nyne
khaldei
rohling
seagrim
fanfreluche
vanno
baptistina
sumbe
romanticise
shehi
ivaw
sspa
liwu
zappelli
diffferent
tysse
dyckhoff
kolpakova
clevelands
renuart
retied
terminix
dippolito
dabdoub
cajasur
degennaro
blaeberry
boohoo
okny
hcz
hayehudi
ecmt
seleccion
pelago
caesarstone
dijeron
sartorelli
jerell
fmca
étouffée
cypermethrin
sergas
odiah
bloggie
seans
goldminers
egocentricity
mcanany
coziness
muhammads
emprunt
mimimum
jeyasingh
tulchin
buckypaper
lurasidone
shenderovich
pril
preapproved
spitefulness
hawkmoths
epiq
shouln
athawale
nellen
wipa
groundcovers
deonar
shansky
controverial
mwita
coleson
sbarbati
marlantes
salesroom
sadagopan
wallyford
spookily
courgettes
asphalting
foldberg
donellan
veterinaria
moscatelli
bodysgallen
fortt
ruith
macquitty
fishermens
shadley
ackwards
farrish
schioppa
mckaie
carnegies
althamer
warndon
sadique
cervero
hungwe
movens
torosian
scheuch
burkino
capicchioni
relaxer
drenning
shavon
braan
hollobone
desutter
maipu
stmt
lashoff
dilapidations
openx
playstations
tmsi
hother
photosynthesise
dladla
flummery
cibil
boorishness
lebovic
rhiwlas
birkedal
eiseman
rafed
miskiw
bugie
gumdrops
hahns
nissley
morduch
uncoil
baltijos
noninvasively
schole
andler
eidi
padrick
dhanani
abduwali
deparment
slumlords
mavrodi
mendivil
constâncio
darmawan
faloon
siloviki
pamarot
iskandariya
pranoto
arrowed
pricetag
inveigled
frontality
najafabadi
megastars
relevé
cbes
syahrir
subplate
mochary
microsomia
sinnerman
pladda
homotopia
mirali
usuall
dizzia
terramar
qtrax
béchir
dimitria
hartner
dreamspace
biosurveillance
demonaco
novellus
portz
ostersund
skimping
biopharm
buttala
trabeculectomy
ijaza
spraypainted
wessner
rubinfeld
graddick
cinv
kvvu
jianqiang
kozakai
luos
haeg
numskull
rabhas
multistrada
spergel
orosei
krauter
gulfshore
romanticises
skullcaps
leaser
aricent
clothespins
tlili
moscona
loudmouths
glamorization
coordinations
incognegro
haddara
glycyrrhizin
reconciler
edur
fekkai
honnappa
concomitants
zarinsky
caçapa
successfuly
mcphearson
natos
neyo
poppyland
queler
nlgn
avidyne
wavefield
sheathe
moominmamma
vuma
rorc
technophobe
lichtenthal
˜
wortzel
gialos
thimister
selter
nonini
titrate
aggy
pycraft
thoday
sloppier
selvarasa
magiera
fadhila
teriflunomide
mwah
bawadi
tantalize
sonrise
vanagon
sieversii
depakote
pentecostalist
icesheet
catched
karwa
staxton
schara
manster
arinsal
seemi
apartado
belyayeva
outbred
stambridge
neded
pantoum
lavolpe
seremaia
gulbransen
bermudagrass
localists
savastano
zdrojewski
rolwaling
pakuni
bautzer
sheiner
energen
dionte
bridon
ashong
goncalo
iaee
garate
dscr
stemgent
nawshirwan
fadem
billowy
crystl
landwarnet
lambastes
raffaelea
theberge
wordplays
seshego
nush
mccalister
kangle
microlitre
darrion
gleen
duffen
dayroom
overbuilding
litster
tapola
rosicky
cretinous
compartmentalisation
fizi
ticketsnow
rocuronium
gashing
ostracise
rutherfoord
chalai
uzis
janecek
wmh
aboya
linguiça
killyclogher
shirks
pastorello
jurisdictionally
tiffoney
courteen
yelwa
sutley
soupçon
matk
poju
felinfach
brocheré
eizenkot
nannygate
tonmawr
dargins
submissives
redleaf
disgustedly
greff
mixups
gisla
stampin
electroencephalograph
aluvihare
zakarya
paunchy
rpks
annelle
babywearing
langrée
kovalic
barracking
ratey
vollers
ajvar
nfer
feminised
ogas
dhoo
newbill
huntcliff
smartypants
langlo
chaouchi
krasair
gnant
monimail
nauseatingly
kolde
weixi
numeroff
kneeboarding
spanakopita
klyuchevskaya
overgeneralizing
astrologists
msrs
skylstad
remmington
gound
scaramucci
ashikodi
shirdel
levisham
nabaztag
mufg
kuchis
burres
takahide
decompressive
trella
temime
analyis
ziplines
ezzati
reframes
pummelling
mckail
malinka
chado
streamable
assumable
neidermeyer
mirandinha
iald
tabío
ruli
quitno
connett
huntik
zoutman
breemen
hossan
upcycled
tunjur
kerrys
jimmys
generationally
akiyda
eresearch
roudham
snw
grubisic
ademe
policie
nuthurst
kuhnen
epizootics
enthuses
roesel
walbourne
biasone
bangara
hapsburgs
christner
pramada
voracek
nbic
feinerman
colarusso
kimona
tophi
abakumova
undershaw
giezendanner
livshits
khaleq
minimed
wordsworths
palandri
ndele
daughton
heizo
aristizabal
mohammedi
ejigu
atim
yankowitz
delai
puttering
tradeking
disrepectful
andreotta
goodmorning
bosumtwi
giberti
bilbrey
fralin
tattingstone
idlis
eithun
bolsas
dweik
finanz
cengkareng
sivil
freeters
supras
parlon
bronconnier
bardelli
bochatay
unbothered
patzelt
grätzel
vanguardist
bagage
neurotology
llangernyw
tuohey
dellin
overladen
lilywhite
saveth
shirihai
adco
ovonic
moralize
mikvabia
elixer
kenric
schiebinger
drumond
billière
swiftian
parmeshwar
haering
braich
menello
overlawyered
nautically
arbory
neyts
hepting
zagoria
acing
coudn
thermoformed
denter
krisflyer
overvalue
baharom
haircutting
enviornment
mitc
arrelious
pomonkey
caballeria
canziani
grisdale
liechti
bazayev
tinpot
cihai
clambers
waterlines
telesto
tverdovsky
verjuice
nored
eusden
soumela
ropert
narcotrafficking
kismayu
twyning
poussaint
matsumi
gigajoules
antepartum
oksibil
plaister
numerate
spaghettios
greenstick
shadeland
citrullinated
oakdene
eneida
robiola
irreconcilables
vanj
jiadong
sarcasms
begue
doncieux
sanderlings
riminton
bobbers
ironsmith
greulich
zalis
subornation
backpacked
rebensburg
bauerlein
rejas
rahimullah
spalko
dogmatists
deblasio
proctologist
laggards
truckle
altimari
realarcade
respose
kaarin
ishikura
sherlocks
devening
guayaki
vergangenheitsbewältigung
ghostnet
wamphray
tocilizumab
rafizadeh
palantine
conchi
shamari
glenboig
kurant
serpas
peyrelongue
prestidge
hirvensalo
waliszewski
begram
comtrade
welshampton
smurl
pruna
oakar
dankers
capricorns
webasto
moolenaar
greenskeeper
guruli
spellcheckers
quarriers
dakoda
aphalara
brimin
shwekey
ammour
dtcs
colgrove
smalltime
finlaystone
bannisters
muscatel
rabago
monocles
guyett
lapdogs
krzykowski
talkativeness
aircraftsman
olimpick
ramlan
berdyev
fennig
rfmd
norry
actéon
naseerabad
cherlin
patriquin
cangnan
xiangjiang
gatete
moroun
nceas
merlis
veedersburg
sportcity
cheswardine
demystifies
smurfing
mcdarrah
munstermen
chekhovian
tutera
mcci
concertation
itfs
routinized
mandle
raghbir
breier
bentler
genval
bizimana
romoli
stabby
camusso
tchomogo
prematch
yaghoub
amny
woebegone
byrnison
athenahealth
sidlin
autenrieth
phenonemon
redound
individualities
mabin
shousha
dochia
flambéed
baramidze
yusmeiro
virtualizes
botel
bunei
anoto
compaore
holtham
shipe
connoly
gilbart
ramit
keams
yesnaby
pietikäinen
maxjet
hexion
stanners
finzel
limonade
mochdre
concatenations
monsummano
marcellis
tomandandy
briosco
borror
nautic
mcgladdery
mancall
vladivostock
nessel
santomero
kinoma
bahadir
unsterilized
swaption
toyko
abdulbaki
ramé
cheapskates
gaytan
webo
crisan
miraya
mirsaidov
jazzmobile
unsharpened
eternities
marketeering
blackjazz
azare
lanpher
delea
trajanov
saarbrucken
enwonwu
merriott
derker
rosener
desyatnikov
meraki
borchin
calmus
unet
epynt
defen
luocheng
dannenfelser
sanitaryware
saithe
soonchunhyang
backpay
marzol
paralell
stanz
etech
rja
freile
tricksy
makhanya
engima
manadon
xhaferi
dexterously
tapenade
mbusa
pommerening
salaheddin
warcup
longfor
ebenstein
hongda
capnography
wenwen
unparallel
sodini
outbids
kingway
wrvr
unisdr
piratebay
kerchner
newbrook
larrivey
bookaboo
nonviolently
vendange
octocorals
brunicardi
tury
lamictal
luketic
mcnairn
rostker
koffmann
irineo
guleria
pakleni
falin
picure
overplays
mbacke
rikhi
bobbles
maitha
clervoy
fareeda
leverington
bkx
renneberg
fujiang
tulong
roedel
natca
papadum
barkero
aaish
pflugrad
summerhaven
telerama
dulieu
mehru
interfer
newspapering
cretney
umetsu
losts
rauhala
mertins
suhaili
moscrop
papariga
yogan
couty
marex
greencore
phentolamine
iceboxes
linning
draskovic
lalani
beging
spicoli
cobalts
gallowhill
chancelleries
pents
cotechino
dhanabalan
delicioso
alliterate
apiarist
lorscheider
vandeveld
harstine
fastidiousness
offerred
massagers
fibernet
yamdena
cpga
attracta
ogio
wunderkinder
belview
wonewoc
catina
dementiev
karlinsky
larcenies
arpana
corthron
tintomara
raissi
stld
pougnet
babineau
juelich
socma
heyneke
intelli
metagenome
abercwmboi
officialese
mayella
audiocassettes
waplington
hindhaugh
labit
aarnes
advisees
saxagliptin
sihem
kikis
coconspirators
demises
lidow
orsoni
djurberg
bertron
niedenfuer
rhinology
uniworld
sdny
tideline
robling
gallou
offencive
boluses
poupou
mindreader
alwayz
nocsae
unrepentent
ziese
grandpappy
borràs
chordoma
shabery
bancoult
slushie
monolithically
suell
ashiana
digiday
micarelli
banters
sudafed
merseysiders
ramaley
quantique
meteorologically
englis
hindia
pulos
whar
ischinger
mundanity
quesillo
eglwyswrw
fedexcup
axline
drabness
tekonsha
guarrera
karasick
stenotrophomonas
hyperconnected
reventón
gurrieri
cosmetologists
potterspury
larod
selikoff
jianhui
oreskovich
crimini
ofter
glumly
darcys
shaowei
ateya
essandoh
bramos
goche
jlh
segretti
farmoor
ilonen
fraternisation
gareloch
abscessed
baraa
bikkembergs
skrenta
kandt
longforgan
accanto
palaeoanthropology
slavsky
sikov
garding
altheide
sebelia
baibakov
russkies
stolarczyk
ragdolls
gudvangen
sicsa
casbaa
nisin
fuencaliente
sunbed
sabuleti
bearshare
talei
tamango
olbrycht
scarola
highmount
karbalai
jasdaq
zorkin
estruch
admas
mckamey
rappolt
pintat
flatpack
acused
syntec
brightons
knowler
olowalu
upcharge
guessers
pagnini
ariat
pulag
spile
llysfaen
racingtheplanet
jesch
mondell
harmans
nortman
mdax
bladderworts
zeitlyn
irking
nixing
horsebridge
tamor
stovold
brimob
charalambidis
almyra
ferrington
heartier
yacuiba
remonstrations
eepco
airvana
forceably
sepetiba
jugaad
riveria
iyiola
simantov
judis
schaible
brauman
chromotherapy
uprg
delcroix
androsov
drumlines
deckle
plake
niuas
geitner
lebedinsky
gilauri
furture
rayyithunge
arcega
labutta
restor
nyambi
phobaeticus
makaton
calbee
aetf
karsner
appalls
crypticus
woodforest
thelondonpaper
hunain
haywoode
brotheridge
ruskell
skypoint
volquez
azada
limn
mccormicks
angkasawan
catfood
erani
eckerson
rentrée
mortin
pasquariello
cowplain
nartey
salved
derron
underlayer
osadebe
parleyed
urrego
ogbo
wasden
quindici
reeker
cgcs
inkjets
charith
fonnereau
olalekan
paraders
parodical
flear
bayart
undergirded
shieldhall
kelenna
ahhhhh
butre
yovich
lousie
copnor
duker
stotijn
bielat
predicable
anez
kamaljit
killhope
gussied
potocnik
appelqvist
crashlands
paveley
belvieu
damons
moulinex
bakalli
ganeri
overreaches
brolan
refere
intercommunity
baphuon
medicalized
yosvani
floodways
mousseline
kabaivanska
tarzian
emison
skiffington
beneatha
ciliberti
creds
whetting
beancurd
inoc
whelming
valad
burmaster
authentics
sweetzer
unicredito
ghedina
erzen
deema
mobinil
ishag
scenester
transshipping
houssine
talibanization
cafarella
decare
gejdenson
crickmer
hazley
albawaba
kayna
rhug
bencze
orice
unclutter
seismogenic
schwertner
chikomba
nakazono
swaleh
lungis
montelepre
allurements
roic
culinarily
givanildo
spencertown
vergoossen
mence
papadopulo
gabetta
niewood
ballyedmond
quante
murgu
sidespin
hollifield
gones
woodfords
yugos
abuelas
hostin
ewalt
lavisse
tido
lesnick
benjamen
brewerytown
portending
anonyma
falgout
myhrer
rhj
prokofieff
shujah
backdate
jeromes
forestar
cuénod
woundings
warrer
longhill
dehumanisation
orangeade
laffs
ccra
oldbridge
leffert
netco
gaudioso
hatib
debonaire
ficcadenti
renaissances
dowlin
pittencrieff
antunez
endalaust
gloominess
déja
sydykov
attra
mssd
gienger
koumakoye
suzon
vadum
radnich
interprofessionnel
parles
florales
saratogian
kishishev
easyrider
eastpak
wyotech
dunalastair
klempner
fahlgren
beezie
macel
pacbio
toeava
luliang
explainers
glomb
flashgun
quantas
everloving
florentyna
flexfuel
gutstadt
ponos
nordfeldt
bonannos
bochenek
alligin
biziou
canwick
morrissy
chipaya
crafford
springford
graveling
kanokogi
rakewell
workaholism
chauffeuring
nirmalya
nutricia
estoque
minneota
astronome
harkinson
vnexpress
ffynone
lardons
clowers
savenaca
refinable
kftc
starquest
winnefeld
wohlfart
hemopure
dunskey
ahrendts
wojnowski
goosens
kalid
endlessness
lingmell
woiwode
botryococcus
creadon
kirste
hopstop
munadi
impolitely
cortachy
natkin
livening
privilège
snpc
numerious
mamane
mathlete
manap
tzorvas
orams
sitings
ijaws
puthod
kalivoda
leverburgh
walkmans
spiewak
sazo
wbcc
sascoc
wadel
nivard
haldenby
policlinico
ochowicz
codding
biggish
sarosh
santulli
guanidinoacetate
anglepoise
felise
vivisimo
erlegh
benshoof
vnesheconombank
gulya
skinsuit
ovadiah
sandzak
galanes
belched
dccd
sconosciuta
scenerio
deviltry
skofterud
kapha
sokhan
theonas
holstrom
vilia
squelchy
danchev
semiskilled
venita
mikadze
biranchi
skarupa
thid
bitchiness
scrawls
muirend
myhren
footmarks
filenet
potties
motivepower
lambics
bidlo
stepfanie
toddlerhood
chegem
xobni
preponderantly
frodeno
ganes
brynmill
sterilizers
hapen
gallogly
khalidiyah
lafi
slitted
colacurcio
shearling
parnu
daskalaki
oshri
unguicularis
gerardia
baqeri
perfectionistic
oxera
ittre
cragnotti
bureaucratese
sebesta
palisadoes
opeka
falsgrave
pietras
treem
lawnchair
nayani
physiatrists
tamkin
salcey
unpackaged
telmisartan
bartak
greenwheel
hilbertz
allrecipes
festerling
jiming
triade
jumpshot
kgd
cornman
aboubakr
layhill
deployers
malnar
zangana
sindal
anifah
nimick
greenbrook
sylke
scheherezade
devier
ryals
bizbash
construccion
coelodonta
widmaier
gaggero
unife
inuendo
nonobjective
rouleaux
maow
cadenced
pohlig
sakano
celente
malaitans
lancker
ladds
donees
kordestani
pwap
aristy
elsje
répons
honsha
boccieri
whatsername
coathangers
wadeye
outpour
donnette
shamardal
senu
görg
barja
tatel
socco
hogger
amarchand
wendlebury
ofisa
zinin
cisternas
aiyana
noetzel
nagami
kirkwhelpington
padmapani
asug
stuey
slegers
merron
kritz
misbranding
seabridge
bishai
cheesehead
xec
wigig
dayenu
placemats
akello
octogenarians
rubial
tuleyev
leslye
hrusa
oleanders
unshod
ndjamena
kucerova
lanzer
uncaptured
underplays
hadippa
hubacek
cinematheques
turnbridge
catmint
bnsc
jinming
bryceson
lapindo
supramonte
mrff
motrin
aamd
deferentially
komadina
bevere
hefferon
balmore
jongro
lapido
unicor
massarella
crundwell
keatts
shlemiel
scanio
grimsey
unimaginatively
nfz
hatchlands
mndaa
waylett
sheffields
mccluggage
mancia
misdirections
solbes
desgranges
bulelani
buttu
lincolnway
partech
minaev
basaraba
recirculates
genauer
larraine
houstons
schlyter
sandora
norichika
netvibes
stansky
fleckeri
wapshott
farolito
samovars
brunious
mongie
lacovara
presidence
paradisiacal
bogdanos
iervolino
yessir
gridding
tankini
riems
mittendorf
bockius
agiza
irrelivant
autostadt
welthungerhilfe
baardson
khiid
superbeings
appollo
ortuzar
interphone
tardio
batham
kapadokya
gurfinkel
willises
snowsuit
snakeoil
housecall
vanpools
bsis
flassbeck
underbanked
everyplace
sagoe
ebble
vadinho
aeos
blurbed
reexamines
addas
boerrigter
lydersen
korunas
tsuper
subdwarfs
ashlawn
heagney
casel
campbellii
jumani
bandos
procedings
morskoi
ingénu
scit
balistreri
lusky
chapli
,what
monthy
doci
carrino
energex
edmark
bulte
abraço
towerhouse
godshall
hargittai
holusha
chasten
mexted
penarol
unattained
francome
cooliris
liying
condem
sandquist
iley
coxyde
ullock
adoo
changiz
chapelcross
shiyam
tokely
walkergate
imler
lavelanet
valler
holper
debouching
chothia
petritsch
monogastric
noorzai
sunzhensky
tanusree
meridia
glenmere
schoettle
oeltjen
sharktooth
cdfis
benetech
obstinance
cotty
artbeat
kregg
chivhu
trembly
rozita
dedge
sharak
llais
aerotec
cauliflowers
patronization
hammamy
appeard
teig
omov
kozun
transload
iolaire
matricula
teekay
consummately
shemyakina
manteiga
pitte
unwitnessed
bulgakova
bloomgarden
deursen
goodnights
adass
overvotes
becherer
andriyan
endeca
deaccessioned
deshun
winterbrook
fnma
marchesano
awwww
contagiousness
gillmoss
achao
acidly
puppyhood
deriso
okinotorishima
feierstein
flexicurity
trelewis
choksi
norrey
honeycrisp
womenpriests
telerik
rasky
immunomodulators
mendilibar
shareeka
dollarhide
hemichromis
sharemarket
unlikey
palins
tunheim
dikgang
poteri
consulation
enchained
idns
transparant
noodly
orsières
demoulin
heroismo
biomodels
covi
sghc
lenhoff
licea
ibera
remorsefully
dalkia
logboat
individualisation
hewage
pleb
llanrhystud
reemployed
satit
dissatisfactions
fiorani
flounce
faldbakken
hyperspeed
mansaf
cnty
silano
yasini
awaroa
hergert
romitelli
vujacic
bettinson
nimbuzz
quirkily
dikgacoi
mergia
overdependence
ddw
asciak
bunged
doveman
tamagotchis
boska
fxi
soung
hcbs
heronswood
eiffage
sigalet
mercaderes
stigall
orlinsky
segedunum
flna
cyndee
clobbers
sumani
bouhours
lardinois
ficano
rouhana
pubococcygeus
oportunity
boulianne
fotomat
morrells
binjamin
janene
bessac
ouazzani
sexists
temuri
ingenues
lobola
cisatracurium
beddingham
yelton
strathyre
comeliness
weinblatt
bjugstad
kusters
leches
stavropoleos
dunlavey
metastock
lauitiiti
overactivation
claeson
herati
skysails
fakey
megaregions
alil
mclogan
woodrell
talega
vergelegen
tomdispatch
ddgs
cimes
kampfner
trikke
hemann
cheparinov
wised
shors
braunsteiner
podobnik
suzue
merissa
torff
sswc
atsuto
dissolvable
runnion
strohmaier
niri
ashouri
mctier
ramras
zreik
ihuatzio
khaldoon
saffari
mamad
teleamazonas
bener
tulsky
fracs
aguanga
moneywise
rommer
dtmp
mengozzi
azzuri
amarilis
kafala
takiveikata
chantale
oscillococcinum
zurawski
illegaly
thompstone
boilersuit
choiniere
subscribership
mathiot
vgz
tajudeen
liangping
schuble
lowenfels
slews
yearend
greenmail
varez
matjaz
zuazua
taxista
boink
tecún
lundine
sasnal
quadracci
dhada
yeren
registerable
wilmsen
seibersdorf
tenative
antartic
gobstoppers
ruebush
safian
mauricette
llanon
ogunlesi
snuggled
ishmeet
klinkenborg
telegrafo
abidance
aragoneses
winda
adji
ovos
pedalers
oscr
abduljalil
misfortunate
kerching
rollkur
rodne
zaraah
ngrc
sloes
epeat
abdullayeva
bordowitz
aspart
dellacamera
bartholet
creton
horrall
beaumes
isacson
weisheng
ribner
kaibiles
espically
smyths
odent
parfaits
jarzembowski
paddlewheelers
fatik
khangura
eashing
boome
ilisa
boyadjian
dacal
onramps
haiqing
spanswick
swidnik
rhoncus
guibal
plummy
elstone
hyflux
panchina
swinstead
skanes
ovejuna
hentsch
ohoven
speedtest
disolved
cataloguers
wfirst
photgraph
mcninch
richeze
ronzoni
orender
topfree
handysize
wohlberg
manheimer
fryars
wigeons
talinn
hovan
trefeca
bloxx
azc
morhaime
omah
boastfully
iseq
mentholated
dhiab
ridgmont
britan
amys
suchus
krucial
elys
mockeries
saneamento
graywater
bolzaneto
aztreonam
misar
democràtica
tomotherapy
quiara
scarless
amendolia
playspace
pynes
pachacutec
guvnor
aronstein
acftu
houssein
inexpertly
goood
sunbursts
manba
semiautobiographical
perci
andresito
inoa
ducange
circlets
bizspark
jillion
sturdiest
constan
willberg
braker
chalom
amedure
nonconscious
supernet
vitreoretinopathy
hensby
sellindge
minipops
geronte
straussian
buttermarket
congenially
whir
osawe
yerself
strathie
malevolently
backmarkers
meulman
chalak
kenninghall
locandiera
katlehong
sanctimony
simunek
calasso
repletion
ileen
plauche
sideward
muwenda
tokayer
bereng
yenga
garelick
antioqueña
dhoinine
fich
vezzosi
actionist
entwines
kandie
blowtorches
dewain
chavdarov
allocable
meniere
rashean
unhinge
cedefop
nekkid
yutz
harmfull
ncip
cervid
escargots
lubrizol
dawdling
newports
obeso
katherines
schwaner
bevois
nailbiter
katherin
roseae
valjavec
wease
avient
volleyer
bogdanski
nachmanoff
holmsley
learing
shurland
roepstorff
fanga
gwybodaeth
somethng
hoogendyk
ticketless
pinecones
matison
santro
piram
nasara
superinjunctions
gourdes
tanera
ivlp
xaui
lunzer
feese
enterasys
undeb
cleofe
midblock
zeituni
pasquill
cantilevering
tooo
asni
tennants
shaoping
badware
impremedia
dowlatshahi
bijani
arcc
dunscombe
genmab
markovitch
tormore
onging
skirda
camdessus
hrvatin
laufman
saidah
khadaffy
aideed
albinson
bidognetti
ramify
xiapu
vonzell
superhet
tuomisto
mcmanaway
niznik
kobia
ukcat
handspun
bleats
meddlers
halavais
bamc
islamey
usarpac
huiqi
ribaudo
marant
mandrax
wirefly
tayman
kanja
bloodstreams
yelin
puces
cubers
autoalliance
deifying
sergant
corpsing
zumino
krumholz
marsis
shirqat
highwinds
labourites
hüttig
srirasmi
mayakoba
franklincovey
incompetant
labry
maltreat
talerico
stanols
caftans
autostart
molaskey
aquasco
tyaughton
artron
myaung
streeting
zaiser
denmon
fahrenkopf
makhteshim
guastella
offed
promedica
abbotswood
lidsky
kickstarts
sanakoev
falkus
amberleigh
nonworking
mbarek
kitenge
severer
saimdang
karoshi
tickbox
dronedarone
kasl
speediness
decaires
southcenter
ervs
sadock
squatty
skidmarks
snarr
spak
christion
scervino
timbits
chuku
scarers
phenylbutyrate
silverburst
exilim
tarnopolsky
nonsignificant
hmsa
summerteeth
trinchese
bojorquez
kiffen
iocg
tsikata
whne
abridges
kapchagay
rossnowlagh
bauler
grabovsky
qwe
scatty
cmz
wavecrest
statures
irvingia
fuar
artim
polyheme
sheikdom
aboriginies
murieston
babycham
ambinder
insensibly
reprofiled
lanzinger
almansor
barstools
bhusal
spiby
slader
baxterley
jendayi
voronova
molzahn
showbuzz
grasscutter
oyan
pealed
bolot
sierens
agelessness
tamest
prochaine
randolphs
lisianthus
jonai
afficionados
waterer
merali
bluffers
cjis
gregariousness
berding
sowter
shazli
vantongerloo
sveaas
bitsa
melot
sloshes
rubleva
poru
maltesers
rhoscolyn
ellies
fossdyke
gvh
aqmd
broadstock
lyerly
shipham
avantage
stasny
overcrowd
kharge
decieve
palavela
misiewicz
leapman
bashardost
chinkin
dormido
subsidisation
microconsole
streissguth
ingreso
vdim
kleinplatz
doogue
caricatural
aqsiq
aftergood
gardee
idiakez
malee
kashua
zonisamide
markiza
steepleton
alney
schmich
wtas
sumiton
tafara
marettimo
philosopy
litfin
irongate
bovisand
buizingen
eulberg
fornet
snic
resettable
marrion
berriz
jingyan
pegase
suffient
stieger
deafen
mcnabney
overdraw
penycae
cku
paronto
tregarth
siveter
flexibles
imporve
overrepresent
vnsny
nioplias
geneco
paroling
isouljaboytellem
alair
privilage
golebiowski
collesano
psls
gordita
dongliang
tensilica
dagne
aecb
lightmoor
viscusi
wapama
coffeemakers
jianyu
berthelier
mugavero
kiwanja
wallentin
kice
umds
ballycran
hidefumi
bednarczyk
mahdy
alqueva
bellovin
ruhemann
yammering
spaul
monteria
obanda
tonsilitis
sebha
colglazier
ribordy
minkovski
samode
boites
camy
danzante
thobela
orthe
minker
xiaosong
hashr
yesss
dacher
moormann
sangamo
degand
decluttering
schore
megadoses
biscot
nerdish
longhorned
canefields
senichi
minxin
izurieta
santoku
sullenly
mauldeth
kaney
atthe
ssentongo
xaar
peos
berz
familly
mobilizer
hollas
donskis
rollergirl
aperol
guangsheng
mitterand
rabanes
tolcher
trebelhorn
hgn
microenterprises
ruswarp
cogitation
oreodont
viant
thearc
tilera
nuoc
assurer
gwm
heppe
pakledinaz
pelvises
torquhil
pulizzi
levade
bastians
moscardini
anouchka
longano
evergreening
architected
kleinhenz
irmis
chilman
socalgas
colford
bendavid
peabodys
mutabar
cordrey
duvaliers
buiding
chainrai
dickert
bomere
blmis
mercantilists
serk
courtliness
glufosinate
castrillon
catya
hillblom
outten
tipoki
priciest
planetree
recompute
schillace
autodefensas
storrier
twitcher
stakeknife
cheekiness
bargemusic
cpri
cemetaries
carsia
bembeya
greisinger
douzaine
bromden
marites
britdoc
rahhal
nakamachi
abbadon
ifrss
dictor
brechtel
giammetti
srz
koito
casebolt
peipah
maked
barbury
cambone
lendingtree
overskirt
jasperreports
uninet
bekki
abilty
tredecim
supertarget
buttars
cluelessly
jinguang
linick
detainers
arobieke
unreadability
bonjean
pigsties
rathole
syddanmark
demeanors
scarifying
fastcase
lunesta
terzan
boov
obvously
cherohala
arrangment
kvarme
lacamoire
eabl
trepak
debarment
veze
shamkhani
schuermann
böge
deetman
stous
munches
holeshot
sarniensis
wetterhahn
salseros
fountainheads
floatopia
adarius
marbo
rindfleisch
schiebel
actt
intrasquad
micelotta
sledgers
calcify
mandaric
demagnetize
irrecoverably
jagaciak
spetzler
carpenteria
assuras
rocktober
dynamed
demontagnac
sodomised
aspal
katleman
beatbullying
stalags
espinola
subversiveness
tronolone
lekas
heptinstall
comissioned
ehx
pulverization
nter
roadgoing
zhitnik
draman
cabarete
ringless
chebrikov
sowton
mcartney
sahebzada
shabunda
uncrushed
serpette
promod
cncp
bateen
kuluk
weichselbaum
gangasagar
wheelis
copout
forceout
differen
iacet
aeroscraft
vray
ouaga
ramanuj
gutty
killean
honeysuckles
kendry
treml
icfr
comtex
whitelegg
voskuil
margasak
choephel
raquet
predications
openbts
sasae
moraghan
deciliter
kauser
heslin
mccane
zerhusen
momolu
nonstick
bateke
cesid
kumakura
teulu
vmap
battistello
nmsc
tapajos
struk
tengas
communites
hoenlein
kalandadze
sovcomflot
enery
adventurist
rfids
missis
couri
roomate
julika
martifer
karanas
sardini
comptel
haipeng
ilko
surber
telecity
draftsperson
vigneux
korede
overbooking
pricerunner
herkenhoff
gandelman
heanet
cavelli
kamol
bramnick
tibetian
skyjacker
yotaro
finklea
fundraises
crticism
songsmith
spreti
nubbins
nevadans
demonetization
pharmacare
ypersele
mceneny
kuapa
krvavec
preceeds
quenchers
tadulala
decaturville
taleo
folksmen
mayeux
enviously
furai
nccf
kislingbury
roote
friona
allahpundit
shabaan
verycd
chammah
wildig
mccaulley
irishcentral
dinnet
ellers
kelco
advfn
maced
tonen
pparc
albertyn
raikar
debulking
spagnoletti
whisby
mmic
fortysomething
kabr
krotoski
alijah
pechman
sajo
laborites
fermes
sicav
westcot
bawled
spätburgunder
bacheta
handymax
birbraer
yonfan
sligachan
vexes
yardeni
inequitably
transcendentally
tailcoats
peson
vakoc
shackler
americanese
nasiriya
koplin
alendronate
bottenfield
fead
tangena
tulloh
alevras
mouettes
lunacek
foulmouthed
korcula
teyon
fidlers
figleaf
godinton
houtz
pitsuwan
serioso
mammaries
schwass
japin
vibratos
varous
asoif
chahed
stockgrowers
birgham
khatchaturian
sahibabad
molotch
cromagnon
gollner
nolonger
macdevitt
hölzle
thadani
neftali
caseys
threepeat
solemani
elisra
evano
njonjo
archaelogy
dietlinde
cirella
gilkicker
nyth
baldersby
gianniotis
yonaguska
reemphasized
serpotta
demarinis
lockeford
joraanstad
danyelle
casanave
sleepout
cremates
aadil
vongsouthi
pretre
tihinen
haixia
vetco
nerger
mistranslating
ozinga
druyun
kalume
compartmentalizing
sanmiguel
stengade
terbutaline
seider
infraestructuras
adelgids
elonex
asias
ajirotutu
cocalero
haiba
lowlight
kadyrovtsy
sejad
bidness
lazne
commonhold
eastborough
schons
triflin
sakalas
tangherlini
karisimbi
archnemesis
hawcoat
tongxin
seany
derniere
poncy
charara
gloving
hallwood
zastudil
progam
sportsbusiness
bartoszewicz
choquehuanca
kerith
haqiqi
ttyl
mepis
rsme
tallentire
cemetry
efromovich
irishwomen
photobioreactors
snores
behavorial
pierrehumbert
shevin
golland
peasenhall
lickorish
houssam
milioni
unpublicised
icpdr
quyang
langwathby
moneybag
upselling
bbnp
daggle
ogunyemi
rogaine
calpin
kyrghyzstan
trads
hiscocks
lindenfeld
psychoanalyze
giannoni
vantini
contagions
rhostyllen
wayyy
emarati
othaya
smoothes
winckley
maisonet
vieau
torness
mehli
forseti
jerard
tafralis
sharits
resentenced
insiste
unguents
incestuously
sariyev
maccambridge
balbach
waterstock
tirey
dagong
yeske
flammang
nemko
journeaux
exarcheia
skedaddle
comitology
swiggs
bopd
barenblatt
halloy
austrailian
veuster
naeole
battlelines
bogush
autodialer
aijalon
cipel
oluwafemi
blackberrys
mewstone
videoboard
huppenthal
tromboncino
listees
gasparian
chaebols
moclips
kihansi
gunfighting
scaleable
subspecialists
miert
shibao
ccmc
nphs
practicioner
xunlei
starborough
harke
gottex
vgg
bisk
avalan
deoksu
bordeira
sinisterly
aghadowey
scialoja
ecrypt
poleglass
gibralfaro
mixa
xativa
jhw
milfontes
seillière
pallab
okam
cannop
yorkshires
begovic
baburova
paciotti
islom
disseminators
tweeks
wolfquest
faehn
brittny
townsell
precipitant
rydzyk
blairism
berki
ctic
rafiei
waraich
abyssus
onsens
fisherwick
cheontae
medix
pointfest
michigander
missned
bonekickers
cionnaith
khawja
naray
shaugh
profanely
trishaw
compil
codefendants
dubrul
gaiole
poofy
nuvinci
rovero
rouanet
plimsolls
metrovacesa
jlj
nrsros
caflisch
laferla
superabsorbent
nailin
spithill
cyberpsychology
ghanaati
trest
maehl
haberal
bayernlb
ambepussa
glancingly
werehog
comperes
monaci
kindie
kiren
kcmc
glavas
brauneck
repko
mooses
hygenic
unhip
subgenual
arunas
chatzky
bucketload
donayre
soaping
adamik
lataif
sitruk
birdstrike
hafsia
yezzi
plenette
sperrins
streeten
pandjaitan
menelas
tappings
onexim
semblances
anley
skiddy
totting
issak
courtley
jeremijenko
horsh
vaunting
lumpfish
piglia
choeung
dubrave
riosucio
diiulio
ceeney
dostoyevski
oleophobic
thaught
arastu
farmable
jeffre
steenhoven
gerren
ballgown
noras
merial
taurisano
pantelides
aggressivity
coadministration
nassarawa
ellough
ouanaminthe
chicchi
caofeidian
furla
scariness
dragados
hotsy
dangit
baosteel
dissel
cariso
brunger
kemmons
stovell
longini
basanez
bensch
houseboys
xuwen
deguzman
haridopolos
panarea
goerner
togbe
suborning
buidling
kharel
fragos
chionochloa
maintenace
gyeongbok
yaiza
clachnaharry
melissinos
clientèle
civilianised
ozm
smiter
theall
isayeva
jinke
browde
scything
gianfelice
knutzon
jagermeister
jinning
ankarafantsika
depaulo
eventim
hoffmaster
birkholz
leontiou
stinkweed
phoon
amendement
siala
prostates
unwins
trulock
macsorley
nibutani
autonet
enrobed
vorp
foreside
exiguous
guestimate
berley
nexium
didit
borensztein
moph
vonta
einstien
mtwapa
fengming
goiters
plusher
appartment
bahrke
malaguzzi
hadoram
possebon
thrombocythemia
clearcast
stipan
kalee
dades
keralan
jurjen
thatthe
slatton
sannine
roubin
lobortis
ataq
bruguiere
gauchito
admen
kierkegaardian
boromo
josipovic
unsuccesfully
chabane
freakley
cieh
molchan
yahadut
fratianne
gagen
lipsy
ashcott
convivir
ziehm
importuned
mirdamad
impeccability
cayer
smidgeon
liadov
pootle
synchs
tagammu
kamruzzaman
cbga
syfret
hotair
monchaux
maroussia
cheslow
didactically
eisentrager
charisms
ntelos
bowleaze
glawischnig
narcan
folashade
kanayo
navada
waistbands
blastomere
schallau
redds
xyrem
solarworld
atsa
toomay
raming
basenjis
adify
bluefire
dinkelspiel
introducers
levere
panagariya
avichai
ayson
bariatrics
otterspool
robosoft
kurtaran
akau
kozinets
yunqué
ngiti
feltner
uale
mundan
avows
inchnadamph
sensationalising
pitoniak
zumanity
croeserw
pedometers
knowingness
sunworld
roessner
heartiest
thiébaud
nurock
bajans
mcaree
mcgreggor
stamoulis
kuneva
kudamatsu
earthmovers
greenfingers
ndege
anniv
popken
machholz
rudic
anatevka
korzh
westeinde
otane
fotouhi
emarketer
nyaru
sterilizes
wisnieski
ecotoxicity
relaxnews
fouler
sabira
viaspace
moordown
minsa
benns
peverill
danwel
netsmart
ludgin
vyvanse
ffbc
nyanasamvara
locavore
wgae
craymer
thomspon
dpicm
phallological
spuck
lucenti
epiphenomenal
richardsson
placeshifting
khalifé
terreri
lepori
opensim
funnye
torie
hesam
enticingly
unbreathable
shargel
rhizomatic
bitung
marmarth
rosasco
croxford
daraei
luly
schnellbacher
rxi
ipbes
diresta
dynastar
beewolf
skobrev
surama
choirmasters
sdna
stingl
arnesby
attendent
suppor
göldi
jonothan
llywodraeth
steadiest
fraa
idolise
strummerville
strathdee
hushmail
arriaza
parsimoniously
schlopy
sturtze
soumare
ovesen
reargument
histoy
jihong
rottenness
horethorne
movieclips
bialetti
kvh
berteau
fromthe
chemezov
becu
baumber
cordey
ranchipur
nickols
valderama
borrowman
ultrasonically
spectron
outrace
oshrat
wasiak
hitchers
huangdao
midshires
giering
electricty
ringshall
kastelein
gancarczyk
gastman
orobio
khanin
tequan
yongjian
artmaking
tanamor
splutter
gudnason
melich
acknowlege
vengefulness
jintai
showerheads
mezereum
nehushtan
wizner
mccrabbe
shinwar
ringstrasse
torrentspy
kathoeys
ijburg
thokoza
ledden
sidestreets
djurdjevic
priciple
aihua
paleobiologist
zhengjun
minju
federalize
beharie
irwins
mcar
danniel
kalyuzhny
clientearth
babus
burish
britax
npes
plantée
chewer
dramamine
balgay
einig
allurement
counterfit
hafte
jamaran
vkr
levalley
jabbers
zebrawood
mathstar
mdds
outwoods
assembleias
molestor
storeowner
fetishize
rugani
wintner
shalgham
betrothals
supino
myspacing
vomitorium
hanit
arthrosis
tith
aehf
eerc
infields
bionz
arnowitt
kayf
tourmalines
tonning
wenzler
jingping
softballs
escomb
rundale
controversialists
eyeliners
malverns
sawani
kesho
centene
derrybeg
portio
fussiness
morrah
tahmima
nimir
surley
mereb
mijak
schyff
quanjude
shahier
kayanan
brumas
philosphers
mams
xtina
biomet
orpah
fingerpointing
merediths
unfeigned
orgad
gidden
melhuse
upnd
michella
strathmartine
herongate
ettajdid
decoturf
drossos
oxetane
kshirsagar
manseau
emson
aeltc
coraci
higgenbottom
sofield
wepf
borker
nanomechanical
zazai
jupille
ghussein
parricelli
cintiq
autopen
consultores
penetrable
copella
bugnion
gorenjska
suruchi
yejun
extell
roesgen
wibisono
merrihew
mullooly
buddenbrook
pistelli
tholl
zwecker
blogroll
somal
leeswood
zingale
grecos
savulescu
dyesol
fabozzi
pochards
papper
prototaxites
pfox
doughoregan
ovulated
blavat
raffaela
ballyhooed
sarisbury
chowdry
unki
ecall
gamebreaker
digitizers
digitalsports
horseriders
wunderteam
rosaryville
combee
weinshall
imagesoft
purwanto
remands
blekko
dembinski
bfsr
mandis
boeufs
maienschein
gladiatrix
kinbrace
xee
cregger
mcgahn
malkus
sovern
yankovich
sundsbø
cybex
rüttgers
sarky
matli
egoless
dwy
pruners
lungworm
collamore
porttitor
kucuk
unremovable
headingly
bandier
runback
tabú
urdapilleta
xrc
prosseda
namechecking
défilé
microwaveable
newbay
politbureau
renovables
cardioprotection
slimeball
distributer
schoot
lomeli
lazars
sharpish
mesnes
insitutions
kraska
mintimer
ticketholders
kehn
polical
stringencies
transship
bildeston
birky
transmen
clitorises
maasdriel
kaplans
tracinda
zilk
drapchi
tonello
unenhanced
uprichard
farter
gemfields
mambi
belnick
vidalin
mechling
ghida
wallpapering
corridan
dangrek
rhio
grandberry
colotto
nffe
flacon
glazman
jabby
gogerddan
dmos
rubinow
isobella
lootah
bruguier
thornbridge
lambersart
slather
passcodes
maturan
lawwell
stoschek
ersen
prechtl
stonestown
lengsfeld
crss
actuarially
lupercio
dimitrius
shizuishan
saharans
tanski
ghavami
zaslofsky
wakens
healthvault
boemre
outreaching
backstopping
ousters
paška
inhabitat
symphonists
grunenberg
underhandedly
treys
demann
naccarelli
kelava
alongi
pocar
mattrick
parritt
sherco
rwasa
reinebold
eastrington
gonaives
kwena
sogeti
unscrews
xke
beci
encryptor
cotoletta
belkhadem
traue
rafikov
anthropomorphize
chanoff
sergy
merideth
ravasio
airiness
nivins
centrify
manhours
avunculus
geekery
brisman
thway
hayfever
madhesis
dafter
unpleasantries
dickinsons
mulé
jeppestown
editis
klueh
torvik
lyndie
politicals
housesitter
anomoly
antiabortion
consitent
cytometers
sprouston
plaszow
frago
cynar
acti
cwmbwrla
manil
gazetta
killylea
junaibi
proactivity
abdulmajid
amalga
appetiser
miramshah
aquisition
polam
reljic
gárda
hamod
cristeta
velcade
prespecified
bodyboarders
hamzanama
arvan
disagee
croquant
bartal
healds
videx
clanchy
amoi
imoinda
riml
summerly
claisse
laboure
wildaid
briffault
barrymores
espys
scatterings
figurs
penybont
previti
karrine
chadway
kumudha
larish
ricchiuti
becomeing
choralis
trubshawe
frsb
kiza
vestara
meya
hartanto
fredricka
chloroformed
stanowski
drizzy
emler
jezz
doper
hitmaking
suprapto
jeser
igaming
cymbalta
weadock
sellam
chrd
ballykinler
stankovich
trass
mandale
dvoskin
pearled
haadi
benchill
cloran
karsay
villafana
bacelar
hustwit
ibeji
inventus
deregulatory
tirr
penumbras
colcannon
sodaro
reitmeier
kuentz
codexis
cindric
sabetta
elvs
drumaness
keqi
biehler
foresighted
cwmllynfell
zalloua
smarr
jenufa
ayorinde
musorgsky
metgod
gastropubs
kiriakos
columbines
antiracism
nezar
excercises
ivinson
kutlug
intersectoral
radicalising
torchio
squirters
akpinar
olgivanna
sudworth
domalpalli
alhamdulillah
malachias
podkarpacie
vqr
macacos
drumsurn
brandell
sidestreet
redisplay
sherchan
unmo
shotgunning
jaiku
tortorici
tabetha
englaro
alierta
konvicted
antiobesity
ailean
janaway
clubroot
theel
chalifour
mynt
guek
tillmon
pardilla
hagemeijer
steepen
plutoid
inhorn
magnetotelluric
critelli
harmonises
naotake
sesenta
helgerson
obesogens
unsell
yibai
protractors
lunesdale
lesesne
camperos
panoramix
wcbi
mwandishi
bourguignonne
composter
cossin
ebid
lambson
kilcrea
platek
ferngully
cutright
hadassa
riyale
perfectible
merouane
sheepherding
setian
morrocan
sampey
taohua
pasteurizing
shmarov
retreatment
perthnow
ologies
pivi
yente
rupprath
cachalia
lazim
allstream
hrsc
knabb
gilang
librizzi
kontrakt
edder
blacksummers
aminat
colage
syeed
mahfud
croson
tumukunde
petroleo
ryter
chloroprene
huitlacoche
poppleford
steephill
canala
supermercado
vilsmeier
oncall
maraud
mundanely
nonfactual
sakane
ornare
nihombashi
armini
atlantan
machlis
mcara
ferencsik
calitzdorp
relitigate
djellaba
houwelingen
nadaraja
berglin
nikolaevsk
pampore
mabarak
brotherson
chimpsky
freerice
glutamates
harpin
jested
mocan
stumpage
balmforth
kasule
karpluk
rashkow
maake
strigl
dagley
khazanchi
soltz
wertman
adta
stroughter
colicky
cavalcades
fsin
fiesp
bbbc
microbicidal
reappraising
graymont
hajian
ungi
fuction
maramotti
herchcovitch
celebrators
gridpoint
karamon
abdulhakim
ozkan
conciliators
beddows
tqt
electroplankton
dtos
caromed
initative
photoaging
tachographs
michelmersh
donehue
ndcs
qrr
hvala
undermain
hemophilus
sanghatana
uncleaned
gunters
wakeel
komarica
garabito
hetian
qmt
bollenbach
wittle
jaumann
fibo
rothgeb
ramda
pyrotechnicians
midmarket
bertschinger
undervalues
ljubinko
ajos
chavers
troqueer
izzadeen
collaring
bimanual
fogies
greenlander
cabat
ellertson
slocock
twibell
atalia
kartick
scandalising
akimova
nedam
dickleburgh
aplington
minisodes
tokaimura
unconcious
serasa
pasm
handbrakes
phanor
studiousness
clunkiness
sabangan
moonpie
kyota
mahvash
mudding
requalify
abdelwahid
thamesport
islero
rachleff
vancouverites
wouldham
tudose
zikim
woolcombe
baracks
mockbusters
maddicks
neuropsychologia
sweathogs
beraud
ongoings
portnall
nazarbaev
tegenkamp
stepsiblings
frankovich
powervm
juergensen
bayal
zahidi
hashmarks
halleux
ottoway
whitear
alasay
pargetter
brittish
berlijn
grotesqueness
dawidowski
theyskens
banyana
lowles
reconfirms
jaelen
dotsero
voegeli
enewsletter
diglis
medog
couraud
yulayev
insightfully
hcap
fremaux
joyfulness
moudarres
dopers
mandour
aesc
orlowsky
organix
ngabo
ereaders
impalpable
barszcz
buckmore
ottney
vecdi
epla
fobbs
bedsores
hedgehunter
longfields
amtrust
glogowski
medicalert
cinacalcet
colorfast
entrepreneurism
juppiter
figler
akharas
cycloramas
schi
heagy
soitec
doctore
ortez
kisseloff
bolivars
sheeni
barawe
fortunetelling
norona
amarillos
salamo
ahmadiya
browers
monneret
zhaoguo
odfjell
notorius
hanowski
juman
sambhogakaya
quickbird
federick
klockner
stojakovic
ciegas
worlingham
rasgas
chadors
overpopulate
konbu
trbn
calibrators
zhirov
hawelka
robiul
mcmissile
blackston
lunnon
boxhill
lakhwinder
weltschmerz
fasani
altounian
cladded
containerboard
sinfully
pavlides
jayatilleka
quinson
highweight
chacachacare
ryol
fumagillin
veligers
jabron
benzes
mothersill
hawkesley
daolin
verkündigung
letterbook
bakopoulos
crdb
aliano
bystrom
zesto
cosla
outworking
kleeck
birkland
yeay
catterline
gaeng
dainties
pacnet
anwan
bijlert
fidm
maisano
lachasse
downtimes
lapolice
cemm
mgus
unware
recontacted
eidinger
flounced
aiyegbeni
beav
culioli
jttf
norita
jermey
listecki
rodio
dabrowa
whup
egnor
kerwood
hanzala
montserratians
rebase
dumberer
analysys
nonmelanoma
sumka
gumbi
baugniet
respites
guruswamy
rosettenville
drumthwacket
quantin
trendiness
tarheels
puygrenier
redbelt
worstall
bellei
suzzanne
kleek
edmee
guidice
sedillo
olso
multimission
mclaverty
karelina
cablecast
cefepime
trevigiana
sathit
leavittsburg
vansummeren
knockbreda
kalfus
impudently
yengeni
smartdraw
friehling
pointwork
gatward
lilyturf
nogovitsyn
bobbe
efthymios
hughesy
philps
joola
krave
geisberg
hylonomus
centerman
taloned
malow
dibber
boanas
kipco
eisenhuth
chewits
maxia
soberania
listlessly
nyswaner
samoei
djangirov
bandwagons
thumbprints
wnyn
sarunas
myoglobinuria
tangibility
haski
artfulness
lagosta
lingdale
heugten
faustman
rebelliously
stacksteads
towhid
lamanda
toper
fayiz
stellman
vamoose
jewcy
murcheh
ressi
shunqing
ciolli
javarris
rmj
wises
burgerville
habig
kannywood
pawb
archibold
laciner
suneson
prevnar
borrachos
dinnigan
hitchock
kuhnle
agusto
dailytelegraph
stango
silvestrin
torregiani
proh
hinestroza
toted
cholewa
lackman
autists
faerch
crimsons
badme
fiancees
glorying
shimmying
dudfield
cheverton
likin
niloufar
charamba
mussarat
powerplays
texbook
mcgain
abdessemed
waldhorn
krolicki
detter
leurquin
mihas
pixetell
konw
shamik
nduom
redelivered
kosmix
approch
keyrings
korade
jaquette
gerischer
kcrs
shanwick
roomers
volutpat
biong
bavidge
birkman
kestin
chelsy
llanddaniel
duckers
pingleton
stofer
maricich
breard
getco
sabey
carpinello
sargentini
geovani
piershill
beixin
sanmar
lutzka
balestrieri
heptyl
dangermond
brilliantine
votebank
ncms
mikov
postfeminist
cbda
stylizing
andisheh
joella
lumpini
zaccardelli
telmatosaurus
bocephus
marrian
pierceton
akintunde
cutline
atheeb
agranov
dicksons
miad
sheered
korogocho
synthons
bazid
rossion
floodable
prawiro
moretown
gamesman
ropewalks
mulheim
desveaux
guardianships
campañas
jaschke
mcclave
aphl
privatair
agripino
aractingi
swinehart
putnum
mischler
becci
efps
foxhunters
paddleboards
ptbt
hongzhong
ritty
tozzo
porsch
wrenshall
minneiska
montbeliard
codell
emgrand
vissarionovich
puhinui
ahlbom
aning
hyt
kriger
demagnetized
selukwe
molestie
hadcock
selenoproteins
boyatt
theepan
pissant
paynton
firrhill
smuckers
cwj
lagrein
pitkanen
bukuru
wtu
velislav
drearily
pluri
drollet
tepidly
gaventa
evhen
bendu
apostolides
tebbitt
rasanayagam
seagreen
midelfort
foulard
soakai
santore
plassnik
protech
kozhara
alchemie
cliftons
susd
hellacious
telescreens
nugen
zingerman
barck
tuckingmill
echavarri
seanna
koyamaibole
communtiy
hornton
bardawil
cheleken
locorotondo
lapread
laria
blunderbusses
governate
arhp
evennett
thrashings
caldercruix
quirimbas
pokeman
viven
ovejas
lenwade
lasource
pastiched
larmina
catamite
faryd
ramadorai
bohnet
finex
quéré
semini
oldish
tejana
contente
chualar
bunac
rathwell
vivisectionists
merhav
meeder
dvorska
muntazir
rotgut
gaunless
yuguda
teneycke
riffworks
hemmerle
spankin
norweigian
eragny
myrmekiaphila
chaing
lungarno
scootin
taegutec
damningly
productivism
majoros
tarbock
phedre
prooves
frind
idioteque
slipware
eisgruber
ziffren
melittin
undular
ghozlan
carjackers
mccalliog
ultralow
platitudinous
concessioner
pornotube
kiltartan
nomee
outreau
barston
cilice
qaiyum
borroni
pmis
saboor
dowley
rsps
corbus
overcomplicating
fiell
aralsk
travelzoo
cdph
recomposing
trogontherii
chandak
supercasino
bedtimes
cusip
wdb
cwmgors
tohn
cheeseborough
trapido
sperandio
batka
xinjing
statpro
perfumo
newswomen
altinum
selvie
benchemsi
pevely
taxability
baghe
bofinger
kuchinsky
fanboyish
kaly
dowle
birrer
ctcl
girasoles
puked
beitler
trimspa
zeltingen
ailish
nespolo
cccd
gefter
quitclaimed
parcelforce
nbv
byaruhanga
deutrom
mussy
zeidel
inclinded
cachette
reegan
hamzat
psti
allegis
devauden
pacsun
qiwei
akerlund
elderspeak
xcellence
sheelah
unbolted
nesto
geeking
popover
dickes
resectable
iaam
cantabrigian
killey
veteri
xanterra
samouni
esgr
uncrc
awac
vsw
credability
rapetti
khudobin
beenham
mehus
anusorn
hyler
stanney
schubin
resnikoff
gibo
ncidq
foxtons
tippingpoint
bauger
beltzville
rinkside
mutuma
pedalos
disconnectedness
csikos
lozeau
oberhaus
aronne
ungroomed
dilruwan
lamboley
calleary
ulil
makhluf
flashiest
sanha
lukoff
unionport
busuioc
oddments
mihigo
gahl
terumo
tussocky
peskanov
perozzi
stignani
geor
serwotka
kleinzahler
taneycomo
moonacre
homelink
lydman
dumbshow
abdulwahid
elmy
shvat
affectionally
fungai
amande
sesno
glenallan
kcls
colacello
reformulates
castlebridge
rumbly
hollywell
craigston
leeum
sadis
gillerman
snuggling
seelaar
grealy
mcwhertor
uanu
sakiya
comacina
marggraff
unstinted
hirz
overutilization
reyle
arvans
bridi
oppressiveness
nerney
ingroia
schroff
laferriere
palladini
hallers
sunkissed
housecoat
fürmann
mortify
powr
msowoya
dorry
morupule
imperiling
byol
daquan
koranyi
disingenuity
zeitlinger
pandemrix
jasmila
exculpated
tardiff
fritwell
stewpot
kellwood
tolsta
forthampton
hongliang
ghadiri
hansdotter
robotized
patua
wuhayshi
bact
rassas
univercity
remicade
twinnie
naringin
pietramala
ladakhis
elsehwere
fdrs
kekexili
islamicized
roxwell
hanhardt
chihab
pasteurize
unring
chirla
wardana
pendet
coldhurst
bogarín
arbete
muhandis
chiren
splendore
naanee
spivet
peynaud
rosland
cmus
pdge
feniscowles
margairaz
llangrannog
chafets
thato
capitated
oudegracht
treston
cardsharp
oostendorp
callejeros
jesselyn
karadjordjevic
thania
hundered
celar
harishchandrachi
amisfield
particlarly
anouncement
purtzer
bmxers
nepc
juanjuan
polishuk
winborn
chewang
gatfield
easd
iglhrc
abdulqadir
jarrid
zovi
codecision
souffles
desko
agrs
hanretty
glenisla
actally
averis
backcasting
senner
snoras
fulminated
tator
yosser
bulevardul
striezelmarkt
hyperinflationary
killeshin
withouth
superbrand
loosley
marwani
lebenzon
chewier
carharrack
showin
petries
tueart
edrf
pgnig
suñol
throup
tibbott
ismailova
picnik
sundel
klonopin
jorde
fanene
lloy
merco
annuloplasty
montly
vlo
milcom
engross
jbe
shouldst
satiating
putley
freebasing
tjostolv
wijffels
cinemanow
francescana
giacobbi
wommack
nayda
priuses
starkest
ringuette
thurcaston
precog
mynatt
eurlings
yunding
alishba
glascote
ginen
kurras
complected
ipda
transmanche
realage
jeebus
bortolotto
purcey
salvant
huberto
kaidanov
mahboubeh
laugavegur
shuanghuan
elkview
yesil
flightstats
obadeyi
nasonov
felicio
zafarul
oversubscription
kupl
qadhafi
rollston
sarley
bedsides
speyrer
songtrack
klete
steepens
baddy
sportfive
lachrymal
nilles
kirklands
cecom
mmcc
stanyon
resuscitators
picturetel
ohlinger
eurocities
mlab
rizwana
wolferen
leiferkus
dewlaps
defrantz
tileworks
sarriegi
evjen
schaars
hadida
chillan
rexam
stooke
tongsun
dolcefino
gewertz
oneweb
gollon
macray
thalamotomy
schremp
sceptred
beechdale
milele
schulke
xiaohua
superiour
menoufia
ximeno
skyjacking
tranquillisers
befuddling
runion
mignons
manacapuru
compucom
hensick
shoutouts
goerne
ghettoised
veanne
superduper
trystin
ziama
monetise
flimsier
ingestible
sheley
lept
istre
wmmj
euthanise
solangi
measurment
merriest
leonatus
moscicki
stogdon
tyring
icewater
nalani
costamagna
truveo
sifry
trindad
vider
reflectography
genuflecting
painewebber
overcollection
kovalik
vontae
cenit
threespine
gruenewald
filmo
vocalink
unshorn
kinichi
khuram
woma
guixé
cranhill
misguidance
vibing
muzzin
chli
máncora
fanfarlo
borderer
deficiences
gruesomeness
discusing
blabs
mardani
treverton
charabancs
hubka
kinderland
shackford
hazin
sterilant
direc
hmic
benfit
yagman
crystalized
dehning
tahoes
forceable
anyango
olom
nontaxable
nonperforming
prinzi
venenatis
kameya
herterich
hrafnsson
roehler
loverdos
mcerlean
overule
scriptless
moochers
smartish
poag
poerio
microtomography
pufang
franses
crochets
lashbrook
samkange
grober
jamphel
vlcek
shionogi
rotschild
vapourised
snoad
moskin
swartley
panagi
pourville
madini
stayt
trucchio
thuds
magticom
shmueli
desperatly
lipinsky
sfda
zahri
craigentinny
clavulanate
reby
tweeden
vexillifer
nayeli
victimizers
hydrocephaly
margeson
burgeson
cannom
duvvuri
limning
wardsboro
omeir
zawacki
axelrad
minadeo
sakhra
ghalia
cinches
alabel
doodads
wadded
metacarta
commiserates
rodber
pasquesi
caryophyllus
inzko
satherley
inalienably
joose
traduced
mannelly
pillitteri
stetsons
burlando
evdokimova
gallante
thorstensson
unathletic
idodi
unitl
vriesendorp
oneview
soapsuds
whittamore
rewbell
zambonis
kadidal
misspending
subarus
basari
piperazines
mutula
daneshjoo
bridcut
minoso
fairoaks
malakh
chages
kornblith
viciedo
dulski
testamony
gumshoes
substorms
glower
stachowicz
urechean
samoëns
pruneda
polz
moxidectin
gendelman
summerhouses
ndem
induta
thethi
kiyoshiro
simeons
mieville
guevera
pembrokes
earthward
publised
possibe
outraised
chocs
multifold
viscri
pickier
mauresque
laparoscope
mizuuchi
dragset
hcrs
midflight
fedwire
novelis
fortville
meruelo
momar
libuse
brodt
hoecker
lasersoft
airbrushes
prasong
demarked
homeshare
ineffably
kickout
fvn
masoods
giggity
immunotherapeutic
liakhovich
despins
matthiola
gestations
gamov
agarwalla
pangsa
capio
tricolores
xrr
rainsberger
majal
begell
geerdes
treena
emmes
wesray
khazim
liothyronine
effecient
coliseums
andrejevs
biocare
wartman
undock
cheesewright
vollebæk
witthaya
arafiles
sekel
submachineguns
synnex
zollar
stonewood
frittering
regbo
symbio
fenestrations
husbanding
newmoon
zyg
villabate
ortel
hakainde
woylie
feltonville
asbs
baobao
weighings
hatemongering
rieseberg
tuiaki
pachl
pinafores
gyd
belaieff
grilse
hirigoyen
canings
huntsworth
zourabichvili
boppish
buhund
cohabitant
ssmc
felicisimo
bartis
historicos
shardlake
ardhendu
stadhampton
turps
witasick
transfat
shecter
harindra
nominaton
mapit
nvra
matavesi
brookstreet
excersize
wynnton
talkov
harmond
welshness
amigorena
xaba
araca
tartufo
chidchob
hedl
cressi
casd
pyridostigmine
relining
tacul
vialone
techart
thaleia
khudair
domanick
kieselstein
waniek
elbaneh
knijff
divani
luofu
shatha
rahill
auxillary
aight
wainy
ghio
ecta
sensitising
unscrambling
vaccinator
poletto
constantinidis
tilshead
anniversay
xactly
madeirans
haunters
weihenmayer
lewenza
binham
andringa
eschool
amtc
schmetterer
kaltenberg
vylegzhanin
guardedly
deprogram
inlcuded
eurocentral
macmerry
popejoy
ensing
laurentic
sandretto
garthmyl
jamileh
zemmouri
forsgate
tetracaine
madadian
anyar
graeter
kracher
economos
demodulators
jcom
glunz
yawed
glusman
tolchinsky
pogan
mitofsky
metacrawler
overnighters
sunquest
groms
mukhamedov
hmshost
oneword
wintonotitan
superorganisms
gunatilleke
jetfighter
newmachar
ittiam
glof
shrillness
solero
fonyo
phrathat
superclubs
kirzhach
monesi
swach
scolopendrium
verisk
bachelart
küntzel
unrebutted
maralal
coffeeberry
europhile
razzamatazz
yeyo
lemoncello
drumquin
threatning
tegtmeier
chukchis
wongsuwan
deludes
jolee
equibase
chandila
transportational
unhittable
toadying
shukran
spilhaus
gianopulos
sveningsson
oppertunity
wawan
teruhisa
rahv
avellini
hameldon
firesheep
jidosha
wordnik
triggermen
campanaro
rumbelows
glaud
nabatiyeh
beashel
veiko
minesite
nower
triarc
emosi
wildi
pandrol
srmg
fluoridate
parmet
krongard
octs
dalesandro
allocca
sehir
millier
cofadeh
hadjadj
blumine
haeussler
llanelian
manross
tagaz
sujeewa
mouglalis
bonu
bocog
manojlovic
larossa
koffiefontein
visitorship
caringbridge
milimeter
gsee
fukumi
ligron
tsip
rustles
mediavest
fdis
zamansky
ljubicic
purkersdorf
croe
varsseveld
fintray
guita
illington
guntheri
midon
apocalyptical
vitalized
grabois
loopholed
fulgenzio
padro
yangchun
golwg
dayong
giannattasio
suzyn
khalaji
ibata
khwazakhela
stroth
huidong
llanrhidian
slioch
kronick
rumblers
dasain
bodett
pivarnick
shirleys
albou
kerrygold
zeshan
absinthes
cpms
scelzo
etzler
rompaey
forgemasters
xiaolei
peerenboom
lounged
lliswerry
ghayur
migel
unbought
mecia
heggy
fehlbaum
birthparents
videocast
macanthony
famadihana
kraeutler
shiregreen
raeva
xudong
blanefield
elderhostel
hurowitz
falash
navidi
withour
samro
wiebo
elspa
goodweather
sportcaster
stilian
pdab
fiszer
moushmi
mcsporran
compulsary
rineke
fawnskin
krue
indispensably
caveated
grrrrr
foodchain
philamlife
nelc
bjordal
amusa
milio
bredel
lebeouf
littelfuse
kildea
sulforaphane
dialler
theilen
piroxicam
flordia
tenaris
vagator
russak
haplessly
eckerle
poplock
decendents
lassman
hinkelien
jarba
fatsis
homm
takotna
baldonado
aglietti
guelman
erhebung
glapion
sizzled
tomaszewska
redmill
mcflurry
offish
sukadana
benish
multiannual
maesglas
maidique
reagin
gillioz
sentaku
guadalest
demings
langill
vedera
censeo
endplay
fristoe
mixel
uncontainable
wieght
makas
delicates
navision
cioccolato
rasslin
wisdens
bershawn
malvertising
chunyan
wyberton
matchstalk
winklevosses
gaylard
registe
marnhull
jiansheng
tokes
reigon
revalidate
ofatumumab
atypicals
bewilders
sleestaks
redware
oxenhorn
delphia
janessa
sharbi
arabisation
rodnyansky
weleda
spinny
tidrow
mikeal
dipg
durose
detents
beitel
cemig
schoolbag
weybosset
xboxes
lmno
goldklang
iseran
fizdale
mcfie
neurohormones
kitchings
infinitus
seabank
warchest
chumming
fredell
neonatologists
arbelaez
dragonera
yankowskas
prevaricating
prosecuter
casten
mansuriya
treneman
synta
secluding
lockhead
kassirer
beltzner
europeaid
bonisteel
villopoto
lietzau
tighty
pandoro
parahydrogen
elefsis
chedzoy
cortas
flimwell
artuso
kallin
kritzman
trifan
glisters
grais
ungraceful
foscote
peronal
sciullo
musos
troyen
inoguchi
segalla
dudarova
irell
esspecially
fforestfach
scheringa
ricciardone
emmens
belykh
muturi
ahmadullah
caldmore
purdin
kubotan
teletubby
richarz
belous
multifunctionality
luben
queniborough
declarers
onces
jayda
tenncare
willcom
laboon
vanishings
plse
purees
successfactors
palenzuela
haría
dakins
payoneer
rtpi
hyannisport
crosscurrent
pricy
overacted
clais
contextualises
proliteracy
arterburn
hongu
muckelroy
extenze
neten
winnacunnet
schweihs
odalys
lourey
fakeness
lipszyc
tredworth
terui
borcherding
guiders
floozies
marvelling
dravot
loughcrew
inititally
komomo
herzigova
wyvis
ammendments
cymdeithasol
webfetti
grillini
stroem
bedner
oten
mattityahu
caladesi
lynard
tauer
burradon
nullahs
schoeler
myspacetv
annahar
banyans
phakic
wenfei
cheroot
breazeal
sonke
tiep
liwonde
diomansy
schumi
tonni
rephotography
pitchside
rushanara
aimster
calabrians
recordation
austrialian
achike
plasmoid
swanigan
hushpuppies
licencees
chromos
kiszely
hackescher
burhakaba
barung
rekowski
streched
tandas
oniani
intermittant
yawned
soojin
thurtle
jalapao
morgano
dadds
verolme
grigonis
pilhofer
chics
odeen
nrpa
alvilde
everblue
phear
manjinder
lortz
promiseland
informatin
smashmouth
zouhair
trusthouse
prizefighters
burhou
ropotamo
geggan
thileepan
rudoy
behre
gomphothere
attik
kimpel
sorzano
rudby
rasburicase
wdk
adhami
bloomy
cresci
dairymaid
pollies
rescuecom
salke
lilliana
ratemaking
abigael
plenaries
axley
peelings
bakry
workshy
orkopoulos
mulwray
consultee
barreno
fehmida
gwtw
wickaninnish
cchit
neighborworks
ymba
offtake
gottfred
fenthion
symyx
wrigglers
vulcanised
distillerie
oneconnect
szydlowski
transmorphers
peshwar
nargess
freeflight
bernardez
mbos
nativities
bowett
reback
semprun
mezhgan
potholed
conna
muzzammil
oberly
fange
fixmystreet
fraternising
petibon
ectd
twistable
bagatela
findlen
galerne
tatties
hemingses
andolsek
jija
aberpergwm
inrix
dicosmo
denunzio
fishtoft
haux
fiving
nijrab
catinca
yajun
glycemia
telnack
casani
tureens
brixmis
takwa
legalzoom
uhrmann
hershenson
playdough
funeralcare
sutardja
cbeyond
bugala
jersy
kaniz
filched
scampers
kontron
condorelli
palsas
chugger
brezinski
instrumentalized
sothoron
warks
israely
wassila
bejesus
tranio
acurately
liathach
mutarelli
serreze
pepole
konyves
centerbrook
excon
eigenmode
mimouni
payes
goshka
ragtop
amsc
shonga
shapewear
vullo
cheret
fitou
sunjay
barrit
teoma
carmontelle
ctol
dzhabrail
rapleaf
gusterson
tpcs
guerzoni
bigombe
llobera
hordichuk
herbsaint
oxyfuel
kishmaria
spotfire
rangle
baltimorean
ituran
culinaria
jatuporn
isofix
chelf
chre
dehousse
gymkhanas
scaparrotti
boaster
nafai
stimming
kobborg
danieal
huggles
changthang
ashtons
eufemiano
landsdown
omnisexual
zess
percipient
paulita
eversheds
curado
achnashellach
bbbs
lrmc
reynoldston
woessner
divaldo
weibrecht
netplay
macbird
aglieri
outq
mountnessing
tores
lamur
zaborski
lopinto
gibor
winey
safarik
proceratosaurus
beretania
agasi
restefond
tiedt
parten
painfull
microcurrent
bulgin
ringgo
akard
billik
componentone
ladens
potashcorp
ridgeon
grijpstra
arculli
wislawa
cardsharps
frataxin
dateland
yerushalaim
bargepole
licuria
kinyara
kureshi
ballog
seemd
walster
galamaz
prerow
acebes
chemring
stepover
bienkowski
asness
krsko
fastercures
swingometer
yangchen
sefl
götschl
servanda
bolarinwa
komfort
mofcom
wära
marette
sportsnite
monitory
maunders
daroff
chmagh
jlens
overstocking
ketteringham
fgx
chartchai
himadri
shimmies
niga
accoona
petion
clariden
shuqing
gardenesque
biegenwald
mfdc
meunière
svankmajer
pfab
hookway
victimise
busken
rosebushes
rimed
specfp
castrations
ultimatly
vatskalis
yids
radcom
sanela
feazel
educable
ambiences
sativex
anone
favrile
neurobiologists
trigged
gilliss
basista
laipply
shockhound
seasalter
hafidh
decipherer
kayson
ebere
momon
ciccarone
neau
incestual
seewagen
parling
merka
danario
dendur
fantan
intifadas
yeat
fensom
cohabitate
millenarians
ethinic
zabumba
ocassion
thistlegorm
servicenation
xoops
wsbc
griessel
elfish
zakrajsek
iftikar
acclimatizing
insertable
cholesteric
infantilization
getu
cinemex
heeks
hjelmeset
solinsky
spraker
tortes
melini
cimc
svetlova
dervaes
magmic
kielholz
nty
paragominas
willye
riverhills
santillo
grudzien
autoliv
jinko
loudhailer
taxotere
peopl
hakani
globins
athenee
archvillain
bryncethin
djenne
bredenbury
härö
manderino
getzville
elsenburg
peradventure
krei
swiat
senafe
carolling
deephaven
tvcs
danescourt
nellemann
turtledoves
ljubisa
traffick
rutto
roels
ozal
grippy
exagerating
pompoms
ither
kuiken
companionway
neziri
podlesh
vidanov
enodoc
hausas
lenadoon
twirly
spymonkey
talbiya
hayan
pellett
orab
laramee
lumieres
gautrey
balky
buckbee
xijing
haleema
sinnet
costliness
méret
finfer
euromanx
sandersons
wcwj
pougatch
baocheng
sacredly
caixin
beeker
birthrights
systemised
aucklander
wamakko
patronizingly
incans
aibek
poron
kasirye
strewth
liberalizations
groundouts
jaquith
landmannalaugar
taburiente
bulwinkle
vdh
kirp
javitz
ouertani
balach
metabolix
sexily
refiguring
gestair
aasld
cosl
antiporda
molmenti
lacunas
gwenyth
populistic
bopet
tulketh
wildbirds
mirrorshades
kutahya
sonnenuhr
bensin
securitize
zawe
dmvs
dovell
bradpole
cloggers
acropole
tetsworth
kyaing
sidero
dungavel
maltsters
clariion
dapagliflozin
yesipova
aronda
troedsson
kwambai
frob
chevannes
trusteeships
desalvio
rezwan
jokesters
hisself
gaymard
riecken
argott
superfreak
vidcast
ruetten
hairbrushes
aparent
junchang
tyhe
alshaya
alsf
pargat
parimarjan
sjf
molendinar
pundole
chatauqua
scolt
mobisode
siobhain
restuarant
malallah
unreadably
coinstar
gstp
frischer
azumazeki
exaltations
franjieh
meskerem
loizeaux
verdecchia
cottas
puthan
channings
chukwumerije
colthorpe
unintelligibly
sherazi
teamsheet
agoro
shamini
eshaya
elementis
almaco
fascinator
maximón
saganaki
trostle
fibulas
ababio
sifs
posssible
niedner
fogbound
settergren
trollers
hereu
roggensack
jhonas
leiths
maurkice
subotic
amhp
boachie
mlungisi
bewitches
digitals
enzyte
undersupplied
mirones
arabatzis
feudals
hatef
witkoff
samogon
shipmanagement
derbyhaven
pigeonnier
rockschool
mared
geeked
nanosolar
harlyn
bedlow
kadum
filtrona
scrase
planells
simplyhealth
nymeyer
lisey
boycotters
ljubic
defang
iqm
dunbavin
centerbridge
plumpjack
irrepressibly
hanah
overoptimistic
chepchumba
cossall
culotta
abiamiri
meys
psalidas
waterstreet
leivers
headcounts
pummelo
knobe
stathatos
chaddi
fricassee
chfc
inveighing
ruyle
shopkeeping
ahould
boumerdes
bolsterstone
douz
rautureau
ranelate
arpaia
envenomed
carth
majorite
lousiana
diptyque
odighizuwa
shenoi
kozmus
pison
unspecialised
baosheng
olesia
intrusively
nekipelov
kamy
cicchitto
mslt
paux
froncysyllte
klensch
nylen
cryovac
arzano
faycal
esperanca
rakishev
heshima
aschan
mcglothin
pcoip
ziqiang
khemiri
minidress
rajastan
gumercindo
efrati
norcot
supercapitalism
zestful
shoob
nonstate
smolar
lesters
feiring
milds
eyrum
unteachable
mineworker
exhilaratingly
myelomeningocele
sitbon
powellii
indofood
alego
zahia
gueiros
haula
sipper
cabeus
buschel
monomaniac
keed
freiras
naek
hurney
englande
dramat
morawska
lanzaro
sudova
meshumar
fundies
ozunu
permut
hapshash
onasis
betão
xiaopei
loval
minghua
bohorquez
sidener
hemminger
fsid
maggoty
diveroli
colourant
gemmy
caulcrick
tosatti
hydrocharis
zhenxing
nyctv
koup
siimes
fojut
figdor
chuckers
gayeton
liniang
stadelman
yathreb
fingland
havenco
flirtatiousness
burkhoff
baksheesh
torley
developping
yassini
bogusevic
alabau
aquada
featherwork
arbours
gilfeather
lervik
rowhedge
superport
deluges
massone
mawsynram
acasa
scarsbrook
moulinsart
scif
kircus
tyreek
inania
gulino
valueclick
unifiers
eichholtz
kobar
coquetdale
improvidently
jakez
pradas
earwicker
qargha
mentel
fionia
hawesville
ponchaud
esolar
stalmine
borghezio
moistness
edutrust
othewise
bamler
ranque
forray
kilchoan
moim
dcct
adulteries
sosua
auditee
acsb
cahirciveen
strickly
lochtefeld
descisions
orsborn
querencia
uyesugi
ajumogobia
appelhans
barrena
dapibus
qingchuan
chelston
deputizes
dilwar
squealed
millionnaire
shearston
remedia
amrutlal
nerad
rossburg
prettied
dreifus
sportiness
savella
kiiza
kashkin
croxon
unflyable
andother
endodontists
katen
thsoe
ballygally
hutman
calafeteanu
hypermasculinity
balhaf
woodnesborough
gecf
buchser
opeiu
norene
gavina
vergie
absurda
thickett
signalers
passop
saeson
verticle
mohajirs
stamsnijder
theplanet
heatherdown
lataillade
billys
toxigenic
urbal
admonitory
khbs
reuveni
karpan
kukes
nyana
zarlink
pongsaklek
kriesel
enjoyability
faskally
codefendant
declaims
joyrider
cantinero
trevarno
wantz
jahagirdar
noorullah
sokolovic
gouze
viemeister
oelschig
bienwald
biotime
ilgenfritz
rasoulof
middlehurst
mastrogiacomo
reburials
clickjacking
miskimmin
gruman
leyh
bradsher
gabetti
roenbergensis
autoworker
candrea
bdds
hemorrhaged
westerkirk
binman
nimbler
affalterbach
farmdale
convinient
forfend
praxedis
tantaros
icaac
boleyns
kashou
linchpins
bernabo
jelled
hardeners
frends
mgx
zian
brevetoxin
opression
aaus
apkws
mellersh
bartlam
roundshot
topis
gerindra
cheseborough
boerman
transamerican
schiferli
chdi
engorge
smashbox
prinsendam
knobhead
greu
preselect
saharkhiz
neily
juett
reconnaisance
dobrzanski
koofi
ortmeyer
imprisonable
ipplepen
gentilozzi
getaneh
pyleva
zengo
memsahib
yappy
serouj
familys
gusteau
predicating
kralovec
microvolt
transmontanus
strief
serrie
anoosh
yahiye
ninas
perurail
quickenborne
saaeed
moneysavingexpert
cherenchikov
mccollin
heca
masahashi
bookstalls
simbol
announcment
jnpt
kikay
shely
winai
guayabera
skirling
bouzereau
groupwide
mezes
chuprov
masculinizing
radwanska
winegrape
huget
tanhouse
akenhead
stalkerish
beguiles
wallsten
piratpartiet
guiberson
colak
pacuare
wahhabists
kortz
meisch
curlicues
plunking
hardihood
jamus
madell
tullian
mitered
neverfail
jerseyman
pianigiani
titv
aspirins
epke
solomonov
travner
germino
mountainbiking
lortkipanidze
disciplinarians
schnoebelen
metastasizes
troake
mccatty
shellacking
haasis
rodrique
dorette
armazones
kymberly
ftsa
quinacrine
karsen
thurairajah
mumpower
rukman
churcham
unachieved
paunovic
bookspan
endurable
pastilla
poncirus
madeleines
troest
channellock
cratfield
boadi
sashka
gehri
seefeldt
backstops
scheinin
properness
cicek
blumenherst
sinuously
rockpools
korto
revercomb
hammoudi
contorts
eisenhofer
blacksea
panbanisha
skemp
retests
karamehmet
empathically
diegnan
ramnarain
mochovce
ankergren
strole
faceman
bográn
comman
radhames
rosmarino
spiropoulos
unfortuately
oovoo
lubetzky
chukarin
alpenglow
spycher
sycomore
petrosky
telemax
tardily
jeret
veenhuizen
craniums
twatt
towable
beardsworth
wduq
unobservant
konarka
picograms
snuppy
mangelsdorf
kushnick
träumerei
damery
travelgate
ndpb
xinran
rehabs
littlefeather
identifing
sickie
nkh
zillig
artayev
telecomunicazioni
pichugin
zollman
kabalagala
boscarino
kecia
ezrahi
rodgerson
notebaert
pibil
hupe
bushton
tidiest
ccxr
farshchian
rosemberg
resales
derrogatory
buntu
himslef
medialink
vitaya
kimchee
nazarena
salewicz
twell
klouman
bedian
sequens
blaszczak
earbud
upadhya
springiness
kaffeeklatsch
glcc
langum
numberplates
herzstein
catalunyan
mandjou
zuul
longborough
shieldaig
ecfs
krisch
sirba
gww
mikela
casteen
inambari
zuberbühler
vatukoula
hizmetleri
diffcult
arnove
mejid
sahimi
petroleums
impregnability
llwybr
magorian
faps
tervela
tayr
acccount
inchture
dextera
dmpk
delmi
qinglian
xtended
stiemsma
crispbread
mcrobie
robesonia
irniq
unclarified
opportunely
kewstoke
damadola
danamon
twango
sonderborg
stargroves
openskies
drewsen
irwa
gielinor
luridly
kaatz
tweetup
dentler
ginsparg
craan
partain
eyob
rodzinski
stairmaster
liberzon
luigs
jirau
patientslikeme
windcatcher
springall
zoumana
gledson
rellenos
starstrukk
reichler
rundles
pasando
zmh
laoreet
maydown
ruffley
ousama
ergocalciferol
dellapergola
rabczewska
lacob
iguacu
infed
ebri
heijde
aggs
chunming
landspeeder
churchfields
hacksaws
vermiculture
ashdale
carwile
specked
severgnini
longsuffering
liyong
cheesiest
kennacraig
lindzey
dussen
kadak
deslatte
earthday
guden
sniffen
groskreutz
arkoma
beaute
acorda
tscherrig
hawkyard
mungle
synthroid
counterpuncher
danenberg
molseed
wixams
chugged
bylaugh
kroschel
corones
tallack
sehbai
fiaz
caolan
haywain
suances
dsts
haruf
valmik
moumou
concerened
tahani
stuffers
undervotes
vertrue
mogielnicki
compellent
exercize
switek
gegenschein
nonpareils
bishil
ungifted
laines
cybercafé
croplife
overproduce
rwdsu
onli
folzenlogen
unbeatens
geebee
sherford
guanglie
jacobello
minsterworth
ballis
lennick
bunfight
kilimnik
bristows
zahradil
libertinage
paleoclimatologist
humberhead
osayomi
tenents
attara
sabiu
strattera
dohnal
networx
mitsuwa
mayis
muffley
aduc
smeekens
kupczyk
qingwei
hoshiyar
financieros
bhanji
csrd
edvige
uncracked
camboriu
sadberge
sportshift
shopzilla
reinsured
ersol
fictionwise
schweder
decomp
dinwiddy
tankleff
nscd
restage
jordao
helgemo
poyiadjis
reinjected
fixedly
halloweens
cyi
fernau
munters
fcms
cruso
enterotoxigenic
thurlestone
dysgu
brako
esfa
deiniolen
cunis
goodfellows
reginal
wijesinha
mesmerise
interconnectors
genthod
kuishinbo
dauster
viropharma
qingping
hatzistergos
bertho
demulder
ampd
mikulic
oduwole
akete
kollars
stortoni
rakotomalala
betadine
hings
consolmagno
cheen
arefe
rthe
maryi
milkers
fringers
tuitama
supervolcanoes
verzbicas
exequatur
jampol
supertitles
mashego
griswell
necton
hemispherx
sedgehill
sotik
germicide
familicides
zamick
tahona
briden
hocknull
byat
aldercar
pitilessly
janeczek
naqoyqatsi
mchaffie
lubavitchers
systematise
mikheev
samkon
gasters
fleeson
excoriate
lienard
supersonically
moeritherium
coorey
heggem
forestalls
foat
whartons
pffft
shevell
lievesley
momber
daejon
douthitt
staccatos
actimel
maddelena
breighton
unpalatability
highend
corma
cynergy
jagtar
demeester
europarc
fastcat
maryn
bancolombia
btcv
dispicable
abdah
eskisehirspor
vesuvian
schalit
pidd
sunnyview
wwrd
gharibi
uggh
miquon
roddon
rasid
fendel
newzbin
mostest
golen
aglietta
damsons
byetta
ypfb
sabermetricians
sterger
rofé
paxinos
trevillion
heidbrink
kokslien
granacci
vacchiano
teber
fenstermaker
yanchi
teleco
nonbiological
maise
ghiraldini
kimizuka
mormans
cioccio
achache
cordi
bernett
dromm
junmai
heycock
wouldbe
dreaminess
atherectomy
conexion
mazzocco
scharfstein
yolles
lubero
shagwell
scooting
enguri
svejk
mardom
theyer
demitasse
rommen
gumatj
wooroonooran
limelife
aprille
parcher
hopey
setka
savouries
sulca
langold
lurgashall
thrailkill
zhenlin
allou
mamirauá
continuances
rayovac
komisarz
hamoked
bullhorns
maidenberg
fawer
duffaut
lobolo
maktum
kirm
dallat
floortime
stuckmann
soliai
phuntsho
siemers
panaceas
keyholder
selvedge
izumida
mukadam
bailu
egusquiza
astorg
mattru
assasinations
dimitrenko
thomasnet
toopi
bocchini
kosier
stellone
daurio
speedbump
seggerman
fourtou
wineberg
postema
physiologies
soetens
solectron
pobst
proliferations
resler
babassu
milbrath
sockington
sordidness
beaford
tumminia
ekkehart
santofimio
sucedió
birchley
benichou
wouold
repligen
epyon
olshan
panitumumab
carsick
weggen
bouake
mernagh
lingbi
reprobates
coughenour
beethovenian
aberafan
uralde
mohmmad
conill
comcel
snooped
resourcefully
imberhorne
hypobaric
criminogenic
couchettes
dribblers
blesma
jumalon
lorek
giau
newstar
facci
yuguang
cobler
trullie
zarnesti
shvedov
hainey
hatsuko
receipient
eggishorn
cardiometabolic
polsinelli
poxon
wpfw
benac
chudy
byelaw
aristodemo
caenby
vaugrenard
choppily
alcotts
ismaiel
velardi
agyapong
overwatering
panjwai
screenless
thinkbox
colubrine
anglophobe
estatic
medai
abdelali
gainsaying
lengshuijiang
unshelled
eurid
walmarts
greenbook
yujun
marakesh
sportsound
deminers
jenice
copart
poults
agj
piggybacks
photoplus
sollis
cosper
ooyala
mangement
qfn
suddath
ichilov
corlough
silverfox
cargolifter
cullingham
matier
gaughran
flumist
untaught
procuratorial
shortstown
shichimi
pragnell
ariels
streamium
turquet
showerhead
tastykake
mohadi
darcos
topfield
februrary
gebara
belgraders
lazari
loai
luckin
khakwani
zhouli
gentzler
cafemom
sequestrate
kaspa
coiffures
hawaiiana
wixted
boulcott
demartin
reenergized
pixelsense
diminutions
sugdens
denars
successtech
feroni
attwooll
appose
lobotomize
adjured
romanticising
pyrroloquinoline
pancaro
naamani
rannazzisi
pierrelatte
treponemal
turbonick
paluku
matrícula
yakoub
llanelwedd
spatters
dofe
programe
duenyas
asmin
vagnini
fishtailing
rodius
koroneia
fasuba
kefi
nietzche
indrio
caddock
healthmap
sellen
schwartzmann
gunster
shukria
dukoff
schickendantz
nonn
oldness
breakings
ywha
craning
xiaosheng
cedu
bway
zackie
gainsco
sigmundsson
ramezan
kleeblatt
wartsila
planetologist
tollgates
succussion
jaffarabad
stankey
sobig
hessin
danot
humidifying
myojo
metsing
henken
ahuva
joanny
fahrer
anthropomorphisms
unsafely
crewcut
krz
fadavi
longball
mokwa
densley
mmac
bfrs
weans
zatopek
palacky
ctsi
grozier
arrey
turbodiesels
frisé
quitline
directlink
usich
gitti
hyn
tananta
cluver
spaccarotella
rumel
pignone
philson
somboon
carecen
basiji
christkindlmarkt
hematologists
redetermination
budzinski
preens
falen
envio
texican
productized
mgaloblishvili
shashamane
inmotion
dawodu
marzuq
mythomania
walgren
bairo
beyound
bikindi
zardad
lawall
leweb
aniya
msil
kynance
saalim
zacek
sutters
milpark
atlanticism
warnie
americanness
jinlin
woudstra
udawatte
cloughton
danbert
dabeli
trich
handholding
aorn
doeringer
oxholm
affliation
barrineau
suncom
cottoned
homecrest
prettejohn
lindenberger
studham
dragonaires
gibara
lehrke
eljero
thunks
proporz
epcm
brunonia
earworms
viewmaster
atcher
whant
ulery
tofurky
zentsov
wisewood
wahabbi
rumbustious
parkdean
izembek
esdale
carrafa
blaxill
frieslandcampina
brawer
yuks
qarnain
livedrive
siplin
qtopia
philex
rehabilitators
fomboni
searingly
bloxworth
beofre
availablity
geneina
azacitidine
osinde
continious
mahata
sungnyemun
damirchi
vcj
basils
alaron
naccho
boozed
carigali
redresses
yelda
carithers
menuez
owlets
volynets
hoveringham
gierhart
castleview
fauchald
boldwood
pantic
newsie
aeroscout
lamama
firewalling
upcs
pischetsrieder
warmings
belstaff
familes
envia
deliverymen
theth
pastie
hataway
cemevi
mesco
obtrusively
embryologists
beverlywood
volochkova
canipe
talkingpointsmemo
sejour
djoudi
bohio
petteril
maqsoud
heinisch
lols
kutralam
micó
canlis
hibner
misinformative
testees
reisberg
chapur
krendl
bosche
territorian
bienenfeld
craigroyston
cagnotto
institutionalising
roguery
thumma
kneedler
kinvig
durants
lohrke
motsi
coater
romanby
ritcheson
thevenin
densuke
kampani
anhydrobiosis
amnesiacs
tadek
aljs
messanger
ggbs
overabundant
buckels
kerlon
contesters
trilli
postition
damazin
nisenson
muscatelli
tolfree
brosio
circs
esmod
rhinog
vétheuil
medinas
subotzky
microgrants
alacer
mittelman
iise
belongers
badmouths
prax
ketam
mechanise
kahney
rheumatologists
hellebores
fullfil
plasticisers
valorized
swifton
beinhart
denève
esab
multicasts
lindborg
dermontti
pabla
staropramen
louwerse
maloofs
vosshall
cisternino
markopoulou
freegans
fervency
tambal
arenson
ramkalawan
masalai
woodstone
mahroug
ladyboys
bgz
chalkstone
showhouse
offcial
culleton
brutinel
zendai
bernando
makubuya
beledweyn
cpci
fantastik
geothermic
jazzers
chabris
rozes
kastari
hwu
tereska
jerseyan
gansta
weigold
fizzes
mursley
descants
sourasky
hazer
wanandi
weissert
bridgespan
fourniret
sedately
kapowsin
karabel
ktvn
sucuk
aljofree
polyarchy
equivocations
witchita
prozone
welday
castucci
cristalino
teabaggers
overweighting
designedly
anmd
effortlessness
krawcheck
shutts
niyamgiri
gohouri
suchiate
allihies
manikfan
tazo
yousri
tanovic
voinjama
kelvinbridge
wheeltappers
authorless
twineham
reseating
uploadable
ladya
fraynd
dardar
doozie
ambuhl
chapulines
meryon
gartshore
woodacre
alkermes
coberley
soaper
hadhari
kwiat
pollastri
windler
cristen
farecards
overlearning
villacis
berlau
yardwork
hadja
bonarelli
derenzo
turnes
miyakonojo
voltec
itogi
ramdass
lorello
walsoken
igcp
springmann
odalisques
nénette
avigail
politicaly
kazam
linothorax
reblochon
filberts
longformacus
camaya
makriev
bogomila
geolocator
dysmotility
heane
kanhar
talkbacks
sabeen
chievres
lillico
wedgbury
levete
zalkind
cutely
shufflin
habayit
carfilzomib
castelaz
hoovervilles
nurc
morange
ffo
dhurandhar
stahle
chacin
dlco
karabits
quaff
indemnifying
eggli
tautness
akhromeyev
suhre
grainthorpe
narec
sculled
paperworks
sonofabitch
nonhazardous
wanstall
nabilone
beastiality
ecoop
outie
medicalisation
atchity
macroregions
langeloth
gordien
sugeng
oschmann
biener
darrian
daswani
shanaka
ibanga
ictqatar
ockerby
suckler
troublous
chumbe
yakum
gunks
strouss
floricultural
galloon
presentence
earline
paramjeet
pontygwaith
tiphaine
turkevich
ehrenkrantz
whiteknighttwo
sianis
ploner
womanizers
dratsang
eejit
lastings
rothbaum
chikin
delva
tormarton
radicalise
magnetocaloric
boockvar
rehr
colliton
heimans
ackergill
khadijeh
makgatho
manicurists
bavaud
solage
saleiro
teetotallers
leisurewear
kindertransports
autocare
kilonewtons
servies
obradovich
msst
funkiness
capsulitis
noisiness
peaceforce
nickolls
sagrantino
arcalis
budhia
apartness
shortcutting
mcaleenan
novotna
cinephilia
emta
montanas
enkhbat
uspi
mahnken
siry
sunrice
lza
twinset
sarara
roseannadanna
russianness
makudi
batpod
nakazaki
maguiresbridge
kariobangi
apportions
hansheng
viko
lsrs
critisim
newboy
bethard
adgp
bruley
offchurch
averna
canisbay
chipolopolo
batsmanship
guajillo
traka
johson
thougt
kidani
viscaya
shiveluch
elving
begums
ngobe
corjova
bamian
soderblom
cfcc
perilipin
implanon
gybe
entrapments
wtwp
vassilenko
commmon
foggers
bromery
fourt
methylxanthine
pukach
chyulu
krisher
blueshirt
thicklip
phreakers
clappy
halkias
decrow
frolik
sudderth
fiki
klotho
ilaoa
liebesverbot
vinegary
forcemeat
gibel
deodorizing
oregan
panousis
rcuk
marsannay
unconscionably
dodji
prono
fulde
efsi
nesci
nbcs
zakri
jonckheer
eyser
subtextual
yuyi
hcfa
glossip
favazza
uniphase
tweeking
absar
koblin
outgun
yeasty
zantac
shafiee
mattiace
rinpoches
ospel
handcycling
anoc
gevel
bouron
lebling
uplander
jamora
linpus
ineed
esquith
vlccs
punaro
shopowner
shaktar
bereavements
osala
drwal
dissimulate
streitfeld
dalbandin
sydes
mischaracterise
steingarten
percussively
sogecable
stonethwaite
dabbah
conformality
beaufret
weedpatch
honko
cowlin
viatrovych
locationfree
munnell
rasit
wielaert
sumptuousness
lubarsky
valenki
sauntering
peebleshire
brancowitz
panthongtae
accha
dodgeballs
brendanawicz
linebrink
lispro
enkianthus
despoiler
torrellas
gericault
chifa
swannery
demonstratable
bummers
dipshit
musicans
armyan
patetico
shapir
eyecatcher
fixturing
shorris
infostrada
kuenne
blurp
cagerz
chewers
khowst
setaimata
annihilations
medeco
benslimane
hanslip
tincup
cardiography
bassinets
popsters
aardonyx
manocha
choosed
sirett
goncz
mimicing
achte
leaze
monstruous
entitites
riksgränsen
gcig
whoopsie
dunauskas
adnam
runned
movi
clearers
fluffers
beignet
delago
whippin
guiterrez
maskrey
drummy
wadan
personalis
zicherman
jerious
falkoff
gangling
mayford
takle
stamile
prepcom
conceptualists
crumey
triptan
dester
khator
drudges
growly
someon
hathern
rusutsu
cybrids
eneloop
diprete
measurables
abishek
unchaperoned
hudhayfa
cvii
mayahuel
gebremeskel
huissen
bezbarua
adarand
vitters
barczewski
dershwitz
zilly
goldworm
wooters
riemerschmid
cfmp
sogetsu
plainpalais
turchyn
macnichol
szarzewski
caresource
radina
creran
reevaluates
piekarska
waafs
ndms
tyszka
senel
scrine
vivens
sahadi
jahanshahi
konovalovas
genetech
insideview
squinted
hilsenrath
erisman
kwarteng
podvig
sheinkin
wagnerism
hipmunk
konstan
cenveo
pottsboro
auxis
patricola
zinczenko
bibulous
plonker
lowitja
bacp
rampi
trusteer
acceleware
pillowed
wilbekin
bouasone
abseiled
barelwi
snitched
kortajarena
wrecsam
persepctive
aikwood
zeidenberg
palenquero
sopes
homegroup
microchipping
frogspawn
uncapping
halabjaee
pittington
nymann
sociodemographic
wagtmans
boster
llangammarch
brfss
takudzwa
celltech
trueness
mccallen
inphonic
lancman
skovorodino
mohnton
seland
millenial
krigsman
bertholle
affini
metapneumovirus
oldway
mushkin
buonanno
debiting
spankings
lidong
patzke
aosis
disoproxil
llanfaethlu
emagin
lcvs
ellaby
stockholding
wssa
sentinal
fornaro
hartleys
huetamo
namasia
hodsock
angiograms
elashi
bilaterals
ponde
mingardo
hudman
rusiya
nagrin
neytiri
curvo
meite
juridico
windansea
countercharge
drisht
gutwein
cswa
degibri
shortchanging
rubios
gwaa
accountabilities
pechalat
borysiewicz
costumier
kyel
hafiza
linbeck
tacle
akinwolere
klaiman
dhanji
huppertz
snookers
tiano
sheinfeld
papalote
poliklinik
passingham
pantsuits
ricanek
lisant
whitecraigs
rehydrating
ccfls
laundrymen
seriocomic
röschmann
evolutional
britany
blankenbuehler
sterotypes
multipolarity
carryin
bootjack
loucas
plebian
ranulfo
taekwang
trobaugh
detoxing
convinved
essary
jannero
lpcs
splosh
kaprosuchus
zeda
wavel
anadin
rydingsvard
multibank
porthoustock
mainella
eisenbarth
ruscote
baltasound
nyumbani
rwn
itzkowitz
tsybin
schopman
alfsen
kallir
cario
competative
unapparent
tyngsboro
agustien
aastra
mingxia
parsky
apatzingan
winnicki
vallhund
dimondale
bogason
limbrick

leibfried
ermakova
merafhe
qubad
calfed
suanne
xup
layabouts
abaetetuba
chayne
balkenhol
alfy
financeasia
bamy
eviscerates
aylor
lecharles
flyspeck
gadflies
vapers
extemporized
liedel
crabill
twelvemonth
jual
kaloko
tillyer
keyla
babula
mumbadevi
duynhoven
guanting
reinholdt
pommie
lalvani
priestner
gelila
lansdorp
immage
pisarski
zmax
campuswide
daane
gersch
scissortail
silar
schtroumpf
intertrigo
dunnavant
semiologist
hideousness
aromashodu
thunderheads
naffah
ammash
staehli
sigall
microangiopathic
hanmi
penlington
chnc
kitware
galderma
pericolo
voskoboinikov
alohanet
molycorp
governable
dostie
jyvaskyla
epiphanny
kalista
showiness
pilotfish
contol
refrozen
althogh
passan
funtion
slumgullion
facioscapulohumeral
muscletech
evrensel
gargunnock
ereira
livolsi
popbitch
bettyhill
shriveling
sunweb
interpretion
momondo
iusacell
keyur
ruya
stagy
kuntal
minature
achtenberg
alkhateeb
eetpu
kazemzadeh
kellington
vulgarized
windels
paypoint
whick
acito
ereng
birdlip
murietta
nareit
yri
uzoh
frenken
curatolo
trippet
auret
jackiey
gugin
utstarcom
dohle
bertille
ascots
zaenal
gestingthorpe
wheedling
brutalizes
kentisbury
nvax
interpark
efestivals
saiidi
modibbo
winterval
mencke
docuseries
britts
feering
tauntaun
sermonette
tobiasz
mabelvale
congeals
cessations
stetzer
ezzeddine
learmond
frontierswoman
godam
skaer
uneo
argumental
aquiring
exhaustingly
gilleland
distorter
dongtai
plastico
strathfoyle
priorswood
zanier
bqi
greilinger
gargani
credentialled
zengel
mcelheny
caerwyn
ambilight
yargelis
groeslon
sheera
idylwood
elevenses
springen
kayembe
mixenden
ramazotti
luetz
dreman
viter
mikitenko
maznov
mushed
pillers
mikhailichenko
apprising
edreams
yehu
soyoung
burco
beechmount
workd
geekdom
anglophobic
segye
sxu
rizwaan
instils
kaguta
removalist
shaindlin
clubmoor
cricqueville
abolqasem
seizinger
tgen
hopelab
tigue
kotlowitz
axyridis
whitcup
shvo
ageno
meliha
neuzil
firewalk
ilusion
kalsang
onep
cppib
golabek
schidlof
clephan
irradiator
burack
infonavit
craiglang
baykeeper
wmn
jsmith
dhahr
moven
policiais
huacana
manocchio
zazzo
waytha
garridos
transregional
eurodisney
wrinklies
verkhovensky
evjue
mandujano
tomatos
clarach
mahamid
baylee
backrow
hardinges
ansingh
dinam
seminyak
fracked
wellinger
boskoff
yaima
gangwal
specualtion
fieldin
screenvision
yezza
ksara
phoniness
mcdreamy
butiaba
terrapass
provinciality
wuxga
irrigations
montori
adfd
googins
tiberghien
srulik
cerasoli
offbase
tinajero
thomet
thesauruses
ruccolo
rohleder
bextra
redlinger
parsad
joone
coffa
wallasea
jirgas
wildfowling
gryshchenko
mountainair
laxon
gozal
morch
arcapita
wahlund
hamdania
vblock
cantebury
lhatse
qudratullah
schmith
ieepa
loquats
enodeb
wiffleball
gritzner
kuelap
kingsmore
mccomiskey
thirdparty
rigorousness
tongyu
swanilda
furn
prouts
buitenweg
latis
kmtc
hkiff
nowakowska
zagelbaum
helpin
karatas
blechacz
croley
amerian
schuett
unsuspicious
railfanning
greczyn
ripperger
botros
dolberg
quamrul
playlot
meglena
muradova
goepfert
bintou
interruptive
mdec
bencosme
overstressing
ryane
sering
brimham
ulch
rehomed
npfit
elterman
danus
mahealani
schrey
roflcon
uncharismatic
cachaito
tsho
coverable
haafhd
nanomanufacturing
clinico
chisaki
sashed
gatorland
wlodek
tusko
gurka
manolada
shynola
seimon
antillon
cliquishness
asalouyeh
whhr
wristed
stolley
huselius
cellulaire
creaked
witchdoctors
schollaert
gemino
dayen
kosoff
disfarmer
bruhaha
coloccia
pawprints
miniaturism
salvy
karnani
tabnak
untameable
jerron
kagle
sodiq
xbd
lachica
clearheaded
morjim
flyaround
downpipe
gocong
impulsora
durmus
poterba
artemev
khomri
processionals
usds
refi
paylin
danesfield
dreariness
wanky
stromback
saylorville
silkman
huchon
netnod
unmistakeably
muhlbach
fdw
natko
replanned
cestero
wagonloads
rejuvenator
universites
writeoff
deiter
abdusalam
aperçus
phana
feltri
gotterdammerung
formalises
marroni
cabourn
goreski
munkacsi
traco
sovietskaya
kusy
kadem
daturas
prathiba
jarema
lutzes
vartkes
equitability
aircard
trica
kurnit
rydin
erkmen
zakani
heikel
incretin
streetballers
groundstroke
goudé
chhang
pinpricks
abbis
prestigeous
gamalinda
aneela
fescues
npls
sandsend
somun
carytown
kostos
insititute
nysba
freegate
twiddled
poltoranin
lazimpat
unsmoked
sotigui
graafland
truffe
nightscape
lignocaine
veloci
glaviano
jeffster
townrow
codenomicon
acceleron
commerically
damluji
blueliners
kenchington
swordmaking
yanke
steinbuch
nuernberger
schudel
grubbers
metaltech
blogsphere
troshin
oldsters
gvozden
postwoman
gamy
dogtags
ringger
pulloxhill
yabuta
jeramie
vicp
prommegger
afpak
semeniuk
slaski
stardoll
launchpads
paulerspury
gugliemo
cornichon
seedpod
alhadeff
davidovitch
trimtabs
nbty
nsgc
chatellerault
confect
wesberry
stauner
dunnan
solmonese
cnsi
housebroken
matherne
hamstringing
badakhshi
schoeffel
ghemawat
savner
rokni
sisneros
schirm
frogbit
hoerl
hindery
monitise
cartograms
ssese
wwwt
delgaudio
beby
puos
dimatteo
cookstove
roslyakova
comparethemarket
vmw
passot
pyone
moehler
beleif
casty
mansky
volleyballs
tataki
infinera
miens
siliquastrum
scacco
wifo
brunacci
djevdet
toeaina
kuitca
jemmett
slovis
esfs
tahereh
patijn
dennerby
hittle
shamwari
elbæk
kilgraston
tewis
chroman
efan
fornatale
olugbenga
vlahakis
fugere
tunafish
emerse
ferlazzo
prilepin
feickert
endcap
mustafic
puertasaurus
towelling
bensalah
kominas
rhewl
monticchio
hler
hirshey
mavrommatis
bowzer
ioannidou
laganas
preggers
glengall
ossis
ingore
gkc
bettter
goligoski
daon
mvy
hgte
lenze
strathdearn
copperware
boogeymen
hermoza
sportline
motshekga
unwedded
bayboro
neurohormone
jessenia
axona
deim
khalfoun
blisses
taoudenni
snacktime
wayyyy
hlcs
charne
boliver
oriali
zolezzi
cerdeira
martez
rosens
navarri
branzburg
bentyne
rheal
odibo
preeta
reinsalu
mishit
arkless
karisoke
happonen
apio
gulledge
websit
jadcherla
bareiss
reimagination
dobek
wolmi
extranets
pathumthani
meeuw
kensinger
maysie
baumgardt
dockable
sovo
junlong
bankrupcy
whli
montrell
recontextualizing
leasingham
burqini
peddar
haversacks
cmag
rgbl
ablates
kalinsky
lewey
seyama
gamrekeli
putback
cormican
portavogie
bitange
lecun
gica
scaffolder
trusties
yusop
kahtani
burtle
wienerberger
avenidas
drolly
automobilia
natelashvili
rainbolt
cashiering
msti
sinkford
sherene
nethercote
deocampo
underrate
reticulin
perfluorooctane
inters
reeducated
inforum
bazoum
holick
wildor
fcis
distell
euphamism
djg
sammies
artzt
airstreams
finckh
iakobashvili
krakoff
sior
knockaround
stortini
unknotted
raskamboni
pobl
selis
lashmar
punnoose
casia
getsy
breitweiser
atfs
overrate
whitefeather
bgea
alsobrook
wernich
safilo
haarder
bewilderingly
worldfish
sportmax
kalsoom
glasow
igbc
grandiloquence
marineo
sigmond
shiara
werthner
chse
provigil
freepers
positiveness
mishears
antor
jammet
wanze
shershah
bucholtz
aloise
youdale
kosmatka
arleth
finchem
shackelton
doddy
fusker
biagioli
oleochemicals
zakzaky
listenin
dulon
bushie
victimizer
firgrove
rakovic
preclusive
macgarry
niap
krieble
speilberg
wingle
beatable
parthenis
kitbag
fennemore
sweetarts
craftworkers
nessan
iooss
worsdale
hyponatraemia
talog
jjw
penelas
dartnall
somashekhar
isrg
carrivick
gorle
rudich
wheadon
seilala
szapocznikow
hodeidah
haricots
buhlmann
faccenda
retin
merkens
kabore
biley
malburg
restavek
shaine
aneuploidies
grafe
selfors
beirendonck
svedin
nanjie
calim
deshea
plastiscines
kamiar
divsion
lovingood
bessacarr
elci
mabira
bosic
distractibility
siron
wolpo
girozentrale
apeared
hefetz
homeliness
ashoura
benotto
frankum
durnbaugh
kivisto
jaeck
dicapo
torrefaction
towednack
uncelebrated
kitting
mbaya
matous
liefers
manglona
wolitzer
handwringing
esmie
audretsch
mboweni
dvids
krisztian
yousof
cebrian
palazzos
mahendradatta
maietta
polzer
porokara
playwrite
overmeyer
bumptious
grendell
scatted
kobaladze
sinotruk
zafran
bedimo
uncontroverted
scripters
telepathe
defeasance
frauenliebe
neeve
mineau
jalloud
zettabytes
crappiest
kelts
holieway
piddly
zubeyr
conditionalities
tashigang
rennies
coffering
armaly
cutdowns
kuncewicz
chisenhall
adwell
townsends
hengameh
schlotzsky
mikele
caravanners
biocatalysts
stonkus
rodemeyer
oxygenators
mutar
zuman
radam
bluh
vladamir
mickolio
hairbreadth
repurposes
tronti
mcilquham
wouln
kyno
araldite
merini
fréthun
sivia
dupraz
loeff
rickrolled
flinchum
larkmead
yarmulkes
natagora
ccia
samimi
sunsphere
sabili
heartworms
jaisal
spedan
anticuchos
uliano
alesund
fullfilling
preferance
medvedenko
tethong
smala
guiseppe
blacklaw
kounis
bonchester
dodaro
thunderpants
rettie
paychex
gyde
heims
barcel
capula
sarsens
sistach
blaes
souch
simplicities
snowsport
babik
pulmo
handoyo
ordona
flexplay
maiza
dworaczyk
giannetta
chorine
riquewihr
cartegena
shurberg
gurvinder
luson
angangueo
desulphurisation
savoys
yealm
toposcope
baerii
tsontakis
mirthless
dognapping
proflowers
voake
argentineans
mondovino
sensationalise
girerd
openspace
gassi
oaked
landgate
americhem
awing
miyar
cantábrica
tinari
chapri
tejendra
calorically
grundler
renschler
saryusz
baalsrud
masury
collateralization
rolovich
zurick
sulistyo
neurotoxicology
vitabile
bracka
langhurst
rockism
jianqi
ellick
omnigraffle
duttons
rogha
whixall
zipadelli
fromager
techsoup
mobitv
ecia
mcweeney
viirs
winnifrith
refocussed
barie
rosebrook
hopfully
mcadd
trowers
saitek
bayla
simranjit
fabretti
leedes
scombroid
henrike
aios
sotu
femail
calty
logista
mystifications
braingate
chopiniana
rittikrai
mazziotti
pardeeville
rowdiest
larvicide
rjf
fauque
buchter
falliero
amiando
aixam
headgirl
paulhus
whdi
maxo
lakeisha
panopoulos
mcnease
subtheme
vlps
moysey
srixon
intractible
toughly
lobstermen
abromaitis
gavrilovic
consern
sightsee
anderszewski
kinaesthetic
superduperman
operatorship
sandhogs
avout
rapscallion
fluffery
brachylophus
leithead
makrokosmos
ismi
ocme
intifadah
kazipur
jinka
averys
surpise
hurch
vandenhurk
bottino
punningly
ctvrtlik
lecat
idania
visting
zenz
sounio
volunteermatch
gessow
polone
kuptana
berberyan
tavuk
pappajohn
roughhousing
amateurishness
bottlenecking
wedlake
trusthorpe
cossery
businessfirst
dalibard
dictums
ortas
sinder
vigourous
stainbrook
proustian
norz
fieldhead
llyswen
lithman
unionise
yisa
hippeau
outstayed
sboe
bembry
hayneedle
beorma
hoeilaart
yandall
sibby
orrible
metrosexuality
lightrail
masspike
lovsin
areen
ménages
abari
yarkin
henline
scraggy
vittoz
mehdizadeh
hakurozan
tibbitts
rovan
drewitt
posterchild
kalup
hewar
eichelberg
henoko
saillard
petroecuador
eyfs
torchlit
asustek
hundon
peetie
scrunch
paliwoda
qatanani
zewdie
axeheads
chupp
ahistoric
camcopter
torina
rmcs
dissappointed
jaid
sportshall
genty
christianophobia
tommys
krolow
spurwink
ginbot
othella
ladislaw
syntext
plaisant
moodiesburn
songza
wohlsen
cremeans
measurments
lenghts
satyric
waterwall
ranahan
carrog
mohla
ceterum
firaz
solney
oxybutynin
cederstrom
mojada
krawitz
davitaia
avantime
makayla
chaffing
idiq
chuansha
occaisional
aslanbek
duragesic
boundstone
faeldon
thejournal
shrapnell
baulking
meths
trewhitt
kidded
faulcon
flighting
brosky
kornukov
mcsweegan
prulifloxacin
wollmann
implantations
durbars
toughens
simioni
pbbs
casebere
shuyi
fraises
pariyar
agüeros
limer
freise
fibbing
jianchao
groshek
shenai
falkous
zavada
cantuária
potkin
superpressure
causewayend
triulzi
centrestage
sixthly
klamm
cambor
kamuran
marmur
sennott
giovanetti
upbraid
charater
speciously
bamogo
rugbywa
dartboards
teared
aliaune
dejian
etemenanki
castels
ibrahimovic
metsys
consenso
niedzwiecki
carslaw
kloden
stogie
halloysite
chunchu
carret
remgro
bonerama
rayward
patnick
vlingo
lamex
wirtshaus
faggioni
matthaus
glozel
mishandle
untampered
kifer
eboo
tathagat
beyti
vinorelbine
jarvey
montol
lapatinib
chashama
kitterman
waigo
yvie
restaraunt
hoogendoorn
garvock
sesana
candicacy
tarloff
backcloth
hardus
eynulla
supercuts
gymnema
conditio
faeroese
chandlee
aapd
dsip
skrzypek
halloweekends
rbocs
formost
zeliha
bloops
shemekia
mannina
photofit
spraycan
mesothelin
intubate
downies
schiesser
schramsberg
brazenness
jakl
hochheimer
libiamo
evenhandedly
volano
buresh
unintegrated
pedrique
cleco
silvermist
urbahn
ceruleus
opk
czeski
overbloated
matharu
galewood
billey
photoessay
horóscopos
medarex
lundekvam
gauweiler
ibritumomab
djohan
hilliards
soffe
boogy
plasmasphere
peñaherrera
yameogo
polypill
vetsch
zigmond
hoskinson
johnjay
mathletics
fater
apodyterium
ganina
glifberg
deromedi
gesticulation
soudas
sophi
njeim
mukhran
trinka
copaxone
corroon
gressingham
jungbluth
choua
curdles
chinachem
liniger
pilliod
sanstead
fraim
freder
rosamilia
poohed
vasilj
salc
eqm
kurzem
nosbaum
fireline
fecklessness
oskam
jianshui
neighourhood
babip
guillotining
norikazu
unseparated
unificationist
kriegsmann
goity
ollestad
ebj
ghonda
buscon
forewoman
cinetic
seig
welegedara
huntleigh
leaseplan
topup
igrp
goeglein
ssgs
stonum
hbeag
zeru
sophisms
nyambura
alexon
offe
bayji
brabantia
dyball
herendeen
hazrati
freeskiers
krysztof
longri
yochelson
micalizzi
wellmark
montcuq
ashafa
ungerman
limewashed
aquarians
sapiyev
windbreakers
nonoo
quirkiest
nonya
acurio
broomielaw
liebfraumilch
lipservice
amaria
adverted
beauchene
representive
nannying
domata
haijun
cockcrow
palfreeman
mpingo
hilderbran
wenallt
odontologist
mechri
theuriau
ledwaba
splotched
semanas
abulhassan
holben
breagh
blancaneaux
eerola
unsubsidised
hardpressed
glasenberg
gaidhlig
downhome
petrusich
cusine
mariastella
nedrow
unobtanium
hadeel
tolfrey
aiskew
effen
shern
multiculturalists
protosevich
untransmitted
keyingham
woodgrange
janisse
chakarov
masharawi
vestibulitis
magnotti
barrettes
howlands
fumigating
geggie
krivets
kensing
purls
misztal
drawstrings
legalisms
laught
biobanking
chiamparino
akhatova
ussuriysky
musis
frenchness
pumpsie
chironex
inprivate
telemar
multistemmed
schabel
tressider
peenemunde
striegel
wörld
kashikojima
seec
mttf
fortina
citroëns
cashwell
tymoschuk
infering
fauquet
garrec
prabhudas
perosn
harell
schliessler
happé
spiralfrog
confiture
deeka
prettified
bcec
protas
bjornsen
cocodrilo
aeromax
crookedest
gphc
bolmer
bemo
beacher
hnlc
kulsoom
jabouri
sitake
muneera
flimflam
stessel
jalalaqsi
noncontact
homebirth
guigal
dabblers
alterative
rohrlich
reassuming
glusker
rowohl
tongariki
benezra
eisma
claure
goudas
haoyuan
tomcikova
pornchai
fedoroff
blythedale
papania
truluck
werbner
jalmar
francheska
yuppy
tetrahydrogestrinone
stassinopoulos
nerdiness
nasheet
iedd
subdudes
iesha
jayawardane
funeraria
schwaber
undeceived
backgrounding
pagotto
afbf
overgrowing
jatras
claycomo
vsoe
zagorin
zind
ramrakha
ginwala
unaccountability
thge
frieman
mckays
hymotion
anthera
computershare
rinzin
dubsky
mullady
revivification
ifap
rayad
desking
tetanurans
atcm
dexatrim
jamster
reifler
amfilohije
telsey
donckers
ujian
zecha
hadidi
definied
fereidoun
hardianto
citris
soverign
ballyholme
rashwan
cadaques
korstin
petrole
vivified
addyston
recontact
rewalk
lamman
wunsche
untenanted
knafo
pinballs
rownd
disapora
efimkin
ecotax
passbooks
hyderbad
photofinder
hoffler
boxwoods
memogate
goese
raimy
madderty
zakuski
failte
bernotas
ilesanmi
stenseth
dehuff
kollerstrom
adalius
pawed
harardhere
fortea
carlsmith
shaf
lybbert
irritably
regularising
poissant
nelahozeves
mcarthurglen
fawned
pashman
autum
engelsman
bollocking
carvacrol
lonmay
descrimination
muthalik
leyman
koreen
mwy
undulant
himalyan
fareshare
wellfare
necropsies
turtletaub
rebibbia
septics
irmen
physiognomies
saihi
bacchanals
naher
pearlstine
melck
bustline
dhore
kobak
cpfl
ssao
guven
tourbook
kaletsky
pathologizing
bacardí
hewas
neurocrine
marcotti
uprate
tingelhoff
bruv
kufour
jinzhu
streiter
bsafe
meonstoke
eannes
feris
chatterly
pickert
polyphonia
amantle
securitizing
costermongers
karee
gortney
fairlead
turag
lakwena
friddi
johnshaven
bucksburn
micklewright
tallula
halef
athman
finshed
malaythong
hachijojima
africam
sudbrooke
ugone
dets
bcam
writedowns
oelrich
tailgaters
milquet
ustda
stagnetti
streetscaping
shalrie
leao
shoate
pinetops
holmfield
nonallergic
octavien
howff
qalqiliya
kukula
viadeo
ferreria
safeness
mroué
frattare
manyfold
haggans
abiy
redlines
primative
grapecity
probowl
corynne
shuaibu
suley
graffam
counterfoil
maliphant
perfumado
shabina
privia
graindorge
hemdan
lenotre
crunchier
midground
stirman
gemfibrozil
deconcentration
wieckowski
inabata
dolwyn
simbarashe
freemarket
pessin
deitrich
bbmf
neophytou
ailun
micklewhite
anonymizers
coutaz
shushtari
tamarisks
laboeuf
zocco
puppie
restudied
ushida
byberg
navjivan
ngic
splatting
yingde
birchen
footboards
luppino
nmsa
loaghtan
pagpag
regionality
rabadan
twiddles
bonnan
bluemel
wended
honeycut
albouy
panosian
azdak
htike
malelane
maracá
owles
capurso
sigd
pejo
zinacantán
petkus
thokozani
mude
tredennick
devidas
nfcs
shroeder
combinable
scrabo
hultzen
makarkin
loking
yasso
gurski
jianwu
motricity
goyet
hipgrave
eguiguren
gunbalanya
gotoassist
dulnain
geddit
enlightenments
microcircuit
zebraman
vareille
demjanov
speechmaking
schwinge
pontnewynydd
greenbird
distronic
kilgetty
moinard
batbold
trollip
aerosolization
taimina
naclerio
huffines
abua
laohu
loafs
nffc
koju
perrenial
timesdaily
greybeard
lagisquet
amorously
suely
wssrc
lowrys
holehouse
behal
nazarbayeva
munyemana
fullscale
tullin
shigihara
linnen
miskinyar
accordi
earthshattering
pasaban
blameworthiness
juzhong
tendancies
deptula
hapuku
seone
haybridge
lostine
messsage
cybersyn
ceferin
rubasingham
dioscoro
huiping
sarafin
socrata
qaraqosh
stemp
allsport
minkley
mateys
ocwen
cpcc
slyfield
perdriel
narrowminded
naimat
mwenga
wonford
dichlorvos
nationlink
ingrates
osterholm
hokes
rudzinski
dumez
binatone
bumetanide
kartchner
menegazzo
schmancy
pueblan
pinioned
martydom
seiyaku
dehumanised
liljeroth
blackspots
danbolt
synergic
cellaring
lochard
butkiewicz
nikolin
democrates
hamidu
achacachi
costock
squeeky
daxon
perugian
chrysostomou
collete
irinel
tuttosport
kriukov
blackey
sorger
nakam
endonasal
dacc
islamofascist
luchar
scheeren
movial
bloodmoney
kisimul
marianske
psychometricians
aeropro
joltin
pavkovic
ippi
chammas
provied
ared
hansville
martinovic
bandeh
burys
ety
northcom
carbin
sporkin
belkheir
bandow
offor
migita
twankey
mistiming
brantano
hrcp
prugo
sott
mascons
tarchi
freezeout
nxs
artomatic
unweave
straggle
hersov
biofiltration
jiangyong
vandeven
greiling
prakoso
onkelinx
storeowners
kuzui
msra
furama
metastasised
karenna
riederalp
rubbishy
superhorse
barclayhedge
sazegara
mariscos
caspians
pettistree
solofa
wcva
intergovernmentalism
congue
cariforum
gradwohl
affuso
parliamo
dosas
mazzolino
canarypox
afterlight
schickhardt
thouvenin
jocund
dawayne
fagella
thallon
modha
millegan
sidhom
picciano
khalwat
mcallan
webtrends
mergel
chazin
hoenikker
indeterminately
abuot
xylorimba
majadele
clenches
burket
dambe
zybina
vivika
busara
oberbeck
polands
pereirinha
hennard
marangi
degraffenreid
kaco
baaf
palmeraie
sprowl
rasheen
estebanez
goatskins
spudded
chasnoff
pertusi
raphanel
larenas
slamon
applys
monovision
maitake
reasonble
weleetka
reappointing
overnighting
tirolean
scri
brendler
wozencroft
avishay
ontime
hoeger
hesla
qnap
coccinia
dashanzi
technophile
applix
xebra
supermodifieds
paresi
cheddars
horrillo
throughtout
crackly
devetzi
vladimira
magy
frippery
kasanga
mingfu
straphangers
twinlab
unappetising
henleys
untidiness
slivenko
medaris
libbrecht
gramer
tornielli
rmhc
euronest
reaganite
otaola
leofoo
stijnen
dieties
walek
sukka
aldara
umpierre
hirshfeld
fursdon
melanosome
relatability
chalkland
resupplies
mediocrities
freiss
rosenblith
mackmin
augustinussen
melendrez
lahmacun
pershin
becuz
paje
fourposter
elleman
hauxton
riddlesden
huaining
frueh
redoubles
pettifor
neighbourliness
mehmen
horologists
gelid
instrinsic
fundos
zingler
carnalea
heedlessness
raghuvansh
potec
reductionists
derecognition
pleurocoelus
veronis
farnaz
blinkin
galettes
plaks
fetishizing
glazzard
brockless
criticims
plurilateral
cwdm
bassekou
cyveillance
conneh
strumpf
potholing
spoletta
zlob
universtiy
spmet
millings
vilborg
jorrie
piffero
ambulate
djerf
fanzhi
sadosky
largos
pinetti
presler
fourplex
debido
aquaduct
makasi
bachtel
ismp
jinfu
slavenka
quiana
remonstrates
bacchan
filippone
qijin
definity
lipford
moais
wattleton
patocka
essentialists
gosk
depressurize
evenor
candyce
necrophorum
jagm
tewell
spreen
xintang
droped
trocken
charmane
yums
carrizales
bussone
kawishiwi
senewiratne
tongliang
inflations
nanoball
lagueux
degout
screwfix
logierait
allvoices
awadagin
newtonhill
kesuma
foge
optimer
endovenous
pirozzi
oooops
bikehut
tyman
galeas
econlog
outturn
tycroes
preconference
mafiosos
salimah
learjets
felsinger
jieddo
wmet
zavis
experion
thornburn
gwpf
chorused
chorusing
sleezy
abelii
manrara
wesly
peddocks
kaitz
encounted
mital
atban
malsam
blar
neglia
bloaty
contentedness
ellzey
preem
yansong
pesan
canstar
prances
fedcup
scarcroft
altamount
hawkhill
refurnishing
mainstone
flossenburg
kindergartner
sevastapol
hydrogenics
najwan
hayen
satran
nixzmary
sosenko
jnto
utsubo
reargued
boncath
astrotech
khalik
prestone
ratilal
bussiere
ratley
yuhui
pletsch
prevarications
kuriyan
crupi
tranfer
absolue
bloe
shinning
steadyshot
polution
actblue
polytrauma
viewforth
sovietologist
brossart
aislin
technoserve
sokwanele
attrice
buwono
annice
katterbach
bushwalker
rumblefish
transporation
züricher
panitchpakdi
kaklamanis
panariti
hicklenton
delegator
pauza
huelsman
seigal
margolick
seguing
microfibre
bemporad
mullaittivu
interbody
pugach
rostowski
hyperresponsiveness
greffe
phillipian
hoopy
djuma
bosnjak
churchly
killpack
richardo
soundpost
sailosi
fceda
adderson
facetiousness
xirrus
moontide
mixologists
svanen
isamar
stms
gladue
myphone
backwardly
zimet
nahh
surpassingly
gratch
batori
beixing
khalilur
dubl
tsypin
trgovac
praesepe
overindulgent
lllp
sciarrone
stemberg
silicification
antione
westburg
labourism
megavision
lexecon
bilgili
flylo
sieberg
khrystyna
hissa
republicain
cwmni
smily
kayar
macoumba
cupuaçu
gethers
phillipos
officiousness
euroland
gaila
habitate
genoux
spreadtrum
hmri
dausgaard
midco
furmansky
tefal
outshoot
heico
gothicism
braghetto
kendray
pelizzoli
alridge
tatarella
otherworldliness
sharsheret
ocrf
tamburro
matrixed
porthmeor
digeplayer
ouzinkie
carmignac
hawrami
mattotti
pfiester
hadelich
aguillon
trinamul
ehsanul
lespwa
enshrinees
vatikiotis
laduca
ssempa
bathetic
pullups
usfp
echus
sodergren
cowdroy
hugue
ohhhhh
glucocerebrosidase
scardina
seignorage
tillerson
desensitised
gholamali
fromagerie
tregantle
papillifera
leppan
idzi
smushed
vanauken
khairullah
valassis
karuturi
ramsin
antiphonally
handpick
mbola
ncsf
lenins
mouthless
bestman
sakowski
teahan
probalby
strazza
slaska
erenberg
zuley
zongyang
sapeur
peppertree
autologic
caesaraugusta
belliraj
unanimis
badale
haythe
portinho
sibbit
remède
nightime
montasser
russotto
tastiness
generalisimo
kromhout
kochno
argleton
nepalganj
lograsso
lipowski
hochiminh
policiy
bishay
squaddie
juth
glowstick
goldenhersh
carouse
tooty
nunchakus
javaheri
caixia
coupal
peshkin
biotropica
lifelessness
trye
hrmm
dilday
glenoaks
ntas
shoping
muivah
openedge
bourjois
kornreich
slades
surcharging
foxey
diogenis
berkshares
nisnevich
sebutinde
pendletons
nosheen
espey
hasanat
padoan
meiju
slotover
bruggemann
aljira
cloepfil
cohenour
gxx
rhag
bojeador
salberg
pfge
stratigos
guiso
bandaids
lambdarail
nidcd
ouriel
affie
indestructable
stabilitrak
aspray
sillero
venneri
russiaville
dalaigh
hardell
microemulsion
regg
entemena
idahoans
updm
icnirp
patheon
mazzon
shiao
stathern
talev
bogollagama
pluzhnikov
motoyuki
afesd
wanfeng
pickiness
stiel
purell
sopko
simplexity
waterbeds
damane
flautas
poge
shokri
superfluously
lusheng
eliah
sanguillen
bookless
cynamon
forewent
ormondroyd
farset
retie
lasalvia
eghbali
estis
robogames
frodebu
euboeans
jdsu
plateros
skymall
summerworks
unenlightening
genshaft
precor
dimeco
melcer
paelinck
plasmoids
simonovic
neske
lumefantrine
slavick
lazarof
mouloungui
holloware
conax
skitube
talamini
havelet
herion
brammertz
mccgwire
peregoy
palpitating
gatherin
beardie
murdin
softhearted
lesnik
hartkopf
lutai
ipaa
churchouse
imprecatory
kilclooney
nardozzi
timney
viewability
issed
serby
jiasheng
friarage
actuals
subcompacts
fliegauf
sproughton
hillas
jocularity
dilantin
whelmed
seedbeds
houa
heimowitz
casteless
haluptzok
nietzel
deeann
notarised
proform
cregneash
trimarchi
frodon
lauría
dushu
jastram
mattone
burts
haselsteiner
eglash
collyns
bodyshop
wakelam
huggable
tacular
hlegu
wehre
burutu
marketsite
rihab
unutterably
darfurian
weghe
bacong
kearon
yamabori
aais
jamile
sarajevan
huria
truphone
furbank
precooled
westenthaler
oeav
foleyet
strassfeld
nonverbally
creagan
meterological
drugmaker
mhks
wincenc
marmorated
kukura
darnit
carapintadas
hevey
naidus
hanzal
mongeham
loesing
gmcr
generes
espinet
cucciniello
dispersible
althoug
winola
draghici
injust
camoflauge
chichilnisky
syy
alvarezsauridae
wicha
kuksiks
ccai
rishad
flailed
eaglewood
reagen
fivesome
toolstation
hosman
balentine
misron
westworth
adiala
druglord
interferring
stepfamilies
simhat
buglass
zuza
smatterings
varifocal
urself
forcedly
cimla
yulgok
wnsl
khalfani
acevo
humpday
hundreth
havar
satisify
khilnani
kitutu
endeth
serifovic
grinshill
smgf
kekova
ghouri
continuers
dieumerci
masaai
sredoje
tuell
puuhonua
jarena
lifeskills
bellydancer
plch
fenwicks
abderahmane
burstwick
bandpage
dstt
outisde
haora
unascertained
dumyat
tongli
schmuhl
roake
lisbellaw
braintrust
kooba
gramley
webaward
poin
kissick
tactility
liveops
jerkily
mbalula
drefach
freindly
julz
insaat
akilah
furrowing
oversupplied
lixx
llangunnor
abfa
meserole
silm
lawhon
shoven
carrai
hoyes
troob
zafirovski
migdale
tourmates
edmisten
tastefulness
mongin
allergists
doomsayers
activitiy
graboff
angon
lovemark
shengzhou
agins
hefferman
whammo
asenapine
moppett
zenghelis
kiwan
tikes
yalincak
vinyards
fantasised
bozburun
epel
negawatts
pisaro
plygain
macaninch
plener
respectfulness
baguer
reletively
akhtuba
shilbottle
houria
amika
welzer
sarukhan
sakovich
mythologize
mixamo
toolbank
precycling
tedtalk
oscawana
littermate
rhamnous
pegasys
llangadwaladr
edwen
audiance
icgc
rutha
spadeadam
dspic
gansky
grode
alverez
suhrawadi
dhall
kulwinder
miscalculating
buggering
duluoz
siasconset
berdzenishvili
chandrajit
jackbooted
bleistein
marshalswick
kimmerghame
newsnow
bahdanovich
wierenga
bouhired
restylane
shanzu
barfing
nascence
fedmart
chalcot
koltz
uofm
qio
kooshian
undergirds
slipup
freeganism
jhsv
asba
rhyan
freerun
monory
kadijk
iplex
madeiras
ciralsky
worricker
chesterbrook
badii
keudell
gebregziabher
discoloring
glögg
unzips
lecq
dovydas
martyring
mantric
cocreator
itsuo
manhunting
nasdijj
helzberg
priamo
axway
ondiviela
bankings
coolican
stoneyburn
stureplan
parinya
roofscape
kols
pusic
sportiest
adalian
katrantzou
fiechter
bouphavanh
kaptchuk
werksman
handpump
labowitz
koizora
numbskulls
dukei
dmpa
pparg
muhabura
scatalogical
galschiot
havern
jorvan
meinir
vaccino
cifras
ditib
scorebook
hakkar
pinocchios
nyjo
fraternise
jawaher
outteridge
pttep
madshus
acquisti
parexel
adhamiya
ikela
stilly
weisses
telekomunikasi
quahogs
toymaster
condliffe
morguard
thrombectomy
vallies
septime
reicht
azenberg
braless
sanah
cherikoff
couth
effectuating
akkas
sturmgeist
supped
imich
mutsamudu
helibras
properous
grebeshkov
wenying
assous
bhujel
vivisections
hormuud
saksit
schryer
philadephia
jarell
blandest
dumfrieshire
briois
kadry
depowering
choel
lisztian
opekta
deltha
aboim
melodiously
mesac
janurary
putzi
mancs
consequenses
fernbrook
ivankovic
concered
gasmask
ispad
phay
firecrests
ruckdeschel
lebeuf
pylant
lanks
karoliina
hunnan
reventon
sparkwell
luvox
gravlax
marlbrook
balius
fluxys
shakim
tootill
petraia
pushbike
bowsers
caulton
murtazin
bispectral
meccas
captian
primatological
cherubism
perrodin
knackers
budongo
cosbys
magnificient
dropside
faci
staffieri
urbanik
tiuna
dimhrs
greenlief
mereu
izaac
riggans
hyperphosphatemia
keshwar
opers
bronchodilation
sonkin
kindergartener
ongkili
geiman
bancor
chamoux
woould
adefovir
grubacic
tiffini
maussa
loofah
mareel
kalkot
aayo
pruvost
boroditsky
phobe
microflex
truckie
yeshorim
collman
juiciness
véry
jianing
basterra
whatstandwell
abady
insitution
strubby
spirig
friedenstag
turkomen
rundek
mtiliga
junhong
ikpeng
defrank
jojoy
federalizing
balsamico
skovhus
mayview
orru
maaren
lukac
tarconi
minnewanka
flavourless
niederstetten
forshew
jahed
lelisa
aeco
alberghini
constuction
sursis
remaing
rausa
geophone
fussen
pestilences
allante
procoptodon
kamathipura
unoosa
expeditioners
pickoffs
trock
datatreasury
airshaft
cajastur
appollonio
hauert
jayakrishna
pedraz
monsur
eayrs
jaik
weinandy
ultracool
antosca
arismendy
baudach
cosmeston
facists
pompes
kazenergy
traiteur
zuying
fouzi
caramelize
mcenhill
imbibes
aviacsa
smei
jillie
hopla
trevelin
ekranoplans
cndh
haoming
mcgranahan
mantashe
mortera
cravins
conniver
perrish
protaras
waylaying
tils
monath
bolduan
unfun
teepell
aranas
practicably
dolciani
porosities
hayde
nationalisations
clickstream
murjani
isthe
waterings
favs
southernhay
waberthwaite
irupa
papillaris
jantel
manyang
nanetta
robinzine
jetstreams
stompy
allergenicity
preparators
fiallos
sherdil
magsamen
krafsur
mvsu
kohna
ftts
echandia
kiester
bonauto
safty
everlastingly
graminearum
orango
ectc
kallaugher
nonfederal
mokko
tsolekile
amirav
handmark
longicollum
ostberg
grgurich
divilly
acrefair
conaco
pinguino
groundbreakers
klemke
boîtes
finacial
billers
zubulake
mentals
makhmur
stovepipes
dudly
cuisiniers
gurling
soze
akinwale
ponifasio
cortonwood
bodycount
zidar
aburizal
biolcati
mesirow
sparton
momcilo
careworn
shayes
breceda
tranchell
cleated
diluvian
cluzel
diblasi
kringles
skyterra
roecker
gessoed
cianchette
bethann
lhen
lynchian
wowowow
ptwc
rumpy
teratogenesis
slipcovers
alion
psychedelically
rafei
fitchie
tawakal
lauries
fernado
torsella
mitsuyuki
mues
foldi
daufresne
akuntsu
chitterne
keyhan
hobnails
camio
kambas
hyperarousal
kooga
mujahed
golikova
statendam
safecracking
gaffoor
videocamera
battleplan
dychtwald
chenai
teaware
tuitupou
exsistence
crispier
proelio
floured
tringale
tensleep
norberta
deferiprone
whakatu
antegrade
ravva
leinberger
dimwits
ingestions
pellizzer
sixthsense
pijon
gtbank
scriptlogic
heartlessly
chesnoff
hafif
androstadienone
dechu
jolies
crisa
manjarrez
poryes
shortley
brochette
wansley
mamund
filaggrin
sufentanil
zniber
swallownest
dardari
weeble
cinf
obsequiousness
paulison
kamvar
charentais
perriam
crackberry
aobadai
ravishingly
bottorff
fargher
annointed
jiangqiao
sticca
upadhye
rebind
esgair
lakpa
varities
procrastinates
repulsiveness
hadadi
moinian
nipsa
fiacconi
jafco
chukwurah
casher
asiapacific
teleworkers
straeuli
acoss
asik
breitz
nikiski
bekken
ansolabehere
wyboston
foux
darwinians
hajir
plateauing
uetz
shimit
wuori
gatecrashes
cotrubas
leenders
topquest
nacif
dyma
urgelles
facciolo
aspey
fadila
bavis
anadigics
shingirai
komis
sbpd
burbanks
laynce
pilocytic
kilbrannan
gotabaya
jaywalk
houchins
forr
doernbecher
viewpark
antwaun
uprating
unexcited
fabisch
gartman
unexcused
zackheim
gcsf
lyubomirsky
apcoa
kidstuff
argaty
pineiro
darkley
trocaire
chaussettes
skinfold
yenni
raffaelle
retreaded
oursel
hoverlloyd
elasticized
labban
gogarth
suvir
unimark
brunkert
ndayizeye
gnpc
degs
correctives
porrini
dafs
kloiber
gruma
hosannas
reinstituting
batterjee
maindee
navajivan
edac
koningshoeven
michos
trenchcoats
shartava
didace
loreli
methysticum
rishard
flambe
xichun
bokaer
dundar
mcconnaughey
vranich
khodaidad
multistoried
barrique
derrik
connives
depuration
mininum
masutani
palamidi
stutzmann
xhosas
whoot
needlecraft
thunderously
actiq
flagellating
cherilus
tenderized
sockwell
maëlle
fleckney
imprecation
ohmori
dirir
arzate
siberica
covill
beriosova
blackback
onic
marylouise
implimentation
gerbeau
zijiang
salvationis
bellm
teabagger
vujanovic
achivement
jitish
heising
somerley
betu
blokey
fanyi
kampamba
tremelo
jart
kozicki
palmaz
goniurosaurus
farbrace
cyrenaics
startsev
extraditable
cammish
anchimaa
phanfare
duntulm
lilliesleaf
mudman
wonderettes
shrilly
panksepp
aromatised
margaretting
doly
jly
tayyaba
qualitas
declutter
catchin
gorazde
eurochambres
amburn
wema
sherrybaby
mashamaite
tapasi
stolzer
mepkin
schaeff
warfront
myachi
schook
psaps
suffo
trism
humlebaek
depersonalisation
mirabaud
zakour
noreiga
mdts
embery
rossotti
nelstrop
pityana
bookstaver
kaiun
microencapsulated
suavely
yixiang
dariga
abdrazakov
stavis
hunnisett
arrogating
skyspace
ulreich
barella
poopers
debases
bokashi
unosat
schepp
fosseway
excesive
gormally
prevacid
tubney
nickelsville
lemv
punchout
rodek
sarabandes
ruegen
jamaine
grush
gellard
handpieces
bacciocchi
abakar
thalman
melchisedek
cuyabeno
kahut
bielan
leggate
pakage
hasama
westmarland
hodgeson
nantymoel
keara
ranjbaran
sierrita
arridy
mortaring
gpj
mikhelson
bdav
dimethicone
zumbado
agrement
orlandos
roughen
tianwang
bloms
asmd
numerologists
muffuletta
gasanov
nicarico
zoubir
teléfonos
tugenensis
byrdie
aercap
tucc
pseudorabies
garrowby
waubay
crapola
potthoff
ejn
biochips
livneh
koches
lequan
yarwell
technologia
kalvenes
jekel
iassogna
handberg
daunton
kinepolis
lightwaves
katseli
descibes
savre
ukse
demotivate
empathising
muwafaq
jianlong
larssons
spearville
microbuses
refreshers
chinnarat
tsas
weibring
goodfella
tatums
whorehouses
percassi
kaltenegger
ebley
delineators
ladyboy
selloana
cronie
tartine
shehadi
hutshing
fasbender
entertaiment
quenelles
netherbury
gulfview
paredon
hexogen
atitude
nusakambangan
lappen
zadick
allerleirauh
growingly
fogbank
obizzi
pumpkinseeds
vmeste
romenesko
shelor
topazes
addicott
despommier
cheesemonger
shairon
simers
benyus
kibitz
artouz
zardana
macgown
amilly
althin
nonnenmacher
eligble
bolillo
stufflebeem
bazant
gazpromneft
schwag
reexaminations
hoornik
carignane
procrastinators
svitzer
quisthoudt
tessel
completers
refuelers
fronk
hunia
techamerica
geci
bairu
jaeson
henbest
keauna
lbos
elisheba
seligsohn
tetulia
ballacraine
niersbach
villez
hobnobbing
uncommissioned
moscovite
fourpenny
faram
ebonie
banyala
metrologic
overproof
mugniyah
truenorth
havret
haydens
eppleton
nsct
inners
pittu
embitter
burlakov
nemore
immensly
elfrieda
qaidat
outworked
ectaco
demory
tolchester
boxiong
gukasyan
dansette
brogdale
prision
acress
eimskip
echourouk
keasling
tamkeen
pancur
saltmarket
penywaun
woodstove
preu
inanely
leibham
farhod
saio
waterscapes
dewitte
banderilla
bazardo
cangemi
agagu
indridason
sidka
kasit
megatonne
yegge
jeopardises
sweney
ataollah
commscope
phuti
keler
millionare
yudell
countryfolk
gabfest
striesow
paac
incentivising
humidors
anci
briege
soliris
leakycon
kloberg
belco
unalike
yachin
prisbrey
tanevski
odina
correal
xanders
iavaroni
sudachi
deskovic
geech
massoudi
seekingalpha
thanis
skelhorn
ovm
zinédine
buehner
pontrhydyfen
atenza
blowhards
hardass
cymuned
sebel
outragous
bmce
ziliani
patamona
testors
lawdit
aboulafia
avanessian
saidou
wasmosy
chemtura
märkl
villafane
vesselina
twentyfold
stereovision
slivered
wknd
iwamasa
bartulis
jaidon
koliada
supersite
mcstravick
eiser
flibbertigibbet
mccluney
movsar
cobertura
powernet
chambishi
sugito
kralick
brunkhorst
sabki
mccullah
tiriac
piiroinen
tlali
barusso
fricks
sensative
rochman
forestiers
thefind
yijie
nonprofessionals
splotchy
extenuation
dinoire
tweeked
parishoners
adongo
lymphangioleiomyomatosis
benaim
berardenga
quadrillions
mildner
brodax
nkechi
tamoil
sabik
arnzen
schlundt
mondshein
pandoran
wisser
marlane
samardo
leyhill
treki
merilee
ablations
feinmann
malleswari
banasik
kiplingi
enagh
hmyoi
shakeups
mollino
knickknacks
ballinacurra
kapsalis
farringford
intrahealth
rewatched
sunridge
boelter
litif
zertuche
tamegroute
skinnerian
xtl
cablecards
merchtem
makel
ixis
laguage
nasril
notimex
heathside
livered
flintoft
immunotoxins
fensterstock
rajabali
omolo
purnululu
hestitate
pbts
tuskeegee
efsm
parching
sucré
kosminen
mmxi
sulat
disinterestedly
borberg
desribed
dharun
reveler
lapasset
stowupland
litwak
barovier
namb
coproductions
barratts
mcgraths
cerfontaine
salisu
worrincy
housey
denike
pouncy
theaterworks
ochils
ercol
laseter
paydays
bruzzo
fcvs
sodersten
adzic
shahtoosh
forsbacka
brakke
formulators
poulard
mohtadi
tahaan
marculescu
zahava
firecontrol
platespin
ruhakana
rennix
gloveboxes
schleifstein
workstream
wildearth
yosuf
golway
highcross
gunes
firedog
mydomain
streckfus
unhappiest
moyock
shof
mistreatments
zanoli
transfuse
traumatising
nooke
warborough
trigeneration
blangkon
tapner
yazpik
dehap
weyhill
parkay
conferenced
lunchables
acharacle
propriano
viridor
ndung
mappus
officals
sgts
corsino
encrust
allanbank
underselling
accme
kataib
gocar
maniatv
genchi
duboff
sculpturally
kaamulan
stubbly
kikuya
hometeam
whovian
chelmarsh
biopolis
theatregoer
veiws
ceis
bernstine
alejos
zipolite
sopogy
penyffordd
movoto
miquita
dogsledding
cranioplasty
leppla
dfps
pmps
danshui
chiman
benschoten
decarbonisation
buzios
aestheticized
valuably
mammographic
ibmec
garimella
winterization
verfiy
mlec
phobics
sheevaplug
benkahla
ligitimate
enj
krokidas
sosnik
shantelle
onal
organochlorines
monicas
lukaya
lawer
neosporin
unstopped
leatherbury
agues
kittenish
aldicarb
krippendorf
billcliff
aritzia
altounyan
philadanco
rombi
geomicrobiology
irrationalist
martsvaladze
jittering
gazeley
zaretski
visceglia
louette
rambaut
jesuses
bidmc
studentsfirst
maladapted
blindspots
cartee
bookpeople
blud
perske
prahalis
trinite
smartzone
guanciale
dorch
silah
pegol
kettled
bhuri
fearnhead
jabril
lybeck
coeck
topalli
esdc
filofax
armoires
voxbone
taky
proshkin
roneeka
thca
scariolo
dolceacqua
omnipod
smarden
belsat
pxd
naalehu
bradsell
dbts
onesies
macleane
kikkan
notarize
midthun
decommitted
prosed
gaucín
sqaure
seasides
speal
banif
mutasim
rodeway
oeri
copstick
oldcorn
underlayment
mannkind
hallsands
rickatson
magpas
lanceros
kvitova
unfashionably
blasband
luisotti
favo
churchillian
gasoil
matuska
karademir
fettering
lifetouch
nadil
lupset
tienanmen
lesiak
highjack
photodamage
kingsfold
morizono
gigaton
kiver
porteñas
synfuels
wojtkowiak
emersonian
umhs
wating
intveld
arthuis
poutre
fulking
openfilm
bagrodia
akhundzadeh
nonbusiness
sharvit
extr
buehrer
blueburger
fokou
compex
taer
supercute
paronnaud
boiman
diperna
kirop
salfi
phelp
anaylsis
avanade
fluorodeoxyglucose
vasovasostomy
worp
micromagic
pardonable
forebearance
labossiere
esle
drq
quinziato
goedde
maurices
sanliurfa
koozie
chebotareva
kunimatsu
akobian
bicarb
tobchi
sunnies
detroits
volx
unhealthiness
tohatchi
tumelty
hidipo
yotel
flowlines
bethards
fingerman
mollick
complaing
wookies
madieu
lothing
prometa
backlin
koomson
piddletrenthide
nacco
revoltella
balze
chiquet
nndc
aejmc
ducarme
uncastrated
chokey
intacct
sorpasso
bluegold
disposability
polignano
kiplimo
whittard
trailor
glanrhyd
matel
takeing
augmenter
salaris
meloan
totaljobs
paulmier
shutup
mizon
korde
colbath
mcgirk
yrm
filmgoing
verklempt
lovrek
occaisions
inul
gopaul
butzner
pruritis
andrades
healdtown
padavan
jinfang
bramly
questro
montecinos
riia
epea
mulyadi
isackson
integrationists
nisource
neobaroque
demora
crosstour
focac
saukville
kostyra
buce
crickmore
gliocladium
kogas
surreally
mediasmart
lassale
blueprinted
brannick
glycaemic
husbanded
donnafugata
tilleke
realated
brookers
molaa
azcarate
staake
schmaler
treprostinil
littlejohns
muntazar
lincy
cumo
dogus
overtaxing
carisa
andreassi
pipemaker
baoquan
permissibly
haniska
anindilyakwa
liangliang
haverland
villehuchet
omidvar
thusitha
ralepelle
havlik
eyerman
pushka
aresco
edisons
cinecitta
graddy
jiyul
asrael
sexpo
psychoanalyzed
telecommuters
flookburgh
entitiled
goulson
krod
mccahery
outmaneuvers
sadovy
deschacht
wienzeile
appetising
baritonal
changeing
pondberry
seawings
alarid
nakarawa
dicyclopentadiene
cege
nbrf
wdn
sugammadex
undt
attendings
louthan
dzf
fuifui
telepods
multimillionaires
kirkharle
yirmiyahu
kircubbin
glagow
gospic
neic
moonmen
perenyi
cerre
maanshan
ledvina
motamedian
rosenfarb
britting
stukel
efss
desirably
pricer
gugerty
chumbawumba
eiberg
rontgen
palipehutu
helyer
vitiates
oxclose
jiangong
dewael
talvitie
theret
jalawla
erquy
peaslake
conzen
ghizzoni
rumasa
badanov
macnulty
metson
schleip
deworm
avelon
ticketnetwork
lurlene
khukuri
fjällbacka
bearder
jozefina
concerta
rehling
balou
wmik
sofiko
wrage
eehv
barioni
ultrices
condimentum
caymus
jenkinsville
murmuri
pirouetting
destitutes
galitz
nagita
lapthorn
decilitre
transferjet
dixmoor
fireflys
blumel
endorsment
langbar
factive
gustus
immoderately
governability
shomari
tejocote
clockhouse
burlaka
persing
dmitriyeva
osteoprotegerin
diyab
podgy
demokratia
terrycloth
toonattik
brennon
kamela
hameiri
forristal
fecs
rockmount
wardah
quaysides
mcaneny
durney
alkalaj
pirouet
glengary
lovcen
devenick
haylee
garrotted
promenaders
ochberg
senoko
diakhate
rostki
myrle
slansky
mhlw
jiga
clov
seesaws
nunnallee
widenius
pyrek
weaponizing
francop
edag
sadoway
stavrakakis
corncockle
discimus
cesarian
dawney
judiciousness
roumieh
wellpark
guojun
hodsdon
rosende
nbaf
labid
spectactular
antezana
haddrill
velculescu
headships
mainsails
pattered
alred
culata
forebearer
chalie
sheader
seeburger
oubiña
dsmb
heubeck
devilment
katayev
camín
gorson
predesigned
feep
saucisson
bertocci
brevetti
industryweek
lumer
loiterers
loadouts
nstemi
leaphart
titillated
aghazadeh
tigipko
corrugating
saldate
activase
gailhaguet
qimin
khaskheli
enevoldson
monay
bordner
rohlin
interxion
munusamy
menchel
welters
brooders
kuong
louvish
tocca
sucsy
humouring
codorníu
darwesh
henneberry
hankamer
altinger
krisp
wpas
purevdorj
zinifex
herbed
itac
vergassola
wellsian
cpsi
mlyn
acustico
multitasked
lman
undeviating
suzukis
loizidou
brachfeld
thornier
marassa
jamiri
anesthetizing
dhaher
klag
ronke
consentual
podd
umme
proppants
peasholm
ingenuousness
upda
hopyard
scrod
sundome
koblenzer
morroco
amlaw
bishopswood
tollbar
probab
rownhams
pemco
nembhard
rocchio
grapeseed
kreger
puistola
sulej
nonfeasance
certificants
lessy
fareboxes
pslra
gruv
kirbyjon
liukkonen
handzlik
opri
uhy
behaviourists
tropically
walvius
byrdak
teardo
balblair
dedaye
ayuntamientos
graterol
overstaffing
bekkevold
kentisbeare
kloesel
légumes
benedettini
gianini
utilizations
galka
trawlermen
subex
klauer
ndjili
electees
tillen
kysor
porcell
yoshiji
ardec
servisair
rapiscan
geronimus
millimetric
faliraki
herreras
manawanui
marmonte
leggs
balsara
capercaillies
minari
outdrew
stockingford
jalaludin
terrmel
hanieh
bipap
drance
diedrick
telesford
svich
silverados
babyfather
nashvillians
housebreaker
cholita
towelettes
gurrelieder
crookall
upends
destines
pictbridge
appealling
sensée
ronney
bentoel
laviano
llanfrothen
apixaban
vlodrop
sneakier
lahair
brushfires
unreadiness
goserelin
sleepily
maldistribution
dixies
birkmann
heidmann
hirohata
fundie
reitmans
qsp
campbellsburg
bridgmohan
lööw
cambois
akkuyu
garce
photobox
placidity
clendenen
riggall
marathoning
boduan
rosenstrauch
interplays
wyedean
njpac
unqualifiedly
mayger
stalberg
mazsalaca
mutianyu
absher
anouschka
towerstream
bachani
chepchirchir
touhig
khoro
iwcs
blairhall
qizhong
zurbaran
coremedia
opposit
aispuro
utla
waxer
cotney
halldin
setai
portese
concealers
requalified
reshard
zondeki
lavendon
microids
archwilio
ceftobiprole
krawetz
bonacina
terrye
qaed
quivar
idiosyncracy
incrementalist
weldele
rascism
chenchen
civillian
wasowski
alchymist
ghyslain
plawgo
ebberston
crailo
whitminster
callcredit
debonis
petroceltic
sterotype
ericksson
hillsdown
sameshima
vicken
skivvy
skyscraping
deiced
biomedica
taktser
mignoni
representitive
gordinier
cervantez
kellogs
perhap
sofort
deferasirox
blubs
shrewbury
africanisation
vaders
mukluk
biebel
drobot
schotter
bubblehead
zhongmin
acidulous
kathyrn
interbike
tigist
ruddles
ruszkowski
brewfest
rehaul
starchefs
wertheimers
towbar
ambach
icti
roberston
barhoum
merzouga
conergy
mcalarney
bellens
sibbi
megalitre
armentano
takingitglobal
vagliano
newswriter
ciechanow
dreyers
marcheline
pownce
spicker
khitab
rischer
portin
ecoh
vacherin
frucor
carbomb
hanout
vernand
cherukuri
wvsg
schroyer
synder
acronymed
schlieker
hartinah
mancot
blackflies
slimfast
brillembourg
blokhuijsen
fellate
nulliparous
chakari
litwa
berinsfield
nicor
greened
decribes
naameh
confessionally
knoweldge
hoeg
exeat
shivpur
safieh
walkern
ravellette
ozio
hickeys
delapre
döhle
daimary
paymentech
pengelley
mankani
littleness
montegrappa
jabalya
landlubber
keyse
apprpriate
ltee
travelwatch
ccpr
crockfords
maradonna
ferraiolo
segerberg
taluto
temporizing
newtok
hately
picrite
yandiyev
zonules
pawelski
microfractures
lilygreen
backtalk
stanleytown
lippiatt
roentgenology
quacked
poplavskaya
squeeks
nctj
larminie
gaviña
fulanis
phonebox
hanoverton
denominal
sautman
virtuousness
ayouch
sybaritic
giltner
felciano
moodily
jagpal
godsall
hijja
micropro
villency
moceri
spakovsky
heselden
jolivette
unsalable
greavsie
relton
acbd
hindquarter
wishnow
uceta
harmolodics
ancilliary
mejicanos
glencolmcille
soleau
farmanara
lifevantage
garnishments
zhenli
wincy
repossesses
chikhaoui
goolam
bhum
massachusettes
mayetta
tyondai
eliasen
transfusing
frale
mullavey
yehude
gauzes
steads
shaddick
calçots
labourite
shameen
alexus
spoetzl
kearneys
refenes
safinia
paugh
resumés
irradiates
mostafaei
gurkan
penksa
pacquette
côt
staybridge
samsova
innolux
precontract
hendrikje
wuhou
diniyar
somethig
barqi
cleansweep
ghazwan
salafranca
recolonizing
vbp
fissuring
ecsl
poortman
franak
lisitsyn
viktualienmarkt
glasspar
adagios
polimeni
chibas
glassbox
gural
fzc
opencable
labno
liene
factionalised
uchiyamada
toasties
roodee
wifa
cornucopias
bpce
jevington
interlandi
balasooriya
twizzles
kleffner
freefalling
pupilage
whirlybird
underdone
morcos
yanover
suckale
syaiful
geroge
moabi
westamerica
dorsainvil
tessaro
kedington
nightmarishly
aleipata
devisingh
blakroc
pienkowski
sandinismo
lawee
galliher
urton
amorales
bioindicators
suvo
govenor
bullah
nhsca
ganbaatar
canvasbacks
satelites
usry
semet
portune
cranksets
munchetty
ulley
tauntingly
nardello
periquita
infanticides
jinyan
gratifyingly
assitant
latchman
cervezas
ngel
hirokami
holsboer
dissappeared
eastsiders
commenee
stupples
menhennitt
cablesystem
dellenbaugh
tigta
cunarder
annibynnol
ralley
wayt
masterfoods
gareeb
shaden
sepura
wheals
minnikhanov
srebro
plastinated
copilots
loulie
topgolf
okg
pentala
brugnoli
kaesviharn
malthusians
hackbart
teehee
volkers
yawuru
pliner
unseeing
collyers
infuser
smartgate
rectums
anbinder
charendoff
inexistence
lyndel
imron
lorely
viscarra
tofas
alabaman
fladgate
herberman
dupouy
volubility
europabio
rosabeth
hectolitre
truckline
ralink
reelecting
metzstein
cheryle
apneas
altom
nonscience
saroornagar
zabell
derra
inexplicit
koltur
whimsey
physalia
motived
buoyantly
rewatching
gayakwad
loquacity
reindorp
yaps
bryukhanov
bidborough
bilat
sailani
stohler
norair
caminati
bourdy
twaalfhoven
painton
faughnan
adriene
hanmore
weitzner
misun
mozza
ebuya
rychter
viticole
eloran
frontyard
landmarking
gerhardstein
hemingways
videoegg
shaoul
celozzi
derwish
macrame
apfc
millvina
murueta
ashkabad
overdrafting
prodger
pogles
grammatikos
belgacem
palenques
joseff
zhongqing
uytdehaage
revisioned
nailah
mombacho
basell
elote
taposh
mussett
luonnotar
backslapping
mincom
aguak
kurashvili
lazzeroni
xenserver
aï
aloia
prejudgement
weirdnesses
naturalizes
komalah
wamsutter
ggd
akima
elayna
scls
kortum
discbox
multiroom
zhilei
eilian
grofman
hatstand
rebaudioside
nymans
ecoflex
necesssary
bicky
belhelvie
globescan
whitcliffe
legwear
steung
biventricular
riquna
vidnovic
marinker
gastronomes
woolwell
mulipola
rensin
chrichton
martonyi
airgas
huffs
pointscoring
ughh
flays
shinkle
numara
marchman
prickling
beiruti
dalcross
mdrc
wieldy
houngbo
kasle
armaggedon
feminize
planit
heamoor
androgenetic
contenta
piazze
jocose
spezi
rsgs
mading
acidulated
pulmonologists
deparle
ksenya
minimun
koepnick
slaa
lifestock
espadrille
sonim
squaretrade
farmery
marecic
beqaj
pflieger
westleton
radhia
tianyang
rejectionists
wheating
thje
carlstedt
tinwell
contigency
theraputic
andirons
hormats
odms
expro
dudeism
sacranie
satpayev
seroma
drewniak
shnayerson
radmilovic
maccullagh
ballanger
avli
mallek
filizzola
loftsson
shumake
kontis
dutheil
gussman
sunee
mooren
queenland
cubria
galgaduud
rothfield
bahrom
aricept
southwoods
guca
josic
bagir
lagosians
oberriet
lydstep
scirica
speedlight
fvd
southstar
havaianas
kayce
pochalla
shchuchye
brooklynn
werning
beamsley
froriep
gergo
stockel
santoyo
esack
bollwerk
parlapiano
tssm
fluview
onechanbara
caciotta
chineses
azarakhsh
aggers
charmat
pielou
acpm
bosen
hapilon
medevacs
yedidya
piffaro
inmet
nonintrusive
feldblum
vittachi
bosta
agok
lighthorne
halfar
squirearchy
binde
bedevere
nanotechnologist
tafua
mtcs
hulhudhoo
trivelli
damanaki
amortisation
ulrichsberg
benstock
forearmed
isab
craptastic
vilató
cready
qgc
kasatochi
legalises
veysonnaz
fion
puffa
lakesides
hyperpyrexia
tehching
lonsdorf
keggy
bakkali
tentativeness
fattouh
hirschorn
jaliya
ascherman
fitiuta
konyshev
borring
skurla
acquity
windygates
undersoil
bureaucratisation
hymy
wainio
emelda
billinghay
protocell
gottesdiener
cottered
lakhbir
altick
kawananakoa
thalken
lionised
resurection
partitionist
worldpanel
rasied
irglova
untanned
varriale
microcastle
willden
mahamed
poolhouse
gurtler
worthenbury
hyperflexion
déclassé
nekesa
sarcococca
arcandor
binette
monarchos
athrawon
falcones
rosellen
djiboutians
itag
paruk
weavering
avendano
bridezillas
zirinsky
axeon
tunin
noushin
dunion
jialu
kessai
milwall
lethaia
legnini
vipa
westmost
mirabela
submenus
dogwalker
indietracks
debruce
tesev
retrenchments
gallocatechin
cymoedd
kooyman
commagere
mcnamaras
baayork
marquitos
hoppenstedt
polysexual
horsely
amerigroup
newbyth
uría
nomadically
chirruping
lague
blaenrhondda
andromaca
bernos
celling
natters
koening
djama
oostvaardersplassen
mythmakers
centredness
schlep
shitless
hipswell
derai
misplay
lechea
hewings
lypsinka
carene
ncrb
cleanable
myeni
hauner
pinkowski
dorene
sorour
llazar
ilsey
michalczyk
terzic
quinceañeras
hollowood
fozzard
grasonville
feofanova
grinnin
basex
stermer
spacefarers
admet
intoxicates
lightboxes
ruing
sabreen
cursiter
ribis
sochua
smokler
nyid
planipes
alizad
dopod
desley
subari
minashvili
noseley
polissia
tigerlogic
bradstone
stikes
görlach
indivisibly
mallnitz
swizzels
taishet
glendy
spainish
hellicar
reesei
gaudily
strategery
gursel
taxations
bonica
masarik
koogle
chinotto
soumyadeep
flipchart
lambrinidis
mondialogo
freshens
priorslee
danyl
avantis
cnq
outswingers
flavanol
zhongsheng
lelkes
howies
teanaway
vinales
dannis
kcdc
cervelo
tallygenicom
shamlian
knickknack
chesterland
amrich
leidholdt
sevastianov
shipowning
repellency
navarsete
kentia
bdcs
surfable
gougne
raucci
bracketology
sangars
wonderhowto
buchert
tosatto
blaris
nextag
prilukov
beansprouts
supersweet
lobjanidze
algiz
gnip
impetuousness
sheffi
kankowski
edeline
stoups
monumentalism
fredericksen
naturalizations
mosts
esom
woodloch
atlanticare
nasima
gutwillig
buglife
buckteeth
readmittance
rubh
barmoor
balmacara
brignall
langelaan
passtime
unremunerative
crego
eastcoast
mcneel
grovelands
pushiness
patharghata
emmas
tylar
savol
reassesses
cheshires
accoustic
overcorrected
stoglin
freshbrook
iwth
olgun
mediaguardian
relaford
avjet
kwando
khamal
ohryzko
aquagenic
frowny
peura
overexposing
paybacks
trés
rsos
necar
ocassional
sadato
haematocrit
trelech
barbetta
laitner
advancetrac
popley
terpsichorean
aapp
tenspeed
hedh
draggy
aberman
khawani
epok
elese
tachyarrhythmias
ogbogu
choucair
degioia
medelci
kullas
gaspara
autoload
winnin
readmitting
flotman
maridjan
cheongshim
greenzo
jozini
kefraya
paciente
omrah
washaway
tolchin
rosehip
wiatt
microstamping
shyest
jahmal
gieseke
piccolini
cheneys
biddings
possokhov
vovak
guintoli
ddca
nimda
fontina
spottswoode
artington
hymned
huyghue
maame
dobrzynski
konis
draggers
acelino
sietsema
chalki
ppds
abdelfatah
xinguang
devide
derio
bindis
brassie
mispriced
prinn
vitetta
mckeand
showier
blakelaw
cillo
choudhrie
sinja
drinkell
sazka
growe
cobolli
khosrowshahi
unretiring
bakhos
ramales
keslar
dormobile
sauerbreij
streusel
wideranging
fugett
schmiedl
fenney
gigamedia
jamont
broadspeed
erson
fygi
neurosensory
wlga
fuci
wheb
sundowning
tolzmann
bouveresse
marisat
rossow
cracchiolo
cobles
overgeneralize
lubricious
stravinski
cetnar
bersagliere
labeef
brottman
shinyaku
benthamite
mugridge
hurk
codere
unpruned
wineapple
laccd
tpbs
authentications
seagen
contantly
smule
maeklong
eqp
zomet
mhango
chippies
seratonin
krawchuk
chellsie
gonig
eett
regester
westa
wtgb
pansing
abiocor
leeched
manikpuri
ignatio
derenoncourt
meranda
institutos
elavil
juventute
abdulhak
irwyn
hempstone
unfried
tarfa
blowjobs
eyeborg
scharr
orked
aldert
tthat
gudino
dinp
helmkamp
ususal
poelten
dormansland
nicolaidis
iosh
aimin
browett
deskey
paradjanov
kilbarry
mckissic
fodge
shalita
concealments
zandbergen
breakstone
pakzad
usfj
stralen
lamattina
undergirding
kwiecinski
clobberin
mahiedine
hartop
chareh
priyadarshi
rarig
thaworn
carskadon
sarms
korbich
jinsong
parametres
portoroz
watumull
caroselli
vcit
maniar
coagulans
camilia
vancouverite
winterswyk
proview
gadarene
europarties
sichrovsky
longsands
uaisele
mpack
knockwurst
kipen
bulimics
slobbery
truvelo
stephanou
thembinkosi
kalanick
lineartronic
hebog
instinctually
cluing
douggie
mortada
amberton
kivilo
bctd
camoufleurs
umkhanyakude
sovereignist
arbaeen
kumartuli
theodores
negotiability
lindwer
pinck
ganaxolone
oakapple
apprehensively
damier
johna
izetta
wowsers
jearld
kingsand
jazmyn
longyuan
guangrong
cerpa
dhalwala
debis
hibernal
depressors
eastfields
sprink
heeter
lbws
hamami
mandjeck
gimmelwald
tellef
liechty
spiffing
hnidy
hemodynamically
pratichetti
muchiri
pleitez
poat
neurasthenic
antwine
cprw
eiroa
odontochelys
djakpa
mosad
dugs
farinholt
silich
preacherman
kaylene
znaider
casnewydd
jazil
sonova
liebehenschel
lightpath
tumulak
haridi
sarajevans
chistian
sidy
mcmickle
pisarek
biomatrica
kreplach
yannaras
santalab
bangemann
spilborghs
zarela
nuuausala
kianoush
murambi
verbenas
softtop
fachie
scudders
esders
koplow
neiers
kuniya
hennegan
marshgate
healeys
panick
croquembouche
unstintingly
quantifiably
kesslers
pancrate
trestrail
kreinberg
owein
wainscoted
balitsch
pfpa
delord
forground
lowliness
glenmora
djedje
embroided
sambol
iraqia
hcsc
windish
tpas
quisp
sarvari
spritual
backdown
electrosensory
multitenant
kawecki
mikala
thermomix
zornow
wilbourn
rabideau
tobas
unclad
ginkgoes
hineman
sayanogorsk
deyda
tourment
amke
tornberg
garrowhill
fabulousness
nazma
taouil
foodbanks
sagel
commity
thinus
boisi
podoconiosis
kijak
bauge
coopering
bevanites
malibus
promotoras
rivr
finnimore
ekmark
nissay
ebbesmeyer
unlamented
boutté
amakudari
inniskillin
hojeij
nxdn
scoldings
webcaster
naptime
akahi
yerzhan
buttenwieser
kinkos
unconsumed
guaracara
sulcer
afflatus
baman
rupai
hsmp
emilsson
ettl
candleholders
garana
gregucci
abff
paretti
accurist
mantained
osondu
kbro
wurly
fluster
sonyericsson
lartin
multiorgan
unhas
stakeford
bloviate
sacyr
agrilife
quizz
assunto
asbill
amellus
postsurgical
malcorra
crullers
pullup
longueur
anticolonialist
czyzewski
kachan
shulz
musze
watchout
sabuk
incoherency
arlingham
malarek
fogger
appexchange
bagnato
kitwana
maxiumum
jasvinder
cydonie
srizbi
rown
jixun
kools
iniative
risper
sandlings
safronova
avitan
jtekt
kotula
danzantes
touchpaper
tarlok
shipard
euryn
rozana
clofibrate
lovern
zonked
shiells
postnasal
ynysawdre
dunira
kenion
dobratz
designworks
molestations
galeya
uncontradicted
salterhebble
verjus
rotork
papau
shamhat
boqueria
theere
reformable
realer
crematories
berberi
gurmani
tibaijuka
fleischhacker
ukfi
adreview
kianga
iwanowski
paumgarten
wisut
fullpower
ebang
utsi
barbulescu
bedin
civillians
melucci
icher
abitibibowater
habituating
dasheen
hefting
alperson
bdmv
shamala
medievals
scilingo
kalanisi
dejia
bacharuddin
dongdong
masticator
ruymbeke
ansumane
upgraders
multitasker
squa
atsunori
mandri
ecard
biedenbach
dolphus
muhoozi
deedy
hospedales
tishina
phonophobia
attitute
sinsemilla
eulogist
rabani
amritpal
dubyna
pavri
quresh
portering
mansory
shinique
gettting
tocker
ngun
whitegates
sanitarian
serralta
spanfeller
tosic
farsad
illegalized
studesville
securement
gamaleya
apmg
tommyknocker
skirl
glimepiride
ziadeh
jakubec
benizri
unmourned
signifcantly
yemens
prewritten
pleasanter
abdille
captious
kocic
fenceline
africat
akinsanya
trecia
eqv
pomfrey
feedbag
shuttlers
sellas
koundara
magassa
saleswomen
stopps
salesclerk
khatar
dyilo
ogou
scums
checky
hasip
mnay
livinallongo
underminded
hibbitts
neuticles
jinchi
woodmans
ficelle
kovatchev
guilmette
deatherage
hanjra
musliu
ditherington
allanbrook
cristini
capsaicinoids
leubsdorf
kreher
processability
stirland
nemeroff
joyon
peagler
roswall
surtain
monsivais
hosek
nichopoulos
borjesson
trenter
murfitt
llanwrda
ffrdc
bassolino
hcpt
mukwege
sommerer
bahmanyar
dimora
gertten
supera
szmyd
huncote
skevington
ssga
bannout
michikaze
maddala
envalira
dakowicz
wuthnow
alaton
pixo
savik
bannang
hermé
mediterrean
cloyingly
spinnerbait
abanazar
standa
felley
recces
linteau
pragasam
kindlon
renowed
ternium
spos
faegre
qabatiya
anmm
lgtb
rectennas
barcarolles
orrey
pohan
cyburbia
phallocentric
zeppole
mishloach
meadowhead
riomaggiore
rolinek
koina
zemp
calibos
swished
mahameed
girlington
arraign
monsignors
camerman
dehnamaki
ellickson
abdelhaleem
herritarrok
narisawa
subpoenaing
marincola
leukemans
watcombe
witters
poipu
cloudbreak
daggy
mzamane
powerchip
containable
panjsher
eiopa
clubface
cannelle
kurbatova
resegregation
hillberry
fedders
tenberken
adamsdale
nonmusical
ingenius
dissostichus
myyearbook
sfsg
mofeed
tuninter
anynana
bockting
leswalt
bhagmati
tokarz
jalozai
reinitiate
tassle
cottonelle
cadjehoun
carprofen
nnrs
pakulak
malinverni
calise
arraying
sungold
ffred
recalibrating
sanussi
greebe
beiji
secretaire
guleghina
giannoli
pennycress
mongillo
hellishly
grydeland
coorstek
smudgy
guilts
crase
khurrum
comastri
kleinke
beautifull
boguslavsky
liberis
iwsc
chainey
livoti
mollifying
gleadall
vishny
coberly
postconviction
gabis
scool
protocal
zimov
arcone
mutie
willekens
clarium
kagay
carlise
yawp
biowatch
lupfer
covario
guiterman
bikhchandani
halleran
jasperse
mabula
polumbus
bulukumba
diglipur
zanzibaris
garona
shrives
kinstler
progamme
tennesse
citrusy
torrenueva
janneh
mycocepurus
compostion
zaimi
broida
celsia
undertrained
ullyot
cineas
pincushions
mouterde
moisseev
naseum
estover
eluana
sumisip
moonbounce
scrimmaging
fartlek
thieve
gambro
hayshed
impresiones
simearth
matela
putrefied
demarkation
danchin
kronospan
bullbars
kenith
warstone
bunkbed
primoz
incredibad
haustein
roastery
weilin
dinamika
ruoppolo
doddie
goodlooking
stylers
prepossessing
prechtel
qpe
seracini
kohal
pfsweb
bundaleer
rockenbach
nonsocial
barnuevo
interwove
carbapenemases
ozgan
gamila
oppossed
gopin
vinnell
crawick
harilela
verdick
getjar
butzel
rjk
demurely
beloussov
housebuilders
sniffling
nawijn
sharoni
fremstad
rightholder
cheko
hbas
dxy
convienent
shaimiev
cannonsville
bittermann
outrushed
garaud
mariwan
nèg
sadwrn
pahlsson
pingus
hockerton
pockett
convertors
chiappone
farceur
rivada
choucroute
aggravations
africentric
areopolis
faddy
sweco
glamorously
clermiston
transvestitism
alualu
skyrace
novavax
moganshan
shreiner
rtss
zoric
wotsits
liuyuan
milek
pantalaimon
borsboom
satterly
tocris
emvco
palevol
keigan
gerbasi
woukd
drumoak
kustow
fajon
longtin
fazoli
gohpur
inlaw
scherber
velib
slumbered
brennans
mokin
béar
kirschke
fractionator
ignizio
tantillo
galgadud
malcuit
atomizes
opensparc
mississipi
trawlerman
dashe
bingguo
fitze
dascombe
eumm
visnovsky
vvmf
dissappear
blaxhall
molchanova
gilberdyke
donewald
refinances
maléter
kimunya
delboy
scraton
handoko
wilsonart
earlie
apelt
dinaw
barbagelata
mafuta
mundford
unwisdom
scuffs
chrismas
altmore
goulbourne
duaner
technophiles
tobiasen
kuye
gielow
midani
ojinnaka
overnutrition
naruc
instigations
jalopies
shestakova
coreth
julphar
klatten
superpole
escura
ratanak
foodists
gravetye
bidadi
hafley
pinchao
wiedeking
easeus
meraux
housewifery
nuseirat
fatwallet
gableman
dehui
safdari
mckines
sweatband
venetz
lowcock
aruz
ukfc
sendlein
dementri
rifs
flocken
greatland
schelomo
reres
pikelet
marketisation
clouted
akivi
tenovus
saayman
eiter
simner
soraida
moniza
genuses
zootechnical
opinionator
kolsch
ohlde
patchworks
emun
brendt
koobface
saera
unlevel
luhukay
pointillistic
learnvest
valke
swarnamali
ballinspittle
impeaches
shilly
nsso
capuzzi
hettinga
akerboom
goslee
bakesale
polakoff
grumpily
falloch
radas
rentas
shourd
smackers
scharlau
boxier
ryongchon
littlefair
chichon
moohan
matsunichi
eifler
spiegelau
prestonburg
maleka
kuol
tillack
falfield
solfest
tsujihara
tacchino
gerallt
villier
surata
beurden
cadmore
photoshops
bastardos
kizzie
creels
vref
actigraphy
igps
snaer
broyer
shafak
hoefflin
tided
nicus
nanko
lufti
millisievert
manick
funestus
pompadours
sreenevasan
kipunji
toupees
melland
agnico
fusce
gassville
antici
maclochlainn
reverby
equitas
canani
urmo
shticks
marcey
proxicom
dogaru
scowls
litl
xinkai
damagingly
untracked
polovets
surgan
geissmann
nourredine
graybeal
bawdrip
emotively
daughdrill
ultrarunner
cefixime
pruis
steinways
aashe
yungchen
glanzer
oppenheimerfunds
hamade
minyon
heparins
caqh
bodenhausen
varnavas
robyne
unparallelled
lickley
doneness
piraya
bodeguita
roscigno
bridgefoot
podgor
tetz
somjit
overachieved
hokkanen
pamphilon
lighweight
torsello
advertainment
amhi
frean
condescends
thalasso
ghinwa
higazy
roves
mackeown
ninnies
dusinberre
apgc
mapoe
hoekman
cpsia
vipp
chuño
unexercised
polisar
maponyane
clattered
rofes
kanning
jetsetter
patsaouras
lochrie
mugrabi
boyt
bowdlerizing
backstabbed
drimmer
bakhash
geralds
eastwoods
shakesperean
betokens
sambazon
hocked
restitute
eshet
seftel
roben
whinfell
sensys
staudigl
wonggoun
chattered
wising
llandel
shacked
oxiclean
faldingworth
bigan
riesgraf
higl
nontransparent
competiton
yuyun
xilisoft
pinkos
payre
burkeman
scbt
geofencing
frustation
sherree
celebrative
jaslovské
millecam
seadream
choos
alkerton
snarly
manzanitas
sitzman
seekins
simhon
sodertalje
boeker
iiroc
teie
convio
sampley
shageluk
photomicrography
priviledged
antonelle
rheinallt
mcuh
mahound
valacyclovir
ednos
topex
llansannan
heterosexually
palant
penjaringan
biothrax
backlick
dothill
constructability
zorab
carpatair
eary
pattonsburg
regulski
bevanite
naiades
strumpshaw
wilberger
koele
inheritence
cablenet
rosegg
barzak
farhatullah
edgeplay
ziplining
aldsworth
lifepoint
camal
intractably
ryz
branders
ction
pattis
muchnic
arrrgh
seroxat
kolinski
ilecs
admist
inactives
harperpress
bhittani
deveci
tamps
zimmitti
fanleaf
udfs
burnton
particpating
chargeability
unterkircher
unbelted
advantest
nyfd
dilbagh
priscu
execpt
tohme
nsri
daignault
kacyvenski
degredation
sucessor
evgenios
mkoyan
delozier
yein
puddefoot
britcom
acousticians
iqor
trachelospermum
updateable
circumlocutory
wayah
burell
pachysandra
mles
maime
menie
crestron
unworkably
sondek
justifed
misshapes
bajevic
daydreamed
critchell
ventor
psychopharmacologist
abde
westerdam
peggi
besties
torteval
stultified
pomelos
tiedeman
durnell
aissami
ferryport
reenergize
confected
harlestone
roncagliolo
stridulate
indiepix
lianying
wescom
cebull
sensitizes
newble
mukhsin
shogren
hagatna
ratnoff
teyona
covanta
stanish
texterity
thermador
laicism
vendeuse
handlowy
stephney
diybio
empingham
disolve
soutane
gorard
bezzaz
guerdon
hebrard
nightclothes
aasra
unpf
balalaikas
alperon
tatling
reopenings
alienage
küblis
hagwons
nodia
raffling
marcaccio
denude
wenneker
bermanbraun
standefer
cisen
cacak
emmanuella
lazarre
disalvatore
orpe
wigford
giampa
moldea
luitel
muziektheater
menchie
aggarwala
bimco
lanesfield
rayful
trancelike
coaters
alhomayed
vantages
ogunkoya
campylobacteriosis
zgonina
disbursal
jannot
redwell
nonmaterial
quane
contraints
xenith
jassen
veroni
brulte
cleopatras
overhasty
xendesktop
saney
sitkoff
sunderman
reoperation
morgenroth
pearlfishers
flagmen
taeke
biagianti
alchemyapi
sciubba
teesville
domoney
withywood
nuriyev
chonda
manarola
covansys
sbsi
baccalaureates
wenjian
indem
sarracino
cohabits
hastle
miza
hubig
beyler
kermer
morawiecki
tenneson
powerboost
labrea
patima
youming
aarebrot
rimvydas
dxn
fuds
shrimsley
sciutto
barbut
palexpo
pooni
cypriaca
polgreen
dienten
butterfields
maymon
akumal
evertson
uncheckable
mingazov
matrook
wallisdown
thynnus
unwonted
muqbil
meguiar
dibono
pistolera
tonnies
carolle
powerhaul
tirador
mcmillans
pulzetti
echochrome
raïssa
lekic
flubbing
presold
denominating
pasquerilla
lagmay
quickoffice
empathetically
gayner
zoueva
kreil
gereb
bassmasters
redcurrants
myguide
monzel
samardzic
amiry
sequenzas
accelerative
honohan
hermer
baver
nooij
dijkstal
thornewill
contrats
kaune
vatted
sherriffs
microbrews
abrazos
kriemler
hartmarx
infelicity
janjic
mechelle
sege
kwizera
aldermoor
crystalised
datagrid
macadamias
nugee
gourin
beauracracy
ghany
kajeet
burles
hadramout
kineto
seguchi
bauw
halbritter
misher
praderas
armhole
nyagan
lessel
snorky
trivialises
jabur
ekren
nbfa
shocknek
kristyan
amapari
₂
broadwindsor
hiraman
perano
uapb
swooshing
ucac
friskney
hemmerman
sanping
pasttime
gallbladders
avionica
emannuel
auchincruive
spotz
kunken
ramita
irrelevances
outstate
sidat
microturbines
kidspace
zinnie
ulumi
pastırma
kilcooley
narah
helzer
portioning
bimpson
fenners
peaston
trotty
rezac
wisconsinites
ssos
dohyo
blighters
ranchettes
twangs
dushane
grandiosely
steamfitters
eiras
slovin
pollicino
neurotypicals
abdikadir
malseed
vogelman
theroy
tranquiliser
mbango
shrewsberry
bensedrine
nxec
jonquiere
wedeman
noncombustible
goalwards
barzegar
moriston
ardossi
kauders
benabib
hypercholesterolaemia
annuit
maginley
kanoute
bedells
fideo
rexroad
yamaichi
reimposition
brynwood
eppi
travi
ruegsegger
freeny
lemmouchia
enfolds
huntresses
philanthropical
areds
wyevale
uibhist
facebooks
slithy
braymer
inverie
ultrawide
vlautin
pitau
smutniak
magwood
daleen
ostracizes
pinchin
schumanns
kloser
hotsauce
tushan
unauthoritative
mullivaikkal
bryld
izad
selphy
kornat
crousillat
headquaters
dolgen
roscuro
rabois
swietokrzyskie
murewa
fouth
morfessis
eilen
mollway
engla
euthanizes
critize
ivalu
lokuarachchi
westco
afterplay
readably
locati
poienari
leonnig
zezima
soundwalk
jmcc
tillim
moblog
bloodthirstiness
guama
chels
liplock
baaj
kylesku
riann
hohlwein
onelink
baalak
quaggas
tallan
minouche
frecker
chemaxon
pennybags
brekk
hebl
agbogbloshie
flagstick
samakuva
brundige
neuralgic
qdi
newswipe
beaucarne
chenxi
timeworn
aozou
lampost
vigneri
hoggle
yardumian
mapson
merisant
okagawa
ciroma
houge
gubay
pynzenyk
tacolneston
khumjung
ppip
edry
mackiev
clai
kadek
jarraud
gtis
windmilling
carringtons
quivered
fruitadens
puzder
ambitiousness
hakstol
waino
cucchiara
rgbe
shuvalova
dwygyfylchi
coverted
castrucci
readier
frz
pupkin
nowthen
underequipped
ambuklao
brumbelow
biotoxins
wolrd
fabada
reeep
graunke
iog
dismasting
useem
lanahan
binya
nayiri
nawiliwili
eyeballed
outperformance
tomcod
nickelsburg
muriuki
durman
dannat
tavernari
whitewalls
saaid
zeen
warshow
koolaid
mpgs
winced
hounddog
chilga
eckenrode
lensless
hypermobile
decentered
mcgauley
miyanda
momment
shafilea
ranabir
fathoming
laret
shedrack
refinish
maruge
zeoli
ferrugia
dioni
hettrick
rozek
comito
pleasured
cipulis
biogasoline
gnep
balloo
replastered
zolty
zilinskas
havelsan
econetic
rcdm
fattoush
foregoes
gojet
greetwell
tedenby
newshound
grabel
atterberry
aleg
amitay
squeezable
finty
celebres
turbervill
caydee
shillue
txema
fairmead
fiscals
vaziani
sintim
trandahl
khanani
fuggles
immunoregulatory
stupefy
hookline
csaa
cisowski
ambrosial
miljus
cyclamens
getback
vancil
withypool
ncsd
shoulberg
ghesquiere
tabarez
schoppert
fallons
maarof
enw
cbrc
certainteed
peniket
inceptions
chowdown
legette
pebley
ushs
astarloza
burrata
jazairi
dcra
patisia
daryoush
couderay
caiafa
chortling
enernoc
becaues
desalinating
shandley
tonkovich
oxygenating
caminata
polner
nakouzi
gearheads
nogar
listkiewicz
poteen
gangar
hoptman
overprocessed
dubernard
remilitarisation
spianato
bathpool
bagrami
decitabine
utzschneider
intego
houpt
nordholm
amorin
shujaa
sadka
questionings
muranen
nekaris
datelines
sholar
uitikon
srch
inexactitude
cothay
hengeveld
griso
onesearch
defrock
foister
eruptum
vincebus
blubbering
reasor
stihler
multicare
zazueta
supranationalism
mapendo
treizième
paxi
teletalk
toing
scabbed
desia
donnachie
shulock
begor
chappies
yogpeeth
backpedaled
jermon
flinger
kicanas
trigano
cottrer
automats
utec
konduz
jauntily
lukesh
druskin
khwani
infosport
dederang
dobrynska
accusingly
napolioni
tunison
icimod
jops
fireguard
openadr
geocenter
bassingham
mcclatchie
wolkowitz
tuffah
ezawa
alnylam
shanghvi
jayla
zarar
adtran
whap
abysova
gormans
healthplan
faronics
taback
catcall
hcis
jeeze
orchises
osserman
optimiser
mluleki
piggledy
kayrouz
holographically
betw
zwirn
cazaban
boerwinkle
inglethorpe
doomsayer
ponden
microwatts
assimilator
carnochan
dfsp
tanarus
sakazakii
benisch
kowtowed
bardelys
borispol
haeusler
americanconnection
dansky
imaginer
enflamed
collery
ravikanth
immy
cancian
trulove
choper
leegin
tickencote
cylinda
hoevelaken
ionizers
intratumoral
pambo
szekesfehervar
kezerashvili
quyet
montenvers
helfert
santaniello
ciil
vdma
transubstantiated
sdlt
baseer
thweatt
letherman
jillings
highridge
dornin
bronglais
porthgain
haveman
omegle
lulucf
kazakhtelecom
ellingworth
kaseman
airão
standstills
nkomati
cowpokes
goldnadel
aclj
vaccinators
richarda
droperidol
fastnesses
eulogise
crosbys
shuzhen
marran
shoeboxes
casden
paradizo
genuis
yaalon
addresss
vrinat
klyuev
caberta
jasmines
takanaka
geocell
thandiwe
prammer
swordbearer
enthusiatic
goldheart
gyude
globespan
aaditya
gdgt
hoeness
prilocaine
palihapitiya
perisic
anchorfree
norned
wronging
wammies
sublingually
dobbies
riener
pongs
bezunesh
zya
rosemeadow
dsny
didden
levell
quann
schaedler
anwa
sangomas
mzymta
undreamt
babikian
diggles
howroyd
duckworths
guoyuan
gentzel
hlavac
ashlyns
monarc
afront
druick
plantiff
selldorf
snickersville
arisman
falacy
refind
kokol
shopsin
unticked
huska
uyeno
pettys
leebron
commet
ubogu
shadowcrew
unseeable
fppc
petfinder
pizzitola
gueffroy
guangjin
wanly
tyrannically
electrospun
omachi
cvent
gandhians
dorice
titbit
zwolinski
freedivers
paltel
gasior
biederitz
meuris
nongame
quemener
feldmeijer
bargal
imagenet
pomerado
penaranda
gumus
overindulging
obdii
interinstitutional
ausnet
mxenge
outgrossed
missileers
tarvit
koopmann
cizeta
flahooley
salutatory
overstimulated
meeru
tricyclics
jamarr
bocardo
donini
llanwonno
hogganfield
butylparaben
spoofer
chandrayan
moutain
wakuda
serried
callused
elies
verryth
zackenberg
helth
colemore
powerschool
bidonvilles
pucillo
delimkhanov
dummying
jakon
iress
sergerie
calica
egeon
hatefully
wisbar
vyatchanin
tecsar
soffa
peakirk
lisen
brudos
tiedown
rohen
qarnns
bootprint
racf
limc
lissett
danly
schwemmer
scwr
hardoon
brunnermeier
shanika
brooked
ginnis
nonpartisanship
journee
schafft
kveta
commisioner
niueans
kandeel
goovaerts
bandhs
carinci
neimark
booysens
sadou
tallington
agraz
putterman
helliesen
seliga
zinkhan
trickiness
comunities
diddling
hearin
zurabov
caplets
bonking
teichner
beardshaw
soysambu
beting
keshawarz
whatevs
ergometers
acmg
sibilance
lania
activant
eudald
pincock
probabtion
leatherby
falavigna
passeron
vigliotti
sherifa
fahringer
fezzik
palguta
encomia
modahl
actl
blazwick
koolman
ephemerals
canós
guiliani
kyemon
slavonice
cuautitlan
quaffing
kariana
norview
iomart
incluso
dmea
sparaxis
wheelabrator
horsewomen
kozulin
yuesheng
devean
riar
easi
foushee
liexian
tolkiens
disenrollment
auwe
thouvenel
invacare
jesta
piotti
motoblur
unitrin
mwaruwari
gambril
scurrah
shawbridge
engelmayer
taniya
tactlessly
beegees
schuelke
pantechnicon
hollaway
methody
saddi
bollito
kolmanskop
kilm
mounding
muthamma
kirkbean
mvelaphanda
matrikon
muteba
massachussets
chacahua
labwani
kofe
familiarities
menzieshill
touaregs
alertly
overprotection
baltal
grandey
demonstated
mefeedia
enemys
debtmerica
renderos
tiagabine
onsi
mocap
mandelberg
maykel
footbrake
happends
hawliau
venediktov
doumeira
khaddar
roksanda
egeraat
kazini
appraoch
eymet
kaindi
deghati
passanger
noghaideli
nojo
uninspected
microcredits
bullshitters
zoomtext
extricates
valaika
ivernia
sgdl
reccommended
daeg
kaluyituka
kutik
linerboard
kincorth
slovenliness
graincorp
profiteroles
jaeggi
samenow
keffiyehs
mannerly
scsr
safaa
fainthearted
mockney
schmautz
wndc
landrin
harshing
asbjorn
airstar
mccalpin
polyone
vizhi
hisakazu
twizzlers
supersensitivity
brahima
wyburd
abeta
choreopoem
linds
kacst
denef
activerain
baluyevsky
tarves
cokers
koua
begbies
hoodwinking
tischman
ranelin
molnau
ustyurt
salmona
morstead
mishon
moskovia
rawsthorn
faceplant
innerchange
ludmer
kassoma
geeker
clamors
frithelstock
sambuco
thiruchelvam
bixley
submetering
latman
rhawnhurst
joei
mangaliso
shedlock
fluharty
norwick
ayariga
hijrat
skyrise
howgego
hoefkens
lichtblick
bisti
underclasses
austerberry
pedestrianization
thumpin
guisasola
fooding
sancan
literariness
whelans
koonse
stocktake
banafsheh
porfiri
rettberg
estore
worobec
govilon
haipe
phrasebooks
maccuish
itasha
mazut
vitiating
sabertooths
furfuryl
unhooking
dimpling
chanchez
kangai
lingani
eebee
brovtsev
desoer
pebbledash
zocor
ghorban
poisioning
dafora
cairon
shpe
subleases
artesanias
dodes
quantapoint
mehmanparast
mwcnts
alledge
calvey
talb
editon
cvvt
gateaux
waggett
babnik
druckmaschinen
allido
phok
gjelsten
lostwinds
hodr
goumri
decisionmaker
aussiebum
gigaset
iqe
scholtens
maplecroft
underinsurance
jameelah
lipoteichoic
ruhlin
unilateralist
lungomare
redenominated
tilmant
sallon
telebrands
squelches
landaluze
recipie
extremest
howardforums
rudakova
xiwang
earthweek
lgfl
qmd
kowalczuk
padaca
queloz
jegou
sharzer
petricone
nanh
studioso
arenda
brager
vikingstad
clario
ruenroeng
gazzam
woodturner
dloc
campolina
fourballs
egaming
paraguana
mashers
migrantes
everchanging
mandabach
hochhalter
popmoney
hongbing
laborda
nouria
bottigheimer
jenco
ningming
hyperglycaemia
kanjo
guthlaxton
hateship
loveship
marjeh
zentiva
aipla
kucherena
glammed
intimidatingly
buzza
kryczka
longabardi
wellie
fastskin
korns
tiotropium
izala
schoonebeek
casadio
tarictic
mikalah
hinnies
piscatorial
hitchmough
nimetazepam
subsquently
wickramaratne
connubial
kukovich
cannelloni
reline
bierwirth
frykberg
morcone
ebad
guosheng
littondale
primario
masalas
caten
ilston
androulla
kamecke
gregorys
toudouze
gfatm
zagorsky
milca
fqhc
atkisson
saidullah
rajapakshe
husic
ulanhot
cahall
exasperatingly
tupay
markmann
meïté
presumptuousness
hökmark
agressor
tridel
harvy
pottering
gockley
schoolday
jalasto
sublets
eatinger
rubinger
valmon
pantomiming
sklaroff
whv
biotek
charlcombe
mulege
cairnwell
gelete
yedda
storrar
naion
viglietti
derocher
soat
tintinnabulation
gerstenberger
flywire
fleshier
kasandra
nealis
tatis
ccop
lippes
schlesselman
curnock
mettingham
encroachers
heckendorn
encierro
banknock
nyron
fantz
sideburn
dufus
finnkino
delron
impoverishes
bidded
hypalon
speechlessness
kukeyev
mosiman
lagoda
langas
wicus
huilongguan
contradition
lawdar
mandhir
otemachi
inclan
bigamists
krupicka
kneeshaw
freidheim
murderland
sumatrae
brookhill
owchar
keratoprosthesis
yousufzai
rearend
fahid
kuks
capts
khyam
dermont
addow
luach
balleza
rouiba
notnowcato
angueira
crissier
leshy
caringal
owly
wheke
pancaked
tosspot
fyle
zalon
alamieyeseigha
noiselessly
durty
greenstar
akitas
publi
terekeka
llandegai
hinderer
gageby
skena
gelvin
richlin
arkefly
veracode
ejupi
webmistress
ladybrook
cognis
soberness
moyross
marguiles
lineouts
schifilliti
magnit
jongbloed
zindzi
rampf
ausenco
girifna
svoronos
toddling
benhard
noblett
scotson
chenoy
elegba
aggreko
cigarmaker
pirls
taiohae
sonol
dukane
rospotrebnadzor
shakirullah
redbush
insertive
interwest
snowblower
sakhnovski
rajnesh
zecevic
strommen
escudier
dejonghe
deadlifts
dumbasses
hexie
vasella
slma
loscalzo
noorvik
sardarov
beccaro
faustmann
recessing
filicia
floorshow
valloires
mouraria
xanthomas
edfa
dorame
samthar
baldie
sandya
lightsabre
baldes
pingers
opodo
obamaism
glocker
njbiz
havlick
grushow
lambridge
banishments
targus
socities
anathematised
quiwonkpa
novair
fluendo
countersurveillance
liqa
omali
whizzers
bermant
mammaprint
gannochy
newspaperwoman
eayre
bakircioglu
oxonians
ezbet
citicoline
trevyn
löwitsch
westerlands
humitas
kwock
dalein
sedici
oceanco
liuqiu
aigs
experiement
faruqui
illian
punggye
organiztion
cruller
dalbavie
akesson
goofiest
crowsley
forugh
pointcast
braemore
bzx
grotnes
shadowmancer
scearce
spenkelink
gracian
terentiev
janadhikar
altnagelvin
kannika
patano
westat
callerton
mouttet
unversity
leiker
togeather
chathill
vodaphone
lenney
harcombe
uex
lajolla
rald
swackhamer
narcisco
movano
balkanisation
hudyma
ragle
kaylynn
lansanah
ferl
derreck
mboka
langarica
bockelmann
tomsic
parklet
loooooong
doesent
mondory
baudour
damsgaard
sedky
tamashek
tomass
buinaksk
mambu
brusk
earlysville
kunimasa
amale
fishings
champon
wagonlit
ztohoven
appall
ukranians
preordering
udston
meshram
baxby
medano
hoteling
udofia
dedridge
golfland
melwani
costellos
eej
lisanby
hessam
siccardi
tolcarne
niebling
sezibwa
garecht
brentry
qabb
jarritos
aulani
bcause
chiedozie
volanges
gdhi
prepacked
mingma
stereoscopes
bessis
bassens
larese
saoul
facteurs
horseshoeing
ccow
avrami
grattacielo
osaid
propagandising
jodelet
mcgaffigan
briseno
kurutz
sartore
ipsley
gosens
vegfest
mundle
nashawena
virot
crimper
autonomi
appreicate
oystein
naschmarkt
gutsiest
whitehair
bivy
svtc
minijack
sautin
hardel
cerrejonisuchus
dörflein
chaib
sandlots
schnetzler
eeda
pokaski
litigates
rowly
bretter
sérac
guusje
clumsiest
didelphys
baatz
katsavakis
josetxo
zickel
dominquez
gagliasso
sarbu
feebler
tcherezov
merched
orjuela
micek
ecletic
tomkiewicz
vtn
rajive
uriminzokkiri
scorpene
bucktoothed
beachings
manhandles
microamperes
simme
romanello
teeside
bichons
shigri
roia
cavagnaro
nothronychus
harraby
tonala
shedfield
matsumori
watersense
sailability
ecgd
ziganda
fiscalini
arbeia
levermann
todorovich
velosa
spedition
pestronk
eigel
otepka
aknoun
sudokus
comunism
batbayar
glovsky
chalmé
mamina
clonan
brietbart
undrawn
liftport
albiceleste
secondi
rottino
rummo
braunlich
ppss
velti
vandyk
impresive
timanfaya
islamicisation
divisionists
tierre
andartes
wronger
tomasevicz
spead
émigrée
floozie
dunchideock
guriel
acga
gulbin
baumberger
romanick
upconverter
boubakeur
romaguera
growney
solariums
tryng
ophiopogon
truffula
aviations
shellers
mcapi
rusan
sakabe
sillito
lombax
sneeringly
heathway
zlotnik
shacknai
happended
jaksic
lomnicki
mcnamer
imitable
woodgett
restif
twinjets
srednyaya
atassut
beaurocratic
ravensbury
calander
mayorkas
glowpoint
reanna
ceratinly
kildrum
horbaczewski
continetti
bucketloads
rokpa
kosair
breathiness
zenshin
pilsworth
samcor
huanghelou
zhongping
sajadi
graystones
humphryes
tional
huxleys
mccafe
engeman
mouhammad
neogen
gujerat
beatdowns
negasso
nomuka
nurnberger
asimos
proek
gidada
walkoff
kotil
fichardt
amisi
discala
passagework
unfaded
stolfo
iside
blethering
polverini
nohria
selespeed
wheego
jeansonne
katyushas
steinfels
gaudini
childproof
immedately
caesarism
mcfeeley
emcor
udj
sotterranea
kimbal
vandas
flatfooted
earthrights
fragale
maharanis
gamcare
rabar
bagnulo
gaulden
bernies
envirolink
lussick
brachen
baalke
brammeier
maandig
levings
emerito
plbs
unintelligble
emperical
shortcode
shirker
panteleyev
stricklen
perretti
lgvs
heastie
frearson
tumlin
preoccupying
daradji
marbley
wetzels
tenderhearted
kandamby
mohacs
atpg
maezawa
democratas
pamala
stegmayer
varces
crucifies
baltonsborough
holtam
dictu
purées
underkoffler
gavaghan
acerinox
flipflop
jockel
methlick
abogo
destory
moviestorm
altug
trister
electromobility
ifrica
eyeroll
plutoids
besets
riversway
grunters
clulow
telea
kennards
akiachak
idasa
aliakbar
rienk
statusnet
peiwen
dubler
bodzin
emch
mosebar
advs
wawr
suffuse
uyeda
punctal
foued
carribbean
sewardstone
hallym
darv
glossies
santia
farmgirl
gieschen
miscommunicated
hosenball
norfed
crashworthy
geekiness
mckintosh
hamdoun
abdoulkarim
asnis
maghami
centreforum
leffel
amalya
gareb
stashower
lanh
gloveman
montplaisir
lawshe
adach
fatfat
idoko
flashiness
eeig
yabulu
maraden
tuerck
representaciones
kiest
estai
caulks
netmedia
auldhouse
freking
holbury
unpretentiously
edwardses
milliamp
grovewood
curmudgeons
mediatory
amedi
rodiles
rantamaki
cymunedol
velton
liljenquist
sanberg
yesin
moussavou
mammies
penisula
irruptions
whjy
draggable
dincklage
rovani
microalbuminuria
squarks
polyak
kassulke
karro
dyantyi
stockselius
mcgleenan
efacec
caliri
croglin
caryopteris
multidenominational
ibwa
disgorges
pasuk
defjam
vidoe
getinge
parvenus
prabin
alamoudi
wolfberg
llangors
cavorts
polysomnographic
yelizarov
pastukhov
demijohn
twiddly
martinussen
arctotis
zaheen
vistar
nourisher
myfoxdfw
readerly
betdaq
withou
cappers
carnett
hemdani
castlefields
alderholt
katragadda
sanfords
midomi
handsomer
tahlil
incapacitant
emfs
vanderveen
minnoch
austinite
lynelle
wrang
heckroth
aepi
openworld
weast
uppies
hnefatafl
mathee
otting
hardwear
stroopwafels
oxys
calvery
mucoadhesive
higest
grammel
ncfa
manceau
hpvs
falinge
diemecke
emcf
sassier
sttr
abase
fangak
guisard
miedl
jianchuan
palatas
scatsta
barrey
soua
meselech
zigomanis
conjur
haimanot
acergy
benkiser
shatra
panjaitan
outfox
ederer
banwen
razoo
lianke
muzzio
millhiser
boehlke
beroni
gospelfest
popkiss
kubinec
investama
nestings
pikkarainen
pylle
itif
mataskelekele
mannava
njongonkulu
bookeen
gonsalo
butscher
frisks
sugarcoating
cartref
omigod
labaton
thirunelli
takeback
mayell
tharston
authoritie
stehn
zeydan
berchelmann
dugg
taitu
gietz
intertech
scooted
sufiah
ayahs
schiaretti
tenderloins
niittymaki
eradicators
lovgren
philleo
aotc
ciardha
altaie
medwatch
sunshield
elberg
kiilerich
seahenge
theisman
fosbrooke
powerwave
gergawi
defectively
javeed
edathy
matallana
suckage
montpeyroux
woolie
qinan
karins
selwa
baglin
whizzy
honingham
adiyaman
leemon
kaisersaal
mazuch
rondavels
finical
ruaraidh
shoestrings
optimax
swetz
tynesiders
commutable
lvarez
pactola
teckla
uaq
garavini
bargan
arnson
pictometry
budnitz
cought
mophie
galgate
daifallah
lifelink
kivus
michalakis
inaptly
beichman
energystar
reprogenetics
hardstanding
khloponin
floripa
mountz
nfwf
lingford
guanzhou
moonquake
digiorno
billingsly
partisian
slithery
hemion
piscicelli
distastefully
rainouts
guanlan
ferarri
plymale
fcip
dundonian
kildans
ipis
equivocates
ultimata
discloser
techron
gaddo
boosterish
ernesettle
liqing
unselfconsciously
diavolezza
beegle
merfeld
financieele
zoraya
healthnet
whaples
calando
seigne
yeremiah
knishes
mahjar
anapolis
traipse
hairnets
reynisson
sunraycer
coppolino
sarpei
warcrimes
lafonta
ijams
curtici
hassans
hial
mccomber
alagic
lanrev
headwalls
stitzer
sopping
aiful
khalilah
unmelted
vatansever
vogondy
pipewell
jinkee
carlyles
abdhir
shawbost
gloucesters
towndrow
galumphing
zwinky
artacho
icepack
maalaea
floriferous
sawtoothed
kargin
mrčaru
follwoing
undercounts
cutillo
bakhtyar
indigos
aobo
carvelli
syamsuddin
juppe
courrielche
superpages
depravities
debrum
partaker
philipino
guarenteed
ferreres
prebate
depressurised
zhelyazkova
thingvellir
gossipgirl
bouchy
nohe
wocn
berlato
perishability
uhls
softphones
sanitizes
kamine
mesinai
ersberg
dlink
sirbu
solair
luthman
sleeter
refalo
chemico
cefneithin
valone
oyuela
ketz
vunerable
gueguen
kstar
fadhili
conceiver
amezaga
wellinghoff
hdy
arsd
wounder
aberteifi
stantis
gustines
protract
doglegs
cevital
stoltze
matee
chunxiao
smouldered
poppiest
palsgraf
scarantino
zyskind
nishu
tightropes
aminatou
datai
aiptek
dogster
catchpenny
withall
fretta
brackstone
schlosshotel
jackknifing
ramnani
hassie
easer
habhab
tiffney
kogelo
coartem
swatton
teilhardina
wazzani
baupost
snowmageddon
bareness
roussey
alekseyeva
mandanipour
leskovar
bershka
cyberball
shcharansky
kaczur
kaddoumi
hetreed
fatton
indefinably
butterfill
cashcard
cuvées
detemir
anmar
joulwan
drachenberg
mackwell
collarbones
derner
belier
vallos
talil
requiescant
cavet
fwi
moubarak
lewies
knightian
consob
grazzi
ajib
donemana
fentimans
conservativeness
blands
ndoro
filadelfo
mohammadu
parodists
purtle
footstar
lahme
fioricet
miniaturizing
barraging
namanga
depoliticization
stromile
moggerhanger
newedge
glatthaar
saadon
lavandeira
debars
kelynack
spinned
vodden
kivo
luckovich
extemporised
blook
ballie
yodlee
takhta
isoh
puijila
caraballeda
nagakawa
szema
beserk
scianna
wetzsteon
demersus
billfold
jonkoping
conversationalists
noooooo
karron
bisel
milinkevich
yakobi
akenfield
morisette
yonghua
kiamesha
sightscreen
tosher
steeplejacks
overaged
sacia
hitar
waterhealth
cousseran
rundquist
pricesmart
pelekoudas
kapitula
chlorphenamine
ijj
shamble
berloni
ancop
karosta
fessia
menchville
stanchart
royte
khafaji
horsebox
gontarczyk
marida
kyllachy
slainte
icebar
posher
lennan
faduma
derossi
multitool
sitution
parlimentary
insectile
deaniana
greated
nvtc
hollingwood
scheidhauer
mcavan
qiuxia
mogilyov
enfamil
ibmp
lioubov
unemotionally
nidever
mugatu
mizutori
seaweb
korpikoski
edgerson
overcompensated
murderabilia
alemi
rivieri
urique
yaas
nmes
tchoyi
smooshed
sozen
lenes
misshaped
therrell
velayutham
indravadan
jacada
melodramatics
adventurously
zyad
chookiat
barsch
yashchenko
uncombed
distrait
ehsas
presswire
mahlman
strangulating
arviragus
gueret
kildress
dgcx
todini
agazio
etiolated
polytunnels
scade
tschiffely
ratlinghope
kammal
ballingdon
flaschen
ethington
mullainathan
pletka
luera
supergraphics
flansburg
ekber
naeve
herrara
btwn
uglovka
tifs
postions
wavebands
svahn
ludicrious
karmon
rmps
grescoe
vodenicharov
palaeoclimate
instituition
beuc
clabo
tevanian
castlestone
kibbles
qfii
noisemaking
divey
metam
slavering
swaddle
acpt
sheesha
oseo
cynlluniau
rackable
positioners
bladimir
kheradpir
amalraj
lahde
cpeo
mackness
uskmouth
primulas
kwezi
myko
panagiota
carancas
lrx
katzenmoyer
marchioro
leamore
litterer
siloed
distintos
dirigiste
hananel
cellou
goldreyer
mamunur
vnb
worldfocus
tavalon
yuanhua
tasti
yowling
marchiani
taked
fischbein
jenky
pilseners
vistor
duntrune
mcvities
moaveni
wannop
lundwall
chks
awaydays
debry
blockiness
tuthilltown
fccp
coverity
segafredo
kurtulus
powerpoints
vidharba
acutiflora
dewick
salinisation
tqi
sharifullah
vieaux
kupisz
codrea
rigzin
schelsky
eilein
pbra
pagents
ruesga
trifurcation
eyeson
aggar
tolmach
pharmacopoeial
iaop
wuermeling
tangriev
nonas
karrikins
disorientate
rollcall
diversitas
roveri
shahnazi
sagg
uplinking
yardarms
weiqiang
mideksa
chryste
glibness
brundibar
spookiest
meeka
ahri
orphanhood
egland
commentariolus
stirbois
farflung
booo
woolfalk
pacht
trezevant
roslund
chitika
mosiello
herbfarm
wholefood
miring
rosenker
pervenche
hartless
lofters
baidoo
hypermotard
adao
dundlod
kommineni
chulos
kreizman
staron
leekes
nesim
nieporent
miasmatic
bhuna
faughan
reddock
egalite
deckhouses
partouche
ilboudo
bizare
lemonick
inpact
appjet
kilcornan
cisler
aronsen
rtts
heinousness
gigantically
correctible
refection
baudisch
schoolfield
arbritration
jfcom
kehela
aptekar
poran
trigwell
stormhold
specifiy
securus
chengwei
knowlden
deleware
ritzau
maimbung
naleo
cuilo
dunivan
swissnex
varejo
sharts
seabound
prinya
bajoria
infousa
nappers
sakchai
convit
khristian
janszen
lascurain
buchtmann
gafisa
lionetti
brokk
lambrini
branney
timbersports
epernay
effel
senik
hundredweights
abdolah
rusciano
deodorizer
eisha
unsellables
blandit
sperrgebiet
cavana
lugansky
wesbite
rebny
campsfield
frontzeck
tlale
bisquick
keig
natureworks
mitica
awdurdod
elmen
ondieki
vollbracht
khugayev
rackhams
mcbrides
rathergate
verdini
beuzelin
deye
mikmaq
rulan
kahoe
clis
picsel
inelegible
uninflated
zulfikhar
picturephone
bragar
hespos
dotheboys
seither
kromkamp
windpark
panicos
esfri
baronesse
vulindlu
rapproachment
lalai
ewwww
hafizur
theif
blaber
izenberg
haggler
gaynair
huaynaputina
wyhe
teranga
goreux
newcastlegateshead
newspeople
uplifter
ftms
sheyda
hadcrut
rentech
yalof
brockel
irenee
spewak
ewh
admiralspalast
feachem
khadak
pisaroni
lascoux
wesch
jlloyd
togather
makeweight
yamon
saltiel
edwaard
miamians
cubistic
wassouf
jube
abouba
cullingford
launderettes
tetty
sadykhov
panich
ineluctably
endcaps
sarantos
belhassen
smartpen
aeterno
sabbatino
faraya
delise
wsff
nacel
dubielewicz
unisource
spiric
makali
echiverri
liddiment
flipse
boyishly
aiss
iochroma
highwall
trequartista
explantation
ocma
barsness
vaza
yakhont
elhage
prosafe
isold
belston
montclarion
rencurel
murkoff
krensky
pouching
tasing
sydd
gharani
mergermarket
dollface
ordan
crinkles
pacfic
wallal
guidera
depressurizing
tittmann
avolio
biodegradeable
cornock
cbac
kbx
salsinha
chasman
khabab
garvis
welfarist
olaparib
warnig
nonaggressive
kotelnik
marinaccio
commmittee
arriana
garlicky
paumer
bionaire
karash
skimpier
loynes
arlem
weyns
flatlining
mujangi
sadeddin
mccright
gulay
glamazons
whetzel
maasim
backlift
miguélez
serriffe
koulis
blachère
rawah
playle
mcmoon
imbrogno
dotonbori
dealmaking
gocke
rhosddu
nassauer
pingeon
lambswool
forakis
onsong
supperclub
ialdabaoth
melsonby
euzebiusz
lusterware
sassenach
hdrs
deployability
remkes
kaviani
fennecs
vitner
cramers
konzen
hadsell
zuheir
jesner
motherese
steingart
genocidaires
quartettsatz
unknowability
tuccillo
blankfield
bedington
adlène
guskin
standalones
ghawi
districtwide
microlite
bastrykin
maidis
aher
innumerate
gwyddelwern
fortunetellers
biologies
triaging
slackens
woodsmoke
assyrtiko
prme
khalf
brianstorm
oncourse
moravcik
centerback
farokhmanesh
mijatovic
siwy
commiserations
glammy
mayorships
arrache
muxlow
husinec
petrohué
skippable
coasteering
onesimo
chorizos
remanso
fulgent
innerspring
kour
misseriya
aiswarya
basrans
lovegren
aparthied
headcheese
codeveloped
nathani
icrossing
convienient
roussef
palmor
barria
bjartur
viñals
castonzo
godana
americ
purveys
gumbos
ministery
streambase
iscd
walkies
berkovitz
mooli
agatston
malooly
fauvergue
cybertech
unencapsulated
wackier
mukaddam
wolayta
tsheri
vandekeybus
chinchen
dossia
dongria
welted
inventec
walin
indiv
frizza
dinicola
lolesi
lindeborg
muntok
kielsen
burwitz
visger
chilthorne
qmm
mohammadzadeh
molumphy
esgob
quet
seper
sdrt
traynham
bromantic
tiarra
bilham
iliza
levain
kuronen
monoprint
aperitifs
masillae
schw
fombonne
ffor
eduviges
aflibercept
sequinned
frostee
bactrim
lifeco
awarness
tetzchner
aminur
flra
rtps
adle
majore
sjambok
jpay
kerson
enteritidis
albourne
auermann
onibury
beechdean
fcmb
masriadi
kringe
olinka
carolann
darvishi
itadori
dismays
timmel
darabos
stukes
kotera
leckonby
lashinda
tonking
internap
maxeke
icbt
sensodyne
bookmooch
vaterstetten
dejardin
jacksdale
lenscrafters
lilibet
sagansky
appliqued
hallford
catalonians
foldaway
xenical
pitcavage
inversnaid
pezo
surpisingly
raty
travelators
novocain
kyosai
ophiucus
arnezeder
kment
caribia
bugrov
fishkind
tinoisamoa
marpi
kollsnes
moldavite
salzhauer
linkline
inboden
mautby
kopelow
ifereimi
simmerling
khano
manats
bradken
recommitment
diaria
transcriptionists
tajbeg
suntower
tyrannized
samothraki
branam
soakers
comestible
belizaire
schrimm
sequella
faingaa
diverters
crackhouse
vascar
goghs
tonchi
superdeluxe
blondchen
nichirei
jamell
fessing
hasnawi
salvaggio
proctoring
biddlesden
steidel
snowscape
villingili
fisnik
xtract
kröpelin
hillings
universit
cyclen
energywatch
scorekeepers
larison
hautacam
biocode
dorneywood
impressionistically
melgoza
whittome
tottered
arambol
matchwinner
accet
abinader
tesman
attainability
tonkotsu
mumblings
koulamallah
crocosaurus
effeminately
remics
bouly
mitsuyasu
dunnocks
bochi
mcsmith
wildings
heymer
brackin
isaaa
pennan
intuits
shangjin
sleazoid
counterterror
sklerov
solnik
muccio
ticer
smoothers
vibrac
barbolini
bizrate
unblended
dervin
ptsi
macnamee
includ
patissia
arnish
norweigan
elloughton
vitalijus
gumpertz
asayama
erres
platell
wesport
openleaks
selimov
ducatis
belesis
kinuthia
galeote
neeka
kennametal
infida
kickingstallionsims
palmiero
vagabonding
lemere
tynged
trelawnyd
ravizza
massification
tiesi
gaurd
jeffrie
hastoe
loë
anchia
kourlas
cutuli
unzueta
ckin
overbudget
paglesham
campsa
cakey
wherley
vallina
airporter
ivereigh
bokun
kagasoff
hpq
lpns
nationa
dingemans
depletions
prearrangement
eissele
strasshof
lerose
remaps
beautyrest
corado
sadykov
chaowarat
lokuge
opelt
erkes
tagesanzeiger
monley
zietlow
keynan
cockshut
alcosense
hoovering
beragh
garotte
naivity
liveplanet
halfaker
calytrix
symondson
managerless
delegitimized
muqtedar
propafenone
adeyanju
mascalzone
hercher
adulated
rebhan
breslaw
spooge
centina
affadavit
raasch
cerrie
fulliautomatix
heitzmann
distractedly
kirroughtree
seasonals
dopest
hollibaugh
larque
thiranagama
panaroma
plantcutter
monnig
gilady
absoultely
blaggers
logsch
tassimo
borchester
fingerpaint
aripeka
saani
nupa
farrag
webmonkey
waidelich
bluefly
discoverd
sonmez
upsizing
muuse
cummertrees
khalef
jomhuri
laique
ciputat
punctiliously
jimale
mansoa
manswers
tafero
lunchbreak
buluk
renvyle
fessel
aponavicius
cherrybomb
ascheim
stepheson
bonite
rodenbeck
chivian
sinco
aziziyah
quaids
reiterman
vlastnik
digitalize
showoffs
stickpin
quere
yamamotoyama
lepley
underskirts
winterwood
clunbury
lucman
rassau
brandmark
culmstock
gidron
geotags
marungu
fanaro
youkhanna
masik
btps
haydel
kirrane
accordin
remonstration
kabuye
navfor
kallick
arnous
prytania
huahua
tapster
wja
sabeans
coporate
badingham
staenberg
tath
defectiveness
fencehouses
letup
bresaola
solyom
alprostadil
susanthika
pfotenhauer
lebenthal
summerscale
catfield
huiyan
malingerers
borderlescott
filos
nevelsk
garvanza
saanei
lippin
cear
kliniken
goggled
capering
compugen
karush
nghymru
gfsi
yursky
whitewashers
zogaj
corndon
silipigni
tenderizing
borderlining
pescocostanzo
cobá
lapatin
benedik
xensource
hijms
gralton
bütikofer
egleton
sejer
attebery
meadway
sarantakos
interbirth
leifert
homeside
aundre
naiz
yanyong
waldenstrom
vlieg
tantalized
ujlaki
bakiev
piovano
lolis
waterbender
dangly
deconfliction
tushy
ingelsby
goojje
chulkov
foully
passementerie
unspool
itihad
suadi
wielinga
obuchowski
hqa
brassneck
saddlebrook
samiria
hoschton
carcroft
hamams
leisle
tiririca
benetta
novey
junliang
bahrudin
sweetspot
possble
mavisbank
priapic
goberman
mayberg
siegfrieds
rollerblader
hodgett
maryetta
farar
dimissed
llanegryn
gerani
davidovic
superceding
hollywoods
okement
rechristen
kronman
varteg
carciofi
fontenette
stemler
mathathi
terrestial
beanworld
encombe
sambals
wescam
jenae
thurgoland
termism
harlemites
kicky
linkner
ghostbusting
winiata
coldsmith
bokeria
stais
updraught
khisa
planetspace
tagine
teria
woodpulp
ghaziuddin
caep
sterilgarda
pofalla
shmulevich
rheumatological
femap
finagling
spanjers
baharistan
goore
carlae
kadenbach
ôl
regim
brandenburgs
visionquest
pressingly
comitting
somebodys
hemani
astrue
cialente
iguassu
bogalay
slonaker
shachaf
nigrini
croûte
rindy
agiorgitiko
salue
piercingly
foxtails
tilers
exatly
aafjes
penningtons
ecotech
pullouts
prophete
gatornationals
molody
tambussi
sheriffhales
stintino
aberarth
vinnedge
ballylumford
anology
acham
milici
blitzers
kartagener
lcpd
reachlocal
viburnums
limerock
myfyr
gxi
trussing
prepara
acomplish
filegate
wischer
atripla
ciccotti
worldlink
tielke

miscomprehension
qtes
residenza
chokepoint
abatacept
rofo
irbesartan
supersizers
pitchout
banovic
temarii
cheste
mbulaeni
brossel
cilostazol
rackety
evangelizers
onvif
fuzzballs
mafokate
lirey
hamc
crct
swor
tranquillizer
vidiya
razlan
tribalist
transrectal
rudrani
clearedge
tmobile
roindefo
benchimol
haratine
pegylation
travelsmart
sostis
startech
cordsen
willingess
rosenbush
linquist
seculin
etravirine
revera
neoedge
geltzer
newfoundlands
couriered
newyddion
biuku
inegalitarian
philomont
vegter
stunnel
otah
einstruction
lipscani
erry
jaquiss
bahaji
oybek
murtabak
beaucastel
besluit
biospecimen
frico
kashiwada
georgiann
pdos
coronate
laxart
beloveds
preng
josphat
europarty
omurbek
habeus
buddwing
vistors
similipal
giourkas
bacchetti
larrikins
belluck
kowalke
borgesian
sysoyev
tulipan
ozcar
ustian
trindle
frazzle
travilah
sufaat
tameleo
nyseg
sterr
brimpsfield
mooradian
purke
suraphol
tradin
brookmont
churchwomen
pilau
aktc
neag
riseth
gaurantee
hillbillys
deforesting
gryon
lutens
lauerman
confabulated
greates
sealskins
veckatimest
demobilising
yopu
barrelhead
knocke
russianoff
prebon
pigskins
gelwix
bunglers
zongheng
gellers
harakah
zamar
blintzes
rajeeb
suleymaniye
whitmey
nikulina
saharia
homefinder
danqing
digimation
oilskins
vongole
puggles
narayen
chapas
udre
handiest
byambasuren
hanefesh
vsetin
scaasi
sherbon
helfgot
kushina
reiersen
nonperformance
kamanzi
teavana
backbend
inhibitive
mineralizing
merkaba
wfpa
outcalt
piromya
huffed
björnberg
kfl
intralocus
marksberry
batliner
brostek
toolbelt
deliquescence
pbsi
laiv
morens
onken
neutralinos
dandara
permissively
boeckling
jigang
floridly
mobey
miresmaeili
guarnizo
daraina
anythin
duyne
picamoles
ichan
tahirkheli
soundtracking
mycar
gamefowl
cosmetician
dorchen
excluders
sieden
aaaargh
tavoris
cartright
tsepo
dunta
appreciators
argentinan
scpr
continuos
kanew
sooooooo
csrp
gnango
skalli
saleisha
methoxycinnamate
desogestrel
misan
ofz
scapinello
connectomics
bokko
popfly
lillywhites
abbaspour
rainwear
sensata
ernies
ibnr
datafolha
cainscross
telent
dinstein
wors
chyzhov
errie
vaunt
ikililou
maiello
gipn
kerschbaum
buskas
crostata
storton
promiment
kropa
matheney
takuzo
beind
centralians
ozolinsh
rawreth
seehttp
weissenburger
balajti
guosen
sarafanov
dickler
tjian
videoconferences
sartz
reevaluations
philagrius
statewatch
badki
milrinone
penilee
indicitive
repostings
totonno
regnante
zigs
nuch
yodelers
redtape
queerest
mindbogglingly
lurpak
multidimensionality
calstar
kagermann
dcac
dimassa
fomepizole
monkou
facciola
croupiers
reato
laurelvale
bizunesh
piigs
pramlintide
slagged
lyddington
hanfling
muziic
reycraft
maintance
intelcenter
torrisholme
eslington
smeraldi
expeditors
shirli
groshev
avoiders
prgf
trefry
pegna
passholders
yamli
ortmeier
hassouni
jaylon
upshifts
grandtully
bicalho
larami
cordemans
biescas
litoranea
xoco
rry
sliter
cassens
sulkowski
kivanc
llanasa
skarv
spartathlon
chyba
leftow
cromolyn
nshr
detloff
follieri
techology
klatzkin
suuronen
thavasa
brontothere
bakhyt
hassmann
stodmarsh
nirajan
kliegel
montbrial
lafeyette
zador
raiano
reddens
leagal
elcott
ebbrell
waties
grellier
ragip
zumbrun
callahans
gasbuddy
rosalino
fungiform
forgetten
aleve
grimi
elvitegravir
daimi
elektrownia
censis
identies
transitionally
gypped
wowio
serialising
zofran
urbon
cakir
sambrano
killadeas
druz
gorniak
linhardt
disparately
estha
unitaries
subo
kipple
alterraun
annington
bilga
pintas
bamfurlong
mestrallet
halons
fleurant
llannon
dastgheib
arrasmith
synchronica
bluths
babafemi
acccording
transfiguring
nahra
cillit
dunnikier
famil
birkhall
ssrb
belchalwell
fictionalises
bernieres
aerobiology
knotz
darwinopterus
vaccarino
ashg
boxboard
yermakova
pazopanib
unthoughtful
titlists
meatiest
jaywalkers
thromb
domenichelli
benamor
beamen
camalote
kapstone
schilly
kidogo
mccolo
ugborough
febian
axc
thinkorswim
pipedown
seamill
actionability
unsheathe
seredin
mutallab
bonnieux
ohmann
baikonour
orentreich
razaksat
redenbach
tayor
blueridge
riffled
kamhawi
wisconsinite
killeeshil
schonberger
schlup
prueher
chartridge
skybet
wour
ansd
pennwell
dynaformer
levitte
ristau
beakes
altonaga
fidessa
ukravto
sunside
pickart
gelsthorpe
bigfoots
welting
kmetz
azkargorta
skhirat
convice
nixle
msde
hessayon
reavley
redhall
jaqui
hayyat
yuhe
ezee
rigley
hamadoun
nonpermanent
anaysis
grax
despondence
zattere
shirayama
bonvilston
solemnise
ickle
ljova
unplanted
caterhams
sertic
conceeded
traductor
treffert
mozena
poortgebouw
nguon
smolenyak
ababu
brittanie
muenke
djambala
remans
hayzlett
neveah
mazière
orals
lautenbacher
kirkos
glendurgan
vaers
pocketknives
cuccaro
talsarnau
brummet
agonises
categoric
pescatarian
jianrong
whti
papillote
psaier
pendoylan
akerlind
redinger
shovelful
dissapointment
cpss
lepowsky
boothwyn
bidez
bessent
jaquez
lhvs
hanting
certitudes
cbis
boxful
levai
tarty
redda
lindenlaub
ahwar
nashar
margining
mtis
stotland
sanglap
brightview
diepsloot
beirich
unwatered
nostoi
jerramiah
mindscapes
tallinder
fredersen
ambry
papuna
glimmerings
bitee
bioresorbable
jockstraps
wonderdog
spritely
nalaka
lapad
orent
overstrained
flagbearers
nacdl
keiwan
brookfields
standardises
comany
miesian
hendelman
cashley
kingstowne
thaek
wisty
coquis
tunesmiths
wirjawan
kulyash
corcrain
awaleh
isaq
benfro
boser
nasopharyngitis
fallaize
recepients
outdistance
ewenki
marescaux
leatherslade
geisen
mackel
clingerman
nongbri
dormandy
amayo
griebe
vinayagamoorthy
zahorsky
grapelli
sereysothea
jamesons
storgaard
sqf
vonteego
reistad
huti
troublingly
coday
phcn
axj
glenuig
bugling
footstepsinthesand
chiesi
laminator
piracha
sovani
preconstruction
confimed
ponomareva
koppinen
cyberstalker
refulgent
soberon
intuiting
repetitiously
katulis
zubaydi
gronstal
niangara

amcon
simod
kwhs
outmanoeuvring
erosi
characterological
khoram
illinoisans
plasmati
hezbullah
goldberry
vogiatzis
fakhreddin
minguzzi
roquel
montagano
laast
pelman
compeer
menchell
electrophysiologists
tamdan
gutow
gipslis
dusseau
heighted
zalaquett
rollmops
valat
ringstone
asure
howfield
assayers
knothe
llx
radvanovsky
uncork
ripol
nannyism
ilkla
roustam
preachiness
debator
cnit
delegitimise
copec
gelée
technium
hardgate
elgindy
meriño
schertzer
polimeks
ogola
similiarities
tourigny
kolender
interparty
neiger
moehringer
uppy
flameproof
kekule
elbphilharmonie
wappapello
enteroscopy
erofeyev
heijmans
txm
behooved
ubiles
hillaryland
wandera
khoshnevis
duvendeck
scroggy
nuzman
diamand
kaing
saliently
harome
rethoric
extradites
ambreen
gaint
gurda
killerspin
foredoomed
caffiene
apenheul
giscours
pasilla
mannamead
tagruato
nettesheim
broemel
multitier
gusky
callbox
pionirska
semones
brozak
llangwyllog
ternus
earnhart
coldren
sowder
midscale
boslough
stagni
konheim
ndeye
kultar
lynsted
tittles
moistens
hermesh
pikiran
pulkingham
disorganize
duska
vvips
anticolonialism
goldbaum
grabsch
javis
kleptocratic
caline
undiscerning
fourhorn
garah
oik
habenular
ordman
mcfd
sprucefield
verbless
fidanzati
edmodo
esophagectomy
foreperson
ginsters
yousufi
waghela
lubman
maslan
fasulo
aquafresh
biotechnics
bryning
kamkwamba
unsilent
sidorsky
connarty
grial
saben
huval
toori
manop
pontardulais
matsen
rawlence
rewires
felicidades
chromes
blackfriar
doubleshot
shippable
greenshoe
ethnocentricity
murstein
amharas
zucchinis
balladeering
myburg
brinscall
popaditch
embitterment
jkk
metelkova
whinnying
icfc
ynysangharad
eurohypo
salmasi
bisbe
wijkman
whippersnappers
chemoembolization
dahling
iraqs
priede
dhanuk
squalling
barandiaran
ahmm
glacés
babiker
challanges
microfibers
ballweg
fritzner
rentiesville
luterbach
bibeault
guilted
tognum
tanyon
hangartner
jahrhunderthalle
llanerchaeron
maggiano
benhaddou
schilbe
advair
getahun
behsud
ecolabels
easybus
haaz
muellers
storgata
bulik
nimma
millies
disrobes
pinhorn
assylum
ewingcole
telsa
demirjian
glyco
kornblut
lanzone
flatliner
heede
skean
kuza
nunoo
idcs
haitch
voletta
bellydancing
abdalqadir
longueurs
tharin
enviromission
salsi
opsoclonus
cheapie
precisly
kgal
siladitya
garachico
bartho
basd
corhampton
tranching
gulped
gleeks
candiates
enocean
enford
anomolies
radvision
polay
brenntag
foinaven
grinches
stylecaster
allostatic
zadrozny
goeres
zumoff
presutti
wuas
dorsen
parvanova
gohel
nofit
healthcorps
responsbility
estwick
zafrullah
cutchin
matjiesfontein
raffaelo
montalbo
powermeter
igrt
beckerle
pretorian
itemization
capriole
emetophobia
pcns
etappe
waldmire
nlmk
tranquilly
cumbayá
fistral
datatel
gelastic
flipflops
safing
busks
chidanand
hydrocolloid
schmeltz
kirigami
khangiran
raccuglia
zalmona
berrouet
haysman
subcabinet
erspamer
goneva
bumpurs
inflators
buduburam
kjlh
pocd
uninventive
dekar
itraxx
fecan
estefano
dahabi
ilois
depravation
aishwariya
boxmasters
sutiyoso
kowall
ntri
bacow
acdelco
volpicelli
dousland
groundsheet
electrocoagulation
yoeli
songful
ganther
nozette
bodgers
tolstoyans
mchaney
nout
rikhye
gibilisco
naujocks
julavits
roskilly
llandrinio
piccari
faehrmann
ambles
breus
oughts
zolmitriptan
songok
windborne
pennycross
azpiazu
heritor
didulica
rasmussens
nuttier
nursemaids
mindbending
wcrx
reorients
comradery
phsa
hintermann
folksbiene
suncatcher
maiffret
toymaking
dominionists
namoc
antinational
chatigny
iafis
mushatt
wasay
merediz
microbudget
methandienone
markmonitor
biris
unsterile
hotheadedness
teegan
hassanin
nced
harab
chroust
clanrye
santouri
awfull
roofies
sebti
chumpol
doffed
wett
hidta
bellmead
mattil
ruly
ubit
fensome
kazkommertsbank
belohlavek
shikano
pmpa
taxidermic
davening
rotger
hedblom
burakoff
quizlet
teemore
digitek
growdon
vaquitas
lulus
tianji
consitutional
dauti
portably
laspina
freudenheim
pmsi
amron
entringer
hureh
roscam
qlr
zephania
seatwave
cedarbaum
barnyards
colliseum
brucke
dacic
mcmunn
ampi
patyk
acfm
exper
guanghe
psychographics
rudha
zanny
antechambers
imoke
kaytee
christanval
occar
hickenbottom
whimsicality
bezzola
jalbani
hamriya
sorona
superdad
linsell
gonks
poignance
jikei
iavarone
northernhay
bukvich
tideford
chipmaker
hobbycraft
guitierrez
saborna
pregones
awea
nourbakhsh
lowedges
heteros
bonked
gruder
merrf
graymail
antiqued
frigon
norsar
gitonga
chernetsov
wimar
scaraffia
sponged
illgner
beeped
assinine
snowkiting
predoiu
unremarkably
armintie
valbusa
bonow
spherion
moshers
asuni
tidcombe
konnect
openmarket
kamiura
tritan
weaseled
seaco
aeolos
mvne
binghams
jaider
tomosynthesis
bibit
enec
premixes
garaway
imperitive
unreformable
dinlle
rathie
absconders
deverdics
falder
chinary
schaeder
maslach
bezbaruah
saddlebacks
papalexis
etinger
tygiel
scoffield
tescos
liepajas
vibskov
dogsleds
pingwu
capasa
kawasoe
maystadt
teamworks
vsq
hypercolor
handpumps
ferrieri
tepfer
adultfriendfinder
atá
vinegared
stonings
mcinturff
bachenheimer
myohyang
icings
callouses
zonin
sedimentologists
palls
folow
underreport
fuqi
rudry
nordbanken
aistrup
lauras
maggiotto
nemser
unhusked
shannahan
alliancebernstein
makishi
bozhko
ceely
hajim
wasy
wherstead
malmen
ecojet
eachus
rebooked
lazareanu
fourme
llansannor
cuates
maleczech
donggu
lemine
ziniu
pithier
transacts
derse
tsis
henyard
centralises
mcbarron
nexxus
vinney
prosthodontist
calcpa
silvero
intellicorp
sandfort
burzichelli
kangemi
thoenes
meraviglia
intertank
shortcakes
acasiete
bertling
intermap
arabised
areh
creamsicle
hydroptère
peevey
bazedoxifene
overreactive
thandiswa
tiggs
padanaram
scibelli
koralm
wazee
homegirl
visalli
backrub
cornellier
tirri
braquet
sublicensing
invermoriston
roomet
ulnes
križnar
lunchrooms
aomame
sollitt
triangulates
cubbyhole
loomstate
edgcote
woodlief
duram
marcopoulos
druggies
jongo
legals
seera
vantagescore
mixings
temata
steinemann
papay
deanza
goofier
bodinnick
microplane
aiguo
mavity
freewave
onexone
shellfishing
alaea
vakalis
tontoh
remodeler
waynick
celebreality
thiostrepton
baghead
awcc
ispu
seditionist
bklyn
sanjida
meddygol
oversteering
euas
jhar
understructure
matero
mahlerian
cnnc
entombs
dodgen
bradda
panafieu
germanakos
bassinger
metabank
matsigenka
chansi
hsci
gnutti
lorenda
shoddiness
grievers
papenfuse
duboeuf
grooth
fukoku
ettelaat
tudful
entune
ivors
holroyde
ferrucio
castorina
yamu
countermove
rosza
mayrhauser
blanchimont
roussimoff
belben
spareness
charone
hsba
pimecrolimus
sawade
sportime
perverseness
topamax
profitting
aitches
roveto
schmader
reidenbach
daises
unpadded
matsanga
hassink
natarov
edss
cheeburger
trossi
goerens
ezulwini
iloca
dehaas
engraft
bratten
bashira
nazy
millirem
bergbahnen
fisd
nadey
jausiers
smooching
auteurist
rokke
omozusi
silan
lockboxes
schlichtmann
hkse
perbix
agion
robh
randolfo
bahiya
dmj
kleeneze
skiller
iodice
carreto
guiderius
toomy
hekuran
diwanji
gemerden
reenan
vibrometer
brigader
nacods
overenthusiasm
pegmatitic
lotusphere
monikered
doubek
spagetti
schalm
quirine
deseine
broughtons
algers
botcher
dimeji
eldrup
midriffs
loddo
maclarens
skirlaw
indivual
vaughns
ohlmann
ladkin
marxhausen
pizzelle
ranchette
nazemi
swepstone
amost
smyer
jadidah
garfit
mutsaers
iwps
wertmuller
shurov
mmas
abilityone
weathergirl
debin
safair
westerhope
khareh
aeromobile
caccavale
primakoff
daniller
intentionaly
powdr
asberg
sealord
zootfly
hornett
gelong
natrajan
duquemin
ciroc
tischenko
medievally
seaberg
moralise
dolmas
epiphanie
krislov
rocsi
supernodes
ranella
lodgements
mcgettrick
edmondthorpe
paute
fearsomely
joggle
suffusing
céad
pritchards
kaleidescope
lumpiness
vorsprung
palmitas
disrepect
georgoudas
crawshawbooth
syncsort
geminids
roncalio
talluri
uou
slavutich
vanniyars
holmans
khorsand
everybodies
mykey
optionsxpress
poltair
tiecon
cwynar
luminously
décolleté
keepon
gahs
fornasier
kuzmanovic
llanaelhaearn
oxhill
barhoumi
effiency
marisabel
kozlow
letdowns
milloud
habiger
minerbi
huarango
ofoto
compan
mixable
bonyads
reprehensibly
rehim
oozy
unitymedia
morsell
djembes
isokoski
soile
sawina
cimade
neverov
jeran
dobias
fandy
pseudofolliculitis
alosi
daudy
adjustor
hmq
mounia
conducing
santostefano
schollin
canak
mainka
ritha
oberhuber
tontines
stenoses
gleacher
intergroups
schelske
radoi
twenge
wambold
hirami
rubianes
wilstead
zaiyu
hatchetman
pelotonia
sotiros
everthorpe
roosenbrand
geden
mcclover
buddist
behnen
owolabi
crichtons
tempelsman
seamore
deherrera
garnick
meida
microserfs
contarino
marinatto
sanajeh
kibayashi
clewell
lonsbrough
zatz
telephonics
wehde
cappuccinos
mandato
rosenschein
htz
duvergel
fistric
tarmacadam
ndq
kawaja
rcis
bindeshwar
identicals
hesledon
presleys
hibah
potterhanworth
wehrenberg
bullgill
chrispin
blanketly
oarnet
hummell
jmac
balhousie
carhenge
coghlin
gjonbalaj
coky
farve
zippos
schatt
triteness
hammerle
unploughed
lobelias
authonomy
huping
simitian
bradhurst
teiwes
gunesekera
alerces
szamotulski
marantha
piadina
tonique
inciter
murmuration
apem
egbuna
frassoni
stonebow
cetv
malry
levys
motio
coalfish
belaynesh
lewycka
lekstrom
loiacono
cleviprex
kobil
duplitzer
boyuan
cybermedia
handgrenade
chephren
humbrol
externalisation
tanavoli
moisturiser
elekta
norteno
suhrstedt
cazalot
cogliati
ordesky
intercasino
reequip
qalandia
zabihullah
copasetic
ipred
mastuj
okosun
eductaion
sanchaung
ifly
limeys
masticate
tstt
gdas
chenilles
wilmet
derange
chalit
monninger
eaman
malil
brentina
remortgage
alhough
tocolytic
yewande
miraa
binged
khayami
amtech
antalis
lavenders
zweli
selker
tbrb
plops
wheller
annalyn
desarrollos
fréquelin
reinvestigating
bedwetters
derivatively
mestdagh
eversleigh
petcoke
prinicpal
ponsor
okulitch
nimwegen
conmy
yaish
saleroom
indoctrinators
antihelium
touchiness
benatti
nelsonian
disapplication
magret
concerend
jihadia
taspo
edmundbyers
mingliang
burkin
bankend
bishri
templarios
bleah
eptifibatide
watercrafts
melanio
kriton
derogating
otellini
kimchaek
subfertility
situtations
arbes
konzerthausorchester
olusoji
fittv
vcenter
paywave
mruczkowski
aselefech
slowmotion
kimzey
prolexic
unimpeachably
masondo
ciaramella
hexcel
kouwe
nanina
shresth
calfire
kashina
osyp
zhanshu
reinfect
bbeb
gramling
supercycle
dandala
kushlick
upadhaya
massoglia
josephoartigasia
jermal
galorath
jingxin
boobytraps
shameka
nisreen
vertic
vados
arfken
huchthausen
ludie
ekathimerini
sliproads
formicola
guangzhong
khrzhanovsky
wiedrich
apsos
sosthene
woodsetts
rajasingham
bareham
ozian
laserman
vivifying
delker
medrad
proselytised
timewasters
whieldon
movahedi
forechecking
frangelico
clubcall
benemann
banderilleros
muscial
ulph
soilless
ciot
aertex
puliti
yelo
dengate
englishby
dessicated
gubu
chaturon
daep
saparmyrat
amflora
dăianu
haleva
raillery
maibach
kayleen
riscal
kerfuffles
lithang
wolkstein
cairness
csfi
qiuping
peschl
rayport
terranea
unembarrassed
murshida
nessuna
panchev
yorman
aspiotis
condeming
gabbitas
sharga
judin
ustekinumab
htar
ncbe
chepkwony
amscot
eilam
slashings
bellandi
surdna
brisard
oestradiol
kalyoncu
coyner
devilled
intevac
clesio
mahtook
humer
ukyp
pitarch
martensson
zoomlion
housers
amonst
shamsullah
dejon
nikpai
merchanting
scarsella
personna
khandahar
swibel
gosat
hounsdown
sommerset
stymieing
shennawi
regreted
hayon
moldering
eicc
mchales
agrichemicals
photosensitizing
xianrong
amortised
rupf
westi
spinneys
mojacar
patronisingly
bakio
oppposed
patchan
tunisa
payman
gyawali
yext
mitutoyo
complaisance
blagger
praj
creige
heartstring
prazdroj
crozes
plenteous
expatiate
mitchler
rauda
powerlite
asurion
delaminated
focuse
muscats
iolana
talaban
thirsted
kemple
bsic
inishturk
bierd
gradebook
monetised
frenchies
arenstein
machiques
vardas
atteberry
armento
walkathons
stromgren
unprogressive
fishlike
bottlenosed
pongpat
sarar
universty
leukaemias
dollase
dervock
porkies
crociani
nurith
althof
plighted
raunchiest
shimange
krepinevich
guyette
tsumori
wosa
jelko
tecnicas
slaczka
tussing
naïvete
whiling
tranzcoastal
stringless
megg
dzg
smex
buguma
mckimmie
murerwa
orated
sabban
sieze
subconcious
memorialises
bolshy
animalis
domene
professionalising
asfb
welzenbach
shwan
djerejian
killyman
curtailments
greeleyville
bazzana
benderloch
lislea
dimitriy
offerd
varki
sandherr
balasingam
netters
unbiasedly
mckinleys
lohier
sadrists
frazen
iotv
essr
jof
amens
spasticus
torbor
dismountable
thorniest
hwn
elmosnino
jonabell
eclips
rousson
mcquire
affrica
ratuvou
moorgreen
roniel
paque
alaves
lawtell
sparsest
serracchiani
anichini
countercharged
prooijen
ayverdi
porwal
gaisman
madcow
bmed
ugley
lifi
chearavanont
onmobile
bechtol
colb
robichon
subcommitee
golfed
roominess
hydrochlorofluorocarbons
silkscreening
siedow
sonders
agrifoods
cebada
zygos
nadene
poplack
unscratched
clusterf
nadj
robow
brunos
hygeine
genou
namesti
aerosteon
racanelli
axigen
kokk
muglad
sciandri
sinbo
vilaya
piquante
schuchard
portraitures
divvying
drillstring
hysell
razai
finamore
bolatti
yaqi
simpatia
pastéis
harbeck
helfet
lernt
alkham
waddoups
timberwood
oviir
comiston
viennoiserie
cromack
superfetch
brodjonegoro
fothen
insatiability
tomcar
senitt
cybersquatter
mcbrine
fesman
sandbakken
livewires
ranolazine
margon
bikeability
stevenses
npqh
trebon
groclin
oceane
otnes
geaux
boparan
tabraham
derryveagh
skrovan
lundwood
lifchitz
vhtr
harpie
sproxil
meleshko
toshin
chartz
ndmp
inconvertible
iressa
aulestia
valte
cogmed
abdopus
ocxo
monz
picnicked
pessa
inju
oakerson
qorbani
cavelossim
unbutton
deede
portknockie
inviter
leape
sportsview
reservedly
monshipour
deplaning
zaidel
sgca
savir
bluedog
cochinita
bourgmestre
nabp
solucient
rahabi
jumpstarting
profiterole
gayby
skysong
ayron
hainley
scotten
gubanova
baysinger
pfirter
glyne
monzer
peterstone
bitd
kramish
higgitt
implicity
rosselle
bowlsby
adhanom
reauthorizes
miskell
sovietisation
ebanos
kerkhoff
denuo
edwardsport
chowkidar
biha
firers
endline
seatguru
penon
prevously
kokilaben
diquigiovanni
bioprocesses
vaki
nukaga
damhead
deliziosa
silbergeld
kavaler
associado
bieldside
saslow
peoplexpress
ruijten
owlett
dualit
kayli
rosliston
beguildy
starborn
abdulayev
calvine
langsdale
vicriviroc
thoelke
ecojustice
jafaar
ohrnberger
carmeuse
etteh
gesticulate
hellabrunn
löwensohn
weiqiao
mrozowski
hnr
frontrunning
byv
numonyx
jazdy
gmcc
wiegele
mycogen
inexpensiveness
ostional
abdulhamit
dorfer
decolonised
bartolone
camre
balgreen
crapware
rhoton
greensfelder
vulgarisation
fordow
demarse
jerviswood
samaila
raey
bohmte
lemish
marginean
arabize
lathkill
subtherapeutic
substitue
cristovao
cranshaws
unputdownable
ilyena
dewynters
handango
doxil
weaks
abbotskerswell
screenprinted
numu
reengineer
milleville
esrock
ekuban
usura
nurudeen
stipp
aanenson
mediwake
postrevolutionary
vemos
pernier
xijun
mircera
aanensen
frediani
aparthotel
brunhart
oestrogens
fadillah
antibe
skelleftea
lopini
jeralyn
khazakstan
duplain
contestability
semde
sarfate
krasnovsky
enage
hanggai
wafic
sodahead
higuain
overages
mallarme
toroitich
diwrnod
grassing
bowd
rebuilders
jipijapa
summerscape
marchell
changming
advocat
mangeot
maddan
fangping
backwords
surono
mieka
schelbert
mordiford
hubers
tosetti
incalculably
changzheng
probings
icemaker
serialisations
odim
huascaran
nrta
chukkas
piepenburg
bodybag
adfl
sportsweek
pallières
kursumlija
armorgroup
destabilises
betutu
westren
transmeridian
tresch
tlaltecuhtli
shaley
eang
grubbed
balletmet
dawani
groenveld
haitises
mêlées
abdulatif
ghleann
unmuzzled
manganaro
yemma
corbey
nybc
changyou
resilin
watchfire
puddleduck
nafeesa
lagae
hanstveit
prw
protetta
beigh
lerebours
skyvision
dousset
plique
presagis
korki
taquitos
paysinger
arrick
redknee
protracting
cclrc
throug
mintues
ichord
monksfield
moredock
bezu
mardo
tavin
frolunda
indictee
nomal
nehls
sporks
dockmaster
hurungwe
flashplayer
alanda
jenab
neura
bpca
jisu
aquaphor
omalos
northstowe
headleys
halftimes
debattista
airclic
ailea
tiridate
buzdar
gogorza
schuth
chelopech
gruffness
eubam
mzima
qori
vaporises
arkaitz
commerz
mauric
kitwara
wolfbane
deaves
fotch
murgitroyd
kranitz
bergalis
leguizamon
somova
awatef
kegger
charde
biorefineries
sliwinska
schachte
schmidbauer
gareev
aptina
vanny
shuff
mashore
warings
jaffree
bougrine
charnov
lifebuoys
kolzak
mitrovich
amarr
plantsmen
aquadrome
ealam
medtrade
okmok
eqi
muslem
nakamitsu
haibel
dawon
patriarchical
beetling
sedler
falanghina
overindulge
aqal
odee
gracemount
alperstein
zerline
subby
fordlandia
bobbling
palygorskite
xiuli
tuppenny
cacciato
ligoniel
nagorski
radicati
gollnick
sedm
narcoterrorism
icenorum
lumosity
michaella
hayllar
surrexit
baumjohann
dtcp
goodguys
rackauckas
listmania
petpet
izy
wardeh
llanteg
molby
korobka
haifaa
mewling
weiyi
lubanski
vampish
scrunching
stenstadvold
itemizes
rosebrough
passout
zizkov
tanchon
isman
thongsuk
impenetrably
handsprings
chowders
cavu
chmsl
partitionmagic
beychevelle
drachkovitch
mobilerobots
rassouli
airgroup
ferley
aphibarnrat
beamz
ciullo
epitafios
tamwe
marauded
astringents
tveiten
safelight
germay
combustions
retrovirology
pazdan
ccbi
brous
cambusdoon
mersky
daofu
retronyms
forthlin
powertools
tanaz
nonja
yeasted
dematerialised
josephsohn
drumlean
skacel
hanaa
supersmart
accumsan
sagall
penniston
yaqshid
monsterpiece
shimu
pixantrone
olsat
discriminately
mcmeniman
kosmin
tolkienesque
harbourne
monitronics
yastrzhembsky
ghanan
patronelli
nwaubani
euripidean
dalke
lauzerte
soderick
xingwana
keahole
mbandjock
cluess
pyatov
babeland
faidley
tamsir
perwer
bazillions
puggioni
staycation
totalai
leopoldino
piedro
turbulently
glacéau
matsukevitch
hydrofracking
nttc
vigreux
herol
deltec
torregrossa
synott
falker
hillwalker
rotnei
fonteneau
acfe
vanhoozer
ndjeng
moussey
theman
khurma
quickr
muthama
jotspot
badescu
gegax
potterrow
prepak
montlucon
palely
desirae
internacionale
gartloch
maikano
altwegg
eyries
cajuste
probelms
sullum
schlack
marsalek
allbee
bourgs
curtness
kitzsteinhorn
chipchura
newthorpe
turnspit
comediennes
zugibe
forber
boneva
nuttiest
snowless
ciaron
llanbister
interet
edmison
lagazuoi
nekhoroshev
jochelson
kettuvallam
lacapra
sensio
macroberts
microdeletions
farmiloe
ranaghan
shaida
lynham
niemoller
bodipo
happart
niyombare
inkom
muscala
blaencwm
gebremedhin
oikocredit
sochacki
nnedv
makuti
subsiduary
twon
ovv
extolls
aserca
croaky
capestang
huaping
nemtsova
bettag
lirung
sporrans
xiaojuan
bayble
therre
taurid
krogers
lemonades
highlighed
negitive
patriarchies
spanbroek
kuular
anthropomorphically
huske
themseleves
barloon
luchow
kuhaulua
krauchanka
castellations
thingamabob
dozsa
bebidas
poseyville
jolicloud
riisager
irccs
doorframes
shaza
sitarski
waldhauser
mynx
lohberg
cameraphones
miscalculates
samoilovs
thoug
maiquetia
flylady
bijarani
shuaa
mediasentry
bumptop
fanciable
gaillet
rueger
lhalu
richmonders
tronchetti
kasarda
hubzone
waterproofs
sarrell
aveyard
garley
chumki
zhaohua
dunckley
sharkfin
allseas
shohan
siglin
intensivists
healthfulness
collecter
bingeing
springbett
hizbut
aboe
stroik
pipien
theophanis
bocar
maulawi
alecks
mckerr
diabetologist
garduno
smain
campath
seaboards
indridi
clogger
kajko
entu
ecch
clunkily
brinded
placerat
dudus
arglwydd
camenker
noteholders
bagneres
onpoint
masurca
jangbu
vroon
resevoir
congest
sigurdardottir
mosebach
vysehrad
dhra
bioequivalent
instutions
malago
laudably
gratifies
poligon
pimentón
neiss
sagebiel
iferouane
fraziers
bashings
drumahoe
naed
chasis
planeload
bolch
wyka
crystalize
winni
kimmerly
conspiratorially
wegrzyn
fastcraft
humourists
tiankai
pyan
sapaugh
bongiorni
yokocho
gyrotonic
wolpin
panderer
bosket
mahyco
ortigia
squaretail
comunicacao
metrozoo
porthpean
wenlan
halpine
moreman
nonobvious
sanea
bamboozling
gardley
appp
defenitely
ombudswoman
mahachi
remeliik
napha
beanomax
bedwetter
polumbo
vaugh
monsterous
trojka
deodorized
interational
xiva
weitman
fewa
bayakoa
hilber
islamicist
lodden
bzo
kamilah
buonafede
microchipped
pedicone
mekurya
intelisys
superko
baijal
jagot
dataspace
sscp
fopr
sinhalas
parisotto
vxs
wollam
toumazou
ruhal
gorgia
tunander
fahel
iqt
lunke
mandley
pinteresque
gloomiest
pimms
dogtag
onlys
alsheikh
securitised
pelttari
peformed
scibilia
zamagni
alcindoro
colleages
priestnall
spinnerei
assns
halic
kuvin
brankin
srgjan
benia
rembrandtplein
montanes
weath
brisconnections
rhit
authenticators
kgomotso
benaouda
hastilow
housetops
cabernets
strappy
hamouri
thiazolidinedione
bdmlr
caravaning
salumeria
senol
grabauskas
dowtin
apffel
qadbak
bandipore
clarcor
healthequity
scarfed
nwlb
jongjohor
maximun
pxs
gambols
dykeenies
sobah
katchit
wreathes
larchet
ozkok
boryokudan
abssi
ellacott
schoenwetter
maresa
summitting
orowan
psychologizing
unkindest
atondo
sifrit
elnaugh
antoura
parro
unmeritorious
bimatoprost
bektas
inconclusiveness
morlich
quinstreet
equatoguineans
glencrutchery
ffelp
pullig
khaemba
bacongo
gyenes
ladymead
pantomimed
grittiest
hamrlik
niace
farmelo
mokhzani
tnrc
rals
jafarian
rockwalk
chatila
eembc
castparts
maque
hutley
deutschendorf
rafeh
macuspana
swelim
khazi
titanics
aminov
devestating
krueck
glocer
sewta
donadi
bloodbaths
marlowes
transdniestria
sigfredo
proskurin
techshop
hotpots
hymens
standerwick
indentifying
krams
pearc
mandideep
lilybank
trackpads
leanest
pamm
ideations
verigy
swarmers
fedrigo
pones
narino
mutty
guéant
resouces
tincu
electrolyser
smeraldo
walport
panalpina
divens
groenink
lavalife
varoni
pegman
teverson
nenadovic
altoum
baaad
kyam
sponsorless
immunogens
teriberka
massmart
tourbillons
libreta
mtso
israa
zegart
capsulated
axarquia
kqet
aranzubia
belayed
cicconi
sedulous
homogenise
shorteners
logcap
agritech
coleite
deqin
youngling
apale
mautam
mdbs
tunworth
mollenkamp
ortak
advents
underseas
seedco
heimburger
jitesh
llanfaelog
ardena
cotecna
sanctities
langeled
bradworthy
pcec
senseo
dislodgement
thiérrée
dinnis
pedwar
ferrandiz
oxhorn
racki
teleprompt
varoom
alluringly
apdm
ghaemi
incomer
pincha
bonacini
orphanos
connesson
katrice
morukov
placemat
thayers
sindell
farver
numrich
ccan
zhengdong
ripani
predominently
semaj
chaaban
titizian
wegeneri
vavala
beidleman
friddle
kolumba
exeptions
zazz
whatshisname
peerbhoy
amerisourcebergen
crosscuts
brightpoint
bensted
institutionalizes
houmas
disregulation
muwaqqar
aidem
bockstein
perler
ngmn
gamco
counterprogram
opaquely
videoscan
mufamadi
trabaja
popelka
rignot
wiercioch
mobaeck
vasini
mandab
asbeck
basye
klompen
torsions
transnationality
penalver
goosegrass
galais
strinati
almarai
rajoli
zherebtsov
tzetnik
kucharek
mishandles
hatlestad
nderitu
civl
jarmin
mpiranya
sticklepath
enaam
stilkey
tahe
harrowingly
dfdr
fertilizations
urosa
delocalisation
lounes
farl
steamie
jangled
eastbay
kumawat
hypergravity
wyocena
marcarelli
hyperx
obstructionists
torjesen
drosten
alixpartners
loughmacrory
inserters
nailbiting
assura
davendra
tavleen
beddings
nosiviwe
nonimmigrants
khorfakkan
uncharitably
kauzlarich
surfeited
foreing
tobiko
barfuss
kenagy
khezri
hardnosed
piseev
controllably
detoxifies
aldhouse
cooden
egoic
safarova
abeliophyllum
beckettian
turen
multhaup
scaleup
jamarat
bejamin
holmbush
cultlike
robertsfield
naftalis
illimitable
ribeirao
decongestion
stonebriar
amareleja
pebblebrook
janks
juvenille
hidell
marshmellow
altovise
langleys
prateep
oaaa
culpas
gadio
restructing
mcsd
sicha
hutagalung
eadon
terner
wahala
sarahs
depilation
beatify
firswood
roseacre
dhital
yambuku
arugment
miere
bioindustry
versas
perranwell
polukhin
denouncer
monoethanolamine
eventoff
ripest
henretta
haythem
lucentis
curtsinger
yasseen
nutall
benyoucef
petrey
twynholm
dustmen
netscout
brüderle
strassel
laviera
pashos
docstar
dorer
vampyroteuthis
nonemergency
nepstad
naysaying
eletion
aberchirder
vodopyanov
prasquier
smmoa
feehery
ustari
embittering
reattribution
oserian
dorronsoro
tiblisi
jupi
iziane
luckly
ruralism
chatom
fuisse
aleki
becha
valaichchenai
degreee
klapheck
soheila
aldurazyme
bancaja
voumard
khalde
defier
daintiness
ataga
lambhill
futzing
ballcarrier
kanah
itsik
yunchang
nantmor
inconvience
saeqeh
buks
stakman
khromov
cliffton
dilemna
mcgennis
partsearch
inhospitality
arss
otcqb
allianoi
duffuor
raincheck
horlogère
hassebrook
piella
hnt
accessdata
keatons
zoback
mccastle
suker
binoo
firova
microbiologically
villified
unhinging
ndure
mastrick
rymarev
nonscientists
fayose
agerpres
oldfashioned
redpolls
anncol
torgan
odebolt
auy
titians
boxworth
guzara
impotently
peaceman
outflux
disquietude
solans
lvef
oxc
pieraccini
rituxan
sekikawa
taketsuru
kibris
pretaped
brutalising
pitchforth
postcomm
flyfishers
wolfsdorf
milners
ummmmm
chembe
bestfriends
lazes
lapiz
hermila
shopworkers
bodai
unitholders
kairouz
yoof
mitzvahed
bathija
evarn
nahdha
murar
poptastic
crossable
microtca
enarson
electical
barsosio
dikla
eisenhowers
maniace
stellick
smadi
ganol
weigman
yanowsky
nerica
lujambio
scoffers
resile
ashlock
crunchiness
picocuries
onemi
goatley
transillumination
kalie
couln
hurcombe
creaminess
roeding
sosp
opalka
cordiant
oltman
disbarments
mackmyra
zuera
gmarket
longlining
teshkeel
wtmd
brakefield
sinkan
nausée
haberturk
dcci
kleinmond
orionids
cytec
aracoma
gorkys
whorley
fengying
wetherhead
boulmerka
kawangware
grocholewski
boilover
vasteras
lardi
vantas
montclaire
barjo
amum
cnockaert
bearcroft
yesim
berlow
trinitarios
ssids
caponata
vcts
lebergott
bvoc
resturants
atomiser
yellowness
bellhaven
tofane
lydic
biostatistical
mintoo
freifeld
whataboutism
tatbir
casue
pishchalnikova
synex
faramarzi
emporiki
entropa
extoll
downderry
omnibox
lowenberg
oppenheimers
amdr
csim
sarnesfield
lovecats
celier
capless
dukic
brylin
studsvik
maceio
mlynek
jaising
silvaire
dravite
huelle
phoumsavanh
mugerwa
wharfside
tessina
daybed
larchfield
olweus
drahm
pristinely
verrecchia
kirr
xtronic
bbet
dillow
boulami
sissie
autodoc
khushhal
helfant
avranas
melmore
monello
stepkids
liau
angelich
paradisical
avaza
lafley
salzburgerland
contactin
reinstalls
allrounders
schulson
healthplus
cnbb
noerdlinger
kayyem
gorneault
trophys
karosas
mangyongbong
shaoshi
micrel
ziplock
vanclief
daftest
nowzad
kamaruzaman
wolfing
graubard
webcameron
supervalue
hakia
plosser
havertys
reclassifies
zhevago
portavadie
getson
nikishin
teshekpuk
kubar
lividity
garrington
cortexes
goldstraw
cebreiro
minic
mettey
jerheme
sulser
richtel
steliana
peeke
insanities
petin
nishigaki
lorenson
kopps
yawk
epithemiou
hazelgrove
fontas
piland
tiggelen
staion
kurkela
carinish
copperhouse
metropolitian
biddies
reteaming
raiya
surgutneftegaz
semitrailers
betulin
outclasses
suntanned
garbles
larae
saumitra
cartographically
ninfield
yipsi
trenkler
glühwein
esparanza
kitcheners
swines
uncast
gangbangers
sehba
berrini
shauny
prevas
zanmi
tomorow
burdeau
subtenant
bibbins
pinchukartcentre
polegato
sarmat
scura
janoris
dashiel
zoltek
zancudo
cchd
cherrix
ornellaia
sacheen
birak
reenlisting
benecol
burped
tahnoon
wmba
virgets
uppo
shafiqullah
velasio
aidone
diekema
purposly
shahbuddin
copelands
chomette
storchak
meckstroth
gouras
xiguang
trefin
zaromskis
leestown
gzm
hasfield
sedulously
udw
carnehan
deepesh
matusalem
lomban
latterman
doutre
yeoward
lamamra
rurua
mangosteens
inimitably
kahiye
dworak
kinchin
crra
reorganises
gamesalad
fangxiao
veikoso
coggles
benest
phisher
ganrif
dworski
leigham
pseudocyesis
huzaifa
mozhdah
ngetich
marumsco
plumerville
tributs
silkier
charaf
belarussians
guilbeault
pooing
dealed
archness
acknowleding
mihadjuks
lescroart
sarobi
nafzger
pywell
halfhill
winsberg
calculatedly
guwa
caveny
farofa
turquoises
shambrook
nyamko
landaus
arrestingly
ansong
koeller
cîroc
gookins
maytas
healthworks
myozyme
shuggy
acrassicauda
vichai
tahiraj
keymaster
britishisms
slowcoach
badesha
lionizing
hasanul
residentes
vorstenbosch
hourican
fausa
gallivanting
navah
shamateurism
owre
dysmenorrhoea
doucoure
pontneddfechan
paganica
futerra
hijgenaar
bicking
skatetown
deplane
heirachy
drehle
marcillat
ickleford
monologuist
harootunian
zyrtec
mujawar
mayala
acroyoga
glyncoch
sinervo
cremieux
khog
watersplash
fressange
cumbus
nurturers
klapow
outthink
cnpa
anite
youi
kedrick
inserra
gonggrijp
scarpitti
authories
chengwatana
haresfield
bargin
leive
antiseizure
federoff
feejee
falon
pawprint
jianhe
lozupone
receipe
munsen
usership
oenning
eshelby
wsta
gaziano
yingdong
majzoub
milbourn
shanor
rewild
kwanda
apolinário
eicken
osmun
kfn
ramiele
hoogstraat
familycare
kiranjit
coould
mudassir
concepció
bimmer
grender
mysupermarket
coffeecup
pelluhue
mindrum
vesali
victore
meshkat
woyda
adorer
auricchio
formigal
heulwen
mouha
unretire
imposingly
electricities
forewarns
haylofts
ressa
unprecendented
bolham
crimebuster
zakone
prita
bridgemary
liad
kutin
behlen
entreprenuer
varkonyi
kailee
husch
stotsky
wannous
automative
garc
saubert
orbotech
morewedge
mageau
comunicacion
uzbekneftegaz
thirunavukarasu
almondbank
crucifiction
miruts
liah
derisi
rozynek
nucs
despicably
handcycles
oltrogge
monashees
defendents
kinchloe
wishna
danovitch
chestnutt
oxborrow
biomechanically
camioneta
mutri
sekeramayi
avanir
cleveden
newburger
turbolinux
kerpan
kaelber
mentawi
hampikian
loengard
moisturize
aapis
getafreelancer
practicising
teven
coldra
alanah
epns
browman
gocompare
officemate
marvy
toptan
subnitens
bierbichler
muffie
poilus
grotbags
pendergraph
orbec
masinter
almaric
craws
nizeyimana
nonsmoker
mihailescu
innotek
kimaya
britweek
wpsi
zahradnik
flapdoodle
microbrewing
culton
tatter
represenative
dnsc
charlatanry
blechner
powertune
usupashvili
almast
susiya
klenke
handspan
wiveton
nanoshells
prawna
algenol
kundnani
ramatuelle
esterhuyse
practicle
lovability
blackwash
sanex
kenti
boulmetis
dunguib
bunagana
chimen
ovono
camposano
gouvia
argies
rollersports
josian
vigneaux
khamid
commiserated
izta
qeep
breaststroker
nightowls
akerley
coarsened
jianxiong
mochlos
bodow
kirtman
wilmorton
hurdia
huijin
kaldas
johara
glascow
ullenhall
shulkin
boufford
blackadders
tatsuzo
gianopoulos
raghip
nylo
thoughs
footstools
yoana
sgarlato
iswahyudi
saltchuk
kronthaler
klicka
georgos
churrascaria
coldman
chokepoints
almendarez
bartumeu
kwikset
lumus
sundyne
presho
housecleaner
unsleeping
ruminated
gergorin
numeros
eurekahedge
yames
cablefax
trovata
slattern
corbishley
timebound
redwan
amirkhanov
eurocypria
cataluna
mcrs
advantech
equalises
furnituremakers
fairminded
floormats
foud
wintzer
xinbo
clangs
irlene
ijl
autoeroticism
gissel
encourge
disappointedly
seyfollah
terrapower
nimham
evany
griffelkin
mdjt
adventuredome
intralase
crigglestone
christobal
searose
criffel
passetto
jurelang
oliu
kranton
kukje
mohsan
juhayman
ombale
hvacr
pides
deodars
cattie
youboty
cosset
feleppa
dirden
reinclusion
beghtol
bodorová
empathised
forbearers
bergamin
breckfield
balough
lyris
langhorst
besas
moctesuma
unpracticed
firoozeh
schauss
slighlty
heartrate
yishui
feec
forepeak
academyhealth
nuli
hawnby
waterous
terisa
vmworld
bellybuttons
idolators
gebreselassie
kidskin
microsieverts
postlewait
wintley
commvault
exocets
isquare
keesling
rosenmann
zabihollah
galves
mfsa
heumarkt
lereah
fackrell
zuby
louet
djorgovski
jiafu
ngay
dadang
altenkirch
froglike
sinisi
hotrods
serrette
jdams
devolites
schtonk
hdj
pocketable
luzuko
fiyaz
gouldner
fritto
mcbey
rhodie
tawila
serag
otcqx
sunnymead
savasta
receiverships
zhakypov
savala
pingyi
bhoopalam
shurlock
sementa
malapa
murungaru
fulcrums
funner
kollath
reground
barrado
mittman
dangelo
zimonjic
duquenne
kobashigawa
seegrist
betschart
convulses
mobos
rons
pscc
gattii
gwaenysgor
pahimi
chiclets
fornicator
kasanoff
brainbow
mortarboards
jimaní
kowitz
loomia
schawinski
tagro
khaung
chiropractics
kadakia
extreemly
maxxam
esporta
bendamustine
ballinagh
stourpaine
guirand
evenstad
jollimore
compounders
imla
cormie
beney
sticken
kenwith
medeia
taraqi
flogs
preregistration
tonet
hardwiring
boskovski
sophistries
sitanggang
changge
tabita
hankies
golflink
celian
selsky
fetida
milanes
mollon
abraxane
dingwell
optex
upperlands
zoglin
herard
crudités
ghiglione
gardea
hautlieu
fsos
persic
pinholster
nittve
kinclaven
fraudulant
fatalists
irisl
widespead
roetzel
odney
overindulged
caldeiro
cherrapunjee
nevruz
cinéastes
cohered
acott
fiolek
matchfixing
wartel
misanthropes
allinger
retrovirals
swordstick
tiribocchi
genske
vorobyovy
recoups
ledy
forecheck
rylee
cornucopian
sorenstam
scrofulous
kanik
landels
crossplay
lifeng
oafs
caraibes
smartceo
yaojie
nrepp
bondan
caffarella
iifl
indecencies
aloisiuskolleg
megal
fantasise
procedes
ebates
simopoulos
doonie
sugarsync
pileups
itula
exhortative
picassa
zhongwang
lopezes
nashwa
aquilano
casegoods
shuanghua
katel
renggli
pourmohammadi
reos
mkhwanazi
sabloff
rohrs
archetti
reconsolidated
coveri
nikkie
fumiyuki
malaitan
overemphasise
torwards
frothed
energyplus
habitrail
aboucherouane
kubic
bpss
stagestruck
riskmetrics
weightiness
doppies
cremes
calumnious
bermans
ovrebo
cadge
terzopoulos
sinkor
boomy
keilar
pylas
nobes
kaplon
kweder
winbond
housner
daynard
tesfa
stuczynski
whaa
georghiou
cocaleros
enterpriseone
kiewiet
decaë
interupted
shohet
mckellan
kassy
zhirong
serhant
shenmu
sharahili
handgrips
playready
pattiz
dionisotti
encorage
jingsong
autostream
frypan
biib
stagework
maday
sumari
sondag
herbertus
quadrivalent
lamneck
kambale
felicitously
icpr
boski
beyoglu
burhenn
georgeann
jianhai
gudgin
atron
verstandig
phans
deliz
carmenère
osguthorpe
osleidys
corrolary
huaqiang
rfic
watler
clingstone
bippy
afgoye
sakaria
zimeray
thiagarajah
summiteers
challock
wauquiez
ferronickel
chemonics
profoundness
auby
ylläs
cigarillo
ettlin
oilsand
fractioning
falklanders
pronouced
mahjoob
cummerbunds
freescha
abendrot
bovbjerg
ngoudjo
isue
cospedal
goslett
bistany
uze
urizar
rowta
lewandoski
brownface
bobzien
sportsdome
penford
unsurvivable
sorak
reimaging
kayt
wolfan
vandehei
tyren
cuoghi
devonwood
ljungquist
nypirg
nikzad
darshak
hillenmeyer
namvar
haltiwanger
huiguang
cavnar
decapitalised
kaysha
stawinoga
minski
chepkurui
gassiev
resna
avax
boundlessly
haqi
dauletabad
zinno
biorefining
laibson
jibed
ladettes
appf
teraflop
babyfirsttv
localness
fqhcs
montae
decamillis
wespac
slaley
positve
dhobley
shaplen
zusman
oluyemi
modd
bunkmate
mengert
indivudual
jhooti
denationalisation
unmilled
novec
sizhi
liliesleaf
diplopedia
merti
sandouville
sarava
hazelwell
petreus
artwatch
apemen
khagush
rascoff
mfrs
shopkick
knowin
garofolo
iourieva
exagerate
sukhera
currrently
jacie
dawalibi
tolstrup
clendenning
indanan
kocherlakota
unza
symud
pleaders
nasatir
ukunda
richemond
sarajuddin
manku
iztuzu
ventouse
lubatti
phobjikha
vitran
catalist
bargainers
cloudsat
acaz
prokesch
maruha
imfa
dovan
appnexus
technolgy
traduce
jousted
jammies
danilchenko
phlip
kokavil
rewriteable
multireligious
pepperstein
crosshatched
ilenia
mahboba
kawaoka
enigmo
urbanizations
exagerrated
cyh
xiaokang
jirous
rescissions
teshigawara
qaseem
booksmith
sivarasa
shearwood
rrac
malula
temping
schumacker
klarner
genevans
illman
douzable
bwy
vidiians
ushiba
howdahs
lasercomb
ultrahd
evrything
descas
bonacic
palonosetron
walus
seafo
christingle
fesmire
shurdington
amankwaah
crowlin
wilfulness
ndpvf
vuorensola
cyhoeddus
flourless
maglakelidze
lhanbryde
hamwee
offspin
cucolo
bogh
tumultuously
smus
versifying
horsewhipped
hayvenhurst
acoba
esag
gimm
nortek
randgold
mitsuji
hisey
parsonnet
iranamadu
despute
remoulded
flozell
flashmobs
blandishment
jikany
anoymous
alogbo
colouristic
pomeranc
ogborne
sqrl
hansabank
baghar
doolally
soceity
apgs
gombiner
ervins
penaluna
gellatley
bondam
aerolift
critisizing
affinion
sifteo
skorecki
amaryl
monoplace
miskelly
carello
jerbourg
avize
regifting
bosphorous
kristalina
streather
allonne
louisana
schivardi
fundora
pallmeyer
teacakes
carbaugh
demitrius
sbca
daintily
kajitani
golubovic
ostojic
wojta
teasels
denette
cusato
boudjellal
golpayegani
bourtzi
qqm
khazna
wangyee
casales
dorkbot
naureen
stefanoni
larsh
pasteuria
highhanded
barnouin
rennicke
lakdawala
chakir
wrotes
delloreen
horningsham
schwartzbach
yablokov
wikileak
meyiwa
napm
drapkin
calderglen
debruyne
jaelani
dehnart
transgas
owomoyela
cacio
osirix
barakaat
devyatovskiy
laddies
enitre
citymeals
gorllewin
minable
woerkom
haubner
wict
whata
hugeness
sommerfest
mappleton
yeargan
hotzvim
aproval
plooij
wingsuits
calotypes
vegetate
westerngeco
fimat
cchf
woodburning
mckechin
cptc
defanged
schroeders
ceutí
mudzuri
othersiders
rheinhardt
besham
tealeaf
sanjel
prosthetists
snla
hanadarko
lawsky
youthfully
djelimady
gwawr
greenyards
planai
sembra
syson
foleys
markerless
mavraides
camisoles
conniption
overbearingly
noblella
kumgangsan
icera
novazzano
hembrey
nussenzweig
asnodkar
ederney
biospecimens
acclimatising
uzowuru
propbably
accroding
miodio
cherin
deutschman
anvisa
groogrux
neilsons
swanny
gunky
splashback
snainton
jonhson
oralia
beshimov
manyatta
moneybox
kaytor
caprylic
squeezers
warzecha
gadgettrak
lipping
licosa
mosavi
kinnerley
tenido
wesminster
augmentin
talad
voevodin
reschedules
cyranos
rifapentine
knolle
nossaman
healtheast
mandrem
sébire
duprau
democratics
mobiltel
crusties
devachan
hallums
unmedicated
participacion
chromatographs
itqs
pusillanimity
magera
rowdon
hedebrant
cawse
gaberdine
unshakably
huttoft
boissiere
parygin
antispasmodics
kusaba
kurzon
chokai
alarp
echenoz
pamidronate
katlyn
fashionability
polarisations
udeen
pomerode
counterproposals
speigner
decorously
dearen
ejegayehu
parazit
niccals
addreses
mcewans
chiamano
latell
tiihonen
trents
qisheng
yerlan
tgfs
khaleed
armajaro
pushpamala
jiexiu
balzekas
bacterias
sokalsky
oesa
megahed
skimpole
saeeduzzaman
hvae
outsourcers
ecity
diarrhetic
shenkin
urgencies
patullo
edisonlearning
abusharif
behaviourial
villarin
atefeh
rebalances
scharmann
adament
glentress
somerstown
congesting
skeem
florange
caymanians
lipinska
sirva
trackman
levitre
oceanics
rucinski
icontact
miland
fecking
felcher
dhahrani
googel
yasuf
bexton
staffordshires
barkay
zierke
hnwis
boardley
heliovolt
sejjil
compliants
turbett
piontkowski
havarti
huaren
mitchelle
lecg
fathauer
ianucci
reisha
talf
whiled
haltli
juliusson
muszynski
onterio
lopinavir
serebryany
comfortless
vikor
matrimandir
gobal
dilatoriness
octandre
brentley
hisanobu
nyongo
tokbox
bucan
zotos
murl
ngiam
heege
dtech
accuray
citybound
dvoracek
sevenload
menaged
cosies
carbonetti
rolet
candes
responsibilites
tundo
thingummy
micromanager
lavernia
guilfoile
caliope
snowploughs
cojuangcos
durstine
skimped
organdy
zatonskih
harsley
talibi
navinchandra
takfiris
mayolo
whear
hoffmeier
cavorted
jailbreakers
daisie
kolay
folberg
ballyronan
relitigating
glyptodont
atempts
ritsch
aratu
brodman
birute
roongta
mendana
laggies
porreca
pacientes
camis
binkerd
litzman
kccl
milane
nutman
gaudiness
assult
carnavon
kyloe
castejon
sred
shimmel
overleaf
scotches
irdning
maltipoo
judaize
unintrusive
tuw
björgólfsson
farrin
trosglwyddo
polymetal
usse
breakfront
elsfield
taith
superphone
battison
skiwear
moureau
asbu
farzam
frecon
vulgarization
clodhopper
bengalooru
succulence
itmi
goodkin
bonam
bhatty
helsen
haykel
sepanlou
haled
shvelidze
donfried
antihyperglycemic
gricel
polisseni
sponger
rackrent
coopting
aridaia
kgale
sideload
seurasaari
champing
giardinelli
gresko
ruths
mercruiser
burrier
vres
thorstvedt
excercised
incude
productivist
lauvergeon
embarek
jeleva
acarajé
democrate
rutrum
monocropping
ngungu
taryam
innkeeping
khurmatu
shacking
dewchurch
bugginess
vanhorn
seineldín
transfigures
sgia
byoc
barion
planeloads
ruginodis
follwed
chechnyan
quadrado
aftr
craigiebank
deloge
csuci
invoved
aafs
heline
cebp
fstc
erlinder
kwatra
bojko
witloof
ballabgarh
gbomo
persuit
lapresse
kohane
ligocka
wavecom
eggbuckland
redworth
yorkgate
indego
rumma
camela
gradante
sosnick
threapleton
epogen
nebbeling
nedderman
backwoodsmen
decaprio
lulgjuraj
mcglasson
bartke
tagamet
hummm
klinton
purikura
selectorate
repped
ayesh
berisford
diemut
merguez
phaeno
orathai
cyalume
papadimoulis
durwin
segro
tetonia
chinless
nanz
vahn
tipsarevic
weafer
wuold
buragohain
algos
osterreich
hullet
gemdale
bibishkov
sanker
bobl
indentions
smietanka
gelernt
bakon
lenity
nuggety
bachiana
estra
lifebook
peloza
protogalaxies
ojama
shiza
tadross
packinghouses
makalambay
khorgos
omenn
whie
otse
apprentis
cogitate
kingsview
duscussion
shatri
korzec
ardlui
pacquola
abdiaziz
djiby
sesnon
hairies
gorgeousness
dybzinski
gthe
aboslutely
quicc
gaymers
grübel
pomper
backfields
eulogises
ohmygod
colourspace
alviri
woodbines
paycheque
refigured
moravek
sunalliance
asbm
oenophile
scci
osetia
kastenbaum
nonmotorized
nubanusit
meteorologic
jankowiak
tradedoubler
redcats
seminarists
grouts
politicker
schaps
needin
gheel
grubstake
apathetically
quadrillionth
trendlines
sinophile
interspliced
oganes
straitjacketed
jehanzeb
presvis
beuttler
samwer
hornback
mydlarz
peddicord
sphe
abelman
paetzold
vertin
mellier
leafletting
behbehani
pqi
legales
harmesh
endives
compunetix
updata
kenro
cuet
macalintal
loganton
ssta
obermeier
copito
bodymap
voluptuously
tubul
kraner
toncontin
chernyakova
ivis
blotner
biklen
beniquez
hallmates
bwlchgwyn
toranagallu
outhit
gulfi
ibstone
billia
alternext
securitizers
petrophysical
schipperke
komnas
hawleyville
kamerhe
magnetix
reflation
galdieri
pedreña
tolkienian
rosmino
miton
bissoe
supertribe
hegerl
saltimbocca
victimising
javin
cocklebiddy
aquascaping
withburga
annoucements
oysterman
bothwick
persdotter
salgaonkar
hellewell
overborne
bevmo
dervite
tcgs
khapra
evenley
apgi
dignityusa
posessions
gearstick
bageant
helmbrecht
neworleans
kerscher
cowdy
breakfasters
preventatives
lemsip
doory
lurgy
tanous
dowanhill
witnesham
zchn
foqa
firstlight
sparacino
taiex
xalisco
thorneywood
creely
mirfak
contumely
prewriting
alginates
tippens
mererid
dudarev
bryansford
ameco
worsall
wardon
weilding
leighty
arreguin
mrfs
pooed
trichter
perola
xiaotang
cppi
heinzerling
ubiquitousness
sssh
maaleh
koutouvides
budcat
feltheimer
burmarsh
caryle
archetypically
syncopating
birchover
edery
dunsky
bonvillain
scrutinies
gevor
antiheroine
lochrin
ashkirk
prieure
aerotel
superserious
wildcatting
canape
manouver
makiri
mcgregors
agom
zhenqi
azir
hitzler
gooner
mcaleney
dynaudio
banglatown
nodelman
meatpacker
possiblility
trevalga
orullian
ritek
bfsi
kovacik
exclaimer
selbe
reseat
overtax
enthral
shielings
cameraperson
beurskens
golaszewski
yelvertoft
satyriasis
langefeld
sayoud
hinden
exsisted
casanegra
jiren
arzhang
micromoles
aarabi
zalina
haughwout
barnow
graskop
aperio
demuynck
traceca
demicco
raptorex
techops
heebner
liberda
picocell
unpoetic
epidurals
hybird
vogelenzang
leevan
eutawville
insureandgo
evci
crackheads
dyron
zenjiro
mozal
godd
nellum
livingstones
perranuthnoe
sheltam
reshoudi
vigdor
whrs
nehmad
suberbiola
farimagsgade
pietrus
baronscourt
chunxiu
zuttah
fenside
fengzhi
maysam
incredibility
iipf
stroyan
yeg
cifas
cambogia
altynbek
lacelle
smallie
batiashvili
phree
omniride
khansaa
shrock
lawick
hilan
anticapitalism
sugarcanes
unextraordinary
dadfar
bryostatin
bhavik
patail
molski
psychoneuroendocrinology
squidgy
rubalcava
rcop
honga
whens
soohoo
abatemarco
sreekala
sasd
goriest
discretional
fillibuster
marlesford
cawing
charlady
drosos
sakip
preest
aquantive
coronil
mobilkom
shomrat
fasthosts
mccavity
aktaion
nuvvuagittuq
elsewere
wrenchingly
pihlak
bopped
oloo
agns
apoteket
stoystown
superlicence
sarac
wattages
iniguez
theofilou
donceles
fishfinder
didmarton
noogies
songul
chophel
nemicolopterus
zhuoxiang
immunise
drabber
lossio
thiriez
pyramiding
treadwear
versifiers
haghia
norster
purrfect
bazaari
gainsaid
packeteer
enticknap
hallelujahs
ambassdor
hovertravel
absoutely
threemilestone
tawengwa
provice
gottheimer
erlana
aeries
weiers
madior
noppadol
prso
corvalan
otsubo
outfight
feren
laurini
scarlata
galoot
repulsively
terreblanche
irinn
humpherys
glymour
multidiscipline
tigapuluh
sumino
geohaghon
chunmei
achenbaum
cirovski
plaw
villagereach
mylin
gratingly
neumanns
exablate
tuerk
baghran
wetterberg
katsucon
accoridng
camapign
datacentres
kinkala
philsophy
fydler
bureaucratized
antifraud
vilanculos
simap
stockholmers
graffigna
drinnen
crashlanding
provi
commodify
guihard
skibiski
psychobiologist
alteryx
tsouli
douchez
garuccio
kokish
ouallam
milchberg
aproved
tranquila
desloratadine
coreper
disfluencies
rogich
scabbing
frizzelle
scarefest
truchard
ynystawe
bandings
thelton
ganderton
excentricus
carnwadric
tigrana
tundi
reller
deveson
ricards
wxtr
jungstedt
dunkelberger
fotyga
cherifi
aserinsky
kirlin
weimaraners
nadiri
violy
catteries
spalled
sellathurai
spiciest
aboobaker
intertrade
tedstone
monowi
frazin
sejil
glucometer
remuda
lillien
qadaffi
filenko
recist
copycatting
bendet
beldangi
mossbourne
pjanoo
diapensia
kisaran
rudik
fionnphort
wenna
wouls
kenyi
orenthal
laguito
regualr
granjas
gaios
omada
sunmonu
pebbledashed
besenval
icub
toldeo
bodinieri
linnan
marusan
aprill
shanle
sinka
tenous
piracies
blackard
kocurek
geling
canpotex
egotists
ravishment
squirter
messano
gerhardie
hudek
hrudayalaya
bksh
monogamously
shikin
nazirpur
quadrathlon
airconditioners
chuggers
rizatriptan
boymelgreen
ladele
mountian
stratix
homeworkers
ophone
springview
unrelaible
hendeles
shabiba
insatiably
krcmar
lended
verrue
hazinski
gastrell
moqueca
hasbrook
sinisgalli
gouil
sexpert
hostellerie
kunowski
flattener
proclaimer
vonna
diversões
wichai
stilll
scharfman
asanti
unstirred
psycological
gushchina
crappiness
weila
pingeot
denevan
falaya
judies
swoopy
hullermann
belski
beefsteaks
bantham
grynbaum
gger
faasen
dinking
presgrave
woywitka
demurral
telekenex
beechhurst
sgam
kreuder
fronzoni
berkan
linkexchange
surgey
yeslam
helliker
lutts
inventiv
zuckoff
nujood
swigging
zmajevac
rejectionism
pushovers
brandz
mbhazima
hirotoki
feenan
hadithi
beleifs
regardt
luraschi
silnov
khanu
fahal
keis
demoro
hausch
nemukhin
ershadi
loiters
tramain
infosecurity
alligned
valinskas
tubay
stinziano
zega
franchisers
garbagemen
chikashi
remedium
salems
iristel
fagiolini
countersuits
perng
mullighan
interoperates
unserviced
xohm
tiantan
margis
brochin
phakdi
stensby
possingham
gdrive
shierson
eboue
bamji
racialised
colourways
mecchi
karnig
rampino
yosfiah
representativity
bankrolls
childwickbury
ujwal
rescuees
anaglyphs
delvon
blechschmidt
interupt
reglazed
ornateness
ascentis
wesbanco
disillusions
raitala
watsham
lewith
greep
bandoleer
vidarsson
forswore
mariajo
saké
dgat
bootlaces
enthusiam
birchin
slawek
roshambo
cordain
miscued
diazoxide
wride
judgmentally
hargeysa
djindjic
juszczyk
guja
jinglian
budoff
escalope
gpnmb
valesca
koldyke
kassal
motorscooter
thesame
trainwrecks
bikies
abich
liveperson
playdates
sleepwell
marjina
natascia
ponderosae
tueller
candil
shabati
bukky
murco
nimbyism
critizing
zalai
xharra
mashenka
mitek
bouzoukia
simplifly
ginzler
firouzabadi
alfacar
oarlock
roadwater
snss
waluyo
deerlick
flushable
dolloff
persol
penstemons
lidlington
sweary
schickner
dumezweni
newstands
galatoire
underfinanced
pistou
cornrow
lynell
renewableuk
aeronwy
rumrunners
tirpak
futurebuilders
torrenting
kyvig
banson
spoonley
ohkubo
dwina
diegos
grost
cabestan
garger
karmically
okayo
soundbytes
sakine
katesgrove
icea
kcha
undiscovery
paranormalists
netd
gitenstein
elgol
dadkhah
checchinato
dealogic
maryburgh
rudland
tencer
zeckhauser
sevruga
honoraries
madawi
tihnk
naqoura
fingle
hervas
gisi
marozzi
buldak
euthanization
khidasheli
venomously
dustproof
troha
apstein
margreet
pushpins
raudnitz
clachaig
kyaik
medcup
medupi
btselem
politcs
naret
voitenko
pervitin
godello
sekope
mspp
semiliterate
porchia
thunderclaps
reeducate
bepler
adrover
chickenshed
yachay
kinyongia
roseburn
dykers
daneil
cassandras
milota
zulily
kepplinger
lemmerz
irec
asre
bissoon
equusearch
kriston
prevous
voos
aberdein
borozan
wiscombe
yonadam
nimat
akhtiar
nons
studywiz
rideability
fter
gorslas
clottes
klepsch
jansens
pritts
kannat
collosal
malde
inhabitated
loghman
chalbi
autobytel
programer
creutzmann
enaliarctos
quinny
plasticky
dooo
hawr
halvah
minces
ffrancon
epeius
hiccough
ashkenazis
postbag
taule
insoll
supl
kiljan
popetown
skellow
ecoa
upholster
napalmed
bankton
diosmin
antopolski
yarom
mcos
stulen
overproducing
mirlande
throgh
floralies
boilersuits
borré
gevisser
kahwaji
balduzzi
milevsky
opiners
chirrup
olick
intensivist
shakepeare
roueche
scrimping
febuxostat
wijesekera
drinkstone
mavrou
chibli
dufallo
spso
delbello
aimen
cerino
golfclub
elate
kiwibox
milkin
guban
ngagi
mtfs
humbolt
alivio
raudenbush
housh
tickertape
winyates
sabbar
leardi
byrant
trackstar
adbullah
fanore
roopnarine
badders
carere
blanchon
yech
garech
selkoe
kurvers
yazigi
kuzak
possilbe
uncontactable
marylene
mtls
squidgygate
scowler
wheelarches
chaharshanbe
schlump
kilnsea
leatherbound
ediets
rediess
liagre
angelil
disenchant
asisi
anginal
absconder
hwg
hootenannies
pungently
jeremic
washpot
zortman
seddiqi
indiglo
palmerola
stalemating
shurtz
reburying
prayad
orthopedists
hardstandings
photocall
cortesia
janovic
gonjasufi
moredon
whetehr
finavera
karesh
siaca
massman
periera
kyuma
delepine
guiver
madhavikutty
endograft
vasarhelyi
cloudbursts
revaluing
wihongi
sandies
madut
ndahimana
ypb
mikelis
puppeted
kentrell
joueuse
qunli
ujima
gurtner
rodic
braig
cushnahan
gjelten
residental
arthroscope
cuckson
crfb
balila
audix
tehachapis
ircx
viragh
katiuska
hamzawy
rakiya
conaton
baydemir
camauro
terreiros
carepa
gurmail
fixs
iaci
squirreled
xochi
emptiest
snorkellers
matildae
dryships
inhope
parales
dimitrakopoulos
weinhaus
barkindo
jappy
marzaroli
trichloroethene
killerbee
worlaby
chilaquiles
limeuil
sandrak
bamp
pangkalpinang
pignatiello
canovanas
ogbeche
nekrassov
berresse
terrón
triantafilou
puska
bluechip
yaung
lyddon
schissel
kpaka
ojea
attuning
boshoku
catharpin
mortgagors
chikilicuatre
bedposts
urbanising
monchi
mosys
hamsi
guillebaud
dpni
aminatta
tribalistic
biogeosciences
sastrugi
racecraft
arriviste
boemo
iguatu
bonafides
schev
cardiogram
thromboses
harlo
confrère
scorey
festuccia
torcetrapib
pendrey
nlpc
niec
sharkman
bluebay
cpag
bresh
banciao
nachtrieb
dumbs
installaware
gawcott
broadline
altace
loxapine
sokoloski
greenburger
rampell
hmma
auradou
finansbank
schwenn
mulleavy
entitiy
goonhavern
gwava
alatar
vvel
cgx
vinovo
tanor
muamar
ticketholder
mislan
proble
demelo
nounou
bhumjaithai
conisbee
mackenson
radrizzani
touby
bradnock
aminzadeh
grassa
bctia
chavkin
adultress
zettabyte
endace
kreager
politicizes
genmar
shockable
lukeville
soccor
pesaresi
brightener
briger
andouillette
alstonefield
hyperekplexia
vidoes
putze
topcor
ceatec
baldfaced
xianglin
joette
walead
bikepath
reithofer
casaca
ilco
monotheisms
lalish
hydropolis
filmart
knupfer
unpayable
labon
distils
kurochkina
sucharita
glamorised
ustia
schuenemann
boissonnault
wikitude
encinar
mfuwe
cruzcampo
unresponded
gokcen
delouche
rashon
jiarui
klaric
compeau
neelys
bodymoor
virologic
rollens
ieke
ripperton
ampro
rezk
boughanmi
golvin
polarises
acharn
tarceva
dlife
belluso
freddies
townsel
gastronomia
dyde
suchon
dependancy
malintent
yarg
flimsily
dhanteras
attourney
novitzky
somatropin
gribbles
cutbirth
husseiniya
meanin
iphoneography
sukkiri
botty
unicar
lavandera
chinstraps
vadheim
katoucha
tuwaitha
cutié
steadings
ruven
deysi
millmount
kabai
payard
bluebloods
guisachan
rouba
kishigawa
plumbridge
squance
vivalda
anhar
busnes
levdansky
pontarelli
rfqs
rahmeh
muqattam
windchimes
joyousness
burnim
waterreus
sprick
worthville
dulmatin
amorn
cremades
urbik
gartenberg
cyberbullied
cowarne
hertzel
jontz
camd
speechify
colonialisation
chwarae
chesimard
miltner
suraci
andreevo
abacos
immunisations
mlynar
cardamoms
funkeys
shibulal
bucketfuls
fusillo
kubatana
candomble
genclerbirligi
shoplifts
carmenere
yurkov
unassailably
hanatziv
sakhan
actimize
snarkily
rafii
bilsen
techniquest
klaven
corble
hejma
craigforth
glynns
didja
chargin
skateland
elegants
eliyah
picarelli
geuk
nlng
sartorialist
burrett
janamukti
delucca
isaakyan
desig
smikle
wherryi
teewinot
boggiano
merebashvili
baccari
iwanyk
antibullying
puentevella
telikom
mawkishness
candystripes
slippages
biosense
crazytown
bcfd
fahimeh
beetv
smmt
prestat
donerail
underproduced
pohanka
sawhorses
lanzano
roner
matonga
straigh
grodstein
lahkar
piluso
tauwhare
tweenie
townfield
glenshane
jobster
gazenko
kuraray
gooda
petrichor
teithio
bomarito
hivert
dochfour
kopicki
iezzi
binoria
exteme
scillies
akrour
avivah
uncaptioned
murgo
refueller
tranport
neustrashimy
hasenfratz
dejeuner
eniko
limitlessness
hardlines
densey
tapies
hamngatan
mcowan
aramingo
creekbed
petronet
stournaras
shihong
jumaane
falasi
dashawn
artemisinins
leipsig
granadas
muder
manhattanization
deuser
outrebounded
downrated
claverham
tincey
puttanesca
bezhuashvili
naughtier
sdmc
edfors
grona
animatedly
sterl
releasers
adalsteinsson
goves
hazboun
govin
restiveness
globalhue
vlack
ceraso
curser
vadon
extrodinary
isce
tuteur
ghwb
hillpark
kocieniewski
psychoanalyzing
idealise
grasmick
cavalia
,why
crossmembers
hutching
kodes
sindou
bardala
jurika
frene
pylypenko
sabaugh
uscap
salarzai
essentialized
deressa
bucio
khimar
valerenga
priveleged
wfsc
kominski
pantyffynnon
islamics
lemonaid
heteroplasmy
oresko
commoditised
nijpels
shhhhh
gourville
fasol
mogden
culpably
azango
blackmill
loquet
liasons
believeable
kappl
plagiaristic
haradasun
moretaine
doger
gonis
nomvete
passionata
flaks
overpumping
snedker
solovtsov
macefield
himanta
madhes
hradek
patronisation
strenghten
luebbert
rightwingers
bonesmen
matovina
filiu
bachhaus
barasso
vanavara
whiteson
riiiiight
bordyuzha
nikolaides
mocka
spookiness
videojournalist
larosière
creditsafe
fetishisation
bumf
dropoffs
hamour
conlig
wilhelminian
titla
oleocanthal
relling
undergound
ecomagination
sieradzki
debellis
pliancy
matschiner
manilva
carradice
jianlin
neoris
rollyson
kotooshu
prospere
lavier
rakhmon
skarnes
westcotes
mayali
purecircle
chakales
arand
liac
cottageville
eastender
wogau
derawan
misogynous
mowaffak
bedjaoui
pokerbot
squaresville
kwawu
blogsites
mustaffa
sprizzo
tesha
retarted
tsala
employe
touchlines
complacence
stelmaszek
arnisdale
ackford
rcpe
naimatullah
tannochside
spareribs
transfix
hunta
prosperously
taraborelli
gavronsky
shirehall
tahawwur
ndungu
copti
ristuccia
mediacity
nicolien
biox
anaesthesiologists
makimono
huben
strul
zaccarelli
teibel
filipek
unfavoured
anatolevich
vitezslav
poketo
aurumque
nucifora
sponser
donckerwolke
salf
neuroradiologist
werners
pohakuloa
lazich
kwangmyongsong
amimoto
shld
fthiotida
erlick
pleasurably
opco
nyuon
lllt
pakey
ncdex
gediman
moonscapes
transylmania
veluppillai
jhangir
asirt
caip
vaccae
coracora
braslau
rendani
nanospheres
ferver
caplen
geodon
metalloinvest
bragas
lamby
tesso
zeroville
sitv
vantis
hopefull
polewards
cirumstances
beyrer
wermers
expresed
charel
fabros
swagel
bichot
edventure
benice
lagha
jemil
vitullo
teessiders
bruery
glaramara
assani
sepco
bronzetti
inverne
kalubowila
cotrell
idirect
eipr
weighlifting
siezed
rynard
afosr
zhifeng
eliyahoo
beyah
imperiousness
sunlounger
zeleza
cephalalgia
saxbee
loughlan
additude
nirex
zuckschwerdt
sititi
tegretol
vagit
gentrifiers
riverfronts
waldouck
mandaville
makashov
hairiest
nontropical
dornie
beltinge
nory
lakovic
nunchuks
demonlover
vaharai
abbatoirs
shimmered
microgen
broux
issuable
gecamines
mcgalliard
saih
clicksoftware
wildner
tarsa
annalie
splaine
eibhlin
fanchini
kousser
yingchun
evri
buttin
pepitas
iawn
bernell
cheapoair
manir
ajiboye
jemile
thamilselvan
sadovnikov
riddrie
maybaum
pfh
vecernje
dipuo
desease
vandetanib
connollys
voras
linchmere
rhosnesni
knowning
gaslit
spallone
dysentry
dorenko
artsiom
lindloff
kafia
yueda
sealock
saikawa
bywords
uncurable
icast
zengerle
kaemmerer
fusillades
missingham
hereditarians
agudio
mcdow
causeless
joklik
aprutino
cesars
sekeris
admitt
spowart
winterrowd
kemy
byamugisha
sensat
carthorse
qaza
huzarski
ballyskeagh
blockout
barage
assads
ceic
kroenig
consert
overgarment
shagpile
bremont
elvanfoot
korinna
cilfrew
bloche
developements
normile
highballs
viavoice
hennessee
agualusa
blinis
bartlit
xynthia
borsetshire
giquel
moradian
geohot
cardiocrinum
slesarenko
romelia
yoncheva
vinexpo
corretto
zugara
adjuvanted
loiko
saglik
badriya
rodberg
renditioned
lorio
beermat
retracement
cenzontles
crookhill
floca
klimpl
seikh
laxly
parron
menoken
gotu
farmfoods
famili
bdbd
birminghams
westlakes
tortolero
hmps
waymond
javdekar
goldup
kashflow
inciters
logicomix
gymreig
cogill
abbos
barbaran
intemperately
sedale
duwe
holodecks
stockage
socias
airball
metatheatrical
ravaya
preventitive
baxtergate
toltz
agway
chaple
pekkanen
goaa
everlyn
shifeng
projectable
mailpiece
aggrandise
markum
stevies
madelein
lipstein
conwood
coifs
eckrich
filous
tabra
woodlyn
overvaluing
futalognkosaurus
cosmopolites
serratelli
mikhalkova
sytle
dorricott
nummerdor
prawit
bullrushes
devonians
laperrine
lopiano
brussell
darui
reachmd
pottker
xiaofang
chawke
foody
aidala
steall
haithem
immunizes
jinking
atacks
fanan
madders
benchellali
okiro
habr
koznick
religios
yawney
dillonvale
gawkers
obviouly
ngedup
pyromaniacs
edou
clipson
hearby
comenzar
telpuk
dualstar
landra
michelene
joaillerie
expiries
moisturising
kinter
lauders
shouls
lankster
mclamb
scws
voeller
vapidity
rozabal
zhiyang
acbj
zinkin
vereecke
egeli
underhood
contrarianism
radicality
anwick
waverunner
cataphora
kennelled
sufferage
tonsuring
pipas
manouevre
yingpan
kipkelion
biodegrades
nitrofuran
pasona
cacheris
lokahi
tehani
crinis
memorialisation
biatch
wallbangers
dasey
guanta
idom
clyth
fluffiness
shoebuy
trusnik
serraglio
morecombe
servicemagic
spirk
metaldehyde
comco
oscs
narrenturm
samasource
hypotheca
horwell
shuqin
vikan
womanliness
clubbable
satified
sipps
lipotes
dancedancerevolution
rehabiliation
arcep
rondavel
vellanti
semisubmersible
eskendereya
pashler
sweetenham
heulog
crociera
antonetta
rety
cutkosky
jaoude
integrys
winemiller
singace
pastre
dyrdahl
drinkall
elefantasia
maddalone
astree
kilchurn
domzale
heshka
walkerburn
dabek
coneflowers
arbez
sarach
anxiolysis
koornhof
dilhan
pranknet
dupnik
levocetirizine
quippy
gallach
novosad
fatau
thme
hilander
yasim
snowland
reatards
baghdasarian
cruiseship
bedward
denverites
deadnettle
huisen
sestinas
cervenak
xiyun
disneyfication
leles
richwine
winghead
fedexia
secane
decsion
cheonggye
grisogono
berdovsky
böögg
mislabled
semc
aabid
seselj
ifwla
chouet
mölzer
barltrop
kolath
shettles
rivetti
tvac
nonethless
lobaton
paytv
extentions
invididual
piecha
kuzumaki
drivelines
biddinger
crestor
wakamaru
asphyxiates
otryad
gililland
jiantang
itune
maalox
rollnick
barquet
neointimal
permanet
parisella
retrenching
kuniholm
inurned
morbier
kokosalaki
foing
loepp
shanaze
chastang
sutent
dotmobi
akaev
pinehills
downview
succesor
siebers
fordo
fakahatchee
mankovitz
horava
controvesy
caamano
qaba
makovetsky
grieger
hellwege
gelfman
federalised
affenpinscher
pullitzer
chapoutier
shemshak
nefariously
shetlander
vergnaud
xross
mouland
navdanya
bufori
amdh
flimsiness
riyashi
robinzon
casualisation
yamahas
truckdriver
izt
differnece
bohrod
sesiwn
schmeer
tomasek
guseynov
firstlook
polytunnel
minhua
bleiker
yulianna
xtracycle
jimsonweed
muttar
cannister
tommey
alitha
siviter
gristede
verdel
taughmonagh
salwan
tayal
beutiful
zixiang
showpeople
ahmadiyah
aganst
cederbaum
datejust
newspring
wreyford
hypogene
pwns
zenkoji
djukic
supergran
janiga
harangozó
jserra
shifflett
borans
politicas
alreday
knehans
tjv
noroviruses
anythng
manhattanite
dehumidify
aguamiel
cutud
ahaa
unforthcoming
billiot
ephemerally
poochera
hiseq
bladesystem
haowei
lucques
partlett
shinder
fiegel
miskeen
allenport
rahav
nacotchtank
poising
whessoe
hansenet
brackenhurst
samething
wavemaster
ashgar
taloga
busienei
sucessfull
pantsdown
mendia
warefare
chinotimba
himali
rhidian
bargainer
poze
evolt
retrogrades
enteromorpha
switzers
caldero
vradenburg
maatta
brempt
codlin
kelci
pechonkina
neurosonic
dreckman
pulteneytown
tidjane
winforms
ralbovsky
inverewe
bayaa
registerred
bebes
oaps
mocumentary
mutel
basware
mcslarrow
anyhing
montavon
penwarden
marival
digiboxes
khanjani
nerac
telsim
benimon
fadilah
courances
mongella
yiqun
zezza
wazza
elburton
limehurst
songlian
mahmudiya
spaceworks
fitzer
scce
adlink
cullings
stormiest
lifeteam
loulis
implictly
prosection
littig
chronotherapy
phytonutrients
razzoli
chilevision
ikigai
bernhards
cnpp
mbita
blacklow
joley
karadzhov
bohanan
prescote
massas
logons
itapecuru
corsages
kabbaj
wimblington
jeanene
gatoroid
grzegorzewski
fabc
stoneybridge
weepie
mutlak
kapangan
refilwe
froing
chibuzor
baycare
ksor
voxofon
doenitz
mediagenic
toryglen
rabiaa
sibelian
planethood
kraayeveld
datasite
meistrell
moonbats
floreen
awah
harkrider
programes
woodfarm
malakhit
krinn
ettedgui
endorsees
fxfowle
inteva
entertaintment
csrees
atiyyah
irida
coze
leshnoff
ckers
kairelis
wurtland
ruegamer
hirschson
theirselves
ogbuehi
sighter
detatched
pioneerof
interspinous
lowlying
underseat
basaev
healthday
deliberateness
synacor
mycerinus
uzochukwu
maarsbergen
wellons
penot
fsai
gavottes
daguang
kookiness
vilk
pissaro
mervi
oldak
firyal
msmb
heartiness
brakey
winney
tyri
hallal
strombergs
pavs
spvs
dumex
metais
schweid
signorello
dianabol
astv
systematising
maharajgunj
woodwell
capodice
rustbucket
crosslands
pierceville
scotomas
mariahilfer
olusanya
oyneg
fantayzee
warungs
srdc
parkhall
succah
treese
kiambaa
vagar
cartney
pntr
chuanzhi
karamanos
kosk
doubleness
converteam
bisri
vulnerably
talone
adazi
overact
nassef
bayev
timochenko
curlier
tetrapak
jarolim
lienemann
hasenauer
inelegance
culturemap
hunching
freeper
linvoy
overdog
interraction
biale
insyte
garlanding
seductresses
culbin
sxrd
bedpans
befoe
afns
hatered
petaquilla
unremunerated
ditan
khubani
callian
voelpel
dolnick
averkamp
warrantee
gliebe
armholes
sylven
treyford
reeders
nudiflorum
reddall
addf
vedior
collora
darva
goepel
dumbfounding
summatory
igaya
xizhen
hatcliffe
egtc
pliskova
duing
corff
subsegments
tyms
motomi
amaturo
kroszner
saadane
xiuhua
rousted
haini
demoralizes
cedras
southhampton
cambers
kurvin
tamuna
ungrafted
spotlit
cripplingly
feyerick
musicdna
grovesnor
robaire
jouvenal
hfw
hargadon
huels
egadi
hemodiafiltration
brason
umiastowski
bettola
sensecam
zmijewski
garaging
cybersource
interloping
komlo
macerating
demonstates
berjon
starblanket
wapshot
absentmindedness
gruet
honorariums
ukla
tocai
quetzales
shapton
deisel
nanyn
pamer
nosanchuk
binki
kilgoris
haieff
trilantic
craycroft
strobeck
manag
unappealable
vilage
achiltibuie
brysac
rassan
stitchery
sovsport
wfdb
bastad
bascules
barbeito
norrkoping
zakher
somerfeld
loyaltyone
cruzin
pancreases
zocdoc
elterwater
hardees
allone
realscreen
corkerhill
pleitgen
libonati
doescher
hcng
gissara
abeykoon
khonji
wideouts
littleneck
fluky
makori
mscc
fomer
crusat
aristedes
vwap
citl
ahps
sudarto
defeatists
bermoy
blissfields
szucs
squillo
rinderknecht
ribowsky
llareggub
jenkens
nevan
jesselli
xishun
ouput
worktop
laskas
unscientifically
iacovou
mauerpark
narked
jayner
cosimi
moap
europejski
biopreservation
junkfood
nonrecognition
pietruszka
deglobalization
lclaa
tsukerman
hatswell
stephanotis
leptokurtic
ebertfest
kigozi
deahl
risedale
bukharans
forhan
rufolo
manageably
abseils
halilovic
shagger
sportech
congruently
illiberalism
hansgrohe
myren
westreich
bookrunner
kicinski
hadschi
seebacher
emersion
saylors
abayas
whimps
dayani
sladden
hodierne
muron
lutvann
khazir
nyamuragira
xianju
flowe
falsies
ardkeen
koeltl
arnesto
competitivity
vook
xisto
virtuozzo
artisphere
exceptionalist
rmeish
colodny
baldknobbers
aspinal
antireflective
morozs
leibrecht
lawzi
sidestick
pretendo
marze
kantaoui
hellshire
wonkish
castleblaney
acupoints
llanfrechfa
tegler
acams
hushes
pdna
tuttino
macchiarola
slenderest
ncls
fischmarkt
lovably
dejene
jelger
santibanez
uriona
creaturely
icomp
unexpended
bindschadler
elavon
arberth
cuiaba
panglossian
stoffregen
pikermi
epron
nirah
popline
kriikku
haean
lacosse
jines
ssdd
copaken
ziane
unzoned
datadirect
alenius
solskjaer
decissions
sevelamer
belsonic
poschmann
choas
overtired
bazira
lhtec
ofccp
stickk
bossou
rahlves
cegelec
wanqing
muslima
lmrda
speakerboxxx
abdulelah
saslaw
industy
difalco
carrigg
koronka
ccrn
gerboise
khudhair
orthotist
terwindt
renardo
subtil
riversimple
digex
alwall
authorites
smartcode
skeeball
kuijt
megrim
hamiton
chatrapathi
repellants
eimbcke
jablow
hookstown
kahoot
ahlemann
clarian
ackrill
theocrats
naidenov
truvy
beseiged
mtec
groupo
sabeeh
kravice
valleymount
wustlich
dermoscopy
subleasing
hirni
cangialosi
mutators
kasteli
contactpoint
fearnall
shalvoy
darline
dciaa
electroretinography
probem
bindeez
verkamp
norvir
fiberesima
kimeu
schulweis
tokoyo
katija
nehmat
davaar
ironshore
bisoli
cbcc
mohieldin
salinomycin
intihar
marynell
feichtner
baseley
amirshahi
kabarebe
iteself
drosselmeier
ngvs
dogzilla
reckard
charlston
chimpy
marksburg
melisse
bogdanich
webbwood
rzn
cornworthy
witanhurst
fuseproject
immunotherapeutics
dahlander
deps
hamblet
bratza
scaffoldings
kleisath
reath
hoverport
santaris
porteno
sunsweet
physios
mevel
sths
harmy
jarnigan
gillespies
alteplase
hillion
drewer
nonusers
labberton
rimoin
toepel
compartmentalise
vigário
chifflet
correze
altuzarra
beloglazov
vescio
pailler
realtively
vividus
abdolhamid
aacd
polycap
huibert
gnassingbe
compulsiveness
placeless
unadvisable
bayliner
marife
datamart
lackritz
oversell
flavoursome
unpeopled
gualdoni
gilliand
medinger
maketa
jaszi
minibonds
tapfuma
blarcom
worgu
briliant
sangtuda
abusadora
kimberle
kirna
ericcson
ithaa
staceyann
bohley
nony
fettig
sporen
rfog
directline
bloodgate
rubeck
saiid
downslide
jianxi
lhj
joachin
druin
suntrap
eligo
cowhig
xhp
derow
musclemen
takling
ultrium
llanmartin
cusma
zumodrive
soaries
niebler
geddings
djamil
braford
shalonda
precrash
goup
pcff
boubous
overpoweringly
lanzillo
delasin
lonquich
shavitz
swanke
stoffers
trifold
matusiewicz
reetika
baberton
himmelsbach
ichimoku
nwpa
kahmunrah
unpopped
amio
bostjan
shezan
glutting
juanas
dagres
rifamycins
stumpo
reanalyze
involtini
lemarquis
obenchain
edrisi
oneupmanship
jeake
dowes
kelyn
maretta
middelgrunden
ardia
plushy
rightfulness
manuokafoa
ultram
sipacapa
propylaia
rubico
wetbacks
sinnreich
forlenza
hibdon
cybersquatters
candelight
oversleep
suchak
bgas
contesse
protrayed
petryshyn
quotron
khemraj
sumir
subliterate
ouaddou
nwfz
islate
lahno
zinfandels
tidc
cieply
mehrin
vacquier
laingen
nonperson
pesamino
renze
nanomech
overaggressive
deutschkreutz
lempertz
discardable
walsleben
noncriminal
raleb
hoppes
mocek
whiteouts
ancrod
bronington
nasic
ardaloedd
sumatrans
fightmaster
escalettes
bialecki
pteropod
gulson
blackwatertown
withcott
mariqueen
walaa
onychectomy
mahayogi
knockback
nisbah
aberglasney
distruction
lymelife
auldgirth
mischievious
closerie
medra
diovan
sncr
militarizing
tattum
cyren
fulled
leatherjackets
latiker
greenthumb
burleith
sandeel
muhith
punctuational
minerality
wesbury
salavati
nautile
weingrad
creekmoor
ruds
corniglia
drooled
coovert
remeron
overwelming
inshas
biodigester
deseve
krogsgaard
talba
dormady
riduculous
chichvarkin
pfieffer
gwrtheyrn
etran
bgmea
fouchier
hmongs
blaszczyk
mysky
mastercards
aillagon
latinojustice
mtap
mandelker
pagola
gurjeet
,just
envira
oxpens
maconachie
gkss
nankivil
garaufis
katsuhide
brymore
fingest
alwiya
morfill
libertyland
romines
squaddies
gillete
insalaco
rusanova
pccp
sdku
byplay
conscientous
etxerat
suellentrop
maisemore
mayskoye
wellsite
wiracocha
garmoyle
kifaya
hyleas
tawanna
namiq
boultwood
randalf
speich
maigari
jyll
echarte
scharffenberger
microvessels
craigshill
culik
ieroklis
plene
quokkas
stamshaw
fedotowsky
sarande
nevils
vneshtorgbank
altentreptow
bloodiness
cinram
fullana
aysan
lipless
ezralow
jamkaran
chatzimarkakis
durabolin
farrellys
glyburide
nordenson
xhij
elist
marudai
kanjobal
ulatowski
tamarinds
inveigle
ancroft
provde
aaaai
datson
rossmiller
seitaro
zenzi
thundery
wtlr
beadboard
pinnochio
sandbot
kettl
beeney
hankered
sanil
perfectmatch
chrystine
intertechnology
crucell
oritavancin
zombiefied
makarezos
hamudi
giroir
sussing
sullenger
dinops
lepere
restoin
jinxes
premera
sabrah
snookums
elitebook
duverne
pfrda
voluptuary
sciolino
solarreserve
rubinfien
vistage
depaoli
manios
sesat
chirs
fridovich
barhom
bluehouse
highter
farbod
corkman
torquey
untraveled
qrh
blist
medos
mobsby
muscularly
prorsum
halischuk
sunsmart
daffin
bronzer
whizzkid
yowl
svitak
auslin
trallwn
wolviston
puempel
prosumers
bazian
appeldoorn
aanp
abdisamad
mapetla
amosite
halawani
youlus
babinsky
udderbelly
michellie
isatabu
gpsd
ramim
haliru
sumara
multiagency
kavos
aneez
tallini
garimpeiros
teeton
itss
alcantar
regazzi
gooses
shucked
snippiness
stufflebeam
owaissa
rockerz
zalingei
adjudging
kluth
conkle
chlöe
polsloe
sloma
gilberg
dangermen
bloomwood
engwall
gondorff
perplexingly
flabbergasting
indeedy
webloyalty
lumbres
faqiri
tibolone
autodata
wpni
inflamitory
sheridon
kidspeace
klean
urica
muelheim
boaty
sandidge
gelnovatch
haggie
finnaly
norinchukin
biogeographer
branekov
fiquet
addictinggames
elwen
barzanis
mashamba
hatemonger
egroups
desruisseaux
lapinsky
beeber
rivarossi
tiddlers
mcquary
shareece
nonparty
crogen
motiwala
gbmc
tomonoura
zywiec
tarali
citigate
mourtada
vannatta
daies
nnpa
ecast
lavicka
headquartering
risibly
dioko
roscas
disaboom
cuilapa
wincheap
mullappally
oritz
nonnuclear
bizer
kubbel
pierot
riderwood
tutut
crowdflower
jabran
mackreth
ataya
giftcard
podcasted
inate
xpressbet
kruzan
lunine
begats
polezhayev
kaloi
feminicide
peckish
durbeyfield
portballintrae
drunkeness
cribyn
rulemakings
psacharopoulos
kraynak
eukanuba
idaville
boomeranging
spml
chainstore
kecks
bonforte
taisir
witeck
terdell
americains
corriher
unvented
twilit
postludes
waylays
bootless
ukes
anbaric
renson
tambang
flanary
hairlines
eagach
uniteds
hinote
rearguing
piela
subutex
pavlak
kawaminami
kamakazi
visner
hamblyn
wychbold
succede
bottura
rwjf
mengtao
birleanu
aneisha
zhenming
ciliberto
xxxl
beljan
smarted
skiier
surpress
wubs
gijsels
obfuscators
handzus
pedu
methacholine
conked
rilya
isometrics
fillbach
ranstorp
moazami
jusu
fulminates
fiedorowicz
gulbadin
vinader
gatens
renewability
veddah
plumpness
hoeben
marchois
birnberg
rubinius
distempered
nijgadh
grainier
becos
burnhill
uzbin
swro
rasooli
straniere
ballygomartin
ibtissam
cetaphil
rouner
nicia
seiders
messman
danilson
buddism
melamede
feasey
mozar
dlouhy
herberth
kristovskis
laau
wittler
conspirative
mumbaikar
cybertrust
immunex
wakakirin
safranek
briault
fnar
firemint
aerobars
mbonyumutwa
lleucu
sidewiki
ahmadiyyas
klinenberg
tarnovsky
natashia
stainborough
tanoesoedibjo
oliveti
musakhan
fellgate
preoperatively
neckles
talbut
coedcae
sunnyhill
tswalu
hepburns
noden
behzti
ogwang
wedig
asinelli
nahoum
spreadex
kinnane
bogliacino
sullens
liinamaa
ohmer
lindseth
bielskis
liebefeld
lowhorn
rogard
hearthside
mickos
ergas
raedecker
brownshirt
ariva
filching
morgaro
cangiano
sjhs
zhengwei
pieh
potesta
shamsolvaezin
takemasa
vivary
bistahieversor
rathsack
dachigam
undetectably
pisanelli
kileen
mamary
reallocations
kiszelly
postconflict
lathering
orionid
garbanzos
delvine
fedbid
guberti
krankie
moorend
amerindo
bagsby
kahlke
mulvagh
barklem
vannet
sarber
yande
pharms
censurable
zuccoli
wadajir
pgis
weho
stamatakis
royles
remebered
brainport
bachvarov
spaceloft
huffstodt
hidehiro
sutherby
nctb
nondurable
sramek
masticating
lagno
mxolisi
nedal
nonracial
marinière
hansards
nadolig
wistron
pulverising
guidewires
pumpjacks
humanisation
roeske
batona
ballingham
golnar
epoll
cappadona
boomsday
habsburgian
jumbotrons
meritech
indoctrinates
bayrock
tilin
hichborn
demarlo
powindah
verloop
abreojos
sozio
liechtensteiners
guyhirn
ridiger
klimm
donto
babikov
ingoma
pxg
tedburn
winckless
demuren
narriman
rehome
killadelphia
bouderbala
crumbley
edidi
youssuf
kofax
hopu
villwock
reguzzoni
galimzyanov
cordano
chongos
vandome
centofanti
zwelinzima
camley
popmaster
edgeworthia
councellor
tuwaijri
fallbacks
pitso
nababkin
burno
zeine
jumaili
plotlands
whiffed
jabbi
numico
vasilyuk
seisint
patternmaking
pharmed
thigs
brindel
lagger
mashru
sonjica
neossat
onaiyekan
ingenuously
rectortown
pianc
tangun
donnenfeld
wanatah
oldakowski
lawver
softex
culpitt
wiginton
romanticisation
baltique
tenderers
shoreacres
hypophosphite
isavia
marumoto
hołowczyc
chritian
scotchmen
ofiesh
mundaca
lorincz
tryless
roundnose
discusss
reradiate
sakhai
iveri
fagina
malasia
petkovsek
streamflows
zvue
bortel
fliter
rahmawati
thür
lisses
ellegood
boaler
scuppering
minotaurus
muralie
tryson
quartino
rockhound
bjorg
kladis
smartwood
pirooz
ringera
foveran
ritchi
dumbly
prarie
donw
colisee
csae
flextreme
harshberger
scialabba
ziedan
hinstock
hochfelder
neaten
oludamola
truculence
markon
grandcentral
golinkoff
pasal
dandyish
atamanenko
aspiazu
rondini
americold
paralympicsgb
banktrack
farj
falorni
strasbourgeois
lecointre
busha
luddenden
fluckiger
tilc
pompousness
hofesh
isacc
moorlough
rearers
lajuan
yusko
stupenda
degreasers
stonebrae
quitoni
llinois
ustads
riiiight
underpressure
conqu
brunjes
solidness
roundarch
alvediston
cachaca
mowachaht
minchenden
conpiracy
gladiolas
devillier
methomyl
kudukhov
isango
katritzky
uznadze
sayyah
bingol
cubatabaco
phasellus
whle
oeh
arnebeck
absurb
adailton
xolani
divergencies
rüstü
bunir
halafihi
sallyport
riveras
fingerpicked
cashill
dendraster
peolpe
detica
yares
supi
tibotec
peptidomimetic
trenant
piotroski
salterforth
busradio
shimshi
afflelou
smeathers
coeliacs
bajin
creosoted
singpost
munai
sneakerhead
pentacostal
multitronic
shandel
riflemaker
shekleton
dedomenico
sensage
sediqi
deadlifting
runkeeper
hamda
enervation
westlane
weightiest
unseals
matarrese
fieldfares
blls
lindeth
nunam
mihaileanu
decathlons
okines
artlessness
geiers
makeable
jurisic
legwarmers
recutting
dynex
anraat
hyperthymestic
vitit
curlicue
yéle
rafayel
mmsp
tarrab
torrecampo
maylor
accessnow
qirim
kansal
recommenders
kimkins
byzantinus
banabans
voskuhl
silvernail
woolfall
ijmeer
auble
ferociousness
ruvell
inseperable
bernsteins
hennessys
hutchisson
myspacers
althorne
bullar
sahagian
fabrick
baybrook
fredenberg
haeberli
reppetto
latchem
yakhchal
independen
decho
mishelle
hellscape
cummulative
moneytree
sutterfield
freerider
elonu
pitonyak
shayana
opower
samdhong
mindlink
fortismere
palaeoanthropologist
callero
lewdly
injudiciously
bednets
crackup
rapenburg
exfoliates
supportiveness
bluepearl
zhenkang
schureman
mclovin
refreezes
unmetabolized
blancaflor
resendez
eery
montanino
khoc
limbered
tanser
paradores
ningrum
kammann
augustow
encap
schimdt
cloudscapes
brioux
movsas
fengate
ahto
appleyards
amatriciana
quarrata
babajian
finnane
skirboll
newstand
bumpersticker
cowhides
timakova
kapachinskaya
bolongo
ilshat
mcglinchy
kachur
bergfeld
nibc
tuluksak
hanchard
tompkin
proffesor
peacenik
cracktown
panthal
xiaoji
beguilingly
qosmio
verastegui
prodea
karagoz
biohybrid
mushikiwabo
raydah
dubut
godell
chidyausiku
sindicatum
flakiness
cardetti
angbwa
cederqvist
hedgecoe
guck
shahids
southtownstar
tostevin
scence
viars
croslin
bewerley
besseler
plastow
frolicked
cyberbullies
qigang
fortna
beligerant
desn
gurwara
descoings
cattiness
middlehaven
warshauer
swinish
paasch
bradach
ghorayeb
brookyln
varshalomidze
pidgeons
unweaned
netham
levemir
resubmissions
frns
crathern
bajak
eisenson
maskill
djup
audia
vicos
pitcaithly
cdls
germy
tostes
dandora
baussan
ahrons
eswt
kailani
divnich
attilla
zenprise
heibel
rudding
ubel
boshears
amorella
usuals
montra
islamaphobic
cpts
brnc
malbun
sdti
hangdog
chamon
unirule
swarzak
spasming
lazarro
lesaka
gulja
mainstreamers
roneo
banel
polyphenyl
shopkeep
territorialisation
acerbity
dulloo
mullner
anterooms
kajara
jaylene
pyaw
lowitt
kelbie
sloate
griffths
uocava
bhfuil
aslund
naughtin
erbistock
nantyffyllon
mouzannar
tapiche
brynsiencyn
overdress
ntdtv
ebbutt
edelkoort
jingying
imat
pozar
sheetfed
pimperne
nikoi
jousset
cosponsoring
shirwan
choric
heininger
aboushi
hilfiker
gladhand
lorigo
westmoore
stichbury
kneepads
meanspirited
fessed
baere
pastizzi
rowghani
krikalyov
akapana
hyperintensities
swingline
jusino
yazmin
ngige
nordmanns
guillaumes
redridge
dhuhulow
smirked
freetail
evenflo
lugwardine
splitt
ronreaco
bahiri
intracoronary
michihisa
drinnon
joud
bils
winair
seeboard
selliers
kiyemba
suitner
delys
sepracor
restuccia
corlis
urmeneta
chisipite
samoon
sopheap
merszei
brommer
gritters
shereef
belcaro
brostoff
nogliki
gestring
hohenfeld
digiovanna
boscaglia
sammich
beshenivsky
rinto
shalamcheh
champman
calcipotriol
garze
lattari
wanlop
biobricks
karell
kiteboard
laudati
carbones
vizor
brawns
disequilibria
assalamu
churchhill
rafshoon
circello
dohmann
frutarom
resubmits
totsky
enninful
losinj
distructive
rosbank
faher
donica
pereverzeva
cyflymder
swansey
mahiki
bacterially
fredj
anduril
kokocinski
sabrage
manicotti
embezzlers
massingill
bourgeault
plagerized
humba
devourers
subtlely
gunbattles
glamourpuss
mottel
sicelo
kipahulu
rowatt
ueps
meckseper
bubblicious
unbuttoning
khaplang
finchum
adknowledge
turnoffs
airdam
invenergy
meydenbauer
saglam
incriminatory
hedderson
sambódromo
acredited
vondeling
jiangang
pizzala
elmaz
yelding
janic
fancypants
facilites
gangel
blaichman
wolder
butturini
stalinesque
feener
parvaiz
yordy
piening
chenge
gormezano
absolutions
elegaic
prehypertension
ginno
burgdorff
itest
willemien
gipi
southerham
tatopani
nawc
runflat
aubain
imcomplete
ufip
aaoifi
gbadebo
jindi
wearability
microamps
simunic
vscs
nebulization
blyk
oscypek
espitia
quickcheck
vanacker
deß
hatemongers
bucheli
perniciously
rosow
araskog
legislatives
mearth
barnacre
unsegregated
mambetov
poblanos
dweeby
gason
dadwal
hexapus
schüle
pickus
kenjo
plax
marineau
thrumming
malual
clotheslining
videoing
bailers
bankok
demilitarise
goodo
thrums
picioane
novated
bronder
helcom
champurrado
infinate
celebrator
nadhmi
ollies
sylvest
fingerpainting
daid
chebii
llenarme
kirpans
bubnik
sonka
ugulava
pennyhill
chavot
sheekey
undismayed
paktribune
depoliticize
recountings
esrin
ngoepe
nyboe
finisar
mohammd
scamman
firsters
guellec
nahwa
pryors
tadre
sluss
onuci
adamy
ferbrache
sieci
lyophilization
dentdale
stratacom
misali
karwi
particulaly
buytaert
oneroa
zizmor
sadig
mohammadullah
alldritt
dentsville
spittlebugs
medcap
wovens
goaless
camana
pathologize
chodounsky
spreaded
foodstore
fairbairns
cropton
lorent
intellectualize
formstone
agustinillo
monkwood
haif
resynchronize
chubachi
tennman
muilenburg
caradonna
sinex
ingrowing
mtss
disembowelling
mahnut
pitofsky
coopervision
cappato
romaro
kenco
elmesthorpe
signle
goldenport
hallyburton
frmo
jariban
hrycaniuk
unintimidated
plebiscitary
draughtswoman
gruszynski
adega
naths
kleb
enersis
baradaran
frontlinesms
giddeon
dewstow
attalah
schachtner
whitleigh
subconciously
catsa
sullies
lamassoure
earliness
preemie
tourismo
revital
zemiri
bemko
ingves
felicities
sawzall
snediker
cumbes
krainer
karlic
stopzilla
fayston
dawod
gunashli
heizmann
brooksley
agropur
romancers
forterre
wejryd
shihe
irrisponsible
tootsies
llundain
omniflight
thorvaldsson
exemplarily
younkin
oubrerie
demtschenko
mattieu
sroda
gutkowski
benville
dobberstein
sixmilecross
uncongested
aveton
ansfelden
coloe
scratte
abdulraheem
bancard
hästens
vannessa
luggala
dethrones
hillgarth
camolese
sinak
culos
supremos
hennops
qingzhu
longlasting
hakims
strobed
ccpm
americare
iconnect
xta
barayagwiza
suminia
winces
gjedrem
backsplash
vandura
mstr
aquebogue
paciocco
treliske
biogeochemist
tearaways
plastiki
groovier
petfood
ingrida
genially
kaydee
kaeshi
pocketless
impetuses
khachapuri
eminating
budner
teplitzky
hkmg
vivaz
schieler
birnau
slavinsky
apiay
rouged
herlander
oldani
gilster
cremators
vagary
ldeo
blindsides
fisita
nanpean
mulvaneys
timeconsuming
prognathous
clarificatory
orthorexia
spacehopper
bartoshevich
msph
tongson
codetel
zahreddine
panthenol
sandvine
gazumping
milhollin
boding
mseleku
potpourris
bomana
beligerent
ilove
shakan
weddingwire
gianduja
mweene
vancouvers
landican
tsokos
rorting
levance
lameiro
gracemont
chaske
manservants
harlotry
whities
seche
usgif
commodifying
upsell
nmsp
psaras
donolo
mascalls
presbury
weisbecker
miltie
genencor
nrlb
plme
mattimore
dahou
imodium
zerai
longjohns
croeso
solat
unleased
waelsch
xavière
sackful
osinga
deepdyve
levkovich
illigal
sinotrans
portnoff
kurundkar
luesther
eardisland
shpa
brioches
slimmers
wallahs
thrasivoulos
shivambu
caparulo
harop
lampu
veals
onepass
schiesel
intraregional
cbrl
glenravel
offshored
lorus
sautoir
shereshevsky
mandache
stafon
billout
oapi
arpey
draganic
radox
shabecoff
empanel
llanbeblig
scqf
dumiso
buzztime
michalos
ludmil
nregs
hoons
dabbert
possition
preoccupies
romneya
lidget
theweek
anchorless
subsistance
borroka
thomasz
skycap
peschier
sagittarian
welat
saqafi
remigino
jibarito
slothfulness
myopically
gosi
pushbacks
carpati
amach
rocori
losantos
aquadome
ricciardelli
middelhoff
gilnahirk
neckless
morem
chiplin
fuhs
winka
insalata
schlub
khalvashi
materpiscis
bukoshi
vallese
cetc
microserver
charismatically
reish
porthminster
virshilas
cinematique
pfandbriefe
jingbo
nishimatsu
miasmic
callands
scandi
korrodi
asnd
cavalaris
beechams
octapharma
sahlan
doripenem
prtm
sphygmomanometers
empact
pickwickian
vhcc
osee
sirtris
goldsmithery
ingloriously
cuase
kobernus
telepod
jailings
floridiana
gradeschool
sharot
schmitzer
dismantlers
spauldings
multisensor
jobanputra
benumbed
busquin
shamban
maqu
preceived
hennum
seeqpod
thegrio
usdla
abary
wallersteiner
gaynet
glaskin
laleston
salomoni
crispiness
establsihed
wojtala
ingeo
issur
adenoidal
hret
darjina
khramov
adelfa
trewen
manzor
inzer
hemosiderosis
segneri
accredo
petronzio
nooney
divex
ignor
ghaidan
agrella
flaxington
septe
claxon
leszczynska
gaudoin
appeciate
daftness
sampsons
montenegran
unpassed
dazer
kookai
nabiullina
unlevered
wopmay
leadin
forgent
schlicker
flatty
ramsis
avdeeva
doornkop
topknots
financialnews
boily
dennise
lelay
tsbs
shysters
kargel
trenc
herschman
fiorilli
dantrell
rennaisance
carcraft
hunkering
hofferth
cornas
socialises
ogaryovo
ignatas
scoopers
gahler
ostholt
solitair
masorin
payi
cubbison
percovich
manibusan
alvardo
narcoanalysis
theoden
edicule
bataller
diehm
daikundi
zaluski
newsrx
monbazillac
vriens
pabulum
loftily
religiousity
shenson
saylan
effortel
cibulkova
goldmans
situps
overpack
cpma
pervs
scarse
vinashin
peformance
meichtry
exoduses
pmbus
levandoski
darnah
odigo
acsu
ftk
zuur
gawel
eleve
wvwv
wolanski
rereleasing
bioscientist
parenzee
vscp
buildin
depositaries
ragot
creedmore
carrville
perasso
spillar
bokum
marje
whatham
autotote
devitalized
temesgen
bagnal
otcs
surovell
sheepcote
toxt
triaud
zaborsky
cafarelli
cherkas
coretti
azertyuiop
ghundi
cahyo
bristed
krevey
twitchers
cannulas
paiano
campanale
holdingham
auteurism
bussman
vanquis
saremi
hammergren
robock
overcompensates
leidecker
ruault
ramezanzadeh
holleyman
exoticized
uduaghan
spagnuola
lomelin
trebicka
doffs
linkman
mereghetti
myofibroblastic
antcliffe
shimbo
nouzaret
wildridge
maket
peterchurch
bazzel
sunai
aaae
spotlessly
kayali
kamphausen
inexpressibly
talkleft
aeroman
youngstorget
chomolungma
clevlen
scien
bouchikhi
siracusano
sdtc
trunzo
banoffee
claimaints
anela
unwaged
conscienceless
mevlut
datcher
satoris
ahmedou
bakhodir
teashops
klausmann
bosky
beachgoing
motahhar
mefin
utton
brami
siknis
andreesen
nonexperts
eshbaugh
gamlen
dordain
corazzo
arthog
laboso
turgidity
famista
sadara
misdiagnose
attck
hansack
nisenholtz
mccaine
warlikowski
wingas
petajoules
rachou
fieldings
udwin
failer
abuk
inms
tshewang
khazaee
sipgate
drnovsek
xuenong
seamlessness
churgin
czene
reitzes
dehiwela
toget
oldchurch
mellits
cromitie
takanezawa
ecotours
delawareans
fierros
eshre
struckmann
unburdening
optenet
petards
talaton
corthals
mckerron
zaccai
sukardi
fanlike
anowara
demeksa
veeteren
anable
shotmaker
polyvinylchloride
sharrif
jacquemain
dunbia
rockish
weinbrecht
glamorizes
najmah
mendheim
rianto
pcit
mesarites
kealing
reapproved
prokovsky
utterby
frustratedly
ibcp
willowwood
airbursts
mekia
tarkov
pruszkow
nurdles
manipulatively
iwuji
weeford
esio
falik
hojjatieh
naulty
greenlining
octoshape
skenazy
wilcott
trewithen
roccat
sabate
lukusa
superclasico
intitiated
irham
preson
gpha
schnoodle
tanon
massequality
energises
feinglass
brickbat
vandaele
nyamwasa
fxc
brezoianu
luffman
chernyshenko
lpgs
kumakawa
duferco
bontempo
teresopolis
blancco
dogherty
imprtant
majia
armella
aarnink
interpet
multipronged
maich
psyching
mecl
syder
bassirou
hydrotreater
conlogue
fouettés
upsize
greenquist
iloperidone
gigajoule
ghezal
quevega
studioeis
swopped
allaben
raimes
xcite
taruta
vacs
hayemaker
mastec
purred
khademi
coppley
sheroo
makridis
rationalises
liveauctioneers
licadho
batterman
warburgs
adrenomedullin
influnce
steenie
utterer
harperentertainment
ishmail
layalina
horpestad
emda
perisho
balcazar
mcmeen
daubs
reconverting
incluing
nieboer
kalaloch
marvella
shugars
minamitama
ftvs
koduri
wagaman
marmari
healty
filmgoer
mirdamadi
chemel
poststructural
bankability
suparat
reclusorio
merdare
yasamin
haist
larasati
xtuple
methylnaltrexone
shengtai
gimferrer
vallverdu
sevket
omos
talkbackthames
kheifets
petruska
mundon
fitgerald
boed
astall
ptss
channeladvisor
distate
mirtchev
noseless
rumiana
englin
wexton
huaxin
jehn
campaining
daddys
yeman
bodycote
bluefins
risbridger
publicy
pottie
nby
wenbing
skorka
skyer
peacefull
zellmer
bartonellosis
desjoyeaux
huneck
ecoterrorism
ladenis
januray
ecclesiasticall
bhagwanpura
gvir
comacho
larsons
laparra
pixelvision
prosise
fengling
kreteks
uncorrupt
centeniers
wamuran
acciavatti
dunlins
sunderlin
clearys
stannah
smeller
vdap
otty
kirumba
babrow
swedan
naymick
cargin
stencel
believ
beltless
dacunha
haematococcus
namsa
scheimann
fskn
airmall
nannetti
zhongneng
opnet
gorske
kuonen
denderah
sportwear
nopat
henningsson
proprietory
shieldhill
sinorice
spideroak
collemaggio
harrodian
terrazo
fayres
egoistical
fugee
birnkrant
bioabsorbable
beetem
nyantakyi
precip
disuade
popwatch
soundbar
barbano
tesak
bearpit
fakeh
izzies
lcdp
douzinas
southrop
berdie
meikles
senkowski
osaf
melony
pgpf
zauderer
tumeric
stissing
appendectomies
sevcec
frémeaux
sahim
ashtree
guyonnet
cannibalising
trewyn
zinzi
audiotaped
jarjura
airong
fleetlands
outof
fircrest
velud
apsaa
hackey
gangbanging
divisons
easl
insipidity
aboutthe
ecollege
gamekeeping
dernegi
karimojong
subtley
anritsu
yanky
raghavachari
congradulations
piatrushenka
hommell
shiqiang
rhosllanerchrugog
bredekamp
nitrofurans
nutball
neuroblastomas
orcel
semiprivate
numerix
mychel
donyale
addenbrookes
mascarello
nonconfrontational
yevhenia
schottlaender
solimar
fairtest
tailby
khandkar
edmondstown
chassin
aquaintance
valedictions
chambe
lifelessly
travelcenters
hiddlestone
macosquin
sueppel
calabuig
kallasvuo
waggish
kiling
lubes
jufer
vmy
tbtf
whoopass
nomophobia
kopko
pampelonne
stanistreet
reicin
amerijet
predeployment
shadduck
legedu
avocent
konowalchuk
refuser
corrrect
njoki
edrm
mordashov
shockheaded
jingmin
medwed
scheld
abdoulie
brahmsian
tcpalm
semos
riformista
repuation
ebisawa
tingsek
anois
risedronate
qaiwain
saaed
reselect
bistec
ventisquero
marabe
smartpak
mossor
somewere
skupien
debbye
klencke
tengzhong
humanlight
dumo
gramacho
nordon
blys
gillogly
sophies
scrimp
roghun
donchery
dyskinetic
immunostimulant
macrs
ledare
mapel
tusiad
jouanno
smashie
longhauser
resurfacers
panopto
flambée
theam
alide
ctfs
cisero
landazuri
msce
schilens
fornasetti
silhouetting
weyne
cadahia
sinse
caffari
jerg
mutely
dubrovina
schlom
lafrieda
jaghatu
cedc
corvatsch
starsuckers
skuce
overbalancing
helados
readsoft
gundotra
misfold
holloran
protsyuk
foxxhole
montagnon
sytems
fbcm
hobnobs
funeka
orginated
drobner
letowski
manhatta
rashod
bouillons
shamseddin
valises
guilio
viar
trussoni
roszko
wosniak
regathering
harsono
metlox
naqelevuki
distortive
mujawamariya
minnaloushe
grevin
lofstrom
gosbee
convertable
mitbestimmung
kinoulton
wintrich
guylian
pitanguy
throndsen
gurewitsch
bakia
cedre
filmless
crenca
baning
vadasz
magnex
sandroni
trundles
akanda
restrictionist
hurtmore
fanbois
scvo
musleh
moqaddam
usenov
deracinated
roee
niflore
uexkull
pulzer
mesnel
yesui
sentis
jaidi
poeticism
babah
stodel
csii
kazandjian
berties
unblushing
tadian
ertha
sunner
baskins
taghipour
thrillington
sokolove
ossó
omdahl
kornblatt
menegatti
beggared
traicho
messan
payzone
hashwani
frenaye
lamber
undebatable
puigvert
teamgeist
clangor
shrider
nomatter
scansafe
reapplies
recurvus
westrop
bettley
consta
iraqui
bioresearch
killias
airstation
huamei
mezzos
hollingdean
thesps
lovelikefire
gilbody
eskra
ppif
mctaggert
eichmanns
rookard
plakias
dartmeet
franzblau
olafsdottir
ethelwold
poleska
smigiel
malles
kalff
masimba
linnington
sovietologists
dufka
parrottsville
drinnan
dibis
amaraweera
timonov
crumby
phrc
clueing
dekabank
anchorsholme
bonce
shannah
quetteville
shfl
boyl
msut
makoti
kolasin
knuckler
susanka
horita
mikulich
mckerrell
fjf
glanaethwy
crumbing
exfo
unveilings
escarole
nading
rosanvallon
tenability
thoise
ahmedzai
towerhill
ukcg
paquirri
aquaplane
thellier
peiro
chapnick
radojevic
grausman
zapresic
heifner
jaymar
alibaruho
firelines
hangama
aamva
choom
llanllechid
muezzins
cellcept
scientological
vishaal
thourgh
siradze
saguy
garryduff
maamobi
anrs
gomperts
diversifications
ignobel
certej
gassim
tourgasm
lumileds
shaib
fragrantissima
bldgs
strambi
myrtaj
lichtung
ardekani
kilberg
erbsen
probat
replan
skapinker
cameraless
soname
dreze
adcb
ccei
aeroports
covingham
minimoto
grutza
cunza
regassa
merletti
utrilla
norwitz
damed
bloodfest
worsteds
woznicki
ferstl
xceedium
kreuzpaintner
logorama
quizno
misregulation
facon
xiaohai
titterrell
puling
osinachi
hotting
serapes
aranesp
novera
eikelboom
dignatories
iccho
kievman
walkey
excessed
thikse
trefeurig
ryvita
fauchier
discolors
morero
withins
gaumer
omlet
irrationalities
cairnbulg
shawali
kassahun
patsies
oncale
favolosa
omgs
pataria
waterpod
snowblowers
obdurately
haimson
fallowell
skorts
undisplayed
slogs
goatherds
reboletti
eodromaeus
ilikai
noncritical
bearfield
ebonized
rizq
swingbridge
castelgandolfo
poolville
bhuttos
bouchart
percutaneously
goedecke
oreskovic
palecek
arkes
mítica
accute
yeandle
virrankoski
luvvie
skolimowska
hootin
libowitz
bulbed
avocadoes
neukoelln
mastroberardino
bahaullah
príncep
associes
competetion
bertagnolli
galchen
gallix
haberstroh
acupcc
ninkovic
aldersbrook
uricase
skort
oleochemical
tradeline
contergan
mogavero
mrbi
physiatry
lagreca
kelz
antiballistic
leapfrogs
urquart
shahpour
huetter
eqivalent
seike
lerwill
santoriello
jelavic
rogun
bedevils
wastrels
figaredo
falled
clickatell
aïnouz
pourandarjani
sensics
frankle
rillettes
ehlinger
telemedical
caterpiller
pleached
mokrzycki
porod
holczer
vomitting
elmqvist
filus
arthrex
stemberger
bellar
sheikdoms
holsbeeck
magnusdottir
waymarkers
unamed
dukach
kilford
hoffarber
encashment
carlick
rascom
naftna
dunningham
calvina
farba
pellestrina
philosophise
elenydd
goettler
fiskardo
mrmc
zhaoxu
kattar
sandelson
streetfront
otzi
stonewaller
clarida
untidily
puskepalis
assement
suhrid
lanphear
lovelessness
poular
dubon
carnavas
sharani
maccumhaill
dsci
timidness
mmrv
masbia
mikeno
yaxcopoil
microtargeting
pithawala
zappin
slurpees
vichea
rhencullen
salutory
careerlink
sandrino
intermeshed
rozanna
zatko
sabow
yussof
petoro
burkleo
midanbury
beijinger
lifestreaming
daytrips
immutably
sarfu
raffell
rubish
nambia
sexualize
kavinoky
predecesor
agrichemical
holtan
schanzkowska
gexa
willings
rehabilitator
luyn
stranges
wedberg
kohnke
vilchis
towelette
postcrypt
sirenomelia
usitc
ragheed
azzura
kuntzman
ebener
malreward
heloc
forefingers
marchesani
omung
leprevost
splenetic
laschet
hurted
xuejuan
twere
fleegle
lloegr
amedisys
enard
havenhurst
crittercam
acibadem
siegels
spreckley
materiaux
skiena
ljubojevic
prijono
inbursa
filianoti
adhiambo
dailycandy
canonmills
setten
oberhelman
nakameguro
runacres
bluebottles
withens
confucious
geoeconomics
ghadiya
kanguru
subdivisons
edcarlos
porscha
interpipe
arumi
cbhd
sanio
healthplex
moisturized
szybalski
counteractive
tedda
prepatory
aropa
thinnings
georgeanne
ilimaussaq
plexifilm
eventuates
finetune
ostrofsky
geocultural
gambatese
iuta
cornton
garaged
hallae
whoopin
resistent
brookyn
shtein
bolventor
rotel
unscarred
chappers
morganstown
machaidze
wellswood
pipper
olesz
mesg
afifa
oudkerk
clowned
naturalpoint
monets
bielinski
yatco
sympathic
eshraghi
suanzes
melverley
paxford
thuet
chrissochoidis
ulcerous
theriogenology
estenoz
ojomo
haddox
kirmond
winkers
gibus
dammika
rowardennan
quicksearch
yolink
simey
cacerolazos
amerex
swimm
lingustics
oddcast
delucci
therap
kidero
ihnatko
xtraordinary
gtps
smooha
caddigan
monastry
extraodinary
yiru
monkston
chakas
bebchuk
graversen
azoy
butcombe
hammarstedt
indepedence
rebora
clairborne
edst
shopowners
sirmans
lungarotti
stategy
suts
girlfiend
spuistraat
sferrazza
navarrette
samarco
ajang
iafeta
akli
yiannos
reviles
venkateswar
mezzetti
pelagics
sumler
vermicular
akridge
syphoning
dwoskin
sparklingly
zyban
ganush
tbaytel
siniya
koomey
bouzy
shakertown
telavancin
spatt
stancliff
misperceive
tiquan
shalaev
hamlins
soccerex
palagyi
tution
qibliya
uvarova
pabor
shuttin
lidoine
skillsoft
shamiana
falletto
comfortingly
etek
treseburg
hypoglycaemic
rumpke
cinghiale
clovenfords
postmortems
nkoulou
kouznetsov
gilltown
nonfamous
petitgout
alpheratz
hossfeld
awasom
financeworks
dinniman
betsan
embutidos
bolesworth
youmzain
adade
bhojwani
weizsacker
chirilagua
nutro
protectants
mepivacaine
brickie
inderfurth
minimalize
kingsisle
sitrick
massaya
naughts
purbrick
toyosi
gruentzig
moussem
worral
befuddles
policital
shadmi
braystones
mojopac
strycek
perseverence
reynholm
bruited
battue
cioppa
blts
bacame
solopower
schierholz
nagusa
cherkin
kummant
backboned
dediu
pinatas
turkoglu
undriven
wipeouts
huperzine
procyclical
twinity
mandiant
swingeing
motecuhzoma
goldwind
clamours
dvortsevoy
bootsnall
baleni
unregarded
danleigh
seinn
bstc
socgen
moudjahid
importune
yassmin
nakhuda
theyll
recommitting
patrinos
josl
polyface
lionshare
senderoff
tradebook
hoogewerf
abdifatah
rimers
farnoosh
membreño
sgreccia
sabrine
moynes
riverscape
bacteraemia
darrill
askmoses
joels
sprinklings
ruisi
marongiu
goldenwest
siela
antiliberal
icic
dangor
britoil
osiraq
centerra
girbaud
starkers
deadwyler
pleva
ampal
montauriol
aigrain
promover
artour
raunchiness
pectic
grotesqueries
veletta
mussallem
persily
browbeats
quinceanera
refighting
hosel
hollomon
rezart
bongoville
taeb
etien
folson
tirley
guangfa
islamaphobia
codpieces
sfms
ecbs
kulevi
herepath
perambulations
bagless
havanas
voronet
bostian
woodle
irelan
carmellini
cowels
litokwa
telesp
understandingly
dreibelbis
cayuco
digitalizing
samanda
dunky
chanuka
gishwati
schmincke
ezekwesili
amegy
flirtomatic
ramkissoon
rerate
rosseler
outdraw
ungeared
fastech
cerezyme
noreena
paranagua
normansell
gozney
dohms
cacophonic
stroka
skeldergate
kethledge
overclass
downlow
uglydolls
bilkey
curteys
manolopoulos
ulanoff
meetic
timble
takover
kolobkov
laarman
gindorf
pizzicati
labadee
mattiello
eshetu
rosinei
froelicher
ribband
vellupillai
radkey
loffler
jiayou
donose
packable
applauses
papadopol
dullards
naafa
shanghart
hashers
marybank
tronick
fudgy
ambudkar
uphaus
steussie
stockily
tsalikidis
phosa
fuschillo
ncomputing
calfornia
ramotswa
burud
premiair
retroelements
grebbestad
alouds
vishnyakova
highflier
hurlin
baynards
undistracted
phanindra
configurators
weaner
tiejun
valarezo
snorkeler
lungile
medulloblastomas
piteously
slightness
teepe
poliak
abdiasis
stemilt
funderburg
raisinville
bidri
ramsammy
elemer
cleaton
showiest
sluttish
bdps
enck
olad
microdisplays
telvent
parings
pinkelman
jelmoli
popinski
stericycle
apaporis
ntale
bartine
labourious
namdrol
catrambone
quantam
poggenpohl
mingfeng
crinkling
aabs
wildcru
iskenderian
mccurrie
totonicapan
rendine
roomates
marjani
punko
konbit
sivb
friedhoff
unpropitious
cliam
magull
sallinger
mykhaylychenko
adisorn
poniard
kargus
angelyn
sonsoles
wgcs
sinlge
cochleae
diefendorf
chairpeople
lonner
somak
rudys
aving
fiis
rattletrap
sansibar
osgathorpe
unoffending
thaksinomics
insurability
misnad
odilio
poptop
hfma
konuralp
abromowitz
gattas
mustacchio
cabelas
trotte
buckheit
zuwarah
lutman
railbird
washdown
casarotto
myps
fcit
kinesiologist
depersonalize
gressly
speaches
floorplates
sating
talwrn
nutbag
recapitalised
nietszche
makhneu
televion
lepisto
senes
camhs
jaho
toothman
cafard
netzeitung
umpg
depayin
adamsen
xiaojian
sringar
cryonicists
zraly
hirshorn
recapitalise
smis
internalising
kalocsai
fidgets
bestplaces
isolus
paglieri
basith
schlepping
marnò
rescap
vitria
sporer
ntakirutimana
carrozzieri
emiratisation
sieminski
agonise
neyroud
naposki
enplaned
lumiracoxib
siekierski
ansarul
chinny
shiniest
diraige
ddlj
mernit
yearlykos
kimhae
sentayehu
tŵr
wattegama
underpricing
taggar
snabba
regorafenib
hoogh
samll
tarullo
guisset
polverino
bookstart
pressplay
unpingco
cetraro
teenyboppers
deppa
sundelius
tubulars
ethie
lycees
fridkin
zavon
mildewed
nuriel
vilje
benissa
seydler
evillene
theocrat
spitted
tianli
defanging
goeken
guidara
petroplus
zackery
bombastically
daurade
balford
corruptibility
crispen
lemanski
unhedged
peniakoff
mahmoudiya
huuuuge
morozzo
kleman
bogash
emmers
proliferators
paleaaesina
tovish
zelikman
lasered
mallach
patission
idolisation
vosough
biancocelesti
stefanek
quatford
johndroe
pulsated
crosschecked
dalewood
tuila
nayel
palaj
kaumatua
nincompoops
dennisons
sehdev
fraijanes
scalf
razeh
heusdens
pollenca
strategising
chaundy
intensly
talayan
haggles
gianello
juerg
evanoff
beardwood
novolipetsk
haplocheirus
shatat
qoran
dulcibella
jaycox
sakiewicz
naharnet
gutte
reagor
perimenopausal
ootani
eyup
roslynn
skrenes
gilbarco
topolansky
wyddfa
dirtbikes
manceaux
foreshadowings
foists
rongsheng
dhlamini
satco
alpuri
sommerin
haaks
zurabishvili
kabobs
shatzer
pramono
plitmann
ephgrave
maqal
iksv
suprises
piezoelectrics
koite
wdh
praver
odroid
scrunchies
biocentre
urbany
iwatch
parrock
bosherston
naturellement
nigon
lurve
jissah
effenberger
tourgis
venkatachari
fessio
goemaere
chuffing
seditions
gleadell
chocking
seved
morosov
egelhoff
cryos
bhaichung
haatrecht
gasparovic
intranasally
melianthus
bancorpsouth
ahikam
gdss
delavallade
sanburn
mckeeva
edlp
philosophizes
riverboarding
kulma
meningococcemia
harlap
ladylove
oeyen
beguiristain
speckhard
lillyhall
regenmorter
mummenschanz
officiously
shovkovskiy
argles
gorbachov
yakking
eulau
zaab
zithromax
gleadow
refusers
aldarondo
dinnick
hevingham
impressario
caucas
yitzak
tomizo
ripasso
ahhs
bellinge
clnk
etecsa
turmes
mowhoush
hickner
stonborough
inveigles
faurie
chaplinesque
vallvé
mynediad
gerou
broders
jerren
flaccidity
brieant
alaha
erlebniswelt
ltac
theslof
felzenberg
zimmerling
pomazan
lillycrop
bhui
mascari
alltop
lry
psrc
oronoque
cltc
henvey
orientates
cleale
trendies
rabadi
salangi
scrunity
apptio
houndwood
butenis
bierer
reliford
zezelj
dejongh
nechirvan
fbar
sacremento
nadolski
mapusua
craford
gremm
debarquement
npis
jalet
fernihough
brutalisation
eshe
cannoned
ravelle
sovereignly
clambakes
beliving
robotised
aguirresarobe
doohickeys
kampelman
marcario
vivendo
barshefsky
gradualness
khorakiwala
korytko
squeegees
yidong
bellochio
sarad
guardbridge
tillekeratne
chanmugam
backpedalling
ooky
systmone
fonctionnaire
fdci
longham
tsds
sulphites
yould
abercanaid
microvillus
piskarev
arrogation
fiatal
trogdon
gestodene
deadest
gallmann
charlette
gorau
crov
hansenne
nonreflective
ezzouar
ledner
acrophobic
adefemi
hothersall
databased
mvela
modrica
bedsore
scheibitz
degi
agathonisi
bougher
readback
healthcentral
hscc
butros
vosovic
rheolaeth
zappy
bingde
wakeboarders
abigale
devondale
nitol
saccomanno
manguera
temptingly
chippokes
trackday
gaofeng
hapworth
stewardships
brussee
gwbush
urusemal
apalled
holober
kwasny
diamonique
nizuc
wellworths
slaatto
cibm
dhers
saudati
bohnsack
tchato
salahudin
naharro
gjxdm
leakproof
brushlands
alfrey
bjorling
seube
narraboth
guised
packetcable
hogsqueal
bracigliano
ecopower
cashcall
everus
mummifying
villet
seckman
accom
traductores
bankinter
amjed
chemchina
stetko
meridius
recapitulations
rabeca
responsiblities
vendio
bastardize
atiqur
personology
sketchier
shutler
oblon
kaempf
pimpled
cafm
kampmeier
choosier
antipasti
ideaworks
kidult
gadair
gahrton
yurinsky
omido
fielkow
willersey
almarcha
luksa
cheba
ukshin
zeltzer
baratunde
turbigo
serabi
endemically
ermir
stpaul
esigodini
schletter
haishu
cissel
stalzer
oenologists
paranoiacs
loflin
ranjitsinh
bekman
pper
thirlaway
rusada
lathered
liljeberg
hazak
rayhon
redacts
deyaar
mceleny
miskovic
unrecognizably
wennersten
heying
harverson
isum
encasements
ocen
storywise
peili
yijiang
nahcolite
vertegaal
tavaris
meditatively
septoplasty
deolis
sosh
mooragh
rockbottom
neurostimulator
cheroots
montanana
foodshed
hirers
unax
wimpenny
bouchers
persective
morjane
verg
ruettgers
trainbearer
pharmacogenomic
marull
chanock
cholish
underhandedness
tharcisse
macozoma
tenaciousness
statkevich
marnich
guildmates
escude
bugandan
saffronart
watchbands
cereblon
tokon
bitondo
zarghoon
cfed
loutrophoros
desensitizes
tauch
lungelo
jednak
guiseppi
whitebirk
evaristti
confino
constition
grbic
kesch
ventilations
pehe
mtvn
mtpd
libsker
sufganiyot
pressphoto
overexpose
lizhong
rohrig
roseires
moneysense
athfest
unbendable
penrhyncoch
disconfirming
vdacs
occaision
galila
murviel
yussif
stateparks
slawinski
lasante
gyrates
armstong
servie
charvil
lutron
mejorar
estlink
marinopoulos
ekwok
lonay
izmailovsky
ladhar
jonjon
cbsp
kayumba
macintyres
noyze
perfector
promontorio
joyalukkas
treilles
fossel
higuey
partizansk
sternthal
adegboye
troeger
niniane
bengen
zacho
sandbelt
carltons
megadose
lisnek
surrell
churov
sherida
austrialia
datavision
bendinat
ixys
damndest
tilberis
lynna
palel
chineseness
fhlmc
booktime
talt
magpantay
lifewater
tiuxetan
ambiguousness
tomeka
darkes
zidlicky
qouted
occassionaly
gigmasters
rontzki
shemona
disbarring
jelenic
kloet
gianadda
gorteen
tranum
matinale
ecobici
szish
keflex
distrest
frassanito
rafaqat
mturk
alliegro
elyzabeth
lamisil
tesoriere
caraveo
disconsolately
rawashdeh
mefou
rslc
innerscope
lipocalins
sidner
moneytalk
fundamentaly
hongkai
eicker
kesterton
motionx
communicability
cameraderie
dornes
gearwheels
efficientdynamics
rightholders
gelatins
treborth
rafeal
dhca
tampakan
khallad
gronholm
erte
wordley
pefect
raechel
icae
divalproex
predigested
galgorm
cauquil
schrek
phangan
solidarite
dyagilev
rolison
tarnya
zesa
rolapp
neyens
staylittle
tsirekidze
mvas
playfire
gcca
whenver
tillya
thirkettle
undemocratically
zakria
europrop
loreno
velthuizen
eigerwand
linenhall
spectris
istalif
cpex
honaunau
garofano
duggin
vaamonde
prople
belcon
dumbos
wichner
thielicke
westquay
sprackland
relection
stinted
gremikha
annitsford
vitually
zige
dambrauskas
fosbr
gáll
hebeler
gsps
exxxotica
diamandouros
mazzitelli
comverge
anguishes
bulstake
pcra
igiugig
zhengcai
wincent
défago
michniewicz
clearspring
fantasmagoria
alegent
grynsztejn
lici
spörl
fromms
courreges
crimeware
refired
sahnoune
lixiong
romarin
whatev
raghead
foodsafety
boudha
uludag
poofter
halpen
panss
horsefair
waterden
askary
exumas
electic
goetschel
vectron
babycakes
hanoon
draftswoman
atcitty
subrogated
livor
loope
hohnen
nurhasyim
athr
bastakiya
funster
retiled
gissin
marrazzo
oyebola
ffms
edendork
doretti
eckstut
bonchev
suncal
expensify
sandbæk
wavi
awps
volkswagon
meutia
sunarto
lorance
felner
fanara
whatsmore
bosendorfer
amamou
outwin
alexandrea
finlays
honeymooner
pezzuto
weyhrauch
genetree
undurraga
interpenetrated
dusing
dragin
emblazons
multiwall
segin
highpointing
dsns
glowinski
jetskis
mudbug
tatitlek
shengxian
surrealistically
allnight
sunisa
demaurice
piscean
gonged
willse
eichhof
tricresyl
chapan
vaidere
dudash
aijala
spazzy
kaurareg
nchez
ratlike
ikegwuonu
deigning
stsi
kunzman
ladygo
chulk
stickhandling
loterij
dhanjal
mutative
hostings
feeblest
barechested
betted
warmaking
gretch
offspinner
cogenerated
buckwell
girafe
primping
taxers
explosivo
tecnimont
rightsizing
kolinisau
vinicombe
levengood
cresselly
voil
kazmierski
sousan
cerrigydrudion
kheera
skyping
tamizdat
investement
phosphogypsum
hayleys
blommestein
rvia
barika
altimo
kestle
vishnoi
eisenhuettenstadt
shandan
baiquan
manerba
oberndorfer
fruchterman
methuselahs
weifenbach
keynell
olukotun
aitofele
corenet
blabbed
gardenstown
hosptial
bercken
reponsibility
sukanaivalu
ponn
neaton
gladisch
gizab
vieites
puco
aperitivo
ncja
dandois
trenchmouth
immobilises
tresnjak
apacs
pedialyte
alvárez
coha
maynez
nassirian
karchin
khomeni
cressage
microdosing
brenkus
tutorvista
yummies
shimali
breakoff
broadcroft
boomtime
babler
mariott
mastromarino
korting
rustamova
garbee
roumelia
ecodefense
nastygram
locklizard
juwi
nuxoll
bottarga
amouri
ashoor
restent
sbihi
superficialities
richins
kenk
vivaciousness
springlike
jinelle
hoppings
tesofensine
niemira
phaseal
nozzolio
ozel
turaka
halkida
luib
xiangfen
fragger
bigheaded
mccaysville
carnt
brashier
cronenwett
cerminara
lenker
brizlee
bronxdale
redhanded
synaesthete
klarik
oseland
namiquipa
sunnucks
bodum
lovings
meagerly
waughs
frijters
sagittis
hoehner
shutterbugs
tariko
doltish
sarnecki
selmeston
fraire
selectives
garot
prosecuters
nuhw
ribeau
seminude
tavant
daic
tommasso
graesser
pelzman
fallowed
kovi
baldrey
brunshaw
putschist
udic
unilluminating
ivington
hhsc
easterbunny
gentin
milenkovic
picadors
linganzi
bennecke
ognyan
komac
exergames
oleana
nicsa
lltc
cholakov
priyani
sierraville
demaçi
vuclip
grubinger
fazakas
abstral
numatic
radogno
keshavjee
apitherapy
kiejkuty
sandwood
brukman
crovitz
memjet
rucinsky
worawi
mythologiques
doozies
substorm
bosomworth
eurolink
bodged
kjt
pashka
hajeri
overemphasised
hamalainen
vershire
tennesee
sunspider
knuckleballers
warnken
fallibroome
avoncroft
plughole
liesman
miniaci
quantative
kazimira
hipkin
skulks
swainby
ubid
howlite
rangali
ydi
shoudt
almadraba
jungersen
cacciotti
hurtault
briccetti
slighest
brepoels
moniem
krovatin
uwink
leylandii
rouissi
arguido
nahn
darnedest
kulibayev
hejin
pfeffinger
mahonen
abina
willumstad
asiah
coersion
amedei
sandos
sirnaomics
ashqelon
pitocin
gemfire
luscomb
meralgia
avicenne
salvers
energiser
harrist
iatan
zaika
strini
salterbeck
busin
mitul
superbank
addictively
animoto
cobwebbed
dishcloth
hizzoner
freepost
lionbridge
mayner
energetica
shutoffs
suparak
camford
picciolo
spaf
notetaker
fontanez
medvei
edhouse
shriprakash
blenkarn
anavilhanas
backhandedly
leezer
bouzou
warholian
cosalt
mcmurrough
cordasco
guapore
parentally
devenny
pimpbot
boscono
wimbourne
qualitest
iglfa
prompters
zarefsky
swagged
hedquist
ojwang
acceledent
lcdc
carbost
tyurina
plutonomy
nypc
daffern
elementally
eeps
smichet
aslant
jgk
kicukiro
reprieving
dovehouse
seljalandsfoss
paradiski
barysheva
unworried
industrywide
tactfulness
wishfully
najer
southwaite
glistened
ostracodes
procare
batallions
bikker
cavis
moeliker
scudding
ocotber
hudbay
afreeca
vesilind
dryfe
norsa
nobuki
berewick
sevelen
tellado
gabardi
vasilija
boureima
mclemoresville
crounse
klane
kaluka
batayneh
picogram
sylfaen
hopple
demostration
ulusaba
zelve
twills
nemenhah
hockering
gaiennie
cimpor
jurki
olallo
seinajoki
malingerer
technophobes
eyepet
broadsoft
nursel
leogane
bookfest
wicklewood
azalina
jostles
clasby
artiga
gavle
backgrounders
benhassi
makunga
mouratoglou
bernand
chiodini
sybert
devanna
vassie
klarsfelds
bullionvault
arcticnet
minguez
causewayhead
rongione
helmetless
deathmaster
clawfoot
glosserman
griazev
meuller
bandawe
aldape
shiferaw
bubaque
issueing
battallion
manalastas
periclymenum
pentabde
jurney
gladdens
batsh
tolstoys
deflective
paoua
cheysson
allusiveness
arodys
skylift
commsec
inocent
chubbies
narrowings
tchotchke
djuan
hayali
kreissl
wimco
lamdin
hackings
appdata
beelzebufo
vanderfield
maxygen
affectingly
celiacs
gorier
praiano
nuvelo
gudiel
krivsky
miok
dreese
manevich
beresfords
halycon
saudan
outercourse
ungraspable
inquistion
volonte
langbeinite
fadinard
siegried
simonenko
riingo
barreleye
zalba
wqvga
calworks
cattedown
lny
nccd
zartog
ciea
eyewriter
kardas
requip
zhifei
whiteners
nicklasson
expediencies
grimaced
lyton
paradisal
gianfilippo
tahli
glenforest
jodat
chiampou
strops
accce
beysen
readjusts
nonpromotional
incumbant
desano
semiretired
hameeda
ripely
slaughtermen
zolberg
deviceanywhere
osud
moiben
outmanoeuvered
kapin
bracko
gladchuk
mvcs
besmehn
wnbd
valleta
callled
terrines
talledega
verbalizations
raizal
fruen
jalc
cossy
hegseth
schubertian
arooj
unnaturalness
wsvga
overduin
suherman
kleppinger
francioni
emilo
dedas
codependence
ballone
hathout
timoni
degaetano
methanex
stillwaters
jesuitical
spoty
muhamalai
hauwa
marcianos
russelsheim
sheinin
katriona
ruggedised
potrayed
aspiro
romesco
issaq
fatina
letisha
remobilized
casalinuovo
schagrin
doden
towergate
italophile
fettah
makhtal
dalser
lumbu
transposases
kopuz
isaacks
wildmoor
satelitte
kamerion
haisley
mexicanaclick
luvera
laramore
dihk
merix
khos
halfpipes
underlyings
brassell
weybread
shiranthi
mabro
ravensden
miltons
nafcu
valleywood
cyfnod
psomas
spacelift
saleman
nowaczyk
lhundrup
tyibilika
abrouq
speerstra
nonaction
snowbombing
kalimah
ichim
underperforms
chocolaty
bragger
dzingai
gencor
puccioni
passchier
rivele
tepozteco
pellman
masurier
rised
trileptal
winkowski
inarritu
chettleburgh
sithe
globa
topno
ronaghan
kinfolks
karawala
chutzpa
balaya
delayer
pompons
loewinsohn
jarzyna
misplays
involuted
mogulus
aviakor
cepol
dreamit
lacewell
chartplotter
cupful
cetain
tematico
bastareaud
beaucamps
childnet
eckblad
copythorne
catunda
machtinger
snapfinger
arinaga
zennie
girotto
misspeaking
friedbert
bizx
treganna
hopkinstown
concededly
rachlevsky
milehouse
kibblesworth
psychodramas
toooo
macklovitch
mescher
toerag
larky
kenaan
designjet
amsus
bakerman
gilbeau
uhart
rosskeen
somporn
altyre
biank
brierton
dummied
smooze
nemelka
stantons
szafranski
lvcva
setya
batle
spetember
skytone
tention
chengs
rededicating
lendoiro
jgg
trefler
loise
detag
revells
aerobridge
doodoo
knuppe
undriveable
dictat
breining
comeing
recevied
kepesh
yonts
leadburn
wouild
metabolises
doelger
straeon
thinkings
nsda
relegious
wavier
easeful
cmedia
chicagos
daise
martori
coubert
yfantis
archaos
incitec
limato
gulftainer
poulenard
presant
exomoon
gajurel
arrellano
unbuildable
connecter
besteman
siributwong
clunis
labb
ghostman
amodiaquine
icebridge
superally
unregister
maharidge
korski
sobieraj
rebsamen
juliao
temor
rompers
waldfogel
highchair
ospc
omehia
ashleys
piontkovsky
apocolypse
seawaters
ultraconservatives
byworth
offredo
nyffeler
dernie
vivox
setaro
hosston
malinslee
tamgaly
grooviest
cadolle
abci
gainsville
wearies
tillary
bewer
nwas
felske
batiquitos
angban
compeletely
oglivy
egitim
twinship
westhoven
carway
pittie
brookmill
nowrouz
sekiyu
carasa
remondi
skillington
molat
racher
kunii
mapleville
giess
nauer
innellan
osfi
skoric
dasrath
barzinji
leixoes
gynae
stineman
thoght
angélina
allegience
glenkens
wahome
lealan
loremo
demissie
afficionado
brocal
offerors
mansouriah
cetinkaya
roskams
operationalisation
joichi
worldclass
beckville
kriangkrai
idarubicin
scuderie
fospero
ghazel
penetratingly
rejoneador
muzza
cenacolo
swack
thinkfun
midgeley
ajws
anninos
delish
vishy
misusers
srecko
schearer
penuelas
veva
rolldown
rabbiting
gibler
otone
boekhoorn
deloused
ghazvini
falkenborg
ezatullah
joads
microseries
scenarists
ugeux
waxcaps
vexatiously
campanologist
corneliani
scvngr
akeredolu
pozega
ameijeiras
senzeni
papenfus
needhams
karimuddin
audard
morrab
insi
sundus
nanosys
discouragements
fistfuls
prevaricate
anufriev
singlehood
dorito
amburgey
skyjump
sterlacci
performics
ciochon
shinwoo
schleimer
boths
endu
cyclicality
uceny
clinks
quadraplegic
splendorous
ledray
frenos
tagaeri
obern
croi
spcb
foodist
stojic
outspend
minzoni
juliets
cdpc
tomashova
diaristic
ipledge
unmarriageable
crisford
lifecell
venexiana
reans
durdu
rynkiewicz
fjerstad
kempshall
canana
batuman
kareema
stickgold
saime
virtusa
streetfighting
agentry
inexactly
cces
roust
bonachela
secateurs
pezeshkian
nobble
senggigi
sinet
alimentaria
slouches
euphemized
valez
rbbb
mugabo
shipholding
abouet
jpma
heker
lapidation
flugge
glaverbel
unrevealing
fiddlewood
steingrimur
bierset
pithoprakta
epalahame
addding
tuveson
dunhams
swix
dhirajlal
minex
maritha
experi
revee
equasion
aliber
gatesgarth
cosmesis
percale
urribarri
menches
partic
vayama
ulvskog
hcan
roselend
korset
benbridge
udderly
ired
sundiver
glyntaff
mosleh
briancon
capetonians
centile
guily
grisetti
strafes
accountings
decompressions
dumphy
tianjian
alingar
striffler
dberr
bishopgate
nordsjaelland
kiumars
sleepyheads
phonecard
skimpily
gelateria
ehg
minshan
cwele
guebert
sgic
deigo
uclu
eures
levchuk
toyako
hummelo
jaspersoft
quereshi
rohim
crocheters
blackburns
excremental
uaar
eaaf
frankies
olfactometer
kassow
sankurathri
wilka
unblurred
numpty
globalgrind
mechaly
blasini
tceq
macnissi
kawabuchi
pdcp
narrowneck
aeropress
timau
veling
zury
styger
lobintsev
sureyya
mmrca
michaeline
claira
genack
setiferum
priyanto
sangeen
brouder
newgale
unordinary
yamar
lattea
muvee
hussies
demarais
cushe
supermajors
manditory
quanity
khrushcheva
houat
danise
clermond
skidbrooke
nosratollah
valey
macdougald
tazmanian
incept
amreen
whobob
ivrs
grabenstein
fibroplasia
bulled
adamsberg
lievsay
newsite
beedis
csip
bicing
driscol
kinbuck
deceivingly
anoai
bicos
soory
paresthetica
xme
holsworth
rattanarithikul
bagdhad
encapsulations
tabbi
kreo
jambiya
flexipop
whatpants
sandlers
watman
uverse
dithyrambe
loughguile
alewa
shfaram
dfac
sanges
hasaka
steenstra
saladworks
aerotaxi
artio
pauntley
fonera
denhart
kleynhans
kenexa
mendelssohnian
oueslati
punishingly
phlx
taurel
belenkaya
phenomina
valorizing
muvi
xtar
shalam
marascia
daleys
splendiferous
sidestroke
pownalborough
clywd
sarries
dywer
roistering
khalikov
nogood
bearlike
frogurt
converstation
gardarsson
lessels
whg
hermoine
ciarelli
ordnances
stockwork
stiffle
wooder
fastforward
charreadas
dpps
bythe
unworkability
synygy
lutnick
gelotophobia
wrightspeed
ollmann
grungier
apiana
thumbwheel
reaganism
tangina
carcas
selfhelp
beltransgaz
tengizchevroil
maniq
mangunkusumo
intervet
vasselon
pelligrini
hunderds
hasbi
lunghua
russello
hywind
cipto
krupnick
eastment
moviehouse
jałowiecki
dezmon
neatherd
faarax
resettles
wispelwey
guantanomo
rodwin
sdam
pemble
karto
leiermann
slec
itched
bhattacharji
pasqualin
harrahs
rejectors
murrihy
mcmurran
vervets
gerloff
araouane
willardson
consu
ningsih
smatresk
ngamotu
froidevaux
muiravonside
thakoon
rumormongering
malletier
undiscounted
pingdom
isnardi
paraxylene
solises
unconferences
ronc
yishay
torrian
mortage
kainth
fortgang
siewers
arpoador
palisson
wilsonianism
arthroscopically
sornette
canicross
motoshima
puerility
textspeak
kheili
kumarasiri
vahland
randomise
margharita
mendiluce
lozo
knohl
theranostics
beshears
egelston
smoochie
vinelandii
ménerbes
ferronetti
amenabar
bioswale
weissler
boecker
novermber
bnabs
razzaque
castlefin
yound
ubani
nythe
mesonychoteuthis
maroga
thiosulfinates
etsuro
counci
brikho
garciamendez
billmeier
jelincic
weemer
boceprevir
koelner
osley
saldivia
osenton
ailis
purches
presorted
forlini
happed
gallagheri
stulz
fattier
yefremova
inovia
benha
dunderheads
slusho
tanerau
makelele
khomenei
lingamfelter
wirework
mcleland
chalupas
savins
riccarda
andrysiak
smartarse
swiftnet
jensenius
thierrée
ghassem
mercatali
phlebotomists
nightscapes
lsvd
linsday
curtsied
zollars
galliers
leonardis
ragpickers
chivery
quisisana
csob
ohab
swrda
conisborough
loadshedding
rattee
wdig
sitompul
krinsk
francaix
romstad
catholism
restrainer
kajillion
pnhp
jiacheng
lexmond
unguessable
strpce
smotrich
adraskan
multivendor
garlinghouse
greengross
duesenbergs
rowehl
nimsoft
pornanong
parajuli
eskow
bookshare
itapagipe
treacheries
belhocine
aseptically
inoffensively
wippert
netview
vscc
annoymous
tuffrey
felan
nirwan
missippi
fougeres
bubblewrap
imitatively
krystine
vipps
asphalts
braynon
copithorne
fearin
sogaard
notecard
sauntered
bergeres
luxuriousness
pickpocketed
bustiers
rakitskiy
ebrill
unairworthy
auroch
sinkinson
zaiger
mojados
gimlets
maurita
elsag
enmeshment
homogenously
unharnessed
harpertown
trakys
diraimondo
jepkorir
felicito
sherril
burguiere
kdom
constituion
batphone
zwillinger
streetsblog
broccolini
ndlea
picures
netminders
nmtv
pamplet
rankest
blackerby
paramotoring
bendersky
lashwood
wickerham
speiss
schoepges
emdeon
amau
safiullah
boundlessness
zürs
priefer
interbranch
bicuspids
feigley
hennicot
healtheon
akorn
ogunnaike
placanica
orexis
downlights
diffuseness
bossaball
overbilled
gryta
elborough
lemvo
tsedaka
asbat
ebookstore
greaseproof
jtdm
derrynoose
shyra
plotnik
cannato
cichan
debarring
gangchuan
kvor
chiyome
swaray
rpra
kreisman
tamboli
calfo
karstadtquelle
skvortsova
lizeroux
milhau
bingtuan
spti
readathon
upsurges
jdz
vormann
kankas
taishun
mofos
clowance
quizzically
officejet
yatama
garold
oogieloves
inseminations
idmc
bordens
likhtarovich
zabinski
izal
pogorelich
sovietsky
oosh
kopetz
ormsgill
contintental
opy
phse
maletta
meirav
ntibantunganya
manolos
bushee
unfi
rossides
moag
shurfine
bellison
extraconstitutional
carrys
bejel
leaseable
visitengland
braillard
hearten
clodhoppers
nyotaimori
bejun
zeltsman
haroutunian
beardon
lonni
glomming
gowler
haghighatjoo
unreligious
leweck
saines
laplanders
sweetgreen
steepletop
tilelli
lencioni
chereau
ivanyi
metrosexuals
connswater
whatsisname
winkenwerder
selna
pitner
divisively
matutes
dikun
collossal
trounces
devillez
coedydd
senseable
applehead
stoneriver
beuret
montalvan
prelaw
lices
kiwanda
fickes
naputi
maif
caddonfoot
mostoles
huesmann
prammanasudh
hakuo
pomades
karmarama
ampac
bellisle
avidia
ticktock
becauase
heldenbergh
vereadores
culvahouse
galashki
middleberg
acquafredda
mendibil
falus
guenthner
starone
olner
tecc
gavlak
bridas
iteris
barzi
landlubbers
sunnegga
mclear
cartrefi
sivaraja
bowhunters
hongtong
glamorising
breteau
faygate
unreimbursed
jiggins
leucosis
perrotti
infovision
bowis
deanell
disenchanting
kostich
kaibil
skinniest
iaec
laserium
valukas
protien
extremeties
breandan
tvrs
rozhetskin
flybar
chaoin
nonuse
smichov
peljesac
garvins
fahmideh
princesshay
braider
fesitval
ascerbic
kruschke
percodan
haerter
butkovich
rhapsodizing
icfj
miraikan
feltes
tohyama
tatweer
bobsleigher
midcareer
circunstances
maynot
okky
kranish
shufelt
crisscrosses
gillislee
skelwith
flowy
wairimu
raspier
gofish
huasheng
muscardini
swinerton
nekritz
ocariz
muhlstein
nofsinger
khaaliq
lidbury
bellyaching
cordts
devron
kazulin
unfathomed
magdeline
sohaila
coppitt
fahrenhype
gbgb
bangkokians
wilches
skripka
benter
bureiko
pikin
explica
guglani
fimalac
castrozza
oculd
birlings
recept
magliaro
anbd
prous
agroscope
seromba
damante
ritan
pinderfields
unresisting
beyondtrust
gavalda
burkha
njoy
microgrammes
airplus
billinger
talabi
berechurch
swamphens
hatwell
chondroitinase
resposible
schweizerhof
bialis
brainier
maressa
candymaker
trêpa
baloji
vibroseis
trajes
jurietti
angioplasties
dehumanise
cubacel
ditommaso
ibers
shosanna
inverkeilor
dischargeable
bcfa
kipkosgei
gaviglio
tarpenning
uncrossable
bambery
teah
dornfeld
weusi
bernacki
wyffels
bludau
roboworld
hotplates
emilienne
bloodedly
tewelde
corle
pagitt
kirsanow
jamillah
fransi
madr
patrixbourne
multiemployer
guerci
micozzi
kolomoisky
canoodling
orrstown
jasani
pogosian
bety
cieri
boozers
tallackson
magicked
bazzell
sadgrove
chrisitan
skydrol
thickthorn
nhema
perh
cokal
sharnee
katsande
ajinca
potinière
dihydroergotamine
baconnaise
llwynhendy
doublechecking
caree
inrena
cancell
boshu
kushel
adali
ratanpuri
backwood
bourgass
glandford
worldteach
geminoid
rukmana
sabritas
theives
andizhan
campailla
khalilou
polce
luchese
duljaj
ikal
kratschmer
yoculan
dictats
nooy
lerouge
fraddon
barmitzvah
corruptors
pashby
ducket
lashgar
aleppan
hanovers
znbc
hooh
exculpating
steadward
skout
mondex
cpnt
bouncier
cabp
ineich
bargirls
uncooperativeness
oppositon
cocksworth
ywcas
hongling
chesterfields
hadee
talauega
sennels
mccoshen
wildlifedirect
quaked
snowslide
greaux
iula
niceic
abdow
ortegas
drouant
ccpoa
wittstein
jetskiing
voteing
currenlty
broide
heteroduplexes
sandefer
fenglei
domainers
thoumire
elss
voluntourism
tefera
unamusing
dibai
rioult
kainerugaba
scroggin
suppurating
moraima
mifamurtide
aldaniti
wtrg
tresize
guilvinec
marvine
igeneration
langsamer
redecorations
opinionating
xiumei
burgar
kazatchkine
hartas
dropsondes
liquidnet
mckensie
vivette
suplement
canavosio
samcef
geriatricians
romneycare
jonesing
pheaa
desvenlafaxine
maraahel
portmans
kaikkonen
devico
mavromatis
posesses
murdochs
sloans
serenbe
stolidly
dorato
micromania
prosinski
sharify
mcjames
iacovelli
costeira
muasher
gervay
isenhower
fieldman
florencecourt
smithgroup
forness
korkidas
lövin
haddacks
taei
rummana
gwanzura
tannert
jamille
gobb
abbu
benbaun
nvic
tcan
karlgaard
alexine
fakudze
hipbone
ppca
beaurocrats
restovich
manorexia
shunner
ulacia
yatch
torwoodlee
seminomas
samso
spunkmeyer
enersys
whiplike
responsed
kador
manuwai
sajeeb
ayiiia
stichter
quintiq
mastrov
brûlées
labron
kiljunen
degarelix
oaker
jancevski
rykestrasse
welioya
vigiani
hafsat
recongnized
grosnez
poshest
soldz
pulrose
steamfitter
assuncao
terpin
norgard
yanhuang
upgradeability
greenhaus
multipack
towerbrook
erdimi
seemore
statehouses
freefest
muscato
olrig
westernbank
divello
quartermile
kosner
thodey
integras
spaciously
maubisse
dashingly
gristedes
zaretskys
denihan
conquerers
altaira
sporns
etsuo
daydreamers
scheidemantel
abeed
bldp
creutzfeld
aodhan
malchi
yolky
ecornell
amser
trabolgan
guggenmos
haerizadeh
jugraj
xiaoshu
mcseveney
hawksford
punkier
diclemente
passangers
adaptogen
shorabak
anderby
strugglin
teradici
kazakstan
guoxiang
zvara
macloughlin
tsaf
despotisms
bucketing
battleaxes
nerveless
valfierno
jianbo
deciples
highpointers
indjai
salvadorians
garmirian
jaaa
cannibalizes
chaffed
romauld
boutonniere
trwy
ashwagandha
fosis
zvai
uygurs
metec
dillenberger
chizhova
gostomski
swiftboat
chout
genyen
paidos
polybona
ahmedullah
lakafia
forsen
amerithrax
zandio
eramet
gutschow
definitiveness
borque
kokee
tankink
freedive
qarar
snurfer
neutralizers
mazorra
naglaa
molissa
adek
gyari
jolomo
rindi
skivington
enomaly
abdinasir
wiva
rylko
faliva
supprised
redant
rifugi
plopper
stoicescu
sportsticker
latice
vitens
rowhani
dorce
subfusc
naviance
sinkfield
tolerence
abbeywood
maslon
telam
gnudi
sharie
mapd
madere
carpetbagging
kinesiologists
yaritza
favourability
jundee
abbeygate
gongga
barofsky
hakakian
taroni
fufilling
ritholtz
cwyfan
atonalism
mexicos
promenaded
llanfwrog
luthardt
théret
nixonian
paninis
calld
phocine
synchronicities
raciness
digiulio
mallaya
lizin
jamika
duschenes
nepstar
infantilized
witricity
taquito
zhenxiang
makete
tsuyako
glofs
ettien
lowside
neskowin
gutterball
afectados
multiparameter
rambly
marcetta
frissons
prugova
nutriset
treier
twitterati
amtek
florigen
toastie
kasmiri
hesch
fatass
dhiya
ruedy
ulgen
subero
hasun
foecke
saumlaki
uncarbonated
swaffield
simhan
newsum
schellhardt
uncap
wajeha
marchioni
crigger
hoofprint
mosside
bernandino
kesici
resculpted
chattergoon
unsheathing
galachipa
tengco
psco
groer
villify
foxell
xiaoke
tinkly
lwala
kontogiannis
hoddeson
devonshires
nonsupport
viceconte
ugoh
scaqmd
zwiener
hahahahahaha
delger
kanaykin
renqing
componentized
ganswindt
gurpal
kurer
sinosteel
savta
lezli
banderillero
tilki
madelena
eaws
buriak
keehn
gogulski
tormenters
ittoop
devro
northeasterners
throbbed
nysp
yakoob
uindy
gantman
latorraca
lefkos
damapong
shyann
repsect
barmal
susko
publishe
lebedko
guettler
cloughmills
springbourne
impellizzeri
valdovinos
zda
wassman
elesewhere
brunching
ibutilide
benuzzi
senoussi
naproxcinod
breakfasting
tstr
killone
soneira
melandra
nieberg
romashina
bastel
pálmadóttir
dédée
caveri
mastiha
violaters
steinhilber
fownhope
fona
crowleys
gurbuz
gougers
drewnowski
pennfuture
medc
boniver
ivanans
crampsey
pontrhydfendigaid
semerari
profounder
mandelblit
austrie
headlocks
kalustyan
celebair
kondewa
nadezhdin
feminazis
shabbiness
dreamier
mejdi
hefcw
gocek
patrina
mesalands
workweeks
cheaptickets
eikenberg
gilton
maveron
resourses
fesperman
gladdening
yevtushenkov
rochdi
horndog
samau
underpromoted
rotini
meadowfield
nokdim
chgs
lüttge
theend
fahdawi
wwpc
unstylish
boneau
rpet
vollman
udda
bureaucratised
pytor
kobna
terrority
woooo
addeo
akitsugu
kuchwara
departmentalized
synageva
estess
balstrode
harmann
graftech
afdm
reordained
dwomoh
wgii
wadiya
aldisert
aringo
bodyboards
matsinhe
pnrc
chestfield
hypersexualized
puckers
lodgenet
hittleman
bogied
hottes
rishell
ambela
prejudges
tangguh
pailleron
mtbc
beltra
mothecombe
filshie
speleonectes
hansruedi
azour
foursquares
cocamidopropyl
automobilist
adolat
giefer
flager
exactingly
rabach
beautyman
wyngarden
vesuviana
direnzo
charmayne
peepo
hartside
tipling
hornblum
brooklynese
pfertzel
platings
brunini
bograd
tanlaw
cerm
anandapuram
grechaninov
kavadis
deadbolts
hunnington
stripteases
iwh
ladbrook
revenuers
taklimakan
volozh
yumari
shaygan
dosidicus
hermansyah
amuay
schneir
caphosol
kamembe
benyettou
lauga
belnavis
mekdad
victoza
airgid
baronness
yonghui
lionheads
denigratory
prognosticate
magira
beome
mcluskey
indignance
sdax
demineralised
boricha
deteriorations
littlemoss
hematopathology
worktable
graveness
chiminea
¸
sivarajah
movietown
coggio
bannier
datascope
trabectedin
humoud
brodian
juling
prevelent
raynell
weijia
volgas
ballabon
babara
tolterodine
veeco
saizen
nafez
avantel
kfcs
slotter
terrian
muxidi
trunfio
digitial
neupogen
ikats
moutaouakil
landmesser
girishk
unsubscribing
motulsky
rakova
hemingson
southbay
steindler
shesol
heminger
individial
bendelack
refolded
shakirova
denegrate
gladen
burntollet
symeonidis
vaxgen
vannan
kiyora
kolambugan
tipples
neverson
herbstritt
jdd
commuity
yance
dragooned
oiliness
exorcize
albertos
letseng
spamer
egesborg
temptresses
greenprint
woolmore
caffine
runnning
roncone
marysa
enman
medex
tbhq
geeben
genotropin
abdelaal
illustrational
polcari
tigs
timbrook
cambanis
rachev
rightous
livadas
whytes
lifechangers
kover
tirumalasetti
trabocchi
derunta
astex
hackshaw
settis
idds
repudiations
durrand
niqabs
seguela
faciliate
antwuan
sudatel
outblaze
shmarya
mechele
sonador
nontransferable
favaretto
aihrc
bowlful
taslimi
geotrax
agencys
bobrun
verasun
acquited
takazawa
sunand
sitoli
periwigs
gourdie
tholan
undoctored
philarmonia
sailes
personalises
lawdragon
kovalski
qjm
ptac
alsalam
korobochka
ceaucescu
lionizes
ramunas
combinado
jaker
gurel
baikalfinansgrup
açai
salutatorians
blouson
pendon
scallywags
wearn
philistin
mcneff
sheaff
mavrud
vernick
owczarek
safah
yanovich
obiageli
bisignani
nerdly
unfasten
semiretirement
ddysgl
lambrinos
lisowice
schwegel
tarmiyah
hambrough
pepcid
sonotone
nehemias
ebuyer
vincci
chupin
noctilux
xiaoqiao
yevsyukov
kood
frostie
heires
counterphobic
castlecaulfield
temped
sadlo
westergard
miodownik
assh
anigo
misdialed
sharenow
brenninkmeyer
orsak
angop
disinvite
xeta
malmierca
ferkauf
chukiat
onochie
seaglider
matambre
ranjeni
filreis
coquese
wsts
wirelesshd
pivotally
optx
woome
sassard
woolfs
protectees
transue
cronic
psephologists
ajuwa
minca
gontineac
uncross
pcam
iggle
sedgeberrow
trovax
malmanche
conod
becel
steingard
freedlander
egomaniacs
whino
meinrath
risling
fasso
weedn
rizieq
rugrat
descalzo
pelindo
kindra
sourcemedia
metamucil
waemu
wtri
toolchains
photocure
clothbound
tatma
solimoes
gargled
burnmoor
pinnix
disagreeably
hamayun
okst
westrock
addd
somavia
identiy
cirrate
rhonheimer
obfuscator
melantha
boorstein
immunet
athanasopoulos
souchet
inplace
shumin
graspable
manikchand
cheywa
uncertificated
feike
webawards
strautmanis
istcs
pelite
shripal
fibrate
counterpointing
bitani
darbys
marlpit
ivuti
trabajan
sednaya
mandelkern
martletwy
ramaroson
clie
ultang
taffa
chigago
dancigers
constitutionals
ytterhorn
placation
unprecedent
coketown
bavani
unshaved
jackline
nver
pudenda
bengalese
gutseriev
garrion
spaven
toplou
arbain
loest
baulin
lionnel
unitil
briginshaw
khilani
kifner
monestary
loaeza
colleage
mwafulirwa
capitala
otherwordly
mahdaoui
daugman
sysomos
beirn
ontrac
cornum
tribeswoman
rightism
villagarcia
bansei
aned
ashaolu
faloria
misspeak
chikhani
tauter
barefooting
sinduhije
deddie
courageousness
yordenis
shenise
branin
arcigay
chervin
desvarieux
fviii
threequel
douglin
barbanell
marshchapel
pratkanis
goldschlager
soffia
shabbes
yossel
berkowski
horizontale
loked
retreive
kalukundi
donvan
salave
sheriffe
inamine
wojas
subparagraphs
dardick
breadmakers
jobbery
anouck
charlety
newp
tawazun
ishwor
hrmph
kirchgasser
neureiter
aprovecho
sular
palimbang
savnik
lepape
edcor
sandella
quicklaunch
melodiousness
tuitele
gurnon
fleetwide
virtualy
cnso
polysyllables
johnners
civilities
dhafir
gelula
prtg
imasuen
immergluck
jasvir
highlevel
sulukule
botcherby
marzoli
desertified
steets
actuant
autie
boisfeuillet
putrefy
fleri
fellings
hofler
mamounia
shribman
fastline
pudeur
azizollah
tallamy
wireweed
olika
sovrano
tightwire
squali
farecompare
fathalla
monoi
pastika
juddery
krylon
melanne
baggier
sirul
vartabedian
gontarek
brancy
poehlman
ketia
weisblum
cadger
parping
wusses
colliver
candoco
praedium
stobswell
readopt
adoringly
staniszewska
liyel
eidu
bouroullec
gotsch
karoun
oppurtunities
bittenbender
padf
kienitz
joyandet
sirieix
coue
prerace
insightfulness
gammans
matchy
evertonians
doggo
shafiqur
antipiracy
thorstenson
siadatan
gradiska
winsham
hamori
cwmdonkin
heinla
switalski
ccmt
tsoukas
corkish
kilshaw
forthriver
nightwatchmen
malnik
editorializes
whittakers
lizewski
vollis
gohari
skarzynski
bruell
starsia
teeder
morlands
northsiders
whettam
kihuen
garlaschelli
renel
knoops
sulfosuccinate
rowleys
stefane
eyong
gauselmann
demsky
bonaroo
luzius
seanor
linendoll
fatica
titters
arrythmias
dinedor
carapelli
tinterow
evem
ortolans
parhamovich
silfra
uxue
kringelbach
juandre
panamerica
doblo
temperence
yawovi
damjanov
penkov
tibballs
yukes
icall
hosseinian
ymddiriedolaeth
intercall
dancap
hafnarfjördur
tinaco
araqi
greengrocery
belam
scarpi
thierer
lactations
spocks
amarena
ponemon
knchr
specced
densify
knipl
teuge
bejam
jospeh
wasem
bodyguarding
gamelink
bocm
kyaukkyi
xueying
gwatney
kaopectate
daymer
redscout
denbow
roadsport
litlle
macallen
seremet
schruers
abriachan
mayanmar
imfc
mozartean
lunkhead
threadsnake
brobst
mahaday
masunda
kerbed
tranel
gooper
haxhia
epower
atchinson
discusssions
garsson
ysbryd
muscian
traitorously
techmeme
southerness
gaches
traditonal
cerisier
renza
savories
rhaglenni
jefrey
karraker
chiuariu
beaulah
ndiritu
geocoins
broerse
fiebiger
karlijn
cyndie
napps
superbitch
berès
hunke
slesin
siree
scard
wohlleben
binladen
vandinho
nset
toberoff
kappelman
iwv
newforge
piekos
beaudreau
molitika
guaranitica
computerise
rehfeld
hungriest
kamathi
strathconon
kerryson
levota
cusses
trius
lokke
smallbusiness
tresaith
hadramut
kosasih
reinbach
gurantee
kittelmann
holbox
shafiqul
germier
pennys
sitomer
guarch
mcgregory
competance
guennol
dislikable
ndamase
yoville
etms
nevine
gaev
laddered
sacrileges
gaehtgens
waybuloo
filete
harrhy
generaux
hydrex
tingya
footlockers
resseguie
truell
joksimovic
inded
raveché
sidya
gónzalez
doury
beavering
bishopmill
hardwearing
mainelli
murigande
hearld
excercising
michigans
batailley
haveland
powerleague
villonodular
hissyfit
diera
leucopenia
nuclearization
imprisioned
thorgeirsson
pifan
malulani
privleges
flyering
nicox
abrahim
unplowed
kazenga
mvezo
turbins
alpr
laroe
technostructure
raptakis
sphenodontian
apsf
monêtier
piferrer
xenotropic
santanam
keertana
roenneberg
bumpiness
furaha
ciana
schuringa
electrofunk
shiroo
unchastened
piolin
availabilty
smithsons
vetters
qiangba
netwars
multilaterals
hiasl
kamaljeet
hankie
thawat
mercel
mogull
fishville
dumitras
bosacki
edgiest
cgtp
fayfi
witchalls
jalazone
proenglish
splain
fraternite
evangelique
ostrower
faeth
auchenblae
crapanzano
kuijper
ehui
vanasek
airbuses
belcrest
kunonga
chizen
bentilee
fellowman
butterstone
ganko
bernadeau
kochis
magimix
glums
mirial
bonjela
ernsberger
austalian
bentzion
purho
okeafor
paller
jgd
kcpl
eroticised
titchner
ensorcelled
remortgaging
glossiness
dopplr
idloes
cuillins
ziada
clubfeet
deeda
weanlings
birdsnest
beamy
edmo
glovework
kwanzas
ketelsen
cnto
allagoa
grimbert
beckson
hladik
tycoch
robl
protaganist
craphonso
ratafia
rezun
roboform
bezy
forcast
itacaré
keirstead
konieczna
teeterboard
tartly
telenews
bandou
ekgs
drummuir
teaboy
flexnet
jingfang
poaches
medpage
misdiagnosing
calazans
reckford
moayeri
nocher
staement
federkeil
dowdney
xinfa
stokdyk
azizian
transandino
serajuddin
superbeing
shumilina
downcity
latourelle
diggerland
deluz
foinse
carenet
reiland
xiaokai
yunosuke
assegaf
renationalised
peschici
lobsterman
magoro
trusonic
menglian
découpage
albertonykus
rovito
shushing
tokia
marite
celerra
kwamena
unicharm
tracheas
cabbot
reitzle
andrucha
salbris
kheny
atomenergoprom
shantallow
unsucessfully
blondies
peiyan
yukihisa
transpartisan
bleachfield
ballcock
piaskowski
shahsavari
chatikavanij
mankoo
tansor
wintrust
ponciau
slipups
hammershoi
mbachu
morosely
nikethamide
reengaging
dodick
aaranyak
frousos
theine
flowsheet
fflint
adnewyddu
gurniak
saxaphone
sleaziest
sibc
appler
introducting
gravestock
emadeddin
squeakers
ricqlès
pomas
neida
gooseflesh
ynysforgan
unconflicted
genebanks
hapus
coatimundi
dolbeer
modelmaker
valrhona
reapportioning
fisseha
ssfa
coogi
ghiaciuc
qatalyst
tripps
kaminetsky
dmitrich
furzey
nebit
lndd
comped
somarriba
pumariega
noncash
nonexecutive
lonstein
froes
thinko
gurwitz
ponderable
whitling
hinphey
joselit
tomana
perfuming
ramsley
detailers
tutan
hairdryers
tatil
muenter
stofile
rocques
accurso
evitable
midfirst
okal
ilheus
sculfor
somashekar
wambugu
bluecher
lyter
woippy
ablin
kubasik
skinnygirl
eweida
shekinna
gluek
junkmail
ubelaker
lezana
gobshite
nibert
davonte
ssan
twinges
corrodi
krapyak
homesdale
shirtfront
cesarsky
appreciator
pitmon
hanggi
lilliam
laffita
kruglyakov
billionare
siroki
barbourula
softwear
xylar
juldeh
retailmenot
thamkrabok
mayola
mesquites
habre
whitewave
pergau
craigielea
yuegang
shalee
finco
broumand
reichek
laeng
zinicola
levadas
lutrus
thighbones
smallridge
spinnrade
vinocur
actovegin
yagil
biocity
determin
adbc
utrinski
aclt
presten
lumin
piomelli
dabbawalas
tamecka
pitavastatin
devinsky
vinehall
meeusen
dyllon
ditmore
roadtrain
amparan
searc
catheterizations
tangutica
somanahalli
chaston
haubert
probosces
schonborn
baysse
ribiero
escuinapa
biostar
blobfish
dhcc
japanes
mapondera
passionato
zeidabadi
dometic
charlsie
arnulfista
unmannered
dongguang
dignifies
sassiness
punnet
lasmo
earaches
couttie
railcare
pauriol
prognosticating
sivola
snamprogetti
tianqing
iadarola
biocryst
obiefule
yankeeography
panitan
akeim
medivation
greatpoint
desselle
yarwun
lertchai
convalesces
timebeing
bashert
veronicastrum
nonrecurring
fyke
gosder
moallim
afemo
ullinish
waszczykowski
huateng
nickleback
swissmedic
cadx
leopolds
dispensa
tivnan
goldplated
bjelajac
perelson
refudiate
douge
mikic
inquisitively
molotlegi
inebriating
getreidegasse
malarone
npbp
woodenly
mertaranta
patrão
abdominus
blazevic
biaudet
koring
oriza
pincham
gowlings
rockbeare
pretlow
flavanoids
barefeet
underdress
llanwddyn
syntroleum
roett
giveing
bianucci
gompo
imerslund
diyers
nongenetic
netstumbler
siloh
newsperson
raymarine
bipper
tvnewser
stubbles
brodner
traxon
teerlink
laticrete
huckfield
shezad
reisenauer
skycraper
kisselev
qcda
aharonoth
stamenkovic
guilian
hwat
groenwald
twiname
parkvale
cohill
bournewood
fitchner
heavenwards
suhada
jeilan
mukasei
leviston
groetzinger
oleuropein
nutsedge
eloqua
nutraloaf
tumim
brucheville
coffel
mocana
synfuel
reimprisoned
vyners
traducing
perpignani
hayekian
omland
whackjob
habur
toloui
priceminister
izzeldin
sakaba
riteway
shabah
stereotaxis
emhs
shilstone
wijesiri
maitham
ircica
gymnopedie
vacantly
aiport
dainese
hrj
dallies
digitaltrends
grandparenting
wellwisher
navegacion
kucharz
vyron
happythankyoumoreplease
panio
freebooting
wheedled
myrobella
recitalists
nusc
shivinder
molestors
addiopizzo
baringer
farboud
ronseal
borbor
dotori
revivify
doualiya
hieftje
kitzbuehel
massaman
fegs
psychs
randjelovic
sodann
semenchuk
shandre
romeny
kneecapped
stepanovic
crisson
hemiunu
ridland
macquin
mousseaux
memorizers
eilerts
hmss
splurging
rojanasunand
burea
halvey
sideswipes
amddiffyn
xocolatl
redpine
valj
suchit
krobot
underley
strieff
lewitinn
lampion
afotec
muhabbet
guatamala
swingy
allenheads
teamquest
eliaquim
bubbleman
delhaye
snipper
bouaouzan
tourister
edls
sedwill
houra
canco
drylaw
dargel
kancharla
deguire
shcherbakova
gattinoni
ntaba
lerab
gehlhausen
critising
poggiolini
seitel
bicyclers
mahdawi
baughen
kosovans
kremlinology
makeen
gepirone
xiangfeng
spermatogonial
pulecio
tesfamariam
milimeters
possessively
electrochem
unlatch
langrell
upperby
reidenberg
wcoh
slabbed
vever
pryd
clezio
laviada
klizan
tinklers
warsofsky
plann
macua
guidos
practially
hesitatingly
ondó
savjani
pomata
whitelake
orbik
fridrikas
scrobbling
asshats
abyssmal
kasikorn
pliev
nkong
milteer
longbotham
isnilon
whets
kushman
gcmmf
folkier
nyiregyhaza
seynaeve
karvy
conclusiveness
applier
firetide
henegan
summarisers
lpai
agwai
tarpishchev
rocketships
nationaly
firouzi
crosets
aquaplaned
paisnel
gurgler
pawelek
usoyan
sequitors
adou
travelpost
mounty
shiman
kosters
pralatrexate
wohlin
bonadies
zoratto
parness
skalak
hojjatoleslam
espanoles
outjumped
majoras
snickered
knipton
ashikawa
kuys
chalfonts
safrin
biesecker
gilbraltar
borgerson
bilion
busetto
ffsg
breakerz
teekanne
ambled
ophelie
niyam
rhios
gaerfyrddin
edmead
citropsis
klores
supplementals
afrikaaner
mukomuko
smeary
triwest
dobriskey
caddish
mahsouli
passalong
theone
inveneo
ululations
marhefka
koyie
gaaa
velvick
churchil
donze
jiranek
papermate
zaftig
margara
oneplace
perfectness
mitvol
vican
horgos
crestar
copays
kierland
revlimid
paivi
schneeweiss
duncon
deinterlace
boniello
wildworks
lymn
jandola
madrileños
chancelor
bushwalk
hunstville
synek
waygal
gekoski
caite
rackliffe
snfs
ukesa
cvijanovich
krivoshapka
hallerman
techtronic
morkov
uscybercom
shirkers
subtantial
suyono
sliproad
yatimov
zhongfu
jamling
saltado
standups
kardamyli
cassiers
biospectrum
anguishing
mawejje
hprp
penjor
smink
iskin
lucich
airier
zelston
multistars
sharawi
cognifit
cyrine
franulovic
geltner
viny
penston
crausby
khrustaleva
zambeef
nehrbass
gierach
ichita
osael
simonoff
colonoscope
kolmakov
fihlani
keekle
futurelab
conneticut
krivokrasov
nzabonimana
confrence
eischeid
duvenage
bayeaux
atieh
brockert
elwi
cirse
jicin
utay
abecasis
verbalise
bernholz
eflow
olvey
kamyab
farella
urmc
stonerside
moxxi
birdcalls
radmanovic
cuttery
pashin
seaburg
liquidy
lahariya
hsyk
nuerburgring
zajecar
ducksworth
nobelists
pudukkotai
xanthelasma
transexuals
withybush
twiddler
eyeshades
nastasha
peringer
ciav
chananya
nsima
internatinal
clomid
beňová
derichebourg
freear
unembedded
oatland
lassandro
andrau
hared
corrieri
gawdat
unilife
kirchman
pfpc
stoye
zhiguang
shieldfield
fallshaw
parod
naehring
panners
indemand
hinduist
melligan
bayik
harpviken
muzsikás
reconceptualize
chefetz
timbertop
dalmations
antiquiet
sinkerballer
seeligson
habimana
trende
talktime
minicars
portacabin
oppresive
logophile
ghorab
sinnington
milmoe
vigee
hagert
salos
qingsong
berkat
staythorpe
biggovernment
jinjin
angen
gravitional
roadwatch
seifi
newswise
oaty
kuupik
throg
riani
sterotypical
prandini
layerings
dishdasha
molpus
hasselknippe
kowarsky
hreidarsson
cabraal
crumbed
karasin
glenwright
aghan
farq
collegenet
nigari
derrill
potheads
palastine
skelemani
elsea
napw
justic
renante
srinivasarao
outbreed
aibileen
huttenlocker
farjo
obtener
githu
asecna
nrsf
biopure
dissolver
fetishizes
ineducable
quaidabad
kakalios
lebua
attackes
hirshon
techspace
kompak
colmado
fanatism
aberrational
hessi
letner
navigenics
suphachai
kokoo
pasborg
colapse
mayby
bobwhites
kasib
brandstaetter
marichu
paycut
automobilwerke
bertinetti
rabbat
remaning
dncl
deplaned
mojib
teasmade
layette
pilleth
accbank
nejib
tatlitug
blann
staiti
scanpix
encumbers
scarberry
shmuck
jugendorchester
bellwin
nonagenarians
autoscope
robinov
weiqing
isramco
lujic
lapot
ryynänen
mateelong
kneecapping
icaronycteris
merald
olian
superpoke
babbled
daciana
windowsmedia
radioclit
cohering
ctca
avoider
reiza
gilsey
malène
financo
bonisseur
flawlessness
galippo
aisenberg
ulcerating
mitrou
furstenburg
pleuropulmonary
disappearence
lefta
glenton
abric
reponded
choppier
stanks
tryba
luboslav
furjan
gladston
intersexuals
meagaidh
soyak
yosses
worring
dund
chamchamal
wwy
intissar
markuszewski
deregulates
aboi
cairndow
zygmantovich
deninger
glasgay
fraiman
metroaccess
thorek
propagandas
autisticus
fowden
kamisugi
safak
anouther
xiaofen
pharmacogenetic
drene
shooshtari
arvella
clydeport
zacatecan
woodstream
freshkills
habacuc
readiest
hawbaker
clingfilm
deglazing
sheikhi
tussie
procomm
delloye
muzdalifah
utilimaster
knar
spcas
tiffanys
fleetwith
kulvinskas
tcdc
fogell
peyi
ostergren
venville
subspecialist
tangjiashan
tejani
komphela
kleier
multishot
krasikov
jokhadze
apria
palamar
jafza
soer
overstimulating
bigbox
gindy
galaviz
jeba
isvaran
krames
grayswood
blumau
ballycraigy
tessimond
aramón
togneri
pottow
stonecarver
embezzlements
cossie
missan
nells
firehoses
enagaged
habichuelas
andrill
grimmette
motevalli
hocktide
coulters
reanalyzing
zoledronic
orha
assts
soothingly
tarting
oddbaby
davonn
marchex
sidak
abbasali
antumbra
pozzoni
ourselfs
lamura
redknap
bemahague
meyran
brichto
atradius
candids
greear
ganjgal
jenan
billado
boondoggles
cotchett
lohuizen
tarentino
arabov
shkelzen
rotflmao
frustated
pendrill
scicolone
resizer
cheeseball
alanssi
giula
immunohematology
kahlah
advamed
phalane
centerplate
broght
engenderhealth
alsingace
hannema
tabrizian
piselli
ojuland
glucophage
shrimpy
bioceramic
kaven
bluwiki
markstrom
urlicht
lavera
hyatte
mizhar
schmoes
dilemas
tantalo
smalto
redkey
gacc
clouden
phuea
schimmerling
solotaroff
amsheet
digal
secdev
allioui
wgw
wörndle
vargus
gragson
reflooded
webman
pogossian
apeal
classey
backrower
maszczyk
uigurs
butkovitz
perscription
bicyclettes
rwas
henebry
reticker
faulkenberry
cristabel
koering
butterbeer
shigekawa
parrita
histopathologist
scholarliness
starcevic
egans
teotitlan
outrated
huruma
gumwood
rysanov
cnaf
henock
cihaner
fedee
crosschecking
deathline
lehtomäki
gershengorn
sketchiness
caseinate
nmx
cougartown
wildern
fermain
zuercher
mapaction
laoting
dybul
fxdd
damiri
summersonic
bruemmer
stringz
walian
ventless
lurita
submarket
samax
permasteelisa
hatchards
contemplatively
billiam
jurovich
cirle
tymms
acerbically
pfbc
caidos
umpp
cammel
pushtuns
saurat
padlocking
trewoon
freewire
agilo
polacek
murate
massilon
wellchild
sulili
legitamacy
backface
ahner
mandaza
zoledronate
tzen
drably
sbrt
oberstolz
dijken
harrasser
principalist
foodservices
acourt
cahnge
mostazafan
dreidels
blacknest
mdca
besanko
sigsgaard
chudzik
ecumenicism
rechargers
dgamer
spaccanapoli
apols
ghoba
bucintoro
drozdoff
literato
eliots
tabbaneh
servanthood
immovably
elsewehere
quinley
gentek
lifenet
koterba
magnitka
gillers
herdy
kostelic
terremark
shanai
getler
reorchestration
neurosearch
wolferman
tunneller
reeh
ungratefully
sanwi
footwells
juicycampus
coultre
muranaga
caminer
groark
sibor
nicaro
baertschi
picaso
dunwood
ramtane
stepakoff
yno
robertsport
serfin
savageness
showkat
jeggings
malalignment
cervelat
nadell
uppance
mulanovich
bucatini
lintula
sterz
rixson
capitalsource
gooya
bendectin
sudbin
postberg
snower
tuiles
ybh
litchard
birchill
cnmv
offir
apoint
amelda
sxephil
pluckers
elijio
newtownbreda
smartened
abic
csfp
gaytri
shammo
millichip
ghazaliya
sakey
hectored
carbonex
wesb
kiton
macala
phahurat
phaophanit
masshealth
fetishised
hexvix
byki
atns
dibert
leewood
mcauliff
bacara
bpsd
mshtml
spasiuk
mothershed
overarched
bowdry
glorya
hokule
mosebacke
repping
fozia
dipsomaniac
lindenstrasse
thuo
rizek
nisanov
isayevich
collinses
conformities
guidroz
pageonce
croud
schellas
tattles
jrj
springle
jeglic
whirlies
pulcrano
mvuemba
inteligentes
fargana
stosic
handballers
washa
quaile
infogroup
calka
benone
sorbetto
seckerson
monogramme
spoletini
ethereally
ltbi
digestives
moygashel
treos
ningling
sheeter
abdulmohsen
jirjis
citizenm
cockburns
talanoa
bakowski
wastebaskets
sigard
renkl
ringgits
intinction
suraiyya
linstrum
mcphun
voorheis
ardah
hafnarfjordur
bulos
schwanke
paparella
meskhetians
blythman
joshipura
euram
sitaula
watchguard
attemping
oreka
bugarach
guirong
doaa
antiblack
acbar
meraas
leisk
biggam
tissainayagam
eurig
henes
strobo
gloopy
bicek
landsliding
acision
fedoke
footjoy
disupted
aberdonians
metarie
malangatana
kurlantzick
ivax
samphel
kilbroney
stah
jiankun
reoli
speechley
lasdon
veerasingham
franczyk
sulick
kvitashvili
dreu
vhg
accomplis
atwick
multireedist
edmont
debak
elevage
dinyar
varik
recenter
seewalchen
quiggins
picketer
pesalai
picaboo
bryd
craftswomen
anessa
chappelow
squirreling
wahlin
loiterer
etnyre
minewolf
guiza
sciencefiction
ketbi
skybar
sestier
welldone
ogtt
mordantly
wordsworthian
etex
mdaa
gsci
dagai
ampaw
nonparticipants
rennaissance
ecosecurities
scoots
jasny
ruzyne
dalgetty
characias
cabraser
coelen
embroils
speece
humanin
polcy
clairee
sughrue
defterios
phising
ghodse
cherkesov
chikankari
cocuzza
repudiatory
yeras
cherryl
rici
immortalising
obesogenic
ladia
dolgachev
bienenstock
cornrich
jekka
massivegood
streetstyle
lipkis
thwacking
kebby
seibald
shsp
thimbletack
cimatron
cashers
iimsam
overprotectiveness
dobrow
compper
dcca
dreibholz
depere
bleepy
lassithi
kraisak
petromin
mazzagatti
sember
oversimplistic
virture
zarir
commandite
mitsushige
forgoten
tihana
niada
ferentinos
loadholt
etoricoxib
sholle
laciga
ostovar
blask
stachyose
rumbaut
ritsuo
creger
tregor
zaidon
aragoncillo
scammy
euromediterranean
shionoya
contillo
sugahara
helsingor
woldemichael
hattons
troutwine
revia
kafle
brouilly
fohn
jevtic
crav
spungin
perotta
sundresses
gellin
dondurma
felippa
clinchers
fluidigm
madrs
shuanggang
farnood
kippenberg
nifaz
baglow
abdullin
speroff
journyx
servini
legside
genan
bresgen
gershoff
corncrakes
slavemasters
kittell
bersey
ameriks
samour
ukerc
mullinax
yazdanian
schoerner
solot
cranachan
wayuunaiki
kunzmann
melvich
grubel
maximilianstrasse
arrise
aubins
bondgate
waseley
tarby
briquets
kulovits
epad
kielar
solitariness
bedke
krevsun
hirschowitz
illner
opher
froths
canfin
hilford
offcut
tolia
gokool
enthrallment
buhriz
woehr
nmhc
platzl
mujahideens
aroop
rettaliata
phcc
polititians
grajewski
codders
gharraf
firescope
casarella
masnata
jeevanjee
searingtown
chiacchia
wabho
skateboarded
whitens
lakeforest
inclduing
zipora
qiyao
remortgaged
weisswurst
maingot
aquazone
pranced
anmin
kingibe
rmic
tealight
oustide
coggon
hazardously
prockter
hnwi
vesteys
taxin
cridersville
majungatholus
dooky
electros
cjones
cenedella
capellio
mchoul
rugare
lowermybills
craigcrook
powernext
francsico
stonely
shantham
vanderwall
absolem
temming
cubbin
vendeen
vitousek
pirih
wangai
alanne
chengping
arithmatic
cenic
nitb
fructis
meyerrose
pagentry
rosmira
bejing
menuires
thomma
echem
spitbank
sesheshet
damascenes
sternheimer
andjelkovic
shieber
korniloff
rasped
sattelberger
sandtrap
idataplex
wideroe
bridgforth
ossete
portloe
bwrdd
whelping
russan
tascon
mioduszewski
smulian
bumann
tormenter
chwe
midle
gelperin
bordage
neurohormonal
fieldrunners
phenergan
usgi
terrantez
mallegni
pourable
hutabarat
overcook
pecheur
decidió
hoft
doomers
ushikubo
destigmatize
cikobia
lianhai
externalised
pojama
cacee
wimmers
hrft
tyber
contructed
ryndam
conocophilips
jarmers
concon
baich
hjordis
ethopia
delphiniums
lsuhsc
bromm
talpatti
tysabri
schreider
bigchampagne
recomends
regilded
arcusa
allusively
sainey
weetos
emken
soghomonian
cartelli
exmple
logline
underarmed
wurmfeld
belgrader
dawut
dadnapped
menschel
scrummager
rubigen
geremie
bonks
pantukan
skalnik
sprinkel
simington
halyk
colwin
kunzi
salomi
lyalin
schlutt
folkenflik
munts
twilighters
craptacular
assasins
majnoun
brear
humpage
gangmaster
fidayeen
antiglobalization
pecor
delligatti
swingball
clocaenog
ramatam
sporthotel
argn
touchmark
kinglsey
yewdall
funpark
pointin
munkhbat
tianamen
kaysi
pinkstone
dashtop
shoulds
flello
ulleren
kanipe
iqlim
asnc
brikowski
butani
beilock
rennselaer
oenological
fauxhawk
parrotting
altinkum
rechnitzer
blahs
lukach
manvinder
antoher
capricolum
xianglu
otila
spinghar
gyppo
apdal
tother
herzel
fapl
bochniarz
gudjohnsen
tranmer
keyholders
bacarri
athenes
jinked
therafter
manglik
peader
perons
ecobot
pasinato
edenred
schodorf
siestas
artfl
vanfleet
wiseass
weckherlin
peaceplayers
medborgarplatsen
kosovich
sonenshine
liftings
wirginia
gruzen
uniban
wordworth
zomegnan
inflationist
tyna
polycephaly
hejlik
jeeja
grindingly
gerecke
collectionneuse
shiya
mortarman
matzka
mekongo
njawé
mcns
pillarbox
numbnuts
ushe
farnhill
overacts
zerihun
reichwald
korins
archimboldi
kalamian
yoky
colberts
panayides
hildalgo
isof
roundtown
debiopharm
rabins
kaskenmoor
patapons
coretech
fretton
eidarous
birchenall
knoechlein
uninterest
eusr
daliburgh
barberries
okole
sneum
otremba
balmuth
beleaguer
montbleu
homebrews
slavena
wbes
unsettlingly
bohinc
circomedia
sageman
autoglym
kappy
freegold
jarstein
freindship
jolette
heldrich
zaffarano
amadito
weibe
gü
djezzy
burlesquing
sauds
previs
kamolvisit
hovington
aahperd
hurewitz
travelle
lifebridge
fobt
sarahyba
xiangzhong
uyanga
barnevik
shadiness
whitmeyer
flyposting
koram
stickels
dehumidifying
churchkey
whitwam
kevrekidis
smokiness
pleace
tacketts
firstservice
tevaga
failsafes
pokrovskiy
smaby
crowdspring
poofters
plonka
vawdrey
supposely
totalfinaelf
somato
bobotie
schnurbein
dickoh
summeren
shuffield
sayee
barud
welchez
writeback
nikozi
meheux
labcoat
gulei
heards
ballhandling
brisiel
gybing
dhifallah
altenwerder
fazeelat
capotorto
youthworks
relkin
balthrop
rachwalski
tiani
daner
senmaya
gimv
resiting
djana
lieden
stabiles
woge
matland
mukoko
flewis
westgroup
andronikou
housesitting
muzzamil
sabhan
dannemiller
ancientness
suora
salvagni
hydrocephalic
spitballing
decribing
imbe
unhealthiest
muppalla
colyandro
joshing
headstamps
strook
lavenda
dimofte
lucken
pollner
oldstone
cmvm
oniel
kirkmuirhill
cutkelvin
pmetb
goleen
outlaid
ostagar
bilbow
maclennans
jinjie
insistences
colaton
mingjie
fatuously
enodis
cutifani
tampep
sansing
muneta
rudovsky
yanique
picenze
demya
teragram
candidancy
izale
troise
belhouchet
formella
mulligatawny
kyzylkum
halac
cerbalus
pogoing
autohaus
emblematically
janeczko
newsgator
sousanis
follas
toefield
waldrist
stancombe
chiffchaffs
strb
fügen
llado
andalasia
nienhaus
dayquil
cpal
pensylvania
balasz
gardent
globaltel
weisenthal
kipkemoi
baconator
ramuan
questin
suao
whorish
honeybaked
wikinson
cardhu
siochana
tsuyuki
samulski
zegar
troncones
cafero
yalie
betak
gastos
jiulin
kiosko
coltec
domit
ghadban
bowdlerisation
makutu
reintensify
patronises
tavullia
ashame
tresselt
eammon
unfortuntely
spinoso
horniness
tricep
leinenweber
laible
koepplin
hypersensitivities
zhengyue
rinet
picochip
relights
kulchak
naubinway
needletrades
disca
zakiyyah
jumby
ammends
bartholomaus
pelambres
qirbi
baymont
entonox
drinmore
masterbatches
lissome
stemlike
cosner
suhua
topcoats
eclypse
moppets
livian
knackwurst
oosterhouse
hmip
ffel
balestrino
vistan
teragrams
defrasne
ovariectomized
immergut
longwe
breastfeeds
amsted
fromson
superdawg
skapp
brochettes
chantix
logicacmg
calamos
grassfire
decathalon
chivalrously
aquitted
benzdorp
cognard
ydy
battleboro
groundlessness
hanane
savasana
canteloup
dalworthington
lubet
miharja
offler
cotif
dilullo
bookatz
kournas
kolirin
banita
francess
tarani
blaxell
janger
cloudsplitter
paintworks
mesocorticolimbic
kopta
honghua
nhengu
dansko
sotelco
turnill
hundres
miloševic
mestawet
lekka
waisea
exasperatedly
gollaher
dircm
iqua
eschenfelder
anythig
houwen
manslaughters
dormy
golbin
romeva
sotolongo
falstone
gethyn
zarabozo
vorsah
unrigged
ovadya
eurofighters
khames
stonebrook
buzenberg
logrippo
fiom
kuchner
duerer
celotto
scorebox
baggin
carlstroem
cotrone
ramchander
hezuo
matloob
sware
musicianly
granjeno
muneeza
broadridge
bodycheck
taunter
scrammed
ponterwyd
wickramasuriya
somnambulistic
ivanic
horkos
satia
replacment
cleversafe
cybersoft
ipri
epischura
serrand
keppra
outbraked
emblemhealth
soundworld
carcary
fardh
ryelands
nadcp
groused
believeing
windland
becamse
busansky
meredov
barbancourt
nevsehir
dustjackets
bennette
bourgeoning
litchborough
clonroche
meneguzzo
dannheisser
outr
bancvue
vertiginously
neotame
hussing
quiltmakers
ndayisenga
pennyburn
jablokov
wabano
mantuano
barbarically
ianello
rondin
chanukkah
pinborg
wyludda
himler
buoying
sundeen
branom
kapalka
schupbach
chebli
mishear
bichlbaum
roadpeace
sheinwald
saidiya
kesseler
marybelle
accessibilty
pillowy
futurefarmers
bernadin
belleroche
converj
janakiram
palinka
chinasa
edelheit
centerbeam
rickabaugh
aziga
marouelli
pariya
wildhack
tazhin
xiangming
kenehan
extrordinary
appleseeds
baglihar
crampy
brazaitis
bamff
beurle
propagandise
idenity
caulley
vodicka
lionelli
freakouts
pidgeley
heimuli
sorouh
currensee
capolupo
everiss
clickability
sombogaart
stilleto
mcgranaghan
moneeb
jerbi
futurex
perversities
kenville
firts
unkissed
gfms
forelocks
ctdi
frunza
sachtjen
waringin
superveloce
buglisi
gilian
casetti
suporting
huettner
shiptonthorpe
shatby
mckessie
ipdr
wrongheadedness
yemo
bensing
bodypump
goaler
sgas
antiauthoritarian
megilot
protegees
saich
tiemeyer
quagliata
mrowiec
cruikshanks
foregin
scheinmann
leisureplex
stegomyia
sandata
fezzani
opalinska
rynell
blackband
sihine
mcilorum
olushola
notrees
eiddo
helvetas
merdle
malvinder
shagang
marianito
llanelltyd
stamou
osnabrock
ceruzzi
baciro
ddinas
zerok
penaloza
tuirc
lipumba
figher
penalites
steamrollered
andijanis
schnelldorfer
castara
chirikure
backscratcher
brintons
gurp
commonside
rhoad
polkey
noall
witin
catoe
toay
balletomane
micromet
jamaldin
hythiam
petterssen
kotsiopoulos
hydroplaned
earmuff
noncompete
skimlinks
overemotional
sourani
estevao
texim
octabde
ouani
chudnofsky
kamiti
medela
rahmans
yahama
isaya
goethel
webwatch
resurges
tausif
nexplanon
cdrh
trecco
wdav
foxbar
tudweiliog
wayport
thoden
documenters
ballee
gurnell
unskippable
somberness
aleppian
ehrsam
pacleansweep
fesikov
espiner
barranger
voorwerp
bidwells
zester
stigsson
kinlow
sungwoo
bonsey
senfronia
reunifications
liswood
belorusian
lukovic
simorangkir
eways
dcsa
supernap
sevice
grybauskaite
ezenia
lidel
kastigar
marmotte
colberti
westhusing
noseclip
frimmel
ossobuco
bonders
wildblood
harithiya
viners
farouki
joshie
rixin
supermom
surgutneftegas
jhuapl
constructeur
pocahantas
goate
hites
bartestree
vermonster
ficheras
slowish
cudas
atlasphere
aggressed
jeanswear
carafes
adventurists
ortenzio
nides
disabatino
barcola
jobcentres
soapmaker
glattfelder
maaruf
titulos
evelia
komlosy
tundavala
frederking
blether
intravesical
pasik
enlli
bealle
uscm
manamela
hayslett
barbourne
serenella
unkles
mckool
numberland
attcks
goldstrom
amerigas
porgras
primp
oncofertility
cavas
goolrick
piedfort
riklin
levanzo
bouchnak
opton
moneyman
riboulet
osmans
jebson
ferwig
donatucci
bennun
labrooy
kozeny
elisapie
friere
ekejiuba
edmondsley
tayon
birnstingl
allocco
lalaland
corrupters
ipsco
narcoa
centrella
wohn
grippingly
rashbaum
ipti
jawboning
hicktown
turangzai
plecnik
campanulatus
iovino
crueller
sichi
mikalauskas
jobholders
aidesep
hydrocracker
bliny
starmine
ecolife
jarina
nicodemos
berlinia
shadowserver
methenolone
regene
deterent
cambrex
trengwainton
mescalito
certolizumab
leisinger
commes
huajian
muazu
inadvisability
ingored
pontificated
contigent
okuribito
taglianetti
sinemet
devard
infared
mavroleon
oleos
morvillo
bovanenkovo
namey
whon
margoyles
saladrigas
zizzle
repecharge
respectible
yemenidjian
rny
vystar
pandolph
nonpregnant
pattama
furillen
negari
contemporize
habineza
ganzorig
billerey
schörling
ellams
butik
quaility
shamefaced
vytex
mondli
zhaparov
behove
powersharing
vigeant
grancabrio
lonseny
adroll
peschek
skorupa
scalisi
revivers
emenalo
eisermann
broderson
musketier
cmca
condesending
bullmann
armelin
bingzhang
askett
progovernment
engleby
osculation
bablitch
casuality
corera
waddleton
zubrow
cuoio
saccos
horseplayer
postrock
gmoser
tipline
millionnaires
indrawan
shengjun
mobocracy
qinjian
tdindustries
aberation
haughian
otim
kingsdon
ndiema
valdebebas
vdara
visualon
shaid
jumunjin
heverly
ortloff
zubaz
urkullu
yankeeland
bakey
iaastd
jamake
evoh
midsections
jhonattan
raffield
infinately
glidescope
espaliers
crossharbor
dumbells
goldstrike
iddle
veddhas
brazer
sullington
santh
colonography
umdloti
physcial
providência
chokshi
madeo
ginster
freysinger
fedaia
studentuniverse
guohong
sadigursky
opland
bioinitiative
ehtisham
misapprehended
corsas
koplan
kayihura
suhay
aleris
bullwhips
rockwaller
ueapme
unitards
millenniumit
crunchgear
hormazd
biomarine
biscan
voalte
meikleour
kingsmills
barafundle
carestream
nbpts
friarton
palsied
yechiam
guastaferro
querulousness
payline
araghi
bellwethers
fulchino
delasau
sehring
gianaris
absssi
untypically
bisek
breedt
pittin
umda
ashurov
zeldes
bennick
penasquitos
panjiayuan
afirma
tshibanda
superfreakonomics
mombourquette
tossups
conversly
pechorsk
gluco
outbacks
norstrom
witheld
shawnie
bodyslammed
chunjiang
luftman
headcases
riverhouse
tslp
norlington
fuggi
washday
wegs
suvereto
frempong
dollys
bumperstickers
nutritionism
forouhi
ferwerda
bliar
brownlees
invega
wisgerhof
iacobelli
cheesey
llegan
shoket
esnard
plunks
kiarie
sechan
beltian
cosseted
asyla
ashimolowo
decoratum
marsdens
eldean
weldin
dabate
pagb
ferziger
rotundity
francl
tucan
mayreau
fontdevila
hollendorfer
uckington
mofilm
psme
mulee
ballyholland
autotrack
exquisiteness
reprecussions
vringo
trabulsi
dipsticks
visitdenmark
oreshkin
whiffing
nostalgics
flugsicherung
nondisruptive
unsayable
ursprache
rihn
konlive
mocad
kupchan
rosgosstrakh
juron
straussians
pleite
reax
philidelphia
surui
whaanga
moumneh
dieted
klouda
chinner
paraglide
proshare
bistricer
autier
numbe
vojnovic
avecia
steelhammer
wonderlust
disproportions
nagg
swifties
hissom
voulgarakis
skronk
dspca
kulala
szájer
golts
jailtime
flukey
greige
tomalley
multisystemic
citerne
farez
fishof
jeziorski
arcelay
creekstone
rightscale
haixing
petrack
australasians
katiya
limitlessly
aposhian
rezulin
quixotically
tolmoff
wapixana
sypolt
regionalize
pitchbook
switkes
folderol
travisano
thaty
pruzan
noritsu
myrichuncle
kheraj
tanyong
gerwel
villedrouin
kaminskas
justins
hammarstrom
kountouris
hansman
dinsha
wadesmill
realisms
moessner
gladestry
myair
wijdan
tweb
deat
karoutchi
kindlier
grotech
jabon
perrig
unoaked
droutsas
greensomes
laubman
shuteye
disport
delerme
jacomet
mountbattens
areli
prisum
quadruply
leapfish
passerbys
koranda
implemention
nanuq
boulters
gohara
chatzi
mecar
cmsb
mladjan
sannae
squillion
streitberger
imobile
kersels
gulliford
capellaro
obession
conygar
schummer
omotayo
caseware
recognisance
darwars
belluomini
carstanjen
sfiha
leyrer
noaimi
shiffer
mihelich
gaddie
acritas
peones
hattenberger
liepins
paraic
buyat
liederman
iskysoft
courteousness
yontz
botnar
kulur
strok
anapu
banksters
alynda
postively
monsall
swires
lahem
innata
lambroza
caraccio
stadiem
gavanon
cundo
beukelaer
jarosik
lumena
assumtion
canditates
ronza
trakatellis
outdrawing
tranferred
porzecanski
mandaloniz
larzelere
lucknam
loudcloud
valtrex
ashif
anasi
hupy
unhappier
measor
shouldve
equiptment
lomia
harbus
exigences
proshansky
myhome
bakhat
murrayshall
libidos
jennine
lammin
schnellhardt
lapcat
sheku
prehuman
handsaws
serotta
zetti
zaitoun
blvds
tardec
hkja
bathersby
shimell
bugat
senderens
liechtensteinische
trengrove
batawa
varois
fourquet
overpressured
fairooz
wildberry
zhevnov
mitsy
wefi
selham
caliskan
porbeagles
yabuli
stjernstedt
sandylands
rainshower
duesing
oksala
veingrad
muhairi
lamonaca
lakbima
shenon
miaskowski
bullshitter
baqouba
gestrinone
telaprevir
johnpaul
iffs
teenscreen
maponya
ened
inveigh
licuanan
mohommed
passell
doitt
alliums
unattractively
quixall
quaero
unrideable
zava
youlia
tsylinskaya
absorbine
arminas
farkhar
hebditch
krissie
tayvallich
disinterments
childern
labovitz
bizilj
krokodiloes
dumpsites
minnion
sazzad
stadtlander
drawsheet
kranjcar
ikle
gayetty
henshaws
timelier
humaya
mehyar
hdq
agnolotti
mustafah
ennahar
etbe
nouzha
defoor
sbec
grossetête
swimmable
barshay
valasek
brestyan
materialscience
reprocesses
goldway
allenspach
abhin
slutz
holmeses
fattan
banman
alonissos
lundt
houver
aversano
cyffredin
hameline
kibali
tradus
wilenchik
moragas
ruperts
paliotta
weaire
aviate
bahima
respirology
bakheet
giegold
arrondisement
proteon
jardini
biedron
cvsa
guyader
electroencephalograms
retools
prounced
almacen
avoide
ryefield
abdulmajeed
dissinger
saptakoshi
wimbledons
furure
bellchambers
firestones
atwar
deragon
rsdl
spottily
surmont
elkstone
poldrack
jeanney
pgmp
towerjazz
scamorza
exluding
onanuga
schennikov
selectboard
sibisi
camapaign
hargray
datapipe
tapson
jadan
alsac
nolley
barnev
kefallonia
flegal
akoo
aeras
dilello
travolution
houtzager
gamemakers
mastek
responsabilities
amarra
wibert
sarafpour
dassies
nightwatching
powerlet
qualifer
zakeri
phumelela
wysteria
mastel
onx
molfese
resits
yessenia
russoniello
hypocrit
karunarathne
straughter
kvitebjørn
crickmay
hierholzer
slobodyanik
aduaka
kirpatrick
statesperson
reinfused
adnec
bournmoor
millam
seafolly
tahai
pavilionis
includeing
littledown
shrawardine
husseins
medicates
sundowe
corgnati
kolega
hartge
symphonically
fragranced
finkin
ppz
rechanneling
suctioned
endoscopies
forkin
winstrol
nantgaredig
pljeskavica
karmani
prebensen
océans
redeploys
mslo
stackley
angelena
collicott
utapao
serghini
zooman
mousedeer
frankenfield
rijos
niess
sameday
chomba
decongesting
mouncey
boisterousness
fibreoptic
fleishmann
pewsham
inconsiderately
narfe
babybel
thorpedo
pushelberg
acquistion
gombeen
erasmas
texturized
dislikeable
snoozy
capitalinos
tokenistic
chaza
yeltsina
zhongjun
tonights
imaginengine
gehrt
sugarbeets
aisher
ortube
proddings
prosthodontists
unclip
nyetimber
stila
succoured
donovin
kajganich
ablum
ecotality
reenforce
gauthreaux
clanky
ocrs
pmda
moonshots
keheliya
goony
realacci
threnodies
rubingh
hallak
cheatem
dismayingly
poospatuck
lorgat
boback
discrimation
osito
unthinkably
heptaminol
trouvadore
galv
brangman
mazhari
benicar
alderminster
stokeleigh
snoots
forgeard
phinda
echaveste
misalign
rovinescu
parkies
abloom
guiffre
nding
zunaira
tugman
javen
podro
ubaydi
uncrossing
leaverland
zonis
kernals
molner
inverrary
bagis
sladky
icemakers
designworksusa
rozy
videanu
wormgoor
flavorsome
rittelmeyer
lahj
kerger
bussereau
cupla
balornock
ustadz
makoua
croskey
careyes
yusgiantoro
harwoods
alcober
highboy
atrap
omed
miasmas
frayssinet
chowhan
ctrp
lustfulness
doghouses
bottleworks
tahmid
ronika
bridgehaugh
decommissions
upline
dodt
highbrows
gamesbeat
kingsleigh
jessyca
embargos
megayacht
lodeweges
spendthrifts
hoggin
cremisan
sebby
cclp
slabaugh
hernial
sturzaker
hengli
dikul
trakr
alnoor
fluoridating
poris
hopsital
lailey
elsburg
bedran
penenberg
hiersche
betoken
bromances
uhai
megion
kalimullah
faulker
breay
pajovic
spokeperson
nasraoui
vinella
arku
monderman
deloreans
fose
kimjongilia
bestit
tianyun
brusseau
konopnicki
uninfested
lumeta
chinajoy
bouff
polistena
watban
renewably
nackington
lymbery
curbstones
santonastaso
careworker
expurgate
timolin
rungwecebus
geivett
vitec
rhena
dushyantha
weatherbeaten
cheatley
swatters
sayadi
xiwanzi
agreeement
luckert
sueda
randoph
reafforestation
ciuffo
balkanize
yanliang
tregynon
haggards
swendsen
cannard
rfsl
lalezar
ailie
chishimba
lefer
pround
netumbo
stavish
bordeanu
adjoua
benzar
ruilova
batraz
itsmf
tombstoning
hashimiya
cardiogenesis
ladybarn
wikisposure
threadworm
alertbox
mulsum
basik
lopeti
brewitt
scavos
wclc
cistaro
tlusty
wormelow
nipsco
ardency
elloway
palolem
lynetta
oddantonio
tacettin
schwope
rothchilds
trelise
slamfest
coxley
jlk
pletch
asymetric
rosscup
ulger
ifez
kibati
poleto
ochsendorf
neophilia
kleps
bosbach
morganroth
pinacothèque
gobelet
ilane
zhicai
serson
gaveled
rauball
painfulness
affogato
sturmius
kanisha
haily
decasia
spoilsports
yoyes
gorodyansky
bydureon
menear
gadberry
schoenefeld
saffman
tawfeeq
souayah
spdrs
zaiying
kaanan
shoegazers
incois
marocchino
fishersgate
papparazzi
unprosecuted
faltskog
kasradze
panderers
tirtschke
humpton
donehoo
inspiringly
aleek
celko
reveillon
swigs
waspy
hoarsely
unstoppably
krubally
muscati
salau
guéridon
froghopper
lagostina
angkhana
twombley
wristcutters
rahulan
duggans
sudeikin
mccrohan
exfiltrating
reikan
rhai
berrymans
redsky
stupors
siedo
llafranc
shirong
luleh
placelessness
blusens
smallmead
greencap
cencic
mcfarling
saviles
soutif
shawcroft
zanne
orlie
roadcraft
understatements
anybodys
belchamber
bessmertnova
jinxy
ingrediants
sehs
bowesfield
flox
lendor
grsm
mazzalai
geleta
muddiness
sorren
epitomy
zeits
buendorf
thatcherites
pantsil
jeuland
réne
augert
peguy
pitwall
garikoitz
artba
greenip
foodstalls
multibrand
neighs
occhiogrosso
broton
fleetness
collarette
almiron
makke
fouratt
barnish
yorked
amgad
dimetapp
misdescribed
bardoxolone
flightseeing
wenzlaff
medpedia
taigman
aatf
fontus
cyberespionage
raguin
tsra
weebles
tianwan
uprs
yamile
grammatiko
bluehippo
uhlin
gladish
goettl
cankaya
toughing
lueth
piccata
bazlul
basuo
zebov
compliantly
sonji
versari
kangerdlugssuaq
digeo
rooky
prescoed
graciani
simps
nakalipithecus
rehl
acef
pressboard
tómasdóttir
meph
splichal
hartstown
bommentre
buddying
brickarms
carful
cannibis
zquez
dmips
pinkness
swartruggens
rubberduck
accentuations
abstr
jachowski
overcomplex
antianxiety
casued
ameneh
paulhac
unremorseful
daphnes
patriarshy
bruyckere
moolen
chhachhi
danylyshyn
kwariani
truenergy
brutalise
utiger
broadmayne
khabarova
wylen
wapusk
kazachenko
weizhong
ivorra
insolvable
bakx
uncredentialed
elhorga
comprends
saloom
mputa
threader
inteview
vaudin
photomaton
squirms
trucs
sitorus
eywa
classifed
tjandra
mintek
gilbank
realtytrac
andalio
hennock
gustafer
dziekanski
hoofing
akok
olswang
sloanes
barzey
pazhwak
ekirch
christys
privelage
aissaoui
peaktime
yorkstone
killygordon
cheeriness
pymt
corduroys
bushies
ablaza
roellig
rinnai
zechman
jacci
isermann
kernoviae
gatsalov
pelletreau
manificat
juwono
ileia
baraniuk
trebert
behavioralist
minihane
amfa
oncomed
spectralism
mossessian
narry
adiva
unconvention
universityof
bodrato
vaeth
czapiewski
rapamune
yochi
mcgimsey
mindnumbing
tartlets
rissoles
jamous
cortiñas
kanapathipillai
davidenko
frakking
librio
pegden
involement
schriefer
compotes
farani
lavielle
fujitec
vandeurzen
chiselers
cibs
fleischaker
chervenak
cigital
piligian
npss
wreckages
seratelli
cariverona
gutin
boatsmen
thoraya
standeven
shellacked
subbasement
wasthe
mercilessness
asfari
borzou
intergen
nonqualified
foleo
schulting
moslim
luxxe
openxml
windblast
transmucosal
antiroll
andertons
namdeb
schaumburger
rahid
eebc
murrietta
hecher
modie
chlorinate
haiping
embargoing
smrc
shcool
antur
formulaically
progamming
khurais
qizs
burpees
changlin
pibs
sibaja
metolachlor
seifried
malaz
nacubo
vatea
tourson
eltron
styloctenium
inceman
actividentity
huybers
deian
nilsmark
arputham
cantrelle
lhundup
revpar
apwu
llike
elsenhans
sciencelogic
photocards
vecoli
utterings
greenbox
unshipped
krugier
erbyn
berrelleza
birrane
antojitos
jiuxian
acupoint
hedwiges
bryco
lavande
apeloig
kaburu
riverkeepers
tadin
invitingly
kayahara
nerlich
unmodernised
ayandeh
grappo
withi
popli
demey
taciturnity
hahadasha
golfnow
sophal
giangrande
resusci
oloyede
dimin
knowedge
loua
whooley
ideastorm
embroilment
gittelsohn
tickin
unicyclists
rithmetic
infragravity
winers
coolheaded
zeglinski
caubet
arjaan
herbenick
quinette
fotanian
looxcie
kostenki
hungerburg
latrina
beieve
shirred
prakit
tedwomen
loule
tscm
jazzmin
preborn
shengda
depersonalised
entriken
cyrela
hoodrat
oxitec
somsavat
lengsavad
rathnayaka
brailer
donchin
rayshawn
seada
schweizerhaus
nikkole
infy
stirringly
wriggins
aschiana
helenians
vilcanota
koshansky
singulair
gâr
sleazeball
rustenberg
jott
mascitti
daleh
verwaayen
paerdegat
nestboxes
robello
tetelbaum
defensenews
zangari
unnoticeably
enaged
craftier
chadashim
quarterhorse
fredlund
hawkei
nosiness
sdiri
bissap
dongkuk
faradje
muriano
shigemura
figliuzzi
comprehendible
julfar
fluked
unquenched
thaibev
feldmans
darington
genton
woodings
margolles
beardall
bhanbhagta
kraichnan
mankini
icluding
xjt
nuancing
sartorially
chomped
beltrones
jaggedness
musettes
kontorovich
inaguration
qama
kiobel
creampuff
kinsel
hornbaker
knetter
sasselov
heavner
bosto
penybryn
richen
tirawi
puncog
diems
famara
arrabbiata
boudella
arulkumaran
pietrafesa
olmetti
mudimu
tradeweb
dcvax
mbenza
russomano
kopitz
dappling
footsore
overtaker
proberly
ridker
neeps
trendspotting
segement
appauling
sarposa
dittohead
diondre
vicepresidents
sophisticatedly
washpo
wintersports
saddo
remata
reconveyance
khuzam
pesanggrahan
iotc
thalians
sarwono
orther
burocratic
makdissi
musoni
lavand
lusikisiki
carpentered
diyali
ncae
ruslans
pompus
artsdepot
uninstructed
knief
opko
tillbrook
groundrules
merseytram
concertgoer
cchp
mclurg
ahmadreza
guanyu
melim
casiokids
samsoe
pagesjaunes
ziprealty
winspeare
bentancourt
issiar
bhps
gaith
proveable
frithville
xora
nilbog
kackert
accoyer
vitalise
fathomable
crasnick
hartnagel
nonmalignant
conern
toyer
lakely
isbourne
unroped
guerette
gorshenin
starosel
walkscore
aptv
signi
netratings
boschker
mindmapper
sumate
bazaaris
kitner
bioflavonoids
mainy
jolibois
theuer
discourteously
anonymisation
sweetnorthernsaint
purloin
fornaio
numex
defibrotide
archibeque
kryptiq
duccini
testee
khella
ciclovía
hestrin
gearey
beitenu
jaysuma
mograbi
allaster
cyberterrorists
quickcam
stakeouts
immortally
microplace
dhekiajuli
grinnan
tatenen
homestanding
ikmal
rapidograph
icbf
kovanda
panobinostat
sandestin
francescani
mailo
lathams
faillace
namikoshi
mailout
nybot
stoyka
oberhammer
khandal
hernreich
tobashi
linheraptor
autoinducers
redzikowo
rumberg
nariratana
kiechel
eirlys
sulked
subeditors
medlyn
itzstein
urostomy
trhough
abovenet
everloop
procul
bulin
defrauds
drifty
nonaffiliated
vlerken
instanbul
reconceiving
alsomitra
wingerden
wooers
socca
multicomputer
gawade
systech
lttle
ghormach
goldfrank
kalaj
tradtional
nischan
gostev
prageeth
luhyas
aloko
royksopp
hargin
wynard
nirj
pollena
germanwatch
boringness
countersigning
wthout
cannibalisation
degenhart
chromeos
cathance
basterd
shiceka
amsterdammers
headbone
bomai
tafforeau
sherrerd
putrov
rupiya
friborg
zipless
liquica
carollers
smin
underdressed
pocognoli
socpa
vedan
tortoiseshells
radhamma
smyllie
travestied
waithman
anot
antinarcotics
omrf
hunstad
canesta
egotistically
mutineering
xinhu
siamas
comingled
patricelli
amock
tarabella
agbaria
blogfather
collateralised
sucide
dragnets
boodai
afib
nycholat
shafta
touton
counterpoised
billauer
deluging
adonde
abuya
hamos
golimumab
omble
giessel
dentressangle
fehb
rivisondoli
wolosky
catastrophists
regietheater
hempson
aniversary
wotus
lilliefors
tegas
melinka
airworld
khaji
jibreen
appup
vongo
honicknowle
guidall
afribank
haycroft
distends
dvn
mandikian
nnabuife
highflyers
lorgues
lazreg
ciftci
altech
yarima
ordzhonikidzevskaya
leocata
brachiosaur
menary
zoloth
shopaholics
hpnotiq
tworkowski
hysteroscopic
nprs
qqqq
pantless
omax
pirlouit
stonard
nungwi
popovsky
monical
vigay
presevo
intrepids
sufiya
timebombs
ukrainain
elgarian
donguy
whooped
seabight
decisons
shupp
christenbury
parguera
klerman
shortman
tauruses
sayala
millimoles
eikerenkoetter
ffiv
penchée
putrescent
fassitt
tobasco
chunka
councelor
mabele
agnoletto
sunshiny
fantauzzo
babying
tiffiny
twiglets
pevey
ecogra
springcm
bizonal
talibani
orcopampa
madewell
teten
marghescu
triner
usairways
hemmerdinger
handbagged
valenstein
lagerquist
gyotoku
mesinger
noshaq
differenct
colcci
treviglas
loaners
shaimaa
leonardslee
globonews
nonmonetary
sprightliness
supportes
tiziani
kimberli
walski
uriri
kuehler
mamsurov
belway
wagnerites
roycrofters
takeouts
glugs
agood
homogenising
gotsis
profundities
hnba
hemlington
milgard
parthasarthy
herwitz
siksik
ferrino
necrophiliacs
subsquent
undiplomatically
fahimi
bavents
zvarych
olfat
ibisevic
arnvid
llangunllo
pushpanathan
blagged
knpc
teollisuuden
mykhailychenko
adnoddau
plumps
firebreathing
kazn
klocwork
bighearted
taxmen
alili
dhuluiya
greybeards
rubgy
jedforest
arbennig
traiter
bardsea
alloro
pergam
coet
yumilka
homasote
magcloud
cryptographical
ourt
korhola
fvrl
maccubbin
stockbury
goelitz
sedas
humidify
simultanously
bethleham
durda
waitlists
pricefalls
hrmc
sheafs
literarian
groznyy
olling
pfeffernüsse
flightwise
bastardizing
tinei
machelle
weissglass
rabalais
banyai
lizst
centralistic
filterless
trerulefoot
matraman
copolyester
kolluri
sefs
svatos
babchenko
darrol
berlinde
tongrentang
behanding
gisozi
automatist
brovina
lifland
splainin
mindshift
adaleen
peguera
uneg
proteccion
zhaoyu
superfighter
bernuth
giornalistica
mentell
gerstman
nger
prospectivity
ruhullah
smarm
telekomunikacja
dangcil
lacoochee
kiloliters
firesafe
epipens
polydoros
nestinari
ngaba
outplays
polihale
llwynywermod
sextortion
stradsett
hankton
crockart
tollroad
swannack
sangermano
zavecz
franciscos
cheuvront
drobnick
oveisi
kingsknowe
rasate
akbars
seronera
maldoom
chickening
thymes
boerger
louizos
kibitzing
marlbank
pfingst
tarikat
mapco
xiaonei
emix
libaux
undersampled
tiandong
kendrapada
ekpemupolo
posessed
soroptimists
eots
goldfever
uffindell
fawbert
amparai
licensers
statemen
norbourg
dhahir
jeannetta
huissiers
daragahi
pontifications
wsbr
mournings
misselling
fittleton
rockits
campcraft
bronne
acox
opeta
kivuvu
ndfa
jegal
intergrity
chittock
shukron
perpetuators
woulnd
valthaty
muncipal
yarnfield
humburg
latifiya
evryone
trcs
underpay
piccarreta
liptapanlop
eleo
xiangjiaba
kanyabayonga
crozatier
dorli
mulvee
playacting
footgear
dutto
norak
zimmy
ionawr
alphasat
franker
pesaturo
binegar
detchant
kscc
bluenext
cacchione
oatridge
hameedullah
mixbook
poltics
jolynn
donehey
kolbeck
mullineaux
humprey
coskata
aleikum
kovilakom
dufournet
maccone
flexibilty
fulminations
incyte
annmaria
yeide
tanbridge
wilkowski
dekdebrun
mafco
kurowska
ungrudgingly
diogelu
xeko
hooty
rattiner
hegghammer
wattad
tavernelle
surrouding
discouragingly
tdra
bonorino
montos
satarov
unenforceability
kgra
basijis
dailys
triveniganj
auguin
gowthorpe
interregnums
resonse
stierle
blustered
suspicionless
sarnowski
recoverpoint
leghold
fukumori
hapened
somafm
gerr
johnetta
technomic
parachuter
serdamba
entj
troughing
iccvam
muscare
sulamita
baksaas
bertan
overachievement
schomaker
cosigning
dissemblers
vilely
vacuousness
sevent
raddle
lundbom
ogoun
schonthal
degasser
univerisity
taqueti
corvalis
paquis
yonko
sungen
filigrees
apicultural
seamheads
decareau
rohlfing
speedferries
kkim
nesheiwat
gironcoli
redlener
ayral
babouche
orogbemi
penchants
clerisy
christodora
grotti
mikucki
tsujino
chancres
obscurant
pseudowire
dreves
editrix
auchencairn
stadd
klegg
kathwari
azzoni
beguinages
steo
redbuds
contentpolis
dayday
brachioplasty
salties
lepelletier
hrysopiyi
caav
docketing
medvedtseva
ochandiano
amorette
scarweather
amercan
mashar
dowles
hoesel
desbrosses
stravinskian
ahmadian
simkus
employement
hubrich
kujat
penyberth
planetsolar
izatullah
vinasat
granneman
chortles
yusufov
chaisteil
bancfirst
joybubbles
cerebrus
eqypt
hapsford
jaekelopterus
pieronek
witih
reiach
brylawski
mruk
treichl
gouves
loseling
tilbian
healthwise
chillier
dullish
leawere
benefical
culpan
kabulov
ladanian
corien
hrebenciuc
wavegen
delaurentiis
boardmembers
sukeena
bijur
stupefyingly
germaphobe
surliness
elber
sauven
daylit
samawi
ellestad
natour
bezafibrate
zylo
ecvam
sychnant
hospitalities
mortland
kalea
zync
kristia
mussed
chassaing
litein
constitional
noltland
refurbisher
virostko
troublespots
crispins
genitally
grisolia
podrug
deadlocking
alies
mahmod
puliyankulam
kayed
fluvastatin
desided
tehrangeles
macanga
desmoteplase
korsunsky
bargeman
uncontrived
feistier
delucas
trevan
morgeson
samoas
asociated
cocottes
rambøll
niedzviecki
duplicitously
ellershaw
tenatively
isssues
annoushka
bigotries
tptb
gropman
morinas
trlica
abayev
novellis
pacewildenstein
trisynaptic
muhidin
cavatelli
gasline
drawcards
perjures
christensson
houghteling
internatonal
gruffud
ingraining
basiron
abacuses
wippler
kingshouse
sarmas
muellner
tefe
dhirgham
tercek
cfdb
enmasse
greilsammer
medsafe
glem
wellby
disaggregating
argentieri
dogons
siefker
colombos
counterlife
nyph
empl
blaabjerg
galgiani
reacquires
aizumi
dueber
goldenblatt
disheartens
exhalted
snowplowing
vmpc
quiraing
borowik
swots
hefez
maeba
microstar
kikunae
tunefully
ikrima
letizi
psdf
luzmila
borght
spangly
aelodau
sorcher
kotowska
pfuj
broquet
dondrup
kindess
pendery
roadhead
odorama
caribean
vincento
mgic
inlanders
fived
tandle
wasendorf
doubletwist
balmullo
tewfiq
solidays
ikhana
awfi
limbourne
wishlists
filloy
sekulovich
yucks
chyandour
dangerman
bottlecaps
zhiyue
ganiel
jeunehomme
consquences
squashy
scivee
shayon
mealer
pulic
brégier
fatteh
portugalia
leatha
linskey
tenuousness
kaushansky
reget
natil
kamarulzaman
navizon
grammercy
midrise
flamekeeper
shontelligence
schopfer
imposer
ismatullah
onychonycteris
grabovoy
ferness
aujali
pakol
overdiagnosed
nüvi
yach
kulicke
jbjs
expectorants
liase
gametech
puffett
canonise
cerrudo
trpčeski
hinners
phibro
mahoganies
happpened
temelin
spookier
hatpins
motorbiking
jedrzejewski
ubaidi
grats
amsha
anbin
venglos
chunhong
lucketti
bernardette
kibebe
dbis
tulaganov
melanzane
bleg
wenxia
kunskapsskolan
armspan
ferrassie
esupport
drabo
refold
brusher
sleb
makarewicz
timra
topstitching
bertschin
llafur
matory
jiegu
kodmani
dutney
trueblue
musambasi
fsbs
sandalo
batterymate
myclimate
nompumelelo
bihn
zooppa
sleazier
milburne
santoli
vulcanologists
ekaette
mondrians
fruitflies
cusanelli
bonanos
fahle
internationalising
penttinen
serrill
narky
pluvius
oveson
daelemans
stertz
mentary
jakovljevic
hesperonychus
georeactor
glivec
charonda
zentmyer
shopsy
kiteboarders
arnelle
cadougan
kanjana
chibbaro
carcavallo
stoeckley
bouska
hosszu
raouraoua
nucleaire
essawi
spinesi
longsdon
guildhe
hainje
builing
demeritt
sompop
gemeos
girds
mollee
heskel
nonserious
outboxing
abhorent
dentally
allwin
mizin
canynge
ecec
schulters
lulic
ccpi
cbtf
louisianians
ayarza
akouala
devecser
untangles
bidonville
doofy
heronsgate
akhigbe
naste
ivesiana
interactif
carritt
highstar
gmdc
aquaduck
marlaine
riogrande
avalonbay
baldheaded
floodlighted
wiederin
sajawal
distateful
vespera
mgive
groneman
intels
britanni
brambridge
cheerier
olivotto
nazarenas
onvia
harofeh
aramide
smashup
vogli
chopsocky
astudio
wnan
mylonitic
senstive
kitwood
yannic
birkhimer
decending
greschner
massier
medicalize
ibfan
bowlingual
railtracks
chandrasekara
heyzer
ccctb
abcam
annularity
remin
dispair
lardelli
orombi
scog
bovim
brimo
takoe
spermiogenesis
iuh
jugos
peijs
kiradech
stroupe
veic
acupunture
phomolong
galovic
spectrolab
unspooled
tobosa
jollett
nakagaki
escrowed
manscaping
samanvaya
ducheny
sharmon
hucksterism
bacas
lenzini
shemwell
ceisler
phillion
pondlife
northcross
xiangying
mohoni
laque
piquets
gpro
federacije
splurged
giavazzi
highhandedness
lillas
pscu
staubitz
bursik
peevishly
atest
bursk
vulgarians
connétables
virutally
chastan
slouchy
strokemaker
cartesio
sheutiapik
dormie
sghir
blelloch
blacktopped
disneyfied
fareri
cuspid
ltfc
paischer
weplay
undrilled
sonenshein
bjayou
cphi
wilnelia
apostola
certisign
preprimary
pepall
blacksands
redferns
omir
enciu
waern
alwasy
borouge
tanjim
scantegrity
gtsi
carrez
bigum
betwee
sadoon
csango
shanny
coursesmart
teera
dhargye
punchcard
babygirl
unmodern
shelties
steltzer
elsaffar
istedgade
loosies
ippl
ehic
lozito
orchila
vouk
conspriacy
frigia
malinvestment
tiffee
karawan
wiznitzer
carcassés
iaus
gingerman
zurb
guyland
brastemp
accomplishements
masterbatch
silverknowes
blogotheque
barlanark
stampless
vehle
kkgn
ringhals
sempe
masura
nvz
baverman
chareau
churchstanton
popma
ivari
overides
alexsander
employess
grindhouses
pescetarians
passata
auque
ogundele
icop
viccellio
bransgrove
maarohanye
pcip
krenar
horiyoshi
luxoft
uninvested
succesive
greenoak
sermonising
twitterers
bakwin
soneva
tonaghmore
westernise
goosed
delectorskaya
tysa
cartus
toppel
mnari
norlund
boutiette
simental
commitees
salaciousness
glassverket
hostias
gietzen
demidec
relitigation
schizopolis
valemont
roaccutane
unquotable
camau
bernazzani
gavard
fischell
mozhan
librilla
battaile
chhouk
dooce
multiton
consipiracy
expertos
alvart
cayzac
arkleston
maumbury
hikal
disjointedly
deutag
pyschology
bobois
betweem
moruzzi
fettling
schoolhill
amatore
triallists
mogor
pedestrianise
apkarian
perriers
janiya
sawadi
arbourthorne
wayyyyy
delwa
orginizations
counterspace
atyushov
reacquainting
weegie
peray
mahmad
bavouzet
iret
averment
baako
tagliafico
prequalify
landaker
munayyer
vratil
lompo
penfriend
gauntt
bratter
tunewiki
iochdar
lluberes
menerba
somervale
flatteries
afren
plie
loac
unallotment
strite
resortquest
woodmark
breare
hopfinger
vercorin
calzati
bluemountain
nukri
intoxicatingly
quanitra
snts
tallie
stricklands
sisia
mabunda
selectone
hanscombe
nurestan
daulby
tangel
multisided
dilaram
uninhibitedly
migala
pattni
paleochora
sarvestani
hasak
jelacic
facekoo
ipsl
kwis
olabode
tson
basw
repor
perrodo
berec
macronix
ciller
drudgereport
evault
nakagin
mpdu
tamasi
sangwa
mallarangeng
phoberomys
bfsu
zatar
camello
constanly
birring
fijilive
hilborne
peplums
mouline
vazon
ozaeta
unseats
ljr
harss
amifampridine
janofsky
cervasio
ayagi
wwdp
feldmeier
crabber
superantigens
lepor
hobsbawn
bannwart
braithwell
thomley
buethe
nuvasive
remmeber
hallgarth
alcombe
gordmans
calpains
yanoff
clape
stoltmann
manadel
lenzner
brugnetti
kuzubov
lhps
plebanski
rizkallah
chimutengwende
potawatomie
superfetation
gonzalvez
severances
slurpy
schnaars
nettlesome
punchbag
mutiu
uhaul
cassimere
orecchiette
outsprint
translogic
depinho
razzing
pakha
summerlands
eurodocsis
supercentres
wreathe
ohtahara
daylength
proglide
seabord
blutch
cobbetts
yazici
scharioth
watg
anastatic
hagwood
lasarow
lavastorm
mcquitty
kueng
gunay
futuris
rpix
homoeopath
sampathkumar
malloth
sanquin
yawner
klaehn
anrig
alent
turnbaugh
burey
greencoat
hererra
sloppiest
garriot
sigon
salehuddin
valmond
diezani
marionberry
ganczarski
pedophiliac
kechichian
woodsburgh
keeve
idutywa
boardex
supremist
nonparticipation
klingner
outstay
pervaz
kocca
deihl
plexes
lingoes
proenca
hoskings
outsourcer
valkova
skeezy
panzner
bansley
telehandlers
nreca
voluteer
tabloidism
secong
newpoint
streethouse
baudrecourt
englebrecht
delamarche
vanaskie
payd
jimly
cellartracker
allmighty
laddonia
weills
entrie
kasrah
creciendo
igwg
chopticon
whipton
perkiness
harrells
airsick
fibbed
evertonian
mitfords
scoiety
rehnberg
naaah
litoff
arraycomm
stacelita
daysi
popinjays
claypotts
slatternly
salcer
bachian
snivel
papuc
uncollectible
vederson
japhy
coronograph
kamlabai
kgotla
beanstalks
dyszel
cutups
techpresident
sukkari
morlok
competions
almudevar
rajkovic
spinmaster
barware
deleasa
etnia
raeff
hoopster
jarko
sipsmith
divesture
iyp
overengineered
pollesch
simulsat
octuplet
decend
hujum
perat
gridrepublic
fineries
riziki
tollerance
volitile
merchantile
longney
contois
bismullah
swiki
niba
fingringhoe
ospraie
allieu
sacdalan
sandstedt
jabots
unpreventable
navaro
pogemiller
itpc
videocracy
kinnucan
monning
fukushiro
fawal
dicked
dobies
quintiliani
sartoria
perfoming
viotia
freerolls
slaff
rigotto
sysview
tagliamonte
khoudary
youseff
myfoxla
commmercial
vigenin
aizenman
kotalik
homeloans
niggemann
pramit
somary
drycleaning
kazuharu
omaid
chaunce
thekkekara
emailers
budges
ashgabad
elastane
reorganizational
pilypaitis
halbertal
hfba
jepleting
petrokazakhstan
monewden
cvetan
ellerston
stanya
fetion
meidrim
towthorpe
unbossed
shaariibuu
coxhealth
blackhouses
satcon
lordshill
fendant
wheelarch
daneshill
incompetance
honigstein
handtools
fridson
electonic
hiter
matrosskaya
strohecker
wealthtv
waiganjo
lombino
ulfers
macae
ducketts
herawati
angstadt
reconception
ahlone
decieving
fillippo
fascinatin
lantra
receipted
dearnaley
weiguang
mytikas
myerberg
samei
honeychile
jubbly
karoki
parween
indocyanine
brakewoman
onbase
spectralink
discombobulating
bibf
ridling
bosniaherzegovina
politial
schue
multiven
lyrids
fehbp
kitzen
waterlog
panouse
ibsc
grimason
boyat
giantism
ploteus
relized
recenty
calem
thougth
deselecting
blacklistings
zochonis
saidie
hafter
grucza
tailormade
saminejad
wilb
overell
pazmino
telegeography
champetre
hightly
targedau
buid
intrado
weblo
cybercafés
fishtails
tadmarton
modrak
centera
rowberrow
riverford
penetralia
mayanga
obergurgl
chbeeb
sniders
goklany
soldera
bazbaz
schnarch
sharrocks
slotradio
navini
gyrocam

brainbench
chalvedon
picknicking
lepsis
kostrikin
audf
voluspa
privilige
tazim
hamermesh
schuening
lyapin
sukkoth
kinetin
saisi
thornloe
fereti
cinching
cityspace
lasak
samji
brutoco
shaquanda
bingjun
engmann
akhond
escalades
atlal
mightiness
wieben
harik
interscan
salerosa
saloli
knipling
ruemmler
villen
verrry
greenlick
inkless
microcsp
altata
fisman
sudik
cripa
elkwood
suping
flitcraft
seasonably
neumunster
revich
ndeti
marleys
dancelike
qushan
slimmon
fosamax
dumatrait
murliganj
junling
haynos
fishfinders
overanxious
fuschia
becom
mykita
songsak
dissapoint
efficently
farrage
adsb
schilf
tagme
cuttone
economix
brocke
netani
greenlights
swordfights
meão
kambaksh
plantlike
dustbuster
rebtel
sipress
frowd
shawwa
bronzefield
sirree
thottathil
zaras
krafts
pulverise
azuz
zirkelbach
multidetector
ansv
edemocracy
paradichlorobenzene
lowenhaupt
mhenni
stretchiness
bansky
frostick
ghaffarian
hodginsii
khamtay
tieback
gabaix
neonode
miloscia
keera
eneough
tsem
asiacell
kurtzberg
quicklink
bowermaster
yuai
dworzak
superintends
kolade
subcribers
summersgill
seantrel
darfurians
upskilling
tagalong
bobbyjo
katam
guellal
hulatt
marketingprofs
zuberbuhler
cagaptay
gorsey
touboul
detron
cotey
victimizations
katial
crossfires
craycraft
estaleiros
sultriness
ghioane
nethers
gonzalves
paduch
rumaihi
kitcat
polictical
chimenea
dazing
brandcenter
amgs
rubeis
inswingers
hovhannisian
kalko
macguff
hobbyhorses
songstresses
holzberger
pivoda
visability
eqb
sonystyle
eilt
folkson

barabasi
debauching
flowerings
hillgate
russneft
pedair
petchkoom
hiros
graikos
whooshes
stockline
guennoun
hourse
gartenfeld
devaul
donaldsons
aktis
superfluities
nymphomaniacs
distachyon
goofily
sreb
medenica
harakas
backett
rapelye
chegini
nlga
paramos
sgorio
witoelar
gasfields
skillicorn
habiby
cotchford
vprotect
billeaud
overclaimed
emberlin
befouled
sherifi
tinhorn
sheptock
nasn
moenchengladbach
pamfilova
uniflex
normando
panteli
anile
deranging
ngwesaung
hilliest
knobbe
askenazi
breann
funemployed
supriyanto
ivannikov
asefi
zvulun
deadens
epcos
thiefs
hastalis
telephonists
duhau
neuborne
slowes
sowles
malaney
medlam
limotive
durup
upshire
luxuriating
manganyi
bhunu
bedeviling
shibanova
calvins
euri
dalemain
inexactness
condole
nagreen
zalpa
inspectional
exoo
ciacco
plackemeier
barti
gordel
repotting
usherettes
alefacept
flaska
mzonke
acccess
saicm
unironic
sugarmann
nicassio
elvian
chaowai
baudy
manorcunningham
okram
scrunchie
llion
braband
blakewater
pontiki
grunin
casalese
sexualizes
mccollom
polderman
heyler
sireau
roualeyn
bakyt
agita
unipersonal
cirulli
roshia
losana
unprescribed
csotonyi
riberio
shooutout
jalli
fcsc
acjs
silek
demanders
perfusionists
bocadillos
confrere
marzolf
jurrasic
zhengqing
habitues
asge
easliy
moneychanger
learnedness
birdingasia
coucil
pauze
schoeneman
watring
bedspring
ferdl
capocollo
snwa
atrophying
wafik
broadbased
bensadoun
zolghadr
medge
hennagan
guiter
kennya
leetle
stereographer
rusas
possesed
mamitu
ruihua
massenhoven
subtenants
disseminations
ranbeer
slightingly
rsvps
uninterruptable
schartner
mcgilvrey
luteinising
drowsing
whiteaway
wysing
egana
ddit
holohoax
srygley
kikaya
neece
rakefet
morhaim
albitreccia
sabtu
skillett
casselle
illigitimate
empathises
ddrc
karanfil
keyra
crothall
xingquan
sacla
colting
staunching
duponts
duea
jinadu
scer
brezna
cyntaf
seshasayee
lampel
gjeldnes
ryakhovsky
bauli
hallissey
stepanski
abgal
‟
derrykeighan
chapek
secularising
syllabub
sirait
pouladi
suwarno
haidle
subijana
flannelette
donativum
lieck
lenell
jnet
alyoshin
youngor
trustingly
fumusa
harllee
eacute
gutschmidt
schenz
stokeinteignhead
bhebhe
molitoris
soopers
khatak
wannen
despoilers
bedsprings
malboro
soelistyo
remling
inebriety
votomatic
returneth
myxer
brehat
skinsuits
nimocks
raymor
oeics
nncc
macayeal
njha
pathlow
smilar
pajeros
valdimir
dmjm
poularity
strenously
okula
kifayatullah
keyesport
campagin
neiko
fishwrap
disabilites
sidneys
ryehill
deerhorn
kalidou
uncoerced
coarsen
ghasiram
schlief
chaplins
techster
mangoma
bolivares
roshanak
lobbyism
transcrypt
messian
gitsham
gtac
mianheng
cimolino
blasphemously
magnificant
freeley
hagues
fogt
weierman
higney
payslip
coddles
focalin
mangalitsa
dixielanders
indelicacy
sidiqi
witheford
syphillis
netnames
morethan
colemanballs
xianmin
minqin
dekuyper
omra
froylan
gocco
shemali
shillelaghs
grayness
ynghylch
completition
calladine
redlaw
smolla
rottler
tochman
sidikhin
mptf
jurlique
attiyeh
jarinje
celebreties
brokenburr
calyptogena
renslow
kolini
nimax
sirias
kyaiklat
homebodies
bhubneshwar
drawled
shipit
mesiti
immigrationworks
splittist
lideres
nusoor
oncor
samarjit
pireos
tromode
mesaverde
themistokles
sassiest
liebergot
artlessly
kelvinhall
ayaa
overharvested
bolde
subsegment
pauffley
karuba
bohling
tamarside
londolozi
goosefish
ocfa
nonsexist
asigra
magnoni
burkas
pakeerah
iepe
mollycoddling
jerzyk
republicanos
sunamerica
bestride
movielink
invernesshire
esys
ometto
campeao
leathwood
cpmiec
diale
aqn
slogger
mellins
schloendorff
dembner
dcjcc
primex
henkels
anitkabir
newlandsfield
shiaism
snots
michéal
huther
crimdon
comert
heske
publinx
inveighs
moreinfo
rosindale
fleamarket
bhac
kliem
plonking
myller
vucinic
naddi
captura
kniffin
latavious
commentariat
ritti
marayati
unreel
slutcracker
nangahar
enervate
hobnobbed
impishly
trampy
bickhart
gossipers
bowbridge
farcial
braue
xixianykus
incy
grygiel
slaughterman
taliesen
anteon
zelon
salym
hensch
timetree
laforme
colisseum
cardiotocography
jaquess
wiand
sedran
focalpoint
anowar
toptable
kalaweit
oeic
denstad
maduaka
meken
drajat
humbrecht
seditionists
niederman
kakabadse
jasiewicz
mgat
khristina
clerically
zangmo
nonsubscribers
jainist
turchak
breezeblocks
giustozzi
scaremongers
springform
blamer
shwemawdaw
alkalay
repricing
kraine
cahps
wanatka
boncore
carlops
chapelotte
laveist
groshong
mandean
grish
grabsky
slinkard
itemise
tither
telenovella
szara
heinously
mylonakis
kinshasha
hownam
chungs
streetfighters
exhorbitant
valkovich
puremovement
génocidaires
lotsof
freckleface
schuver
jwf
toplitz
accupuncture
baranka
techzone
jolbert
naraine
svedka
cinderblocks
unsal
masciola
hedgcock
manlangit
herczog
tamanaha
umpcs
shushes
wvas
qanooni
pronuclei
rheeder
wanding
yukitoshi
harquail
oshea
dauth
lockview
ioulis
tasimelteon
effies
austalia
greenmeadow
hallewell
abshero
inklusive
takeshis
inswing
mweelrea
looki
reprioritization
antwun
insulinotropic
propsals
schive
mipro
cobner
sayra
lanntair
zayets
czarnikow
banotti
majete
boeving
moubamba
seemant
caranta
inbreed
harmut
maue
vandervalk
abiomed
ryori
hollinsclough
benzecry
unusally
weaponary
shirtmakers
dantesque
bvute
wüsthof
iattc
kruijff
miilion
unconsenting
perenially
hainline
counterpunching
götgatan
bonnart
prinzing
fineprint
flipkey
allegros
clearvision
klehm
lorkowski
raymour
fujisue
kasriel
sagit
baladin
gsce
schoenhut
misgovernance
vobora
sydneysider
gardaworld
threatend
radiat
esolutions
bloviation
wiliness
neumar
enzon
yongliang
udzungwensis
tetangco
teutuls
filmaka
climactically
shoosh
weinschenk
oddbins
multidisciplinarity
youthnet
nagamitsu
getliffe
confederado
hiddush
surpreme
piaba
semiformal
alkatraz
chorltonville
yuqiang
niesr
rpts
kcals
dognappers
pluggedin
giarraputo
senofsky
unmanipulated
dimario
italianisation
mingarelli
hafed
patrizzi
scheraga
pky
mozzies
moynagh
stohrer
shireland
multiplus
nadac
yukons
vakhtin
appaled
glj
lamingtons
perfomer
naeemia
zhakiyanov
sorriest
sashays
lamposts
delfos
firewatcher
osmanovic
langdorf
assit
orlob
satrec
amsellem
carupano
teresas
pinco
mietzner
defintiely
taweelah
moustiques
albain
raxibacumab
winstorm
enkoping
bromfenac
demailly
tudoresque
retamales
qiuhong
perfom
amicone
resika
kloza
fieldglass
trafficing
violinistic
ayro
kumbaro
dirtgirlworld
pilarcik
drumpark
cicheng
shefferman
trockadero
qingchun
viperin
blankmeyer
retore
ballantines
matamoe
tzd
tranghese
woodrough
seatmate
shoemark
practicies
burgt
unmistaken
sakow
elew
derw
okike
bankcorp
saltshaker
crespina
snunit
towning
citysocialising
civan
itälä
someo
ramaala
wrdf
tyrnauer
truthfull
kularb
paulistanos
rastagar
brattish
pitango
espaliered
scorcese
whitmill
fehrer
begert
galens
apprehensible
yarelis
kadyrbek
ticketline
zagg
ciwf
sandbo
everhardt
chewiness
ahdr
mmod
vmotion
fmris
brawlin
bosire
sheeny
krivokapic
najiba
portlemouth
bewigged
parsai
radient
kampgrounds
maoyuan
cherita
strathkinness
fervant
krusell
lamorde
wattville
openhydro
jesli
avesco
fleetingness
sitex
weightlessly
isandra
emiro
menb
itati
kiplings
atpa
limbos
klonoff
knowlingly
lefsetz
holzel
sabahuddin
duckhorn
kipsiele
warbots
gyalo
wifehood
miscalculate
attensity
chpa
kraeger
absl
loganberries
trillionths
ladygrove
bishow
barcha
journe
schlondorff
protalix
capb
omiyale
goeman
iréne
penholder
quintiliano
mujati
bogarin
veldmeijer
kavenna
shaoxiong
mispoke
tougias
fetz
estephe
pontio
cinelatino
hradcany
intactivists
shortchange
ocdetf
cortella
jethani
tegic
magheramason
dakkak
burring
meziani
twer
upseting
sonicare
apme
daugher
allbeit
intrepidus
avonex
karaoglan
bintree
caroused
sadkhan
abdelbasset
kaliese
napili
scherza
mythologising
guneratne
orringer
piecyk
aondoakaa
vbieds
glannau
stormready
cirg
kapilow
gaffikin
blutt
ryabkov
babock
rreef
handwrite
prizefights
reyad
mediteranean
buntain
spicey
nextfest
mcduffy
pakhalina
spiridakis
nyanchoka
brandesburton
overlit
echemandu
clarizio
bmds
gardels
jihah
carmens
gerneral
anyidoho
ortica
nevling
claybrooks
lazslo
bassanini
uproars
outstays
coiffured
bibring
uncalculated
orentlicher
nevetheless
essawy
basaran
clinard
jcj
keten
constantí
disposers
optomec
ineichen
houseago
krzanich
boasters
expensiveness
freshour
ramstead
railfuture
rondy
reteam
krolle
tdca
sekita
mitiku
feriani
tradional
nidhal
metaplace
schallenberger
italias
innocense
greengauge
bedbound
ponied
impark
panted
entrate
bunkmates
dunlewey
poac
lucasz
zhur
ganef
unstocked
hueter
franchuk
lorinczi
sizomu
friedelind
bouchons
lmax
bountifully
sentencer
sichenzia
elction
simcyp
taismary
cardfile
unadvised
hritik
expertice
sclar
nsofwa
diamondstone
jouyet
baudrand
tkacik
wingels
ethnik
fulfords
stongest
xtension
heavrin
chitron
intoxica
ashizawa
goldbart
benalcazar
ysursa
cwla
pyatachok
nnemkadi
koshiishi
ahanotu
bulyanhulu
bonacich
gheni
twlight
palavicini
nekvasil
lannen
magson
tubelike
propert
heliocentrics
rusol
traing
bocker
aviall
slathering
authers
aryanised
imlil
shamefull
hanash
wyser
pagunsan
clubmates
rinta
chieftan
buscaino
rccl
granatt
allweddol
mcgillivary
samirah
nonindustrial
dharmana
padgitt
peroid
quandts
gallarda
famm
isnaji
cedano
posman
goetzinger
proe
radhakant
lazydays
yark
riversource
watarrka
skils
ruehle
kunzer
murchinson
wrigleys
hotwater
chiwenga
sosen
supersizing
dermalogica
uncrackable
stakhanovites
prickers
taniwal
giveway
bombassei
maloin
greenfaulds
uncustomary
steinhorn
clouting
pniewska
softspoken
sashayed
ramms
wuan
pagella
fscp
origene
vaisanen
arthro
guzm
cbcf
blahous
weyco
vasogenic
cobank
hkmex
haimov
mght
bridgeclimb
flexsys
poussins
cardinology
jnci
dedem
kohavi
doostang
scribendi
sillen
graziana
fischhoff
italk
pple
wyndford
monetate
kanakuk
kursinski
hutchby
bootstrappers
nonpracticing
ecamsule
rajay
peltason
chej
fieler
icrier
bosmajian
footsbarn
glassful
otpp
rtfo
matyszak
chatani
trivets
flashfloods
jiale
yingxia
rhosgadfan
birbalsingh
suky
untalkative
catcalling
runcom
chopine
ramdeen
kabanicha
beastliness
zarubezhneft
payslips
cerdin
reinbolt
rebating
productivities
daunay
speedskaters
llwyngwril
chinaco
tatzmannsdorf
chipmakers
blockson
incoherences
sathyaprakash
icosium
biocatalytic
bisogni
ozlem
sharnford
nekrotzar
wallbuilders
clab
cannito
montagner
areligious
cscb
stickiest
azmatullah
mumbler
bilro
doddery
klepfer
intrafamily
eltrombopag
leissner
zhengfei
muratovic
medika
amatitlan
niederhuber
roulades
shifren
braeckel
blowsy
ucedo
gagas
greivances
galynker
hardup
paiewonsky
stewartfield
worls
fufills
womma
monacolin
laios
enawene
dhahi
ayovi
drugmakers
sherwyn
danii
astrochemist
aznavorian
suruj
mougenot
cagatay
hdhp
nzimbi
tufin
aristocratically
saddique
plommer
morgenavisen
fetherstone
penstone
perlson
reaggravated
sessegnon
herkel
tarcy
chelsio
calie
warninglid
meidi
clri
playcalling
heitzler
revak
tsimane
questnet
mynor
imagineable
voggenhuber
greenfuel
cwellyn
obaida
partical
tulipae
canjuers
poolstock
backfoot
griffth
lepselter
sherbets
herdan
lifecasters
uzzo
claycourt
propganda
valerien
realmuto
dossen
biomechanic
vrijenhoek
furnisher
kuleto
diservice
fcbga
cavolo
holnest
vcast
ryuteki
solden
slossberg
tansky
lisset
homelite
sterndrive
shieff
conjunctural
hammerstrom
nafaa
duena
simanimals
stalevo
asmatullah
glandyfi
imperturbability
clra
tsundue
gollen
ladieswear
suntans
aurang
diamoutene
clayish
buoi
abercastle
handwash
hottovy
piermaria
woodhatch
mastromonaco
flexjet
jftc
debbage
blythin
oneapp
fiaa
grishenko
gaffed
budwood
freehouse
jubril
cestari
warbly
craftsmanlike
koniaris
ogutu
mullivaikal
turneth
sobp
jinmao
cortaro
fictionalising
seddons
hilmo
meronek
triperoxide
ooooooh
weidensaul
cwdc
smoothening
palefaces
chelgate
constructedness
solwhit
magomaev
truchot
ramsy
moider
okaying
uncataloged
wegge
witchel
forschungsgruppe
infratest
lumio
ladypool
persina
arader
zilic
snappily
pikers
netlogic
kimilsungia
inflamation
ptech
trunki
vbrick
bradville
lolled
bangaldesh
nyheim
polycationic
stannett
bellyful
shinguards
bakana
trupe
uneatable
kneads
austens
pgrp
leberkäse
blatanly
ittel
enpro
redeclared
yolette
zakira
tricoire
handschu
ldraw
reinemund
mcguffins
miklaszewski
effaces
ixabepilone
sahafi
araghchi
fatmire
trawscoed
catastrophies
cohr
tahuamanu
bottlerock
gottemoeller
greedier
anahuacalli
bleazard
coonelly
kombewa
floridita
nutted
steinseifer
stenzler
byrddau
forestethics
zirkin
geliang
gromia
lolito
corelis
seamons
cefntilla
mhec
klatsky
mamillius
unclogging
inauri
ukaj
bruyette
hechtman
jolinda
guch
munlochy
presort
wlgc
marleix
batmanghelidj
sturley
linbo
slushies
akerfeldt
sioeng
religeous
shuras
ideaology
meistrich
bullwood
ritche
shadingfield
tutv
partipants
megatrain
maiziere
dunakin
themelis
krassimira
stojadinovic
merland
orating
saintsations
schräder
letard
mikayelyan
heatherden
toomebridge
marcoola
unparented
seabolt
wittert
dufourg
nyiso
pasedena
goldfire
unsustained
nechi
cheifetz
finighan
dsrl
penneteau
kiedrowski
sajnani
nobia
blakean
grandvalira
tetherless
breana
vivisect
bengay
rheumatologic
glitzenstein
ozkaya
panang
newsmaking
racioppi
boffey
récemment
pyant
walvin
khamisa
hoopman
poveromo
monagle
tilter
excitment
exlusive
massouda
nsms
vanderhye
faguibine
cayford
armyworms
bridling
opportunites
polyjet
mariuz
pigasse
schonwald
kuettner
acuson
mpinganjira
madilyn
vereador
aftersales
facilty
gathy
snowier
reallocates
cockwell
jadriya
backplates
chunlin
airhorn
relativities
jewlery
delettrez
lubiprostone
cheresh
coronari
bushill
pollick
topmouth
bertodano
steinlauf
pildes
evensky
waeli
galw
leighanne
kvaal
boender
unlaced
hemostats
labatts
arrogates
nguan
musah
nozadze
facilely
duologues
hexworthy
prakarn
kinlochard
airer
wildean
makkawi
hortas
goicolea
gamemill
corumba
arné
touzaint
similac
sickled
fattens
hmmph
devani
stabbins
arrata
howlingly
tenían
gelbman
taelor
bainwol
hambden
shubhada
maniquis
coutadeur
kavon
diamox
eyms
maxalt
titchy
pepelyaev
francel
luckwell
intelect
assurity
transnationals
shorelands
volleyers
whitgiftians
polioviruses
matambanadzo
storozynski
suitsupply
tailandia
raskind
eboda
gleasons
bydlak
sabali
unassumingly
postnuclear
publicaly
oilpatch
bekay
elusively
forrister
dopage
hahahah
toppert
somila
narbeth
savuth
nosecap
doonhamers
supové
aydogan
pumwani
bental
neckbrace
acoustiblok
ekmani
ventastega
provett
kilchrenan
higiro
kippford
mckilligan
mastis
bonnano
westaby
dovgan
lilliman
pucky
quelccaya
milltimber
miswrote
sulitzer
seikel
denneboom
ermelino
clarabridge
clacket
porterage
tekebayev
jotischky
forswearing
zhaorong
waithera
overbought
mcgarrybowen
filyaw
kamrava
korabelnikov
fluty
spottings
teleni
chelius
doumbe
renjen
beyton
squiggling
khouzam
barnavi
brighteye
rinos
gecov
cutecircuit
lukacevic
ferez
megayachts
verardo
overpaint
ngaujah
kolek
rollups
dafur
netdragon
marari
cochleas
lantiq
asipp
votewatch
capathia
weedkillers
reductil
zyflo
seting
efunds
whuppin
aurothiomalate
cscp
cacb
kadhafi
surving
pockmark
estulin
chadderdon
calavo
detikcom
bitancurt
brejc
sherak
smolyansky
begovich
katembo
lahsa
guynn
aplogize
eribulin
parkstead
marget
burkee
reckermann
salvias
winnows
masche
kherington
beated
overachieve
leimberg
veltrop
croudace
lisov
mbpd
swidnica
glossily
cassivi
centrefolds
salge
niklaaskerk
readapt
avega
chowdury
mossawa
krakhmalnikova
netwitness
morgantini
mbbl
ibasis
densen
timl
uhlir
dohop
quiddities
ismm
brugs
overgraze
piffling
wendrich
teenaids
lucases
mumbrella
tribalists
taab
cuvaison
armegeddon
themos
underrating
cozido
sicknote
emzar
sukin
ecofuel
shanbag
lorenze
rotovator
ymddygiad
dolara
pochoda
semisweet
shlaudeman
zamost
esref
breadbaskets
kankariya
mcguires
dockham
soltesz
haqqanis
stapelton
wodge
masterspy
schriro
ghobash
kelut
cefp
chiyangwa
crookedly
makhalina
minikin
stinting
mescalin
groaners
gyorffy
cakmak
levave
hatzadik
pahk
apéritifs
rathbones
recharacterized
okonogi
zhuravli
moeti
basbaum
marchet
keiles
obscurer
gestamp
aslamazyan
leverets
nassery
ocurring
rapke
youa
totalview
maouche
essner
particularize
shufti
thighmaster
cocilovo
sawasdee
comapred
bibey
toywatch
hoggish
dussuyer
murell
eurodam
humetewa
owlshead
feleke
altarum
kocoras
voluptua
basico
neala
fomula
kanamit
misfortunates
aprl
asieh
eleborate
guisti
eterovic
louloudis
manwani
genomewide
crazyness
rationalities
intracorp
intergration
tillander
squirty
spendable
kollie
metalogix
prommer
kromberg
fussible
presuppositionalism
botflies
ambivalences
cattoni
boutaud
trymedia
dslextreme
disjunctures
gossart
tapha
briskness
zoelen
flightiness
clussexx
grinchy
papadopolous
caixanova
simonaire
endicia
lockerley
asila
marasa
stess
edox
hailiang
intercell
cuifen
januvia
vny
marquiss
majdic
dumal
enmei
multipacks
jazic
rajdev
festivalgoers
bowdenii
pumpy
rizman
kommunisticheskaya
stevioside
bridgitte
onionskin
kinnego
dereje
woofy
morue
morila
ebace
janise
clarisonic
charap
kuresoi
kaltman
sezone
beaconside
giulietto
panjagutta
cangandala
istream
rustamiyah
krimm
travelscope
parthenolide
talkswitch
sheshunoff
gred
sondi
inglee
maawg
witrh
affreux
forrey
colono
tennesseean
simels
kugelman
selsun
lunchmeat
kirkholt
wahhaj
jaffari
mohm
torcher
boschen
balikun
wkmk
hrqol
grangent
trumpery
unfollow
gwastraff
chamath
pinp
timmendequas
schee
ustin
lesportsac
weyermann
sixpoint
enshrouding
freesheets
belapan
pepped
ellex
reagrupament
zakharia
stiffing
amnis
symbicort
guerinot
robreno
swiecicki
schaftenaar
eathorne
afobaka
maxium
oghene
khashan
mochas
biotherapy
citydance
palaeoclimatic
garousi
kusnyer
klafter
delectably
acitivity
concretize
oedd
drumaville
howatson
caiguna
ocie
oryem
osodo
ippudo
wygod
sosnovski
sockman
enayatullah
tsoumeleka
vanderbei
jestina
bruenn
duderino
zitoun
democratizes
skyscapes
upcycle
duhn
bhawna
ooohhh
tsavorite
unace
manaseer
scarely
sharabati
feddis
vvx
operatics
eldrige
repetitiousness
pellicani
dalenberg
tabbaa
zubok
sblc
beatragus
bernia
fhcs
trebic
udelnaya
ezor
blubbery
injuriously
carrows
overengineering
ruig
abidemi
mooo
jiandong
weyeneth
indiscriminating
penanti
demotivation
areawide
bodjona
abdoulay
chapelfield
norbord
pearlberg
unfcc
rynku
curcillo
darbonne
lesc
edsp
djemil
aboutboul
maizy
swiftbroadband
candelon
danderhall
ecomotive
canahuati
mafara
gyromancer
hestor
asgaroladi
paatsch
smrz
arangement
yuken
niomi
codepink
verrastro
zenani
petriv
grawl
malayasia
collusions
hayasaki
intrastat
toyobo
upaid
upolo
köbel
vosawai
jabuticaba
myanmarese
issers
haydenfilms
everymen
litterbugs
kidshealth
windowbox
skistar
khalda
vectras
jorges
suzhousaurus
felci
overexert
slobbish
diringer
harpic
epassports
vocino
ragbir
markino
poilâne
scalora
menseguez
mellard
enunciator
kutigi
intransigently
lvcc
azzaoui
nsid
rotw
barocas
krevel
maelog
regadenoson
superlambanana
hrcc
dsgi
nplex
achub
deskilling
chestertons
mangalaza
chesko
birotte
schlenkerla
awbridge
telengana
lumberjacking
transshipments
powermat
gateford
souha
frehner
blochairn
alouf
nyka
weintal
schlossberger
presious
quantex
svyaz
quicklogic
beddawi
gabow
nedrailways
governers
saajid
disincentivize
nayed
vié
stielow
jeffe
seifollah
aldenderfer
oponents
militans
benylin
crepeau
menilmontant
feasability
superlambananas
arbc
enxco
seccession
waqaseduadua
gyllenhall
outraced
kyri
sestanovich
sazhin
underworked
ramdat
parous
akapusi
beardies
flipbooks
graby
heckenberger
imigrants
pavord
bouyant
mully
lifescan
manguso
oakside
terrorization
karliner
ariaudo
palestinan
unintellectual
flagellifera
scolnick
purloining
celebracadabra
vergnano
clougherty
condomine
wholewheat
thundersprint
parsvnath
jonigkeit
batabano
dickersin
icabad
hyrbyair
shawnae
brunnock
bachmayer
monart
kalkhoff
esfi
personalties
exhortatory
metatarsalgia
turkmeni
zilberberg
shopmobility
thorong
rebiana
microcultures
sieni
voast
qahtaniya
sloviter
gromoll
onich
kerpoof
kamens
egunkaria
xtep
meschke
herritage
abbasiya
uncollectable
unnnecessary
decine
baramia
markwest
bioventures
digitaleurope
ptwa
itat
chromadex
arbas
neofascists
tidningarnas
betonsports
mmae
mayewski
pinelake
yonty
janelas
ghabra
freakiest
maculan
leurbost
denik
kawuki
disctrict
corticeira
fancily
cransberg
nyakairima
vyatkin
corpoelec
gradowski
goursaud
sokolovskiy
kervella
rietdijk
beukering
weith
schuchat
principalists
posessing
ogis
sorc
arnup
vagli
chardara
califorina
stephansen
qassemi
smolko
moleleki
wallone
annike
rothlisberger
esensten
mdex
garetto
picholine
plentitude
mcbains
mondesire
knuckleballs
clammer
sarikaya
transitionary
cheyanne
meiping
todung
mygrid
maseda
nontherapeutic
canup
unflaggingly
yvenson
zophres
entech
obidos
espalin
cosiness
segell
prezza
darges
phoners
solenni
shaoqiang
monther
blimunda
hosani
vomitoxin
lifesized
egms
chucklehead
neonopolis
tonsillectomies
depositers
publicat
mqg
milevski
travoltas
harbarth
ciganer
chsra
vallado
blyskawica
shaanan
blankie
nasirli
bellara
clawlike
madheshis
mozier
scandle
pinderhughes
ubcv
visotzky
eichstaedt
newaz
gorczynski
artna
faggen
miklasz
fiercly
musotto
antiangiogenesis
ravishes
nakal
jeronimos
eunavfor
kildan
childproofing
jarnell
timberly
sehd
escholar
marchants
compustat
kalooki
frechter
mirlitons
terasem
poltermann
distractible
elowitz
whichford
konoike
graib
modry
gatsas
infuriation
razaaq
istrabadi
koschnick
ishare
souldier
vrts
cheno
olw
brivati
hajjis
matsouka
chunqing
ynysmaerdy
rsoi
bions
altfest
tibisay
binaghi
waehler
iitt
kassabova
outreached
bindaree
patner
averbach
interactivecorp
ballar
mohtaram
cybercriminal
masive
willbe
ortakoy
hellsgate
handcarved
usrbc
branstool
talentmanager
defibrillate
iigep
photopass
realtionship
erso
yueling
nyssen
babon
oosterdam
avisar
brightleaf
supossed
wickus
tornqvist
snowboots
rambourg
skyburst
spamount
oshiomogho
abdesalam
nyff
karamira
joypads
mangola
tissi
bratschi
colcoa
hovensa
farjestad
hatoon
tomishige
mcdermont
hayball
hillmans
scemama
shoubra
schnebly
tyf
ciriani
euroskeptic
krajcir
radanova
avermedia
hygrade
peakes
phoun
gilleran
theit
dellamonica
bootcheck
tassled
iftas
grishchenko
haeuser
minfile
cavna
rouseff
sbme
canbyi
rieveschl
teargassed
geyen
yaguas
psilakis
transmogrify
spoonfeed
collegefest
prepubescents
alteri
qxm
xuetong
mouyokolo
walmarting
zionazi
kerschen
emobile
sciammarella
predetermining
riorden
medflies
basdevant
raee
erksine
vanessi
salwak
olivadotti
aortas
gardenburger
zhiliu
risanamento
masterpeace
genetti
postholder
lisanelly
vundu
ramekin
hinzpeter
konigssee
krzyzewskiville
lyv
segert
webworks
halikarnas
pavones
xianling
expoland
emmigration
striaght
mobuto
frease
sunhat
dabinett
caviare
uare
francombe
worrywart
bcuz
milwyn
gennum
lebrons
labbey
tchuto
odamtten
dinging
cannaday
mongelli
lejuene
hansma
epecially
guinnesses
arberg
skydived
spruiell
corval
bernadet
onebeacon
hightened
musicophilia
heikkila
surfwatch
goubuli
mallarach
acroos
chopines
mohawked
tigerish
equifinality
thenew
tostao
eyebolt
vignaux
turkovich
dolefully
cricks
spaceage
viewty
persbureau
proration
tranquillizers
unhelmeted
friesan
reggiolo
cumaru
centralwings
merok
kiddoo
bellieni
vewy
zakum
peladeau
mochipet
layt
globalscape
cyromazine
stoate
gastronomist
sotc
fofonov
losina
ohrp
chirayath
taxidermia
derbenev
neuroreport
hologic
mohebbian
eilber
wanyonyi
rasai
kuramata
keers
rosenhead
svoray
demetro
prachatai
asdi
vilvorde
digifest
flabbiness
bronchospasms
tradingscreen
carleto
silvinho
mesages
lofquist
ralsky
dictatorially
pepsiamericas
hörst
zabir
yealands
molleindustria
actorly
gisiger
spallino
guarnaccia
jetsuite
spatchcock
undoubled
mayesbrook
jellyman
sibos
nnanna
tagliariol
kreiling
dohertys
radiotelescopes
denninghoff
okurut
thirstier
lytel
villata
mosalla
jelassi
novagold
garns
fayrouz
hangai
checkposts
olshey
mchappy
aberley
huzi
aahh
deminer
muzarabani
aujeszky
brassage
langoustine
booe
limnitis
vignetta
noggins
kalins
quoter
gutherie
temperamentals
obarzanek
kupferschmidt
newdick
dudzik
dalmer
cibelli
privatklinik
uplyme
macgillis
sliminess
lasermotive
pelites
joergen
opions
aesculap
dupey
intelligensia
nonfamily
euphorically
vingtenier
coasties
imrf
michigami
epicures
vinen
portscatho
murrary
mallavi
adag
enmesh
akinaga
cybersitter
amicability
esseen
kiyosi
posistion
rjdj
woofs
ligashesky
superscape
stiches
opsource
tiea
kraul
buffleheads
sumardi
veleko
hessie
greenbridge
burayev
bubenik
hywell
faxfleet
myyahoo
bellbottoms
mariacarla
republicrats
demilitarizing
skycrapers
kutol
philoktetes
hfss
wulkan
alcat
gönner
fatwah
messagers
monstrousness
mondot
pentex
megaresort
bioactives
geotargeting
orkis
dogfooding
emblazoning
saqar
chinext
ossur
photosynthesising
bonomy
alogliptin
jasperson
scrotums
onefs
grimsson
lorah
pethica
toregas
shitake
oakport
longtailed
reitinger
dentista
knockwood
freefalls
dorpalen
hidef
pased
golesworthy
hokazono
lymphoplasmacytic
stanilas
overwrap
holers
accustoming
barghouthi
untarred
epolicy
fintage
panforte
ecards
draytons
elfrink
revivial
shaheeds
linday
substracting
mohebi
crljenak
lindani
happinesses
gioda
pithiness
sciaf
nesbett
wilkommen
takeyh
bizo
sniggers
strba
collegeweeklive
sbss
matsugen
liekly
zings
völsungs
caroma
reimaginings
ruskies
caliguri
formable
westwoods
swathing
vnpt
azaouagh
pacheh
enviably
pegloticase
luaus
stutman
kertz
koplewicz
hurre
similien
tdic
niniek
anghenion
seaquake
jetlite
reddicliffe
lilliane
pranali
attact
songgang
tremosa
faming
tsopei
enunciations
dmfc
nuong
alexsandra
vaval
shoehorns
lapenta
greycat
atml
luai
automotrice
aurness
lucilo
hny
questrom
herringthorpe
alowing
interferance
bayhill
soputan
mahroof
laakmann
hoylman
gainsharing
qcn
overdrawing
conspicuum
beiersdorfer
geminid
farahanipour
mcshan
astone
demonhead
wiggington
yudashkin
mooallem
sevelle
davisco
civb
bellanaboy
idress
grard
waelkens
argonautika
bozano
posion
ceber
shirecliffe
giliomee
guillette
trimpley
lepera
curatives
richmonds
epilim
suffredin
valantin
neykov
flaster
panamas
lacefield
rapidnet
jokonya
overgenerous
arteba
flotus
microentrepreneurs
volinsky
graviora
behney
sciora
pilarski
reodica
stiltz
expections
hobbyism
arrearages
almin
kiniry
chockstone
tabarie
mourenx
cherating
queyranne
butterflied
lemasson
warmblooded
whouley
holson
skurnik
bertazzoni
golovinsky
knipschildt
moonwalkers
jeraldine
fireballer
slym
herstik
techiques
deutche
goggling
wangled
lazienki
yellowhammers
swoony
matalib
interrante
worht
unenthused
aragao
somnambulists
affably
selectwoman
pastilong
butterflyer
thickheaded
abouo
sicky
gedmin
rhomobile
guardium
oedenberg
iyman
gemmayzeh
gomang
zerona
wilkus
playden
latchingdon
macur
xansa
mogahed
intralipid
polticians
mushak
tweedside
yendry
rainshowers
olatz
billal
eshaghian
halzle
kaltschmitt
gastronomically
clunks
altermodern
venetis
eyasu
smögen
racepoints
buyable
schincariol
birne
moorlach
ponosov
waddah
hansjoerg
crescendoes
picarones
extorsion
lewander
ufov
govindaraju
banjoko
karageorghis
suboptimally
fractionalized
boudoirs
neverdie
michah
bodeli
overscaled
dudette
birac
pribilofs
zivin
merkato
pasnap
uklanski
engesser
brittenham
snazzier
sakkie
julissi
kathalijne
ebrc
tearjerking
surfaid
hattenstone
euroweek
zilmer
seniora
lutzke
merrey
entilted
belfor
graphomania
colacino
vavae
mouallem
papf
alvarezsaurs
sannakji
caguan
fidow
zoheir
tobalaba
nins
felinski
kaliati
gallivant
fouty
inbicon
kenmuir
ahlerich
jaschan
layde
masmoudi
cluttons
philiosophy
heher
albarino
zimmermans
liasson
freeside
boxton
glennerster
yeaw
partent
podany
kirkstile
doruma
combustive
overpopulating
frair
cloners
landamerica
facep
smts
yakobashvili
migrane
hemat
updo
wessberg
abbeymead
grohol
overett
quantites
avac
fileds
codeplay
commercialises
shaxson
ellegirl
unconducive
descry
cacuaco
girion
bootlicker
naturalisations
telegdy
gianvito
lambraki
elee
gaddaffi
kalinigrad
alapont
phoneix
multijurisdictional
gabali
nynke
prorate
winichakul
cadeo
ansuman
vereecken
harport
faynan
rashidat
wost
matali
yonke
shinh
shufflebottom
mathmatics
archconservative
expecto
massau
kuparinen
ugaas
panchai
kendyl
cronian
critised
postini
gräßle
glavany
importan
rpsgb
wellbores
savoyarde
matulef
interwrite
altmejd
banez
ruched
backlines
jesrani
flixter
zawr
redos
rovia
veeraphol
oorja
bankowsky
nmol
keali
armario
goaltore
cartvale
majestical
karkhanis
yanney
kemess
kheirkhah
unicorporated
magel
nemtin
katsnelson
romansky
padwal
breidis
husnain
infantilizing
chihuri
burkini
znaur
pasanda
weonards
trappatoni
bannos
expostulated
fizzed
chirisa
lavasseur
rapsons
mevhibe
drezen
kinatay
muxima
daugther
dimbola
sappiness
actvities
alook
henegouwen
entabeni
mineralize
fetv
kuczek
seemlessly
hoosegow
iberworld
wenra
mudrak
dubiecki
signators
litsky
kormendi
lewter
brecciation
backslang
smartfusion
trelleck
equinet
delamore
shrivelling
abousfian
nimke
ultralase
hamdaniya
sadusky
redc
mikhel
bayovar
iosafe
cmai
parlano
orexigenic
mandokhel
acidifies
skelbo
muira
ovenstone
gigauri
starkell
stiflingly
leadship
akalitus
jaynee
millichamp
dermarr
tingyi
sunrocket
soulbook
pegfilgrastim
dezarn
oculists
jerseylicious
incarnadine
overfeed
stepen
girlishness
devolo
fortuity
carem
strefford
labier
nhem
kennebeck
ozil
findler
retroplex
disinfects
milbauer
yones
auew
granbassi
sensored
aghaly
moukarzel
mikhalchuk
etelecare
alands
faultily
veline
rlsb
oked
munyeshyaka
accoutable
sharim
wanabee
créditos
imitrex
methodologic
hgcapital
inconsequence
floriston
udovicic
nolta
elghanian
marlana
pescante
interims
elfan
moneyglass
bashkirova
devern
gerova
gerassimos
singlemindedness
kovio
hochhauser
palestinien
delijani
nordictrack
leasers
sanitisation
ansca
winnabow
djinguereber
haselrieder
mcquater
plankensteiner
rezso
flisk
clampdowns
unbuckling
kochnev
uzomah
manweb
shinbashira
junwei
finallists
rubenhold
ibmers
natually
fredom
matui
cadidate
couer
canchola
arbedion
ellsberry
usariem
openvpx
mbiyozo
hervs
biyela
sunstruck
chowen
arieff
anobile
happenes
migrainous
stremmel
essakane
zymogenetics
obsequiously
gruberman
lutfy
cabb
floaties
emex
fuleihan
quaida
jawann
suurhusen
sauvaire
finnin
rondelli
unlived
mabbs
offic
smelliest
drawls
khwaza
thuwal
mujaji
autocenter
socoby
mirassou
kingswells
biocom
dustiness
unrestrictive
ebif
contractural
naghavi
dragoo
pastiching
evolv
lascahobas
deplace
barkos
psda
steuernagel
notic
devincentis
zweifach
faunia
ammaccapane
migingo
angelilli
unevolved
kolahoi
wapenaar
roctober
rpii
milkwall
migaloo
tecau
gimigliano
romanee
mecox
bulhan
bvudzijena
leibo
chainani
towerblock
gaddhafi
plumbly
roumen
doohickey
curtsies
keithen
thurmon
katehis
apparenty
caerfai
playón
jeremain
deactivations
eriez
singita
jennah
sideswiping
nonhierarchical
isdale
lautsi
paranjoy
schmieding
clienteles
tarzans
pillowtex
moleketi
sgfc
wsib
camex
valextra
tubia
etkes
nardos
harrabin
debski
hiccuped
gurhan
sirivannavari
kohrt
hyalite
arlenis
permier
korotyshkin
airpass
sharebuilder
mpio
issacson
zaldarriaga
latek
teching
hanman
giannola
tachia
ajeel
feltenstein
tyisha
sigalert
rescoring
kilolitres
ahady
chanae
moseman
largley
wioth
avorio
reichensteiner
effectivley
gilbern
glycopyrrolate
herbas
delicateness
muessig
closeouts
famour
mahers
schauerte
upim
fcib
troweled
fantawild
randox
microprocessing
tusar
boorn
jordanie
rafalowski
fulstow
subsidaries
mistubishi
odikadze
télécoms
paduano
gechter
pagrotsky
wimbleball
lipizzaners
flouncing
tumbril
interferer
woodfire
murarka
fraisthorpe
finatawa
ermei
speliopoulos
tourre
mccheese
listrac
cushwa
wonderwoman
apet
angiogenin
deafeningly
oluwatosin
shioiri
havergate
fahleson
goreme
experteer
barmbrack
taillamp
shefield
papooses
jauslin
butterflying
fibi
moonblood
desferrioxamine
imanyara
bradlow
belyayevka
rybakou
abdenour
almalik
voicemale
esenboga
sadequee
dorpel
renyel
unground
cheeto
beargrease
hayeses
mobileiron
nymphets
cafi
drizzles
efds
spigner
hoohah
hitchcox
schwaninger
tyrella
wilsdon
shanghaiist
faschingsschwank
miskiewicz
regimenting
mellone
yeilded
celmer
sherdley
gyno
visitar
civista
escobares
zuiderdam
nylan
throughways
ccer
muhammara
azlynn
zakki
noteably
diplay
fktu
lydall
franzino
nsaba
vegetating
panell
indohyus
caslin
westbay
gilissen
bloorview
köstinger
ponytailed
faize
mikeska
carrowreagh
hegemonism
tullycarnet
gamecrush
belmontes
florigene
amaretti
clearplay
kelbaugh
telehandler
kamuda
capcity
emnity
irradiators
rwindi
magentis
kolosoy
michalson
milchem
timesaving
misnumbered
ralia
glommed
lovergirl
ternovskiy
toughed
processer
swima
cutbill
sunwin
ayham
rocquier
teperman
notcutts
microrobots
moiseyeva
blejer
gafney
kerkwijk
mmhi
pellom
schnitzels
anthropomorphising
shehada
extravagent
fazzari
tibc
burbles
ululating
dichato
matchwood
blecksmith
judgmentalism
predessor
garodnick
triptorelin
redrafts
wehrs
insubstantiality
galacticos
peugot
harbash
mgib
lastres
bensayah
kloppenberg
mandaree
shread
ahdab
cein
ordoña
guangpu
hallgate
gerein
simester
peevy
harteros
ambanis
psykter
rejab
zuhairi
morkos
succot
haroldson
chaddy
prath
peevishness
bodnick
wpte
abston
silveiras
dysport
dammage
smolansky
identita
mizengo
fieldsend
laneast
dirgham
laruffa
shchennikov
huafeng
mosaicos
desalted
tokyoites
kesal
nesv
navnit
kumbala
hatcheck
corporatists
imanaka
snouffer
reev
expereinced
voecks
crissie
dispraise
medpac
hussainy
rumenov
taffetas
entrepreneurships
banio
loveseat
hawthornthwaite
caragabal
topdog
misezhnikov
glaceau
efficency
cammo
instution
pfanzelter
beeche
saltend
menkerios
biondolillo
lamost
soleas
keydata
makwan
alladale
rebarbative
blobbies
koutstaal
bicommunal
dowjones
nonideological
chartwells
coinfected
ugma
thecb
shlemon
susurrus
presurgical
jaelyn
corruptness
showplaces
bamberski
falks
unendingly
kopke
emanuella
blackwaterfoot
rhes
sandrow
apocalyptically
fcfe
fripperies
climatechange
najdat
masic
rafeek
caseley
mumbaikars
sportcoats
rürup
bursten
snowpocalypse
tofurkey
suramericana
fraternised
underinflated
dcal
meiller
barchiesi
flöge
ennenga
shortterm
coughtrie
vergin
skride
bonby
tajzadeh
palino
isilda
patriach
petrobangla
energix
sauipe
realdvd
medit
mistruth
iprint
eink
lavonda
wurtman
spfw
jnpr
suiciders
illela
unopenable
hirut
prayitno
goldleaf
nduja
unfiled
faudra
shantell
bluitgen
unadoptable
baytsp
telemanagement
hussainiya
deruiter
forlan
rebok
matchmake
horsepowers
adapidae
maccurtain
leish
daille
kasinga
gronwall
foreseable
waingroves
boyishness
higlett
concertedly
amerli
lisogor
evencio
berdouni
paggett
effors
weathertight
fawda
corporeally
mccambley
xiongfeng
panks
mcadie
srinigar
pacojet
cabdrivers
nbac
wrighster
asterley
featherlight
possesions
sigale
accessorizing
thrombolytics
distrubution
whinger
pubwatch
myplace
kyzer
aganga
sedney
etchart
zigiranyirazo
apcom
aracelis
blanchards
ookie
mediocracy
figh
markai
cotehill
hycor
bollworms
ratanga
adriean
rizan
slominski
neurolaw
gialanella
duloch
soliani
universalised
furless
orizon
euthanased
citera
jeffer
jellyby
namy
massoumeh
gladed
palivizumab
curam
rathana
glezerman
avorn
feeb
carmelle
gianaclis
preplanning
ofex
ciolino
meisenbach
hyakuri
peisey
brrrr
pharmacyclics
ctts
womanity
incapital
filthier
balivanich
rbai
shulian
chorines
jilali
nyhuis
jcvd
arcenas
campaniles
heriditary
dnx
tacu
catnap
leagas
strengh
liquored
accame
adeena
invidiously
casscells
hirsche
liubinskas
hamastan
pricewatch
saccharose
mindark
ulh
brandbergen
provincien
warheit
tewary
jugendamt
kwashi
immiseration
massachusets
wgnr
moszynski
karake
bogorad
marcelis
mwiraria
nocino
hammersteins
eljay
avowals
declaw
interpretively
guangqian
omischl
classiness
rollino
hasnaa
unperformable
ntic
drycleaner
varvitsiotis
litzinger
jirtle
excellance
shavendra
nipponkoa
stradbrook
skretny
prii
lassaad
sanjaagiin
horsting
bouan
squamiferum
touchette
waterlogic
wynds
spaceline
outpoll
soliel
tarisa
avoidably
aignasse
mpxpress
nontheatrical
clutterers
gribetz
oofnik
bensko
tastelessly
loamanu
postracial
rugmark
epicness
striemer
speading
lepu
driftnets
ashikari
lanerie
healthone
worthman
fresian
damart
livieres
mcluskie
telvision
baghdadis
unclogged
leftfielder
spindoctor
sercial
kitahata
clusterin
egozy
bruco
roubik
aveeno
maouloud
distrubuted
songlin
qayoumi
asilisaurus
estefani
dowdie
natonski
asanda
goset
rosapepe
fructify
lambertye
loiero
bratwursts
elyea
gargash
benotman
alecs
nightshirts
greenend
osanloo
jentzen
cortazzo
jaxer
colur
bimkom
mangalica
xiluodu
mahesa
craignaught
stoneburg
boulat
satsumas
sliimy
snowflex
sentimentalizing
lapwood
clanks
lifang
mectizan
fromanger
pradelle
douch
kazmunaigas
cylc
mdsp
alternitive
flamineo
lilico
ayalas
mighall
gafas
dorfeuille
kwegyir
samodurov
spayd
clott
procup
strumwasser
brotzu
rabbatts
outsoles
cubavision
kaminkow
chengyun
serfas
clopyralid
endsor
mboungou
irrawady
purposelessness
culty
domainer
eligibilty
kronemer
bannis
merifield
dahyun
fruttuoso
saudabayev
galev
nbta
efrag
demary
gerston
savviness
patson
lockhard
baugo
superciliousness
turrall
johe
acsis
laurs
arive
haselbech
hoarafushi
bset
millichap
ecause
brentor
chauffer
devoteam
plasmarl
kkoh
hitco
maybachs
magagula
regenstrief
estimo
brtish
korphe
sumidouro
ofari
guarinos
jonetta
dorka
breadmaker
mekonen
ensenat
czwartek
eyeware
kraditor
taktshang
kanaskie
simena
agedashi
lamendola
mcaughtrie
weaselling
cinamon
deitche
weterings
llanerfyl
stanfields
priyadi
motomasa
witmore
zarrouk
scantier
kenkel
shiona
wilmart
leachable
truffled
versionone
newaj
feigenholtz
interoil
unaustralian
stralman
waagaard
otera
ntma
titilation
benahavis
bouhali
texi
sarsekbayev
sukhpal
marksaeng
gargles
nygh
owlish
connot
taghazout
tinks
onanistic
punchcards
penhallurick
drummey
undercapitalised
blogtalk
collegesports
gimlette
visk
pannur
radlauer
sauciness
antonellis
micoach
nondisabled
husavik
faichney
seliverstov
blanchardville
chochiyev
thumbsticks
clemenstone
turnabouts
dibartolo
embrya
purposedly
steenland
hogsthorpe
cidac
glamorise
sccf
emasculates
rudolphs
kalatozishvili
lfas
maimings
quetzaltepec
relize
huarui
latson
vanzi
pipelay
wrxs
orad
syneron
reweighting
raramuri
dgen
trustedid
mwapachu
mullowney
hius
milions
undisrupted
shirlow
waili
brondanw
enrapturing
mobeen
ingy
intracompany
jolliest
algalita
iannarelli
recultivation
musnicki
clickstar
iakovou
kiehne
palazhchenko
gorik
alvester
laddha
idealises
chinesepod
helibase
nicolazzi
recapitalizing
tenaska
discussio
achievability
infratel
caiden
ronalda
katuwal
cetrorelix
abosch
medrich
rothienorman
goyas
sanquan
batmans
yuanta
reweaving
degrand
pirogi
streetcred
allcomers
distracters
tuiga
centilitre
cytopenias
palestinain
srebenica
eulogists
arcserve
ungentle
chicom
perrons
travelbag
freespirit
suler
nigrelli
dunderhead
horribleness
ballyhegan
shomon
boxbe
miniati
inoubli
teeba
mayhugh
bemuses
dolez
buffelsfontein
evista
appropriator
sturgell
ponv
reraise
langlee
fremon
myface
javea
convulsively
exelixis
fickenscher
fangzhuo
schastlivy
maunalua
enform
vinohradech
hjerpe
causon
tortelloni
neurofocus
multimineral
isakowitz
culatello
besito
poping
hemakuta
dapdune
sleddogs
hawkswood
dendê
freada
frailer
brydekirk
earsplitting
electrocardiology
tucknott
oracene
kucharik
bodeker
sazon
embarkations
gandanga
bedcovers
nect
frankens
daggash
glezen
jananayagam
taichiro
piñeres
mimobot
peterside
ongame
worgret
sedoc
indefensibly
chemosensitivity
eghan
zlotnikov
shopstyle
biocchi
sinoti
cuttance
culverhay
thngs
bargery
giggler
lonesomeness
stuver
westlea
characid
ctam
sukow
alemseged
packetvideo
skiiers
strope
ecohomes
skoula
dissoluteness
shtarkman
schain
rmds
serramenti
ocaranza
marginalises
falconhead
backload
bennite
malefane
adino
schoolbags
fairham
tizz
tatelman
russh
abdale
krima
chhatisgarh
miragliotta
enterra
dabancheng
kotsopoulos
schumachers
fatmata
mednet
cadivi
fairthorne
nympheas
mariatorget
hollydale
shivek
scarrott
postimpressionist
gutteral
maber
rostenberg
dawran
cumbersomely
eilde
xjl
szala
koshary
canstruction
faling
seachd
loppington
jheryl
maliau
yenque
roae
ravani
bittker
surpising
inartful
machipisa
jenike
ipls
leic
sielaff
instrumentalisation
kaec
makeyevka
qeb
ikanos
qingyu
autovaz
jacksie
dissembler
pickerings
cherishable
ellalan
duplat
mauricia
druggable
murshad
huffner
christeson
laughead
recommerce
egendorf
carancho
imaginitive
wendice
aaaasf
blitzkreig
riposted
safie
occular
peelable
perold
monforton
jeannemarie
deping
walentas
hypervigilant
adularia
idou
strachen
mislow
mmlp
samuelle
ledburn
apolgize
skatelab
karfiol
balqees
sfha
okaukuejo
specchia
dulcote
levaquin
pallino
wayleave
macmorrow
feakins
lefkaritis
ʼs
assasinate
naughtily
jebidiah
nightlong
szczechura
solterra
asagai
krege
seond
ekahau
margalis
esquenazi
starquakes
perence
pressé
searchinger
desal
betra
nejdi
honickman
sorrowed
sotp
homocide
citycard
priveledge
velayudam
eicta
causeing
hammo
krever
heroles
thint
artier
samantar
pacot
clotfelter
gadirov
hasset
lottridge
coombeshead
miszewski
frnt
episiotomies
concilliatory
sitanshu
krasney
nexavar
safework
djurkovic
hypponen
boyling
pembertons
persisent
appletini
remonde
palaikastro
redgraves
qabalan
yenesei
fulfiling
olsons
rackenford
automaking
colsons
shander
nairas
riechers
irongray
plateful
stumpel
ezratty
gilets
kelloholm
fishler
fedderly
geurtsen
ownself
paillou
lenser
wojdan
throughline
snowshow
shackman
marwad
zarchy
resiliant
westermarkt
coachable
lanlard
kenspeckle
fistic
brasilinvest
captivatingly
manicuring
aszure
spoonfeeding
spoofy
remotec
vainest
plaints
housebuster
nilli
cellufun
anandita
manjgaladze
gudelsky
khaledi
osmometer
xtabay
torcross
classlessness
rbds
cerrillo
caisley
lashkars
fonkoze
martillac
kenechi
glenrosa
aliph
ilangakoon
calfrac
talling
slemrod
peterstow
shnur
midweeks
aidablu
benneweis
rumsfield
muqim
airikkala
rittenband
annouce
biorepositories
rogat
zetts
feixiong
huskier
soleman
smedegaard
mcphilbin
nvca
dustcart
ladyzhensky
kismaayo
philinte
moelleux
synbio
eldrin
vargen
schoenbach
rawai
besian
aleviate
blueray
yaoting
concelebrate
adebibe
mortein
magliocchetti
nrse
hartmayer
garani
lupberger
frownland
solandt
aftc
kwol
irag
eaiser
takatof
unpc
livigni
oberender
superweeks
gwanas
gilver
kelya
gentilhommes
buttonless
balletomanes
clanged
glunt
guérot
eurocrats
rowantree
sadaam
charltons
falahi
kavsadze
vereniki
snobbishly
perpetuals
sunquist
holzhauer
murdie
saggaf
afft
reimported
gensym
dastageer
offshot
krumnow
kensie
earbox
howen
amnuay
kirland
wardropper
brelis
golnaz
gousse
clammed
scotairways
nelyubin
bifold
hraun
gwennan
svenssons
flwyddyn
hollyscoop
tuesley
aljibe
mumpsvax
scambio
flomax
apoligise
muharib
cerminaro
zomboid
mainlining
incurious
sakeasi
vaul
monnezza
sanguino
sureau
overrating
misclassifying
catchgate
sheinkman
demurring
capricia
allieviate
creppy
paranormally
nadjim
amtrack
predicta
ewanrigg
huegel
ricupero
aimc
kasotc
garthamlock
macknik
dalayman
refurnish
integralis
lawrentian
akilisi
mediene
aldates
nadex
thyroids
hopeline
korian
surveilance
mobilisa
assetz
juulia
palatschinken
slaiman
trasolini
cumner
underachieve
ponsero
joergensen
dovel
compaines
rimadyl
tazeen
celebrini
putins
tocom
alomst
talh
esele
worldwise
waterwatch
swiftcover
reasonover
kradjian
overiding
weinhofer
vingoe
ndep
fleurival
loudeye
freshminds
laoula
cornicing
tajikstan
nadoolman
icesi
sieck
footware
niebur
mastergate
qurani
tamanisau
reults
tabaski
gorane
dawick
mulsim
glenmavis
potables
rittmeyer
granzow
faer
fengchun
maingate
pentregarth
corraled
silverhawk
scotlandwell
coem
leineweber
sterio
abydosaurus
mengcheng
calculatingly
goodramgate
latté
sanctimoniously
alyssia
albertbridge
aperion
aknin
maestripieri
catholocism
onlies
saltiest
mathwin
bladelike
tailford
cononish
arond
evotec
militamen
smollan
zemlin
possuelo
prigs
underexpose
mchedlishvili
laparoendoscopic
gettable
seides
imediate
bahmanzadeh
hanwang
samadpour
boyn
gruop
xinmi
ventilla
kriewaldt
praus
critcal
encash
diferences
ojiambo
ahmadshah
amendt
freakier
tuder
dcpd
wasifi
crowbarred
rubbly
cpnb
hoheisel
gatluak
outdrive
crashingly
fullani
bournigal
fiesty
undocks
kazmin
echakhch
whiterocks
sparkpeople
submeter
wondrich
worapoj
smashburger
tipsheet
businesseurope
understan
andew
ruusunen
gunnoe
pallasades
cressex
zuckerbrot
novolog
attornies
stemnet
comani
peaple
stakeholding
mannahatta
stuhler
dzidziguri
ethnie
elyaniv
liberadzki
ramalama
preddie
cassley
carmenita
jitterbugging
hrinak
dataline
specialisterne
hogel
shaquelle
munah
rubisch
draghounds
benazepril
valeting
qbic
plymtree
tidespring
spco
unstarted
sharghi
feedstuff
hegglin
shashiashvili
aiqing
gurrutxaga
langerman
chappill
ahlenius
launey
grunke
mahahual
tassles
baqee
hashmark
kofte
mufson
jerraud
maithan
perking
verkuyl
depressurizes
tweezing
medjuck
brantes
uncorking
demoura
gnvqs
jinshanling
bushs
artfire
unnuanced
bowlplex
gappah
froghoppers
assuit
paciullo
lutzner
facg
jungled
superflash
thiksey
keiding
lkq
djhone
pavletic
mmbbl
coric
mobilians
atie
kalyakin
vigilent
yanelis
rahua
chevvy
stengle
intini
joseloff
freydank
muncrief
carkhuff
senomyx
gsfl
grussendorf
iwebema
cairos
rouyanian
outragious
urdahl
negociant
moobs
chenot
indh
serwan
waines
suers
youxian
cullip
smokejumping
krustyland
lixion
disconcerts
kolodkin
zouaghi
picerno
ciboodle
chinnaiyan
ifight
dimentia
coronagraphic
rednock
geerhart
fafen
teruyo
mascarades
moonpies
wriggly
mallomars
exceptionnels
nudger
ufap
mumper
cyberweapon
numbersixvalverde
vtti
aydilek
sunsat
lieberose
umos
cancelable
eackles
urvilleana
rompler
grouplets
bbowt
testily
souhami
clifft
ingestive
ruthanna
amimo
cognovision
stonegarden
farouche
khadevis
pajcin
danionella
tomory
kibitzers
prettying
selover
bookishness
leroys
schwizer
homeister
hatzigeorgiou
norenberg
natexis
unimplementable
tashkin
receving
sewering
dinatale
growhow
angkorean
avonbourne
janaury
calsonic
merkies
jernhusen
outlandishness
moeletsi
cubillan
grenot
choreographically
dusc
taquerias
headstands
habab
calang
morphey
iyayi
oktob
mallaiah
ezzeldin
xincai
gesticulates
hantuchova
presidentin
pokéwalker
zovirax
trividic
fogeys
mahoningtown
pentathol
landhuis
kahramanmaras
elyce
disporting
graverobbing
jadelle
balthaser
gobbledigook
yaichiro
imprezas
whackos
nosepiece
unspeaking
blardone
vixia
nazal
isnr
hayesfield
referals
tamaiti
hansley
peske
euthenasia
neoware
slivovica
mahelona
sleekness
shawfair
stanekzai
kurkowski
ramrodded
nowgam
drijver
cheesier
berba
potempa
avero
romoff
protess
apog
maruto
sawaf
tjia
overnighted
novakova
duboin
yld
jacquline
zerkin
diciccio
fuelcell
sorena
verao
awwwww
coomarasamy
greenhalghs
humalog
fanson
bobyshev
semipublic
torygraph
baedekers
gready
chatlogs
rosenwasser
shamansky
yiampoy
narochnitskaya
externalising
zongliang
dobrzyn
surewest
stupified
blackawton
hiban
spse
mameshiba
schklair
maviglio
profitmaking
datone
karpati
ecobuild
mcsp
trigem
neison
outswing
klutts
brasel
axcelis
coptics
superenalotto
paleoecologist
investible
terzigno
zogaib
faife
yarhouse
darmin
thabethe
newfangle
hannspree
magliari
wendorff
nesdb
ningqiang
swanning
girassol
pettes
greasiness
ncpd
albet
nikolos
skopintsev
nohilly
onovo
odundo
fleshiness
daccache
ceglar
arghandiwal
marende
hoomes
mccuddy
branley
darawshe
innsuites
kronzer
unsuprisingly
bvca
tournaire
merzi
whannel
longar
sponsers
sensationalizes
olenick
symptons
sufen
arabshahi
irga
leiserowitz
meltingly
knörr
griffard
disinvest
wakiihuri
domincan
dewsnap
wileys
gosselins
netsu
perserve
nuture
remould
rbct
laforte
tscc
thinkstation
brichambaut
sasch
berkun
divos
tsikurishvili
qihua
sallenave
pendergraft
supernotes
sliderocket
sayyeda
bdelloids
headen
reijtenbagh
atflir
steinacker
pted
guozhu
enterprisedb
bersoff
crampin
vareniki
daril
nexentastor
rokit
montbrook
vilardo
bursík
tazers
norries
plested
seeram
sterilisations
hallem
lormel
mewett
baqui
candlelighters
gymslips
pahlka
lukomir
ponnary
incipiently
geohazard
questionning
limpness
ketteridge
surfcontrol
materiels
járóka
paisleys
rehypothecation
elladas
schwendeman
milksop
yobbos
baynote
rosehips
magrez
kalaris
squiffy
nassari
touitou
gvaramia
nefedova
montañes
bowdens
raihani
jonzi
heagerty
teneyck
socheata
jiraporn
kipipiri
lonestars
amcas
cpdc
heitzeg
fmes
shivalingappa
gaurds
pollis
jenrry
cassileth
giorgianni
reconstitutions
roshek
frogh
unchivalrous
kinninmont
mnemba
incu
consentino
mikari
tazed
csincsak
demutualise
sledders
earthshare
scruposa
visia
consious
heightism
oriau
makol
bingum
tailgater
profitted
porage
vmtv
cagy
geekiest
srichand
gudim
johnnes
speci
cheeps
hauri
sheirer
disapprovals
molyviatis
mbrg
invs
hyperreactivity
nieuwstadt
guidepoint
inmarko
foeme
bodymedia
yaneth
smaland
commingles
cepii
roskamp
mischak
sekouba
cravinho
ahpa
kuschynski
hatchwell
morilov
aromatherapist
elanco
medupe
destocked
profero
hodzic
xiuping
mtagwa
igmar
balčytis
keme
uahs
chassay
jacolby
efectivo
whinning
paderina
syabas
unpresentable
tromethamine
juls
wasikowski
ballig
audronius
sophmoric
impermissable
posterboy
zavalloni
paricalcitol
mcgills
herlambang
mtpc
lcvg
ubermensch
dalakliev
globalwarming
tjh
ohtsubo
mccolls
frikkin
newsarchive
zaiem
surour
postwatch
klecka
dragages
wingding
obituarists
yaneza
hipocrisy
songlike
acridone
cslf
titnore
eyf
mushangwe
minidisk
cioppino
shelda
gwaed
papur
guileful
melitone
wicaksono
fiana
wizman
comfortability
mizens
expences
nfps
teodori
jursa
overexerting
perambulators
chedgrave
kyani
toris
myfoxdc
binski
kohona
slapdown
dramaturges
williamette
alakozai
dutifulness
bnim
gradated
malera
eyermann
neidermayer
zaoua
pyschological
meditec
taiseer
bougainvilleas
kibungan
anglomania
mcalley
dawasa
windale
anhri
cyndia
altekruse
locums
capmark
whittal
zillionth
essek
baynunah
seguiriya
kvisle
somemore
jtwros
whitemark
cedomir
cortizone
galou
létard
xenoposeidon
biomacromolecules
mcgauchie
mijail
verbalizes
hizbi
smyre
tuanpai
kuruvchi
ouzts
mawsley
yuanxi
ungheresi
kapasi
aesha
garoowe
undisplaced
harrover
bagmen
bezemer
pageboys
saout
campkin
reglan
scalin
kount
rohais
osklen
karus
besty
summiteer
ramoneda
lassin
skybitz
belim
lairy
idith
maisanta
erikas
pfeiler
superpass
quirpon
leutenbach
wormery
eousa
torchin
millito
bootees
vectibix
ruchel
dednam
felcsuti
belkhir
whispernet
wallenbergs
pipefitting
anaika
anchalee
nanoworld
cbra
anybots
baiardi
koutoukas
unaccommodating
zdanowicz
bexington
manouevres
imbiber
qaddura
lituma
mascarenas
beneski
tablespoonful
digestifs
biryanis
rubbishes
nonenglish
rolax
staggies
kenwin
wychall
melanosporum
mondoloni
annihilationist
recoupling
birchleaf
snarked
qaemshahr
toledos
tacori
interestin
laquita
romanoffs
sabillasville
terroist
nuvu
frischling
hushion
manciet
magnetise
szabelski
rexnord
sidorkin
dayem
sadvakasov
byggmark
spiropulu
etexilate
smokable
gearshifts
lesiewicz
derx
preregistered
hinatsu
crostini
avello
hearthrob
sarukhán
akhmedova
trigell
boquerones
kassman
wilderstein
aabout
unicel
schupak
legitmacy
breasclete
uncompassionate
macecraft
panasas
lonyae
ojek
turbiville
locksbrook
ardfield
xeloda
dunganstown
alinean
baggages
lurling
hardins
bringham
slackman
flackett
pentalina
heytvelt
aoukaz
lockner
pacificare
goldgeier
lightshows
debrided
gonadotrophins
plagnol
burdale
verryn
wholley
schelvis
managerially
interestes
mazoka
tschütscher
quantros
sirm
rarin
inviolably
siejewicz
emerman
ransdorf
heartwrenching
oplc
eihi
allbut
tratos
dykhovichny
droukdel
chapatti
adirondak
sarcophaguses
guarascio
swoozie
extensis
schweickhardt
pamulang
merigo
jfsc
scratby
araud
labaf
mitulescu
severeal
hoguet
grandpop
worldperks
borsheim
biocraft
sekurus
herodotean
mewshaw
brovold
cannt
worawut
nomadix
ramonov
dazé
peascod
kopczak
lorra
cockily
sanitec
medl
oceanliner
crtical
orsedd
baqr
ghoramara
suheila
baldeon
trioval
turnstyles
adwar
interglobal
kerekou
impared
tilery
gracetown
imerman
combover
tuffnell
pacala
bolay
smutny
genson
nonparticipating
pasman
nyamwisi
hectically
versys
bridgemont
huckstep
pesc
cosmeceutical
firminger
goodboy
conviently
jujie
lortab
djelkhir
kobeh
melipeuco
kearnan
degrange
unsensational
garantee
libéreau
miscounts
bassis
baochang
maydays
elfont
greediest
salesrooms
sovreign
emminent
achaval
snits
insequent
kineticism
jashanmal
bemedalled
lalt
libassi
seapine
achieveable
dooey
intiative
pedelty
elkinson
seluk
hty
afalava
brulée
narcotraffickers
styrenics
swadi
setliff
myelomas
affo
maliwan
lariam
outdueling
mozaffarian
isscr
zuddas
fraioli
soppressata
mosimane
unattainability
patalano
hackleback
rogness
mitler
bickersdyke
trencrom
zilka
electio
nodine
kapò
uspga
haot
currock
kowey
marinaded
gurri
gurpegi
ebita
schoner
telum
jamame
hedric
zhongxue
deice
apwr
zetterstrom
ereli
nlea
trendrr
ariaal
daem
pleinmont
overanalyzing
ungass
fatherlessness
carender
simphony
bogacheva
taimani
naksa
enrapture
gurvansaikhan
heroique
haplin
inama
corvaja
grillers
teixera
betzold
superfruits
mastrud
shihhi
thanya
pellum
nivose
rebidding
pinvidic
zichal
overmatch
geiges
grassia
usajobs
barmulloch
levisohn
grovels
shadiest
ehuman
gagon
jowder
belani
tilmouth
färberböck
bukantz
nowzari
ryers
rutterschmidt
anacap
microcitrus
shentel
cutup
rakhman
riyadus
hickin
avolar
upasani
amyloids
ibéria
mpsv
tlab
sahwat
eriswell
szele
risottos
amzallag
bombadier
velicka
knouse
indepence
audobon
cassidys
meletis
biostatisticians
sledder
stranglings
maggay
internationl
ccgi
hryvnya
zipit
cmfs
interwork
felberbaum
denak
schonbrunn
exorcized
azafrán
tabd
nunno
kallström
cranmere
hirliman
hellekant
dsam
pilsners
cocaethylene
jhoulys
alipov
westmann
berce
sickener
akme
txtmob
khatija
ghastliness
calyxes
vehrencamp
inidividual
uzumba
zarnowitz
tarfusser
neafc
inovative
marshevet
nebuliser
srrt
tirofijo
ghalam
roovers
factae
llovet
vasiljevs
jiefu
memecan
landahl
riyom
parellel
unscanned
molodkin
seesawing
navaros
igougo
konenkamp
popeo
repilado
clerkland
bioquarter
longmenshan
hessenford
trematopid
muqdadiya
nonin
berryfields
kranju
methamidophos
mistatement
globix
iaccoca
sepeda
kaiserschmarrn
opeb
rhyddfrydol
bhangal
azmir
scailex
delawar
bwakira
caseosa
glied
madisha
naftel
publicty
podolny
twelth
dishu
professio
hanao
hemstreet
hinant
supermans
nimotuzumab
genewatch
fooey
santus
onthank
mauby
pankratova
fractiousness
maisi
galbo
eint
krinos
migranyan
lukats
raeth
iscan
miscik
latinobarómetro
magnell
collectiveness
accountably
stetz
europolis
dowloaded
botmaster
gobsmacking
unbackable
fyw
mandery
incandela
thribb
tooway
gorfodol
tillsley
bouzaid
blatancy
ahcc
pactiv
sectretary
underoos
berst
frovatriptan
kilolo
bifengxia
feuchtwang
orlovac
dhoki
helda
inspirers
arvinmeritor
melahat
lifeguarded
gustinetti
duac
suebwonglee
transistion
kitas
ifund
markettools
fussier
playpens
streches
martinsa
hastier
simulataneously
shaif
lewand
denationalised
tokujin
humaidan
balluch
fhit
hensey
gallardos
genarlow
aranoff
retentiveness
zeyad
lautem
yantorno
yvonna
caracortado
aresty
slinkys
porthemmet
gurkas
hampreston
trannie
mudbath
soasta
bécasse
closeby
subpanel
kissogram
athwal
ghising
ptsc
meshoe
bentprop
vafi
limned
rostang
ajuste
ambulanceman
dthe
vinnik
culicivora
indigineous
senges
khazai
schomerus
fastsigns
vygon
ibase
maddiston
lockerroom
gorick
sagalevich
bedeck
dualeh
bisheshwar
hersha
kaboose
moghimi
ingosstrakh
yannan
mudawi
rheinecker
expelliarmus
upturns
medee
kaaot
vidala
scheppach
atheel
subindex
dishearteningly
faild
arabba
medicalizing
jahmar
cussins
gisagara
nahabedian
arnhart
dragonwagon
nonexempt
ellcock
suedes
unreturnable
peckhams
bigmachines
katwal
drewiske
tuxes
mophead
noncircular
sathyavagiswaran
silab
moldonado
tidemark
overinvestment
takhe
sfwmd
sconser
handballed
goldex
fossetts
collop
magomedali
kupers
accurev
gabbing
lozes
ebookers
tetiaroa
swanagan
jevremovic
cbssportsline
slingplayer
csmfo
ifsl
wikholm
glicksman
brynu
alnajjar
sukru
georgetta
theodoratus
locharbriggs
vandebosch
nullis
dinitia
environmen
tuckered
anthros
saleve
unmasculine
savvier
vendex
fundamentalistic
awino
zeibert
norgen
momanyi
diiorio
subira
otherway
ultrasone
cyanurate
roedy
conisholme
viget
invitiation
matalqa
shoeprints
bmgf
firelighters
gurria
flewellen
tuaminoheptane
sukkwan
competeing
forbore
rweyemamu
sangean
alieva
rentfrow
eurocents
kohtz
colaneri
naggingly
kintnersville
janulis
chuluun
jampacked
roadm
antitussives
respironics
suzaan
yoennis
bdkj
theirself
abdikarim
arow
elmworth
oteh
saniora
pronexus
mangou
mephistophelean
yingxiu
ophthalmics
rohrssen
vitrol
wittenbrink
kalisa
pfiffner
menteshashvili
nwoye
nbgi
scottevest
youtrack
holdway
trotanoy
plenel
meatout
whichis
bctgm
thirdhand
heii
heltzel
mwewa
pelmet
langsdorffi
elleke
oberacker
jamychal
twitterer
nances
impella
cummersdale
mitting
excrutiating
grieser
cailleteau
weinfurter
demaundray
gorsel
eocarcharia
addlethorpe
wajma
khamanei
strengthener
resprayed
mingwei
topcroft
medvedow
ulsi
unignorable
zevran
brandejs
abernyte
kules
tavui
daulerio
intellichoice
klaers
mokarrameh
ccsso
doneger
irabagon
midsentence
unnat
institure
bradburys
soapmakers
genart
moaa
preds
normalises
shvidler
jackon
unsubtly
rayburns
electrocomponents
pravada
doreena
cackled
burban
unigol
iriekpen
strummy
waaah
hanzaki
orobator
kouider
zelika
shallah
acores
phoner
chronophage
deadpanned
roybridge
keidar
housekeep
unsocialized
ibbett
schinkels
baraawe
rasiej
kasoulides
holweg
macchietti
jiverly
resoundly
eacho
mnuchin
logocentric
aviculturists
lineberry
tafreshi
smartpass
kwalik
chattiness
robbrecht
witheringly
discomfit
nistelrooij
mcaveety
flymo
snappiness
deklerk
breguets
gronant
kabalan
itsunori
religeon
shuldiner
avenatti
ravenscraft
instranet
punking
settlin
modiba
pastides
thimbleful
eurocommerce
overfamiliar
qliktech
beleskey
ikee
anacor
snacky
symo
iparadigms
soerensen
errin
sharow
louisianna
edmier
yday
josanne
ruching
sinokrot
ferer
hurezeanu
ockwell
oathall
biegun
intermodality
jorisch
debasements
mcarthy
mardirossian
zubayda
bearup
pizzaiolo
karpaz
ghigliotti
nayna
umbehr
megeve
elwesii
aclidinium
onhand
klie
soltanieh
eular
spehar
nowcom
itilleq
steidtmann
barnets
medvec
mirium
dulio
posturings
biodetection
thinkg
feamster
dourly
roubideaux
reinjuring
deathbeds
houte
speechs
dreadfulness
moskitia
flytech
fioritura
soniat
deifies
digsby
multichambered
aubourg
primaried
kawneer
seamicro
quadrantids
dresse
jonnes
crosscourt
prescher
perondi
sweig
palkot
unwrought
bagleys
recanalization
cestria
aluizio
snakefish
campier
captials
mckubre
talanted
dayyer
unironically
manipulativeness
moorclose
wozza
popguns
everyscape
winings
treacey
zhihe
doninger
bluerun
pivnice
geraldina
clcv
crdc
onforce
friendfield
bnic
gushingly
fiances
neovius
tsakhia
nurhayati
trehane
lachelier
mayersohn
schafernaker
cuillier
grenham
holmrook
francaviglia
lashawna
accessorised
rollick
yunsheng
corkboard
symlin
baylay
angelfood
orgalime
florbetaben
chiem
vroomen
goip
mudbank
mojahed
bolnick
rieken
fatique
téléthon
ecodiesel
kentz
blairdardie
wonderkids
owlsmoor
cerebrally
qualles
valimo
bradaigh
lavander
sextupled
senseney
adpc
greencroft
lauched
hopke
mmfm
battlewagons
yedoma
turridu
signeul
essaydi
carumba
manook
rafeedie
berendo
epay
khomein
overbidding
dodn
souheil
replenishable
jové
sandelford
gavil
mponeng
zieve
upconverting
formanchuk
comissioner
lesiba
caturday
shefrin
anggraeni
ellenore
alshamsi
spiderlike
faliure
darlo
bruar
baoliang
stiteler
khushali
franscell
oinks
estupinan
spigit
kleptocrats
pivni
unbeautiful
detamble
demonstation
suie
sapina
horowicz
hardmen
sicap
tubbing
wigtoft
neuringer
splurges
streetscene
tantular
betokened
ecris
qianfeng
policylink
vyarawalla
entrecanales
marefat
werfenweng
macconnachie
okonjima
defusion
mayben
tittsworth
smalldon
wolfard
futuristically
volkens
folb
pillco
collecters
debon
rubbin
kulakhmetov
monachino
cachuela
butterkist
rolta
misheck
unattentive
quilombolas
deria
tattooer
chaurasiya
trescot
helderman
uncoils
murriel
undaria
gogear
artsbeat
pelisson
finagled
schiffahrts
starchitect
hisser
lousaka
baianas
nerdier
hollitt
dorokhin
sensable
xenios
houshold
schuttler

taxloss
demijohns
kanasaki
marakele
gostowski
headhunt
bazzy
destabilizer
kirwans
scoyoc
brunnstrom
slipman
moritorium
siddikur
undeployed
challandes
bernardins
stepladders
ladling
metavante
portmouth
guertler
nlets
okuwaki
kavoussi
konterra
keynsian
etblast
ceoe
lapiro
whooo
rosema
erikssons
ducille
cynuit
saretsky
progen
paraguyan
motele
devaluate
poutala
winterkill
fastpath
danesmoor
bucklesham
lectorum
dygalo
meatheads
cratchits
kanel
pitham
ciurciu
multrees
manufactuers
bonesman
utsc
greengages
zumwinkel
bayhealth
guilelessness
niiler
limpidity
renfred
evpl
freee
opmd
jawole
capoccia
envirofit
anitelea
nelan
nimbu
tahhan
reconcilliation
bitkom
zepagain
caronna
electicity
thummer
refire
canestrari
wahiya
grumeti
checkdown
pacan
baltray
athttp
dotasia
tencha
sadangi
snptc
forsbrand
fungurume
corbusian
fehrnstrom
savuti
dryish
lurn
huvafen
strangulations
uchea
seagirt
artrock
nawbo
pantywaist
werx
vandamm
cajani
skana
mugaritz
diabetologia
aïoli
luebbe
cretz
laroquette
heireann
lumpp
kanit
lamel
brenco
octeon
pharmasset
suwalski
universitie
whoes
georgoulis
schurrer
zafferano
eejits
shaggier
ejiogu
thandeka
fuzhan
minohd
ddtc
ecip
herteman
aravit
pragna
gonsalvez
hathwar
izat
whimpered
wiggans
hotblooded
sapard
desposits
repik
krengel
cedillos
tarries
vetinary
campness
schackenborg
lowed
mocny
whalid
feeks
grabarkewitz
paravati
bortin
midgame
patroled
peloux
gwerth
unmailed
kunick
regualar
heatherside
nyeu
capaldo
reichenthal
owie
brassière
pearline
crotts
patongo
bloodsoaked
walkon
pirarucu
stiil
gastner
sajil
changepoint
grindea
taimuraz
supersensitive
pierantoni
ambulating
cheselbourne
alqaeda
lehmans
urbn
conniptions
peninah
waltzers
quangel
qualifing
scrimmaged
horray
washko
palazzoli
rosemarket
mantillas
bambis
handywork
osterweis
korchemny
bateyes
vivari
shamama
maelstroms
guiora
shokouhi
nerma
reinfected
japenese
praefcke
skrewed
surpressed
ubiparipovic
manifa
gruppetto
santaros
effler
solemnising
bodycare
lilianna
dominiczak
duked
waddled
inconsideration
ruksana
fillibustering
carlsons
gappers
codys
xinpeng
vandevorst
cumper
spytty
kurella
amoni
brochier
borzi
cevs
robino
enviorment
grillework
peacable
nypro
dulic
roxford
junren
ongeri
peccadillos
supping
guffawing
toretti
blackfordby
durana
discusting
miserablism
rapaciousness
seabrooks
chevez
träsch
lobberts
paternalistically
sobero
aboutreika
grcic
abertoir
eradicable
antismoking
coyles
stormclouds
insolia
arctics
mittee
mephitic
beginnning
cairenes
rivasi
kingsly
stenvall
mezzolombardo
abdelbaki
anglophilic
metaj
eletropaulo
ahney
articulacy
keyboardless
semionova
requetes
medx
deskin
injuncted
oohh
loadsamoney
fscc
nakanda
oteley
kiddos
triebe
sidetrip
facilties
ruitenbeek
tugba
rocketbelt
lebovits
backscratching
telephonically
beinne
duckler
vivona
subtyped
unpenalized
berlais
aeis
floridean
soukupova
afce
videocameras
savviest
speedbumps
balmossie
ergots
adpt
meineck
possibley
mangor
niederbrock
koelliker
shanth
munyurangabo
dreamforce
bacus
ditchwater
haussmannian
sessilee
detrol
matanovic
tricor
benattar
intertanko
doai
swaner
deoderant
kulzer
dupay
danthine
alderwoods
xethanol
rewrapped
hyperview
effed
isoardi
alderston
undersubscribed
godsmark
deconstructor
dcgi
intermittence
naprosyn
buckfire
prempro
slummy
backstrokers
fornas
puttable
inconsequentiality
wimpassing
placedo
bahnam
governmentwide
senao
unfertile
drinkability
elberse
nebti
nemyria
moooi
vorce
krisworld
maugh
huanglongbing
esskay
intino
fumus
krimpets
scolese
oweis
ghappar
caffell
cianflone
mamajek
moneymen
fenk
zabari
twosomes
phwa
rapradar
baulas
bungard
stonger
actifed
getresponse
ahrari
baubigny
allowability
cosponsorship
tunneys
inhumanities
eyjafjallajokull
szeleczky
cableone
dttc
trwam
nutra
gotton
beaworthy
moddershall
arrangments
maurito
follitropin
enquest
adow
ongenaet
macpro
doorstops
hydroprocessing
nassawango
schroepfer
fazilet
scalex
peróns
mandelas
stoever
whinnies
granizo
hardick
roehrs
corvington
sharsha
skywatching
pensthorpe
bloodmobile
stereography
loftleidir
megalopolises
mokedi
ramonas
ndira
horter
constitutent
funguses
merlots
swanswell
agtmael
prunings
sabarsky
cynicus
intials
blums
strothotte
washingtondc
nataline
archimage
gobero
guadalix
eupd
vpotus
nnoli
koryolink
jicky
bohora
beckfield
pelkie
potasnik
wigmakers
frankman
vardiman
flameouts
riedlinger
olaleye
pylade
howeth
modiselle
siglio
jasmyne
aviacon
paglicci
technomedia
shorstein
piggin
getfriday
abkar
inshes
donihue
opurum
eglitis
conrades
panuke
cedulas
mileusnic
ntpp
dvornikov
reclaimable
efner
marfrig
everdene
hyperextending
kotzur
kreth
tamberino
zetumer
afls
meydani
sharmans
hainsey
gompel
ellay
blsa
haskanita
nadcap
greyest
brackpool
cowlam
unconciousness
traxis
fibrocytes
flamboyancy
ecoterrorist
hejab
overnment
zerya
violaris
disgreements
staler
perniciousness
budennovsk
supremecy
erway
highstein
noblis
fohrer
laurissa
lapka
batofar
meler
waverman
sözer
eclipsys
luvaglio
ppci
feliksovich
tecu
londell
pintxos
ntdc
centurians
queered
courtlands
ticia
amnewyork
labandeira
amsprop
mchp
walhi
tanzie
goloco
belviso
sachinidis
sankaralingam
microdiscectomy
soyfoods
cogitations
sullenness
wycech
wuer
amburg
adaobi
moskovski
probers
concia
makhmour
ferrús
dndo
unfished
spherix
kasukuwere
yonnel
hornbeak
gombosi
mulid
absolwent
savoi
unutilised
legendry
avodart
derwenlas
tsahi
jurow
doriel
landrys
futaleufu
galeo
buitendijk
guardiani
tajuan
sabemos
kaidanow
romark
nsimba
adegbile
sakkas
assarat
haverson
covisint
marrington
duntley
tarence
sidles
relativley
gauldie
mastrogiovanni
peopleperhour
pramaggiore
markwalder
intc
bauditz
mussing
forminte
tatnam
kaemmer
bioport
heric
gitahi
footbo
exuse
komodos
taromenane
sukiyabashi
kovarsky
guebre
bestrode
pnwer
buschman
tarracino
univeral
preheats
tortu
willoughton
niezgoda
edinbugh
singlemindedly
saynow
chitou
hydrofluorocarbon
brightkite
unbooked
dhuru
jamilia
harwit
oubaali
adwait
tietmeyer
onsides
elmasri
crispest
mingyong
borkovec
efata
melhorn
pubcos
christmasy
fritta
attou
schaldemose
mindjolt
competitve
ipulasi
kapana
ttha
longserving
kopinski
bakhta
swepco
cytopenia
raquil
dilscoop
precollege
beruit
microcircuitry
overspends
geenen
liverpools
arghistan
seifzadeh
khpal
mufeed
contentfilm
himeself
sékouba
jayasooriya
ramkishan
masaliyev
santanas
ultrazoom
bassiouny
motherlover
paglialunga
xarelto
nougats
aconex
pirez
reconnections
wildbore
mowle
wainiha
alati
fsap
pophams
povaliy
grapnels
popovers
timone
kokinis
raggedly
alphatech
strowbridge
phuck
mackems
schmear
goaling
iannello
binliner
ecsp
chipotles
hucklesby
hatefulness
cynde
dobberpuhl
wppa
altenbrak
acclimates
hebronites
catster
tanard
soluable
elsholz
shirvell
torraca
dailymed
benriach
sasajima
philinda
neinas
vinitaly
ninoska
betgenius
shoppable
haveron
tertz
squan
nadezhdy
potowmack
espressos
antiunion
multiplo
craftiest
aigfp
stavinoha
suhur
moebus
zawadski
bilello
uhhhhh
lednock
fezza
undigestible
ragozina
paramhamsa
kolditz
stoneyard
weglarz
megaten
wananchi
vnas
chritianity
prolongued
gaiger
montias
riads
turbinton
baseera
underexplored
koloskov
squba
rosendaal
ponces
rajnoch
salyards
atsg
sheepdrove
pokemones
abedian
glionna
peckers
amirahmadi
minnigaff
isdr
lizzio
gtel
aglitter
blemishing
tankful
kalapara
maydwell
shodeinde
interiano
sobon
effervesce
drumkeen
gildred
pustay
bournside
benedi
seaclose
cbol
kulchitsky
dabah
ilora
lemole
chhewang
fullalove
mansyur
exg
aderans
outwear
mckennis
counterinsurgent
lasering
spackling
bangbang
zly
senetor
ninebanks
derico
kausfiles
stacchini
sirimanne
ralpha
ntsebeza
shastar
obermayr
hamptworth
mbagala
drunkorexia
cardoons
cereso
fallwell
astrovan
tabaro
springland
hiim
asantes
vorsteher
weate
garanca
talboy
clemes
motech
biodel
amankwa
weijing
strenous
unspectacularly
cdphe
judical
chauhdry
storminess
dalhuisen
boardsource
actonel
cappellacci
keryx
directcompute
buturo
sibleyras
tqd
scin
fbml
mycock
insdorf
suner
crewneck
braggarts
kivuitu
monsma
fellous
sdunek
affliations
ucpa
washcloths
yowls
malei
wizzart
millponds
sahebjam
villepique
pigheadedness
gannushkina
cherner
setterstrom
tegegne
whem
concertmistress
vanderhook
tsehaye
extrahop
recladding
healh
sciens
bramhill
jhagra
acass
chervochkin
ponying
phished
secularise
colateral
agasint
fibresand
bellatti
roopam
snowmobiler
eikos
laynie
donowitz
yafeng
laugardalsvollur
thinsulate
tdz
herme
buddys
dobens
hepatologist
wagha
stoessinger
spaulings
afsaruddin
goodwills
homicidally
resuce
techtown
carring
pracharaj
duddleswell
deheng
fantle
poczobut
gammarth
diligences
suprime
jokhio
mojitos
portaledge
moyesii
haansoft
boeglin
sphr
denburg
folfox
eroticizes
endris
sudjatmiko
alayo
kenmoor
bioserve
mlinaric
dustier
innocous
carizza
windbags
hipocrites
afetr
iridotomy
tinetti
cihlar
hairbrained
xertigny
chengcheng
hamingson
marsano
clamed
ghei
cozze
reecie
rushern
daynile
tetelestai
cydf
olaroz
nhongo
stacom
sublicenses
takkula
hardyment
freekicks
hannahstown
zaluar
umdf
nmfa
baracouda
awdah
jpra
cravenly
biocatalyst
pevenage
geving
flaxseeds
macaronis
uralsib
tearne
voilá
privacies
bizy
nosegays
pensioning
anaesthetise
sowings
seinfelds
santuccione
develoment
larché
sucden
gumbiner
guccis
alpf
maruziva
eccu
teklanika
sklarew
borlotti
moneypak
bedfordia
blipping
laup
tineid
inadvisedly
huriwa
austriamicrosystems
culv
stormier
stankevicius
lashun
facius
trumpian
aelvoet
parcour
growning
sospan
brandeau
probablility
scrote
clonie
dufry
hargroves
depietro
floodline
hudspith
fiveforthree
distincly
calster
hakimzadeh
hygge
lowara
ramde
meckling
darlyne
hubayshi
taikonaut
musicial
americanise
dionisios
coovadia
tyjuan
saintia
tolerantly
geekier
heisterberg
doomsters
snowmachine
cityengine
inlcudes
travagli
lawbooks
bisazza
talamona
silverlode
ptns
zerain
thicot
deyong
cosmeceuticals
outpitched
supersleuth
abli
blancarte
landrovers
bleichroeder
ttwo
noninstitutional
decencies
hodas
saxenian
dicovered
koetsu
irresistibility
soyabeans
skytypers
gasaway
nunziato
dornenburg
copenhageners
petrojarl
tril
strawflower
doodlers
barthmaier
buatois
ripply
mcglowan
bombsites
himm
ravinet
nalbone
itoo
glassybaby
painterliness
balbardie
baddoo
rptp
smarthinking
galisteu
slix
wristlets
biodegrading
razeq
aquinos
miskella
frankenfood
otelixizumab
homebirths
chabbert
dovima
cristan
associaton
pirinski
barondes
muxworthy
denos
wholesomely
pimentos
duporte
worthingtons
farrokhroo
prickliness
bienemann
verzaubert
colaninno
hautzinger
sanish
nikons
suslick
nexaweb
underrepresent
zetar
sedivy
zeyada
moshayedi
skvorecky
dragoneer
repole
abortively
praire
acras
theose
rickerson
teplica
thte
saulters
habeb
perkiomenville
axinn
freska
chaptalisation
damola
johansens
klunchun
electromedical
tumwa
nuble
budenberg
filmyard
beeland
youthwork
bilwi
lemondrop
hiccupping
pipersville
denzo
rohnke
necessay
throneroom
grullon
awardwinning
afrobarometer
mundinger
afni
voteless
qpos
goldsteins
applebees
scarselli
needier
cortefiel
nashan
offguard
leppink
brynford
aawc
shoulderless
jeanloz
altnet
daiquiris
manufactuer
gillepsie
vowlan
andanson
cdai
limitative
hervik
flewin
unfroze
libeco
kompan
marsali
pillcam
assumpcao
qddr
formfactor
stecf
champoluc
rheo
csrwire
pawloski
mobiletv
acthar
vaji
boltby
sorries
stonking
leasco
omnipoint
basepath
multisource
priyadarshana
crimbo
genderism
stroytransgaz
whistlefield
rashedi
arcadey
strategise
gulfcoast
neptec
applebome
merkinch
breachers
bloused
chiliboy
inluded
helta
duvie
sgms
organistion
plowes
toyomichi
remeasurement
jarrai
derrico
schmill
lipes
mudhafar
gruaud
mirembe
marhsall
trakh
gruntled
scaynes
wistman
mavrakis
smolyaninov
paczki
pianalto
orumiyeh
osoria
persley
guttenburg
xhb
fellates
leacon
prelanding
munizaga
infringments
popka
mastracci
bendiga
sneakiest
curzan
ekstein
timidria
unho
environement
preposterousness
mergler
songhurst
honsik
spaliviero
turky
kalimantanensis
theorys
kanekoa
riihimäen
fuddled
medialand
abdolali
babayi
pijanowski
fiorile
akab
findin
impanel
iaato
messineo
azizulhasni
sleepier
issakov
annonced
mitina
biometrically
granich
kalpoes
prestigiously
weatherproofed
nargile
cathall
tamilselvan
debitel
lamestream
bracketbusters
slieau
soltner
almany
conventioneer
eluay
aicraft
secen
budelmann
audioguide
cahyono
mulal
perfomances
unlubricated
rabbity
cuting
godzik
industriebank
casciani
cytyc
rechargeit
fengqi
memberclicks
loropetalum
neulasta
bananna
queffelec
hardiyanti
rostek
véfour
tourk
topicals
peacocked
lovaza
vlna
gyori
quatar
bostonnow
navile
edal
crog
criticsed
mangoush
mahmoodi
sjoholm
garness
entsminger
comé
wellwishers
gichon
bagnat
meetmoi
gatsometer
setmariam
mitat
vangundy
stalbaum
unterach
begona
attmore
drakopoulos
decomissioned
buoncristiani
hellsberg
sawday
olumuyiwa
truckies
nomics
loook
slesnick
gabyshev
backhander
ajillo
jendeki
kawhmu
idealogue
atalon
nbsk
edrik
luksch
johnnycakes
qpx
predetermines
pcad
dussap
kaimar
trishaws
biofields
edinaldo
singstore
michelins
okulov
shirrel
sspca
glucocorticosteroids
scourer
madhuku
participacoes
degrippo
tholet
daspu
naoms
slonina
moroko
manusky
koeneke
pelegrino
granquist
jabbie
empreendimentos
climatique
wisterias
rburgring
parkwest
mantriji
winyard
nichicon
thinspiration
gastao
pushpinder
otherworldy
thinkvantage
vaccarello
braidhurst
autlan
carterfone
indama
nwlc
clfr
earnout
henigan
escarra
dysphasia
debica
kenfield
infirmières
aleasha
elysha
butterballs
brynien
campanilla
studly
iraw
feleknas
barehand
committeee
rumailah
chillas
laffoy
angelakis
hepatitus
tofutti
schnittker
kehrmann
scentsy
chockfull
toulou
bordat
sinornithomimus
lezlee
padmawati
potray
carrison
coarc
cytel
pyret
holographs
seussian
rozsival
uncomplainingly
pontycymmer
opsm
meringer
croiter
helioseismic
natelson
mokuena
gophone
dissappears
delce
ascetically
cadan
hospitalising
kingseed
receipients
stockbreeders
releaded
porfolio
blaiming
roskovec
squired
topilejo
ladened
shopland
serran
flightpaths
suryana
indisciplined
liliam
darroze
carrefours
grobel
gilthead
heles
shikov
holywells
bourrées
vlassakis
henehan
craigowl
downsizes
parralel
businesslink
maxick
guarida
sucumbios
gremmo
infopro
breakingviews
dsas
kayum
fernàndez
eqr
difonzo
wisehart
wmmm
yauatcha
opciones
goutal
mccourts
glowers
ehnert
turkheimer
dennee
sizakele
hitsman
tuggy
agley
relevations
mformation
cuneyt
celluci
carloss
jorvorskie
queasiness
dubout
gazzano
postrace
firstplus
gagnidze
uspaskich
glencove
ipoker
loppers
rackleff
akkus
bomke
podimore
fantus
ballyhackamore
treyarnon
yardages
nexcen
acdf
dedworth
belguim
gosek
chanon
fitba
coastally
raiter
haasara
yohane
badreya
doxsey
jomhouri
birdcall
jephte
pretzelmaker
dizikes
horwitch
chappatte
bumaye
jellylike
badui
ottobar
incontrol
kipngetich
burte
shinkay
taxer
grandads
draisey
cytori
kupono
plash
barberas
souless
sideco
pogam
yaleglobal
tarrion
serialists
ditzen
mevs
chamberfest
filipovich
qinghu
papun
musicalized
jjimjilbang
holocost
highflying
gattari
emptily
snode
jenefer
haasteren
michitoshi
chongwe
abakarov
atbc
trubowitz
belchatow
guobin
losyukov
requoted
skrebowski
chandiramani
fenroy
merise
körbes
superbot
pankshin
umarji
petrivna
mpinga
matress
chiuri
pearlson
luchko
chumbucket
shneerson
endako
magennises
pliage
palframan
douda
uscp
phillipino
thameur
snobbism
chebotayev
softnet
hessert
wallbrook
shafritz
unfortuante
fipr
witcham
labrot
hagmaier
omahans
awlaqi
utest
pescod
kinger
marinacci
adulteresses
militarise
bitez
bumpier
symbologist
undiscouraged
seenan
leyvas
chenevière
stressy
portantino
pogoed
zayi
comebacker
ignagni
delusionary
kazakhstanis
wasbrough
feaunati
hundleton
fiondella
kossler
vitreomacular
adrasan
statisics
halfshaft
kurani
cobu
bartend
prinicipal
dishcloths
vogster
doublehanded
barkwill
heapy
lumpish
mefa
wge
bunbeg
hutomo
qdg
conservatee
podleski
corallium
leopardskin
geed
icariin
zequeira
asshiddiqie
celynnog
innovest
cornley
canix
masunga
obnova
scarifier
nikahang
rainproof
pernil
cozily
qadirpur
microdistilleries
muntasser
bukstein
rotblit
uehiro
bancgroup
vodoo
zhihu
rapelay
ashmanskas
marcassin
hunzeker
imiglucerase
supernerd
nuzzles
batjargal
doldrum
njea
friedrichstadtpalast
forestside
worell
bolie
easyway
hotbutton
netdoctor
beschizza
zonally
vsia
metroeconomica
megalomanic
sheperdson
castleside
overexpanded
curvin
becas
ascuaga
delegitimisation
becomingly
verndale
baheng
trenesha
smilovic
vylka
bottner
ngalande
milanetto
bundlers
khalezin
bfms
goluboff
denaturalisation
zargham
travaris
figel
nilaveli
dangos
fussel
supriatna
eurovegas
nationbuilding
lahiani
asmerom
shuaiba
gagarinskaya
smethers
vachhani
banguela
galron
greenaction
kharrar
witchunt
seariver
electively
supermaket
lepeltier
quesnelle
wudil
pucon
agualeguas
wigix
heathcliffe
anafon
abinanti
kotsolis
soweth
jcra
gresswell
capey
slingin
ferromolybdenum
brewski
accomarca
eacts
westfarms
marver
saltmine
treixadura
grumblers
hamfatter
battels
dubuclet
cheeseboro
snored
grafft
raibert
murcar
hossenfelder
madumarov
ploof
recyclates
devolutionary
xiaoshuang
unappeasable
visionworks
schweiss
médiatique
hyperstar
mulvehill
noruz
azun
speechtek
tedlar
comice
siminoff
dashboarding
superkings
lated
futurebrand
thornwillow
liebesleid
josen
fazes
mecaplast
pokalchuk
sabrett
kuvan
makvan
chinawhite
unremitted
rygb
liveblog
sinkerball
kormendy
pérol
hayre
burncoose
rfis
mfcc
warstler
cackler
coopmans
superrich
attarian
bookstand
paffenroth
shelp
dettloff
polehinke
saimira
idva
kibir
ecbt
kelbessa
humanistically
overbuild
redeposit
shanara
swaleside
gwac
kozmino
prangs
bouthaina
ostini
palazzuolo
kahlow
jadel
boreks
carpetright
ramshackled
yourslef
mangieri
yanli
cobranded
chupina
frontrow
guiltiest
vellis
decabromodiphenyl
rhapsodized
fiberoptics
deductable
coreconnect
balvinder
gembo
gullestrup
tiegel
cotorro
gwartney
rayyis
bosserman
paluska
machista
ishmel
remunerating
pceu
rivenburg
playsuit
perrucci
graffy
momlogic
styriarte
viviene
kumbayah
mamaa
licencee
pioro
innosight
morcambe
garishness
priciples
rdoba
peraliya
hisi
cheeping
shaojing
fadika
trhat
boute
quietens
ogalo
baraldini
queensridge
lybian
blustein
ltachs
mystick
tegnestue
jetamerica
fukubukuro
shanoff
skiworld
womenomics
blinatumomab
svartedal
rovera
verace
oetz
houseparents
artunduaga
underpayments
sooam
grasberger
twitterature
ngare
paperport
ibes
airfinance
aufiero
killalea
kadence
kwaa
inkersall
northface
sliger
fontainebleu
janmohamed
askariya
tekeste
mcleer
donabe
seiser
stalter
acutest
narmin
appelate
proselytizes
bphc
techtonic
moledina
gilula
berinsky
outgunning
cnci
etilefrine
galvagni
conservatorships
underrecognized
mazatl
torturously
carosella
iabg
karunatilaka
tosson
ifin
jazar
migalski
lumm
dotta
bitterer
remarketed
gxc
zijderveld
delonghi
lavita
codiscoverer
incomm
mendaciously
immad
galban
tayeng
miquelle
woodfold
elbot
siscoe
fluoroscopes
vobs
reseachers
southest
pescovitz
reesman
olympico
weaking
fultons
shopfloor
blackavar
fitterer
blackmores
ernakulum
naawp
kalca
dobner
bearsuit
rspp
anapol
aboudou
aughenbaugh
furmaniak
canellis
piggish
rettew
palheiro
phenominal
urrugne
bahoz
brijnath
ricket
jarosch
frazzles
mauffray
sospeter
mortty
zingarese
philou
lumanu
soooooooo
daboub
kelepi
unjam
mackenzy
weely
guadagnoli
dolcezza
leidolf
gmps
vitual
trafficks
milimetres
podladtchikov
abhey
grappas
gourna
wanma
woulld
poct
saidman
youare
kazahstan
screwcap
cliton
whrrl
proplem
buenaflor
guilet
sakuji
imass
prouve
thonglor
nubby
thyroxin
interpeted
terrenas
kremikovtzi
raelyn
haircutter
smokery
recharacterize
schipani
cainero
mitsis
impertinently
atributes
debkafile
plasticised
xpresso
scillian
ifsec
ionx
hpsa
eruygur
hacdc
nondiabetic
terabeam
facchino
impassivity
poreba
ignorami
primly
kkbc
niswanger
unreconcilable
resposibility
jacarepagua
moonpig
mediacenter
paulovich
molsons
waterskier
liuhua
imperilling
hollyhill
levolor
haswa
cosigner
blackcraig
izere
klebanoff
smetters
sabihuddin
draad
netronome
susah
dominoe
rauhouse
martitegi
deetjen
adhamy
mcelheran
counterstrikes
auxvasse
elimane
boukpeti
gogrid
deftera
eoir
cordycepin
abdullahu
gasify
highberger
elmos
cattrell
fatuity
aesp
waterzooi
identica
asylmuratova
danjean
lunardini
giardinello
sararogha
mhpa
depastino
rubiana
tanesha
wvi
meissnitzer
merieux
neather
lubricin
hohlt
pomanders
xiulan
naragon
grizzles
worick
lighthizer
karaokes
muminovic
mollik
villavaso
kuzuhara
inertly
picciola
usabc
naqash
renzaho
bogler
careone
yannos
shinskie
frankurt
godsons
espel
panigrahy
chooch
pixmania
kludze
deltana
republicrat
gregoraci
boundaryless
banic
stockinged
dataviz
yershon
lungen
osmanoglu
africanness
unprofitably
peewees
charasse
ilyaas
joggler
linaclotide
shemari
adaptiv
merrylees
mulhauser
hankers
goorin
voudouri
chaiten
sizzurp
breeziness
brightsolid
mudhaffar
camoapa
kve
amstrup
yurendell
elitest
chronoswiss
shools
prettifying
dolares
mcghan
dabblings
grittily
rajith
stonex
sweetface
lefthanders
toddrick
stableyard
claxons
monacos
walmex
iceton
ranneberger
labovitch
tryg
nalecz
interindustry
marlabs
cafetière
islita
aniboom
panzanella
yusak
thankyouverymuch
fusspot
swyddogol
survelliance
bancwest
noly
varatharaja
nestell
ymck
christodoulides
fastovsky
mclaughin
épater
dunkerly
gplus
doumbouya
slideluck
chermak
sheeesh
paulites
mariluz
cecep
chainsawing
grumbly
goozex
counteraccusations
gelbaum
reakes
corkscrewing
vranjes
unvanquishable
schoolteaching
gooners
tamberi
budby
dearn
bissan
strenth
preformance
pucino
unrecommended
lipnic
proofreads
wailoo
towneplace
goalbound
buld
dicon
ferenci
balloonsat
shirkat
kurlan
debrock
cussin
pfiefer
qoryoley
machinegunner
paganistic
crimelord
horejs
takva
kirschling
dedapper
deradicalization
zacar
fooball
cabnet
fishiness
koepfer
nuclears
bendolph
susanu
meitzler
venezuala
jerini
tajammal
przysiezny
shrim
chompers
aciphex
protonix
zapak
jasmer
towl
clarkey
bellera
continential
amidror
skylighted
hoschek
nettled
falshaw
hainford
bamroli
ganss
jitian
clearcube
sureesh
alelo
ubale
ketchups
pattypan
abudi
dargham
exacty
liposculpture
revotes
outway
unbuilding
gyala
tvss
oceanariums
narquin
mildenberg
rescanning
sighters
anodina
funfest
vergilio
maypray
aurakzai
stillit
subhiksha
pharmatech
tartes
allatt
gottschling
zerohedge
boriana
godesses
spunt
engrosses
redelivery
rushmer
alhaurin
compartamos
valmor
chikwava
sooting
figgures
atiur
jairazbhoy
mustiness
dorint
sheerest
gardenstone
shapeliness
spronk
khandoshkin
shadoxhurst
kroening
eventally
staticky
bissey
vukelic
bathiudeen
utilicorp
uncomely
mandefro
jewbelation
youe
osteoinductive
knuckling
westthorpe
vaugn
cullivan
dgas
dartanion
longly
cizelj
trimont
gujurat
spinsterish
inexhaustibly
applicances
mazombwe
diputed
pélisson
tripath
hagglunds
fiorinal
rhuallt
quida
acenta
begood
hursday
schaitberger
broadwoodwidger
wakanoho
avature
inestroza
vestberg
loasby
dibbuk
cfmc
adlair
finching
udovich
dispensible
replastering
abjures
rooftopcomedy
netwon
vandenbosch
parlevliet
decrem
hkac
meide
gigatonne
descibing
minuteclinic
shankhill
ounsdale
nocc
glowy
akasheh
geekish
kanetsu
faryadi
ingratiatingly
madhat
malpede
crashgate
postill
despensa
railworkers
kibisi
shinwaris
norvasc
vitriolically
tavakolian
seeable
designbox
bouazzi
wilcrest
faggins
horvilleur
pfitsch
lightposts
svaty
bresky
libiran
paquerette
opertunity
cavilling
hilwa
gendex
blankfeld
dexcom
sutjipto
rothholz
saligman
mclatchie
asfur
cruch
codirectors
zipursky
babeau
gjetja
rooijakkers
kidsworld
stenske
desrve
inmar
smartnet
hatherill
trinlay
reeti
ahronovitch
zakumi
masochistically
chopteeth
escajeda
flexcube
montevina
varod
baccar
manale
dininny
yeilding
saiburi
reak
noortman
innergex
efraimsson
kaime
claren
shalesmoor
dapd
seasonale
fertik
mondesi
kayford
lingvall
homever
volcy
sirenuse
unheeding
nrai
superfresh
artccs
zhenghu
loomans
antimacassar
steinwand
trepidations
avascent
gasten
ossetra
ngwynfa
drived
bloomsberry
naiop
fridriksson
abuelaish
alabamans
vahafolau
zicatela
heremans
casteja
actioncoach
scoobie
paultre
motoshige
lunday
servive
landeg
valvematic
dartez
themseves
fruman
kittenz
aircastle
diacap
prpl
husock
mondegar
isavuconazole
pawlcyn
jonason
dehumidified
bleiman
oddpost
tzus
necesitas
ritterbusch
pluot
kickboard
zhitkeyev
ilakaka
brufau
citifinancial
soundlessly
tcis
rinnes
italease
ustoa
coatroom
viracor
calvelli
mystifyingly
constableville
roskos
astuti
innholders
polares
jichi
mamadouba
asessment
sabatoge
overthought
ealry
amrika
dashevsky
breden
averdieck
tabarzadi
nvocc
samlor
tamarasheni
remmen
tartiflette
solzen
cabindans
worktops
nawani
mcelvoy
forcers
wipperman
snuffle
sexperts
suboh
cantonian
handpicks
hawlati
jankowitz
furloughing
taffi
valee
hamdallaye
asselta
shimakawa
bcls
designlab
sheepwalk
brickies
imapct
lunchpail
sandmeier
glaschu
biodynamically
kahunas
futureworks
havranek
crottin
dragani
subdelegation
rabbu
accordian
juares
febrary
rowold
revisitations
latag
carefirst
desertlike
mortaud
buttonholed
wallchart
yummie
oilier
zenei
ieoh
kulman
matterface
atavisms
poreotix
schwarzenbergplatz
johncox
mitisek
klempert
apprehensiveness
washability
serverbeach
martavious
spatharis
emere
poggetto
greenkeeping
thiermann
mukhlas
aifms
kastanienallee
minimart
bronxworks
longcore
concommitant
phobes
zyvox
lüke
eddying
stockouts
rohozinski
meiendorf
vinaigrettes
lindbloom
jgto
amimon
blockparty
brucknerian
fssa
prystowsky
mcenany
bashfully
mischeif
bloes
trydydd
ziegenbein
zauberman
kovalsky
lifegem
gibellini
unguentarium
kassinger
amerco
wojak
homegrid
birkel
roodenburg
technogenesis
nonpaying
ritger
valancia
caunce
schapen
mereside
borbely
bogucka
trounstine
transfixes
lvfs
grossbart
terlato
cordelle
convera
wyg
anticrime
hwadae
zyla
omesh
sehrish
paktya
johnthan
nondomestic
nesrine
badeel
swda
moeakiola
hiott
kraam
okudzeto
oyon
camner
eyeghe
aaaaah
nateglinide
breastcancer
verheyde
tsiamis
mobilereference
twittered
cpfa
jeremiads
krainin
inrushing
solargen
redeveloper
berylson
atunrase
mettraux
esbenshade
parfaitement
najarra
tvel
ampelmann
questioningly
parlamentu
mayahi
slaght
phidippides
johnmccain
grandnieces
tenerelli
koumans
menyn
dilorom
durrel
chear
zuckerburg
dietlin
kloefkorn
streckfuss
imigration
delance
bryjak
opirus
trusina
ravallion
hafize
togheter
kholoud
denery
breadknife
cabl
ceejay
korad
verhees
bubye
iratxe
lafranchise
sychdyn
telapak
choza
ombaka
retik
margram
eutef
pangandaman
synagro
daiga
mufj
iifm
underhills
assabah
washingtonienne
lvads
tuhabonye
traduttore
radeke
bhuddist
corbas
uplinger
naczi
worriers
smooches
quintuples
cloudforest
brenzel
liroff
backburning
csibi
siluva
shotspotter
ogalde
kershope
lusciously
chicola
npfs
candotti
sabyinyo
wackjob
teachstreet
nanosphere
coaltion
trester
seesawed
edfund
abbamonte
roseobacter
nanofibrils
perkowitz
csapo
elizabetes
boilard
sinol
ammendale
mcanthony
mismanage
pendeli
groggily
ugel
eiast
fcrn
siggers
susia
rougue
ozpetek
caiso
summerstrand
pakradounian
jetlagged
burenstam
sillakh
sadeeq
barbian
unblinkingly
souplantation
yanovski
dichloroethylene
caerffili
steinger
eale
microsensors
toolin
kudela
manugistics
bohnert
savoriness
cetta
tuzee
kallstrom
iqoqi
maistry
deitchler
rugelach
bahcesehir
atpdea
stockcross
ausma
abrades
messerly
ingushetian
tianyong
durá
unwrinkled
latti
kinani
hollahan
jembere
unpressed
cousar
redchurch
prepack
collators
waterthorpe
alhazmi
forswears
navellier
bergrin
abercombie
jje
playlisting
vaisman
hildren
esmc
outselves
rounsaville
fawdry
lanoo
kasyanova
hymon
mingy
uruba
insinkerator
tamagni
borroughs
gonorth
siwarak
gillier
themslves
leondis
accountholders
zyed
fehsenfeld
paltenghi
amms
maggiemoo
brasswork
sasic
integ
stornaway
beautifies
bloqueo
mcwhirters
fransiskus
brucennial
pustilnik
reckart
gribenes
bonitasoft
bluewaters
trygstad
piegza
cestaro
preciosity
foringer
barfed
albes
kurczewski
bartnick
threadworms
shoebills
ichsan
searchme
bowane
ajristan
guardpost
felisberta
ceviches
zhongjie
noodler
ypma
sterndrives
polyglandular
yof
ryanne
wittingham
meliden
unspools
moniotte
domers
slobodchikoff
genao
dedryck
sirot
tokarska
xiaorong
dharmu
arfordir
unclench
aspillaga
bogles
espelage
bletchingly
steampunks
assest
innie
hyatts
geekspeak
mulish
urbanise
nwagbuo
mgallery
kukava
imbroglios
nmls
castaignede
ringhofer
moncivaiz
chiffonade
paskoff
owhali
anatosuchus
chimichangas
ferrández
sereys
simontacchi
jasinowski
televizor
searchmonkey
guidobaldi
newera
scolnik
kogalniceanu
plent
litovitz
broadtail
yiqian
fomunyoh
tricoteuse
squidge
bijagós
emip
soother
wacaday
kipas
folllowing
pensant
nafie
semifinished
lasciviously
palringo
preloran
pyranopterin
espineli
pomery
shelanski
rgensen
hingson
berzina
jamilya
hardcash
gurspan
rury
toromocho
dider
raymone
muindi
aryzta
crnas
alexiy
marzell
maloch
tommasina
nirat
faiola
xtradb
centruy
pooches
congresscritters
xiaoya
snowscapes
finvoy
faoa
seyhun
kiesle
encourged
wots
bégaudeau
westquarter
omnistar
hazime
liwski

tearless
hounam
grarup
neimann
pontina
unhurriedly
critisised
hireable
khoder
pushtoon
karnats
meyinsse
chicharito
leegomery
vincis
intermediating
jungbauer
jinjun
algarin
testiness
moqadem
dogcatchers
smoger
terribleness
phileleftheros
shufen
scheirlinckx
noski
shoretel
spinvox
moreleigh
janyk
xhelo
drivecam
avacado
landver
richarson
thesmar
roslind
mcgrogan
estara
wanogho
filek
realties
andariese
fullhurst
senyszyn
iaem
korsa
poppyfields
mukuni
weymarn
harbinder
pelak
gardey
synj
cancelations
iordanova
denegrating
hoplessly
hdsa
incompatability
vsos
katami
ashal
violacein
jaxtr
falkengren
piacenti
denitto
hodara
syha
beagling
keilty
exxonmobile
grasscroft
antholis
chanteuses
prosy
embas
sadistica
whitledge
trépardoux
superagent
equitana
woertz
kovanen
josko
kunisch
sidling
suldan
cwlp
doliner
drevno
vreed
cageless
paramax
erekson
timboroa
broompark
dekoda
eclinical
frought
misappropriates
pihlaja
kotecki
neocolonialist
loconte
marineros
bartfield
villaneuva
decieved
revas
orstad
blutarsky
redolence
exultations
billary
cousen
purewave
cwrdd
fiering
gridwise
sturner
vietcombank
najih
newlink
kracow
contintent
semkiw
cvijanovic
xtca
brahimaj
alvheim
ironists
nonkosher
dreda
ehrenkranz
revealled
millenials
gimzewski
mankinds
atheistical
netcentric
pixelate
termers
gonave
snowie
brooklyner
jettas
siemionow
caravansaries
braynard
bintu
pudlowski
disagreeableness
introna
fauskanger
asheesh
waffler
cellura
poltiical
kawalek
alawiya
bushelman
quantez
shikwati
citzens
skouries
padeswood
kibbitz
pecorella
beijie
hearting
dohner
reho
ninots
authentec
demorrio
muscly
zerbin
sabria
gastroplasty
unitedairlines
hafit
starbar
bonnema
saleban
burgaud
techtonics
billong
telereal
privette
diflucan
serviceperson
organogram
vytorin
healthfully
boycot
roadtrips
ahya
rosenannon
trinklein
cismesia
moscows
komsan
lenarčič
sevinc
repetoire
bioidenticals
zinch
lemisch
lieving
innoculation
golli
ponsanooth
woodpiles
morningness
batat
fayman
dicocco
incurrence
wolflike
crez
expec
katseanes
impracticalities
minnig
malandain
smartsource
shrubsall
montsame
paulas
ballcap
swerdloff
poldek
lokichoggio
cdcc
suroosh
taxact
differentness
recind
vergette
celadrin
overregulation
convencion
zenkov
palmiter
chênevert
caffet
mapless
cpff
vellenga
malmer
earflap
garufi
tarrifs
safway
bachoco
vkl
mingott
cherisse
patteri
krooks
walgate
karalus
crispies
frieston
ergh
mountpottinger
bachuil
genderen
obbligatos
munatones
wenchong
heilborn
multitaskers
unhão
nahavandian
prefeasibility
shmendrik
rottet
ecds
pastrick
poulner
tumbly
morrical
hanlen
vogelhuber
cunth
kamras
traumerei
contined
sammour
evridge
guglielmina
clafoutis
rozza
marshua
bitonti
polaszek
malinski
deltacom
laffineur
glantraeth
udrs
putrescence
vitolio
diang
cpcu
inbounding
spraypainting
fordhall
titzer
penmanshiel
cedarlane
staig
hodosh
duchoň
knockeen
shindigs
americast
casiple
hechizado
steinel
culbokie
economised
faridoon
charsada
acquaintences
overextends
edgbarrow
cavicchi
traude
interfamilial
familyfun
margoth
controlls
greyrock
melaugh
palagio
pnut
caucusgoers
biema
frankendael
delbeke
yobes
giannakou
midlarsky
mulkerrins
bdms
kliot
quinapril
schlepp
digitalcameras
studiedly
gibbin
kierstead
unoffended
opressed
mapumental
cabilly
honten
ladyga
steenburgh
toothsome
eaks
readspeaker
giantkiller
biomechanist
bernsee
webforms
tarnasky
throwleigh
immunodiagnostic
paskett
rsus
panayotopoulos
ukec
ghilarducci
manylion
jandel
lenowitz
broxden
spendin
htib
stragglethorpe
demobilizations
kirikkale
resiliently
seaon
elterngeld
nozizwe
ejeta
discontinuations
mistras
loopiness
rabett
mhoire
whittacker
mierzwa
carabante
haughney
mislabels
becaome
serodiscordant
alexan
guac
cotmanhay
aabey
danishes
rtaa
japanimation
kissable
dayni
plottings
braestrup
backhanders
skovde
arbitraging
tiebacks
jenessa
ultraorthodox
applecart
tokitaizan
barreales
poujadism
mojdeh
hafidz
makova
katsikogiannis
blankstein
overmedication
thirer
munene
xsight
prexisting
talkman
harbuzi
sililo
khudadat
podkański
rcpo
petcock
disinvested
vinken
boppy
veillard
brussells
boobytrapped
coevals
spiegelworld
undercapitalization
kropper
makover
dornhorst
exprience
giedd
chaswe
tgfbi
pannone
dropp
pehn
sekayu
ashr
tooooo
liepman
jemiah
schlaeger
reroofing
josico
nansel
breakthough
miotto
whittick
mlbpaa
hypocrits
bausor
denay
flatmo
guerron
wanyoike
shoretz
mbodji
cummard
jossinet
phemister
mabrouka
fagre
leanora
wagnall
contemptable
leting
carajas
barichello
dunguaire
profectus
brettkelly
nicb
photgrapher
memeti
ibarguen
volumn
happned
presss
comentators
emass
twitting
robischon
traxo
sorrier
openzone
typcial
pirret
jerred
mattani
junying
employeed
portayed
skoura
gasless
wyplosz
moulard
ohny
shelterless
abullah
jiazheng
hapuna
jiashi
pelzig
netminding
succed
remberg
karkus
kunk
galanteries
assigment
mmgg
dechrau
innospec
toplists
vulkano
edmans
dhurgham
middagh
zcam
holstege
taxprof
oldhams
niblets
mtus
leeane
gops
steamrolls
magliarditi
waggin
ardinger
klesse
scandalise
nutmegged
hohlmeier
longways
haqlaniyah
pbjs
matchwinning
meqdad
assaraf
matices
zeppilli
geertrui
houseowners
onyszko
subfertile
crispers
garyn
universalise
falacious
mangera
prodhan
watchfully
carronshore
houghall
canadell
tabbush
melch
mbom
nceh
glucocerebroside
petrisor
bunich
cravioto
kydes
microdisplay
rosendorf
fastt
bonnee
microfilter
waiola
mirrorstone
xolair
sunbathed
edemas
privatly
llinos
rashee
rogak
schaffter
communcation
muffit
baylham
aimhigher
noseguard
cadex
kagisho
wimon
stouten
masrour
ortuondo
brynberian
tamburrini
wollerton
devah
castrejon
serbedzija
simonich
spross
sequans
kleanthous
clickstart
peiyi
echavez
netone
expendability
dobbing
wryness
flexitarian
hinduness
adbowl
karadsheh
justfy
charoensri
wootan
superpredators
mallouk
klintworth
allerca
reappointments
sackers
obsurd
pertierra
barbadoro
evangelically
woznuk
carleon
csgn
wolens
nativeness
kneisky
hewko
kaigler
ihsanullah
talismen
eliahou
bhuto
saltin
itwas
nevels
unyieldingly
radiolina
noveau
casalena
zheijiang
follifoot
morfogen
waudo
flajole
davidovna
glng
learmount
umbilo
ohmigod
wagenmakers
centrais
canggu
amcat
feilchenfeldt
soccerbot
spreadeagled
weq
dazedly
shalleck
funo
barkely
humoresken
stiggy
velencoso
bubbas
stoled
ghairat
hetrosexual
ghag
watanbe
coproducing
moxam
dspp
feeva
klibanoff
eurodac
bernardito
newtowncunningham
floppers
rozzell
eveillard
rassak
sayyari
robomodo
bulatovic
gemal
muskal
tulkens
eurofound
greenwise
rajaraam
inspirationally
telefonos
artemesinin
teenies
balades
ledc
amraams
anways
harmoniemesse
kynt
zielbauer
burooj
adumbrate
pickaninnies
ciaravino
candlin
hurrahs
wescoat
kailuan
actuallity
pollaidh
nastas
molemo
kultgen
communicasia
triscombe
bapineuzumab
castington
weiting
xpedition
siwak
brisland
härtl
cottco
zhongde
massiter
imouraren
somach
pitkerro
mhanna
scalpings
ghos
whislt
nayeri
smip
guessoum
wingfields
crotchless
hoerbiger
geranios
leashing
nellies
elastoplast
surburban
securian
schake
fanfold
belarius
skulked
seediness
fordu
shimahara
executional
temporäre
funemployment
shamsan
cruthird
behravesh
coldiron
bringsjord
bakira
lierman
temelko
outdraws
seage
sainclair
olimpiyskyi
nosb
nneji
somersall
vuchic
logisitics
jiyane
siutsou
jackolski
castenada
chuqui
ibok
zigic
agreat
symtoms
torton
obssessed
leafleted
exstream
synesios
kybosh
fadeyechev
racinos
smirky
tsxv
fatmah
missett
miglin
remailing
busanga
gronauer
antifreezes
roughhead
deafens
scrimped
femininely
hamlili
mesterolone
westenburg
medikidz
sosban
nonblack
thrales
feray
hoggers
qurabi
senut
westsiders
kulczewski
gallaxhar
paraphenalia
confutatis
oput
ribay
kanders
furgeson
lprs
funsch
anatomized
corscadden
redrick
amfpa
disler
schute
lalmati
ditherer
linlathen
ganeshguri
runzler
remineralize
fulmination
corseting
fials
feess
bedframe
fascitelli
gerolymatos
khaliqyar
lansberry
meleka
frischenschlager
pivitol
damanik
intubating
ychwanegol
acgt
gildehaus
zakharkin
ramlet
lafree
verrrry
elsheikh
tuttiett
mashai
monetizes
wdcc
blazej
conversive
prounounced
mazare
kaitlynn
immmediately
kamenar
laninamivir
pouffe
discolours
heckaman
dacias
genevive
temaki
safarini
highlining
schattner
tawdriness
mprf
daslu
canniffe
eufloria
frattarelli
zeresh
mbada
londongrad
frises
tomorrownow
paulée
ecotaxes
ningrat
pachar
bjornar
hoarwithy
ewni
belabouring
messitte
vidattaltivu
airmiles
jumeira
cnha
homeservices
lapostolle
ctid
grazax
marando
frontward
decharms
iasia
jitteriness
dodgier
upswings
marinelife
blachley
barcud
safeways
kavanjin
donetta
satlof
skipinnish
bagur
cambronero
swaggered
rotarix
checkbooks
elcombe
agbank
aegisth
fiftyone
chapayevsk
sobro
unkoku
fairwell
steinhafel
heherson
gluckian
jenzabar
onetaste
sportmanship
lebsack
riippa
acomplia
newtech
landesbanken
dtto
razored
degenstein
andoga
spikier
agism
thtat
verminators
mcaleavy
temblors
thaddis
maplesden
mckneely
colichman
semiotically
motavalli
millerson
jeeb
badenhausen
weldments
sisli
foraying
szumilas
primelocation
polikoff
freezed
musem
garuba
villacarrillo
benlysta
siopa
ineffectuality
altink
ddoe
declassifies
mcclaskey
npsf
compartmentation
leared
santiz
dragulescu
illusional
midmonth
hoodwinks
bofo
ogolla
ahler
uncouthness
zynth
successories
limet
whibbs
fochler
aahoa
conatser
tibbermore
kirketerp
maydon
saintil
aggreed
uitsig
drumless
artisteer
schwencke
dungee
nonliterary
atalissa
swingmen
eithan
jinjer
crowston
professonal
ostalgia
zalul
odein
callconnect
okland
kufor
solidthinking
harira
alliegence
dejana
ffhs
kurlbaum
evildoing
supersuit
hortle
reflowable
heenes
costumery
uclear
litttle
spirtual
sherraden
iakf
argentian
shontz
pipperidge
dispenza
dramedies
illegalize
damüls
boroujerd
okerson
vinyan
slah
diffence
toddles
perschke
langdown
consumerists
llanganates
jealott
overpacked
mohagher
ameerjan
furushima
hooplah
littlehey
benalmadena
syrias
stoneyholme
almor
pursuasive
moddelmog
missel
iwuh
batjer
ksbc
traffiking
gorily
bhoyrul
omneon
mamoud
jogmec
autocentre
unbreakables
intermune
sittercity
prosectors
morganelli
rapex
clevin
itamae
pleguezuelos
composters
bathie
polydoras
mukhim
selenochlamys
qcdoc
rejer
bahave
arlinghaus
stylelist
multicurrency
matuszczak
muttathupadathu
kirschvink
bogeying
baudier
twistgrip
janicek
chubbier
samareh
birklands
aricom
distastefulness
arnwine
fawcetts
toorale
mantese
dannehy
momentus
kelsy
disaccord
hoerth
nogaideli
ballardian
lifepak
lochbihler
agwunobi
suretype
pily
eyrow
shibil
qflix
angelically
overcount
lasowski
wember
aunti
flamming
productization
santiagos
healthywage
partow
mistating
curliness
liangs
secac
fundholding
tribrid
adeoti
candance
chavula
inotera
sciencedebate
lillehaug
kirbo
lobmeyr
stebic
weyts
hajdib
abdellilah
roedad
itabo
nafri
shpigel
presedential
publitalia
keylee
lisamarie
clich
disgyblion
gandolph
perlmuter
delterme
helgelien
colontonio
hakura
sugishima
hovid
paddypower
lbbw
wildcare
mokane
belloto
sinatro
devalera
postitive
reethi
irobe
chapins
cornale
imrpoved
lincvolt
arived
idexx
matouk
papile
ventose
baldragon
pribanic
abayi
everydayness
piperlime
omama
bornaz
footcare
pritty
mangatepopo
forne
deltav
dzp
castlemead
micrometastases
guipure
santamans
noncommital
trisler
schutzman
reneker
klender
anung
corniest
neidig
unchronicled
vandell
nashel
wälde
fehrbelliner
skmc
udink
sensibaugh
orfani
poblah
nikishov
rookmangud
mandina
hectarage
khatem
uncashed
caricola
apearance
ooohh
timbercorp
dupattas
höweler
huntsmans
gandur
pollutive
athra
lovedean
hankiss
handleless
zimmet
gropers
megafan
stirewalt
portguese
depolo
katsusuke
barrella
sarafem
multiaxis
cloudlike
mudblood
abramenko
johnell
batchley
bosiljka
leesong
hydroxycitric
lipsman
hypnogogic
swishy
hostessing
angland
arenysaurus
schnoz
strads
tuaca
briberies
superchef
millgram
hallucinogenics
dooner
promisses
bosquez
erbelding
antiterror
dhonau
cupholder
xiotech
raynella
hoffi
lambke
watchability
moark
pettite
whitrick
smartbooks
michter
martner
bromhall
deputises
romiplostim
balsawood
nonlawyer
gardenview
scpp
cunit
bébéar
runnalls
cmim
jundal
frega
nuvigil
panichgul
thermoses
euroset
sanzone
craftsy
thurbert
spinco
antiheroic
bocarsly
festooning
ghazy
reprehensibility
dunadry
conell
opina
bagudu
lowermoor
lopsidedly
chynlluniau
pontfadog
sidaoui
sinnadurai
adspace
targetfollow
zooborns
iskandarov
chiso
meary
stylista
eameses
apointed
haindl
gorrill
mulrow
croonquist
exagerations
lokshina
proce
pinior
bisgrove
wersch
autralian
walaker
wimples
sapirstein
jital
roydell
ezchip
pranee
smilowitz
scorseses
stroeve
sumaq
extraordinarly
yuanlu
nonoperating
sugardvd
kaufmanns
henmore
bootprints
parlux
wheedles
socarrat
moughton
basketful
gahcho
korena
abusos
domicelj
schanzenbach
caveda
suhrab
helmstetter
fatalistically
vivara
sozzled
bubriski
huette
edscha
powersliding
pitchcroft
haffadh
timbol
aidsvax
monifa
vesturport
gravenstijn
chiggy
dineequity
lyubasha
clipa
webair
wennerholm
ahcccs
viniar
guillebon
corniness
zakay
citified
popule
shushed
patientpak
marranca
rolexes
laplata
wasteground
nadca
labau
bosnic
rabern
voltaren
maroteaux
cofman
hefted
chalencon
disentangles
uncessary
ninepins
salissou
shuklaphanta
armishaw
sohus
ricardas
chalcraft
minnesotacare
goave
netqin
lowit
britglyph
didymosphenia
gcwr
abellan
dismemberments
holtec
stobe
fenics
papayiannis
undertreated
beknazarov
koloma
accordent
wochner
sonatel
podhorzer
amerkhanov
metho
kroller
uhomoibhi
cvne
oblas
rumaker
cessy
nuissance
notc
assida
schwacke
easterside
ebly
tapfs
evfs
postmatch
theraflu
keirnan
dynadot
caglioti
beuselinck
hoggy
southgobi
electon
oneunited
kochersperger
lacoursiere
organis
beinish
khushnood
brenkert
bohlke
sikura
moessinger
labrit
crabbit
lvpei
upperman
chinsky
carying
metry
royana
lovley
izulu
scsep
manboobs
stomached
incresed
lamphier
nget
laidi
staled
misdealings
coporation
disavowals
wpz
quadrilha
meints
liptrott
prodanovic
purblind
hausfeld
tuggar
twitted
rgen
cecillon
vorobiov
pinots
homburgs
critise
guajana
uncurl
prets
medisin
moviebeam
yackel
giuffra
putrescible
bawds
occurr
kimerling
ergonomists
jhai
fgtb
koprivec
sukaina
rubberstamped
uduak
clankers
weatherbird
andreeff
trabado
tianshuo
craigholme
nicas
linkov
montecore
dunsire
deconsolidation
tjeldbergodden
plsg
mclenaghan
pennyweights
corsell
moronically
haulout
femp
staduim
palanquero
infernally
kuhlen
brasside
susar
anfam
unhooks
bofferding
alege
emerainville
etbf
plasco
auchnagatt
elizbeth
herit
pucllana
telexes
siriusly
woldegiorgis
shpakov
morso
rakotoarisoa
zbarsky
gomolka
cubanacan
wodarg
controllership
pasola
yonekawa
febraury
amzn
pmds
saviana
altemio
schandorff
ingeominas
verter
marawah
kipevu
largoward
nadirs
osisoft
fluzone
kutluca
brammah
holdzkom
unawarded
hmmmmmmm
hatzigakis
coatney
pulsenet
bowbelle
transdigm
tinging
chickera
investure
cascini
lozere
mcgurran
wiegandt
wartelle
contributorily
goldhap
troplong
vaios
shingly
denoix
garafalo
asru
luxuriate
cyrulnik
boconcept
wizet
gelee
climatecare
syclo
stayne
maliyah
indicatively
weiberg
manoucher
excisable
ammori
dagze
ruthsburg
sernageomin
kaixuan
twthill
jashon
whadda
cookshop
ndegwa
bzhania
cartiere
lrads
effeciency
portney
franzone
chegworth
schit
convatec
pcln
tsgs
yaowapa
wellink
undecideds
dichiara
samiu
thauer
xeroxing
rotbart
cloghogue
dioum
somewhow
bourlanges
levrat
poundsgate
studentification
scholanda
hiccoughs
stives
ezcorp
terrone
ofrece
lukewarmly
melcon
ryeish
heartstopping
medvedkin
wardani
morphis
norwine
fissiparous
glipizide
waveshape
dolcis
hausam
knyveton
delaronde
biomasses
reinspection
shelthorpe
tular
ohmar
zaliukas
sharnol
gorbatenko
castagnède
phins
gandah
dahei
nonexplosive
ceradyne
ennon
arrendondo
mackilligin
captively
samnick
jagodowski
ballinluig
bioproduction
zelek
villaldama
compaired
vamizi
locklair
novacor
beable
flexitime
climbdown
ondeo
krup
unflappability
geotech
prokurorov
anassa
subagent
multibillionaire
bogend
unobtrusiveness
monopolises
nfrc
meterologist
tearstained
rosstat
mesiano
polkomtel
purry
goulou
scorcard
irshaad
cogifer
beatboxes
geoeconomic
espandi
fsrc
llanellen
demane
eureko
perigueux
lownie
pharmasat
sosnovskiy
rootmetrics
barlev
itit
gottsegen
shimrit
dhanushka
eurovans
claborn
dishonours
woddis
amouee
bozzuto
koreng
smfg
yipping
schlaug
roher
muilleoir
bannering
schaler
dauterman
trewby
incretins
sugru
werleman
invincibly
zachares
nalukataq
commisions
carrard
shoutfest
dreazen
szymkowicz
skaret
igoeti
racegoer
kabasha
mccadam
amundi
coalson
rossan
shearsby
vociferousness
unfelt
cianfrani
niave
harpooners
kohll
alkqn
gicanda
ittersum
blaemire
gunhus
misuser
okadas
raharjo
krener
bemusedly
mejide
batkin
arlinda
ballat
asselah
keyfob
rafaat
tobolski
paletas
korissia
dereon
weaponise
abbruscato
hilburg
ollar
toloache
dulkys
caragol
trafficmaster
ionises
divorcées
bonusgate
grafer
fairyhill
laracy
jiggered
paellas
pallinghurst
outcall
shedloads
suppy
jalolov
promethei
reinflate
alperovitch
zutt
yourselfs
wfda
hinam
thingamajigs
lecjaks
stroehmann
directconnect
hettleman
armann
nkusi
volubly
apalara
twanky
wintuk
corjuem
bokke
chiselhurst
patdown
lannoye
rizzetta
proshares
kaymaz
aidh
jackstraws
tomou
pakem
kamula
newcott
nhmf
formworks
monyane
istanbulites
luxuriates
perneczky
rhapsodizes
poetsch
josifovski
tipuric
braywick
sominex
pennymac
busari
minatory
cloudworks
aloul
nouredine
narcissistically
mesmerizingly
sportvision
nmvtis
suvaddhana
hifn
parboil
mworia
spertzel
desset
scaringe
younoodle
wardhigley
garanzia
mulchandani
lizet
shaftel
runelvys
jelveh
finansinspektionen
competiting
oakleys
spitale
jawarharlal
aarik
tabat
bosomy
kazuyasu
nassiriya
chilcoat
bidil
tininho
firstsource
defering
breglia
saimo
jaumont
ostracisation
bramscher
hyperwords
bidis
gerisch
haskoning
implenia
ashkenas
voiceage
transer
ninties
bukharians
misstepped
walizada
elvert
transcervical
cheny
soffritto
phandu
miltiary
nagyvary
kaywa
izea
linkscanner
sorrillo
crailar
payden
stephanson
mcjobs
uekawa
brayboy
jeacock
noppharat
daglas
zevalin
jazmines
busho
dewaldt
tegwen
bischitz
donike
deshka
flashily
cavaney
tannura
paperno
nevas
cablecam
speakerbox
loncaric
thonged
ablity
zimansky
ruchazie
nestegg
juakali
mihaka
pluming
mckinnely
detemine
umida
kerasia
breidbart
paskaljevic
merrows
chameides
migente
khoshchehreh
guojie
clubmakers
malhuret
webid
roadsinger
harnar
crisping
rogaciano
dowload
kanok
grca
phonelines
mcilwee
motasim
dickoff
siutation
dildarian
omeje
kakarala
boci
beacopp
ihss
nattily
sedmak
shanay
observantly
awvee
kanavas
ezen
recalculates
degloving
marusa
jewna
shyrone
biffer
infobel
airness
newlook
sadaoui
vickilyn
hydis
varatharajah
himbo
feshie
quetame
unmannerly
owiti
wazhma
bittles
ochola
radmacher
dehoff
loglogic
gravelles
incudes
yodeled
offill
qalibaf
iaslc
incased
rowenta
moosman
hermitic
activevideo
giveback
chissick
akea
dezheng
hupfield
toccare
radiosurgical
superglued
viriginia
shiftiness
pulqueria
momah
villainously
piscatella
keker
anacomp
espenshade
puckhandling
komag
maoying
ultraliberal
javanfekr
undergear
undervote
stonhard
pupus
casgliad
luperon
chalkface
tseckares
rivenbark
chaobai
niklander
bedane
henthorne
protzman
utsw
pisor
snooth
qios
ginman
tirelessness
twse
nnoc
disinterring
chidlom
hongxiu
untrainable
zvecan
wibbling
phonautogram
whingers
nerdiest
demirtas
sollano
airhart
waterpartners
plavnica
fillinger
similaun
anqiu
avalons
archera
navetas
susham
twiddy
alizada
teviothead
lanehead
steris
mekelburg
tgscom
papacostas
daith
shimmied
bitgravity
surftech
kostunica
draut
scandel
judicialwatch
mongbwalu
riffy
balakhnichev
rossiiskaya
reaganites
gavello
gopie
discolouring
hoelter
backslid
shaylor
fatimi
meltz
humanscale
shisler
miklowitz
ringos
waitressed
recinded
pensiveness
tojam
senagalese
radziszewska
arelis
forcasting
kosslick
risikko
nxy
mdba
findout
cuisiner
europia
amsf
mirrow
financialisation
agho
adtr
fidelco
lawhorne
traicion
abramashvili
bosti
rockstroh
chalal
odsts
neuroprosthetic
brcm
hartunian
cattet
burnhead
rachford
hondura
muchlinski
custodies
wristy
vancouverism
equest
auchengeich
cocain
buether
abeche
skeate
mulva
whitecrook
ortutay
filsham
ismaeil
cemagref
nedge
shatzkin
trabing
diedrichs
zadravec
deggans
btwc
mzinga
kastens
stormonth
stenches
pedastal
krabbenhoft
amortise
chipolata
yashim
hisse
maduekwe
bramanti
sarangan
schable
shreves
bukowskis
fascade
theirry
cidem
torek
chiangs
pritzkers
strokosch
rewane
speechmaker
brunis
havce
kinkan
alléno
oxyglobin
devonald
crippens
falsetti
valyermo
tuani
atvm
steamrollering
dabbl
kondek
riddalls
barnholt
benty
meyrav
sorice
kasala
kamiz
verbij
kndl
artiness
sarur
frissora
mahawil
doeke
weisshaar
churlishness
redpill
ethnicly
bamali
mouldable
printex
supershuttle
guehrer
nanofibres
crowly
panjsheri
sheepridge
napali
shenfeld
charruas
wyllin
bciu
justham
unsaveable
huntsberry
esnc
nightlinger
hanessian
treworgy
contiue
shoff
baroquely
stonewashed
wanas
katine
twot
afrezza
foundem
pushta
kenedi
moulinot
stewy
gekkos
interplanted
liviero
witsell
gregorka
watai
gchat
scalvini
completionists
baztab
tichaona
kahol
beilke
dommer
kfy
charata
makim
kreole
rockler
disapointment
ravensbrueck
elniski
valena
timesselect
butties
gratta
mathiang
lehmon
sarhat
dunka
mimodrame
godwulf
boutcher
gultekin
eurocare
hangnails
superbreak
windspire
damber
raffone
videomakers
desirables
triacetone
yakes
jonae
hingerty
paerson
folksiness
zukang
pubco
nbcf
beiges
tyrannize
strathewen
marper
harjono
skywatchers
hartsuch
sfogliatelle
mouneer
bedder
towsey
kostry
indect
dancemakers
youlgrave
snowbarger
mesilate
pinau
edwords
rizkalla
maintian
brainwork
mistatements
famiy
tehranian
turjeman
deyanira
hustead
koltes
baillères
airfast
firstperson
gategroup
btop
kingara
cobrador
lespinas
computrace
zierold
sephaka
schmutter
buckroyd
khazakhstan
sayari
tollfree
milibands
cyberdissidents
trockener
connaghan
climed
alisande
pressurises
pltp
martime
gustaff
chouly
minnet
daitch
ilness
baifu
angelsey
predraft
nicklen
unabombers
filostrat
alhabib
trashier
sawblades
ocansey
prindiville
sitski
benot
bierenbaum
bunkbeds
henington
ghosties
junilistan
soliver
pwllglas
kovals
tsemberis
meidinger
guloien
grixby
playpump
katelan
kalentzi
hosseiniyeh
crediblity
méheut
prevx
goyens
somohano
letzig
yurchenco
sroc
opensaf
bandleading
keppens
eleuthero
amerispec
winance
helwing
javani
monologists
harouni
resplendence
imdr
gurkhali
gecina
mucheke
maheras
baqaa
leaners
headful
hawijah
whovians
zione
jardiniere
robon
rotateq
kamistan
sadeghiyeh
webisodic
mernier
coronell
brunache
strutts
sumanthiran
olevsky
jazouli
nubuck
aflutter
velmanette
loukoumi
heddlu
muthui
bundeskartellamt
katzoff
nikkita
mueser
eeks
oystrick
monotonal
vishing
sedyaningsih
overcorrect
frite
ruddier
eisenhour
documentarists
mollycoddled
soussa
chabanenko
tjaart
disquietingly
cominsky
birdfeeder
postimperial
klusman
hanikami
hostless
pécoul
emulsifies
yuwei
garrahan
schiliro
petersohn
witth
houssami
greifeld
bernita
chimonanthus
minshaw
billye
freezable
whitus
simer
disjointedness
courey
punal
watahomigie
amenoff
tooted
lowfat
caadp
updrs
bearnaise
elal
dpas
desme
sadoughi
tanowitz
huitt
grillner
carcelles
jekshenkulov
axela
baskent
thonis
iseh
hoardes
serebriakov
ganier
hempsall
kaheawa
klandorf
negociants
curesearch
aaoms
thicks
troytown
arimidex
peare
magaoay
spece
kampela
cerling
salchows
abcarian
terjesen
deadpans
miraculousness
paamco
discursions
duruer
schuemann
berocca
mawete
banro
knoxes
teligent
poonsawat
coombefield
amobee
ritchiei
distrubing
deservingly
stength
supected
aggrevated
dullsville
chonqing
execustay
sonnex
corcione
gazillionaire
visioneer
michelangelos
superstretch
luchina
avici
nighbert
scurge
chimalhuacan
asalache
limeback
actwu
kiffer
chieftans
copius
secretay
learnedly
servicepeople
presentiments
theallet
lechers
averatec
firesale
gcib
baghaei
dildoes
borgonuovo
slobbers
wmzq
ntshangase
welshwoman
laverents
repriced
geminder
chayevsky
bugli
linzie
burc
trigilio
krisjanis
glanrafon
cctb
hardhearted
daood
beczala
ambastha
slimly
waltho
kwakwa
rematerialized
njitap
angellotti
khanzir
milliron
linsdell
benbihy
grimsditch
karweta
shreffler
barberosgerby
phio
defrosts
satsukawa
dresslar
sekela
bmts
ergonomist
klepac
syyn
juryless
noluthando
abdolfattah
ievoli
avamar
amiad
koplovitz
vollweiler
silverfleet
hadjigeorgiou
buffalonians
robedaux
jermareo
pretreating
kobylanski
peptoids
falkand
htere
spbu
diyana
marvez
mayiga
tsatsouline
placatory
raghda
skystream
mashkel
nadelson
haushalter
capiche
milkey
nrtc
bargewell
pizzoccheri
ebike
longframlington
perfumier
rojales
politicshome
snsc
birdfeeders
futacs
gwenan
alllowed
nematzadeh
halsdon
sitagu
durrants
adkerson
marsot
hyperdrama
ngaus
pozder
refridgerator
zarlenga
hakhnazaryan
executability
haibao
orituco
snowmasters
avelox
dumenco
nonpigmented
brégançon
mvule
dawnnews
nitzanei
lightsome
fanselow
caseback
benca
fhlb
khaiwani
oraha
laiyan
genous
knec
azzaman
newbuilds
pcpt
poobahs
featherbeds
bulldozes
ecoterra
struckman
witlin
vitacress
gharu
katula
servidio
erdkamp
veitia
jakrapob
accrediation
retendered
youna
genae
unvested
rongwu
nicita
politcally
mcaboy
amergen
banrock
scmm
hamard
coltish
husselman
alworths
nativi
gartocharn
spetman
prudie
appdynamics
questair
llwyngwern
nevisport
lionore
mazelike
swoger
reclusively
grotell
saverton
condop
clevert
kolokoltsev
directer
trayner
faqiryar
takaratomy
majimbo
teensiest
prty
tacopina
gristly
concequences
ifton
laraza
stenico
cansecwest
tilp
yeghern
tehilim
komarkova
taec
ivoclar
cermis
superinsulated
migoo
millivres
burrer
spievey
telasco
cheiri
sewai
unworldliness
outgain
seyfullah
ribagorda
vivadent
parps
valainis
communties
ecybermission
hypermiling
rovnag
immobilisers
sefydlu
latrepirdine
luveniyali
snakebark
curioser
skittery
dementedly
bukiet
vanisha
redistributors
balsys
wineke
psychemedics
drabu
gheal
zamen
culican
sannier
purja
maniruzzaman
homechoice
drepanid
aguis
enyce
oversexualized
kuboya
thaller
petosa
multispace
malandros
bobolinks
circumcisers
mortine
zangabad
subserviant
devolutionist
swalmius
atryn
pyrotechnically
kirghizstan
gracelessly
mespil
vignaroli
mutualisation
harguindeguy
sexcapades
aaid
virut
shaodong
luukko
roduit
kojola
oultram
faulknerian
tyser
finlinson
apostasies
individualizes
smotrycz
ovnand
kaprawi
kunavore
nakayamai
nadaf
djurgarden
maleli
unty
relucant
thomert
fairisle
shirttail
cwtch
anykind
hickham
sympton
aanerud
undisruptive
automony
jokerized
ethnikis
forbort
mandarake
intrepidly
tesoriero
newsmarket
jorris
driton
camaradas
amstrong
disaproval
otion
reoffended
lavanderia
kreibich
smithfields
fountainview
footaction
protuding
melal
pulverizes
levram
mollifies
compeling
unbelieveably
clariss
wangers
walkowicz
dixe
cholitas
psncr
epit
clearstone
polacheck
roils
palal
daytrana
lazerow
forpadydeplasterer
careline
daybeds
somebeachsomewhere
dingers
cruiseliner
osuagwu
hojilla
causations
wallbanks
nilaja
rossinian
accesibility
repotted
parfumeur
zhovtis
muldersdrift
stoved
cianciolo
kirnon
helinet
sarova
chiroto
oleaga
jouir
pulgram
melott
kabayeva
runnign
abasse
dipnote
lunacies
schnuck
rullman
shimmen
reasses
grimier
novatec
sasses
airflite
natano
gmmb
charrisse
delanoe
sprogs
applaude
mcclarin
tealeaves
piry
rorrison
pressenda
voje
centry
mccurrach
jarrel
pennybridge
ostroy
vardzelashvili
snda
reeboks
spinningdale
soilders
vivify
mainsforth
polypills
achieveing
fenjves
loback
dyrs
junc
angrick
murabito
guarguaglini
ibovespa
krepps
golomt
ecuadoreans
shatwell
mogil
yueting
familylife
separovic
polymorphously
seliana
warier
brethern
rnla
atheltic
johathan
servicos
panelized
marentino
vorobei
makhosetive
dhandwar
weihl
mocospace
ceragon
beared
headcovers
sjaelland
kirkush
articulateness
relles
litomerice
gazidis
pantsless
eritherium
aerus
birinyi
thefunded
cyberpatriot
webste
regling
ruthledge
bunnytown
lowfields
chapagain
taichman
zucchino
danuel
libral
lazek
midprice
mblox
derraugh
hehman
hambastegi
tangelos
hernadi
entrepeneurs
boontje
touria
telenova
snoeren
sukhee
guttered
ikitelli
wellywood
uninstalls
nonconstitutional
riesbeck
oversizing
withut
silbertanne
haimoff
phonagnosia
wedad
knovel
perserverance
afflecks
backelin
golau
lorit
independentes
shankaranarayanan
lezaun
hymietown
impove
medicinema
partrick
selesnick
kachroo
mmrf
novacare
unfarmed
babida
cryor
bakshish
betzler
mimun
blatchly
bangrak
bessies
transdniester
mandsager
swsi
rancilio
econs
enviga
sireen
ʼa
wluml
bouzigues
lohuis
renationalise
hameli
säumel
miluska
warkentien
bethelehem
mekhennet
brooksie
strepsils
shibas
talibanisation
ensconce
hioe
overturf
erikka
balendra
miscione
unipublic
orangethorpe
chemoradiation
inkombank
mudpie
interfereing
staretz
molberg
marif
chebundo
zekic
bejeezus
lipkins
piggybank
emmanuello
sambili
petojo
speea
barzón
deleg
gillinson
cardiocirculatory
woelfl
cagen
emoted
cloude
riccetto
stancioiu
gaspoz
aledort
cavadas
jhawk
elevenfold
guetersloh
shirwa
gerardin
sukau
drumadoon
elkhonon
lotuslive
lopsidedness
construtora
kirkfieldbank
jonik
roters
scorchio
steigerwalt
idelphonse
jersusalem
musikvergnuegen
leichty
enginyers
qadissiya
dreariest
klamt
musion
yacaman
novellos
fighing
migraineurs
cavernoma
linel
katsas
hamiltonsbawn
allpoint
martignette
shaquana
undiversified
resmed
shirring
hesses
offwell
crimint
sportello
intergrate
poisonously
cabretta
saidia
khh
collatoral
desultorily
benecken
wanvig
ntera
freakiness
nexuses
unformulated
hmtd
korakas
cokehead
incorrigibles
plutocracies
zhenrong
bussanich
colombiere
schnook
cyrpus
heddiw
ussia
kanouni
skelta
adolescences
sprayings
referans
rassman
wiebel
bradnum
chortled
leccisi
rephotographing
vukmirovic
fearmonger
searchengineland
mulvenon
lodovic
speedone
rhosesmor
leighann
prough
delouise
nacom
feranec
plepler
gamier
greyshkul
calpol
abhazia
ecodynamics
provenly
safod
molea
beverland
kulveer
lukefahr
unshowy
kahlefeldt
ndoka
nesper
leichtman
cohns
lupito
snowling
suhartono
smicer
casualwear
tinderboxes
hakawati
rustiness
exploris
akkawi
witb
olefsky
roadcast
denoke
sarops
ghaddafi
overprivileged
trachtman
zhongren
aiusa
transcendently
farassino
epoxied
ncfm
odigram
blytheswood
elswehere
clendinen
knujon
beistline
kimemia
bouclé
histiophryne
juvéderm
metabo
palapas
tillydrone
glaxowellcome
avdic
willbros
fabulism
kwiecien
aquatically
htsql
nighmare
protopappas
everyblock
tafazzul
guogang
pelf
secci
aquaventure
rcma
asadata
ophelias
todra
siteminder
afjp
kalyx
isenstadt
costil
rart
bonert
uniformisation
zanclea
mujava
chalat
puzzo
meniscectomy
kongsgaard
belshe
witteles
wonderbox
hegele
bessard
coûteaux
wujiu
responsibity
manirumva
buliding
thordur
hansler
darkhovin
woggles
nabokovian
capinordic
informat
bcrf
ralling
ffis
shedders
oldhamstocks
headscarfs
inconsequentially
wernet
abex
barsocchini
naqba
microemboli
bedie
telemental
rutner
maryanna
apetite
zamal
shadchan
senegalais
aaiu
panasuk
odhav
fionda
kauch
sponsler
nesbitts
anaky
malleny
gannex
nysschen
citycentre
lambells
atwoods
zourab
sleekest
vrbo
worthit
sandate
nonrenewal
hurriyah
hertle
sheenan
upturning
ciljan
elps
unpoliced
fabulis
huwaida
druglords
slyngstad
mittelos
apim
sulaymon
attunity
epocrates
nueske
fissette
ipoa
frence
metrostage
limm
vizinczey
weisert
miriana
ochr
salcito
macconnel
rbnz
nozoe
bandimere
musawah
likierman
bellanaleck
recharacterization
momsrising
budreau
helness
tideless
schuffenhauer
blumgart
vinchuca
zuli
topstar
fillin
lungful
soulet
sandycroft
wtge
pseuds
hospicecare
charneski
iming
sharlie
knewstubb
debting
daliberti
surfboarding
himpler
flateyri
jajoo
drevitch
carbro
greendown
loneragan
phedi
bruhier
noninteractive
calao
soaped
ocularist
jetbook
nercwys
kaniewski
aftertreatment
vieled
hulkster
capcities
audur
referrred
motorheads
movlud
bambuck
leastways
kaczala
osotimehin
arpeggiating
kronplatz
najid
corbat
bruntland
imafidon
hicheur
intercivic
bpxa
moini
horrach
cnst
yoakley
candella
randock
egprs
tenbroek
imperas
detoxes
cloar
segolene
gowkthrapple
shotmaking
perambulating
deliquency
nicolussi
harebells
consultas
houtan
unharmonious
anatomised
demattei
wanblee
filimone
brzyski
inforcement
nhang
ullger
tinside
acknoledge
juvederm
sefydliadau
malmoe
palmucci
maune
keyrouz
ceccacci
aggreement
yankilevsky
truva
masawi
laughers
lumene
drainers
dannenbaum
bbpa
biotechs
bouys
oqab
bacanovic
swiger
goldkorn
procyk
heisted
fingerworks
stenfors
xuesong
cvas
throwdowns
viengsay
megadroughts
insituform
sluppick
biache
papaj
claudon
gonwa
famelab
clites
ubhi
nhma
azelle
bodysurfer
muxia
baisch
rospars
icandy
hublin
zhongjin
carletta
melican
figleaves
mgpi
rasheda
stepmum
axam
butchard
daudel
tchama
usuing
openmindedness
rodah
harbon
corsello
kerschner
cantoro
pattisons
sureka
remoulding
omlts
jamine
boubyan
blogrolls
deelites
vijaypat
hackitt
morogiello
felliniesque
dfis
hartdegen
moïsi
badula
commisars
cosmit
bicetre
bjorgen
darkmarket
nadimi
maingain
seldane
lodell
kwalia
bensaid
puigdollers
straatman
gobbets
sseldorf
linyin
kazek
fsps
parwin
hasanain
behura
affort
releaf
goldensohn
wygal
improgo
tredre
kovykta
hakapik
townsen
mckimm
citco
svrs
sugarplums
unglamourous
benfotiamine
shawqat
endarkenment
barroway
denae
podkoren
adsafe
eòrpa
tweakings
raciest
olaberria
shamefulness
maeue
doletskaya
arifur
carbeth
vergallo
makombe
coleburn
reithian
mbdc
gauzès
korsrud
brévent
sangki
anaren
beansprout
astani
penparc
muhaisen
carpooled
tickly
eressos
kostelecky
kinderhilfe
soumana
jüngst
troys
porpose
inviduals
jouvencel
shibly
ffls
frieson
medtral
swanier
biscaia
pairoj
griesbeck
lubed
chandratillake
chrysographes
trashiest
laviola
endodontist
graysmark
polystylistic
disipline
glute
mcdougalls
dovidio
alcorcon
spitty
timemachine
coquettishly
easypaisa
adjame
parette
revoltingly
abdusalomov
sabathé
wellburn
unfulfillment
unshackle
mosstodloch
leonera
quesenberry
sightscreens
shamefacedly
kantra
rengen
dareen
mockel
daffer
baseliners
xata
popps
excessivly
autarchic
southernlinc
readdressing
cannolis
vitolins
tthey
neuroinflammatory
backburn
vucevic
unstarred
helpmeet
mwakasungula
larcs
huxtables
counry
lievremont
swicegood
timurziev
leeker
geurin
innerwear
ramsier
toylike
carholme
dostoevski
hanefeld
gwec
mcalexander
bunscoill
peebler
lunasa
peatfield
rorys
hankerchief
makfax
zalika
ankaragucu
markdowns
ooida
toughies
torbothie
burmania
commandingly
haryasz
siepr
alini
gunchester
dodginess
phrazes
magomedsalam
brisenia
terroris
klibanov
excells
edradour
fdep
cholesterols
qardash
clwr
processess
hedgefund
derogates
nrfc
shehong
zania
bertolaso
glanaman
navle
kilmahog
fourviere
desses
zpmc
atrianfar
hogeg
boethin
unwillingess
roadloans
smolkin
jellis
eusec
mutakabbir
glindon
caerwedros
kignoumbi
recalcitrants
rhubodach
bitrix
contrave
hoshinoya
seawatch
kusuhara
algenon
corobo
whitebird
whiteys
obair
skelp
monumentensis
lovvorn
posiva
rafd
klingstubbins
ultimatley
contagiously
minstry
knibbe
devido
natika
thangarajah
brackenfield
jingly
duncarron
faley
loeffelholz
erha
gilbeys
tenden
cauterisation
kurkul
kathee
gutberlet
tsurikov
caduet
igancio
deicer
graybeard
assocs
eglu
difficut
electrosmog
ardana
zeltiq
melquan
pentecostalists
awkwardnesses
conflct
braillenote
sevki
slippered
loadspace
armagan
autosuggest
potstickers
parasaran
bryanna
volkhard
loipa
shabbier
brinkmeyer
jccf
pregancy
surgury
kayelekera
hapis
heavan
oref
mouin
eaccess
dalreoch
dibattista
kuhfahl
clifts
damario
scrounges
céladon
wallstrip
utsire
lashanda
rlam
bakone
goldstock
humayoun
rasuli
birkrigg
fagerstrom
lscc
ofterschwang
beardsall
antivaccine
ellidge
bodis
mcilduff
percudani
iread
spraggs
suporters
zeyuan
mccullochs
annelis
algranti
allissa
mahaiwe
bidtopia
sermeq
transmodern
bootman
boloney
peisner
slackjawed
gummelt
infastructure
mordechay
picca
stuggle
powerbuoy
berings
kryptops
kildren
wittelsheim
doughtery
fessy
egeler
ieua
cheerlead
gambolling
popieluszko
drudging
rickham
saikua
yellowhorse
goddamit
bacote
mobel
daviz
sahayata
overvalues
greenworks
borke
nnlc
notarianni
krejcik
chinitas
desem
norat
vilne
rudko
househould
speciousness
dawdled
condemed
dogfighters
complicatedly
belaboured
mjunction
xsel
dowski
portabello
aberdale
dikler
cnpv
bavuma
kussman
flysheet
settop
rewardingly
backcombing
promotores
inovation
noront
pelmets
crossly
outqualifying
wolan
zensational
debilities
fictionalise
mondrago
freeriders
optiks
hortscience
mikhalevich
edirol
divyang
repentence
timane
ongarato
homu
rapprochements
dishrag
shadsworth
pulce
forign
curic
gjepc
truanting
mckinlaigh
waldseemuller
desmangles
fenz
surescripts
gerbarg
unfulfillable
broadbents
whitchester
kaskelot
kukly
macbookpro
trendex
kilmadock
cranapple
webtrust
mugavin
blazesports
watzman
beermats
upsweep
thinifers
calenick
mollinsburn
greninger
naqibullah
bafi
theatregoing
tedlow
caitie
pejovic
tdis
janiot
indemnifies
exfoliant
shalash
lethargically
lobke
petillon
clientless
volcic
xizhong
calister
nmas
tarlau
schwegman
mergis
wittbrodt
greenloaning
kingarvie
appealled
minallah
patarini
boretto
thorkelsson
cordara
echorouk
shutan
fuqiang
leukocidin
bookstock
kingsberry
asnes
durovic
implementability
minexpo
leukaemic
sdst
salaciously
mayfa
anacetrapib
schooltime
trands
muzicant
spents
jonzon
tusing
nellysford
ivys
supremists
crbt
covec
philanderers
ophthamologist
velveteria
marleigh
plesser
wymott
valueact
macgillvray
azerbaidjan
classiebawn
scavino
unrepressed
prevedouros
ebidding
looj
gaganjeet
soundmen
metallidurans
yielders
sherelle
sloggi
sagong
joydens
poliza
transtec
lumison
monadhliath
thierman
ballagas
okonak
underemphasized
sprayable
masduki
verenium
decarnin
qingtongxia
shopbop
kalogiannis
woodfree
wull
inson
mosae
yeppers
mrgfus
carphedon
neithardt
sewin
evca
azafady
stringbag
loske
sucursal
agrast
cphd
epscor
tasmagambetov
chrysalises
hanad
kurtaj
nobbled
graët
scarfing
eighton
treavor
unhygenic
ibol
parrotta
dimebon
hindolveston
boubekeur
cetos
motaleb
titherington
pual
deviney
unglue
balanoff
usnavy
bikavac
ltro
gombocz
boniva
polene
prender
waterdale
phsc
helmsburg
tiyapairat
cadhay
ieca
stratyner
sicinski
sukhinder
poncino
goeppingen
ncrg
seriouly
lscd
barbaso
aacca
mohammadou
novari
accoutred
satsias
extraordinariness
welnetham
soudant
tanberg
temistocles
pelczarski
buyt
piplica
dnpa
zgh
nayfeld
toumai
marylynne
kleeberger
scioneaux
kendric
backbite
jeanbart
manzanos
tinch
disbenefits
guesstimated
reguard
ruloff
chint
esveld
sandercoe
hissen
kmmg
csmu
glulisine
arsher
khakrez
suffrajets
buzzwire
lxk
illegibly
witthauer
hhla
menú
stecco
danli
gymslip
overstocks
reemphasizing
putsborough
jamarca
clynder
adolesc
bohac
arpeggi
xinggang
areopagitou
hidajat
enaje
iostar
mooncraft
whiteknight
paulinia
autostar
coburns
annicelli
yhoo
seediest
kozinsky
debtx
lifeimi
repressiveness
mupariwa
usie
clitsome
uaeu
initital
donnellon
rawstron
nsrp
buale
kohsar
hulugalle
velaux
bolberry
neubig
sodomising
polomka
communcations
multipane
katrena
pinkstinks
atousa
bizony
berlinerblau
cagdas
abduljabbar
rockiness
kuligowski
bancyfelin
faragallah
walper
desplanques
vialet
mutahar
husari
ghazvin
breezers
lineth
chantell
hatefull
borick
beeford
tiffins
mylswamy
interracially
allegaert
tadelakt
ethoxylate
keggers
lunyov
strazzullo
akapo
mcintrye
malaak
colombet
jeha
brendell
cazo
bunkroom
auville
hsaio
kasowitz
jribi
babydolls
chatrath
cilluffo
startingly
buhrmann
bryanboy
sheppy
glase
derryck
recenly
gamesmen
dumpleton
nostalgist
tollerate
postin
arender
fateev
intersts
silhan
deferr
odato
discordantly
rowanfield
logoed
faggy
serzone
amurao
lavone
gathuessi
kibwana
stroz
nufarm
mijke
feagans
dellaqua
ranner
concientious
nealson
genesia
siewierski
kurobuta
rosyln
kabwa
natinal
aubeck
muntarbhorn
extensification
lxp
tarrif
nemawashi
tilberg
pyos
azcom
pageflakes
roderique
acrylamides
topland
ulimately
ksha
rangon
alessandrin
laeticia
investools
kusnitz
regifted
debaty
morphix
wilenius
servranckx
condotel
croquetas
morbello
hoier
gaoith
labvantage
grindings
tadena
faisaliah
shantee
siderurgica
cinématheque
esbi
glasto
gadonneix
uninsurance
selic
rowlatts
pietta
flummox
persey
horkan
gleit
bellyflop
flexibile
mokoro
poped
paceline
ackowledged
idefense
kedric
straplines
watercar
stuggling
harrowell
cubera
forequarter
acuma
thongdee
uraemic
onwubiko
lyuda
distrigas
mehat
bhere
nothaft
agreeability
nondependent
dellasega
jiwen
altira
craniomaxillofacial
orianne
hammans
eggcup
collateralize
shakealert
karters
guesstimating
newirth
teaspoonfuls
lifka
auwal
minilabs
tatoos
quatchi
robeez
orhttp
nantie
smithwicks
seinfield
doornbusch
nochimson
buzard
bolthole
polydextrose
umprum
gsic
kamn
likably
jegathesan
enourage
pieth
pongsu
tiggywinkles
lowenbrau
anticlimactically
rivinius
sexaholic
fraccari
andrad
rilwanu
growthworks
connette
hopeland
hillarycare
baigrie
shurrab
laspada
gretsky
mcguffy
buffler
drunkest
itablet
goig
steamiest
forterra
pilic
tranquilisers
goerge
alshehri
fountainbleau
enterline
equiduct
suppposed
totzke
dimbo
rizgar
stimpmeter
alzbeta
delzell
trodding
saraghina
mulrain
promissed
aripuana
tankus
bairsto
clabecq
reproachable
securitisations
moshkovich
lemosho
stamatia
jayyousi
mitsumoto
includng
seedat
relgions
urucum
gabell
prozanski
montellier
yesica
palmeirim
cobent
sheneman
npts
anerobic
plokhov
rejiggered
gladiatorum
bulaki
ripson
gnma
qibs
thesp
ungphakorn
enimont
prerecording
programers
smokeable
samadashvili
martinat
macerator
bhukya
worklight
vertrek
carahsoft
opcab
argant
rendl
zipperstein
tariffed
sherle
cpmf
poad
jokery
trahtman
savarino
reciept
tendal
keilitz
polten
vounder
scibona
beribboned
gyürk
fonatur
akuno
loughview
giunchigliani
obvi
gladkiy
meirs
beijingers
mancillas
elfering
aktionsgruppe
millworks
oxycyte
spragens
afica
joxel
dopirak
fuerstman
flowbee
buildling
arulanantham
propitiously
natee
schaerr
taxine
afida
lionshead
abdulhussain
goodguide
factless
xingtong
aysar
chalor
jalfrezi
furreal
skimps
onyia
minnix
catam
rubboard
compèred
janiero
carufel
mitraclip
kardamili
hypolito
savander
ronnen
feagan
malcolms
reflexologist
birthmother
bielicki
emkay
lovetta
theplatform
dimetra
munninghoff
wynston
smartdrive
cordevalle
grayhawk
nationalises
songping
bryncrug
ritze
roht
psem
marbridge
pejak
fjellner
bernerd
altogther
wihda
chipeur
climan
glicksberg
pfspz
sopheak
nextpoint
homoet
viklicky
degise
uchishiba
demarches
critien
oddson
knifeman
hazam
lifechurch
crons
advaiya
perroncel
nordeide
sirignano
gabari
lupski
rilonacept
kasuba
frednet
cabbed
spaccia
goosse
sklamberg
stelt
obousy
gregorich
mousketeer
pirtea
whalebones
hahnium
poidatz
mushonga
macguineas
wijenayake
conradian
undraped
tashichho
thosands
munthir
mandlikova
gollogly
jaiyen
nexxt
hoshiyama
telecommuncations
plapinger
chrisopher
webchats
amanor
deflators
mbrace
springsoft
paladina
uzcategui
uyttebroeck
aleshia
saffarzadeh
egelstaff
odiousness
enertech
garbhan
misdial
scarceness
demetric
fromages
yerman
altshul
surveillances
dashty
stunder
prytherch
hanqin
cowlicks
jackton
ubergizmo
narte
delny
hospitalizes
strews
jessberger
vulindlela
tresper
megahy
absard
cynghorau
esrey
tumori
furballs
lowys
kadkhoda
heddell
lesil
maulings
simpley
ghota
cormet
whitehat
mammalodon
unbuckle
manfo
atteveld
chaimbeul
kaulkin
bellkor
degaris
endoscopist
polaner
caulkin
illycaffè
hyken
hiesinger
talarion
kasbahs
zunga
xtremedata
honomichl
kandarpa
yixi
carlby
defusal
colish
lagdo
khazaal
hoverman
resnicks
enrgy
bezabih
maryka
donana
jumpstarts
soloski
bowhouse
wozniewski
demonises
panthar
unbend
macuxi
unbolt
dorpen
abutu
swith
dochow
conata
cipf
beclomethasone
cyberlab
noncognitive
czop
havil
fagged
kurtag
hillhall
crofelemer
gretkowska
weatherize
gnps
abili
bluelithium
schoepp
muumuus
thickish
millerston
lalov
buhrman
totted
lenain
knickman
obah
shgc
chlordecone
yjb
wuyep
circularise
saquib
courjault
idga
tuscano
stropped
yadavaran
bousted
mevacor
warmenhoven
ecumenicalism
lelaina
krysko
caesareans
dehorned
genuflects
trilene
lufrano
alaneme
picafort
popout
squealers
repressurize
coakes
underware
peenemuende
duplicitious
balasubramanium
puiforcat
streetdancing
nanoflares
moaners
eyeteeth
herdwicks
petrolleri
passaged
beznosiuk
pretorious
npap
wrighty
oforka
currnet
assistan
antiaging
bhol
francileudo
landstar
limns
foulkrod
nerines
nowosadzki
endocyte
retchin
jideonwo
detangling
suface
winfields
sirotta
sacrafice
ghilad
rawkins
panoptica
kryzan
reminiscient
sapar
guoman
mubenga
dought
geneses
chocat
tutukaka
ajok
puplic
patasse
mahanay
gunshow
nanobio
kruszyniany
degiovanni
kambarata
unforgivingly
dockter
anothr
vannuchi
hudsonalpha
perovich
maziotis
jammyland
treeman
crosshands
sadick
duckhouse
sarvey
barod
keelhauled
rght
luhrman
terroism
fedflix
venoy
konstatin
shapings
shengyang
coastie
gruca
cuete
leacach
hardfacing
rashim
playpumps
kymry
odendahl
methold
relending
mariet
chruszcz
dalmation
mosakowski
gaitley
lisnarick
yanci
biniak
carrim
saurez
soffin
dfferent
backhill
philagrafika
tching
herreria
libdeh
waxholm
trieschmann
isoglossa
cambar
hirschkop
freaney
postindependence
fontis
houndshill
vastest
rimondi
pandeya
edinbane
tayab
designtech
wassit
lianwei
vaccarelli
salpetriere
sógor
womanless
sibl
cadec
huwwara
nergard
uchibori
wondershare
histroical
stubner
influents
zoepf
weathercasters
lazarovici
lewannick
bringewood
samtech
picaridin
wiancko
slitty
salivates
bombmaking
kundor
enemo
naptip
dewer
shukoor
mahachai
sivananthan
timblin
svanoe
icelandics
goerlich
xianliang
organovo
freshbooks
synterra
haminu
cerza
movellan
coolfin
bakhmina
axten
gynaecomastia
coastkeeper
federalistic
yordany
quana
aristede
urfer
wenzek
bibipur
alerion
devloping
kusile
liasion
ksiazek
murwanashyaka
kolaj
cotarelo
breznican
lamoureaux
cemt
unbudgeted
guangyao
costumiers
ratzmann
aufdenblatten
clannishness
rustock
kenidjack
siegsdorf
tiredly
depersonalizing
rosendin
akdag
ripi
pourers
politicked
titantic
oponent
mokoka
sambucetti
gudaibiya
roenne
prelimary
sabaj
jampolis
zootv
taramosalata
olopade
contorni
bodgit
starite
yateman
ewallet
ecography
kiriakidis
suraqah
orangs
unificationism
giridharadas
aulc
dominatrixes
gesturetek
calçotada
orajel
zwiefka
mansewood
klingholz
stjames
outgrossing
thackwray
devassa
dogpiled
swormstedt
oppostition
despites
bounciness
standfest
soupcon
irisys
glicker
bourdoncle
voegtlin
cyclamates
kopane
gobbet
rudimentarily
fornicated
etidronate
ouvi
candidiate
powerlist
woessmann
damnatus
tozzoli
osetra
despagne
guirguis
wisecracker
morineau
kloska
clsoe
waverers
maroda
luhuo
passailaigue
runty
magnabosco
schoenecker
relativisation
craneway
candleholder
srokowski
fumigations
mingji
tellkamp
terrawatt
rashanda
medicins
hugya
blindley
dimson
underexploited
lumberjills
lachter
borui
eastcastle
marichka
jakartans
juurlink
kaffi
firelink
krolak
confield
aggrandised
webload
silverdust
broxted
gamman
momument
ezatollah
knur
bacille
kinkiness
thunell
theya
sametto
recission
kiselo
mfou
ispan
eigensinn
goeff
makarapa
easyhotel
govenrment
mancation
iacoboni
mackeral
lcpc
cafetiere
hershon
postgrads
shurpayev
tamro
augar
eickhout
octobers
demke
appreared
vinalon
fizan
minicamps
durfy
reboarding
jalandar
comepletely
bilderbergers
requalifying
dicipline
chickenhawks
moamoa
tiné
ministerships
anniverary
prompan
shaub
myllyrinne
speis
hankus
dührkop
khataba
portnaguran
tibula
zenter
triscuit
edetate
whoonga
najmeh
cratic
excerise
tepel
advo
zummar
jollier
genlyte
smyrnium
juola
getlein
huziak
manjaca
insidiousness
rasilla
delpuech
buruca
ghx
helyg
pajatén
broomloan
provoste
funtwo
jubilance
girmay
dergachev
magomadova
quieroz
nalbandov
jaakonsaari
legspinner
kapito
mathais
hitn
kabinga
ghneim
rembold
indianopolis
evli
wirat
yoostar
baghmati
tibber
nateq
choupana
hlth
baiqi
unbuttons
pennett
numhauser
szczerbowski
unequals
eplp
tilborgh
dairese
sharkbait
uejf
braises
mcclennen
nhin
quitlines
moroseness
resat
tasawar
nkan
cozzie
assymetric
daphnee
natb
clopay
kopatz
sacramoni
tmrw
supremicist
gwir
grigoli
winterflood
leblancs
petrano
tky
strimmer
chingoka
schumpert
immaculee
rouwenhorst
abdolvahed
mbilu
decadently
essola
moisturisers
edham
ieep
ramsus
lushest
wtwt
spongecake
ouevre
atually
aftel
stofan
bullheadedness
hanescu
pediatrix
footling
mohmad
kyree
alpharadin
bellco
wiratchant
souman
ballyoran
himelblau
saravanapavan
pasticceria
ceku
emmigrated
nonrepresentative
wigging
pedde
lakic
manoussi
nonlegal
mifeprex
ceratizit
schupf
mzalendo
runyonesque
kamiko
neurochem
eduventures
limewoods
suspcious
feelingly
hoogesteijn
weliveriya
gssc
fitfinder
kamagra
floxx
prapas
photinos
lebioda
konicek
sentimentalised
linpac
poyon
lingholm
roadford
ronbo
souaid
ularu
minkova
chrx
ozzies
overexaggerated
abitova
swimmy
samanez
szaky
koche
rewarmed
lebherz
kwatsi
celing
olofi
roullier
silverspoon
jawzjan
stangs
vanpooling
ensuites
bulletproofing
somodevilla
birmanie
playnormous
mysimon
flinstones
rahimpour
unparalled
elkinton
omda
amscreen
disect
inaugurals
jelmini
woodstoves
lakhdaria
ruimin
writeoffs
snoozefest
bowcock
goodhall
reinholz
yatauro
azk
interhome
bulbrook
carnkie
undercoating
binmen
reidt
kanshin
lauture
chondral
skelhorne
upgradability
carten
ribadier
cosigners
papell
breadwinning
disharmonies
pepu
caucasion
wishfull
bemand
baige
olivennes
elaha
haralambous
ahmard
sugababe
hmap
fataki
embarrasingly
boesche
arleo
chentouf
hertforshire
overtrained
miell
ghayasuddin
freshpair
penel
carmichaelii
bexon
homerless
wilhelmson
aturu
indianan
oldemiro
venkatasubramanian
eorpa
staerk
healthconnect
ekanga
voisinage
boquhan
lossed
voyeuristically
solena
todday
sobis
mundulea
lanoka
kidsfest
abdusakur
skolrood
balnagask
novicki
innovis
faifili
stranton
joselin
procyclic
persuation
pécrot
tonik
chemaly
zachanassian
ferrexpo
baiyangdian
stoelting
teletrac
inalterable
visionland
overholtzer
costanoa
whitie
kaffiyeh
toqueville
scarlite
turst
darvocet
mewbourne
cristalli
rangzieb
championsip
farmaceutici
petrolheads
zhikharev
teisuke
mosaid
schlow
hyperlens
ockelford
berkett
ksentini
cheeseheads
pennslyvania
gouts
konecki
ashray
reengagement
mogadon
prepaying
neugeboren
stefhon
hengjiang
shireman
crusing
dcrp
scheuneman
hussani
guyomard
spellbrook
abdabs
shiftas
dawra
infracore
herenstraat
kosmidis
subandi
marmie
galloy
amosu
hyomandibula
stretz
parkfields
ipath
hufbauer
mitia
hufanga
ebonised
lankey
cerrejon
mainar
gudmundsdottir
vinyes
newhailes
ycg
ouvriere
tracheotomies
altarock
deboning
wisenheimer
highish
bedlinen
cpea
espnhd
caslen
hussmann
uaua
bovensiepen
tunit
troncale
hornqvist
lbbc
iberiabank
mkhondo
tuttis
piggybac
twtc
kungayeva
papastavrou
iddy
pprs
spurk
placeman
postglobal
lards
microalloyed
fattiness
berlinecke
klair
kness
coge
dehm
michole
xingjiang
chikari
noncardiac
mutiah
tressie
sprengers
algore
omidi
chongyuan
foreigns
broatch
skibine
yeonan
egality
rocken
thorbeck
rathfelder
generalisable
hemingby
flouncy
menichetti
avonwick
rellys
botherer
raggedness
rootsier
paktiya
morua
lessini
portégé
zelenitsky
mutally
roudnitska
stoldt
yodle
nitec
kinsmon
obliqueness
cutrera
heuermann
zhone
acapela
albade
nicvax
indacaterol
ciarrapico
zamrak
kacar
vomitous
undcp
frappes
astho
cmmc
speedballs
safelink
mashatile
londiani
skivvies
lateralised
nonreturnable
geekchicdaily
pouted
scruffier
predix
raniya
kyba
deterrant
aminopyralid
lember
elica
surefooted
preened
addling
anchorwomen
tomabechi
ingibjorg
birthler
nanosensor
connnection
mattas
overeater
falstaffian
bidh
baralla
theola
busaba
mocktails
tiptoed
intereted
obamae
ishiaku
nhms
wringers
codispoti
miyase
rybarczyk
cspn
millons
rompres
hogh
zottola
chanceless
furudate
larot
barzansky
saiccor
yardenit
shareowner
stratoni
baodong
rubicund
expectorate
zebro
nelva
aproximadamente
santizo
garbarski
dogileva
instituion
kousseri
schwark
wincc
duaa
dunie
megaresorts
lejune
kildow
dittoheads
ccsi
andelson
werer
haynesfield
ngawun
trupanion
themelves
dinkas
tendenza
perpetuator
adetomiwa
dvur
nucleators
townwide
ilpa
pathwork
mesclun
iuda
winstel
sèze
qingyao
killiow
stepgrandmother
dozzi
hillebrecht
raducioiu
zegerman
individualise
rusbridge
arpel
yovia
gbci
hazmieh
miaoke
marander
intitiative
ninfo
dordick
illegimate
tabarre
gsol
reddox
provate
vornamen
battara
nareau
prestowitz
hyma
wouldent
diminshed
sajko
mudders
molosh
teola
ekulona
gavea
kotlarsky
crdt
camaleon
encarnacao
bonizzi
shunhe
lthe
ameringen
hatayama
smithberg
comapnies
hedgeable
arngask
ogoegbunam
absorbingly
neuhouser
fadipe
ramekins
amnd
broadweave
deodorize
abduallah
kotchian
saucing
estheticians
canaletes
arcelik
fourtrack
chiefswood
digimax
speckhardt
emmissions
ivanovi
mahammed
masiel
quilombola
solvej
cornor
monarrez
maccy
cnev
misallocated
ministerially
expectorated
utegate
daqduq
tragardh
lipoplasty
rezaie
passionel
missable
indulkar
recoated
elkhounds
minaldi
bansen
aldemar
kettaneh
berkshare
accumlated
bakchich
vcxo
hurwit
gillilan
blondness
stavenger
lopucki
eastonville
cabrach
innisbrook
zotinca
dariani
frenzie
aquamarines
isqed
hamshere
steelite
northsix
cymunedau
leeville
luftschifftechnik
duckface
dalsace
balogna
deglet
zazzi
droi
mouswald
mirazon
amjid
undercooking
erulin
nagarro
abuza
assisant
swirral
corioni
yonica
mesopredators
reifman
machart
iciness
welshofer
exceedances
solkin
foreswore
cannellini
sadjadpour
buryatsky
garnetts
haidy
songsmiths
schwartzes
sauerberg
corrimony
roszkowska
sleazebag
varathan
pricol
jady
qlipso
jadwa
meharg
emeny
failiure
diddles
massood
outproduced
queslett
lipperhey
ngonyama
monix
krepela
ciparick
idolatory
ruwaili
colladay
lpsa
blathers
newsweeks
jwn
horsetrading
esrailian
retrospectivity
smalti
khetaguri
tweenage
bmad
faligot
peynier
rosemore
shaloub
weltzin
triparty
tuyn
moamen
digiwalker
duathlons
spoonbread
deustche
heedful
genium
sportsters
arrika
naulleau
speakerphones
bulcha
tavarres
serbis
sorest
unassimilable
dicatorship
boluk
rabonza
plihal
opperating
americorp
aguera
kavosh
dobley
sampas
semifreddo
lantagne
bardacke
haloacetic
solicitously
tigiev
akikiki
antinazi
aptilo
callay
teppco
recip
mgscomm
mahb
wpuld
eletronuclear
ideh
beaterator
cawp
mccoskrie
darrtown
submarining
maquire
sagus
cubbedge
lefterov
saverino
ambered
fesq
stomas
mavinkurve
corperate
algeta
zyskowski
entec
rolheiser
ioma
yuksekova
ostel
maghrawi
freeconomy
goam
leferink
bodensteiner
vardanega
shabbaz
elams
richies
refoua
crosslegged
lojka
latara
portaloo
baycol
nonliterate
genbutsu
serbinis
abbing
ironkids
echavarren
shihuangdi
xuemin
hsic
hooghsaet
fonner
greystar
coutiño
courset
dollarized
mcgonnell
borodavkin
tanzaniteone
lundeby
romashkova
caballa
malkki
strawberryfrog
queada
mclonergan
holmbridge
sliceable
lahyani
cauna
prostatectomies
arrivistes
japannext
fossilise
caroming
loxam
heimbuch
unoticed
feinblatt
sasiprapha
overinterpretation
supercroc
ellstrom
weaksauce
priviliged
ausberry
danahy
viswas
mojiva
overdesigned
kapya
dosova
idri
yongda
crunchpad
climens
blalack
tabbat
aquathon
nfwi
greenshoots
murdishaw
overeducated
laylin
windeler
dragonlike
sbtb
decomissioning
preelection
nbis
bullmastiffs
kahikina
markstone
muxlim
chijoff
toweling
cabazitaxel
tabare
gascard
nondairy
tzipori
motorbiker
letkemann
intellectualised
gorings
bronzers
klipspruit
recogntion
hurdlow
artfest
cnnturk
kufel
pyrg
senbahar
xkss
huskiness
eurosurveillance
baoshun
pernicka
ceola
emack
badria
chlamydiosis
jctc
holeta
fareena
calev
supplementaries
kitsmarishvili
kronholm
dumbya
halkerston
stanforth
stonesby
stanziale
ketterson
tejocotes
mujahadin
daintry
carbonator
sportservice
hardluck
amlf
americanlife
daredevilry
imerovigli
fieldview
shatskiy
roflumilast
walum
aldermans
salvayre
sicklen
ddss
overinterpreted
giantkilling
tomaševski
meterology
aderman
alsmost
bartone
mugyenyi
ondres
wisenbaker
indebtness
saichon
civilan
sukamto
hypercompetitive
suiciding
uncomfortableness
souvenier
gribbell
zhenglong
magaha
chernoy
bahamondes
plook
metalith
abfab
radjou
alianca
halshaw
kingweston
pipelayers
evanses
stremlau
northminster
aundrae
chiantis
intellectualizing
susdorf
usonians
marabese
dehen
kliesch
rehrl
pogmoor
doniyorov
wisan
berinstein
storum
newsfutures
segee
cyrila
phel
allowd
sutha
imagemakers
amorphously
mendelevich
rüter
highsides
clandestini
midmer
fixham
extradicted
coccoon
reassort
halamandaris
personol
chenalho
rehding
inminban
chastize
profepa
staska
levitts
androsia
golbeck
stracchino
mitk
attemtps
dendias
supercolonies
quaffed
oshitani
dergarabedian
sceme
apimondia
vehvilainen
sibani
ghaida
mohmmed
shopgirls
hkdl
recyclate
nonattainment
cfy
kaztransoil
deliu
caucaus
kuchling
ivuna
trafficlink
dicale
paerl
ihad
nuradin
anash
offspeed
vizzard
dunnion
raimer
urbandaddy
monieux
jumpdrive
forein
stoitchkov
underdiagnosis
turangalila
jangmadang
underspending
hejji
kaouk
focued
brakebill
roise
appennine
sonntagsblick
gielan
bardale
batarseh
cleardebt
avemar
escolastico
chatline
ziketan
ukrspetsexport
entomo
tarne
arova
hylenski
ogadeni
laghmani
zurmat
surcease
netherfields
kuzniar
ponderousness
wearsiders
tribromoanisole
daunis
winklepickers
guanjun
surip
izecson
wnion
sahebi
defaqto
ayouba
hotted
tsdb
sunwear
straphanger
wittur
abandoner
comeup
roithová
ropke
pranjic
manouvering
javins
anchen
donavin
sahner
wasso
hlavka
nichter
dillendorf
audri
delicta
pizzaz
disinvestments
gilmours
rollonfriday
liberalisations
vuthy
ccjs
newhills
sanguineous
situtaion
rodecker
mufa
killham
consumeristic
idigbe
flustering
bloggery
baarsma
salvadoreans
camillagate
arrogants
summervale
uggen
magradze
fannies
changizi
shamni
migrators
shaquan
ziprin
suppler
expells
doise
springham
rcpch
khalib
babynames
measey
buddists
zaluska
delaema
glander
gachassin
corss
newhill
substanceless
miquale
nooz
mcra
prehab
zucula
jahmaal
fedspeak
dotcoms
whoopers
permenter
gerten
radhouane
ostracising
lemrick
aliquo
nardicio
wanlaweyn
thomana
fangupo
bilerico
pantev
ganllwyd
rainswept
misunderestimate
liuda
omniport
spewer
randich
doranne
tensest
nonvocal
hooshmand
yesteday
governent
xingguang
waifish
lazarowich
jatto
manorhaven
kgotso
seley
quckly
natex
smerling
isins
barfrestone
seperatists
karban
crockard
intersolar
haselour
batoned
einfall
centilitres
dayshift
allegedy
ethelston
klinz
kashmoula
dellorto
multicarrier
sophiline
backlighted
dakarai
airp
nepco
shinbones
brodys
soigné
schvaneveldt
atlapulco
nwigwe
mckaigue
calascione
emuzed
mirii
fleetest
nasbo
askjeeves
facilitations
exultantly
asustada
ijza
kulski
vaniqa
bridgestones
ostentatiousness
efie
bryncir
nickodemus
slcg
ganjoo
apoligized
unburdens
samkos
upcourt
blindart
bagcho
lacount
jetsetters
enfora
dispise
kungyangon
allice
thibadeau
outdrawn
cwmp
unliberated
sameur
ovenware
prescreened
fountainebleau
cauleen
ankershoffen
meadowlake
leftrightleftrightleft
geosequestration
batirov
pruce
pocks
kotkai
braaid
naesp
tuctuc
periapt
atvi
jokiness
bizley
gdba
candylion
crawforth
challice
keion
medpro
hidenao
gesté
coextinctions
centrefield
chanrai
lamal
neuroendocrinologist
geartronic
mañuel
stramongate
kratsa
bonio
alphabus
ziesel
efalizumab
slipstreams
mckalip
mcilhinney
recertifying
poell
petrolhead
nurre
ruthian
thuddingly
zippi
kenetic
syafi
freasier
jimani
demichele
radosta
eurispes
privvy
spacesaver
badisco
handwerg
trostel
coldingley
jumhuriya
kitzman
ashbritt
freezeproof
overscale
hopless
nfrn
hesl
muxton
carred
rockwells
meglen
sabbaghian
hometree
screwcaps
chrisafis
okolloh
hording
vranac
privledge
pollmächer
tchouk
rufinamide
tasy
kornstein
jotwani
runnig
krahnen
gardephe
newn
trocchio
jxb
galadi
horrow
marcelus
bialys
lacena
jerret
lifetree
dragland
seabase
callifer
fairtlough
poulation
humberts
wilcove
youwang
wesonga
tuscani
ruqaya
subheadline
azrack
pcis
hazier
labná
striplight
blindest
manduka
donyelle
togu
capitalone
kormas
cypriniform
karrubi
insys
bobsleighing
grönfeldt
sconset
zubrowka
catchline
fetai
ganeden
dpmd
miniweb
dirico
ushcc
anually
blio
bradsby
choicer
oberzan
eliash
scuplture
elrio
dehydrators
rengstorff
pottelsberghe
nukui
jabrill
mdladlana
doofs
rockowitz
hajela
cellarman
swydd
papou
jawdropping
grunwell
laundrettes
frowick
lachappelle
kilburg
morger
treseder
elsynge
vitsoe
backorder
challaborough
schmelling
irva
nenashev
tharps
stokker
critten
fogginess
bfas
petrogal
instructionally
ufdg
kukki
escénica
outstretch
desipio
jakeli
shevach
saltzburg
sentimentalities
pacome
teetotaling
tangoing
rmif
blairites
schottlander
pezzaiuoli
tadaka
debaucherous
experiece
ohsaka
hopechest
wncg
kenshu
sunspel
merisotis
amerge
mandlenkosi
vlahovic
mulad
elysabeth
averkiyev
longboarders
leukine
muenchner
meytal
myman
thumpy
nagatsuma
averge
woodruffs
abuu
geotechnique
macabe
waterpipes
orgainzation
kolchinsky
fayers
mcclenachan
grevett
shebar
aftre
wehrey
mcinness
glenglassaugh
treuting
nazila
mizzle
milashina
vedrine
wussies
exoticness
goldstaub
guvava
leibson
goodsync
georgallides
remitter
stottie
snakebit
calagione
aviapartner
datek
zelenetz
govekar
tillikum
psephological
neigborhood
wrapp
chaddick
odoptu
morenike
dabbashi
bazlur
sightspeed
affectless
chountis
tenthani
washkewicz
millbeck
siusi
faberman
bradac
maturen
falcarinol
syncardia
bozdag
pisanio
guardianfilms
eurimene
hemmis
karsai
chelbi
martellini
kabwela
rrvs
camblin
semirural
proxauf
microsft
bmra
giersz
timelag
yildiray
nebbett
rivercentre
husick
rhoca
aahpm
jakhrani
mettee
disatisfied
recoupable
phipa
parkerization
matthe
bolognaise
disagress
nameth
snogs
hadarim
speeddate
unemphatic
jilli
cemita
tarke
muqimyar
exterran
homeserve
menking
jiangbo
torpig
caloiaro
deregistering
nyombi
gangplanks
hanasaka
lepad
skypeout
stielicke
ukriane
plextronics
ameriserv
honkytonks
khazova
mukhu
thalesraytheonsystems
ciminera
opinoin
propulsively
mynytho
recarte
geee
barnado
roofthooft
kwana
tichfield
waaaaaaaaay
charterholder
sargentiana
inquorate
teerathep
koshwal
treesnake
knuckledusters
concom
snyderwine
amusedly
angolagate
karpowitz
rrip
monkeyed
ogunsola
aurangzaib
meikhtila
prevoius
reemphasizes
dbmotion
tarmacs
surtaxes
incongrous
raij
earhole
remorselessness
chintheche
lacatus
hilgenbrinck
biodiesels
riechmann
hboi
fottrell
smatter
hronek
utherverse
busurungi
eforms
prolem
gruer
ketumile
kennford
repressively
yasith
remaind
adelene
darchau
plastinina
kergan
frba
vucci
jbala
rezonings
agadem
resiliance
nonpolluting
investrust
youwriteon
actable
equens
sunspace
vathana
glentrool
sigm
edhar
jobsites
schlefer
forysth
kassianos
commericals
wdfc
sacrifical
wichelstowe
sullivantii
veev
iboxx
administaff
allighan
vacilando
waldin
saipov
carbonfibre
babied
eliasoph
pollycarpus
honeybuns
dastmalchi
marrige
phocuswright
pintal
privatebank
jeremys
prusty
medsker
joncour
zaniest
abdulahat
nebbishy
sical
serevent
buckwald
ermanii
viamedia
makeshifts
zangar
tredup
sombrely
mouen
djau
operagoers
superwomen
tiramisù
perniciaro
merrel
wheddon
arousability
shibatani
maierato
kashlinsky
shotting
cirrincione
moorsley
loughhead
satmars
ansoft
germophobic
grandpuits
hyperinflations
pandorama
hemscott
donorship
litinsky
qualifiy
sarking
sdac
bronchoscopic
breuss
ramindra
bookstaber
notarantonio
crady
gazillionth
bittorrents
wimping
neagles
pampeago
suffocatingly
valoria
jonquils
uwem
hastreiter
marinone
daleside
pauc
yanira
cetrulo
schonbrun
benmussa
restios
clinginess
ogoo
kirktoun
lustrons
ashvale
arluck
yding
laserlike
alliteratively
oberlies
nauffts
forgivness
henredon
gutsier
rushfield
abuzayd
tacomas
dokhan
bosideng
mahoud
tayburn
glassgold
weckerman
bandic
ultralingua
smarthome
cmro
venerdi
bactroban
toddies
abrt
duststorms
centaline
rivetingly
twanged
wracks
elektrarne
midnatsol
photospread
naeema
boorishly
incontact
scheinert
patchier
featherlike
ajlouny
strengthing
altens
lifedrive
neatened
apachecon
glenochil
isbc
mudsnail
bistecca
schulp
shaat
nobilmente
junming
chicoma
hsmai
hamieh
osteologist
parrini
amerenue
lokken
groch
diby
mechtronix
rooflights
olaszliszka
micromedex
lablanc
broersen
ossendrijver
trosten
gaviscon
fishbowls
preza
beltany
initialing
recrafted
herawi
garlik
shueyville
hadian
ecouen
filise
moellers
yemenese
hokuyo
overexerted
badee
aseltine
derbyn
badertscher
plageman
underdose
minnieville
quiterio
penality
danahar
blommer
fdml
ribbins
airscarf
masterbeat
ochang
photogs
addisu
goodbaby
sefularo
nopalitos
spadoro
cleggs
harryville
marlias
unposed
curanipe
dulken
utiashvili
niederaussem
unpurchased
richardon
kamlish
enzler
astroturfers
maaden
boraie
milthorpe
cotteswold
dimango
haefele
uchitelle
outstandings
candleriggs
showeast
cakewalks
sipunculids
zinkan
peponi
gruffer
willhoite
verimatrix
bootstrapper
bossanyi
cyno
vorapaxar
hydrochlorofluorocarbon
bhulaiya
touchier
oohhh
spiccia
bulyga
rattoides
rockiest
yunsong
websurfing
regelous
fresquez
rumfitt
playnetwork
councilwomen
celestially
yoduk
liqour
fenves
edifecs
grandmom
ngumbi
phenomonen
nsti
funambol
underdocumented
meshuga
happpy
irise
cucuzza
goicochea
galatasary
rangwala
ploegh
cornetta
schleiff
toulemonde
lozowy
subsonics
tenorist
teletrax
thathe
varaut
muwrp
tabtab
komatsuna
mcgrigors
fratti
rivarly
womba
wimper
mortifies
desseigne
affliate
brandons
waterwells
rescources
ladram
bellando
wintonensis
aoci
perol
mlib
jebran
nyangoma
npta
villifying
bemf
baumgold
teliris
glendermott
madalin
digestions
berlex
prodisc
aarto
stellent
bedrolls
yepsen
seckin
clampers
eljahmi
astapovo
petalotis
cashmeres
locy
fujirebio
idid
januarys
reinvade
ampatuans
lazevski
fcfc
buildability
microneedles
flatworld
guangxin
sirisak
djemma
kaide
opportunies
contemporized
neuger
sakhizada
snuggs
mmna
bordell
travelsafe
consomme
delizie
dimeola
narcoleptics
talkes
koniambo
kondanani
ninestiles
veihmeyer
baisalov
homeliest
guyford
wijsenbeek
implys
mcgookin
passholder
godlingston
kuwadzana
qinghou
marketgait
capol
digennaro
homotaurine
unequivically
interbanca
squawky
laywers
goana
codjo
ukrtransnafta
sutureless
halaweh
delgadina
sharpstein
leily
mirenda
morkunas
clintonesque
airpatrol
cullagh
bromsberrow
degressive
compells
windcheater
psephurus
outdates
russomanno
glaciei
cartoneros
eileanchelys
corvel
kickapps
juvenility
qoba
lysek
thiab
knudstorp
xinliang
armorsource
quikpak
looniness
wonderbread
mezain
taskent
serostim
portugeuse
ctwg
sweileh
shorebreak
hystericalady
sitesearch
asaps
nonpotable
wagnerians
malpani
ambulancia
fischietti
futron
viticella
pasierb
unbelivable
crima
poppens
stonewater
vanecko
dohany
koelbel
sentman
zirp
lyovochkin
erento
jastrzembski
iguaran
fishtailed
tilstock
quartarone
bogdanchikov
tehseen
tashia
rolark
rhestr
newcastles
hallab
sleaziness
unbreachable
hktb
pulickel
irrs
difi
cushnan
concetration
härstedt
libbers
boseley
industies
thorps
embratur
megapiranha
modernises
vagabov
kolarska
brightonian
freedomcar
gilvarry
willowford
heiken
yalcinkaya
rpro
lorriane
intergovermental
monteblanco
piringer
primally
dunard
oglaigh
coutnry
symmetricom
foodhall
underwhelm
nerb
shahreen
riesberg
sibbing
nemertes
sensorless
socializers
worldwinner
palwasha
avancer
tambra
kortekaas
bravelle
deschenaux
halozyme
mudslingers
lameda
damaskos
merrist
nicoderm
chowing
mceachen
harisa
flammini
wahdan
bartron
zims
flowerless
riexinger
majaw
incovenience
tabermann
tamperproof
reallity
wideorbit
larkrise
dosky
dipirro
ndfs
taliglucerase
cramster
gonnerman
gajbhiye
emling
shayma
tlsa
eliota
srisamutnak
gulworthy
kislov
slackistan
kirkendoll
riesenbeck
disengenous
reolysin
comcare
magied
kräutler
urbansim
htey
tadjedin
pakastani
caucusus
bigoli
jurcina
frontload
yeywa
charpoy
caramelizing
languorously
greycon
trialpay
walth
penchard
ziegelman
snauwaert
donnici
misclassifications
prakesh
sorian
netfires
verbillo
ashun
mifumi
klaris
darfour
subcommanders
kintra
dileepan
hafith
paulusma
zerp
auditees
reniec
monjayaki
strubegger
hybášková
taquería
xpertdoc
cigdem
childrenshospital
begrudges
increasinly
blats
crissman
leered
suwal
eschewal
plantswoman
hanses
casmoussa
etant
rodrigez
ajarian
sharhabeel
dilettantish
minnijean
saydiya
tartlet
videocasts
deaville
yielder
sikich
rauseo
pompy
wasent
raage
designware
brka
santisuk
efforst
misbahul
klauk
streetwars
henkelmann
drusillas
paillettes
genise
unconsoled
karaganov
pcmm
actionaids
stubberfield
maekyung
madelynne
borsani
hsinchun
filmforum
bestas
treelines
staunched
kräusel
ogac
denplan
chlorophyl
gueorguieva
hamoodi
putis
malphrus
ciska
abusada
backseats
tembu
mysogyny
zeitchik
wanden
libson
loebel
tsaritsino
burmis
disapearing
unrenovated
gkj
haizao
tombolas
capdevilla
soeul
tissanayagam
nucatola
huanqiu
larget
jctd
conferee
nondrinkers
lyndy
livek
windburn
asell
lstr
sunders
dukem
uneducable
stanwich
rahmaan
forkas
chilliness
kapetanakis
unisom
kambriel
nawagai
rissient
besly
jeangerard
narmeen
whatling
citzenship
toase
glenaan
notetakers
bagnone
bibba
cnnradio
corrugate
dawaa
monopolism
hyfforddiant
hookwood
snowboarded
redmonk
henricksons
elmu
denodo
ghafir
deeson
pitkeathley
nairns
renuzit
healthpoint
yangguang
payack
zabola
japex
ampad
giammanco
bestriding
teag
celerier
disatisfaction
loquaciousness
rothnie
degrey
visualsonics
horwits
gubden
hallandsås
unpriced
seventhly
solaraid
grosskreutz
belcombe
garica
follick
sekoff
alkylates
tariceanu
wilkis
cianchetti
surachet
beachell
niedzielan
eyesocket
buzzeo
saubade
buffaz
zojirushi
mcfe
azmak
coricancha
hateg
graffitists
dismantler
uncorks
chavance
silim
schabort
legent
shafiqa
gtaiv
rtpj
klaidman
printy
feles
governements
transeuropean
zippity
samanna
tagliero
nachalat
rockresorts
getups
glucocorticosteroid
cordoza
anyadike
arbess
goffney
liquefier
bedruthan
janesky
stetser
reschio
gongwer
cracolici
charbroiled
duskey
assailable
adamatzky
abasov
faceboook
gudel
rosefish
vizza
cafolla
pacificist
azuaje
sagent
limtiaco
directnic
casteneda
citifield
elektrobit
overinflating
wheedon
pellicori
preauthorized
chilmanov
brexton
googlemail
kennestone
colaw
poussepin
greendog
snowdonian
aagl
chakai
cortec
aurandt
megaplexes
cdars
bigshots
ballybeen
joojoo
mgps
suppling
sizewise
entekhab
holestone
santagati
saifun
peavoy
kilicdaroglu
teleplus
nonevent
coinings
dismisal
ndambuki
bioparco
vists
fortunatly
stibb
sarahpac
artt
paxo
intx
whyley
havazelet
bazargani
harmeyer
azedo
tomuraushi
ecologo
kagans
sayable
tsiklitiria
boness
navtraffic
develpment
tenderise
wwhi
poulis
kurdsat
schuele
marcinowski
macdissi
rankling
paulite
postolos
tappah
gousis
whizzinator
utma
toftwood
fratboy
barbarianism
youtubing
temporize
marylynn
compeition
astrakan
hrbaty
oldag
archetypally
mumbere
dayon
vancheri
yazji
karinna
ipredator
menstruates
moneduloides
wolson
encouragment
lincoff
jasman
veitel
mosharekat
buyuksehir
bulkiest
derosario
ohsumi
grapey
juxtapositioning
alsoswa
canabis
mpoc
unconstricted
birah
folmsbee
wharfdale
taghreed
oumma
anthonia
yoink
expediters
cheenath
luminas
wanrooy
onglet
cityboy
poppadom
musuems
taurand
stadco
smartops
issenberg
nitties
catalhoyuk
bushiness
burok
branzino
razorwire
ccusa
lfepa
stenild
idilbi
ballymany
sherell
haffield
bapela
outterside
qindao
mistargeted
olders
grugan
bazetta
nanx
changwu
patsalides
teisha
zimmerly
chambost
scowled
masalskis
jaggernauth
enfeeble
tousle
allsteel
dratel
erkesso
guliani
qingguo
abdirisak
letko
gonorrheal
kaenel
cilybebyll
predeliction
rissani
simolke
beglov
barrise
vannina
gloatingly
pathography
montrey
lickhill
bedzin
maniadakis
neufeldt
sapozhnikova
ogorodnik
educationdynamics
jobmatch
cbiz
tapella
spellmans
bluntest
externalizes
cavewomen
bucalemu
ffrom
staibano
melaye
enchautegui
fuger
heinricher
orgnization
krumper
reyka
millpool
colanders
sollazzo
acoustiguide
vogele
maconomy
lubke
fottorino
costopoulos
mangoni
jasmon
shabayeva
yagihara
ncoil
consultees
janácek
suceeds
norowzian
strongish
bergene
bojs
likkle
hovater
mccrosson
visionart
mavric
digu
ceemea
tredoux
chavvy
turchinov
nudler
indarjit
backstrokes
digitalise
zhixue
mcvarish
agaoglu
effectivness
argutifolius
auchtertyre
schwartzwald
uberuaga
longdistance
malitia
theatergoing
mcgoran
depledge
netmotion
girlishly
countermands
bumblers
stroble
glai
betor
galthie
lammons
ukibc
edmistone
sportcoat
nfda
clatters
ioflupane
ripolin
gellir
adacel
kabluey
kadogo
nanomagnets
asadero
prochoice
shafar
folksay
lefkovitz
firstgiving
moonlike
novorossiisk
tradelect
unflashy
decopac
wyngate
avrio
equistar
uhmmm
latexes
iftc
boente
bbut
canellos
saccomano
risio
abondoned
playfoot
gudenrath
treesa
alpenhorn
minehart
kibbutznik
villaluna
carbonating
afghanistans
tsipi
gainous
scutiny
embued
pinkhassov
nestbox
chwilog
kiteboarder
groppe
stuporous
farmstay
norrises
suroso
rapprochment
sulikowski
nikolich
lousiville
safleoedd
burullus
lassco
baniyaghoob
hafemeister
krutz
hallhuber
coloradobiz
kutyin
clinicals
jiggering
affiars
mehsuds
ghlaschu
seidle
reidford
agajan
foderaro
vixs
lontchi
wizer
louiselle
nibsc
saggitarius
renetta
brittenum
sudapet
uchel
severfield
umaro
rolfson
hungerhill
dunda
appolloni
stripclub
mcgilloway
chakothi
wessi
nyange
smpc
jereissati
engagment
zokkomon
musbach
ituango
frienship
sporogenes
vinar
lyas
dolmabahce
vatapá
mciff
cashable
pancaking
garndolbenmaen
copers
hockwell
sawab
madoffs
photosharing
jarislowsky
shaheeda
privatizes
maylander
yanquis
carouser
grundys
weatherized
cultybraggan
annemarieke
taele
cbai
recievers
decis
soleirolii
hitson
itida
fortyish
escano
pilegaard
fakkah
zayouna
machetanz
polsham
waggled
fruska
dtvs
mroueh
bostanci
supermaxi
headcollar
ugobe
nyccah
kondengui
atrivo
eliezrie
billboarding
alousi
sablich
fites
shanae
blose
nilgun
perthcelyn
nonscientist
unwaxed
lemerand
bodhráns
jamalapuram
lusti
hoareau
chubukov
watandar
berdymukhamedov
mariale
impromtu
sagittarians
roanhead
birkins
buchthal
tottendale
mawlamyinegyun
kaauwai
fresnedo
omnova
lazzaris
lickona
cragged
magentas
tsatsa
cheesemongers
mediasphere
blackglama
croser
polcheewin
charnwit
nahc
backaches
murisi
lobosco
lauretti
obering
brezovan
shanthakumaran
cybermentors
grashow
actualising
borkan
thakuria
cavuoto
toisa
cpmp
meddwl
padlo
righthander
winklevi
demandtec
recalculations
mermoud
dimare
acknowleging
indonesias
liveblogging
headier
ghuneim
strey
mélida
negociations
aphrodisiacal
schloegl
nitasha
weaponisation
wielechowski
knekt
marican
fabish
fierberg
shaunie
lycatel
ruohola
opcc
beefburger
runningwolf
goatish
commonfund
knews
distinta
tauxe
masseroni
catizone
pronating
milarsky
aggrevating
caglayan
troman
inseminator
chands
lonliness
gladd
lyudmilla
didima
sgis
kaynan
prasco
plessinger
camilio
pancks
farinetti
kalingrad
gumbasia
salahadin
cannavan
visant
uniters
bodysurfers
barolos
phurbu
brightscope
spekman
fasihi
​​
natsvlishvili
shahrad
gujiao
gafi
wellbank
tuchin
eliette
scadden
smiliar
leavesley
imperi
zambarloukos
guzzone
petcharat
ghorak
rhoten
maradei
diger
gsms
petrzalka
neronha
bibbe
kfaed
cahi
oberli
roseworth
rubaish
ostergard
defenestrate
destructuring
befouling
surles
bertelsman
delissio
kassoum
gekkeikan
wodehousian
torick
korzenik
podila
gospelaires
spurtle
tvoi
venemous
kabukuru
sunergy
greenergy
krechetnikov
killke
digiallonardo
sandaig
mccleese
infrastucture
muscadines
laloosh
pucciariello
atandwa
chugay
stcherbina
afba
ehteshami
shikapwasha
cdnetworks
avanafil
priniciples
ozdil
prequalifying
ahlman
dunsdale
novich
markhams
vultaggio
assymetrical
dinosphere
advancedmc
safana
vigilence
peachment
kienbaum
dachan
kimme
littlehale
ardith
meditteranean
wnav
scotoni
upcrc
servic
gannons
daurov
holymen
takamanda
gramatan
wewer
conscionable
carlye
wysopal
sharkawi
bumbled
topfen
walpert
hirshson
essmann
fffm
orocobre
watersound
embitters
milanowski
andaloro
totaliser
bistline
sexercise
mhaiskar
gigamon
wanjek
antiestablishment
bmss
aitha
bobsleighs
aghanistan
icims
sonejee
kahmann
záborská
risal
younousmi
gajdosova
medja
econonomic
neurointerventional
rfmos
forsaw
newhope
kpene
poust
freesias
baldermann
poleaxed
playita
necesidades
hestitation
mlac
facchina
purpuse
hadly
theatergoer
gonzalito
alraedy
llai
hiropon
fullenwider
northstone
valleycrest
allena
ksis
dunain
rumangabo
beleiving
sweeped
piniero
nassan
debbies
boecher
hefling
ethereality
warnemunde
hearson
proctologists
dwifungsi
ilaskivi
galioto
shlep
capesius
amidol
thommes
administation
incentivisation
nadich
godager
themselvs
lazowska
leevees
oodaaq
attenion
bankster
vatubua
stuggart
yelped
overslade
sweded
whitworths
idama
wgbr
balistic
strenuousness
sluiceways
carbonneutral
midsong
oboeist
carlan
hautelook
mundaneness
labalaba
rainforested
stoeffler
portelet
benfluorex
gristwood
userplane
sonderlager
someof
ritish
jny
ticas
qingmei
merridy
dulcificum
nationstar
delhiites
allmon
wpeo
spriegel
vălean
brejcha
zhuangwei
lineswoman
epoisses
locatell
doubledecker
revaccination
terez
teamcenter
crisises
ocegueda
clems
causewayside
narrowmindedness
sesama
reddihough
balkinization
chiad
quinzani
othaim
ldls
wonderings
laraba
legro
acomplishment
casana
gabling
politicalization
orrisdale
gnpoc
nostalgists
verkhovtsov
horseplayers
altynai
thoughtfull
fobert
foodshare
slugfests
nonattendance
victoryland
cytoxan
teramachi
calculous
picciotti
lepofsky
sistersong
ukpabio
pizango
etextbooks
borinsky
ruffel
stauts
sxswi
ellise
refranchising
salamabad
somper
ecovative
gemany
stabilizations
hummler
commercialbank
flacons
azeffoun
charrin
yots
broadreach
raheleh
goldmacher
kelash
kiswa
dumaux
iglinskiy
earier
temedt
disbenefit
unrelievedly
vittozzi
buitelaar
mieles
oatibix
balsera
glamourised
candymakers
collicutt
arellanos
lifemark
tagines
laizer
nabakov
drakensburg
vindi
bucheri
sightedly
harmeling
hernadez
malthace
hulcup
panfilova
lilikoi
tradmark
tufaro
undertreatment
backcombed
homepride
straiges
wegh
fleuranges
gcmhp
faulkners
paravan
laughingstocks
lentol
zerofootprint
fanm
ffyrdd
slagheap
marinez
thougts
blogospheres
susceptable
porny
caried
sitelines
rosg
parwaz
merilees
cereproc
resx
emergin
cmls
annerson
dickons
cogitating
honeymead
tarnovski
coquillard
solrun
harfenist
kondas
nutrioso
connaitre
binalshibh
freetel
debarati
clocky
tomsett
fidis
intruiging
fabiszewski
tinnies
condoles
intelligable
ecologics
maddo
mowie
fautino
crystalens
trigonatus
linctus
sonys
nathenson
relock
lossada
werema
vandellos
kovick
propanolol
wimpish
jtx
funuke
telltales
siefkes
mahlum
templehall
gnaoui
palmatier
sisolak
transpetrol
surján
arolia
boyloaf
splaining
wernli
judaisation
soukhovetski
kruer
marquetta
ghtunes
quizzle
clippie
grandclaude
chryssie
eknaligoda
taie
protoge
gullets
cavusoglu
birdbaths
esaa
holoband
voluntears
perb
shofars
graymark
radicalizes
westaff
edemariam
metastasise
schweddy
syphoned
waleses
workstreams
iceburg
luzolo
tugra
langenhoe
rediculousness
lepping
swishes
portakabins
quaters
xyratex
sjoblom
pekahou
greengairs
masaro
titanothere
sarji
cwlf
britneys
nangrahar
varshons
lottes
citadines
fayek
artefill
nakhabino
reprocessors
tolitoli
amechi
ballero
rohanna
galactically
premising
dataplan
recardo
contast
foodmaker
swannee
oganyan
stoneyhurst
bathstore
sirichai
stalisfield
indelicately
avrdc
nonscheduled
aitan
gureshidze
villagrasa
shebdon
skosh
walliscote
anegasaki
brookenby
fisherpeople
waivable
gnjidic
bookhammer
sbgi
budney
bunnin
conculsion
polek
buyvip
gabbling
younggu
transfomers
izraeli
milleniums
barayeva
zhura
medishare
righs
ramsburg
czechvar
decof
propell
sayliyah
kitja
labinger
plotty
oncophage
brockbridge
aound
plessi
risg
mcpadden
tanswell
landaburu
pureeing
farries
gamersfirst
rouiller
adhab
bargylus
simplegeo
cachers
salikhin
unlatching
abdelgadir
lfrs
maqaleh
khosti
forgia
kocik
shequida
debone
musicnet
shilleto
lonoff
rutf
jornaleros
dinnes
michelozzi
wiedmaier
iresearch
internati
rouček
zalkin
kilpatric
rrez
abdollahzadeh
pickhardt
biondich
chipstone
obare
sembiosys
karabey
muaskar
butlering
redistributionist
cfib
bodging
gaudiest
motivala
taposiris
kearin
overwatered
varischetti
tenners
turkina
trebay
phema
cockier
scrappily
tywain
zougam
pantuso
urbanely
tamarra
mynachdy
jayz
cupet
blurbing
vivaldian
tochterman
mikonos
laganosuchus
maidenhill
adultcon
kiwayu
bryshon
smlc
mitsuka
phlo
overfinch
mainero
mitteleuropean
diprivan
olevia
dimmel
pappardelle
malignantly
langour
häusling
babec
dogwatch
guguletu
ezrati
zafiropoulos
pociask
kacyiru
iacovone
roudier
bolívars
svnt
koziej
syntocinon
hillblazers
rvus
greenwhich
wallagrass
pereria
chnages
uparmored
fadai
priggishness
povoledo
phindile
saimone
demattia
otelli
faddists
zackham
thurnherr
mackynzie
thieblot
whiffling
achugar
braincells
tombliboos
stoccareddo
ozin
sandcat
bulkan
gogarburn
amphistium
karinen
bailgate
unwearable
yaodu
henhouses
gualeguaychu
kotlarz
zohor
rogol
wahington
electrochromics
ordower
debossed
butrym
kandhas
xmega
nickeled
abouit
devloped
wristing
gambala
propostion
cyberchondria
methedrine
serviettes
sumarsono
niquitin
canalys
transfats
gokita
hollopeter
belevedere
shulte
freeload
danetre
devaan
dotloop
devrouax
juchau
stautberg
virusbarrier
streleski
niketown
headlice
mazuronis
conceeding
manshiet
honoury
duvanov
alirezaei
underrotated
classens
gilbraith
skypower
transplantology
ramniklal
herrarte
sfgh
qoq
readyreturn
peetu
ilts
debix
byrraju
anonymizes
newstrack
perifosine
alomari
popster
triumfalnaya
stratt
bespattered
telogis
branshaw
schwamm
babaker
quadrantid
laksman
oystercard
strompolos
broadsiding
pharmacutical
mitsutomo
viliv
flautre
keenes
muttemwar
slingback
priorty
ayva
dengir
repected
monitering
jees
schory
yehezkeli
tomilson
fathur
tatooed
benquerenca
ocrelizumab
tryscorers
interbolsa
klnlf
overcommitted
holidaybreak
homegirls
sheskey
flunkeys
bjerk
overpraised
muhren
cobholm
dimitrakis
machinimas
koshalek
shujaaz
annamay
birdguides
pgad
shenghuo
megatooth
esepcially
stanion
sheehans
vpak
makaio
jelbert
convertirse
plamegate
uncuffed
towfiq
alagem
retrigger
drivesavers
avrett
matche
mctamney
apmi
philosopical
dongier
sangfroid
tremolando
secb
iafp
oliveiro
gandelsman
uxorious
illinformed
laeven
farmacy
unreeling
franprix
westine
streymur
ducate
stephfon
ovca
soshnick
zuqar
zormat
cordylines
multipin
afcis
cousseau
hittable
itemising
kaparo
parsol
rohrback
valdebenito
ozem
mexicanalink
qmy
biscombe
hualan
thunking
bankserv
particularised
rohrman
pellettieri
famvir
pilferer
impastoed
drycleaners
weblike
khachidze
bereano
narcotizing
cordelli
counterprotest
cooil
karzi
khazaei
sedums
stasse
cafergot
linagliptin
luethi
godbee
chercover
shaweesh
schimpff
ehrenheim
plecas
loidl
inconspicuousness
siquiera
sunopta
cozying
gilleo
tokers
boutwood
bagci
unhitch
wihin
sinanian
vegesna
lezmi
winegarner
amgylcheddol
interlex
qarmat
yetagun
infoterra
deothang
partenon
spywareblaster
hilia
helferty
pinesdale
ecogen
diffidently
loehnis
phonecaption
pdns
aetr
radwaniyah
garpozis
schnipper
chloraseptic
ballypatrick
rabbae
renationalising
depreciations
resuscitations
maels
guissou
holmgaard
gaffield
deerstalking
diffenbaugh
thatiana
spett
tabajdi
cupcakery
rysavy
intubations
homegroups
trepel
mariacka
swarmcast
unsuprising
splatstick
chaudri
gbks
sathnam
buyology
godsake
popovec
kocol
usulutan
demarquette
differece
assayag
cochairman
kidiaba
kaixian
sweatbands
pemuteran
inve
uniformally
balathal
piccante
empassioned
zehme
werhane
duckstein
bestayev
hematol
frascella
rothfeld
zlotnick
tanyang
chernovsky
restring
marnhac
koeck
nubani
attacted
bavette
elran
ungovernability
quorate
soulflayer
lapook
meteab
gorvy
superfit
firstpage
bcwipe
mashouf
ermann
kozmann
synergized
annyas
beachfronts
saibao
duchampian
clincal
gruters
cliett
upswell
wenhold
langenegger
garano
hollebon
capmed
sfantu
xeroxes
euphemised
restuarants
oysho
costena
greenwire
lawrimore
reponding
oceanium
fidaxomicin
wahidullah
jillions
ylon
buyside
ballyreagh
watercross
nexity
ginnelly
dorwart
braefoot
bourgnon
brookhollow
plevin
camtek
filardo
dumpings
kotchneva
abhirup
kiesl
neatnik
almihdhar
potocari
borlange
canogar
nwj
cavic
expectency
fander
medieaval
zipcars
ieta
wooddell
rensink
rottenest
elinkine
ingelsson
louai
haggi
veyrac
tarryl
misrouted
glenney
rapers
writetothem
yazilim
bruuns
chsw
pojaman
arrambide
tripplett
feretti
reinsure
repoted
bernadetta
jdem
mediterranian
gyem
shouild
cavileer
tonsorial
poitical
grantleigh
capehorn
moeckel
aldrige
gidu
mainassara
burguieres
bettadapura
nosediving
nonsecure
facebooking
garsztka
speacial
snacker
aluu
nextlabs
smilebox
nasiha
futurechurch
horberg
menapace
rabaska
tuleh
bvii
couloumbis
eleider
demchak
labourhome
redcom
nhsmail
gremolata
finf
naame
khnata
sidoides
beetaloo
deking
denouncers
odabashian
daughterly
eperjesi
tittering
expectance
furberg
stanleybet
nardil
knakal
gadafi
strenghts
apolosi
imprecisions
counterprotesters
epaf
klimaforum
loathesome
blackcircles
overruff
routon
ketek
mehtas
aremissoft
believs
johanesburg
artemije
bradco
podber
soulliere
mangiaracina
lipshultz
dgif
weirong
kazeminy
lmra
leecia
adsm
duhhh
wyebridge
tobyn
drilldown
ipekci
aparatus
landier
trefechan
westraadt
microtrend
scarpaci
prisioners
tosoh
inwhich
detorie
bakman
girardo
hosang
bertonneau
moue
minxy
nutrional
unsightliness
hereditaries
transactors
housebreak
cabaluna
multination
roadbuilders
wyett
tamte
mecher
massounde
megaliner
mzikayise
kurcz
marraccini
inexhaustable
tashnick
mccrackens
packway
semisoft
leishan
cyclacel
metsi
lissen
dpfs
titillates
wynen
lovan
sekonaia
zedginidze
bedmates
paymah
crystallizations
tinkertoys
lorings
volc
nattans
tmti
dénériaz
puurunen
cattoi
reinspected
oldmill
dorofeev
bestattung
whetsel
noubissie
uors
mendiratta
qriocity
marsack
injuried
yanza
verino
bascara
champers
gridlike
brossette
rebeiro
lampam
grandparental
acgh
tietong
elmlea
tirin
odowd
jeapordy
hansjorg
swandel
bakhurst
penedes
hamoui
ndanusa
corporan
glenearn
steinkuehler
zainaba
kaveny
ermmm
zeidner
piccaninnies
elsenheimer
villaroger
giudia
stramash
haakanson
gnarliest
seminoff
ediacarans
zuchowski
aijun
aosda
encumbent
elektromotive
slackly
jbar
gurule
duzan
smoulders
siddy
nanogenerators
dokku
quicklook
ecwr
preusser
bunchrew
pyestock
ashtari
kurton
beninois
razaullah
skulason
balchunis
siyathemba
overshirt
iguaçú
tigereye
telhami
nacey
astroglide
gillbanks
strudels
gorinsky
codatronca
welikanda
gharawi
sensationalists
warnaweera
exercized
zareth
sedillot
balindlela
nosimo
unadmitted
saõ
ntawukuriryayo
ikeguchi
lurma
microbrewers
overwheming
kolokotroni
lubrani
shenyu
dockrat
onepulse
gemzar
addaction
velders
svengalis
machluf
ivell
birdine
fixie
kapuya
foxytunes
uberalles
blamelessness
voyatzis
calguns
kapnick
mohlala
hftp
distribucion
zubo
heinrick
ghawas
champps
respectfull
sunniness
vilday
welterwight
szen
ovec
efects
prosecutive
intervac
ocobamba
pspgo
careforce
shopwatch
posho
binationalism
rubinald
mcgree
khiyami
seckinger
borgna
teitlebaum
inumerable
confrimed
prattens
bamboozles
customises
fleeters
porfido
stoody
aahhh
sorsby
nsmb
prepme
garavand
islamofascists
unhackable
riboli
higgo
koelbl
ursala
kilamanjaro
pischinger
hamalian
jipson
ervell
bracale
cravendale
lubel
wojtal
cpls
haydan
camerawomen
kokoris
audies
burkhas
gounden
nshamihigo
cemf
daaras
foroyaa
rockson
repat
rhiwderin
kenscoff
hoselton
flewitt
oacs
sherrys
eyestorm
slavisa
langbord
switt
meseberg
madnesses
kelami
phiroz
zitty
pslc
urbancic
belabors
infuence
bullsharks
nephropathic
perdikis
rougle
srur
chowrasia
istanbullu
dustball
roofspace
quets
staginess
cellu
chadirji
madover
bolderson
vicitms
filtrated
meiff
kielt
jarich
smallprint
coehlo
womenkind
geovax
wolking
marketscope
pestival
chuet
avouris
sahiron
stanifer
malovic
symeou
enraptures
instructively
trillionaires
phandroid
govindini
musademba
nerco
curc
tunesia
manlier
domeier
chupeta
gurewich
tutka
pwnd
tohidi
sneesby
busmann
cimavax
probook
creperie
tallish
strecher
winalot
bialobrzeski
guaderrama
stamell
gretar
sadeer
salmide
chelvan
ovab
inkd
tcell
schnozz
manful
middlemoor
armures
paiche
edemar
sajmiste
yuanjie
uninsightful
inaudibility
streetwork
mcclintick
dmhc
cowpat
paterakis
zeku
helane
moestafa
illogicalities
lispy
andreen
bucala
swraj
zaidman
vilimoni
scandalmonger
rolihlahla
sidetur
guamuchil
electons
navaras
kauahikaua
mmis
nhmc
minitruck
nageeb
wyma
yuansheng
incompentent
yakasai
bommarito
shaoxuan
beliver
seedbanks
pijbes
smartbike
toecap
beliefe
halfwits
qutaiba
caffein
collecion
ultrabithorax
antzas
dizdarevic
lapdancer
cahnged
attaya
tuhakaraina
lailatul
tlil
haverstick
poggione
ardous
demined
hunko
sesamoids
winzeler
gensch
antlike
clickwheel
hanify
patricidal
elkon
expostulate
bespeaking
aspesi
charcutier
nemesysco
judicia
ibbc
iobridge
okunade
dogfaces
frostily
dusik
maggotts
minutaglio
westsider
acccept
egate
nazenin
witlings
bancells
mclehose
mexoryl
megret
gerstenzang
umshini
extemporizing
jamanak
erfle
sukova
mesmerises
ballaquayle
diabetologists
chisanga
detoxed
ressner
astmh
pugnaciously
mercinary
anbyon
compromisers
powerreviews
lapandry
gŵr
toderasc
mochamad
forgetable
razeen
idiotbox
mehmedovic
scotching
dwimoh
tirtoff
casaburi
horselike
therasense
desexualized
cmpp
qiam
allof
rendich
mondeos
gornstein
temporall
kneeldown
pristavkin
boireau
namotu
vanceinfo
compactrio
kerrea
otherton
jawing
huebener
fleetbroadband
sheriffhall
lhoknga
myska
billionnaire
iscol
blinkevičiūtė
seattleite
senpaku
vasts
gearlever
xrep
hashbrowns
tightfisted
eroticize
counterpunches
immie
boshers
denuclearisation
dojack
rubberstamping
jimmyjane
bunging
frez
moseleys
achamore
tirozzi
geltsdale
verkooijen
sakhile
oberton
mussburger
ssma
eeob
overwatching
margenthaler
raffie
bastarde
drukier
paradera
senal
altagamma
rosile
careeer
onmedia
peszka
fullbridge
auctiva
romeike
cyberethics
adcps
masoff
ahumado
ardeonaig
stevenote
setlock
picklock
seatons
absamat
spiri
landco
vocho
urre
aircaft
unlaid
brekkie
nonappearance
goldenballs
bendler
meddon
falor
baitings
atttempt
shushufindi
milbrodt
fitb
sthat
squaremouth
dutia
peacedrums
krohmer
otri
presh
overanalyze
zaldana
mudwort
photofinish
liquorish
rimjingang
ngruki
uncremated
duschl
alalam
dahlie
whimp
melentyev
shakour
identfied
kabbara
eldene
recuperada
tremelimumab
unwarrantably
kajran
guatemaltecos
kreitzburg
asplundh
chaats
manour
xusheng
sefik
adiba
lightpost
unifirst
agriturismo
shedder
wpnsa
calamandrana
emmaneul
exploitability
pubmatic
swaco
tankel
numerex
guidiville
accually
hajiri
ellite
buchko
aldai
proteolix
listserves
specktor
turiscai
badei
belfonte
diabulimia
memfis
mccuin
sidhoum
clotheshorse
nadasi
dannemark
pikser
sheqi
grare
cavinder
democratice
dretzin
keumgang
silha
balkind
completeley
tsukigawa
grynspan
ziaullah
lazurus
paksitan
tsuris
dongbang
rashesh
ensnarled
gawp
pittelkow
giffels
avondo
vanderwagen
kefah
jannarone
arraigning
onebox
erfani
chepkok
detriot
kangeroo
coppess
metalers
schwermer
tavasoli
twito
lidor
implats
khalig
antoniotti
abrass
releif
padoh
swedishness
lanaway
reliv
unsavvy
straighterline
unbuttered
willborn
soapdom
ebot
aome
salloukh
wescorp
valencias
khetagurovo
sbab
athanassiou
alexovich
dolinger
yerkebulan
gracelands
auvert
afrikaaners
bjoerling
birthland
wfmi
schepel
seideman
brutta
gezellig
zajic
freidgeimas
edelca
wolfhill
edenside
mutawakil
energycap
dryburn
mialo
aizue
skavysh
diliegro
fanfou
slome
narins
cirv
keluak
crossick
grapeville
pensieroso
beiqi
nurhadi
sezmi
maillots
pssr
beriault
atambaev
jusman
encrusts
shantal
samadani
graceway
aahp
citypoint
gastronaut
leathered
happies
aggeler
quimeras
cadue
panaf
adml
buzova
eletronic
anseo
discriminatorily
whilden
prelapsarian
duxelles
jmpr
degracia
chiberta
tadai
frontperson
guoco
sojern
nematullah
sigtarp
languard
wetherly
teesmouth
ruthrieston
medassets
ecvet
offerring
medicalised
imtoo
wallcharts
gihad
serendipities
rumin
hmmmmmmmm
topcashback
numerologically
ambegaokar
wongpuapan
vantassel
auchwitz
inkley
taurons
cepollina
fledgewing
rozenblit
facevsion
tereshkina
purssell
chatterboxes
pizzotti
mahgreb
ingoldisthorpe
vpas
mandarinate
slts
fgic
bambinos
efamol
internext
annenbergs
chandley
podles
yarnbombing
joyti
quadband
wgts
bypasser
gillinov
herheim
sociably
plebians
avtoframos
stueber
depetro
khadam
northwesterners
naturiol
pdcf
battilocchio
basavich
durach
tomasita
schaede
corodemus
chockablock
foyleside
libber
fixins
traumatise
cynnal
leffman
northeasterner
domolailai
fravel
accaoui
everygirl
muddiest
tieger
elsbernd
etrade
cracklins
backcheck
jerika
mozie
schuiling
spagat
solarize
railbelt
plantadit
dobui
haugo
parouse
labreche
tandoors
oestergaard
kentrail
lifecar
flaggs
oldner
hacan
restoril
millmore
deltalina
paladar
thorhallsson
outpowered
karamojo
mazroui
durabrand
lavely
centropa
farahar
trendmicro
explusion
electrovaya
zegveld
goebels
bebside
agilely
esapi
alikbek
meghalayan
adventitiously
regardin
khial
senelec
hhonors
boninite
goddijn
baikeinuku
bananaz
couvering
uuac
rynhold
ushguli
restaffed
enthrals
underscan
bartner
gattlin
okum
akator
sagario
mangetout
chabalier
dematerialising
matchima
spencelayh
behme
oravetz
tankering
trunkless
sentimentalize
unsated
curnyn
slabby
altegrity
tavinor
macquisten
owoo
shiat
shapovalova
entrechats
epigenomes
inculded
moellering
oversharing
nanotechnologists
clukey
grumbar
pötsch
clarvoe
inquries
geldermans
plaun
schulzke
meadfoot
shaller
superpark
sinovac
syamsuardi
dizayee
penknives
unreflecting
moelyci
masaiti
bluestonehenge
polensek
butterhead
polledo
burbine
dentention
banahene
steare
kahre
rammo
prody
yllescas
chames
deskilled
nikitta
monetising
cardas
harborfront
exhilarate
philogelos
torchi
contran
shreiber
goualougo
studyblue
schwammberger
icestone
hagworthingham
uhrlau
lacore
nirere
sozar
goginan
sabouni
saunooke
dassarma
vhda
overtaxation
nammco
abloh
vaxjo
movetis
thinkway
lousiest
peita
ullett
saydnaya
adline
anthoula
schlindwein
hoopsters
landcruisers
rezeigat
fabrizzi
covestor
farabaugh
valuating
flipflopping
jilma
marconiphone
idenitified
lldcs
timesys
equipt
adlene
niemela
puddester
shynaliyev
fatayer
shamaqdari
hudok
gaymon
identy
ziems
shiprepairers
manawi
bajil
sprd
centronia
riddall
gazarov
mademoiselles
arrse
azrouël
eymer
svox
sampallo
electrifications
lapore
horseboxes
hypres
sinskey
rushka
overinterpreting
argenbright
toothmarks
wrinn
butkov
novodevichye
bastardising
boubakar
towncar
avicena
jaekle
guggul
efinancialcareers
oakmead
nopr
guaino
gunnarsdottir
freakery
vinoly
suppli
aulsebrook
boştinaru
overnighter
densborn
obamba
nexpress
whoremonger
poras
kisik
gillotts
interdealer
thordal
ingenix
popkins
tulaichean
probl
rnln
sunbaked
muhlestein
orexigen
psychopharmacologic
kulstad
chanice
torgovnick
borishade
rahmanov
firsties
penate
heygood
ifart
acred
massgeneral
castlecourt
mailrooms
cacg
grosfeld
pcpcc
erulemaking
barrica
brancalion
takaso
roadblocked
mohommad
cornichons
alphameric
ormando
mirasierra
whiskys
tranchemontagne
tongayi
roebke
erevia
ancester
yanlin
gorefield
klaß
hardfought
reenergizing
botequim
seiphemo
yayha
ostm
tiguas
hausding
duoyuan
undertows
basima
dambazau
disqualifiers
nambarrie
sacramentalism
califonia
imagenation
butterell
anastazja
pressreleases
verheggen
shaap
bronowicki
uwezu
mortor
winthers
wyvil
seyferts
cellmark
bigaud
ethar
soveriegn
tulipani
dealtime
ghulab
appartments
calphalon
idiotarod
onepoll
feinsod
senterfitt
surrick
caloia
appelby
studabaker
businesse
insteps
salsalate
grandholm
tokofsky
pidgley
saiger
handmer
dimishing
capoco
backhanding
myrone
bluemke
adminsitration
severence
yorky
globrix
buhara
hynor
cadging
alicudi
quezadas
cylchgrawn
eurospeak
fromlowitz
jaked
tattled
restaino
springstein
japery
tregolls
flakier
suniva
arabias
clumpiness
insectosaurus
ceasars
vankor
giltburg
schillerstrom
cajolery
poliglumex
rosettastone
carmondean
nisene
heineke
meidel
violance
dabbahu
broadnet
zbot
scherrenburg
ocsw
wyszomirski
enshroud
contura
dced
kanninen
adezai
ohca
wantanabe
multiwave
sequa
tupak
kratovac
monai
finetto
akusekijima
esms
maritally
glascoe
gecad
frankcomb
peruvemba
zitka
pakul
garagistes
packa
ncdt
yevgenii
squarcini
dissaving
pellens
budgeteer
unpressured
transcendant
workovers
yanggang
kitaka
termist
greeve
neuroarm
guangwei
geerling
folklike
roebroeks
ncbm
hoshiko
jigmi
keyun
hellholes
aquent
fayot
monagh
swellhead
outkicked
footbath
roenigk
laschenova
feklistov
rockabillies
lichtveld
renovacion
luques
haematomas
hohlbaum
qdii
respec
venlaw
brandable
snowmachines
superking
bikeathon
sucrerie
sukhirin
knrm
polulation
nwcu
takeway
terlet
ucunf
kameisha
bolillos
avghi
ccrif
underweighting
daraghmeh
sulek
choosiness
lemunyon
loadman
nonpunitive
misplaying
arulanandam
fiestaware
zvents
schwam
mothetjoa
tcherassi
edesa
rotina
dijla
halilbegovich
sansern
yotsukura
kowk
fuelers
kblb
hormozi
sohonet
luchey
steinwedel
agcenter
qualifiying
remaine
cput
qustions
stoudermire
sharpy
millirems
etumba
reverand
kauranen
hermene
baneham
parentline
ibrisagic
wallboards
darfuris
zedar
fanaika
lobotomist
bijal
allshouse
accommodator
remoteview
kalymon
summercase
wachenfeld
vozdovac
smashingly
haigwood
componentsource
pizzahut
calamander
summerlong
spartoo
grazin
prevaricator
iannicelli
currenttv
fordhook
bbdc
haterade
areng
krainy
freedome
branchini
millsport
ribery
adug
lathallan
exigente
lasok
lorinc
dcpi
albader
hochedlinger
crowdpleaser
dobrik
studwork
sashaying
aopo
proglio
beizer
spirax
prochnik
ruvin
guernesiais
kwing
pbsg
arkstorm
ilano
vawts
viasystems
acknowleges
shoeprint
unflustered
wilkman
sealyhams
linjun
neigbours
upex
seeclickfix
spelga
eskelsen
intenational
dabaghi
wisecracked
selnes
forbiddingly
tomihisa
bliadhna
tasovac
toeloop
bodhar
sartison
mcmoore
hideway
stattersfield
reibman
stogel
coninue
esdm
microbusinesses
facination
foresworn
aftertastes
hackenburg
finnbar
thorsgaard
prgs
skalicky
mipomersen
rvot
globlex
beatboxed
dignes
anouma
stealthwatch
zimasco
linnel
hyperaggressive
magnana
civette
tyrees
fernandis
pmvs
egly
markelle
mounis
jonnier
widescreens
pycnogenol
dazl
oldways
accustoms
woozley
streetworks
glencaple
schweichler
urtiaga
goldsim
undersupply
lipidology
uitslag
hauter
sapientnitro
ilinois
hoxsie
chadband
beninoise
goudswaard
tamweel
zabin
oluwaseyi
flexbook
velafrons
sauvion
beynat
gmai
mutilators
unifab
dokht
mehrjoui
surestart
tendinopathies
btween
divinia
douna
faintheart
insor
garringer
nolvadex
numberof
atacand
lavarra
kribs
tchotchkes
icepacks
alagno
lateraling
haydnesque
monobrow
maizar
lightish
killifer
ghoula
jarus
uppercrust
bereave
musikapong
richardsen
chicest
nuwer
macilwaine
chards
underthrown
conzo
zephyros
tecce
istrobanka
dirienzo
citrines
shipster
untrendy
synopsize
ieah
lissom
chozick
aksentije
liveatc
digenova
capoor
rafli
stalinistic
unvalued
clinked
sarghoda
mebroot
cyllid
esfehan
zavoral
martt
mirrorbit
kufeld
heixiazi
shnider
mullahy
dorko
maresco
pentremawr
skodas
cifg
campsmount
flavorite
hosl
banwart
railteam
bernadito
tomashoff
janjigian
rmst
yagawa
talae
whitopia
feau
cristofanilli
imangali
colmers
kilbert
gackenbach
crassest
highnesse
zhizhou
retrolental
ogbulafor
teether
hairo
barket
decrescendos
adorama
reinagel
ngog
perkey
microcenter
gossom
outtara
echolot
masalin
downshifted
rosenheck
superjets
nemwang
apparenlty
elisetta
lindners
hafsah
extenstion
tulipmania
ahart
cembalest
ayatolla
irangate
gubaz
zavagno
fnpt
snowies
schwaig
tzarev
croel
palumbi
aelon
kajen
twitterfeed
fekter
priorites
zekeria
varmland
fiercewireless
bubkes
kamonyi
expiating
peekyou
annapurnas
ishitani
eilenfeldt
ahmadinajad
pelargonic
chawalit
zarich
restavec
lnat
doretha
linyekula
ncafp
terie
nexbtl
bancsystem
cybs
rangeworthy
elopements
ronks
slurps
chesnel
devraient
repigmentation
torgovnik
karnit
shekelle
olice
represenation
zhonggui
establishmentarian
conquerable
bujnoch
psittacosaurs
hiccupped
vanderboegh
walleen
surtitle
sippers
rechichi
revalation
makuei
fumba
cammaert
lampariello
thredup
bathon
makhenkesi
bavituximab
plusio
carandente
benfatto
trimega
nextradiotv
clev
gilboy
pannabecker
vahia
coquillette
bratic
brudney
californiavolunteers
paatero
stultify
toqué
kavaf
ileret
incierto
kansho
micromanages
ostle
xunlight
bardella
isleib
omarius
skrastins
mokwena
akhgar
rutino
fulmore
docusoaps
nonathletic
lippis
dipex
karayilan
sacheri
afesip
buonaiuto
taavo
hesseldahl
luisel
putsches
paven
hartvigsen
bezark
innundated
indianans
sivanesathurai
chelopechene
cyberweapons
bouard
daury
colitti
nonplayer
dailynk
neurotech
capels
respray
lichtenwalner
investorrelations
virtuallogix
jakucho
schmölzer
dlcc
kleberson
rotech
sceti
obtrude
roombas
kholwadia
cicale
playng
yucumo
ingleson
osenovo
humanitaria
trilion
kashper
kuthe
anzhen
alvine
unangst
aethlon
clintonite
jazziest
progams
jianjiang
junkshop
nungaray
yabunaka
oskoui
golubchikova
magomedtagirov
dangana
elshaug
ncredible
tisin
ouisa
lisenby
nuraini
ruecker
ebie
onodi
somnambulant
fallu
unjustness
dvorovenko
consalvos
tlapehuala
hootan
lisis
relinquishments
tenaculum
palek
fricassée
comilang
reconceive
firtree
rosegarten
collucci
ejehei
unblindfolded
prestedge
voordewind
viewerships
urli
malgieri
ababeel
yumyum
yandicoogina
shouwang
atabani
imbeds
tottingham
widsets
vasterling
bleckman
tomljanovic
khameini
innumberable
argouges
rhissa
xlx
vexingly
dervi
gullino
malasian
cuemaster
boxter
wherewithall
alekno
schorle
nigori
vangard
unloosed
lekach
harnek
prepurchase
sacn
endter
sielicki
amanjena
pelevine
microcalcifications
amatrudo
exoneree
smwf
abereiddy
bettane
webbies
riverbay
otpc
palmenberg
muolo
ackert
refinetti
busness
councelling
nasiganiyavi
schlubs
morizo
beikou
shokhin
dysfunctionally
deerstalkers
frankensteinian
bencivengo
dusks
anghie
bisoi
ringlike
engressia
frontloaded
prattles
dimwittedness
onetel
ratnesh
flanz
aggrieve
delcour
nadali
moosajee
szrom
progessive
minczuk
reinject
indevus
antiriot
bortolini
microculture
oufit
kevern
crosswater
metrinko
wlaschin
wijeyadasa
antiphishing
roséan
korge
siguiriya
ajwright
yellan
shellsuit
lazie
acdi
spadeful
austrlia
skived
reqall
girnius
swigged
vivra
alaitz
duhnke
guaging
ricelands
zerrouk
surk
hailpern
batsuits
broughs
dissemblance
papaye
respo
setoff
repubs
fudger
whiteways
totobiegosode
kingwill
dawki
sintonia
lejnieks
meei
strenk
freddiemac
counteroffers
hemwall
pyenson
aaqil
mcclear
hunstable
reasonalbe
jeromey
acofp
gabarone
hyperventilated
autoexpo
peppersmith
bibbings
judiths
humorlessness
raydel
nonsuicidal
chamie
jaggedly
bettaney
fortissimos
lutker
navic
lasis
riddile
birzer
shayeb
shanzhen
auctionbytes
medvecky
ahic
härter
superheats
taneal
houseowner
solino
deynes
wensveen
cmtl
blackhearted
transoft
scarify
sauerwald
moreish
antigenics
fanciness
pisaturo
nigussie
oppourtunity
bochinche
flunkey
raynesway
reinikka
saadeddin
mvumi
feltmate
kleptomaniacal
activequote
differenet
regasified
bestrides
basardah
fulsomely
coiley
onama
meganne
sentras
batom
chilcombe
disembedded
trunkful
playcast
tanaiste
inspiriting
blairon
sneeky
huangci
knaupp
wiid
euroclassic
reimport
ghullam
finl
nared
socìetas
gachoka
jackmanii
trikilis
declarable
rismondo
basbous
godé
fatahian
oilrig
sdcp
furriners
downtowners
pompholyx
farbio
vangala
suhardjono
inmage
nordson
jianfang
sqo
dcmd
scossa
cozar
suctions
pigmenting
monstering
meglioranza
sweatheart
inkpots
benthien
oneof
vernola
alore
berettas
hogwart
ffcb
salac
cayler
doonies
autopacific
winiecki
nondefense
southstreet
allscott
demichiel
overprice
scorchingly
berru
wajsman
sebonack
givebacks
trickily
mullaithivu
kotting
perambulate
cavelike
nicandra
craigour
guenot
formidible
powerlabs
perkinses
aabpara
apls
obayomi
jolita
shkurtaj
vanezis
fssd
kapetanovic
scheft
alyanak
retrains
leskin
posole
palansky
puigpunyent
devay
bieniawski
nutmegging
albertrani
hadjuk
goryunova
jumbish
khales
jegher
femara
defensman
avapro
savient
spoofery
dharoor
wolensky
uogb
yhency
cuminestown
tresspass
pensilva
chikwelu
percuil
safyan
overinvolved
reactivations
woodfalls
varelas
muhtathir
zabrocki
pentons
ligocki
jensvold
vaneman
mullee
tolemaida
macrosty
unrushed
chariman
jauntiness
nnpt
nogy
horible
asssociation
kawale
bollore
prommers
redounds
freeborough
maglis
mensun
stratstone
vikuiti
neurocare
bustami
marenzi
dalmane
illmitz
francises
undershorts
angwenyi
amadine
dabic
veddy
hctz
gruessner
bifeng
recaap
coccaro
diminuitive
bachur
enpa
charleses
cyberpatrol
jannuzzo
walkstation
roppo
polysorbates
sianel
challengingly
iverse
karkkainen
greyman
posedel
barfe
neilia
jinal
aftermarkets
hajda
turismos
citkowitz
zoidis
trotskys
ungraciously
midlist
magneux
kamysz
sinosure
wihtin
lesha
repellers
hybritech
fulston
blitzkriegs
balconette
fitchet
outscores
virtis
dipenta
inteq
rigl
oftsed
neindorf
hindcasting
corehead
dorleac
vukic
kolola
pfis
shovelin
baszak
melanee
englemon
wbmd
bowlhead
insm
nolhga
keynoting
laussucq
unstowed
medbøe
belayet
lishen
cheesegrater
subcription
tracewell
tambunting
adolygu
harshav
faruqee
allahdadi
sissified
theera
tmsuk
shugak
omniscan
chevettes
pikestaff
phambili
rhinefield
kuchenbecker
lukonin
lifestreams
polarn
ebps
boxercise
texmex
blondish
shishkhanov
wallender
derrion
rusticate
nsia
olfson
sameem
whizzkids
mazria
bestir
lakner
kenshoo
meghir
krauthamer
reannounced
blousy
zirok
photiadis
smia
reinvasion
simbex
musicforthemorningafter
gothberg
feasters
genteelly
flometrics
knickerbox
feuillatte
tulchan
kitcho
quietman
copans
wythnos
bravinlee
sanctimoniousness
mathez
szyf
mamuyac
nonroutine
kamikatsu
ingla
mizani
gpic
puedan
goadsby
istore
krowne
northacre
deganya
kayishema
spielers
internalises
muhummad
anthuriums
bankengruppe
iped
heiligman
portuguse
cgdc
rosseel
infantilize
schenone
elbagir
beccan
yasouj
hudgell
callfire
deerbolt
leiblum
ineloquent
cussedness
haslauer
gotic
molindone
thecall
feminising
akipress
esquerre
suriyasai
agnitio
neelofar
madrilena
folkloristas
overeats
whizzgo
majit
logomark
dunlopillo
mcleodusa
schear
nonvegetarian
turkstat
riversley
renoirs
gadekar
tigertext
ristelhueber
hubbie
shaprio
hoffinger
sylvor
liepzig
ersek
comorans
souryal
politition
zere
brideau
celizic
pmti
glasfryn
weinhart
thumbplay
duperreault
madencilik
plestis
shaiq
shabih
soapies
antimodern
redevelops
negligees
torarica
tagai
brothman
naggy
igcs
hahahahahahaha
rohtenburg
electorial
longheld
rooseveltian
overbook
thermie
abfs
dothard
sintu
echouafni
noshki
mediacurves
boosbeck
readytalk
larrowe
wanabe
appetisers
baumgaertner
deminish
xiuyu
leftfoot
nakhle
cloudveil
kohail
strentz
germophobe
trotignon
ledgemont
naoma
ghosheh
headcover
drotar
preatoni
dynamicops
chantra
elchlepp
fanteni
ususual
celebutantes
alguire
leivinha
haemost
sagerman
shraeger
colliford
gardler
brutishly
sendlerowa
autoban
cherkos
smull
golfballs
wafertech
mucinex
reitell
logicvision
zatloukal
gleichmann
lazca
artigarvan
wooky
idyllically
odabash
bbumba
rosheuvel
efit
elridge
taea
driade
mcanelly
hitsp
gelowicz
talebzadeh
steppingley
kontakthof
vrae
satava
vinacomin
shpiel
techsnabexport
kaptel
hyperinflated
duessel
vileda
dombrovski
masscap
moonfest
kheirandish
supersecret
pyszczynski
fenkel
proposterous
eastell
koufman
zierath
nabintu
chorcha
løkkegaard
cusos
heicklen
vlock
artwerk
schwabian
thanachart
wbenc
depinto
usaec
fenproporex
mirakle
kozodoy
arreaga
kforce
wepler
tsalikov
lapica
pezula
tavaria
joudi
corviglia
softgels
kapahulu
karber
cretons
scrapbookers
twinbill
mansouria
communit
ciza
kriwet
bdhf
breakish
pacesetting
weisbuch
shogan
vivisecting
schramma
hampnett
gameboys
mcarabia
lairige
strokeplayer
svase
bibelots
rivermark
belltel
applera
acomplished
njvc
septicaemic
felini
rmhcsc
rawood
exubera
curnutte
vechicle
bajorek
vashishth
volumetrics
gongsheng
grumpiest
ramlow
peratallada
eservglobal
troublé
getequal
unartistic
lytal
strausses
pedri
machiavellians
pazel
ridded
ecotect
corveloni
minimorum
shamburger
chambar
perusals
landgasthof
trammeled
plexxikon
derinda
dalpiaz
toffy
zafon
amirkhanova
marzola
mutayri
dtas
repot
bassuk
villification
suncare
khomeinist
luari
lollygagging
fireplug
halaco
triessl
tendonectomy
falby
gallahue
ehnac
inflationism
dardentor
flueckiger
lickteig
fettled
domenika
barme
mummiform
ivideosongs
unopinionated
rubeor
intransparent
vercoutre
profiteered
domenichini
monolines
gentled
overapplied
hejna
maching
ballbearings
finers
skycourts
millitants
schlüer
homecomers
grommek
gamt
franzo
keiaho
rugasira
drčar
casulaties
groov
noonmark
elderflowers
grauls
battlemind
ecosmart
softkinetic
besmirches
clubgoer
prediabetic
ibbeson
depouilly
multileaf
nebulousness
gorenflo
corace
mahanna
ramadin
carbinoxamine
pacquaio
gobbetti
foldover
furbearer
windsocks
dabab
wollack
hladowski
opinium
spykee
convivia
devauchelle
suul
dallasites
shareprice
kulinski
inglemire
dinola
kevane
schwehm
magovern
contritely
youthlink
begginning
batterberry
biodigesters
gedow
kocken
afonwen
luging
neratinib
kenzler
borqs
corrance
qureia
malcontented
rilin
pcati
prizant
tanora
anshakov
hied
bridgefield
locavores
falafels
markhouse
uspt
sombreness
babycare
witthoft
jijun
cheesing
therkildsen
haegele
madaleine
tastemaking
abvs
awooga
gemlike
tahdig
effertz
autralia
calingasan
kechik
friggen
netshitenzhe
koskas
taleon
sarnez
gunewardena
polymethylene
microbloggers
hostelworld
lyublinsky
kwanchai
mythbusting
askale
utsteinen
barimo
taybarns
fibia
dissatisfactory
comvita
consisently
mccarra
drugan
kabbal
irurita
janangelo
soliant
copemish
grimmelmann
unhinges
dubitsky
changey
ifanc
sectra
romich
bronis
mazard
tognozzi
friebert
afue
descripton
dovale
zuska
yasman
kollins
jalai
afirm
ballyowen
pochat
turrent
komisaruk
homoeopaths
advertisng
econsult
shelver
frix
fuyao
electioneer
onmessage
minibond
chumminess
endulge
pastings
beanywhere
winpenny
kohrman
weals
funkified
champine
hugheses
umguza
gähwiler
viskase
geospatially
matw
testiment
jzj
borther
xaiver
splutters
gicheru
diddams
lincolnian
gtec
cwik
lexcycle
vljs
writeing
wheatleigh
styrenic
tabiou
shough
erhlich
skimboarders
amantaka
calestous
pesacov
stolzius
medallia
depletable
resing
biomatrix
constantiner
unexpressive
trilliant
sooreh
xybernaut
kulakowski
chunlei
boffeli
travelodges
scootie
hubberts
spongelike
mirandized
mirandize
alss
bulabula
mojaddidi
choudhri
beva
tardies
fuoss
barankitse
shidlovsky
shanavia
pitie
estrellatv
borukhova
thoroughman
gywnn
goerges
goldenrain
wtfc
uspca
firb
plunky
chanc
emptyhanded
mumc
nwanze
cgnpc
steckman
opthalmology
boomeritis
minova
humdingers
sketchiest
stoutz
brodcast
caschetta
sarnies
engrossingly
anounce
concience
noud
compx
debottlenecking
kanharith
fynd
lansbergen
ucberkeley
whiteburn
sikkens
alphacat
switchovers
dbfx
locricchio
chind
vanned
cruddace
supporte
lilos
nickiesha
rosebys
wellawatta
cryobank
homosexualist
redbaiting
kimberlina
piks
abrham
stoskopf
zenergy
transparência
ayelen
aboudi
lejuez
stieren
dunnhumby
farmout
ngobese
poltician
hoedemaker
househusbands
acik
karakus
dillenbeck
mowlavi
belluci
uchc
doup
subzwari
availablility
borchelt
visudyne
wimbeldon
idlet
macchione
beytenu
qaddumi
tajique
fromong
gonul
jihaad
eucerin
elkarra
solarcentury
ortegon
arciniaga
alavian
headd
fireworx
menedez
softlines
seroka
oncampus
chmela
deserie
rybeck
colee
supramax
baleira
rvers
chitr
dziuk
molaschi
malinoski
dellape
barcalounger
narcisstic
godah
miltant
naudero
marily
kostow
olmer
eqecat
foulstone
kasav
adahi
athat
cutchins
rigaudo
nancledra
chtr
nosherwan
faaf
tjmaxx
unbuild
boffetta
coopi
seipei
tchuruk
ngosi
histrionically
lalomanu
kachinsky
falconnet
quiter
mcalear
pinked
malalane
navajoa
vakapuna
raduka
peacher
honeybears
sudby
defalcations
gabot
giessibl
solovic
ebulliently
garone
yanshen
rerras
junny
origamist
kopenawa
gunel
tariffing
humongously
owassa
squeo
foryd
northwester
forgetfully
lochardil
macromutation
unsanitized
thanklessly
sadec
delamontagne
fremer
readle
fruz
goventure
hollandois
bulgurlu
tseycum
usdx
descenza
ncpdp
dogheads
frenze
betaseron
preferreds
lating
revoy
nusoj
dispirit
staadt
palled
ebitdar
milyo
tappolet
aurn
wellesian
quaintest
sherbedgia
behin
citroens
marcellos
donio
axiotron
zinkann
peform
xojet
woodpigeons
ratnesar
karisa
outspark
cynhyrchu
jurados
tuputupu
repasts
youcan
syman
shimel
amtote
zerola
devoré
sukhois
cbocs
rúm
geona
srlc
nuvox
plumchoice
gingersnap
sheinkopf
lellenberg
seic
prexige
machinelike
fighers
lastinger
plethodontids
caotang
tarpin
abreham
hipposonic
magezi
moghaddas
mishak
retkofsky
bogacz
holtzbergs
gomart
kurniasari
walnes
izdihar
enpei
grappolo
evanzz
ieak
picosulfate
fredenburg
welfling
blinkoff
reserveamerica
hempsell
cigler
cvik
koelling
onica
tranquilised
sinnette
earthecho
merran
nuumbembe
juking
cadus
rushcroft
neftegaz
hazmiyeh
schmadtke
refought
mossmorran
sinovel
calcifies
malary
gracko
trangressions
wickenberg
kazinform
akoi
kcic
keerthisena
sandboard
machalek
galeai
spherification
clixtr
mascini
cervalis
tracesecurity
revoz
excoriations
distroyed
amrican
arduaine
etxeberri
hasagawa
shagrir
tannaghmore
redways
anihilation
olow
cachay
domnitz
memorystick
labyrinthian
enegy
saadiya
binyuan
blazic
riofrio
gokay
tachyarrhythmia
borgny
sedmoi
shokusan
griebenow
ejk
sunsetting
roofbox
tomt
gqt
incisoscutum
gotkin
ottica
stovroff
kavenagh
zaimoglu
lewisdale
grimshaws
imazapyr
knautz
yonemori
anthropomorphizes
congel
cysylltiad
sierwald
junade
trupish
isrl
madivaru
mewl
pretti
elomar
courters
benningfield
borsodchem
buvette
kalist
intrieri
propser
vellay
rashkin
milliohms
technopromexport
wuffli
tamug
kerevan
perscribed
squirrelled
kernerman
darbee
fowzie
teeshirts
samels
soothers
weiger
plasari
gallick
diginity
girhotra
komansky
inveigling
belorus
soild
morandy
caralyn
stoley
hawlata
razeghi
didik
alasin
suaveness
maxner
joset
edenwood
adminstered
abstractors
invincea
pennyless
anado
jurgielewicz
aona
magnun
lunchers
kemkers
latifiyah
hupond
dominiak
sambhi
shepik
khupe
spykers
cliffhanging
absently
eabis
midpines
domori
peterbrough
revivifying
mickum
nonstatutory
invisage
melaney
hutcher
paliivets
netshops
quigo
turny
lamak
ghurkha
earcup
skab
springfree
mpumi
sigurdardóttir
depoliticizing
muayyed
repeate
leising
narzan
suseno
burkland
elkady
scratchpads
toeholds
timecard
abhra
nerakhoon
lowbrows
oponyo
ghayas
comandantes
gmwda
gobie
esayas
lerners
zuora
klasko
kokayi
cinematch
exobiologists
luobei
bouga
delusionally
cvrs
chetra
biaxin
reichgott
requiris
woolite
counrty
namad
ameril
tembagapura
mockey
snickometer
fabby
eirich
adineta
bogogno
kubrik
interract
qchat
eliaz
nonaccredited
brudevold
glenbranter
laglio
wnats
havebeen
atayi
lowgar
polyradiculoneuropathy
klutzes
hearsts
moshen
xiaoquan
sebastain
rousingly
discoved
jouzel
tradepoint
alahuhta
birtwisle
girotra
questionned
vasby
hazina
borgerding
pcga
seakeeper
wangjialing
borntrager
magdaline
renumerated
stubbled
harneys
atfa
mayakovskiy
rumery
jacunski
reibstein
intraracial
peolple
nicaraguense
responsibily
prasidh
klaber
blairo
ferrufino
gallactica
tareck
sahaku
arciuli
uplc
nofas
yurov
blancher
worklessness
milhaupt
tanayev
moudgil
langert
vinopolis
nejma
dnsr
xega
wilgoren
amplex
levkoff
mittelplate
gissa
mattusch
akeroyd
cataclysmically
ignjatovic
siegelbaum
myspacer
defog
bowab
aseza
rousell
africian
beraza
cretul
stabaek
schéhérazade
heatproof
ciatti
raushenbush
scathed
quadrozzi
convis
labeur
precentage
romers
insightec
mplms
brasilero
chanton
sheffy
aliwa
dhamala
acelera
thilmany
tainsh
jędrzejewska
novitas
qdd
expatriots
bridalwear
niquero
cernota
novoseven
normobaric
guoming
sumroo
sorah
poed
cthomas
kibumba
glendronach
nucletron
sneiderman
krusoe
zephirin
dmard
deskbound
tradgedy
bistis
chaweng
nthi
pathologized
cumulous
youmail
brazzale
karunaratna
admc
nordvig
gcobani
proselytisers
tukuafu
tritely
looing
loyle
killinochchi
smbl
proceeed
nijmeijer
supplementally
audika
todger
spiridellis
tenderfeet
clynelish
memorizable
smattered
dycks
litterers
wirelesses
keville
nasibov
casalotti
zuccato
losec
xalatan
messiri
rafterman
upskill
shousheng
ierardi
epicerie
waliur
trouillebert
aquaterra
amstelhof
replants
kazlas
daedone
vandebroek
guiltier
moninder
marfork
interlog
gumblar
talafar
rattin
boooo
peopole
hausknecht
skarpnord
thembisa
chmc
smru
inutiles
pongpaiboon
tarrell
ddisgyblion
vetterling
nizich
rltv
hinni
retirment
longlists
metronorth
sellgren
posterboard
crescencia
cloues
cleberg
claimd
salomao
casterline
ayrshires
stuffin
repossessor
dabelko
overcalled
shinnojo
mocktail
noerdin
chekroun
colusso
furtiveness
oberwetter
mafura
enterpise
drymonakos
zacharda
krasdale
jusy
sluder
bekke
fareeha
casesa
assisters
sulkily
archirodon
borho
reputationdefender
nickelberry
rejigger
rhinocerous
wbmc
possis
schipol
lindelani
hatzes
fhlbank
mrtyu
liversage
caucchioli
intellecutal
kirtzman
rdq
ffostrasol
nownownow
élitist
vianet
shakiba
recirc
khakoo
emergance
turndorf
ebco
malane
jarg
backpedals
crommie
saamiya
treehuggers
buffetts
joele
dezman
osaghae
kilburns
cemea
kalogera
madaen
ukad
henaac
straggles
kattegatt
cirtek
mayhle
efvs
mogran
newschools
bierling
garwe
minezaki
pogram
tegge
ioco
searchingly
wasiqi
salky
kalloo
followes
jiskairumoko
fejerman
krajisnik
dupler
ranchita
ftrans
mccarthyites
ambulancemen
balmorals
potatoland
jowharah
transportion
belkas
caly
colbon
jimador
nidorf
jreissati
unplucked
saarwellingen
tomcito
sunich
whackjobs
adesa
schops
dalesio
piyasvasti
tsetan
demarius
chimdi
kronkite
finz
musahars
ballycolman
wybie
stottlemyer
amranand
pargman
anandalingam
danhi
shabwani
linkon
depthless
camers
puigdevall
xacti
relandscaped
gerel
potpie
racsa
olotu
cepia
plumrose
ridgelys
driveaway
mozafar
vidlak
muhire
officiators
wuerker
zunshine
underlinings
blisset
guoying
palmerino
pepparkakor
machipongo
gashaw
philpots
senkel
standale
ferensway
zebrano
colemen
ceysson
ovulations
satterberg
eoms
kavee
flightview
lebanons
solemly
fangman
cffi
finel
maundering
negotations
gieringer
farouqi
challahs
juliénas
lehotsky
rubinowitz
stridex
twitterrific
venezualan
edingburgh
tallyrand
econoboxes
taglieri
sheiman
mualem
gpy
pommies
credulousness
invaluably
toukie
arrowpoint
lmrabet
ervasti
arsu
citröen
liszewski
osunsanmi
overcommitment
schisgall
giraldez
shamalan
aquisitions
trigen
vanrooyen
powerfulness
wilbrod
limbourgs
savou
masakela
kalvitis
macwillie
jennerjahn
terrico
demolli
ciic
catrinas
alamolhoda
yalies
minibook
shibis
wintemute
haijiao
padillo
juenger
sciquest
mccarton
wierzel
hazelhead
ggtase
brefs
bendich
hamane
sheratons
kulat
chisi
plakun
cdus
rafea
liguasan
mcdermotts
uncontrolable
hulkenberg
legistlation
shilou
salkantay
darkazanli
unmistakenly
fatuzzo
touristed
snorters
braison
andolina
noven
hegsted
surgicare
cadapan
griddled
bousada
plcm
usibc
ebersman
moontoast
turballe
hmoud
samares
blusters
shubaki
cowells
rumberger
derestricted
announed
germaphobic
taggett
botticellis
margol
mammuth
erace
nalge
firemark
refugios
montepeque
gratuitousness
wetters
greendykes
sharil
kinno
sitrin
fneish
mabhena
badush
rendleman
cadged
oveur
kroizer
trelliswork
chagares
ratners
hekhsher
unloyal
aspec
morledge
kamuntu
biesty
habibiya
robeks
refinancings
paprikas
grrm
toking
computerising
corodeanu
milkings
kildary
skyfari
kidswear
bushcricket
bollixed
wallcovering
schoenwald
ffgs
analytix
birkenstocks
sefland
arvanitakis
bernholtz
joltid
dellia
elcoteq
decends
buckmire
teamwear
belby
establisment
sufyian
innoculate
ameriques
procuraduria
yunhui
innotech
ruwaida
liolios
cheapy
rutowicz
vigoa
instituions
amouage
aniasi
casias
lennmarker
yonto
sauteing
rouas
spruiking
inmobiliario
foriegners
extemporize
cancercare
winehouses
numberi
diddums
nonggang
bucklo
brauncewell
undercounter
glybera
zharmakhan
sibeam
stoutmire
osscube
unreplicated
mohau
disneyesque
croute
inescapability
groebli
dollarisation
appendino
bernabè
cmit
tiii
dipity
contabilidad
humanick
burtone
sheddon
verex
kalishnikov
climbie
texturizing
sikku
recurrance
gringas
hipnotic
exquisit
gogava
qeybdid
hashash
defillippo
musonye
darunta
rahmouni
geocacher
autocentres
changingworlds
embelished
leitenberg
ureb
nealey
baringdorf
notarizing
klimts
dementium
kafoteka
barnshaw
outracing
jounalist
querce
websafe
joek
anzueto
parayil
bicyle
dijlah
rwdi
glaspy
donnent
monigan
naadac
olitsky
guirane
humungously
wallstrom
legistation
excuted
slotboom
gavilon
wnes
mpowerment
plastech
klawonn
sticka
aspentech
bronques
ballestra
grevill
platoni
asss
khuria
skittishness
jonagold
intolerence
ciarrocca
smoaks
versacold
yalennis
linza
zabini
rubiner
zhixiong
establishement
deliv
lavatorial
layva
episcopals
tagong
poluleuligaga
habid
garwick
tshepang
superhead
spainhour
urband
neutraceutical
rhynchophis
aspirus
koelmel
immunising
distaghil
blacklaws
tingled
drwy
mevlud
shelledy
breteche
zhenmin
mooches
lieberfarb
damione
spirko
imbrasas
lazaroo
sugery
chusak
canvin
bashmilah
cabieses
opande
spaisman
coulsfield
goidel
smooter
burklo
virgoe
lylle
ikhwanweb
redprairie
khodos
unrequitedly
destoyed
interheart
mingqing
schoolcenter
therminol
hightails
sokhom
zhanar
coshed
limbering
oxytricha
apwg
luochuan
defenselessness
zeyda
bootiful
bevine
enviornmental
freecom
pitoitua
oberwaltersdorf
wraggs
arnautovic
silima
sicked
tapili
yamgnane
asymetrical
sewnarine
fassihi
milberger
haeggman
deresse
freidoune
pelosse
chirashi
venissieux
adamji
vogueing
comor
conchord
kobau
garavan
giannandrea
panameno
bamdad
dufresnoy
tymor
elbmarsch
xchanger
helgren
trabbi
khaymah
levya
moujik
benezette
akhlaghi
pietropoli
infosurv
multicourse
smartwool
throwout
altimas
drefelin
hxm
delvings
foria
barcaple
debtholders
boissonneault
petalas
sorial
munyakazi
nationalmannschaft
cambadélis
circustances
escrows
keilen
kissenger
lifecam
sourer
chumleigh
meciar
bigeyes
chailleach
moralised
hymenoplasty
agbami
keyaron
zeebroek
glimmung
bombardini
imune
balinda
ffriddoedd
harach
djinnit
etirc
sparts
forbeswoman
neumos
trickler
borwankar
braais
kerper
peforming
inoculates
fleita
aguet
abdelfettah
khajavi
commy
fioriti
loller
firebag
rtog
chakaipa
louks
shabout
tekori
coway
ursuleasa
loudham
foulards
vonderhaar
goetzl
xcalibre
blochs
habetz
tawafuq
houjian
jerie
hypergrowth
heathcare
wishnie
nextar
stst
mitalipov
camdeborde
wellenberg
owyang
llrw
schouppe
otkritie
counterproductively
truley
saridakis
aubey
lillestrom
trevers
stepgrandfather
virnetx
ailman
globeleq
ibercaja
reveiew
salmansohn
strm
asaib
yilishen
eliadis
jaquemet
liscano
yehiya
csbr
ingelise
slochd
brandix
orthofix
junhe
mariaca
saood
yoffee
dubens
mynbayev
sprawler
restabilize
sfcs
khadraoui
adrenalized
simbolon
bescos
kinghan
yachtbau
tsigdinos
trokel
skywalking
yefang
corbato
midelton
datoo
killt
dajabon
tcdl
masisak
kiminori
raleighs
minging
timesand
klebitz
nauls
coffland
fielea
curtley
mailshot
macroy
rebloom
deicers
sundkvist
buyanov
hatboxes
brookses
unhealthier
rathina
luparello
moraal
oukaimeden
jodean
devyne
callgirls
papple
zeromax
kurzyna
scratchiness
flinter
sebc
touble
crockpot
foofy
kusasi
azilah
slicking
bntm
dinerral
zimmerle
sardonicism
hippiedom
wtkn
veddahs
refashions
wiebenga
stavins
darmer
nbsc
freefloat
dufficy
subashini
sysytem
kilquhanity
representivity
simman
positons
powerblock
outstaying
roeckel
flowmaster
eckholm
gyorgi
hasibuan
naipospos
brittlestars
ayeli
huijia
transapical
housam
dettra
pekins
gwrs
malayev
bocevski
llorenti
swingby
paddywagon
wavery
maniraptors
janaagraha
krueckeberg
schmall
nizri
martore
epbs
burred
levulan
biumo
windbaggery
tharrington
carped
lomonte
wizzit
mcgory
kasserman
denouements
rcrc
sertig
percoco
saidur
trendspotter
lupetey
priniciple
nuthetal
gasprom
unphased
unhorse
karwoski
beringe
arboc
bonterra
intiatives
stiglic
mokambo
skilcraft
monsees
mateljan
levantado
splodges
artsquest
aforge
mwaniki
huachen
perogies
videoscape
aveion
breezier
serykh
congree
bandwaggon
zerefos
canonisations
tumminello
yurkiw
weisburg
samoens
lichy
abukhater
sioban
cruciverbalist
extortionately
crocked
dubik
baaaack
charnvit
egerson
predevelopment
potbury
accg
trainride
clottemans
wizbit
trundell
lifevest
lammermuirs
ghosthorse
hassas
basateen
gavora
hasman
shaoyong
sumaidaie
ocloo
opressive
skjodt
waltic
nuvi
coneway
anthropomorphise
hungai
perfformiad
mavimbela
vetrazzo
australopith
tamuly
luchezar
adivce
advanstar
apruzzese
bowoto
leyzaola
ireporter
eriam
vonkleist
taveau
houselife
intradivisional
defensives
monacolins
keevill
somfy
raychel
nonprime
maestracci
pgds
yergeau
blueeyes
kostelac
touradji
avtec
carvello
turnd
bumpf
seref
multihued
pulmoddai
adriyatik
neutze
trinajstic
areheart
tdbfg
scdi
monocrop
gemütlich
rummaneh
peterken
dozoretz
ultralounge
schuhbeck
sharyati
agunnaryd
laquinimod
ricigliano
bewilderwood
tripitikas
huthi
chakushin
kremlinologist
unclamped
glaciergate
techine
dardennes
zahwa
sapergia
musngi
canoscan
tizza
jived
sztykiel
pishchalnikov
salathe
scholnick
microcurie
copaque
mobilizers
safecrackers
menotropins
ziso
naimah
zenithoptimedia
intelliquest
sobelman
adjunctively
cloudwatch
guocun
bourzac
clearsky
thornsbury
cibils
panaderia
changewater
myocardin
denene
madziva
mezuzahs
abdalati
inexorability
bioculture
eitf
vitaioli
vickroy
nonstrategic
morotopithecus
turken
rawland
yibi
dmitro
sigurjonsson
merighetti
ghaire
bouteloup
beugre
apsco
oddsmaker
rustically
sunnyville
hpmc
linkery
pottermania
turiano
cimpl
netsanet
muirtown
ffpe
underbidding
empirix
farrey
lahoris
kubr
letford
garapa
capful
purevia
clavichordist
diamanté
mcmannis
oldstead
jahon
scunny
schnegg
ddess
gransport
irace
ardiansyah
cosn
bellweather
binit
imjingak
sinabang
grandkid
casasnovas
mediahub
redjeb
rfec
zacharakis
raufi
futureless
tomatoe
phillabaum
foodwatch
atomises
merlau
passacantando
myxo
winzenried
ayung
maryjo
mumtalakat
neigbouring
panden
shaminder
choragus
katesbridge
sockless
sebasco
woehrle
vernae
famly
dllr
chemnutra
malangré
hajis
jarwan
manorohanta
plaschkes
felidia
confidents
whitnell
repremanded
hahnfeldt
pfannberger
xianting
ohiri
negotation
suleimaniya
cashcade
schofer
feczko
shriller
roulez
nearn
carnaroli
gronberg
keilholtz
wattenbarger
prometea
vasileff
videoplay
apodeictic
dispossesses
detered
blogpulse
lambreaux
kinsell
forthside
sicortex
cartoonlike
nontenured
kyama
mancrunch
mavrinac
unsaddling
zloch
bendien
dustups
ibrayev
mamool
turcas
recentered
idoit
popularisers
shopsavvy
gawked
fritzel
biblarz
thowing
sharrers
mailander
cerron
tolins
marqueis
rejiggering
devestation
gramanet
powerfuel
expe
palliatives
broussards
iditarods
ulsh
currid
hérita
verduno
natsal
eroticization
easynews
ziha
wiranti
scoopful
niyitegeka
hallowich
rehospitalization
devincenzi
nadarasa
sabriya
desexed
pollman
savaya
nsofor
eisin
ubisort
belatacept
shabbiest
changsan
axid
norber
delievered
gebregeorgis
tukssport
khedafi
goldbar
outift
koloroutis
sheikholeslami
borkenhagen
savlon
widemarsh
siarad
hdaci
zushan
porshe
giddon
carrizalillo
intersec
kervorkian
ferk
easybib
shamsia
barrasa
mhuintir
filopoulos
swordmaker
ureilites
profesionals
leigha
austraila
eldrick
eggbeaters
dickers
kotzian
highquality
emary
calderaro
adako
mowaffaq
devhub
farthermost
lostness
abdulmalek
barrista
giboney
softbook
colesbourne
paln
waginger
nydegger
xpressions
palancas
fobb
schoolmarmish
showcourts
greenbergian
americinn
romick
semagacestat
lloydspharmacy
espinos
recommits
acquaah
brouwersgracht
rcnc
yanobe
jossen
koenigssee
onts
sawadee
chelminski
clubpenguin
undented
garske
zacaria
uchizono
sniggered
houseal
khanjari
kozerski
kalfan
kagaba
antimalaria
troutner
raspin
neuropsych
klowden
longpen
bermillo
hatzius
calvay
epigonion
dissatisfy
icue
willebois
collydean
hungerburgbahn
risoe
tornoe
tanae
romanengo
broekema
denbrock
txeroki
lanum
immitation
inevitabilities
wahaca
ntag
exida
reuinted
frbny
lopresto
platteau
bythell
asociation
neighouring
lapera
naumannite
kollmer
competitiors
taghrid
summerell
incrementalists
eerier
kolpack
sheinton
baytree
marivent
phonthong
suglia
archaelogist
hoeksma
overcut
strelsin
silverley
lichtenheld
darelle
askernish
segatti
elandsrand
priszm
luciene
infragistics
chorost
mindflex
nwando
rydze
tunebite
paediatrica
cubasch
wakfs
hasselback
akill
atmopshere
reborning
sophoclis
plassman
galliagh
sloshy
jaibi
ostick
plasticy
schmiedt
sanca
reportin
wolfango
ethirveerasingam
grami
econic
msif
mckerracher
semapimod
klocko
carabineri
incandescently
mlilo
evilest
norcros
bitsakis
throttleman
anthonisen
mccoskey
triblocal
notal
besilate
radya
descarte
bionovo
firelighter
clopping
placemen
ruchat
demisse
paratek
menzes
ehring
schmahmann
chermont
merholz
stockpot
aronhalt
valentenko
tulles
pidsea
thsn
mujirushi
sereboff
thurairaja
sealcoat
elosegui
crohns
pecentage
whirrs
xitao
graniero
gawlo
earthwards
simerini
orthosilicic
justanswer
riggwelter
aorund
brouhahas
debrosse
andruss
cityroom
akba
toneelhuis
thanee
kambwili
selesky
dropback
jesenik
curnutt
herdswoman
annisul
enoughness
occhiuto
biblioburro
parodically
kresch
challege
cowtan
cannarozzi
yucho
telnic
sholley
louisian
rieuse
tradmed
trevemper
bedritsky
qualies
holtzinger
airworks
hernquist
bahts
pappelallee
erehwon
simspon
jetboil
tabouleh
xhelili
jocic
pursifull
schuchman
vilstrup
amross
decifer
markwardt
abazov
phaswana
differin
vitrano
toprol
bartoshuk
guardistallo
adulating
septermber
bezirgan
anisi
daulaire
showhome
tafani
gunked
ptec
boobed
karoke
antitrafficking
kyriakopoulos
stulman
nickl
nightastic
guibovich
belskus
kamynin
kalisha
centurys
billionares
froebelian
cardie
utilisima
jackovich
barnosky
weister
bertinetto
finaldi
crowlas
mingxuan
clearable
flarer
sandpapered
lobbygate
aqqaluk
wayfield
waterflood
baseej
govloop
churlishly
vismitananda
lodenius
racerback
foddering
mangara
barotraumas
levale
deferrable
blackfriday
andex
sanminiatelli
narjes
haggiag
lvns
acorp
muyi
gcash
overdecorated
hibhib
sponseller
stedim
suncadia
rebanding
usbf
gospelly
uncurled
zommer
pedicurist
luem
falahat
nanasi
previtera
globaldata
wintertons
bevon
dailylit
chunlong
homevestors
spinkai
addustour
parrog
simulus
petrosaurus
popke
smri
passback
communitie
wassana
sabans
oligoastrocytoma
talbieh
goucha
tchatchouang
ndebeles
mellace
magnetti
penalities
jaroslow
felpausch
cdrp
mechira
lagrassa
millesime
ntegrity
vistec
ramjeet
mukhisa
andraz
ickburgh
liebovitz
dotts
offroading
snowcover
supertex
tuev
imposimato
eutsler
cibernet
kazman
laermer
fordcombe
dtna
scheuering
senghennydd
wassom
melness
mantarraya
cravotta
kattie
croddy
girlier
alphine
stacul
brangaene
dramady
guinazu
drugscope
jasiel
windrem
dlugach
sevenzo
risius
sezno
nyingi
reflexologists
dorritt
mejstrik
unblind
devaluating
summerfare
fetoscopic
dancemaker
hosseinieh
frightener
shenita
shakr
adalaide
bianchino
spiritclips
nqetho
ctmm
kickstands
stodge
brosens
aseff
metroplitan
zues
peagram
wigdortz
yoshitani
supergrasses
valstar
xiuyun
pannekoeken
presentationally
kullgren
sooud
pupillages
humenik
autoeurope
lindia
homebuy
teladoc
latzer
fraxa
kamimoto
kortnie
kamarck
earthsat
klaeden
harethi
kogito
oschner
staney
speciosissimus
gubmint
fireable
worrad
jackknifes
pouchy
anticlimatic
gyrobike
chelada
haeji
tjibbe
rolontz
burnishes
bosdet
winspit
tarciso
sliderule
stultification
applebys
cutietta
lusciousness
mariton
lupson
autom
renegotiates
prbi
pauperisation
copperwork
stateliest
restringing
jurdan
liechenstein
engineeering
pndd
cotonniers
qateh
kagwene
sayansk
scrumping
behuria
hvps
mizera
escobars
chewey
kempski
ceraweek
arres
serritella
tinkerings
cerrejonensis
cken
wyeths
prupas
hoosh
troadec
tuntable
trendalyzer
ruebel
maykin
duckfield
kopczynski
blus
etymotic
wahner
teamaker
syafii
biohackers
prauss
bernoskie
amwal
andary
oclaro
slurped
lanzman
musicane
awrt
estrangements
setterholm
churchbury
tendrich
eruditely
telk
crovie
cominetti
swaybar
arogant
zeelander
lamana
buthaina
wosb
kermabon
sieracki
deward
gphone
mozafari
gjorgje
edig
deviceatlas
unibomber
wolsley
kasulke
robotlike
galab
tazu
hawfinches
gaouette
dierberg
soneda
dranginis
sharisse
sufferd
goldentree
mepolizumab
fradette
shirtsleeve
disillusionments
ddaear
fatemah
sunlamps
peachcare
slickwater
vlcd
spluttered
viacord
berryden
petionville
extintion
gulladuff
pheap
alvery
leonidis
aquacity
flacking
grinspun
ilkham
wenglish
rabena
ridao
yordas
buonaguro
levans
cevipof
aosc
mannello
duzce
nantcol
zdunek
helico
slong
entrup
whiteleas
stringari
indeck
jochumsen
wellstream
mataponi
niere
mwesigye
meroi
tremblois
kaleme
biadillah
rapporteurship
mosala
cuiv
posegate
chhon
valentins
opthalmologist
panmunjon
maryl
multiconfessional
domjan
harlans
marzluff
pointsplus
peoplefinders
simulative
entraction
zguladze
fuyushiba
gerami
energías
pernin
fortfield
nurowski
lwara
ravishankara
apologias
alberstadt
travesser
yennifer
uccio
cannongate
fitisemanu
braveboy
disastor
pregant
infusers
pantymwyn
vaco
highleigh
ilcho
rochez
vapourising
underplanted
cenckiewicz
shadjareh
plasticene
ramierez
drca
bline
doje
symingtons
hermelink
pedatzur
lionette
piacentile
sandpapery
cubbyholes
knofel
kalogeras
feiger
organistation
tweetups
ovalles
chainwide
shoptaw
lifesouth
vandrevala
barbis
satyadeo
broccolino
rabdhure
keylime
pohjamo
sadean
joudeh
unpartnered
furfari
superstocks
bargu
aquilar
colbin
prenups
buddakan
helsington
powerpacks
lipworth
moutsopoulos
intereste
tavelli
gibgot
vegitation
allaho
mashatu
buulo
dynacast
olagbaju
schwerk
solventless
wigga
haymans
arthenia
banze
deddeh
garzarelli
spahic
elwak
bjog
wideboy
naderites
esraa
greenmead
deflations
durado
microcentro
yawei
cordani
normacot
atishoo
oportunities
fleischacker
trahn
amireh
falceto
tonnere
rensberger
aerni
timberhill
nooooooo
nisgs
munarriz
consignors
adcommunal
shulong
maralee
incessent
raddad
shabes
elfred
rushgrove
aherf
mamduh
hizzy
lucratively
ayalde
deskphone
progammes
quailed
rhymesmith
militzok
skytown
rattanakiri
sudoko
supportsoft
kwasniewska
tungesvik
totengco
airmedia
catholes
hirshler
baldeosingh
degrease
hittisau
elshof
danovich
narenda
shaali
trepidatious
overperformed
jamiyah
klun
axtmann
variani
rephotographs
longroyd
outdistances
nowais
munud
moggies
shirtwaists
thuronyi
farw
otls
jorgan
fallafel
goatherder
kosterhavet
narvaiz
kacelnik
xinzhu
exonerees
khurts
ollusion
yaffle
lilts
zabaglione
denmans
kumkapi
kemfert
moonshadows
alessando
audouy
korsts
dioli
gendell
slebs
photiades
bramman
gueliz
slideout
biosaline
segements
xenicibis
rolos
nithish
polybag
skittling
wicoff
bamieh
sightholders
hauntworld
dirouilles
pratices
yakker
abcn
eurocrat
sayedan
fulmor
mahmidzada
luja
eritoran
underperformer
tremopoulos
weddy
djibrine
suttirat
midwifes
medplus
noncapital
cheesbrough
madhoun
juknevičienė
rhinodoras
kazai
respresent
lanxon
mesomorph
shahpoor
gasgoigne
falkands
chemtob
alestra
regualtions
kurshid
macropetala
sensoy
firends
pantalons
studdal
czisch
salinated
balefully
intrax
yarrowford
emruz
monochromatically
peschanski
oudah
trencherman
hafedh
koitz
chitiyo
espley
scambos
ahmady
morells
talek
fqr
melome
soonercare
ogechi
arrowbear
mpes
fukahori
lasondra
hethersgill
santions
cucs
flatteners
futuresonic
lehnertz
ngoun
matovic
mckelheer
ciprianis
eough
wiita
kuncel
duddleston
prudenti
zenti
maarty
rttemberg
firinne
drummerless
adelir
dizzily
bullishness
siadar
pauperized
conwoman
nblsc
dragonwave
cinespia
vathia
outmuscled
bailenson
nyongesa
burfi
soremekun
pccy
hedonics
connahs
balyoz
drhp
harsimran
caihong
muyin
lagrue
pfaeffikon
disinformative
twitterings
maalla
amerock
tassagh
jilo
babybjörn
defencelessness
morrisonn
cevik
chukri
apegga
hottoni
wreathing
tokuichi
sanahuja
atepa
hardbitten
pluspetrol
sesji
virginmega
azadiya
iluvien
hazzazi
mirina
merico
steneck
enomatic
thixendale
dominca
untransparent
behroz
bigbelly
hyacynth
niqash
shapelessness
tarascio
darias
returfed
insurmountably
cappellazzo
firther
imaginis
champigneulle
signvideo
mijac
malkenhorst
sherjan
pgcil
conforama
bavarois
yonghao
zubeda
glenturret
hadjicostis
burnikell
altaqi
haigs
aanma
bollwage
pongthep
kitsyn
nonpersons
keigher
barbaris
mashtal
chabraja
iwanski
knockings
mjallby
aldeanos
qpod
langrock
schweinshaxe
raschhofer
cocoavia
gershowitz
cheapies
targeter
boppre
grmn
bollix
obopay
lendell
zeppenfeld
reproachfully
doffcocker
doubleline
kellestine
onevu
virginiamycin
ppuc
dormeuil
ciggy
guilliani
vanderbeken
usaction
ssed
trimeris
hitachino
jbwere
nvta
sentencings
corbieres
herkert
whatshername
gawanas
satyana
bovt
gozman
incontinently
kikambala
pravachol
payloader
reamin
tuddy
cothron
sengamalam
fahrman
apidra
dolatabadi
sagapolutele
waldholtz
bobińska
kapron
chepkemei
ariail
goodone
prelec
vivix
habersetzer
raptly
envenomings
lizarbe
valhallians
yogurtland
rudahl
neigbourhood
raner
stablize
nacds
ijegun
mindorashvili
barigye
ajdarevic
beddar
biotch
subeliani
yablon
colostomies
shpigelman
embarressment
joisey
smaghi
trilokpuri
garoña
wynick
headshaking
taherkhani
asiantaeth
cbtl
fuzhen
ubci
omache
denkaosan
tarmiya
filewich
baiman
konczal
ukaid
beesands
didas
starcite
molyvos
vandierendonck
salpigidis
jackmans
dawdles
scentific
temozón
frother
lipc
milvina
lonedell
jaffin
ufot
hissong
kleinfeltersville
ronneberg
naseef
janian
pharmaca
kooza
clozaril
wojdakowski
korostelev
gastrique
kondratowicz
nulogy
tkos
hajizada
kwaje
noncoercive
nimalan
kruglik
generose
telbivudine
penello
unliterary
laurson
unrelaxed
sagrillo
biswamohan
grachvogel
myard
circulans
khusbu
voguish
beiliu
lubinda
iget
inartfully
ajuba
rancy
distributers
rauchway
unsually
wizardy
workingwomen
chmelar
ganakas
unappeased
lital
salera
outdating
serice
nanogel
ebrahimian
combita
jermin
audioboo
slettedahl
wories
sreemathy
imbongi
aslyum
bartended
jermel
mussbach
minibridge
micromanagers
cozaar
shemmari
breznitz
reiteralm
harsant
gafor
dewji
chilin
ndesandjo
bugliari
labovich
peepshows
apotheek
talibe
pettinella
hufner
gylve
responible
hipoteca
unefon
microbusiness
capato
mdri
haike
udmf
novabay
ammonds
provokers
submeters
gutgsell
rewrapping
gagara
liegey
bouillard
apey
measley
kotrikadze
zeidi
okarma
flaxby
brochstein
ricciardella
azares
trongs
scrooges
horsy
wallenhorst
tinyes
adorableness
lucimara
dracup
czapnik
propulse
sillick
vanmatre
schaumber
kylies
compartmentalizes
organsation
emeklilik
tamadon
lackies
tidi
scrunchy
bartran
chargrilled
gouray
messchaert
massud
boteco
griethuysen
sauvignons
bicurious
chuanhui
ivds
guernesiaise
baldanza
crvo
ollivanders
talisma
walstrom
mitham
bourbonette
beideman
bcpd
powerphase
winnning
newstrom
terible
durgham
lobala
khromova
rubane
veliyev
jehmu
sasparilla
shorja
chimbalanga
digitises
dhahiri
weixiong
ozmo
stegbauer
onsm
recyclings
kilpatricks
accuvote
smartbox
akhunzada
cabalettas
belobaba
vyborny
bourhane
egros
woodforth
thambwe
nazeeh
daimlerbenz
owlia
bushweller
noritsugu
baicker
ormers
raggie
lynly
agamez
lostant
cogsville
wellpet
zawislan
wqma
chandleresque
underspend
kuhnhenn
kubbeh
morghan
sicced
qubaysi
revies
cabei
perorations
realtions
mcclorey
gribiche
aerophobia
shahwan
webanywhere
hablen
pluthero
shindaiwa
teleopti
dohoney
bochert
cuddlesome
allera
bendross
dongpeng
kilninver
aestheticizing
ebondo
expropriates
dsit
geisst
sinot
adelaine
pwyllgor
ambulette
nadama
stanfordville
dipersia
fagerlind
jazera
honeyborne
thermoteknix
keatsian
qinming
gandus
umtri
shahrudi
twitterverse
inclue
milbanks
fillman
zisblatt
kajwang
weikle
wolfgarten
redearth
teseq
fernworthy
lefor
akse
curlicued
wetsus
guidette
bakhchanyan
bosci
blobbing
planès
saghiri
laugable
chesshire
pravex
internationa
beckam
wordily
northhampton
miljo
bornhak
gravley
ageron
koperski
blamelessly
linkia
talaris
mcintoshi
bawdier
theise
fdas
sistare
naghmi
ipico
novostey
twighlight
lackadaisically
hirbet
lewiner
bedwar
questionaires
apeing
fielakepa
theyear
akuei
colonoscopic
kinam
xmpie
kisby
famlies
lukianov
brandent
alimta
kutlay
sirchia
eitm
patrico
hasbun
barbequed
crackback
zalesne
urbanizaciones
suceptible
dunaire
failled
sathre
toniu
overwhemingly
mountainscape
insensitivities
alimentum
backbends
clearpoint
tepperberg
barkoff
sultanzoy
klerks
rudesheim
cabassol
ifire
greenfleet
bobulova
ticketweb
coinless
delwart
tppf
intital
bosland
roehrkasse
lashof
afterwork
miroff
drevon
acatzingo
killavullan
covich
unaffectedly
baechtold
maelle
unsingable
madland
petrominerales
yonus
barafu
audiocast
genteq
siglar
farham
slapsticky
kaletra
eurusd
lueshing
androgel
changgyeong
dairygold
euraque
avtur
memela
fischerandom
kinkier
seitaad
kopecki
ourisman
crescendoing
ahmadenijad
kerbstone
gayego
alpharma
rptn
radosevic
downshifters
trevo
naghshineh
emmad
bluekai
prestipino
petreikis
koswara
asloan
ballakermeen
shaqa
kamagata
studioworks
kolish
disparagements
yatkin
esbjorn
socitm
komid
absord
movewithus
decadron
aygul
piraha
inorganically
kinmond
spaffords
asmbs
romcoms
nshmba
akpd
furmanski
khardo
panard
guideone
courroye
lahouti
laborte
hibba
mdms
stevil
hardegree
mamaliga
genexpert
cifta
zonderkidz
popularism
talusan
paycuts
feanny
sorini
eeeh
liccy
gopers
beertje
ohios
reshipped
budathoki
mtvnhd
honghai
koifman
carruba
trilevel
cautionable
honglin
unnos
allerslev
wickramanayake
elementeo
symplicity
rtuk
hornedjitef
geolo
vccp
reismann
framlington
vpsos
pygar
ratably
healthly
paasschen
boxclever
peschka
awpr
misadministration
serenic
peaceniks
homina
holiman
xingxiang
entrepreneurially
terpeluk
colbertaldo
consumated
simponi
letroy
bioc
shbak
medlink
sitefinder
sordelet
yaling
kapandriti
customizegoogle
sennitt
rotavator
gilburne
esmeray
gingersnaps
tealights
optomistic
roelfs
lovieanne
buzzmetrics
membathisi
monasch
trifield
kraisintu
musni
facelessness
selenological
kispert
cohabitated
iekeliene
virigin
mulemo
sakvarelidze
riscassi
smallhold
damians
cityfile
textainer
sisic
sowood
spriggins
abstentia
haberg
lving
banatwala
nishma
platkin
shnewer
shyanne
zubaid
bocsa
memeory
deitra
undebated
sharkawy
jolkowski
neuvax
casnocha
nansan
pansion
vacumn
confoundingly
palnackie
draemel
mochammad
algosaibi
unbreached
macbeths
ameican
devaud
gerlan
bluephoenix
suliaman
mahfoudh
medr
bateses
korinek
volberg
cusak
hrouda
jaziya
thuring
nagayuki
xiuqi
knockemstiff
givner
ludivina
midpack
taramasalata
gerrad
slashfood
doxsee
nechak
ogonis
criticizm
mogilino
myvu
decarbonising
eact
uatp
alacchi
dropshots
prometic
evain
doubledown
bannermans
glamourising
anecdotage
demolishers
compari
coloseum
khadambi
flâneurs
kuppinger
glitzier
heartware
nexbus
abashilov
camowen
lsgt
imarex
sisavangvong
gimpl
mcconell
thuras
reefat
mevushal
cacaphony
sedore
amantino
tillysburn
touchtunes
semsey
cdoc
thieren
ofice
obaigbena
mantraps
mashishing
parvizi
kuittinen
hayli
radition
haouari
chrisi
giek
testolini
nonmetropolitan
gluteals
kronur
xinda
kpatcha
lrri
klepach
linkbee
glante
frengo
karubi
myobloc
otuam
redivide
ciggie
consituency
moukheiber
resculpting
qcue
destoryed
blinged
sailab
bouhnik
apiarists
woild
holimont
alameh
veremko
fiwi
smci
unscary
otabenga
caleca
frerking
pjhq
kiww
bobsguide
tvnotas
lietch
gvidas
meteorically
matynia
wiessmann
manferdini
jamundi
geolearning
hicox
dobrish
oppresion
lanice
dacquoise
trosclair
brocher
windpipes
fulbrights
guayaberas
gepetrol
varnagy
kashief
kosanov
raineach
sandhoke
ungloved
whippe
squarest
raspiness
revelaed
okech
zhongzhuang
somaia
dudettes
andrikienė
natik
ramanarayanan
darnowski
gentlemanliness
adhp
khaidarov
cerebal
plutzik
santigie
fallibilities
schaeuble
clariano
dodgily
funkwerk
ashiestiel
goosestepping
neugent
orinoquia
mauel
structual
ritazza
ameo
solamere
kopište
lizzimore
basjoo
dmva
lynnea
lukoshkov
waterwings
synchronoss
itgi
ralfini
pretsch
autho
nourian
sageview
hassanien
hedigan
tavai
corsendonk
siriwan
jaybo
verdiem
fuljenz
yarmash
sweetners
molak
marcalo
ispor
glru
encams
youba
helyn
wateridge
dragonballs
muschett
accountholder
clemencies
gemmologist
frometa
europejskiego
porstmouth
waterspace
consumate
fossilizing
mussawi
harkess
uncanniness
gyory
zwahlen
affilliate
dramesi
hiroichi
frogmarched
presentability
whatsits
nefe
sweere
szavay
carvo
wanjin
sangota
rinconcito
gobey
fomentation
fondues
beore
trakin
awwwwww
liftboat
benata
magande
pignoli
wormuth
mbow
dyrell
mariches
kasdin
tolou
parliamenary
ecoterrorists
brushoff
wuethrich
dundreggan
buckberg
bhagidari
mcgarrigles
yamith
targed
understeering
immortalists
swecker
astropolitics
newfel
laudonio
lowcost
leadframe
glorney
hestitant
eythorsdottir
jianling
mhrp
keag
guerze
cidery
cococay
mccolly
mellau
wolchok
lbsf
beefburgers
denaturalize
numayri
ssrt
twork
bergstroem
portalatin
brunellos
oddsac
traffiq
satisfication
clawbacks
publc
narcotized
hartpence
colombopage
detoriating
brassiness
stivanello
candelora
fidanque
matricidal
extened
liveras
ibrohim
einfochips
misetic
teamworking
inzlicht
rockwater
hoedowns
koteswar
tackier
solih
fleapit
apetit
handleys
pebsham
trustafarian
qumranet
nardy
lunak
impenitence
nonhlanhla
molgaard
oversample
boguslawa
nxstage
roadsweeper
therapod
inderal
dewitts
tutition
dapuzzo
lawleys
ombré
merzouki
tootling
demostenes
lotoro
repya
susica
anbarasan
flightlink
troj
winghouse
raiments
btmu
mienis
jasira
flagyl
tittilating
mizban
rodean
mingli
ludila
sosostris
brachiosaurs
grammaticas
hartshay
masiko
liptsin
amanzi
flyouts
almery
vatnajokull
edilov
maliah
olubayo
bookin
solaces
ceiops
shiach
budzik
rageful
suppon
trichologist
courtoom
antisatellite
menetrier
opne
pludermacher
plettner
hennegau
ablynx
consquence
woollven
nyalenda
etvs
hasnan
edwan
mmce
voxant
duhul
risgaard
ickiness
heoa
nesirky
alishayev
reoccurrences
roisín
duckinfield
jeanswest
arbd
harpoonist
alcos
chabangu
komolafe
dadush
bearwalker
qisda
tabakovic
boumeester
ghettoise
quavas
aattou
sobowale
metallireducens
jhoni
preemies
vagni
pcsu
zigged
viklang
sterzel
finucan
swithins
shaull
kuglin
scylletium
gokova
plevan
ruun
prefight
fairmindedness
maplecrest
blairtummock
chatbi
buyline
bwea
catrall
projectwise
europhiles
zeale
britanick
canniness
guenet
zachari
adsu
targetpoint
huveaux
breazell
balaran
lupardo
misconfigurations
urstadt
eurofor
supersaver
nedl
pantastico
kurgapkina
baaack
nudell
audioid
sypien
amrdec
udey
mcgruther
pâquis
ibmt
misbehavers
callix
homogeny
toystore
futala
krissa
chansamone
berrydale
parachuters
mogilner
etess
malfeasant
lemore
ngconde
houseing
deewa
brewhouses
vatandoust
semore
blackheaded
persent
kamenetzky
groovaloos
norphel
misaligning
landefeld
mcnaughten
shaibal
revenews
roundham
cssn
belf
cocoas
novogrod
buywithme
nutrigenetic
kakavas
fountainwell
dafarch
aristarkhov
klvana
mohassess
fambrini
ketterman
juqua
petrohawk
tommasinianus
israelies
nishar
edelstenne
breathalysed
buzea
osteoconductive
wiill
stts
unwieldily
nnimmo
attorneygeneral
falcos
nextview
attackable
anoth
slutkin
deodorizers
racisim
sarracini
nhsbt
emoney
sagaria
ganmukhuri
petrotech
delury
harbormasters
skierka
zipingpu
anfrel
groundsheets
zeum
ahbabi
hobeau
luvsandorj
ıs
dinman
scurrilously
ainamoi
lorho
offlee
fitur
parashumti
deathers
unshrouded
invigoratingly
eirwyn
magisterially
noninstitutionalized
disend
spanishness
jianqun
sholam
bonrepaux
ballyearl
loopallu
chengli
yssouf
callaways
airnow
frien
elseware
khairudin
blowzy
nemos
himebaugh
pulks
soberer
popula
brawndo
brigandry
dicciani
dourness
kusstatscher
hefton
visitng
shalgam
photographable
afpd
outguess
unpretentiousness
seglins
uncompetitiveness
methilhill
fridd
selvaraju
vauzelle
bigton
haemoglobinuria
unrunnable
luhmuhlen
mayagna
castresana
reenforced
jalepeno
anaesthetising
botner
wallbirds
druzin
pumpgirl
awilco
overlimit
ossum
jetfuel
winegarden
metreleptin
ahmen
sententiousness
bacm
druckerman
lanitis
maryscot
pohler
grantshouse
smogs
bluu
exitoso
sexsomnia
fleysher
rightfielder
salving
dietsmann
sweidawi
hornists
qarantina
glenkirk
motionbuilder
greensome
ibet
qafco
dragonas
raviolo
ortigue
tuilière
cicconetti
histed
manuever
intimas
shcherbachenko
pericard
thundercrack
pyscho
cammisa
harres
poshness
yakini
clergerie
mashele
fimognari
purnells
mexicles
schatzel
beschen
alloudi
leocorno
mckhann
obsene
klosinski
toder
groundwell
merryday
vistakon
freakfest
entrace
radiotherapists
troubleshoots
setember
transdniestrian
quizes
shostack
exarchia
mwamwaya
rehanging
nauiyu
carlesii
shehrbano
flatshare
honts
haick
oaug
kyliekonnect
cliental
crenson
marere
musila
meii
starbrook
choroidopathy
azadpur
farras
winmark
frailness
anonymiser
restauranteurs
guardans
suporn
derden
lemler
peycheva
tobji
mawlynnong
kipng
sedlock
gesticulated
ghariban
daloz
unsureness
padhraic
radivojevic
maegle
poscente
unmooring
cadelo
bourjade
breitburn
rebaine
assps
mistal
aggress
ecowaste
mogt
argaric
sueng
usselman
reprogramed
grassless
vahrenholt
goeldner
mcraith
jcpenny
halamka
sisif
maeena
voepel
bitatawa
muchembled
sizeism
mause
vannatter
boutboul
fench
elecricity
tsumami
swaibu
nigmatulin
ascentium
seelische
urwand
schütrumpf
gulhati
hacktone
summerleaze
treuille
kappelhoff
amlani
linegar
thenia
europeanize
funning
muia
bilalian
neomagic
pygott
moudeina
martson
garrards
mtds
gautum
bergendorff
plgf
hords
basilevsky
norkom
cryne
elgas
elektrim
biscuity
yef
tuwairqi
baktash
schweer
ecobee
kraftfoods
unembalmed
autopark
natcen
blogworld
rappahanock
tuxedoed
improverished
sittar
northala
enlighting
alexzander
aneurisms
nuriya
explicatory
toai
maasailand
bonter
hornecker
leonne
ramzee
zalaznick
grasscourt
moviles
studentessa
alixandra
micardis
fontelles
cacophany
rephase
gurmu
ballgowns
babbitts
badoit
souan
frigatti
misunderestimated
srsl
tital
weitemeyer
brembeck
pedery
mcnearney
comoglio
minkwon
jaseem
tastykakes
akca
akissi
heussner
hurtfully
sourouzian
tremulousness
lilke
esmart
peiying
appauled
bittersweetness
indicent
curosity
schevchenko
bivvy
unikko
hjw
kokonas
brysam
younesi
qualman
saulny
ledua
nafjan
falsest
yarning
tedindia
adamonis
teitipac
dublanica
chipless
godri
veronda
threeasfour
charlieticket
zider
qlikview
sauquillo
darios
skedee
reyeses
fléchard
podsmead
soliday
allden
prowell
casketed
hanhua
supon
lagree
mutalik
caymen
slotmusic
kerti
skeevy
hypercompetition
pakay
degraffenreidt
danishmand
xwbs
matombo
redmans
arshid
natirar
anticounterfeiting
sigurgeir
gorbushka
debentureholders
onton
macauslan
oosthuysen
babblings
grummitt
rahmanipour
fredia
boudaries
aldabran
pressclub
margusity
badylak
kaleidoscopically
safelayer
stojnic
wargotz
patsis
grivich
haladas
arciaga
dongdajie
sutaria
vouchercodes
seddiq
judelson
magnevist
eminate
ilincic
tauiliili
burght
ohland
kasco
tuzantla
centraliser
buyseasons
maratier
yankiel
yunwei
conditons
druart
cosandey
whirpool
stadum
mvume
jowly
tsakalakis
abingdoni
conedison
apirat
searage
stoesz
scvs
oldbrook
clublike
suradji
calasan
rehang
kotsos
sccb
guilliaume
proo
obstructiveness
sarnie
opportunties
shepperdine
montervino
achany
acouple
miezis
perlet
rybinski
sorl
rayappu
haims
angelson
readynas
pouvons
corkscrewed
kawasakis
napatech
reffert
bloomgren
talaiasi
confederating
andreau
kabimba
unpunctual
guranteed
tbilsi
dratted
pizzicatos
paustenbach
kenol
aerium
sofroniou
mecury
obstáculo
maritial
tubigan
burig
mcrl
beatmaking
weirdoes
portch
maskery
goateed
petroliam
memeorandum
contortionism
gwylim
anuvab
mofidi
gurmai
nawr
pouquelaye
jeke
tahboub
garagey
nutsie
swedelson
sherringham
mojgan
gemaldegalerie
rnld
wambua
sarnow
láidir
menduh
newriver
creyts
babakir
runggye
hateboer
génoise
vinography
gennette
korkishko
comninos
brkovic
constituional
nassfeld
sayeg
sizeably
steepler
evalu
nonkululeko
scaysbrook
humanware
gerberas
sonyma
umalat
ltci
indieplex
zanjero
bertsche
nikolskoe
durnian
plisco
ridonkulous
serbanescu
hairapetian
sixthman
prooth
kracik
renesys
shortchanges
hydrofracturing
humanties
zhongyin
tracky
lyreco
pleadingly
attoub
overweigh
blogads
paedos
bisaro
panagaris
thueringer
oostlander
khudari
tradewell
winchman
supersexy
torfeh
bachvarova
oxonica
sanye
guger
benedikz
eskinazi
akeju
cermelli
philogene
dadier
ngarlejy
leavengood
stented
starflex
anawratha
frailest
shriti
schmancer
varfolomeev
clattenberg
pozzilli
dohmh
champo
bussers
filesharers
thronton
ffffound
wcpb
shabang
polick
repackagings
shewmaker
overburdens
fascinators
geordieland
assaluyeh
wanawake
ovali
boepd
eagleeye
armsden
sipila
hipsterism
denktash
nakra
briede
vantone
empanelling
sugv
ubiquitious
hyperworks
vneshekonombank
welltec
patreus
ecoc
bordt
zantaz
sodra
elebash
muslimat
manbeck
furtney
woodshedding
politicing
hippen
ditsi
olbas
disgruntle
geoana
suozzo
droppeth
jalaladdin
kongresshaus
skraastad
francophonic
egenhofer
kathreya
nimco
loyality
preseasons
rublyovka
kalapathar
strege
cdnx
lymari
enfeebling
tendil
grinny
htpp
shisheng
kierantimberlake
acors
rfmo
overfills
pickren
senkakus
pratury
zepnick
privitization
hasanali
nutbar
jagdeesh
bewitchingly
greenbuild
ecumen
caracollo
revatio
etchebest
hairclip
koedinger
daffron
cardean
toolo
wagonr
delaitre
tomsoni
photodisc
charcol
trumpauer
aurigo
culduthel
tosheva
hijran
lcbp
marajuana
solmar
cervelas
havng
transmogrifying
rescreened
cracke
batdyyev
milivojevic
asteroseismic
palstaves
gabla
georgeous
weegh
madzongwe
peleton
undramatically
cisri
albinder
foamex
overcounted
hoik
amgott
drozda
hawra
rochinha
farat
autolyzed
avmt
bochao
zibiah
trevelgue
cherdchai
cmht
hartlage
roren
nottm
nightbook
weekending
menlow
reuseable
matere
saaka
castlebeck
canonicals
rejecter
bernoff
jibilian
dincer
metametrics
madano
lamerat
hanusz
diserens
leovy
blatcher
reborns
alafco
aproned
peffley
gardberg
ectel
ecomony
hyperstudio
kaptsova
aboody
ivara
jpmorganchase
nearman
turkalo
lochmoor
smartstax
degreen
belaiz
molinaroli
luedke
durborow
ragingly
usglc
krkonose
tiszalök
huynen
auchenkilns
cafecito
twistings
parsenn
irreproachably
sensoji
gooood
annwyl
thouht
islamise
mieras
kanaley
wogaman
clingers
housedress
minuk
kiefner
civitarese
jumbolair
ayerra
ultraframe
ingenierie
mailpieces
commmentary
heftily
jermiah
oudated
solomont
conspirata
cringingly
gastright
mcconigley
reheats
vitton
plewka
hrapmann
vicunas
aftermatch
thadd
pento
orgainization
graessle
gearchanges
tabibian
parvulescu
goolen
whinges
erindi
irias
solesbury
fanniemae
lolls
coustenis
waverer
rumgay
springmeyer
itelligence
caoyuan
scotting
strandlof
hongni
arbittier
groundsmanship
muchoki
hellooo
chuckchi
shatilla
shampooed
makharinsky
mesters
mcgavigan
dudenhoeffer
persecutionis
downcourt
maletic
manouri
tramal
kierre
kontz
riddock
umare
foodstamps
openstage
stopgaps
zenab
undescribable
bariza
pridefully
calaboz
lopina
brooksher
benarroch
eirwen
demornay
papaer
brosolat
esafety
seamlessweb
supsect
mogge
pilferers
azulgrana
ghoulishly
llins
wettermark
coccaglio
arclid
kinglass
djellabas
reincarcerated
gunthardt
orpaz
irreconciliable
chhatradhar
douvall
mandazi
mannakee
borkowsky
countr
embalms
labout
wasicky
nonvolcanic
riffling
sabeeka
saddamist
antistate
chermoula
skrzypiec
lucard
cittareale
vaniel
effuses
barcamps
baaji
kolecki
bongha
bajramovic
ncdp
oenoke
mimedx
iahv
runscoring
stammerers
bruene
hoovered
danuza
oiks
muhame
jusr
rigidified
schinwald
togai
almorexant
sathyamurthi
cdebaca
elfert
liabilites
denoyer
ultralong
klutziness
exergen
mcclenney
chaussade
cullina
zibakalam
ziegfield
peop
schirmeister
kenneled
bodycon
moneydie
ensconcing
bujie
dragga
flašíková
synbiotics
bolno
dnot
phien
glamourise
relected
miankova
potot
wintermantel
suhn
sartiano
mosle
squawker
glencadam
gummett
wwmd
jeyapaul
ggers
antimacassars
radov
transaven
sentor
syangboche
multidisc
greycoat
heinert
chairty
maalem
kctr
bergstad
albicelestes
rasjid
cherrey
memorialcare
smoothstone
henpecking
kerkhofs
gagloev
barview
tutoyer
pistacchio
anthoussa
ratney
romanucci
motorboy
cubavera
ionatron
equired
flagellator
wintersweet
caringly
rauden
nvoad
farmstand
asiimwe
ganlea
roodman
waingankar
strangulate
rolito
litowitz
kozachik
knowledgably
kopites
freudiger
rechannel
agresso
carterets
tuksal
cortivo
decato
jirgl
fooler
madigans
expd
flanaghan
gloms
houtryve
arduousness
saintlike
podeschi
transperency
plyed
responisble
gravedigging
knafeh
diagramed
squawked
endebted
vondrak
malenkikh
iwinski
ofmdfm
neesom
greenorder
kurchaloi
bahmanpour
ghantoot
saklad
efran
depoliticise
halbfinger
saravanakumar
mengiste
ferf
cabined
fontham
wingstreet
bachmans
cology
theoneste
bareroot
netgain
knake
mgtf
venturewire
tightlipped
jujuan
vertigineux
seguranca
guendelsberger
weisgarber
refrescos
hatalsky
clerge
tapsfield
unseriousness
entinostat
payperpost
jasdev
habshan
spritzing
jaslo
shmelka
abduweli
nonpartisans
gaznavi
groundfire
snowfest
tarpinian
poddala
keppa
synagis
gassani
jumaily
barcap
cimarusti
megahn
coulsden
lpld
nhli
sourcer
cirkovic
buffardi
cseu
asgeirsson
tweeker
macguffie
dominent
levasa
maloni
neeves
jehuu
pollitz
fullfillment
holmstead
jotters
heretically
countermoves
miday
swarowski
headblade
riemschneider
horizion
fueltank
dymovsky
grundon
latsko
arlaten
sesquipedalianism
rubefacient
assac
nacianceno
sandyhills
unrecycled
enginuity
baudains
schouwenaar
bisno
searchwiki
druidsynge
dontarrious
moukarbel
smocked
amawi
emptywheel
pradaxa
abutaleb
licensures
intacs
unaccaptable
sandrakasi
villongco
halek
unshockable
protsch
ecig
mendick
worringly
campsen
jaunarena
ukaegbu
smetacek
preprandial
kamoshita
poujadist
resistere
westfeld
xigaze
prakken
victrex
geater
torgay
smarminess
stuckler
oldster
gerhartz
szulik
benseddik
sedarat
plasticware
brainpan
eader
plodders
calisi
phuensum
kikinzoku
wangling
arsey
gibala
tantalizes
doroshow
bellmen
merrylee
eelmaa
rivère
klitschkos
unemployement
oinofyta
uncategorisable
gernat
tsagaropoulou
owever
meritz
ansawdd
eodt
commu
mabala
quintasket
simansky
davilmar
vipre
utahans
mutsekwa
giazzon
veveo
syatem
omnikrom
vidino
deyanat
watte
subpeona
kupelian
mazier
lefar
segelström
gulmira
saddlebow
montcoal
lythcott
celox
kranidiotis
viscogliosi
carianne
dudack
bilchik
castlelike
capretta
liipfert
reliablesource
chidzonga
nilico
chikurubi
guiller
drillholes
yadegari
advantange
tencate
commi
kobliner
tsvangarai
halkia
bondaruk
outride
bachmanns
ruangkit
hydromel
ronksley
laronidase
aurlandsfjord
kutten
hamadah
doghmush
jinwei
tarangul
moinina
fibrillating
olsher
guatanamo
hamiltion
dulberger
donly
abreva
nanceen
comatosed
reacquaints
magnequench
ostaz
cahr
garlon
knoa
futuresource
pennrose
alvac
tinkebell
glandwr
recuperations
zhuravel
jingtao
stiking
francios
biotherapeutic
shov
rodericks
gyopo
vergiat
incredimail
lodeve
ahhhhhh
combivir
alouni
ziercke
sunlamp
hywyn
nuval
futatsuki
kneiss
madwed
mccoin
markshausen
linhope
norit
unnegotiable
biodomes
eredvi
lagudi
pricha
dusia
htng
talkasia
dnaprint
qrx
sickout
fasciani
palcic
sagardia
anzack
praisner
unprecidented
genine
suppresion
rehage
bennitt
gugelot
blaim
allida
rephasing
alertme
stasinowsky
widey
eastcroft
tragara
idose
agilysys
wassuk
mitsubushi
kingy
unrefreshing
unabating
danien
mcdonaugh
guttierrez
luijten
kadom
wilderotter
riuven
widspread
nienow
dagvadorj
kipman
grzesiek
noxiously
climateworks
cabourne
edemir
elkoff
johm
aquivaldo
sabaah
mmcf
pavluk
airwair
pristiq
traoui
iolotan
norhayati
autodialers
hendersin
hyperbolizing
substain
nureki
irineos
niknam
menthols
hakonarson
suffredini
leyrit
cardoz
ilchman
pratali
paczuski
unbury
crookhall
naturalizer
bardhaj
lahim
pleiotrophin
copehill
vengoechea
nasief
botanika
calvan
redstones
tapitsfly
maners
ingoldby
violanti
deveopment
degregory
chabba
arthrotec
immoderation
lousing
laggers
xiaosu
cèpes
tranchina
mohidin
bedmate
contucci
knestout
reboosts
cotgrove
saruni
oaxen
tofranil
abike
detian
kurkin
cutesiness
addage
mmbbls
antisocially
cyclamic
tayde
shahda
daniszewski
robberson
opondo
fougerite
mannos
gigapix
miczek
furhman
swia
pébereau
warco
snoasis
stamatiou
tropicalism
reprivatized
hiemenz
versaci
eastone
tombliboo
minskip
revoting
nicr
fishier
scoveston
fallenius
khammas
storimans
griffithsin
outcoached
cassellis
garorim
weightroom
ngowi
sfogliatella
martinko
tinharé
landesgericht
livepc
tynell
sledded
opporunity
abboccato
blackminster
fumbler
ramsha
hankla
evaulated
applabs
blasing
jadco
quadruplicate
offi
perelshteyn
annuled
trupo
chistmas
autismspeaks
santiphap
lekon
geosentric
sandioriva
rothbort
conney
moralistically
hipe
poulterers
capacchione
dunlevie
prevaricates
ronez
iboa
haemonetics
prodesse
chockful
fremeaux
ipbx
najai
leggier
teason
narubin
tricoteuses
klyve
kohinur
rushaway
talbiyah
huffnagle
aites
haidl
gaouaoui
gaoxing
mcclenathan
crapulous
raczko
dersa
rasist
imbeni
swithenbank
collingwoods
brynle
galactico
celum
zvaigzde
downsizings
jubur
serphin
seree
acqualina
idylic
bsja
kisielius
rasenberger
aliecer
perinpanayagam
testar
dagsa
zamarra
rambunctiously
sarlis
durepos
chatwell
aadnevik
surfwise
reorientate
riisgaard
saperia
lentran
bittlestone
twrs
pentabromodiphenyl
fmsa
unbureaucratic
paisarn
lochren
neltner
dogtime
kashmirs
lorelie
anemona
thunborg
hulya
sayidat
redevelopers
anasthetic
cumins
afrh
supersensory
regicidal
spls
nebbishes
modero
pekli
kinevane
ezat
indebting
icenhower
eprize
pedestrianising
rosiland
bronicki
terroists
bresland
clangorous
knicknamed
haoge
atrocites
ekazhevo
satinsky
contemptibly
adversly
kleinubing
peerindex
invideo
destigmatizing
grenny
geele
tommye
chekwa
comtempt
nyhart
qureishi
simchon
tunjang
deked
kenndal
onglyza
chunyuan
ciccolo
wirayuda
barnados
motorports
tsuyuzaki
infs
maiolini
intskirveli
corkers
aloudat
chiazi
unagreed
awir
boughten
gemkow
merentes
politco
berlioux
macaree
confernce
yiasoumis
lhamon
persiankiwi
metlin
turbinado
jancic
hetze
shunichiro
stonghold
gavaldon
lorrenzo
elier
wikispeed
himilayan
diddled
soakaway
sinofert
horsewhipping
mursyid
eggrolls
ruskins
bargehouse
zweigle
jarecke
shikanda
hostpur
stodginess
genuair
mikoliunas
cwcc
krinitz
jando
sarasponda
vivadixiesubmarinetransmissionplot
mobileone
mathiason
newbo
yette
naema
spanis
neddylation
sufferes
goatlike
helimed
jameah
cathell
mcparlin
mountaire
hoggets
destoop
mediterranen
shinrock
zarattini
silió
cornutt
woodbird
hochar
convieniently
undercharged
hymnlike
bhonsala
eidur
montanera
muellbauer
wrongfooted
lehmiller
mediabank
collahuasi
schiera
démarches
shetrit
mewhinney
feminis
kuljic
timeslips
ketino
rohaya
parachini
blakebrough
faler
literalized
bkmu
scruse
astelit
repositionable
golba
qassas
nammi
supranationality
neuroaid
speliotis
braconi
pimex
restak
koscik
maternities
relearns
benit
gootee
drozak
rockpoint
madit
lachhiman
jontel
pymont
needlessness
lupoi
corrosively
hoell
mihelic
amercia
oybike
tnsalp
abandoment
naple
mcneilage
ringfenced
lindenauer
econimic
cerovic
muehlen
trubus
thurday
refermented
rymans
sandhaven
belkovich
berserking
sadiyah
stanmeyer
cacy
snocountry
besylate
schizophrenically
pennsylvannia
traiman
zatuliveter
caputy
mpam
abramsohn
prma
ameritox
nonk
pyromaniacal
videogamers
wenaweser
tigerwoods
masklike
mougey
spindling
isais
osode
bleik
totia
obstinancy
epromos
latibeaudiere
knaul
samsungs
westernising
tagetik
womacks
madkins
bonarrigo
zhaxi
economicas
evetually
maiali
chrysanthe
imberger
frischkorn
vieshow
srdp
egde
zemek
marianjoy
strakhanovich
sexualising
cashline
marafi
beader
dismang
sibum
malagasies
stuffbak
senturk
inglesbatch
interlocken
luoland
matarasso
incanting
sleeveface
ballinalacken
zhurov
kangtai
boaretto
friendo
keroche
uvse
sporepedia
mccelland
mulsims
ketrin
sudayrah
reapproach
bondian
thermopower
dowsed
boagiu
abawi
burdzhanadze
marcinkowska
myoe
digitalising
patk
mohaqeq
obang
yetiv
tytle
britel
yrbs
unwarrantedly
dmepos
costan
penegoes
relocking
smartnav
alaixys
jeremiahs
bactec
noncontributory
winterize
iabf
yuthasak
scartel
engelson
waybright
haladjian
aduro
bellantoni
cisos
sahf
vodone
islamicists
homeform
fragueiro
moben
minap
tupolevs
subardjo
tongkor
gibben
handcream
pinakin
hurlston
claf
michelito
simrit
treays
gewgaws
giavanni
bridlepath
pioquinto
desgagne
rhogam
powerseraya
paydown
mesadieu
mawrey
chedia
mspot
ballasteros
eluxury
paracetemol
unintimidating
whored
lazerine
oenophiles
narcoterrorist
fwab
lawlessly
lizan
gasmasks
peretu
fridjonsson
criminalists
lenelle
outgo
ndga
rojewski
curleys
sevey
morolica
mollan
nugo
colazzo
matinenga
lolapps
speeks
jamecia
ilanaaq
shemayah
unol
minisub
puchala
sarkozi
sainato
cahoy
gbar
glendoe
cheaping
telasi
hastingwood
nuctech
palella
vermot
wxco
bladderwrack
zetters
vibey
suaad
hwkn
ibahri
whitetailed
jeles
offsetters
polyfilla
truffer
techflash
multiproxy
previewers
kindrochit
airpot
minikus
balante
fizzer
handprinted
demulling
rudrakumaran
bugaled
hlep
uncomprehendingly
actogenix
ogbuke
unipac
nghien
neckpiece
bhojak
shoptalk
sermonized
againg
giacomotto
stasenko
tousa
revitalises
vitalising
dattakhel
evaw
getsemaní
nbcam
loami
palestian
cvis
textphone
fakhrizadeh
tobman
pasovic
gintung
trevell
cartoonishness
vicitims
karrington
emnid
essoin
electrathon
folasade
abdulhussein
bamsey
dhurki
baasch
aggravatingly
vijit
mambe
qabel
luwei
friguia
baguilat
knittle
unfastening
lettinga
caiani
sydrome
txtr
wasila
fundraisings
cannava
ingc
sterilants
stovetops
linty
hryvnas
mstf
belmain
unwaivering
carpers
petlyuk
broschart
cencus
slatten
neuroeconomist
bibical
muganga
catfighting
suttar
afpp
superheroism
marketriders
nimalka
uncosted
harazim
abdulbaset
milllion
ciosek
slipperier
grocock
antimilitary
robocalling
popaj
doners
lynnae
senaida
binjie
fayson
kidstart
mashangva
suleimania
calter
curagen
breakbone
yulex
giammattei
ximin
unsticking
pullia
overspilling
petties
haziz
ichinokawa
arsad
kuchins
tamarkan
jilleanne
unpursued
reinvestments
chivvy
piersons
parowski
offsites
gutteres
goewey
stren
toolbag
computerlinks
spierdijk
czwg
paitoon
powmill
farasi
goeltz
alexsei
ppera
kaczmarska
stadtmueller
shanking
robbeson
echerer
pova
lightweighting
cookalong
stepehen
repsonsible
megabank
strim
shivender
beatifying
sauciest
szigetvar
pixs
rolodexes
pillin
gothenberg
sketchwriter
barnabei
flossed
clitoraid
zulum
dramaturgically
bendicks
scios
ceredase
streppel
dessange
wasna
ljdam
ilsac
yermolai
suppositious
purakayastha
rollig
aloui
wuite
hibell
cytotec
uyilankulam
peacoat
premeasured
locair
lemuroid
eastlack
petrizzo
malachowsky
turiansky
volskaya
tribeswomen
ratuva
childie
hooning
murban
mowforth
yatedo
sandblaster
intergovernment
jajab
marilson
daffiness
clapgate
gairns
shampine
schneyer
beehner
retimed
stalement
unprecedently
cicerones
iclaprim
islamicised
sejdic
chakladar
precancer
sovietize
xiaoqiu
countersue
chitown
infonetics
lipsticked
kissels
zubari
treaders
wollensky
centerparcs
lamazou
nodjoumi
upconvert
magelli
takaesu
dalessio
contrafund
fnih
schwapp
zesiger
kathay
bewhiskered
fukomoto
salmesbury
zimbawean
fiscalia
erbst
klyuka
candying
kleisterlee
giorgobiani
ojougboh
schriock
susic
mezedes
throbbin
presicce
redlawsk
lijian
petroperu
becchia
kettlebrook
mahamood
yanuar
stateswomen
nonelectric
andrettis
thanamalwila
castleland
webphone
jungly
ostros
pandorans
vaubel
dodel
unspayed
kalafut
herdson
ymarfer
chigirinsky
myfyrwyr
bifeprunox
jackhammering
seasearch
kiyah
blackpole
hendarman
mascho
massaguet
devasish
atssa
moralisation
cvti
ganguzza
snorkeller
cppp
gootman
qingtai
muzhda
alambo
loeillot
suvanjieff
calicchio
brnjak
punjana
dukascopy
sparcely
mylifebits
ismel
shompole
calbeck
wulfeck
rebif
bodytech
airson
maltiness
treaster
boudrow
draguhn
flanken
pukhova
npoiu
infobright
assination
norani
meijaard
aronen
drattsev
accton
starchase
mewed
meowed
suleymanoglu
fanlo
shammies
bannar
nyehaus
sebaoun
ladenson
depoliticisation
lladro
churchard
montelago
waalkens
premeditating
treadell
caubul
nightstands
sverrisdottir
rosofsky
kruimel
arvedlund
merkes
drasdo
khinchagishvili
intersector
attram
magnanini
gleckman
diame
dsnet
sebasti
medpoint
halterneck
mccaffree
unformat
vatanka
gallazzi
pspca
baghtu
blagging
kentallen
giannasi
contolled
roaringly
switcharoo
horami
equivlent
xtremes
nceo
portentious
conron
circannual
sarkany
datascape
subtance
burky
arikian
peruke
fndd
selimaj
clearancejobs
surveils
dryfhout
glbc
mainous
pdex
soelden
chigas
blotchiness
sbordone
nagpaul
malinowsky
kravi
wejustgotback
misclassify
contrino
bletchly
rieves
esmir
gyns
aiting
massouma
sharhan
portentousness
kauvar
demier
milblog
oluwakemi
threesixty
leshawn
irtiza
owenses
blahniks
premcor
rheumatol
idealogues
norback
gellein
stromstad
gelitin
feeing
ciborowski
rimage
bjerkan
tenterhook
vanalkemade
anticorrosion
spillius
javaux
hoarde
datacards
hezam
papangelopoulos
definably
bienaime
shrewder
smallhouse
jellen
staybrite
contal
mkhuseli
ryzuk
dinertown
chrisochoidis
giltwood
buyukada
reinsel
adebari
moudry
procomp
sweatiest
yeatts
erradicate
tweetmyjobs
orlewicz
trashiness
nesterovic
sagolla
chilliest
matevz
faoin
sheephouse
grundhofer
focuss
econobox
alner
salba
talascend
clatto
supurb
votin
yosypenko
enmeshing
equivocator
simcere
blasim
oobr
alante
consumtion
blackarmor
panyard
komie
elebert
digitalbridge
ambrozy
snowmaiden
paramés
doncasters
hurman
cuttingly
uaefa
hardhitting
jeselsohn
albiglutide
dalka
dragutinovic
ballhawk
shikaki
spectactors
asokoro
swansfield
infosoft
confits
lacedarius
fraunfelder
gastinger
freyman
osterhoudt
adampan
interflex
paytons
lekuton
danhostel
habitue
vatalanib
bernadac
marchfield
duso
vondrell
tooni
samander
scario
langoustines
prologic
forechecker
gembe
bettyann
workboots
gissen
hysta
midlantic
zoombak
cirali
washbag
hararians
mowaa
zalan
rambunctiousness
adrean
interaxon
sampil
shtayyeh
yesanguan
leftwinger
roettig
chascona
wurg
kouddous
okenyodo
dolfor
smorodov
shahedul
smta
polyglutamate
harjap
watcharapol
avantgard
crankier
lungstrum
exhi
mohrhoff
loogies
tyrrany
ubig
mushiness
odmhsas
kasunic
kickable
kochneva
schwadel
pflaumer
triplicates
rajalaxmi
beardley
wilverley
sexualise
gullibly
baryalai
refusnik
flowerlike
jetsetting
sacsayhuaman
baruffaldi
turbanned
unstapled
ateke
buyelwa
ogushi
gooshays
twitvid
paible
strating
internetworldstats
restacked
kovalov
netburn
filedby
barbic
fruitwood
tasho
padiri
quartaroli
starent
mufleh
conferance
gorat
eidelson
rubendall
favory
frauding
suhad
sconyers
cornips
veugelers
gottsch
oosterveer
pulsepoint
cierge
sitcen
denigris
guestimates
montrouis
jhpiego
recriminatory
perónist
mwanda
pogie
boudjenane
tutima
bavelier
coolabi
gorol
tuilevuka
gartung
ahole
muehle
nahass
tcheky
tdwi
serviceberries
vivaciously
friedly
novazyme
heimbeck
beachgoer
hoferlin
hopscotching
ossoff
junquillal
igiv
forsythias
tiney
brindas
intentness
odep
habuba
howeve
bandstocks
marcee
ritzler
reyum
salkini
zafonte
henseler
vemic
mawene
rahlir
jacquiline
apture
unig
mailat
laffel
demilitarising
appositely
litchis
dipert
zeinat
spevak
sayef
denesha
shorefields
lapdancing
vieria
gumshield
gisby
baerbel
dustings
commonweath
ahilan
kamitatu
guotuan
ebrie
callipygian
coama
zorbeez
especiallly
disclosable
ruttgers
fuligni
beldini
kutbi
natche
bohaty
drappier
myvatn
toureg
antiskid
wierman
arrearage
sükrü
dellasala
whinged
blocklike
parrothead
elminate
qamzi
farls
taneisha
youssra
kleynkunst
altr
gewgaw
wiesman
mexus
holmlea
tyrany
kuentzel
waledac
cgnu
superlotto
comtan
nocker
vygaudas
bobat
tightknit
sajjid
coai
filerman
kambakht
nanoradio
cybersafety
tripati
goehner
rainawari
ogelsby
paksane
rabeder
shigetaro
heffers
coregulation
kausea
tillabery
chongshi
suchada
sitzes
partan
breba
manzke
tagliolini
revalidating
sneakerheads
pęk
jcbs
studenski
disotell
whitendale
baraniak
linsker
sunbrella
ehouzou
jedburghs
ignus
outpassed
puertorriquena
moxom
daneshouse
gehlert
tzaban
khatua
loughrin
wornat
punkass
veghte
harcar
spoofable
bambarger
ispca
barnatan
garavoglia
plauge
accusatorial
continung
hasmat
nyangweso
wideford
hairclips
castrodad
feminise
naftogas
relis
nieburg
kurina
lthough
esthesioneuroblastoma
nakli
anzalduas
majhu
vasty
christoforous
morganthau
allaga
eggless
recks
parayno
petropars
rosneftegaz
ipys
elins
bulletholes
contines
outstreched
capouya
opthalmic
kojedal
invervar
rozlyn
zibibbo
panhypopituitarism
athermal
raskoff
shaea
rosendall
thorbjarnarson
kenyas
ghri
thaiss
mcgoey
gechem
officemates
grodsky
pittsbugh
vitorio
manahattan
moisturise
niglio
nyaumbe
prialt
tardun
exfoliants
helveta
sickipedia
delawari
relet
awny
trcc
untravelled
protaganists
engergy
healthyliving
gaesong
jaskiewicz
wsna
rafanan
politians
tapui
bloodies
solidcore
sabbatucci
yezerskiy
ritzes
arunma
tulsiani
sriharan
loyan
braining
bagdasaryan
consignia
pecr
uviedo
silentnight
tevot
iecex
instedd
melvinia
evanne
kinesthetically
hyperseal
unreeled
keckly
stolidity
brackmills
jaacks
indespensable
lissavetzky
zillia
overinflate
terorism
antonione
cerimon
ministeries
definites
sobue
bleated
myrthen
addidas
maftoul
jenay
hriz
harperson
drubbings
zangrilli
estling
kongwe
bacskai
ragheads
silcon
companied
tesema
napwa
arcinazzo
lagmore
hydromassage
jaljalat
phax
betokening
uncurling
scheyder
larrigan
zetlan
kully
urumiyeh
novitec
turnround
minorty
adew
gradillas
unseemliness
letterier
wcccd
oout
tolee
prediliction
tsapelas
krent
carwashes
tsatsi
condidate
ropper
gutíerrez
onyejekwe
ryness
gróbarczyk
klappa
wvvi
bedoin
biodyl
odec
hria
pozdniakova
declasse
salmones
bornhorst
realtech
oakfields
pellant
elleithee
dubrowski
fashing
braney
wiewel
thorleifsson
gazgireyeva
izlar
asig
dearmon
statland
barjon
manouvre
fezs
openhpi
clergyperson
centrastate
bodegon
scrabbled
aldeasa
rathgama
eiag
reopro
economywatch
westerterp
bluenog
fbj
bacaro
acucar
unavailingly
rethorst
cuvs
ghiglieri
paleopathologists
whitmanesque
serlet
actioners
repeatedy
rogacki
grimp
sedrakyan
casodex
heucherella
kilfinan
unanalysed
dicroce
snowworld
mykonian
ahlaam
kosseff
oeur
jinguo
surpressing
goetterdaemmerung
heithold
masnaa
iarfhlaith
colifata
mulrine
daewoos
memoli
homex
tidland
barncroft
veljkovic
traikov
jolena
wisneski
abraxxas
haymet
overscheduling
zemedkun
victum
gosin
excruciation
plumpers
talve
bliese
mielgo
neurobics
gewen
mangul
atempted
banchetti
mckellips
arwady
scarbro
socialmedia
ceannaichean
hoens
tpss
frankovic
kotyli
hyperventilates
cribbar
schmoozer
lezark
jibouri
gaddopur
mahalak
networkable
scalelike
reivews
dyfs
arrotino
pontignano
debategraph
sawc
aryani
dazzo
bullyboy
masstige
adbe
lustau
farruco
expatiated
serivce
gemperle
freakishness
alathara
tintero
unrealisable
minidresses
xtrax
umhoefer
porfolios
inveroran
brisbee
kiplingesque
chorman
stationwagon
roastee
aqueel
servicepersons
devoloped
micrus
kazmierz
shoulod
soulen
mandolines
maharashtran
jobsohio
tshing
tehm
eldery
avnon
lrdc
desmar
rostekhnologii
zitacuaro
elshani
edap
gymastics
mergent
wishtan
mereway
parkgoers
imars
nasulgc
jcaa
arito
ouatah
waterbugs
whif
matrex
spiffs
employeers
orotund
impossibe
felitti
nehad
fosil
cigui
hammerskin
gaxa
szczech
daleep
aloulou
przygodzki
oyw
amatista
trubion
postmarket
callisthenics
jielian
iovs
becora
khaindrava
woodier
blindsnakes
bpex
smokier
karahalios
vitone
glaberson
aaadt
nonvoters
anwarullah
clarky
olimpa
cherkizovsky
villainized
lancot
barzanti
errachidi
mcgetrick
dundonians
hamengku
yolton
swartzbaugh
levav
bouzar
yousefzadeh
bystolic
cardizem
proventia
yerawada
dieteman
fructifying
samarasan
triump
photoreal
kmpg
slatyford
supershort
djbouti
saproxylic
schertler
cusí
zadorozhniuk
equitorial
farelly
britiain
ʼal
oceanlinx
sayerlack
hespanha
jolil
hoyoux
massaud
lomitapide
shucker
ginis
citifx
loshak
rosekind
stba
manakhah
svest
varet
bademosi
korst
samreen
cobbolds
fonarow
neighours
kinnesswood
planemos
memora
schoppa
slatalla
giragosian
rncc
fleshpots
isocs
urtis
moevao
onlocation
appathurai
yiin
techconnect
retailiation
militzer
schlabowske
zargani
battenbergs
birdell
spazzing
mizejewski
billerud
klitzman
prograf
angloamerican
ashamedly
wishner
wirec
rustagi
biliousness
topbas
dabchicks
torrentes
postmus
roundscale
infinitis
hornabrook
badden
skochinsky
plainclothed
exoticisms
saralyn
tervalon
securecode
configuresoft
multidrive
rightsflow
fzco
mulraine
mundipharma
nonvascular
nägel
breathily
bodywear
appals
bowlder
cityteam
glistering
fereira
martons
inrap
ardelan
pentech
noaman
ilight
baggish
angoitia
wmrs
mohammeds
covia
celzijus
wiltel
stepleton
hermidas
kipkalya
barlyn
zambas
raeesi
citzen
iturgaiz
protesteth
pcdc
overinvolvement
pelekas
lagnieu
daisher
funjet
piasentin
boulangeries
podoski
bobbys
ultradns
impracticably
cathdral
kedrosky
rasheem
perunicic
prostition
erka
bryarly
pricelock
ornais
moneycorp
ameristeel
hughen
xianfu
mngadi
ventavis
fewsmith
frash
rothweiler
gabart
falkenrath
pakpour
sietes
ouches
youtek
jonases
bugaighis
primet
hoegel
dybkjær
dodkin
construciton
khouloud
crezdon
mindblowingly
hadnott
glangwili
paddlings
hajicek
widepread
rulespace
ectodysplasin
dandeker
zeitgeisty
perries
shpetim
amadiyah
keydrick
wehlener
showjumpers
spron
danehy
terveen
chodak
compart
virually
semons
mariell
methaemoglobinaemia
bohrs
oleszek
triathalon
prodution
dalecki
toeman
gentiva
truedelta
upheavel
riekstins
bouramdane
yelpy
canaloplasty
kodjovi
zirakashvili
locurto
jarandilla
rafis
cachagua
dahlstrand
bitterballen
mcclellen
wassersug
overindulges
stentiford
clema
sautéeing
unklesbay
tregroes
dalguise
aspatore
schlippenbachii
immobilizers
raktim
shamaa
chanterlands
saddos
resharper
boytoy
foresty
jelous
diesotto
multicandidate
kokaral
taregna
menachim
fayeds
ventur
gintare
lgiu
microproducts
tafesse
noonen
delinski
noebels
jaunted
naheem
selcan
indiewood
washir
nurallah
manufacturered
canonising
solpadeine
grazhdankin
scences
zanevsky
snoozes
marrietta
hamrol
kantes
preperations
infotag
brittini
securency
fuul
aeberhard
asteriod
roomstore
lukeba
russsian
eloul
pfaffmann
rislund
bronces
sploshing
swep
ralton
ligth
empoverished
boomburbs
nonsequential
lhcf
polania
mokango
thanawala
sinopharm
kryshtanovskaya
agrobiological
firmwide
willikers
kerbow
jarris
argumosa
attampt
hochbrueckner
eversons
ephelia
securtiy
netlearning
freons
asantha
fifith
piwna
pigneto
siddiqah
carifest
healthcheck
intoduced
beshbarmak
zauba
retiling
deganit
stevely
fruitlets
envolve
linardos
repell
hosteling
insouciantly
onliest
noiron
cardes
greaseless
doleuze
madewa
butuo
siral
ciragan
slurried
airbeds
tabloidization
stewmaker
obamanomics
vansteenkiste
goates
meieran
zoodsma
nasrawi
recue
mesches
oevp
minerally
venkys
kahlid
kondaurova
ordianry
likeably
aquarids
horelick
hypermarché
singularitarians
goken
tarasiuk
fluffles
jakusz
smetanka
tirus
minmin
sterzenbach
tournment
ameron
insititutions
otex
bohland
petrolifera
malecha
salerooms
tressed
lifewave
telenorba
deceuninck
shujie
crêperie
chadiza
sexbots
jeanice
ukieri
robatayaki
malenotti
inititiative
olch
danilkin
labrandon
prosten
artparis
mohne
cochetti
endp
dasient
sarinana
schamel
investindustrial
wwxx
jawanza
superpowerful
targacept
albanes
toffa
feinsilber
afbs
scaffolders
shidane
krejca
utimately
bookbags
fremlins
apsell
firestarters
healthsource
rachofsky
bistrots
tzavela
bellavitano
agreeed
freemantlemedia
maxcom
eloui
hotelicopter
allenna
mcgowne
torgard
oulo
blaven
zebley
tenerian
jacquelynn
theinternational
marzella
vandenberge
leuckert
straigt
cetron
cyfartha
kolba
koukidis
luftig
prestart
dorkiness
mohabir
nauss
sanitoa
khameni
eurosceptical
atwah
adung
yubaraj
zierdt
hereos
slaoui
biowaste
sandimmune
speto
tomase
collander
ssing
buttenweiser
horsholm
winkling
consommés
questex
watchstander
lehmen
tversity
sigaty
karydis
heikin
humorus
kourula
arechabala
devines
dehghanpisheh
bratich
dinowitz
karason
qadari
jacobsons
tonjes
filipiová
restitching
shoddier
scrivani
tibbatts
defoliates
unholstered
unmoveable
hulthén
moaser
gyrowheel
inflationists
neckpieces
deleston
gegenheimer
errrrr
amunategui
kingon
naciria
maehr
pulchritudinous
waitperson
appelberg
amselem
swanlike
spicerhaart
findaproperty
bartop
eucharides
bigongiari
pomaded
gieg
cleanshaven
halotherapy
orlane
anakara
khanafeyeva
thitinan
karugarama
hakimian
vnus
retched
maysfield
cvos
bearce
bioinnovation
jailen
bookeeping
cooed
addante
mafileo
miscegenated
fervidly
caroon
samouk
patelco
benzodiazapines
haufiku
tandikat
feldzer
quing
muheisen
kiwaukee
tilem
sarejevo
schieb
castledown
stuttery
ineris
annozero
raidrs
phaal
garzones
yongling
foldershare
anzelmo
handicam
zimnoch
atabayev
aicr
covention
jingled
petroliana
hotornot
jerritt
seprafilm
symbiocity
ganchi
brynaman
nazami
kavkasia
oceanhouse
legutiano
bestowers
returing
cantick
multicharacter
sycolin
aphorists
fritti
jacomelli
toradol
mujihadeen
onzo
turitzin
denica
whaaaa
narusinsight
champley
shangjie
ignobly
overemphasising
nessers
latsky
acknowlegement
sbics
ctitf
ctsp
cebalo
luxim
mival
flotations
fobney
johnsondiversey
brokovich
iguapop
zelevansky
boccio
knightwood
brahams
rogerses
glassel
hartzer
murado
huneycutt
kersen
himelf
targeters
pecong
tunwell
prinster
constiuency
trihatmodjo
collás
vaidi
grimbsy
nasbla
nakhumicha
vöslauer
olaide
gwella
liqiang
filmically
valhi
morters
orlovic
hilgeman
grizzling
ipev
farnen
governmnet
feyling
hutsby
viennetta
zippori
hushkits
biofit
rusen
ndonye
callwave
monomaniacs
busineses
andreano
degorski
whipsawed
mhaith
proyecciones
uppish
viread
defnitely
intracacies
knable
mohin
forseeably
eunick
presdient
hintlian
mtgo
energysmart
kalemeh
talez
razanamahasoa
deceipt
vainstein
watercube
bastes
necesitan
breysse
powerscreen
bashari
krysti
ncore
nalbach
nathe
orencia
divorty
fussbudget
anusat
groenig
phard
teachman
chandrasekera
chamni
moquet
unaroused
aspics
halaas
rrsat
erlys
collacott
ysbryda
downcounty
ishitsuka
velling
rital
hagers
djabal
hometrack
mionet
dymanic
jandreau
attiq
jannetti
oibda
maesbrook
guilbe
sudack
khaidarkan
overhelming
aquatec
arenivar
licuados
ladsous
xiuqin
boomgaarden
archuletta
schooltube
adforton
mashek
bisimwa
elshadai
sldf
mahimahi
openscape
mcewens
pouplin
zingman
dhody
seaney
tantalise
girao
nachito
afairs
nawlins
vasilyan
mecki
gwledig
marzuk
mazetier
visitied
inoccuous
techau
sonelgaz
brothy
porkie
huiyong
pannack
bahng
felow
infighter
fussily
ilao
thecompany
committeed
sakkinen
troester
unprecented
metalcrafters
rollaway
saddams
raqia
ezpass
priestgate
yenakiyevo
ghostbar
bellord
bagginess
dalonte
birckhead
gridwork
afrims
glasfiber
gelcaps
lashonda
lmxb
gloder
korenmarkt
turbocharge
soyle
sctf
pixlr
serfass
westerhaus
watb
clarencefield
scudded
circadence
lowgate
ferlaino
vaananen
moneghan
clayhidon
eeze
visen
viciano
kiyofumi
frazell
frontgate
cherigat
kabban
crcd
securesphere
medicore
schulle
krabak
bustleholme
bebawi
delonas
streetwalking
presedo
missier
dupper
fwix
xosha
zestril
conside
lmhr
spottier
speyers
lgpa
moview
loped
classfied
complicities
frosses
scheungraber
chieppa
thouands
sanitiser
staggerford
jurrell
silking
galewski
carveries
naffa
farnquist
frisoli
perdut
stovitz
drowsily
abstinance
korpal
disclike
betterinvesting
predicitions
yoslan
quotability
photoswitches
grpr
moviedom
maullin
maddness
stalwartly
starchitects
chhom
finkley
distroy
chainstores
cwmnïau
zuckers
kudrycka
desterrados
shanmugaraja
lakesha
shaone
shermanesque
euille
medhekar
kowsari
ahhhs
minsiter
lrpl
romanens
whelbourne
chelbat
penelec
kalsoum
basille
amgylchedd
collectivise
tohave
boycs
hustinx
colagreco
hrsd
viñedo
bluelinx
palmason
seillier
kaernten
predictify
shirzai
shahrestani
bialowitz
msrps
kickabouts
lattman
loosener
jabugo
izosimov
obonyo
baumfree
priot
researc
salubi
raviglione
darlingtons
soukar
tripit
donorgate
ghindin
wifeless
vppa
destinationcrm
miwg
whalemeat
unchoreographed
auriculas
canapes
bardavid
gimick
insipidly
apercus
numeiri
radioheads
chamberlins
peoople
fosberry
eaglerider
irhabi
chepkurgor
squeegeed
soulez
reacquaintance
volkl
conery
ciaglia
laedc
pietravallo
spokesdog
steelbox
devashard
esteruelas
anderon
mediabrands
arbey
olmützer
kreab
agfc
garimpo
mcnenney
sabatine
rabiyah
kumzar
musicid
pilgims
homefree
gepford
limeside
lousiness
prisi
adenekan
lazrak
guclu
wolever
restages
bleum
virtualise
tscp
maternite
kuken
snackwell
derrys
aippa
stibal
overcommit
atircm
zolfa
murderousness
peckolt
xdms
midgate
gorza
ruentex
sarachek
rotondella
ratkovich
rudoren
shoucair
shmyrev
hachamovitch
swoopers
remodulin
zahorian
mathmatically
pisarro
perfervid
cofinanced
accutronics
zomig
inferrence
verboven
cornishness
cavendar
toddled
gothically
manuli
wopping
snarkier
hmmn
milenov
brightseat
bardoni
auldyn
tinkerbells
mccarricks
preplan
matijasevic
deschaux
sedonia
dusic
susceptability
yokoshi
tifatul
batties
dellwyn
fritada
akond
defuelling
durica
varbusiness
loilo
striplights
papastamkos
budiriro
meeeting
mackert
unrestrainedly
franic
stampolidis
metascores
adivar
immensities
sanostee
chedraoui
latricia
silaigwana
joselio
villemont
callans
fridjof
activinspire
reimplanted
midstocket
fitties
vayl
khanbhai
sensualists
omelyanchuk
radiesse
devriendt
filp
kennedies
gabitril
overhype
maquin
cofinancing
shemara
kidzapalooza
belpietro
commmitted
kwamain
efyrnwy
gressman
brodigan
semtek
adirs
qoh
oetiker
stellaservice
cpix
freightcar
matria
nasier
arboriculturist
bozize
transhipments
tanrikulu
crossruff
superstuds
rquez
kalkut
ragoonath
antipolitical
fortunado
argiano
petrobas
rumsas
elisco
cadora
jelleff
ncal
sbsa
travelalls
furuvik
magalski
ratcatchers
mennello
doorbuster
yunchuan
zellij
coldsore
introducted
phytopharm
buerki
twizzler
scharfen
lellouch
arahuay
parvizian
mcconochie
langstein
tradegy
pinkwart
hirschkorn
foulers
amazongate
schüll
censoriousness
rebook
eruvim
zenkel
whiffy
tsarukaeva
kirabo
trevard
biogenetics
pingeton
enfeldt
callowness
zarein
miseducated
ahmedzay
esfandyar
boxloads
klipa
caovilla
heimeroth
enraku
dearland
soundin
trunnell
zuoming
ilyukhina
grupetto
midwifing
anerican
nondeductible
espd
fuscia
trubek
menashri
prigozhin
goodmail
lelands
oilexco
wotter
undedicated
kadannappally
pillaiyan
vilaro
uralchem
memish
cashedge
khorramshahi
mailouts
miccolis
amendoeira
lavarreda
sardjito
awarders
trenchancy
ddal
wfes
curruption
dhoble
quattrochi
nosebag
interestng
tamasy
endesha
hanawon
datillo
arktikum
lavapies
protrayal
protectins
telemonitoring
tackiest
strokemakers
goldinger
unexcusable
intelliquote
sunnyhillboy
hyperic
yuanshao
turnpoint
gsoh
spinelessness
corralejas
mildrid
clomping
paisal
ricked
putbacks
sbrana
choiceodds
cnaan
monics
raihana
sebnem
enobled
studentcam
schuppert
endboards
kafashian
sujak
infinia
duplessie
altinay
starchitecture
kirubakaran
dentzer
calvanico
galamsey
mukisa
duoji
payrise
batoning
ustelecom
aplace
chloraprep
yachters
sekulich
suddeth
fengwei
bicentini
kmarts
kliuyev
nickolds
medbery
coronaries
bielas
gaull
gonxha
mindray
sibio
sohil
fradgley
fastlicht
sperrazza
berez
eprivacy
gingeras
noticer
anabtawi
naiem
talyor
volksrant
mussolino
abeloff
rubare
thirtymile
pulchritudo
nahiyan
mchutchison
khovanschina
uxbal
dakwar
kookaï
lovenheim
gudavadze
midence
eatr
vanderzee
sickliness
nixonland
resiled
schwanewilms
ossatron
smidgin
doodlings
indebt
saugy
rasula
zachy
uniongyrchol
orcoyen
kingdee
komarom
lecarre
viennas
lemlem
mirixa
lovallo
isaby
hightlight
fcram
hoariest
idell
screeming
thenca
apptec
dorsher
zebtab
zesn
exergame
netquote
fistfighting
besir
apakan
nikolajeva
viatronix
leftwingers
numnah
ajdar
aborning
gualano
charbit
kleptomaniacs
houssels
mishara
abseilers
hradilek
grayken
pesko
wheek
metalor
dasko
bazaarvoice
downsizer
kechele
batcho
nonmineral
manteros
gatier
millésimes
okasan
mulben
waple
brasiliera
shmuger
profetica
theravance
bialo
honsel
fcso
tssc
louvrier
schleisner
hujjaj
maschek
gardebring
brashest
autopart
bcbgmaxazriagroup
frabjous
peacor
northton
shoulderpads
coporations
hspice
veriface
dryvax
julys
multilateralist
goosing
usce
goudiaby
graffagnino
muhedin
manata
sweetmaker
channelweb
grauso
tremin
wmk
cenes
kimbrose
pozze
dybbuks
nozkowski
marraud
medjools
demosphere
sunvisors
chotaro
condemming
zouerat
sorton
zetterman
xinyong
fulgosi
klenz
teeranun
wisened
chmielinski
norek
swooshes
alsaud
rossiskaya
plues
aahad
beecken
hollihan
koezuka
rzo
hagiu
woofing
lacourière
yottabytes
zerog
damuth
popick
cruely
payge
loopier
laxmanananda
ciwidey
kebler
aneiros
sherrow
unterkofler
jameos
akritidis
gonone
shanetta
lacinato
breema
farechase
idtgv
biospherics
mirise
noboby
mecaniques
hillendale
mishits
lotharios
sanglah
psbr
sleepwalked
siblinghood
barioz
vasakronan
bordwin
scoleri
failling
inevitible
jabaal
nuzzled
crouzat
javedanfar
overspills
kathon
pengxi
rimowa
beauge
paraquad
teyssen
mesis
chlg
asheninka
sarcona
cegh
martineck
wilens
pomponi
pitcaple
haphazardness
mallay
bodywash
zoonation
folketrygdfondet
pinheaded
tierny
milien
ponderland
pongracz
controladora
musberger
sterksel
cocis
narcotraffic
taegan
pollanen
technolgies
ganek
chechan
ogguere
ozumo
prisonners
japansese
glinert
conditon
stayaway
blackann
onesphore
jiaomei
agboluaje
cratos
hirko
haukohl
monotherapies
bergian
bucardo
trowelled
viermetz
juniti
possoni
kasselman
bonnani
aumento
démodé
tatoyan
kasonga
jesmer
cavness
affymax
liddells
yurgens
cahlin
backbiters
mcdouble
saulino
bertagnoli
zwakman
ziccardi
supremecist
heeeeere
nighties
kulbicki
masline
lalena
affonço
yumasheva
ecuries
yhis
maici
bernasko
sahidulla
klyberg
nannys
heilveil
rosengaard
uprighting
reindustrialisation
nordaas
raisinets
tweneboa
mitrichev
gawks
qasmani
bikesafe
unspooling
khukhashvili
mélisse
exective
stvp
securer
duerrenmatt
fitel
tifanny
noray
afeared
nyanzale
specia
shalson
bhattacharjea
myisha
franzoia
suberb
catheline
krahom
seditionaries
stockmarkets
jokanovic
ltcfp
orygen
birkmeyer
jnbridge
unfavorability
sumbandilasat
neelmani
technocentre
beaunier
bossler
brownism
harbourage
mhhe
tbps
staner
depersonalise
shopfitter
albless
birchers
sheeplike
mitsuma
ramsaur
kostyrko
tailgated
breadlines
cepko
jivanjee
nenuphar
uramin
revolutionises
kwikchex
multihomer
siheyuans
wardieburn
barakula
parrasch
kanpachi
crewcuts
sebio
nikpay
pappers
lengkeek
salares
jaffry
thorseth
saltness
lotery
beversdorf
tulashboy
shiavo
caravaggios
elmay
amenah
licy
vovan
gleysteen
osbrink
ekay
simmond
bukata
borovay
madisyn
aulisio
thinset
dakim
realtive
altantic
incommunicative
wadsted
nimbys
werbel
scarpato
karith
kohlenberger
antipollution
sengstaken
munqeth
ogut
babbio
vunga
zakiur
iciest
mussoni
befriender
efdi
ptpi
pelma
transitting
tarino
tindy
sculptra
dalcin
dugel
sanzari
lumleys
polkinhorn
campaiging
baroncini
pellerito
kaloyeros
skycaps
kozari
watchcon
nukular
cunat
religously
nhsa
nomineee
bibic
alpuche
ritualize
riotto
stauncher
boulodrome
hairst
caiz
trattorias
silvercorp
tidmington
wigren
etcheberry
tennessees
vaghar
gaoke
batemen
charlow
bellevarde
bobbly
goodear
yazdovsky
eeurope
cryopreserve
broubster
jaynarayan
fordian
guilelessly
indarra
questcor
cannabalism
lehmacher
encashed
higgerson
marketside
actiontec
wraxhall
sycophantically
renasant
skyforest
toothlessness
toshishige
cooneys
lovborg
waggles
metelsky
inayet
netessine
pooks
thassa
cummis
lookbooks
tpsac
collpase
monicans
mikul
tscl
ausn
dext
sweatsuits
daklak
peerlessly
playtimes
westfaelische
maneouvres
woobie
iezzo
pethau
metascientific
unsentimentally
cruxton
nigera
simplyhired
afican
semneby
backorders
malikyar
swansborough
ozd
blackweir
becareful
varlan
balbeggie
cfats
simpe
bouilhou
alwaki
kalentieva
dardai
electrorock
shipwide
cosily
withn
retino
gridlocking
iggs
denuclearize
gubbeen
mauerfall
windchills
ugss
tananbaum
luskentyre
mhangura
venkataram
wondeful
kalite
kingmaking
strozzapreti
happeneing
antipaxos
fransje
chargoggagoggmanchauggagoggchaubunagungamaugg
edies
giricek
finnebrogue
homeboykris
tristone
mobus
kriese
schrauben
adilgerei
shmira
inarticulately
thumbdrives
cegedim
philomen
lawnbott
fornalutx
filev
poulters
poofed
wardwick
epistolatory
zanoun
abingworth
untaru
hilldrop
altamarea
megabreccia
googletalk
motionflow
thuyen
fugitt
obray
ferrai
cricieth
imangi
tranferring
willyum
pufferbellies
olajos
isaichev
intellectualise
financiarul
meejin
corty
rickeys
czarism
srifa
lookng
kopera
geoengineer
twistier
hundreths
ixer
frightning
pollins
cssv
renfors
oiba
loglines
guoqi
franceschelli
econmic
prodemocracy
pacl
leeuwenburgh
hawketts
nonsensicality
tylerton
elshinta
cheathem
gedco
spilak
klisch
purchasepro
gonvick
moussaid
gogii
kurgo
novostei
moletta
dargate
fabianski
pyatykh
hibees
niccolls
vetrone
zottoli
inglenooks
eicu
rexrode
warrenty
rikardo
unerotic
immersiveness
lichstein
punnery
stigger
anonymise
wanyeki
metahaven
stetch
holosko
mantau
tironi
bechtolf
joedy
ruedrich
ettner
monolaurate
hearron
mutemwa
netex
manzitti
galwan
pitchwoman
hotelera
seereal
gwq
tankan
rakus
akallo
dyatchin
coupledom
subtrochanteric
sumayya
nightjack
ryzik
hardmeyer
jacketless
dsda
moble
leruo
iraninan
resorces
filaq
tsfs
virilion
flamboyán
wrotto
xiaflex
bettio
holzheimer
unhired
zahera
idealogies
predo
quixotes
chirst
laurionite
buisnesses
hirshenson
lescault
adventuristic
wraithlike
dossing
tabouli
luege
iwave
rivellini
semicon
coshes
viehbacher
volpp
nonfuel
latunde
gamalath
mouquin
cappex
pingsha
reupholstering
gulty
msosa
ndarc
olazo
speedpark
flovent
quintupling
rybovich
pyttel
bankwatch
truimph
helocs
semalaysia
clerkly
ropelike
bradke
nrda
elagolix
monokroussos
presentaciones
lepauw
cowpats
dorb
trenfield
matiullah
mpal
brcd
himax
nosamo
excrutiatingly
microfun
noymer
blossomy
qayara
bjorge
nighthawking
tamisha
kalenjins
garosci
helvacioglu
biosurgical
unemployability
mcclaran
floxin
hyaric
chiezo
colascione
massucci
autonational
baseco
godat
juchitan
bertamini
mainds
kolontár
noncommittally
aptr
cottrol
videophiles
irifune
barakani
megadrought
bolotbek
biolley
mosbah
ijewere
naqura
bunless
preinstall
dunnichay
fortresslike
lovenkrands
rivello
unrisked
stoemp
ccould
heptullah
siggs
ykhc
puhalo
mirapex
tomnahurich
mcstays
ferarro
trouserless
leanse
goerg
jacquards
hromadka
noncumulative
unroaded
impieties
mikkelsons
sweenie
blumengarten
saleemul
percentagewise
jizzini
frantoio
zinetti
lunera
kalaye
tumbrels
kenber
struhl
baheer
langenhahn
hairsbreadth
decarbonizing
mundanities
redepositing
videosphere
muqdad
donmoyer
chaleff
yhf
osteosarcomas
nasuni
gettng
crescenza
slingboxes
collardi
saeedullah
misremembers
grandmougin
bgrc
modfather
avenbury
skanled
survitec
grossmarkthalle
videoboards
yondelis
fugnido
sarewitz
bonxie
knobble
nerdfest
stribley
tarekegn
hoogie
rhenaniae
lentine
epir
lautin
rekso
leaue
jasikevicius
jobsworths
hsss
kachikwu
vanwalleghem
dontell
dizzied
rothgery
foxiness
chalamish
littwin
whpc
caniato
coolish
incuriosity
proassurance
kcfa
vandergaw
andarabi
govermnent
manhoods
discofied
tumbaco
vyomesh
toweringly
testagrossa
dolmades
yehle
dussindale
thwacked
delerious
inscrutably
tapuaenuku
elegent
terravista
adventuresses
northumberlandia
saiedi
bramleys
osmanova
hyperactivated
cultiver
muhle
echoworx
murderes
bordean
boodman
egnazia
paffendorf
zakhu
schels
pitaro
britspeak
skinflats
phonautograms
penine
storediq
tionna
yaourt
eurosur
abasing
gerrymanders
sidled
koblas
nexterra
eefting
sdsers
moonfleece
puela
hedgelaying
woja
matieu
nlcr
viebrock
doumgor
aeronatics
kochon
siriwat
gloppy
perceptics
dunseath
veeps
mxim
capeless
costcutters
purhonen
harmanis
dorheim
bollene
lidove
smartconnect
keyna
crunchtime
poeteray
palmprint
huante
unconventionals
teresitas
cascal
lavee
incompetencies
campign
antitheft
sulub
discombobulate
larkcom
rotable
shurah
taneshia
sabraw
brunstock
pourdastan
timetrial
gewanter
fidc
unionbancal
intenders
unpeaceful
koelzer
maizes
corfman
tunza
ngirabatware
postler
sudr
rozak
aple
fecally
oncoplastic
kaching
macrogenics
makhar
syscon
sexify
nagdi
kubesh
fruad
mandoo
unticketed
rimmell
buratha
sleeting
smolianoff
dionyssos
ambrook
macaleese
knells
dimiss
medigene
bddk
vigilanteism
cioloş
iofina
kirkorian
megaliters
saladdin
reappoints
manich
varsos
sugarcoats
premeir
gudiberg
matzger
meville
collateralizing
nijah
beirutis
fishbowldc
preparty
cardlock
snorer
thirtysomethings
lochview
kát
bhumipol
aquatech
inarticulacy
raffele
bedzyk
sporormiella
defoggers
weithaas
promphan
hantho
unconcernedly
seattlites
nasari
elating
screecher
charpai
eceiza
beinert
schapp
nelco
easyoffice
downblending
dabaga
suthin
wergs
vouchercloud
persichini
fertonani
runflats
tagney
penpole
tarratt
harmening
balletboyz
qeis
onziema
houseproud
respess
casglu
spidcom
brtain
akinyi
katwala
wooziness
dynapac
nccg
linescore
chrysanthis
ladau
stuller
nowhatta
okosuns
troesch
portgual
bluelines
schnare
kaneoka
yarkas
cescau
askarieh
otjivero
kalashov
ramoin
creeth
afterparties
excelstor
thecentre
khulani
rabjohn
biovex
zvinavashe
rosty
composedly
adulate
weydert
squeri
scorable
rafin
atbs
nuedexta
kegels
sawrymowicz
flabbergast
mistrals
molsoncoors
singlehop
blooped
grambau
cortèges
ohley
dangerious
citma
procaccianti
relevation
pubpat
razzed
oveneke
osvs
pancanadian
lenane
giulianotti
outsiderness
teaster
negedu
costerton
propagandised
kanektok
unimpressively
sordidly
elslander
arall
ritvars
saudino
tupra
politeo
premeditate
basbas
konzelmann
yakcop
aurus
funsters
hezlett
rakishly
eviscerations
meritcare
underutilisation
christofis
filtronic
huiskes
pashmul
romeyer
coywolves
dezell
rigerous
housesit
panjiva
boatful
metamorphosised
hulky
roures
bilingües
snerdly
trusso
draig
headquarted
viafara
altringham
oppotunity
laudicina
berisa
underrepresents
broadbrush
conningsby
emfesz
overcash
outproduce
olke
gudni
akbulatov
fruitseller
ashecliffe
amortising
policitian
mommer
snis
shoeboxed
inteko
unconstitional
tsoukalis
mezzaroma
loveshy
ecycling
djibrilla
experinece
keithly
irenas
saghal
rcni
gubner
bangaldeshi
grandt
escentuals
dorros
sayafi
galavotti
actec
glinted
seminfinals
dendoncker
cryle
guachos
cicalese
reeg
illiniois
themal
isenstein
zarkasih
matternes
inzerillos
fetishising
mowj
dasmunshi
stallergenes
sevugan
dolsi
nawaja
naieve
dabovich
stephannie
polydrug
unieuro
bashiqa
bajur
cscmp
hypomineralization
nspca
cockscombs
reeher
vigliatore
bvrla
schlomach
mathrani
jurisidiction
jaralla
stojiljkovic
haskells
inflammed
gauranteed
seebs
sideroom
kiddingly
kanneh
squirrell
shekhovtseva
crooking
wfic
econergy
chemoperfusion
soullessness
cavolina
sophistications
guaceto
zotova
reimchen
youngins
abaas
excitebots
shadeless
shwam
armload
cbry
bjerga
trizzino
sawula
belluscio
withrington
scrawlings
tranquillizing
disinviting
techonology
zhengang
countertrend
loooove
lummy
wintonbury
revention
dzongu
mohtashim
torax
lycourgos
snakeshead
handpresso
aureos
stecoah
zubiria
shanas
djalo
bpoc
dipippo
santhera
buckweed
youtie
heathhall
ofcc
interservices
cheongsams
rodenberry
commodifies
archaelogists
repesa
overfunded
mamalahoa
ghayyur
naaco
jammaz
averbeck
scanbuy
tahita
sithanen
bilaal
radkin
citrano
nonessentials
kulasekera
hsse
grumann
fishermead
ploddingly
schratter
crimpers
phillp
danat
trendspotters
joman
hawklike
neighed
guadamuz
stoeckle
tianyin
maccray
vincor
weldmesh
regieme
steriliser
srdf
robindale
selfserving
geneive
pukey
knuffke
travelwise
dalmain
tradionally
bemuse
charlack
sherika
ambasssador
ddct
pokiest
peulla
solowij
sheenah
chakrabortty
waheedullah
baaria
postconsumer
hiiran
stolly
auberg
subprimes
vanhoenacker
snowshed
lovasi
herrling
barbicide
quemere
xuren
satays
abdulrazzaq
narusova
warinner
makombo
dharkenley
restrictionists
investees
foryourart
secla
raizals
kaijima
begrimed
ejide
jumpiness
kirtankhola
fifteenfold
kleercut
abosolutely
chintzes
klyph
bransome
groundforce
andriod
irandokht
obono
biegert
centerist
erikkson
mmia
vesicarius
flonase
lotrel
armfuls
hosiden
tavernise
woodenness
wsbtv
jabrin
dualview
electile
royak
eurand
pafilis
joffrion
honeyben
ursitti
steepling
seimens
reservationists
hankham
mornati
toped
hoeffner
beaconfield
representan
eadington
nevres
cerattepe
vanderhoek
indefatigability
fenwal
surbaugh
wiedmeier
perserved
birgun
plastinate
futs
ereading
thillaiyampalam
ognyanova
unliveable
zlaten
kumutha
rinzen
scharbach
undershoots
goldensource
musslewhite
quaden
nastaran
osmak
fartusi
saponaro
chernofsky
disinform
dilsaver
ipelegeng
selosse
trifactor
bivings
areshian
gyurcsany
nethken
confernece
tuerlinckx
hammaren
trofile
steinbergs
wobblers
reesha
outgross
northgatearinso
gangjee
mukantabana
meshkati
messianically
kejuan
celebriducks
weteringschans
kuzo
misdating
shlapak
selecky
rodocker
chebaa
abednico
scalamandre
compañeras
buttsbury
friedensen
montepellier
madle
utek
facette
pyrolyzing
fieuzal
laproscopic
yashkin
jermyns
helmsleys
nozipho
mcpate
bobó
alexes
jacobelli
mainiero
tsumani
neognathous
mcommerce
wirahadi
boersch
banser
mudpools
plungington
agroterrorism
burnison
beachem
fmcgs
mankiev
buchris
ihrsa
lahza
duchak
chartkoff
mmvii
arrrrrr
forlornness
gonchor
apotheosized
magnoliana
jaurez
cyrenians
chiofalo
abfd
gasòliba
nonlife
eobs
abdolsamad
forbeslife
chouchan
sidefooted
beautful
runstrom
uncynical
gymuned
thinkequity
schendler
palchak
wiltons
pitters
haigood
trisenox
colbost
jneid
giacaman
kellari
rebanded
isinbaeva
simpcw
unrecognisably
wabbes
alamzeb
overhit
sinese
muner
repacholi
wodicka
demou
barrickman
perving
auburns
educat
contageous
whithead
mortages
understatedly
submarkets
kleibacker
selction
donnish
schaltenbrand
heartscore
shamless
rowdily
paleja
genarro
oarswomen
freekin
cdhc
goldstuck
opeing
tuxedoes
raucousness
technocracies
suppported
tyranical
pannacotta
perimiter
shrivers
khinsagov
schiavocampo
meadham
noshing
extramadura
minesh
granades
forhead
gimmies
sheib
khury
gøtzsche
ceril
atebits
midfields
teethers
closey
zammo
medcath
slosser
ladoucette
africanamerican
carryforward
hanenburg
tradi
ellenese
crammers
cjackson
financiación
wildmill
sidekan
surrend
atomising
fabrazyme
frisée
boullard
rhythmless
phcg
tvert
propanganda
smigelski
scrappiness
ghadaffi
weihuang
greenwold
boebinger
kwsc
ereck
nabali
glenzer
drumshoreland
zubairu
leichtfried
servce
transation
comittment
choksey
ampinga
motorcyle
quadbike
starbrite
bcame
goksel
germophobia
daddow
depresion
maleza
progenics
goosestep
donghui
daddona
weilheimer
norred
pitcairngreen
antcliff
bollihope
vocabularly
sachsgate
kortlander
premediated
agoraphobe
gorovikov
pierogis
dobrinski
singlehander
kamais
ludzidzini
rapkay
balero
eviter
scarletts
isria
napierala
newwest
pericak
aharonovitch
snugs
ercument
schoewe
alykhan
interntional
sgitheanach
lambrew
boneheadedness
rhadigan
pwer
bianna
taranabant
luluwa
fosmire
agahozo
wickliff
airrion
lianfang
pangalangan
sukhodolsky
docilely
clearence
sendell
minnite
pannek
blubster
towables
ferraiuolo
bossone
barosso
winterizing
slrk
karcic
heára
producted
redounded
shovelfuls
fullabrook
hairbands
nonstock
dadswells
alemparte
ozor
bangkwang
schlössl
slaloming
niswander
guzovsky
millennarian
ratsat
foreighn
inarticulateness
audenaert
heuze
lambikiza
koechler
bespangled
gtce
vaction
contrairement
rajapaksas
pieczynski
unforseeable
rumbatap
mediterannean
aviran
docksides
catavento
modert
italpetroli
demari
odioworks
mcgougan
schroedel
abery
montemerlo
spasic
eglis
dunkenhalgh
islanova
pixelmags
rhoys
kempinska
cleansheets
fregola
cortman
gerbrandt
viacell
granulates
cobeaga
sunbathes
snorty
greywalls
sbfa
vakas
cilurzo
difx
assiduousness
wbztv
mcln
avona
standhill
aisenbergs
morroccan
colllege
underspent
visitphilly
kyndall
untinted
mmscmd
unremedied
buddied
poretz
pogoplug
phangnga
gudhe
groundedness
gibrilla
virkus
educationguardian
chemu
garofalini
sahd
extortive
sotherby
wienermobiles
hotdoggers
umiker
pallozas
dueles
terprom
kaboni
plotts
vukelich
kazakhastan
sideeffects
vanska
accolate
heesom
adgenda
hardbat
skeens
silgan
melbreak
flouquet
zokwana
narika
rhoi
ebridge
yge
mushailov
miiro
imbibers
conjurings
sengalese
bombardiere
andrin
selamet
griesmer
battino
zumas
mananged
sevices
ingrouille
karnell
fotp
rzss
fakhriya
wyndgate
zaradic
photoquai
banej
itrust
gardenwalk
adaniya
ethiopa
microtrends
dincin
iftars
phrasavath
labl
laicite
asarch
beanos
exended
tombini
ileka
xuehui
soury
differntly
lystedt
shocktoberfest
thinky
asiantaethau
reinaga
willenburg
essan
geuder
jianyin
truckles
nayereh
almezaan
meadowes
isupport
sengis
dziak
demerjian
dvbt
sajat
underripe
higazi
rhostryfan
moconews
caler
beneift
klaussen
emvall
somjai
opatowek
sealeyi
hfmweek
glangrwyney
vilhelmson
accumen
nbjc
ladgate
karlekar
batschelet
delegitimizes
wumart
lawlis
qanta
boredomresearch
hendawi
inoffensiveness
lonley
lipoff
revaluate
drillhole
earlyish
ecssr
thurlo
exurbia
matrimonially
mobbers
berlingieri
foreshorten
journeywoman
wintles
qtrs
girlfirend
hipping
frontloading
allisa
rebeaud
cézannes
pdii
pizjuan
zellnik
unamendable
temporising
jafry
overanalysis
zaiwalla
topsiders
headshake
petroci
nodl
burnstone
altmans
manuevers
pemberly
alliedbarton
brucey
secetary
fullcircle
khomeinism
sashenka
andalgalornis
wonkiness
beckitt
borthwicks
banrural
unadvisedly
mlynarczyk
jouault
moromizato
umming
botvin
schultis
thase
muffing
sliska
mamere
echikson
codero
brutishness
soilihi
kitterick
bloodlettings
rhinestoned
icej
niesiolowski
anandappa
relativise
blockfront
frova
camozzato
phyiscal
skillfull
mickleson
qahwash
draaisma
piconewtons
chernovetskiy
bashier
superliminal
genthner
consititution
columist
thurin
buzaigh
pnvs
guestworker
lvam
devocht
demokrazia
techteam
wildnerness
gonch
neoclassicals
reductivism
crasnianski
panwa
snjezana
vasterbotten
medding
lerille
amezkua
playacar
ferrells
ucsmp
navisworks
kesselheim
betsworth
snowsheds
tolossa
reebie
rajaan
buddi
murkily
coronia
seiniger
janal
prebeg
defecit
whitetips
agassa
escuder
torlot
omaya
molehunt
concretisation
marchinko
ballotine
reailty
tudworth
delahoyde
cornioley
roenning
sarcinelli
celmo
mulbah
spritzers
zees
goffer
jerilynn
tenedor
rebic
gershenz
revanchists
pakista
memy
rojansky
soulcraft
lamno
mottern
dantherm
restis
embonpoint
deblase
elesin
olic
cpdo
kavran
mimnagh
gibaut
hudlow
guestlogix
decolletage
allrich
cloffocks
lashell
pietrucha
freerunners
echocardiograph
vargos
voloder
deche
compartmentalising
kupferschmid
inveralmond
erdy
soovin
laskhar
biocompatibles
winbourne
particpates
handsom
certegy
tambasco
ixg
buffenbarger
cindee
sspo
breathalysers
forys
odweyne
rabuck
saeka
erturk
bocot
tssam
callori
elnar
déshabillé
americanhumane
coralled
journies
spado
lafranchi
swapp
cardelle
deschamp
kynikos
parkgrove
nrtee
ldtx
danstrup
fleurent
deazley
griffall
foerst
øn
velleron
mensik
nesmachniy
dannette
concepcíon
counterinsurgents
tambyah
eurodata
drumwright
pineoblastoma
ghanouj
mayernik
morentin
kanaha
gesac
malagon
westerhever
doorknocker
peacful
thirdforce
shoeshiners
quesney
dogwhistle
alape
hoornweg
consolidants
williamsen
ftaap
lippiett
politkovsky
moqattam
jewsons
dechra
tranform
tregele
educap
wearyingly
florally
mugunga
lafauci
brundell
backloaded
wittenauer
bydesign
screamy
thealby
cauthron
leavenheath
swartzberg
sdrm
omigosh
nomnation
kødbyen
baytril
sassing
sidko
swopping
traschel
rhsc
gailen
mcgrand
itins
vinuta
patuano
prezant
puhn
mygig
seenigama
chasteness
shamila
mcclammy
lekae
extenuated
tartines
lingeringly
incomings
shwarma
kianoosh
belken
nagatacho
altcourse
bundley
heisting
gionatha
ellaktor
basrawi
zainey
unapplied
ilean
zaineb
simclar
italain
devonside
muehlhausen
rochkind
guarnera
quanis
renewdata
horakova
zoocheck
disentitlement
sooki
boichuk
guantanmo
bwambale
zirh
comroe
amersterdam
kurty
fcpo
lawil
ganca
prodcution
mihn
wagster
galperina
landfast
abdelmunim
bpop
streetballer
foreswear
austhink
incriminations
superchip
hulaween
radiotime
feay
airmagnet
securum
overtons
brancott
bourk
burdeyna
unfug
duchez
smooshing
chaiwan
pennbury
npwt
tmep
outred
monfredo
ndirangu
wzaa
cryptarithm
reimportation
fasanenstrasse
orback
bholua
promens
englon
organizatin
unentertaining
seatmates
hcahps
igaly
quadricep
drawbaugh
mispricings
teyba
safeware
altanta
weedons
quaterly
redecard
reconviction
thomasis
sedq
tuomanen
woojin
copule
drival
wegher
bionova
enormousness
willemsorde
tabesh
dreeben
misdescriptions
trubeck
rosenwaks
balmacaan
remeha
brodyaga
peppier
skapa
congueros
bayanat
kaiserstrasse
gatorback
beouf
crablike
disx
tapizar
corcept
divay
labégorce
airmailed
degression
rickrack
ummmmmm
qvarnstrom
templand
vegatable
dimitrouleas
alteris
gitlen
shopfitters
bipro
abufarha
smari
kenyetta
polygraphers
denuclearized
ballengee
steadicams
giuggioli
bleiweiss
iluh
kimat
schockenhoff
rotenstreich
visarts
unstrap
teenangels
kowalenko
rafacz
natual
subfractions
unblushingly
cobern
vittrup
margreiter
sannwald
netprice
surfy
qkl
ingénues
giammario
akekee
mujadidi
arounf
zfps
misfiling
micronetics
nazak
ozsoy
wattlebridge
haralambidis
juritz
ethienne
reynecke
cléac
altafaj
splinternet
izunaso
flitters
polticial
serapong
cawt
snaefellsnes
supermileage
waltemeyer
trammo
vanersborg
vadino
norrback
sjowall
momemtum
fadzai
miltants
shunra
fnirs
mezie
trüpel
snackers
sunchoke
recrafting
peijun
accomadate
cusden
mullhouse
xuming
baruffi
niggled
centrafrican
moufarrige
qunar
blavod
abudwaq
bodysnatching
kneafsey
pbfa
arthrogram
gubareva
ibarlucea
swiming
unfortuanately
unintelligence
israt
marenariello
alltami
amshold
selectorial
cerus
soligas
worsely
uwimana
biopat
buddenhagen
pietrowski
carmex
whizzbang
dinc
foxbury
melanine
terlo
supedi
panjshiri
broadcastable
agensys
ziobrowski
shoddiest
semass
educaiton
nataphon
splitboard
parring
ungracefully
rubacuori
nuclearelectrica
stigmatises
underutilizing
blacksite
meaby
hazelbank
acidini
hachamah
fraxel
kingsmarkham
sunporch
arsalai
diaster
murazzi
bedazzling
diming
gonchen
koroshetz
loirston
kongkiat
discplinary
muiruri
watergun
briber
motar
patick
tufino
beejays
decampment
froufrou
nasho
siasa
wyddgrug
sflg
hagoel
huuhtanen
pettier
gamawan
recordbreaking
wotcha
miniskirted
fidgeted
zemaj
fileman
wdowiak
counterplots
hoick
reaganesque
gaggles
celestae
multitiered
laurans
lumenis
aqmi
nvla
prosystem
welcare
dibiasio
desalinisation
pacificor
irens
hwyr
banjarsari
worksurface
eltek
gfcis
rhinoconjunctivitis
saracoglu
phoung
spraggins
mcmonigle
hewad
hongfang
holdbacks
briefel
richette
kalchbrenner
polygraphed
sauteeing
kaltschmidt
afelee
alfaraj
sreet
essenburg
eggland
coolhunter
fetchet
supersites
rindner
gouro
gruntal
haueter
meyrelles
kaplicky
dizzywood
vivaki
wipfli
monemvassia
alsumaria
khamisiyah
jiuhe
schlemko
loewith
rttnews
grassian
fileshare
cabangbang
glink
seafowl
hydrospheric
bougatsos
incentivises
handwrote
muhammaed
recraft
pibworth
umarin
waladi
schene
rabhan
idtvs
lauckner
selbin
tasktop
steuerle
otellos
sebire
poeni
raisor
reenrolled
feart
finkenbinder
jennerex
bergtheil
warblings
walcoff
chungtak
wftf
degremont
caeathro
galphay
zhila
pratter
morocca
tieas
dinker
penzeys
viastore
icontrol
fashingbauer
berelowitz
spiceball
singlar
silkbank
iacopelli
retrovir
swfr
mashiter
tratman
manandafy
laylaz
lidoderm
noorin
oback
nonstriking
anesthetised
holstering
supergrade
eafrd
indigenious
cardmember
pedre
pugel
androgynously
footbaths
miara
jhamar
iftf
hornall
indpendence
exantus
gainsays
unsaddled
farmhill
chalendar
ukrainskaya
matonoha
bagsy
yalikavak
tamnamore
keyssar
zvegintzov
smilovitz
cloddish
nonnatives
pötschke
raineth
ambisome
catheryn
helsper
stambecco
pupster
zinkia
powersite
apaolaza
constructionskills
mazurian
grasz
avitat
annell
simpletuition
ronaldos
counterpulsation
goldhay
businesss
dubson
omnovia
amazonencore
thermofisher
wonked
availity
cravers
poreless
gsam
mactec
quedamos
stressfulness
antonovs
csbg
molsa
pvos
nyag
lakhvir
rustlings
unbloodied
footbed
oudéa
aliadiere
sodefor
nextgov
spohrs
disgrees
tortise
deracination
foelsch
aircruise
contempories
tankie
adapid
fahimuddin
unidentifed
yobbish
bakhtyari
expresstoll
ebanking
tcab
ghettoising
enshrouds
yavala
missakian
snellin
ikenson
trifectas
khinshtein
tyen
comentator
flyballs
intergrating
satorius
taxand
metcalfs
slimiest
volodko
washash
anayat
corenblith
grobelaar
grandtop
traavik
hobijn
yuanwei
kobren
mushore
courtman
overprescribing
dhiyaa
mainey
tercica
oosterlinck
maplets
penans
culatra
motloung
mudiwa
nauseate
lohans
piatco
macheras
horsefall
esmerling
chuntering
happpen
baswell
leighnor
kalkidan
gulestan
paleokastritsa
vegliante
marrage
dejour
yionoulis
groveman
masachapa
cvcp
farking
undershooting
bockheim
rajasingam
snidey
korkman
backslides
goldsithney
demureness
frozan
lobotomise
schawbel
escapable
tetramisole
prebiotin
simandle
polizzotto
dqg
foamers
previdi
beltranena
ordnungspolitik
rooked
digicable
greensun
makhauri
ebex
amorousness
aldermore
crasser
swarner
dillihay
italee
datca
kirkwoods
dubendorf
norampac
copeley
caviars
rafte
thykier
normanbrook
cubicin
avoding
kharabadze
berjaoui
cerist
corsock
toussas
misleader
januaries
piquantly
wasman
ovidsp
kilali
demaster
holtgrave
eremian
frappuccinos
charcoaled
collywobbles
dorvil
stuthman
wingos
nafd
frysinger
consternated
propertyshark
jindals
sirulnick
asimow
khicks
geomicrobiologist
schuknecht
trøim
reyaz
rosiest
olivan
golarz
vivane
stachelberg
metatools
aeromech
retek
remirez
barbastelles
serhal
obamania
sprucewood
blfs
graythorp
universitites
univited
kyrgystan
battlenet
horizontales
rumaitha
maiers
rexite
greenbergs
ltach
lnas
rittenberry
rhji
sulafa
enox
immenent
weyerbacher
debilitates
trimel
snuffling
rocap
backscratchers
leinsterman
pricewaterhousecooper
misogynism
wisconson
dedmond
suyitno
dadge
mthalane
pinenuts
kuritzky
shanawaz
jaylyn
confe
rottenstone
stollar
debtline
gritta
pambianchi
groocock
bredberg
europeanising
llandyssul
capitamalls
ajack
habbah
asdale
presbyopic
swyddogion
allinges
kyleigh
carcache
stracher
pinoncelli
theleme
ngeno
messagero
moodey
cefin
gaddafis
salaway
gershfield
vical
doah
lionstone
alverado
nichollsia
satsair
hannstar
unconvential
sarchal
sommersdorf
disallowable
milovanovich
coloproctology
kumang
undercofler
israelsen
clubcards
soveriegnty
catain
prescotts
sitilides
misrepresenation
snuggies
titouan
irrevelent
brattiness
vickory
euroccp
bedf
coloradoans
luqu
pgcb
karimou
anipals
asats
giyas
laddingford
paleros
rukhi
smidgens
moncoutie
schnarf
apropros
leahurst
chindo
nobilia
redetermine
scanted
westry
defourny
jerrid
flatulate
feifdom
eightsome
alipac
buckly
kazani
dirkzwager
cummine
sexlessness
rutfs
vould
bifma
nazarali
pesis
hoock
paperstone
presilla
gommes
ornek
kroners
bredeson
lahya
sponginess
rahwa
aynte
kotsovolos
gamex
rietjens
mummifies
reoffering
incresing
semiautomatics
fiscuteanu
abrecht
ruffler
raiff
overcounts
luvvies
eeriest
mingxiang
pichaya
smushing
ozonoff
pelisek
cessar
vadakan
dinked
poggia
roros
busienss
grouchiness
sansabelt
upstretched
skvarla
werrick
dispiritingly
petrikin
suebu
koruk
jekabs
pensham
garritt
noncore
woloshin
creditcard
supertax
porousness
adipec
pstd
azotam
tiltons
ghadhban
frechon
handys
andriole
kocab
brainshark
kraushofer
unlimted
intouniversity
driulis
foucrault
junnier
weatherworn
ringfence
gribkowsky
tumer
decorex
underpant
aitzol
mobmov
lochburn
preferisco
travisty
mathad
dowm
privets
woolhandler
nooky
kuklina
boutall
khamkhoyev
bridgeable
buatta
secondees
tawafiq
saucily
debbane
bubbies
rapska
gemmells
tablespoonfuls
masbah
raddled
convinction
adelos
palenik
kesington
muniruzzaman
unibanka
sunchips
jasmijn
unmanageability
mlda
spivvy
surpless
equably
synaesthetes
thirtyish
discombobulation
nabq
petchabun
duthy
chlorox
leslea
polute
metaldyne
eerp
dailiness
dioskuria
abiam
aretakis
vreth
whae
guant
liasing
maydew
campatelli
qorey
nuview
carmindy
dristan
citrone
petersilia
meagerness
oilworkers
hemgesberg
bllack
podgers
amplats
ingly
feihe
wiemar
goolkasian
opfermann
bunja
autorisée
flatrate
kaddafi
ampitheatre
tranformed
kudrina
shukat
ochakovo
marcouch
chathrand
stosny
sekoba
inbs
rudgate
snobbiest
demoff
supermouse
tunnocks
jamuana
preannounced
hossien
stodir
guegan
owsinski
valicenti
aquaintances
muhanad
lazovic
mujaheed
anano
kosachev
wanny
kavcic
moede
breska
ettelbrick
coulds
pozon
szkotak
walderstown
delafon
nickolson
oepa
kalinic
flabbier
khoshjamal
sherland
lavance
fillup
dunkelberg
rokafella
pbpc
ccni
gabig
suwayrah
shmatikov
ruffly
testwork
valesco
sunblocks
douyon
growbag
gorkss
piredda
hooches
rycart
bathursts
seiffer
sharlip
germicides
chhiring
kurskis
capka
penyak
zdravkova
poitevent
mounga
leposavic
nachreiner
sumiden
mondre
underfund
haoyu
americhip
bucketed
alimera
litigous
enamour
tawian
indictor
heartstart
onguard
stetched
kennette
unhulled
vaccinates
disobliging
dostoevskian
bardrick
shihadeh
magodonga
replated
untersteiner
glasslike
bangadi
illinios
parging
dundale
libbertz
amsafe
meudwy
delyagin
easom
aboubaker
nasonex
hashani
radder
pomposities
rodny
kirchnerismo
barncastle
semdinli
hezza
inablility
solitoki
powere
narcotrafficker
pitboss
lazerson
muit
gottula
oinker
bestball
guzzles
teravision
rushdan
rubenfire
infrasource
klimkiewicz
cnmg
surprizingly
chinodya
remilitarizing
xiangang
reconsult
dorback
doft
southerlies
leinhart
bernardinis
obselidia
tchepalova
mashery
overseeded
labroue
ripsnorter
remcom
gunningham
crarae
filskov
bierschenk
usnik
piereson
jeetay
carryback
finneyi
hottopics
cyberhomes
smithline
fylingthorpe
rykner
terziu
argandab
yenikapi
nomaguchi
derecognise
borain
ocober
sparaco
trifari
baranwal
nocas
slobby
hommels
sliddery
jouwe
wenkui
chiennes
yachties
ingnorant
abermorddu
lekander
nourizad
diffe
bekamenga
rocabado
oguike
martrydom
sapic
affilate
mitayev
meindertsma
warduni
biotex
kritzler
vanairsdale
tegtmeyer
devasted
chopinesque
moistest
ulve
kouvelas
stringfellows
threets
zombifies
jeht
paslode
alasdhair
standardaero
empatic
maryss
threedimensional
playsuits
siguier
babbitty
godsakes
kayange
pandermalis
strautins
saharawis
technopak
conleys
oroweat
leggie
sesti
defago
hasids
gazzale
herculex
transcipt
ractliffe
millefeuille
habberjam
gantvoort
bilary
lankier
centech
zaynar
sanluis
iongh
tzofit
mussab
amreit
weslye
odelin
minstral
slopeside
soundcast
sorowitsch
parkmobile
ginormica
snortland
meritocracies
indecorously
retsinas
notchy
charmz
souix
hyberbole
drowing
reprice
decitions
saltie
xerion
mokara
whitecollar
chrysostomides
corcodilos
haniyah
subindexes
ezeagwula
zaneis
erbey
bullmer
beckets
bradely
pfcu
broodingly
skurygin
hourmadji
giade
yerkin
maddahi
kukmin
scbu
zorigt
niftiest
maeyens
leratong
wettstone
meanwhiles
hickses
efast
manservisi
mithoefer
robinowitz
czechoslavakia
styczynski
perreten
milltir
turbomentor
emrg
sharashidze
verrucas
straatjes
caladiums
idleaire
arboform
mpombo
gougère
tuszynski
sunaoka
swiryn
nonresponders
poisonville
baloga
biovest
reidl
jamalca
gameela
ngarua
hrpc
fsbi
chignons
varnadore
dashikis
cirstea
strobbe
reconcilation
firesuit
profert
landsite
stymying
clubbish
caceras
recision
acbp
darif
grooviness
biscotto
lkcm
securitise
skyped
avivit
potholders
ntaf
mesmerisingly
fime
sensibleness
mwaka
basata
leabrooks
inquira
mingkang
eskdaleside
klovstad
quanjie
obessed
raschker
pluckily
headwrap
choruss
hiratzka
viruslike
eathyn
gesm
nettlesworth
oestrogenic
honaman
shister
multinationally
aykley
irep
roasty
supplimental
martocci
homelier
chipkin
miamian
parenty
shipworkers
disinformed
ferosh
curtseying
compactdaq
uncertainity
vsla
napbc
sangqu
norks
lubenow
niesha
deterrance
suhar
tebbits
proell
mauras
succeeed
kuando
onky
brandade
overfarming
laurys
ipga
mcld
chernack
droser
naftiran
qaedat
scaap
cankles
netidentity
blcs
ueberlingen
sparkleberry
afforable
communitywide
endosurgery
schloter
jhung
recognisers
sunswept
pheby
amael
nutrimetics
purvin
shivaun
canetto
cylone
ultrapar
protoype
janerio
dragus
threadcount
giday
hammerly
lesleigh
zhikai
apalisok
forrell
howdens
vardag
propmaster
bikestation
optoma
jinsoo
garstein
damaseb
ballboys
ahanger
helstein
californa
sowash
dbouk
magati
anastassopoulos
suranyi
youngistan
stateliners
sonitus
thebault
simeonidis
maryton
birkhahn
prudentially
thigpenn
matkovsky
inflammables
dugatkin
treneer
tigresse
baluja
datatech
aubrayo
puréeing
catrachos
vallillo
verplancke
moreys
brickcon
unimagineable
hindell
fundementalist
incisionless
jichun
vanthan
gagfah
lindele
lazards
lashway
stepback
saintpaul
aleklett
dollet
acronymous
deschaine
cwmgwili
chapero
hazimeh
belachew
clearflow
armorlite
rakipi
fiduccia
picpa
videosurf
cyberteam
hemicycles
urspelerpes
shoulld
sukhotsky
ecet
mularski
newjack
solimene
abowitz
mowmacre
barnich
genego
gursewak
safieddine
probono
dadms
allested
marolla
perspired
middlesworth
ouderkirk
mohrbacher
sanidas
oseira
kinkeade
atgofion
sayette
suller
ribollita
mindiashvili
gratins
gdula
karolczak
czinger
proselytiser
nonconvertible
proganda
hospi
vishnevski
sluggerrr
kumiki
blendstock
busaidy
duchaufour
rebonding
mcgeehin
renkel
bodysurf
halfvarson
tesarz
mulkearns
alyami
ziswiler
lumpar
overweights
yewlands
deconcentrated
stichelton
boggins
warser
diffenderffer
stonelike
pillayan
vetheuil
dealbase
yinxing
militarising
zagaris
profligately
jgbs
ruales
timofejevas
miscoding
drinkaware
mambili
wmam
rowlestone
llez
halona
noninterventionist
levalbuterol
konashenkov
spikol
radionet
maneouvre
dataupia
niakhar
anarchically
dileu
nvds
adzick
masculinised
crankiest
rxamerica
knafel
renfew
testifiers
achacollo
amorphousness
natsheh
boscamp
depature
crookfur
diyer
milway
aichr
ristra
backchecking
ptown
prolith
hollywod
gamercize
balce
carnesky
gaffel
costell
suryakusuma
geeez
nielssen
myocet
beauvier
emtech
diffculty
dubbelman
shifflet
safestore
yapacani
marlenka
vectura
galymzhan
jupiterresearch
qinnan
tiggywinkle
regreen
gadish
saffan
ooip
mockable
whitehand
cesareans
greisman
glushak
djeljosevic
classie
numbly
dazheng
coughtrey
surmelis
belgiums
yway
caramelizes
lombarte
bundtzen
obongo
formigenes
navias
baaaaack
harrumphing
iovate
revazi
sangdrol
nonmusicians
hausers
daccord
magarotto
goeteborg
passported
unimmunized
wiedemer
smitka
derrières
gfoeller
exhibtion
squanderer
pongsapat
chemmedchem
cheekiest
djoghlaf
smalll
anthropedia
affronti
cepgl
policyarchive
gottliebsen
rapnik
trasti
bastiman
krautchan
ketts
continuted
klironomos
golembeski
kriesch
brogliatti
pepperoncini
livlihood
zinkevich
frohlick
perkus
heayweight
shanghua
culmone
atakol
mugals
stablized
cultureless
mazzari
honni
porrit
sawaneh
matchball
externalise
cerruto
lachelle
comsys
jalander
kolanda
whitehaugh
momentously
mismeasurement
rosilyn
mcbarnette
greasier
madziar
murugiah
gossy
medchi
systemising
kavira
travelcare
montine
hezbolla
rachline
karkos
golodryga
manzeck
nicar
qidi
curtsying
mcdougals
toniolatti
straght
leplae
shabbazz
oraquick
baghurst
nulo
goldbelt
figawi
shrauger
selmayr
wierdos
diagraming
yesilyurt
lexiscan
leglise
yoos
swindal
agnellis
vitabiotics
babygro
iteere
wecu
infolinks
optos
enflaming
adapation
madrilenians
eigenharp
rejuvenations
zeglin
wackily
gaggioli
corkage
obih
hillaker
vallerie
touchstar
ifty
reyhani
ruhulla
counterreaction
dmfcc
dussey
jeds
investco
tanginess
jannuzi
lehnhardt
richochet
entiled
ooking
drewal
feelgoods
spirts
debriefers
sinecatechins
unitisation
meesh
toireasa
primped
guilting
ephremidis
pooneh
backcourts
whiplashes
gulfsands
cutaş
jurcic
naiditch
kablooey
daftly
wahabist
humouredly
zhilian
kassaye
chapattis
makumbe
amené
tlhagale
chaparhar
snootiness
uncorseted
unwontedly
fewcott
efore
proscout
wsox
macherio
bozilovic
xtet
interogation
ebwy
toecaps
siobahn
datamentors
whippey
immitate
herszenhorn
uruzgani
lollypops
orumieh
methodone
belyaninov
bareth
highman
cinepop
tossiat
berghold
bmvss
topchi
pentrechwyth
bricky
trilogue
untenably
adwent
mofi
elixhauser
paitson
roués
lephone
vindec
abubakir
amidoamine
vantrease
mariellen
hyperviolent
pernando
gasifying
motney
tipperty
maxlife
dhillion
tursunbai
ameel
overreliant
souters
unsterilised
sarjono
sharrad
mckemy
ripkin
oaters
postconcussion
pynoos
casagranda
proclo
mitrice
ibeam
daila
aukett
crappers
hechts
blings
blashaw
kragnes
garrulousness
polderbaan
pollyannaish
ancestoral
ablondi
nkonyeni
mjos
hereabout
hulver
wainger
scaremonger
birdieing
idearc
chirkunov
beninson
ddce
restituyo
stambolic
boomlet
migranes
departees
ipanemas
sortun
baskas
stanching
stacho
loestrin
donnica
delevoye
sharespost
glaiel
harabin
chezelles
eurfyl
klitchko
nerin
ogorek
offman
dorber
downlisting
instictively
jacamo
haematologists
nmsdc
deanthony
deadheaded
zagala
foncia
hachikian
greenfort
musicmaking
spiessens
desertxpress
bilharzias
homoerotica
sunsail
khushtov
nazyr
pandorapedia
distinquished
fnbo
blaszko
dunnellen
kaissi
ffilm
urfan
rosbrook
caughman
courneya
franquelis
rightwinger
overseeding
endomorph
sgurrenergy
abudullah
howsare
onsat
constructech
nonsinging
alupo
lavachet
delcher
hauslaib
incovenient
suniya
savey
apesteguia
shiree
yoandris
helex
soliloquizing
polyphenolics
levoxyl
presstek
narcotraficantes
bestinvest
bolinhos
zoomsystems
comvest
serte
listel
kelaidis
kenderick
salel
carnebone
alchoholic
belwind
severenergia
glimse
heilicher
gartzen
brookover
tuvey
uninvented
pittam
iurato
underwritings
endometrin
trezona
woys
casassa
mindlab
annotative
lammel
krawczynski
submersing
themn
valcent
thinkuknow
singledom
taharka
possati
redplum
penayo
sardelis
farecast
subramaniyan
continuingly
tuilleadh
elhami
throughfare
nozhkin
oversaturating
smoriginas
predicatable
chooky
wynott
swiftboated
ameerul
valades
pouters
tujague
gfirst
visionless
camileon
rancatore
kaeson
gryaznoi
moelgg
avaition
stiwt
ferumoxytol
cudgelled
jeggle
cebt
cargurus
klausenpass
earings
kittredges
prorates
caonima
macvean
overcritical
patzold
mareya
fromageries
battut
dreena
somet
yisrayl
hamouly
chiego
craiginches
pléthore
shojin
bhith
waitering
sufferred
etouffee
felgtb
drra
ditalini
weaber
upwash
boutih
janikhel
schwieterman
tryanny
heeschen
kleivan
luxuriated
katzev
mauad
podding
jumeriah
rosetto
energey
amortizes
primobolan
ishiya
mcilhone
penalisation
gotopless
jounalists
kalmijn
azocar
kebbeh
artown
yabuno
hydrolized
misimpressions
catrow
roseisle
loarie
guzzled
chheang
spped
gecc
apah
photoscape
selody
machisu
wolesley
peblig
lssi
kagona
marziah
lajko
astonishments
counterdemonstrations
quillagua
wangmene
hakiwai
nipro
gardenweb
artmosphere
ashoori
makhubela
zhushu
islambad
subfunds
szonda
maňka
desolately
latests
litwinowicz
shamol
odiously
bufwack
trustworth
afmxa
whiddett
woomer
sauvey
lewitsky
ficara
perridge
passalaqua
sayedi
nyeshia
overreporting
stablised
weathernews
stabex
obamamania
kaeda
gynormous
jaemin
axcient
pmle
mammotome
qeada
pisf
altamed
stepovers
uscinski
defronzo
kéchichian
skeoge
acccused
windiness
gouldings
favourities
zhichun
guysville
woolfitt
rosani
oludeniz
fedscoop
radfar
hemrick
hooksiel
crissakes
guagliardo
allodin
leningrado
centamin
dhaifallah
paspa
abdulamir
patulea
microprudential
fredrix
colimon
elaws
avihai
krents
glassless
absolutisms
sharespace
gweld
buttriss
summmit
theramin
couldve
insideline
uygar
intriging
bahkshi
cldf
cbhf
calgreen
unrenewed
nonadjustable
micrsoft
fenstersheib
letna
middlecroft
schauland
begic
duralex
genara
lardaro
nanoengineered
ecchr
nonbiodegradable
bressman
korupensis
holmeside
dizziest
filppu
gaglio
gnashed
rudkovsky
authorizers
virgnia
lochfield
cemusa
beclouded
potshow
azizova
insectariums
unshadowed
panasci
lesane
micosoft
permeti
piecrust
blunderers
interestedly
stigmatism
rustier
mahida
deyon
jonikal
skovbo
beevis
ryotei
smolke
malast
ikechuku
decertifying
cyberharassment
sidefoot
sipio
underthrew
kemery
bigmore
yubamrung
pedwell
knoxy
speechmakers
mtop
ticketable
oldoinyo
chemjor
wamiq
schalfkogel
longenbaugh
rasmala
perkstreet
exosolar
sainbury
guradian
baladiat
kacou
norson
paaswell
annoucing
unguaranteed
mellqvist
lexon
filesoup
darnovsky
euronaval
quadbikes
rawod
fraenzi
mahaman
nwoga
assoication
citimortgage
musir
postelection
bloomie
ezzedin
mielles
oscarcast
donʼt
kickboards
nchabeleng
panickers
kenoy
chascomus
cricitism
ipodtouch
lifechat
riskily
dopy
tophoven
léonid
lebi
mtan
blcu
ortique
maxlinear
bamut
anlaysis
newva
bijagos
cholnoky
koksal
bircken
usbourne
barnfields
ddmi
ambergis
bittova
iberri
ambaye
retendering
schoenholtz
rawboned
burberrys
ghayoor
dongshen
dansaert
akhondzadeh
shuttlebus
pavlopetri
mentouri
bjornsdottir
cloistering
eykel
eemea
molnia
cinnamond
cesan
egington
firtina
datakhel
corgiville
yeatsian
kelberman
abdellahi
overbaked
eljvir
melonas
xpak
naqu
hoffbrand
ymwneud
volberding
debronkart
fondazioni
intollerant
zareer
rangier
melgren
recalibrates
dimunitive
kerimli
xhale
mhora
potbellies
seminerio
reclaimation
shyy
rerating
mortillaro
chenghu
parisel
mccreet
metje
chulov
bssf
cantrel
boehle
grogginess
eeds
atherothrombosis
guediawaye
depositos
krawchenko
truckfest
meringolo
dcsp
panarina
jsea
pijls
bocom
grwp
havillands
unleveraged
barakett
owener
smagula
tsod
phraselator
trevisanato
arkeia
garbino
balilty
bacalzo
degooyer
hosteria
sensitivies
gateses
pcar
hdma
zonegran
laabidi
tomisue
phemt
panhandled
groharing
klimentova
degauque
dafri
kazachkov
rushie
targetman
miljevic
generra
upscales
afix
wetli
domanic
mclinn
sherita
exhanges
gronvold
lamouche
anahtar
reinfecting
paedophiliac
bijoor
neivua
rubensteins
budgeter
wineglasses
moszczynski
admp
djavad
visitorial
hemingwayesque
nanosilver
oncothyreon
presumptiveness
carpeneto
everflex
blogospheric
dhyanapeetam
eschatologists
olivant
weatherunderground
ciggies
nandigna
borrman
lowriding
dostam
baltiska
lisped
hasira
tweeness
ghoulishness
immigrationist
brigyn
policians
framwork
bogdal
hammeri
screechers
drizin
shakeys
inabilty
kirksanton
goodfood
persbo
walletpop
salaberria
vermet
cadidates
hyperv
tarassenko
hawalas
drnc
vistes
tangly
wordscraper
abderrahime
sunlin
theodosopoulos
unsigning
schwartzenegger
steepbank
siebenaler
idjits
handwritting
provactive
underexposing
acabq
prewash
rimsza
goning
goldford
lomondside
davender
rijkman
jobseeking
zulfu
beerntsen
tsnas
underfire
sorgdrager
racecards
nattiv
sadriya
acquital
promperu
freeradical
guldimann
cogentrix
tzekos
babalawos
nosedives
tendar
mubaraks
hahahahahahahaha
spickernell
ipcom
ergneti
echan
daivd
camarota
casciato
qare
guentner
foxhollow
oswold
guodu
cmlp
buttree
timipre
resurgance
mulyasari
piehole
zinchuk
knobkerries
bupkes
assemi
dibens
dryman
chimneyed
masterlink
duddies
busemeyer
staceys
enouraged
agaporomorphus
dettre
toysrus
multiculturally
islamising
goldvarg
montalbini
elhaj
kyalo
kamlani
hysear
lietenant
galanz
troublespot
bardakjian
ditziness
verticalresponse
drakakis
sarcs
hirakubo
shainova
vizner
scintillo
stockebrand
mochomo
manclark
aborad
massolo
stimuvax
nirkh
cated
ksnd
slowpokes
aksh
plastinates
intothe
necarne
antiapartheid
moruti
uglified
podimata
mpshe
cfso
felisbret
mountainscapes
hadhramout
shtik
seedcorn
arizonia
winklers
somocurcio
lichtensteins
insuremytrip
dutchified
ropeik
bothfeld
tchp
teamsheets
nationalbanken
brussles
laugar
technologized
supperstone
corridore
stomaching
wuxiu
stidolph
comodity
flyspecking
hostmark
allensmore
rolfsrud
maridadi
vlsci
carrem
koumiss
responsbile
buctzotz
prasow
legimately
hersen
wibbles
podlike
pigeonroost
siwik
americone
dirndls
lameduck
qadoura
reoperations
caldrons
tolek
hubertz
poppadoms
tushishvili
walkaways
todoli
practi
prewashed
timebank
brainchildren
mauia
schankweiler
tunelessly
gavio
winkett
vilmorinii
decarbonise
pepperball
drywalls
aduku
careerwise
geldmacher
bollant
lirhus
wmpt
sattui
speechome
sunkuli
majestyes
popogrebsky
dworken
hafida
skrivanek
microdrones
baltrunas
khaldan
audrin
vanselow
deneroff
dadak
kascak
afbi
montecastillo
haraf
qtip
liveuniverse
isnora
maulidi
straightjackets
radilla
micropilot
remache
yongyue
mazeikiu
annouces
strangfeld
strategizes
vicorp
unrationed
massaglia
tbvi
ceramtec
eiris
potlines
laskoski
longwei
perdent
piriapolis
studenka
prolink
sovereignists
ganglands
rampike
esperion
intitially
schmuckler
beanfeast
schrenzel
nonelectronic
bamarni
conversative
peterkins
rejecters
staropromyslovsky
headiest
aquirre
lipodissolve
zahim
steelfab
gotze
lhotellerie
flns
staffrooms
faughey
outrush
purewire
georgaris
emoi
enchancing
artiss
chiaiano
beeden
reseacher
mceniry
spongey
cannoning
irrate
lickspittle
skory
innexus
lintuan
lazeric
capaign
braxis
quiffs
sarapiqui
brüner
pooterish
klimley
excelerate
mycolors
addex
rhydycar
staale
saklikent
posma
trupia
airington
idong
peerwani
veryfine
polyhydroxy
bvps
mosleys
tereu
famliy
technlogy
homestall
tzafrir
kosayodhin
barhopping
byoe
timberon
winzar
melazzi
flogos
plainclothesmen
shpl
litty
montalte
penningroth
carnivorism
barkeeps
themost
refroze
ultraclean
zwcad
bingai
daokui
capricans
tanjin
kharbash
griffee
botherers
phort
loscher
brombergs
schoolbuses
attacts
mooched
hillheads
monnat
kuźmiuk
anted
avish
vsevelod
topdown
pashminas
midichlorians
élitism
sonnega
bahukutumbi
assosa
gichuki
maleyev
yoco
petronis
splinterheads
sacristía
szejna
grolar
fladung
ersland
cksw
fellus
ceasfire
weened
satid
tanpinar
cianna
debaggio
potholder
martissant
cheesmond
cyberonics
floyde
pedophelia
leanord
naiive
stanched
ardura
minuites
datatec
schenkelberg
libberton
sidm
alspac
uncoachable
loiron
overprescription
skinput
cuzick
mangolte
jehani
mananger
menrad
pettiti
kharzeev
dealflow
ratagan
gambera
yellowbook
urofollitropin
preowned
earthend
saccacio
trentmann
comish
zhenglan
shtreimels
baylake
nyfix
sutherst
olima
vaquillas
ciorciari
boulygina
spallen
rhubarbs
forbearances
vaujour
descottes
daintith
capelet
maesydre
mackenzi
kabelis
homier
regaldo
planadas
primecare
cananda
treaments
brunnera
fehlhaber
defillo
khabaronline
aramnau
nnec
carrousels
francisley
steriods
pevehouse
dairakudakan
rafiullah
chrust
ncsli
microbanking
documentor
indabas
carsickness
bahran
weilded
schaben
jonney
dmes
sunnywood
superboat
rotisseries
stancl
malvey
mpay
norbolethone
daszak
kazillion
semitool
thehotel
primadonnas
photochromatic
varland
wulfman
penacilin
sharek
religionism
donielle
cingulated
dezzi
millmead
torbati
peiry
spaly
lyngdorf
resoled
tlbb
mitzner
interveiw
braodcast
fairpensions
pavees
kaschalk
boarland
sanfield
slendertone
khev
usablenet
occurrs
schlagenhauf
governemt
peepolykus
elctions
icesheets
haltime
tagliata
carrillos
ishkhans
dibadj
stiti
styres
mcelmurray
hacu
péchenard
libtard
rhydlewis
losinski
sudekum
cheatom
spinnato
phaup
manoguayabo
shaum
afssaps
bossaso
parziale
lobon
girolle
reau
destounis
sapristi
propagada
chodan
rashakai
curty
tangoed
batzer
calvanese
eremic
exiler
griel
closeminded
wikicrimes
agoraphobics
riddiough
sandalow
lobenstine
factorys
krutonog
khodari
wemi
tamelen
vanrell
kohestani
roddrick
allanna
thenm
shewsbury
deutschemark
nisenthal
bushara
kamchybek
ermira
primondo
layaways
mccally
dihle
opiod
biosurgery
fratzke
schnetter
gerrell
telphone
fractionators
nighclub
vardes
nirja
grebby
ilyse
houselights
maninger
rémoulade
wsii
petruzziello
chicherit
csssi
fabrico
greenpalm
gransmoor
tronconi
bouland
booys
sceince
cmtx
schellens
raydiance
sarpourenx
mavni
wilfie
defered
monosol
myogen
schaghen
obsenity
chakiwara
whirs
dlan
skyscout
onone
misrouting
pyworthy
rhônes
northeasters
prevatt
frigstad
seetoh
inwest
exces
deju
futhey
nimol
eliraz
derossett
woolsery
lacusovagus
manless
lbpd
emrouz
sherez
roadwarrior
exagerrating
bioflex
laysha
godzillas
waivered
massler
joseva
overpays
nitsana
purdis
ocklynge
overuled
fourpiece
profeet
federbush
headscratching
embarrassedly
protiviti
zhesi
uniko
fagiuoli
viccei
zagha
helpdesks
threadhead
flexibilization
fassold
looniest
invigilation
vaciago
kilkeary
channick
ismayl
geofence
stockhill
oakmark
osaze
pakkoku
estacao
whodat
kizilay
barovsky
igss
netcu
diptheria
mareer
haringay
firmount
renationalize
overeaten
anatomize
froglife
traipsed
interpretors
khetrapal
canl
pikelets
resplendently
gilyeat
zabib
endocrinal
obenhaus
bioprosthesis
gufeng
lowlier
deisher
multisector
baddock
ynni
blaenannerch
huitian
rocquaine
brodeck
jayanarayan
brigend
jilal
breakables
shkedy
phwoar
kravat
proverbio
jelleyman
chelokee
tojirakarn
goudin
qomo
earthcraft
agrc
shantsev
mangasaryan
suncream
faceful
besharov
chailert
amibitious
desensitise
oulmers
frothiness
lesers
pissoirs
trickel
servicement
swampier
drotske
vivon
seocnd
idjmg
aftewards
dravucz
caixacorp
spirtos
elizabete
pissaladière
drivelling
gharab
counterinsurgencies
necci
siefer
thahabi
qinsheng
haidan
kasteler
polititian
tipsword
penodol
choike
ancesters
kyabakura
wrep
htcia
lydmar
dykhoff
kabiller
icban
navjit
beccause
nixzaliz
unprosecutable
ceter
sipkins
nankabirwa
omfug
amoro
therapods
phildelphia
meltons
systec
telexed
unchlorinated
raquenel
wilderhill
carbombs
wieseman
sedgh
recolonising
olsiewski
braje
sollo
kiefaber
nerudova
simonsohn
powl
chipolatas
dorith
sleaves
platystele
twinsets
spaeder
mattle
raechelle
semsar
paciolan
mosqueta
dieperink
jingoists
antiglare
issifou
vastic
sweltered
shrills
colsey
hestness
doerries
motorino
iwbs
beyersdorf
ralliers
inweh
agendia
habinek
reinstitutes
dunmire
overstimulate
ambue
khannas
tultitlan
nightguard
csag
sleazeballs
gmtc
gansert
burakowski
beersbridge
maribou
valras
damscus
aairpass
bigtent
munichs
ahdyar
knudsens
annointing
hogeveen
bekbosunov
laquinta
amurdag
bannaby
abramorama
evanka
demeanours
scoraig
shway
tilkin
armagost
subconcussive
breakfield
giattino
sodhani
maccaull
nauseates
aspirationally
niloo
silverrock
muyambo
zoladex
derelictions
ithought
vfinance
uranishi
pinneys
zumobi
porogen
vincz
aaprp
playwin
fermon
rocketi
slotte
portovaya
gwraig
wuzheng
grŵp
quntar
cowcliffe
chldren
disabuses
downington
angmo
govx
reconvicted
buddeke
gyalzen
hicker
covenas
aromasin
shebab
stonewash
personratings
marcuson
bayti
miscalibrated
businessowners
leeburn
byalalu
marberger
wiergate
microbrewed
krystofer
quada
ajorlou
versveld
supatra
rachinel
genepax
cahillane
akinbola
cuende
tribewanted
boughn
maslovskiy
renshon
maidencombe
joguet
simoco
plyer
comisiynydd
drumrolls
agland
yermoshina
hypervirulent
willenken
kuryla
kellyton
shovkovsky
venzuela
bassell
xaltepec
unpegged
marzok
chutian
conservera
michaelwood
seeminly
warthan
protzmann
azrouel
trombitas
dremiel
intercomm
beedenbender
toeless
elishia
shipside
heleta
woollier
gotschall
taliafero
sannitz
glucometers
countryish
asimco
swinbank
bentovim
redward
barall
nazereth
litner
agrobacteria
whitcome
vircom
upender
golsan
corespondent
benguerra
geoge
forver
hounsome
worldfirst
botanicas
rmts
tyrannised
sunned
pelavin
niembro
lehmkuhle
multifetal
granddads
personalizable
carbonare
ganoush
jurijus
stultifyingly
crabcakes
bulok
acierno
papahanaumokuakea
bollingbrook
cortesio
katoey
mukhrovani
yanyun
bisol
lanxiang
losty
yeohlee
cardiotocograph
harnet
heryawan
rashidan
dazzlement
monkmoor
dyspraxic
gatkuoth
centage
feminem
sawl
ilyushins
skidoos
pizzetta
dostoyevskian
jamye
muehlbauer
lmvh
inergize
hmbana
boeh
virigina
goochie
tepetlán
wlcsp
ncsp
bodytalk
kretzulesco
khristine
electrity
gremillet
trefz
katasila
gulets
rohra
grigorjev
bedrails
kaliss
transf
amoralists
cannier
sigheh
lled
fujeirah
devestated
republician
coreg
samachablo
anyansi
biunno
briefers
sukkahs
accessportal
whitestrips
angawi
ruscetti
sheillah
bauerly
bunkrooms
awri
mocassins
lassegue
volpenhein
mungin
reitemeier
chukudu
kalcheim
greenestreet
iosono
elfstrom
backhauling
xpec
capial
bataoil
lebeauf
frothers
melloan
panzhinskiy
turtelboom
playfull
slathers
cugnon
exectutive
cctm
charasmatic
hajizade
tairia
labourist
izzatullah
successul
overleveraged
overley
philtjens
cerfontyne
ovx
mutahida
younghee
hoil
threatre
jizzax
prevalant
stauble
pilled
tpye
ververs
gerler
kjustendil
chavancy
antigravitational
zakinthos
remeasure
haugabook
yaftali
infotrieve
ladish
mediamark
jalynn
prutsman
walzes
valabik
pmfm
dawidiuk
vnaa
headguard
tennies
biart
facebooker
sanm
wackermann
shmotkin
btter
tweini
jersualem
partl
dressier
alkhalifa
rendevous
sophisicated
lorenzos
briskets
spinasse
fondants
tamarah
darknesse
whoppingly
brushings
caviling
jinxin
mcilmoyle
jadaa
stickings
iftaar
layas
pechacek
voropayev
pirb
zaio
jiangying
hakurk
oxtails
mournfulness
preferido
bicsi
stverak
coniker
makawa
rodopoli
nafpliotou
ablauf
zenima
dishrags
kriechbaum
chipiro
ascarrunz
lmar
liquin
arapoglou
subsonically
hairshirts
holevas
wiyono
scamsters
blackmount
ampim
spiffed
munaim
cyrte
orgell
monastary
zimbabawe
agressors
gpif
scuttler
trucost
arnestad
mosehle
savci
pricings
divoll
adigwe
drumintee
exercizing
villainize
khadidja
dulaymi
konsam
harperstudio
maccario
orlowska
linkout
visalam
kawaler
egstad
cagnon
kalameh
mouselike
glaab
margett
pancini
minins
commisioners
russoti
chyzh
kalvis
ngcc
thaut
layettes
muynak
shoulderblades
taribavirin
imidiwan
medvin
sunshields
wouuld
productize
cepl
zanupf
broked
zarghona
toptier
jeffrion
denckla
quandrangle
chiropracter
plaisier
bazzetta
nxtcomm
studenty
talkboard
quaak
martavius
petrosun
volpara
thavisouk
corogeanu
lighthart
europeanness
schenkar
collerton
monteiths
heppleston
elsina
pinstripers
borocz
lafetra
marmoreal
hotpsur
sarsembayev
erraji
neidstein
mahne
sendar
powfoot
chanthaly
shiyah
rayssac
hyperendemic
tschuggen
recapitalising
rendeiro
cimzia
drinsey
tavoletta
ilhota
dubchak
mcelman
hessy
ovais
abujihaad
deglaze
anghelache
sgobba
nonstory
intellectualising
oduoza
kissables
hensal
rewarm
woofter
lumpier
cheesily
pacier
deall
neuralstem
haplessness
caifornia
ghatan
zitan
apalachi
shustek
ofran
balakhani
adoyo
fallico
nanocenter
februay
yoicks
rosbifs
pcapa
rafaelov
bulgargaz
stollard
diebert
speedwalk
fawningly
jokily
tygard
vallings
librandi
kurrum
motorbikers
harded
halfsies
burgler
kovalyk
thombs
worktables
boogiemen
taikonauts
qalah
weseman
sadibou
unfriending
pennyslvania
fwice
fryett
giampilieri
behnsen
housely
jiazhi
ebok
janean
disabusing
mehdar
faultfinding
birdstrikes
soonr
tobola
langenhan
sumirago
nahirny
shorina
dobransky
tcdt
mcalpines
noninterest
leonardelli
macugen
ludwina
parveena
marberg
biofuelwatch
dombek
dustpans
accupressure
casee
nyia
hajem
pief
chubbiness
vífill
moistureloc
maroone
chiringuito
jetico
flippage
galmo
schuyff
reinstatment
envivio
demythologized
kadaria
onecat
howt
sensipar
sherpalo
meadowgate
nurdi
casac
rétromobile
mistiness
greensource
pirko
laforgia
hottle
crapy
trenchi
etxaburu
nagydij
toffel
woundedness
bdcp
buccini
outraising
bruesewitz
scacchorum
lovatelli
arteriotomy
tregonetha
stuewe
onother
depoliticised
senties
natthawut
kenndy
roozrokh
bodyattack
leeholme
washton
wedginald
mohebian
syden
ehmcke
protocluster
explorelearning
ghashghavi
carida
matussek
homeira
alsabah
mazyck
comissao
bassoff
hubdub
malamed
neophobic
zarnke
napack
swogger
mukonoweshuro
maxit
errosion
raveloson
lahaleeb
jericevich
gadgeteers
umraniye
rusenko
sotware
chevillot
kettels
santarchy
chitting
bindschedler
kamajor
shmooze
apparu
reissa
mureithi
ravas
sharaud
fajinmi
rsantiago
assload
salicyclic
kuchinoerabujima
kontraband
wysocky
goodmark
lawncare
kryvobok
leibig
guindulungan
expostulating
vahradian
unexhibited
holbeins
schoech
detectible
pamelor
fireams
zation
stitchings
spoilery
elegist
mohamadi
sacktor
boomslangs
zemlianichenko
premesis
thita
vinopal
lascola
protrader
legrice
demauro
wazzu
riebel
rottweiller
sauts
sabeg
servigistics
tomlinsons
kalkwerk
hininger
pluth
longdowns
subtone
angiodynamics
catastophic
favorables
caravanner
anip
wyshak
sulmasy
okuyan
kekhvi
carpiagne
lutropin
chruscinski
garell
thingummyjig
sereena
hammerings
tielve
amangalla
nellans
ungainliness
betro
consience
strenghtened
wernig
ehrlichs
physicial
xueyong
undecisive
britrail
kramim
chevely
sulistyowati
bibp
flurried
prapawadee
jaroch
kittaka
kamienski
pamon
bakol
mailstream
brif
battams
muttonchop
campingaz
sylwadau
kyndiah
rollenhagen
psaros
igglepiggle
critcized
memolli
flumotion
shoar
sousi
houngans
fpns
replaster
soulstress
osmek
antipodium
validis
lakhvinder
autospy
yogabugs
complainin
yanito
viperous
recapturetheglory
muney
mdtv
kemach
mismarked
humourlessness
pitigal
haileyesus
bolchover
turturice
demonination
albondigas
zinged
honn
ghuzlan
identigen
brdl
gatza
sipla
finacially
coudreaut
kubango
smartshops
tyrannobdella
khaiber
nondas
shasun
sankary
kipre
emamul
firststep
imaki
ringles
paukner
sorosky
separado
statelier
spunkiness
biocontrols
strahs
scintillometer
tches
cacchioli
mahnic
hytest
extradiction
nimoo
benfits
lelouche
masgouf
gangsterish
boltholes
musana
semaw
ddaw
floraholland
exemptive
primous
paiboon
reconrobotics
atttacks
frothier
pwajok
plusha
posties
penwithick
hochheiser
yongyoot
cuigezhuang
chisnell
portego
fretfulness
amvrakikos
integrilin
carrotmob
mutashar
burgat
lelieveld
emmorey
pseudophedrine
abdual
oilrigs
wzab
uptightness
aaaaaaaa
gorbey
bootlicking
spykes
asacol
monitorship
bionumbers
distaster
wesselius
argippo
brunners
mahdieh
sceptism
grotke
hostaged
thebritish
wieandt
perille
snowshoer
dedvukaj
reemploy
bribers
orfinger
orine
latchin
thisclose
bunf
armloads
endundo
ffransis
fettucine
backstabs
rasunda
rgus
lorillards
unmetabolised
yolanta
saffra
bartabas
hommos
sabresonic
qiyada
worldblu
harehill
prating
employeer
punditocracy
burqua
dahlvig
smeco
tikos
zippier
keroack
bonset
benmehidi
teeshirt
madunina
dsec
kazamias
midlem
closeting
qudra
lanzillotti
orthoaccel
lidvall
ehrsson
starkjohann
vezie
capoulas
ingolfur
seave
sifc
karenzi
ebison
callpod
kardava
pettily
periph
powerpac
thummalapally
hcry
weissglas
meridio
bowlines
jaksto
fulcra
ribnovo
topoff
machipanda
flytes
thirstiest
natchiappan
reffet
zarren
gwell
mozarella
naatha
towfighi
sofabed
tsuneoka
annelisa
supman
athers
globalizers
pirandellian
irené
prizen
zaafaraniyah
multicity
cornhusk
harestone
lhbs
verbraak
trialogues
borrus
ringford
davd
rawbers
czin
pilkadaris
soakings
tanzy
hilarous
mchann
hengda
egglike
multipiece
groinal
fragrantly
homeworking
komanoff
buzzmachine
graubuenden
nilab
gutermann
ruaro
metrick
yotei
paparazzis
bogusness
monba
ehenside
algebris
ultz
thumpingly
budhwa
gluteoplasty
alwara
fiscardo
stukalova
parizek
valenica
shoraka
beggaring
rattue
formidability
gergaji
wazungu
leostream
trasks
junious
cyberwarriors
irritatedly
delduca
degutis
shuftan
megaliter
hamiel
sshe
avalere
mcguinnes
laropiprant
wonderlich
berkowicz
slotsplads
innovene
nstein
maurauding
malesko
fundrasing
wardrobing
buzás
lobis
barerra
underprice
suson
mausner
sabaudin
cramant
gracer
chechyna
zarkin
janati
unmagical
contactus
kabari
chykie
wimped
inghilleri
quibdo
teachin
equiment
schmatta
faceguard
ceyla
rsrm
khojir
trendily
cambridgshire
sciencefest
bimont
steptoes
chipidea
fogy
naïr
holidaymaking
diess
trancendence
safeminds
sandaled
roarie
firstbrook
sunshiney
qorwk
niederungen
juanqinzhai
zentek
jamac
bjalcf
quenin
runouts
fetishise
molikpaq
kryolan
garrana
butterbaugh
jahic
muhajer
talibs
sexagenarians
disbelievingly
eminonu
dicator
kunzelman
merriness
abidingly
dottino
cantellano
commoditisation
coconspirator
jerath
hippyish
princeridge
imperturbably
doermann
ploghaus
buget
azher
níos
arboriculturalist
hpti
underweighted
wesabe
zakanitch
packrats
verismic
subscore
rehersal
odze
mckaughan
paprec
cristals
brownsell
surepayroll
itip
arym
deinstitutionalized
butah
uttr
pleged
studiosystems
sakanaka
dantewara
ciliv
geosteering
surapol
amnio
triyaningsih
jonal
zweben
yahuza
spigarelli
esophagi
apercu
grenson
depraving
nonsmall
makistos
suspision
breconridge
bookswim
abbassian
merigot
reconditions
towcar
gudal
condemend
kitabata
yoss
armao
mirrorlike
birthparent
lifestage
vlos
merriot
czepiel
muskin
boatpeople
nonchemical
velocci
cyfan
dranko
twitterfall
mckersie
dodders
liverail
sztorc
misruled
heidgen
youseph
slivovice
kosmopoulos
endorectal
humevale
ribbonlike
redtag
grefenstette
trucktown
colsten
snapnames
nochten
haithman
galeao
parites
sellotaped
rozett
cgaq
makaridze
omnipeace
realdolls
willinge
crewelwork
zitserman
streambox
silmi
zamzow
silguy
culbertsons
ministate
nutritiously
jakobshorn
governmant
apdp
cotis
thjat
digusting
ivascu
eperon
batalona
barishnikov
ofpra
interupts
rochlen
reissman
mazuryk
schnupp
certifica
menacker
rohdes
manifester
noncompeting
eujust
tranformation
schuchter
economistes
aripov
cathlyn
musaab
megary
jankovskis
husbandless
sppi
thulagi
escalopes
rooda
inadvisably
tabizel
unacclaimed
baathism
eclerx
purpled
improvemnt
nontariff
consciouness
moec
aboutorab
abousamra
liazid
rutlege
baupin
sarotte
televerde
southernness
epidem
kuwai
chharia
richieri
teeuwissen
bepotastine
bouhail
geldolf
wolynes
diadkova
pelerins
ceud
tcpi
jence
gonazalez
keanae
qualfied
pirouetted
maunderings
khachatourian
vaccariello
upscaler
vasilevskis
crasset
toensmeier
snugli
ollson
hpra
scld
icescape
myfi
cubicularis
biop
mdtf
basargin
batelle
brewley
nidri
viracept
sexbot
devictor
sirard
amenagement
dyches
kepak
diddymen
rabidity
mancin
tonnato
pressgrove
galliver
vixxen
vaalco
fumblings
khvichava
djodjo
adorability
mabrook
corebrand
ipel
kaelke
fuerzabruta
schoolwear
discernably
deutschemarks
boyarchuk
recellular
jubak
sogebank
micoperi
izbasa
haetzni
sahri
munkenbeck
megaband
koukkula
jonthan
bigpark
jagusch
overconcentration
petrecca
strempler
teint
kabando
confidance
radzicki
thicko
backflipped
triantaphyllides
atgwu
nonslip
ismailzai
hoves
resilence
nasto
insideradvantage
canori
devit
gearknob
irrefragable
ulpd
rigths
exercisetv
trackies
unassailed
strenghtening
shakhshir
visionmaster
padera
meachin
sivuqaq
merdian
hellevang
koetting
microlender
pinters
acquiror
manizha
naftidrofuryl
ledia
creakiness
fastco
zuckermans
mogmog
econohomes
dulvy
zaryn
lishchynska
paperwhites
socialthing
escorza
khamene
cajanek
geiszler
tonked
espring
leggitt
liaigre
kasetsiri
bragagnolo
crowdstar
samknows
treillage
emulative
ehlmann
steptext
besifloxacin
llinares
pafumi
kofmehl
macaulayite
disarmanent
sdvosb
sporanox
philanthrophy
multistrand
insourced
tropiquaria
adnexus
bobroff
bubl
nehst
passats
phuthuma
emmisions
expungements
rootball
chatlines
gcy
guariniello
helljesen
boultinghouse
remarket
pcrd
impaneling
jocelynn
bougourd
brownite
bacigal
germanier
indys
buveuse
uncoached
dhliwayo
undersexed
pautsch
daniluk
africanised
sindani
increas
buybuy
rikhvanova
christains
tamfourhill
waakye
homoine
aberhonddu
hurón
itʼs
bergsund
hallac
betj
iranophobia
yayale
smellier
gancheng
harbuck
gochar
floorpans
beverely
bamfo
openvibe
carnglas
presciption
progesterones
fasahat
tresspassing
sulkiness
groft
muuto
prescheduled
gigle
beatrisa
nanopatterning
naeng
jailable
eavey
enchaine
beyou
rawitz
disected
sennowe
diogelwch
terino
osna
netflights
caherlistrane
edpr
modric
fairfull
unroadworthy
kariamu
vermuelen
agcom
ballestros
esot
spiriva
wirtschafter
crisislink
breakins
zermeno
eurolat
merjos
cringey
machiavellis
obesogen
ngpl
scrra
weikang
tarriffs
soundoff
tavalaro
bedevilling
fragomen
lahudood
cipralex
sepich
koleston
olukemi
overcorrecting
slopey
bbam
stemagen
chiadzwa
marketised
pongsudhirak
unarranged
cintec
morgansen
ruez
eurobancshares
saadun
serviceably
feting
kellagher
overmighty
abergils
haemorrhaged
balzacian
hosue
glasslab
crespadoro
interivew
corsentino
mowasalat
sourest
oktapodi
galvanises
coaliton
compsych
uncollared
boutrous
asppa
hureira
banpro
karabus
cambr
gulvin
allrighty
coffeeheaven
sarene
inhi
lovenox
weigner
spethmann
pvsa
teraelectronvolt
mcop
chinse
eggstravaganza
warmish
redrobe
celebrites
dustmann
roeselii
marraffino
arking
heithaus
rosarie
autex
harvati
recessionista
brodnitz
datablog
rebooking
takeh
krema
galatica
ripston
cooklin
woollands
multistrategy
zubiate
concertacion
ambroeus
marsey
hireright
urazov
tackies
dazhalan
tweaky
muscley
cameronism
antiwhite
postform
mokgadi
schwarzenneger
croff
closests
misiaszek
thaumastos
wheeeeee
tanic
sempell
butkevicius
manisco
embyros
mellifluously
chiroubles
moerk
ramraja
sercel
malesan
caleton
hrer
arnaoutakis
cappucinos
counterpanes
sisha
evenett
surkhakhi
cermony
konskaya
hoitink
recarpeted
lmod
varqa
goleizovsky
dcsnet
carsberg
knackering
mohhamed
agitatedly
raybin
bupkus
ocensa
abyd
cpcn
avalanched
saril
kejun
hief
ravjaa
balkholme
adamiya
freedomnomics
shaunte
holiff
artworker
dunetz
naspe
unscrutinized
oleandra
unbuckles
takete
darlys
uniqlock
otbs
galluci
unsheath
schellhammer
freson
crossparty
mineseeker
unmolded
kadhimiyah
håkensmoen
pettiest
kiogora
swaid
arhuaca
tobold
lanxade
ghirga
rissole
xifaxan
silkiness
thermoplasty
getwellnetwork
chuanfu
vicoprofen
artworkers
sahlstrom
corrag
gengsheng
crocketford
alikozai
woooooo
tabankin
dipalermo
gildenhorn
concilliation
troughed
yolaine
emmins
exchequers
papalo
synplicity
timpsons
busniess
randlay
remitters
vivenne
knifings
haileselassie
vechile
ibrado
finanical
rigside
elasticised
brightwells
endrik
buttry
eyptian
foodmaxx
sopped
consumerland
nyweide
vlahides
teuben
mellizos
retrofest
amunition
farreaching
torbit
mellila
kuvaas
krabacher
eloxatin
hekkema
thanenthiran
mobiler
kymani
akqi
xingdou
illamasqua
newgard
zonolite
fendry
pakhomenko
tenschert
brusiloff
bombiviridis
trubey
muralee
almondine
vovkovinskiy
wauchob
maccalla
dessources
dinte
vicous
marole
lifesource
zhangazha
criticalblue
fatboys
jwad
dingzhi
bakoyanni
marjam
heimoff
roeca
kucharska
singsongy
hospitalise
kuersteiner
testrake
veyance
unkeepable
absoluetly
munire
draznin
amercians
bodypainted
pittilo
aiada
doskocil
fiftyish
fujayrah
inists
sweetcakes
adacher
casva
hexaflouride
thissara
tortorice
etchebarne
richels
hser
wolwedans
ception
seniat
menahi
joschi
bousses
karolides
paazab
gantly
regusci
mazarron
stainthorp
aguirres
safle
mondro
impossibilty
predecesors
trotts
pakpahan
sanaria
turmi
neibert
billpay
rittie
achabeti
ieci
breeziest
doueh
hypoxico
pzena
cannada
dcypher
jerrianne
barrenwort
pargneaux
linselles
moamer
grockit
weejuns
dudenhöffer
ghafor
crapness
rechecks
scruffily
krmc
diino
duravit
airbed
bidzos
timebends
biesk
flaux
skirsgill
andollo
miscikowski
arnika
miteb
alkmonton
museles
widevine
soetrisno
pittsinger
ermonela
mediasurface
yanguan
kotlyakov
transcantábrico
respon
disincentivise
yulis
duroville
htfc
dusika
ifec
larence
azimbek
dathorne
nbra
reedijk
polygot
wheelnut
boedihardjo
thiazolides
rideon
quatela
nondrug
soltow
sergeac
gropings
chippac
willockx
cagily
downswings
ruias
ticor
burnsong
alexin
cannucciari
godsday
wasch
okeowo
indistiguishable
beyonc
offiical
phadia
pisasale
healthspace
bachatas
taupes
embler
ridiculas
gaszynski
strbske
siyoung
odwaga
chimore
sedapal
betrixaban
kimenyi
semuels
gelbmann
sarsam
chigishev
jursidictions
savain
zylinski
suddely
diefenderfer
thecurrent
dejevsky
dlesk
shinnecocks
toowomba
tortiously
griffeath
scruffiness
morizio
inestimably
nellas
jittered
barkema
jetzer
kushler
aecio
picowatt
reevoo
slsp
nobbling
nciia
flosses
vascellaro
publis
salteñas
occupanther
vallandry
actigraph
fattiest
vasper
toorock
sbaraglini
sureno
zennstrom
pumalin
brumit
aschieri
grampp
laclair
noncarbonated
adminster
sarwal
noof
dalfen
agem
hogberg
caffera
hassanain
kubrickian
revolutionarily
stotlar
dqed
ehrenhauser
orderd
alonside
coproxamol
brotherish
eletronics
andalsnes
trsl
bjorgvin
frostier
bedazzle
cancelmi
simendinger
jamesetta
lewtas
squishier
ladny
pinenut
ricot
delectables
vasconez
tumultous
switaj
maltam
wisewindow
habip
aliviane
crimereports
promac
usarec
slinga
bernfield
weaponising
ltat
driveling
meienhofer
makus
constrasting
loprinzi
barneses
refigure
ipzs
quibblers
blauwet
narrowish
sooliman
fasps
ragoon
mbuzi
chappelear
swantee
workwithinwork
talkfest
tecco
lakiesha
bdrs
akoh
ashja
ofhis
visocchi
spudding
yousouf
mahvelous
janece
hewgley
pengana
cablemas
momentousness
theatricalized
jagh
disaffecting
accustions
toama
mothershead
abck
decarbo
clovenstone
clms
mutualised
agnelet
rattal
amoes
governates
camonetti
gargurevich
dimento
heith
karolewski
tfank
penpoll
franklen
energos
inderstand
scintillated
tarpy
tsujiura
railrider
succcess
greasestock
takanishi
shoemate
delemere
tavendale
implacability
hallbauer
lfec
carryon
dimished
riniker
gilbar
krikler
cinton
murkiest
krstajic
gerodimos
eastler
dubais
forlines
ygal
enfys
anagh
cresitello
honeydukes
linkohr
oliker
koczi
dreich
perik
folfiri
peaceloving
dellibovi
turbow
mabanga
goozner
meinshausen
agoumi
airwick
asssessment
maldanado
pachtman
broan
kmsa
basirat
dwtc
taioseach
filc
foreca
wahh
glencanisp
pepke
surenos
bizuneh
haemorrhoid
linkenholt
zongjin
latecoere
debenedetto
katzburg
ventilates
irgcn
avmed
shakshuka
weindruch
cappas
prescibed
overanalyzed
ceglarek
quian
apeh
hebblewhite
ploughmans
hammudi
petesch
kissas
yageo
albiev
simec
chiranuch
spielbergian
cappio
onismor
condolances
webdale
kucik
dangamvura
akec
somini
kallweit
fractionals
aminda
samoura
odce
shenergy
datanálisis
hagelauer
goldbergian
murdani
loewinger
riddens
connetion
yanadi
scrumworks
truckdrivers
bedazzlement
trilaterally
omwami
ergezen
loonier
goedhuis
yokokume
cinciripini
zacinto
bergeman
tribalization
breeanna
qizheng
clydes
oussekine
delrae
protectio
isayas
limbones
mightly
metastorm
obano
vanjoki
massify
fréchon
outling
shakili
wojtak
keffiyah
sunesis
upcounty
hugoson
loutzenhiser
mallins
dncc
oriard
churchley
targanta
pement
micrometastatic
umholtz
dekom
idalgo
geogia
floderus
aredia
taona
cicpc
realclearmarkets
sloaney
hulnick
genr
mshini
energyguide
blueant
poulicek
spokesong
kransco
gatefolds
unphotogenic
cockfighters
anastagi
stowells
thomashow
hartcourt
kazza
kebony
bogliolo
pornified
egpyt
samarskoye
lagani
khosrokhavar
inextricabilis
foliofn
merksamer
ploddy
pimkina
coutlangus
minikit
xopenex
herdes
umarzai
yetty
threathen
allaw
spyrus
hvlp
nvcjd
stigson
covault
larizadeh
nessinger
kiejman
morparia
baquer
timbit
renaissancere
rhaid
divorcés
nakarin
nonrealistic
wepco
humilation
tokoza
hottrix
pantsman
vithy
thobes
misdials
hualon
abdiwahid
xinpei
selari
solastalgia
rodeohouston
nonken
kudrik
electronuclear
comag
courtesty
petruschke
methar
kudina
unshackling
shriven
mahaley
soggier
paikan
leakiest
achrafiyeh
furtherwick
rishe
caringo
softcat
bourride
khudzhand
purità
perod
kurnev
newsbites
duraflame
reupholster
keeril
mintzlaff
isotoner
staycations
nekunam
tomazin
banaa
bradmanesque
daunts
vogelheim
beetge
gladedale
exceptionals
ssst
bermudans
keyhani
zwillenberg
flashguns
nbbs
paintballers
emmalee
aftershaves
gwybod
micta
krejcir
duhy
puscau
carcelle
canegrowers
techserve
tfets
snitty
deliberators
fatuousness
kumuka
skovsgaard
montgri
fishmore
icli
ansol
lendable
invesment
superproducer
saleslogix
whipkey
barraques
sprigged
santine
oedi
desensitising
antiviolence
arséne
drakeley
pugnaciousness
geleijnse
ijcic
arhaus
mwampembwa
basketfuls
fingerwork
adultry
pekgul
bukaty
delus
gpaa
tpct
papastavros
mcclam
underpriviledged
effient
levenston
ltpa
ihda
daypack
orbusneich
antidiabetics
malysz
grelier
shumer
nonsports
profligates
baibakova
bandelli
limberis
corelation
pantperthog
hockenos
jorrick
shieks
ghahraman
crashy
meckfessel
aweau
eservice
jomba
curvacious
overcapitalized
condroyer
mshda
novespace
ushar
ubcp
fampridine
yogli
teeples
crystina
kgoroge
fassbind
brookly
erteszek
echterhoff
bonefishing
sturmia
fenay
panathenian
momjian
mustansiriyah
antonick
aaslaug
bfei
elsbury
batheja
duperval
rozner
photgraphed
scharman
earnse
vosgerau
seventythree
notini
wedgy
rackswitch
ballyarnett
ganthier
cognetas
zacny
reciva
suljic
amfibus
orwoll
siggil
lareo
utahamerican
wardheer
yukky
cruiselines
treelined
hbsc
chitau
leitmotivs
defexpo
shoupe
zhengs
fotoweek
fastpoint
yingpu
dqd
douzeniers
amcol
gervich
edrych
preisendorfer
sharpcast
unopen
capparell
funduk
jobserf
vilebrequin
wvcm
cyberagent
policer
foroohar
multicountry
cheeking
simens
jaunary
sarbox
hcam
fnsea
munwha
karokhel
punchiness
derschau
shilda
peverly
velaglucerase
callau
khadhar
swyddfa
squirrelpox
horsepool
chesting
supramaniam
motshabi
sfta
dumbiedykes
peskowitz
bathmat
ballywillan
yugraneft
rattly
winterscheid
bybox
craftmatic
gambriel
alauya
vizplex
dekosky
supportors
kleptocrat
effusing
denamrk
restino
smajlovic
colarado
milligauss
reithmayer
theyt
spraints
portugual
xsite
guarentees
disario
lavrakas
pedrone
shiane
montazah
malakpour
bounciest
kaeslin
denudes
kilnacrott
pfaender
suraev
levigne
flaen
godfinger
chérèque
perfomers
ubbeston
marchinhas
kilinochi
chatanooga
genebach
cambricum
ovodda
disapperance
vervoordt
woodycrest
parsh
montefiores
badza
lagattuta
superthin
amaizing
deperately
vspc
roadtest
vtss
vieregge
dodard
dudzinski
vastani
castellito
tullygally
kinoki
karokhail
palsey
milipol
leafier
benbassat
demil
tolpeko
gilleard
mcosker
digitalchalk
razzetti
hyzaar
algesiras
valueoptions
tdameritrade
partenariats
nadhem
deitchman
merlone
ellaone
sottish
wehrwein
iflo
puritanically
seduccion
weepiness
polycotton
osuri
withdrawls
dreki
penasquito
laraby
bimetallics
windmueller
stilettoes
ahmud
kovachik
joepa
winkled
toursim
paressant
kozima
stupple
nestwatch
chirtoaca
madridistas
dlodlo
rachmanism
achos
flunkie
almaleki
jiau
adlakha
moderow
cosit
mattscherodt
lafraniere
azadian
gmpta
lemenager
resecure
tucsonans
schouwenberg
jabareen
altoumaimi
parraguirre
citysafe
rikleen
tersigni
lcdx
someonelse
immunotec
hueppe
kittila
skorykh
realini
prolia
perer
ehambe
hearbeat
yijinjing
zeitgeists
libow
yanny
stylewatch
corter
isaps
ukio
humayra
bosselman
groveled
speid
monjeza
direness
tuleev
aurp
cartizze
malayappan
blockheaded
multisymptom
hfas
cloque
sanderijn
gkpi
kaskeala
ichill
kysar
shrinivasan
bingemann
martevious
bivb
representitives
tajarin
doind
shuyong
bialkowski
kuryanov
portec
teitzel
tefap
mikaya
rabineau
laphil
junsai
everbridge
vapidly
huldahl
hinkles
reichbach
clattery
amadie
korisha
penkivel
neuhart
subseqent
arabise
theary
shamsudheen
menomune
aomar
nnal
ramassage
footytube
gayboy
lopex
mercadito
boostrom
funnymen
mcmurdie
glozier
waney
raiber
shambled
askk
opaques
localnet
thisthatandtother
mmpl
asnawi
ribeirinhos
heredad
janies
nuvuk
cheishvili
lohafex
lokeris
hanooti
nhanh
unitholder
rotozaza
tratner
webtech
wpy
amerifit
optitex
röcken
ojani
gallantree
mcbl
abdukadir
encouter
rathakrishnan
wangoi
pubilc
crafer
gylfe
tebutt
strugling
delrish
colomendy
negusie
ydri
hryvna
beautysleep
nxea
enfeebles
ogboru
rukwanzi
ghostworld
poernomo
cooncil
dramis
ovulates
vermund
kuyl
matemwe
nongreen
portovesme
psycopathic
volkwein
apsītis
onhollywood
warrantees
injunct
enrd
tokyoite
amerus
sieć
photoswitch
shdema
coalco
kroloff
crustastun
cullan
ctna
justifing
satawu
oleoylethanolamide
decriminalises
mehne
breat
borysik
labarda
colboc
llwyr
hillsmere
milbook
cvision
kdhe
bleeper
medmerry
genasense
oblimersen
homedics
baoanan
millest
unsoiled
nyias
rukshana
coastbound
twcn
procrit
crymble
cantankerousness
mattich
adiana
diloreto
fritolay
snifters
yately
whupped
ridouane
joycie
frontlist
timberwest
unstimulating
braslow
recalibrations
vitrola
thiopurines
underyling
dilnawaz
neurocase
signability
lighteners
schreiberg
tlachinollan
macin
phsyical
ritualizing
drawdy
newschaffer
opalach
sweady
nwec
struldbrugs
purply
yatooma
sexted
ffirth
kristic
utoyo
johnstonebridge
maringo
charnota
republicon
anesthetizes
pirla
chieming
compretta
westerback
flatish
coolmax
shamarr
bosfor
zwingle
judisch
ihamuotila
saitas
werren
nkadimeng
xinqiang
nativistic
ringbinder
idesign
raliegh
bosanko
hillan
misell
crepps
aracinovo
dankerode
helcio
guanliang
rockspring
felito
theede
welching
julong
ncin
maasbommel
nontextual
wharrie
gwyrdd
trumark
windups
kizs
snapvine
sillanpaa
defensics
striplings
selka
anastasijevic
redtops
imamovic
trendwatching
overtasked
garaicoa
edles
mycoupons
servicechannel
arbaiza
dishonesties
geolocators
urbanbaby
podowski
prausnitzii
cosying
apuc
triboelectrification
shinewater
electroshocks
happell
brittannia
icenogle
rsquo
machery
eicholz
chiropracty
tavey
stumblebum
mydicar
nonsecular
acuerdate
undule
muthumudalige
stinkiest
breglio
sebarenzi
woronzoff
scheana
supercomm
mexecutioner
jpatrickbedell
healthplans
tarica
larowe
comsumption
ultraportables
hijji
yrg
comodi
gmita
despouy
omot
ntsiki
smarmily
bolinaga
envirocab
llun
brimmeier
edctp
poufs
klackenberg
fdaaa
mickah
bryder
truckling
vatagin
alpargata
efficiacy
anstine
palistine
taxcut
lagrell
masaood
leevers
ospca
truvo
martinrea
ehhhh
shuval
classily
dishevelment
effeciently
fastbreaks
pairts
bleick
lokker
alkhatib
pulwarty
fedral
nyuki
unpredicatable
manguzi
thorsteinsdottir
piretti
bloodymindedness
horizontina
stucked
demandingly
doback
gimnastic
yuce
overfish
ulyatt
wowzers
campest
petray
prouser
upback
epratuzumab
greeners
francelino
wbrt
screwiness
gismervik
economizes
jaheel
alkhair
bcny
euthanising
edgwick
sciton
stuttle
veggetti
laydee
vrus
toters
xiexia
lauden
cunagin
portentously
galantuomini
vaclavik
nortenos
cueller
aramini
trown
polytec
gyar
lacoe
puffier
isafjordur
moistly
aberhafesp
chavarro
technolog
hurlow
oshins
nasbe
noncommunicative
apnoeic
temares
coachroof
guayakí
caramazza
alderden
chaderchi
allstetter
cosseting
mersley
balfanz
muskateers
folchi
racivir
hargon
fruiter
topseos
deantonio
shuzhong
buzzarté
lhergy
zinwa
miscalls
bagmet
salaams
saveock
locksets
rochom
preszler
tpus
schwankert
bvmw
teamlease
thosand
bekkay
kadakin
crushproof
restorick
ucap
kitabat
fohe
briso
temporise
dipuccio
souare
alhuda
tranched
ameris
jolean
chengue
kipred
dahdal
wmz
cohera
grotenhuis
impishness
herbalgram
vlassi
dubiotech
dayboat
midcycle
admax
uncarpeted
mobiclip
flushers
ohhhhhh
mascaraque
fargione
zongfu
sovreignty
calavia
mertinak
huzzahs
hardscapes
vohr
wonil
zestfully
semisubmersibles
planktos
bramhaputra
odimba
skulkers
gonxhe
elegua
blamers
yaer
strangerer
harasym
morcenx
eyg
illegitamate
exploitatively
tschannen
stuffier
poliwood
affifi
parlyament
huttary
simlab
raaum
ibandronate
nanofilm
nioxin
onegeology
nonsupervisory
egemonye
spotkick
wilmhurst
leslyn
lylia
bonier
holzhammer
edsinger
cowberries
ciolos
naysay
huijser
guleff
killenard
replating
cosabella
distorters
simerly
interiew
shoreh
barankin
marben
bartech
sunaryo
haygate
bluehybrid
saccoccia
bunnyland
dezcallar
plattin
sidewind
mogaka
multipanel
cushenberry
superbrat
sorillo
heckbert
yanying
audaciousness
tarabarov
rumormongers
cnbv
hotelplanner
nallamothu
jarillo
yuewei
earcups
raynaldo
oktoberfests
pentrebychan
timecards
sturminger
lochsie
stojanowski
nsmt
gastar
jovetic
bajilan
loafed
laffa
namrood
szpiner
marvet
seebrig
glenborrodale
bonvissuto
souchard
companionably
uryadova
insiderpages
applica
horaire
bastuerk
feoh
sellek
adakhan
decarr
kapatos
ketra
datasphere
abbasgholizadeh
poppema
behalves
salahat
falteisek
saksin
verticalnews
shockumentaries
peskoff
jabbarin
semmelhack
chewables
wearden
zimmerstrasse
mazdzer
schneiderhahn
pearlmutter
securitymetrics
funiture
ngodup
surabi
dangour
theier
natcho
pethokoukis
apichart
kuljanin
bendas
poltically
hardricourt
amanresorts
tabery
dynamex
piconewton
trogolo
dalsass
bilsborough
flantz
sotheara
kolaches
morgunbladid
decodeme
wiseacres
ouranoupolis
androsova
covali
cèpe
flightsuit
edmondus
sonosite
gawping
caergeiliog
kandlbauer
pettem
torotrak
myregistry
potholers
snoozed
fawza
superlawyer
haroutounian
needly
clavenna
benesova
christodolou
sweder
cancerbackup
torita
gazundering
chilren
bachchans
giddiest
embrassing
funez
spaunton
albenda
platais
icier
maamari
ntes
gillane
helitech
solerno
ddwy
postpubescent
versweyveld
effusiveness
raikabula
bigged
haddah
ishkanian
monicagate
weatherstrip
sicr
teleatlas
igrc
misspeaks
hartbreak
crownvetch
logroll
plauged
mashakada
firex
turneresque
rapaciously
yearout
forewarnings
kuratani
vraalsen
duartes
shebly
ﬁ
ecolodges
barthau
touze
crotonville
titstorm
mainshill
matrafi
devanei
hannick
reconciliate
geotec
norgle
sutel
sevene
direko
samalut
acclarent
cohabitees
nahawa
mugira
fingerhuth
beleiver
staffies
ghufron
infometrics
honerable
innovasjon
vaporetti
herdlicka
uncertaintly
athlinks
piccino
marbourg
abdramane
yitta
journeyers
easycar
clairoix
baldisserri
emblaze
suksan
sculleries
moayedi
thriftily
mataban
kulju
morlin
dunavan
seani
kuratomi
canniest
writhings
grandbaby
carthorses
ijet
tadier
rehabcare
czamanske
straussy
stretchmarks
underwhelms
etkins
chinitz
nvrs
shvitz
dougill
immunoadhesins
janneys
videocam
schoepke
triptik
mezuza
utvi
smartening
heuwer
spellbind
hafstrom
wislocki
neej
yangaroo
jessies
zigging
mhashu
farstone
kishkovsky
szorenyi
menochet
zuitube
kirkhams
garganelli
beztu
mcelholm
mmviii
innoculated
comissiong
fidelite
spherics
wellbeck
coremetrics
zuanic
gluttonously
wascals
booooo
caucases
péchiney
jadedness
malawai
magnetites
matchet
fogelsonger
regionalise
schave
tmst
coffen
hamisu
neilyoungi
texai
pixeled
petatlan
ghulan
proir
tarantinos
axsys
whomes
mwando
pianissimos
explotar
membrez
shiffler
jeruselem
opthamologist
juvic
quinnett
lowyck
bydd
sundareshwarar
cablinasian
prescripted
atttack
alsation
fallibly
cacutt
overprotect
dishabille
heavican
raiken
jeroboams
piccillo
hollowly
shiryaeva
uchannel
raile
tackeray
cyberthreat
bookrunners
demeurent
outfielding
matuzalem
forestweb
tolerx
migaud
pharmanex
polykoff
democ
calcuations
microcantilever
achfary
midwifed
drumbrae
wwxt
atsutoshi
arrugadas
butteries
apoliona
chivan
dogwalk
marivi
crescimanno
blacklegged
germanika
slingbacks
mukit
forcasts
nexar
ankunda
geigers
belohlávek
arkland
pedott
topcu
comercials
nelton
barrachnie
ifpte
culinarian
nundroo
brandied
trasviña
cuases
dpic
psychologising
topolanek
futureit
trokavec
urogynecologic
photoshoppers
shiia
perparim
cyclamineus
yadvinder
reawoken
propps
gottdiener
bexxar
weliweriya
departee
fldr
delubac
mewies
lisotta
furlined
drumskin
muntinglupa
egre
surive
celltrion
calato
houttuin
howdle
witkos
wjzw
euromax
tanezumab
damco
giornetti
biolife
matisses
frieds
stormily
maribavir
heptathlons
bluring
charitible
swallowable
bengdara
msdc
centuro
adetula
tolvaddon
akaretler
cakar
khiel
ryonbong
mcdarby
letiecq
honigsberg
puddifoot
afriq
zeif
ballotting
baldaro
educaton
sperian
tundergarth
responsibile
calpirg
souping
nutbags
chressanthis
soparrkar
penymynydd
quietening
bloviators
tarre
wardhouse
disunite
globalmedia
denerley
pomerai
ecrg
airaudo
nobriga
dispaly
hcrc
gradd
ndongou
adblocking
reaffirmations
contentnext
vainuku
stranocum
producton
blugirl
darbelnet
heidelbaugh
milborrow
farabow
yaros
papaflessia
vietnames
pownell
gijima
alcwyn
ozat
kuoy
baltschug
tegryn
mcneela
darpakhel
planalytics
jospe
wallmart
ofman
completism
breul
witlessly
goodmon
fasinating
collidge
grissini
booksmart
zikria
orjiakor
accuvant
hofshi
czlowiek
unroch
masculinisation
sunber
kimbisa
waggishly
adfer
southernism
dilallo
megatrade
cadas
priestlands
rosciano
honein
changelessness
arbora
ramattan
masseron
hironao
nø
ccbn
ausmin
atfer
kerven
brannoch
snogged
dexheimer
calamitously
brasell
cnla
unlet
cayennes
fdls
ezj
iniciative
bolnore
insurrecta
adickman
boultings
peripherique
marcavage
ndvf
falnama
manettino
opinionators
thinnish
mitrova
milarch
oirish
moraski
jrti
blaenafon
snatchings
souffrant
emmission
terenteva
balouchi
gentzkow
pamams
opprobium
rtdd
dwon
sweidan
mazzaschi
crowstepped
scramp
scroogenomics
natonal
woodlarks
jaabari
betacarotene
chepchugov
teborg
dehumanises
borgstedt
xianghong
ortrie
heatlh
bankcards
tynesha
bablock
andwele
stelara
unattainably
vansville
kloes
furlotti
tromps
cheslock
minxes
flyhalves
harnal
lasjan
preh
valizadeh
allars
boogyman
gwernymynydd
gimara
microfluidizer
hardily
saydia
bakkavor
ventenac
luncheonettes
coshquin
recoat
broadworks
valteri
satpol
sparest
reynié
meeja
disovered
bodgan
constitutents
opiods
rtpark
inchmarlo
opdebeeck
nonvenereal
orgiva
gunshops
revani
oracy
whoof
overdramatize
genwal
sasbout
gilberthorpe
weissenkirchen
shirtdress
monterio
llanfilo
tabreed
baczkiewicz
helmly
yushau
dolmio
konkatsu
schtum
freckly
kubzansky
inverarnan
slomer
volkening
thabani
itsma
mignardi
sentate
hansala
frenki
blumenthals
xevo
tasat
pawky
pgmol
kieny
felcman
visualforce
allgaeu
virgis
drasek
dampha
fairoak
ysios
mistele
forsight
balcons
roitstein
inuendos
nicoe
godineaux
bonnerichthys
bezner
galuba
worldfamous
emiew
decarbonised
cleron
whuh
ladman
mroue
tsirbas
sustiva
clarinex
attemsi
marinhos
carrageen
machulis
meea
mulongoti
giradi
zayim
whirred
kibbutzes
aquascapes
ayap
chapparal
schnuelle
bananana
puckrin
alfridi
bysouth
waterwell
lushes
overinvested
kowtows
detangle
usbg
congolose
risbury
pdpt
ffom
pehub
smuck
nfsp
simring
keyamo
omoyele
dabinderjit
hamparian
pixillated
blubbing
spaetzle
sönksen
wilcomes
disabilty
pamart
correx
glamorisation
zurnal
ncaf
rauterkus
prognosed
lenett
lenkei
karrasch
verfuerth
delcath
rehad
segerstrale
distrubed
puamau
webslices
boedker
ahogada
mershin
elmaghraby
copling
paneque
wowtv
purnhagen
coplink
cagin
hellotxt
malaren
dobrovic
glasgows
duann
nylag
philhower
catcalled
pinapple
surburb
kopetsky
hypos
teenhood
backordered
kokoi
iapv
khastoo
kvinta
stageworthy
armanis
kurbos
benchlike
tyhypko
xiaochao
filches
ngakoue
banyjima
linkups
gliadel
ealim
tschorn
mudzingwa
amberry
rissing
jozic
gemmayze
verenda
politcians
enviromentalist
meleady
habarugira
desfosses
biosource
srob
slanket
garazh
galafassi
nollette
dasanayake
deitzler
yewen
nster
alcr
laluz
talgarreg
restfulness
nmhh
immunodiagnostics
befogged
samray
zanchini
charniele
dbes
taimour
llok
omgeo
salava
kuriansky
oldington
ledue
flunisolide
msim
zarudneva
balir
aaric
varvasaina
uncapturable
dakake
beatlemaniac
dhusa
stingingly
blankies
khaddafi
hadman
vendormate
elasmar
sgpt
byamba
luvians
leejohn
marcoci
playfighting
kabballah
carmaking
mcnorton
khankhel
collateralisation
frenck
wallbox
glimmered
puligal
mersiades
billionairess
friskier
branstrom
horndogs
paulann
nunnelly
katsenelson
kulchy
abdirisaq
grislier
mylie
hodac
buschow
schnozzle
goldia
tolchuck
diease
zometa
echosign
tindyebwa
lahovnik
chamil
cefx
vahidnia
dlask
bandarin
rhani
meningosepticum
heilberg
ecclectic
presidnet
scalici
dentistas
pouffy
lavinge
staysafe
kossangue
gorblimey
furriness
gourgeon
weeren
discimination
labordi
execrably
ogunjobi
disater
sceney
pichan
superviser
vekic
fraility
reslizumab
jernvall
hillo
pluckiest
ireports
rowinsky
mkvi
meglomaniac
giradeau
laicize
watina
cohibas
iecee
pulmonx
fintor
echazu
jugulars
laygo
eght
bioservices
cornisha
zondas
greycrook
franzetta
syndric
espeed
unclassy
chadsmoor
sabril
raffinee
impingements
smartsearch
shipmans
mbithi
chunquan
mililtary
toremar
obamacons
furanones
vayrynen
rizaj
liquescent
expessed
totvs
hellmans
saifulislam
srbska
edfs
attcked
carbfix
migereko
hillstead
wyhs
spoofers
hosepipes
ikhwanis
policical
yongon
stoneyhill
charwood
gittus
valvulopathy
dedring
dustiest
jankovec
colcrys
unsoaked
longyis
mikhalkin
mndp
digney
splashpower
homeaid
santano
berragan
agnieska
seanez
azcueta
watanagase
planemakers
frontwomen
xingwei
freebrough
swieringa
lonning
rhinogs
michni
gillece
toppenberg
chainsmoking
chattily
currenciesdirect
seasonless
elwynn
singlemost
barazza
wakodo
demaci
keyani
muntadar
plimsoles
hembrook
gruenenfelder
ninewah
portege
interopnet
binikos
globic
papco
wernyol
clydesider
klish
ironpants
ultradense
wanded
ikena
kéréon
pesznecker
waycott
comras
seveali
jerrald
mcconneloug
nopporn
speding
arduthie
adirus
kopping
martiza
oncy
rowlson
hamhanded
lenchwick
ngsa
panoff
attebury
capozziello
sarvananthan
malkovitch
dalís
michaeljohn
imrg
skrtl
wurmbii
ziaulhaq
saposnik
silecchia
softic
shopworker
gengnian
skycouch
bovett
treasurable
industrializes
chicozapote
bromidic
nervewracking
rococco
tasir
adjagas
gasparoni
homewear
beanballs
chiense
inkaterra
veganic
kuppermann
suspensefully
badrah
trémolat
decling
doomster
hursts
cigarrettes
altitudinous
inconvienient
levra
spokeman
yushenko
aquapalooza
unluckly
repsonsibility
magrittes
edtp
neckbands
waitng
grimmel
kapris
triscuits
anaemically
feinsmith
montuschi
molndal
anastassiades
woring
anabl
rebublican
ehrenthal
likel
grunenthal
couthy
tnav
ontarget
upperline
shucksmith
bouverot
arfeuille
stefanello
ffdm
minimills
cnsv
hellowell
hvms
naug
mapjack
munizi
agenices
nonnarrative
saviotti
aleyna
letterbreen
swiftie
transcosmos
mcaleavey
mamand
baetge
carolos
downhiller
streetlinks
cediranib
sooka
oatt
kartoum
idacorp
macchiatos
xonacatlan
boylepoker
lavorante
coutino
bahrampour
wearings
gnomen
chironis
rauzi
ostomates
cieneguillas
merchdirect
diguido
yuzheng
bullocking
ginevan
postprimary
soyun
junsho
misconnected
anleu
thundr
tortilleria
beache
sunhats
hardouvelis

souz
avtr
excutive
stenquist
ruchbah
schlip
sviblova
qureish
helil
osorto
yesodey
galvins
hypercore
craffonara
karbouli
sebrina
lolololol
donnachadh
romatet
mdvip
moderateness
edidin
facebookers
thinprep
royere
koping
diasaster
berdymukhammedov
nicoson
hellfires
essaioi
unctuously
meditor
siamwalla
pourous
sordillo
gcla
aerogarden
sexts
regrind
gidel
knolton
ketchmark
kucova
redcarpet
marwoto
zahary
siroty
pwsa
zwally
ozden
kjellander
elctricity
dunifer
rheos
kowaljow
policitians
ttsi
yolie
cusanero
stanislavskian
gurthrö
provent
brownsberg
dehavenon
stephanopolous
brookgate
baribault
hecklinski
qaneh
minivehicle
spiegelstraat
lenette
zaninovich
bacharan
expan
bodybags
garrida
divebombing
trichelle
jamshad
jankins
smailovic
clevage
wvcs
visisted
becomin
demine
pentrehafod
rahimian
nassri
toosh
kichler
bratke
smythers
masoomi
poborsky
sítv
honny
pottered
ishac
cloggy
uchytil
umitaka
kyohwaso
mmboe
kimjang
quianna
tufft
keenyn
federalise
touchtronic
prouzel
terorists
stoelwinder
magdziarz
proliance
deloittes
tecchannel
poddars
socialst
melafind
wathne
mkiva
yoggie
killstreaks
whipsaws
puoy
eopa
cloudcomputing
faultlessness
managerialist
nuvia
magerman
eivers
overpromotion
yanbaev
cupas
buczak
datafinity
sebban
meretsky
mutawassit
centertel
tekah
voase
lafico
gourgel
pashton
xinhau
schaepe
jaquan
updegrave
parttime
mpeta
attactive
inconcert
abgr
pomahac
sedenquist
hypocrasy
baniata
trussle
axiant
catchily
brovik
phama
pleet
siphokazi
mughniya
mesz
ctsb
disloyally
zajaczkowski
environemental
traeg
pinnies
verc
kryostega
mjsbigblog
itinerans
jaggs
abelino
icpw
surfas
mccullar
hyperdynamics
bobier
dialoguer
dodgiest
chritians
bgca
iavoloha
langleywood
krtk
tyrka
mccanne
mammels
saccharolyticus
kameen
chisley
awyr
versfelt
nyone
sapropterin
pureplay
mosaka
divertingly
basejumping
khilanmarg
dankgesang
qalyan
scheucher
delahooke
anmyeon
lyutenitsa
madyun
hrubes
parsely
meimou
kozhemyaka
shuchat
industrail
oxfeld
ncbw
gulalai
schierhuber
wriglesworth
bregazzi
wisnefski
zieser
ynon
sexpots
hyperdunk
manbert
chitengo
ptcb
ahlering
zagaja
serotsky
tanatside
abassan
tulshiram
banchetta
stuelpnagel
inurnments
envestnet
ysern
selloffs
fretlight
roitz
stelma
xinmiao
noteholder
jacomini
dimicco
recidivistic
montengro
sakihito
spaceview
almasmari
mavy
purnick
kakule
shaffrey
parches
superstein
helsinn
lovobalavu
lotensin
nchv
jailyard
offsping
sarler
causeyside
fazullah
skylounge
coryells
demoralises
shubart
raisiny
kutelia
feedmill
tromping
pressue
alecha
cakelove
pancrelipase
pirozzolo
ehrnfelt
runningen
falteringly
azoz
metselaar
hollee
activits
zannie
boltman
aryashahr
proventil
lonsway
rvrc
zhongce
preauthorization
boikarabelo
marazul
khalisadar
schlievert
leasebacks
xueping
mchendry
butit
canfa
oplinger
nevrkla
hamdaniyah
seghatchian
jermantown
menick
maštálka
buydown
analysists
oxbo
szabos
kerusch
colleluori
raincity
konchalski
slobodzianek
imagesat
crossbanding
ivania
yaghoob
meain
selecao
toked
kirstens
savvidi
kaywin
odidi
remanufacturer
ibhs
garacad
vaijanti
nellor
dyax
nzsx
souflias
fishworks
ponten
emcore
luckraft
braidholm
lipsen
bartecko
batterings
mohallim
amoke
weyel
eees
xomba
erhman
malenky
lavieri
bizarrerie
lancz
partnerless
ances
needlesticks
ansfield
reciepts
emeter
eviler
busalacchi
jeukendrup
alcarez
erdbrink
vegeterian
moamar
gptw
mauvernay
slipchuk
kaival
prayerfulness
carlozzi
bintel
kukic
hasecic
hindujas
kleptocracies
uptrends
seaorbiter
dettling
brigalia
repella
kipros
talmidge
cluness
billʼs
haddows
musicpass
soetikno
kornitzer
pittburgh
impactive
mironyuk
goteberg
attenboroughs
pescaíto
aholes
bremzen
tpma
scordo
hepatologists
econned
efejuku
structurer
supporta
urssaf
ddyfi
grudi
onlin
shoplocal
osvath
maheson
vermeesch
swotting
lastrella
muwonge
couriard
iproduct
straddler
neuromedicine
bezard
marchiony
miceage
dosmukhamedov
transmyocardial
barbequing
biemer
markbygden
institutionalises
avmf
fizziness
untether
hotpicks
baldiris
corperations
carollton
pasteurising
tsking
civial
ringstrom
orangette
gallicas
cocucci
ultraswim
jvania
mustert
sixapart
volkwagen
viravan
moiree
amtower
dpuc
schwirtz
randeree
lacole
iwish
pellacani
oddicombe
merryweathers
cighid
overdrinking
cdha
serrailler
abdelbasit
ferrah
devonna
wayto
bejach
heinken
bmxs
miligrams
vivagel
mson
cynnwys
poltroons
farmaceutica
inimicable
muawia
bristleworms
tatsoi
gegenschatz
quipster
pnemonia
pettry
pareos
philipshill
truculently
xius
bloomsberries
handbasin
keehner
gammick
poutch
hourig
cheptai
armelie
rogering
credem
isailovic
lealamanua
quamut
przedmiescie
carmoisine
harriger
mediametrie
kastrinos
aulbach
zumra
criscio
oversweet
mcnenny
sheshinski
shasteen
rbts
collegeview
carveout
towerblocks
genuflected
godfilms
robsons
toolmarks
tomlan
schieman
glencarron
reagonomics
collagists
kennes
uncomplicatedly
thongkongtoon
salesclerks
decimos
mircosoft
overtrading
sabaawi
syrahs
concupiscent
adede
acidless
robet
parquette
septmeber
akrum
datebooks
paradors
arauquita
bucossi
comdemned
ephemeres
dachelet
queiró
telcagepant
goddaughters
rodgin
mentaly
stepneys
laczko
commoditizing
oversaturate
choriogonadotropin
shtum
taubate
nonregulated
kassid
bekking
knutti
pickable
vaubaillon
kardasz
redbulls
tabacón
leathering
ironfire
whipsawing
inmans
tweezerman
bcbgmaxazria
fishapod
bruha
ojdanic
urell
chibaya
mursaleen
tumpach
opunohu
prizel
scrace
poliform
duncow
retrogenes
giovannucci
catnaps
schratz
endplayed
taigs
iniparib
gereida
ecoupled
giantkillers
esmon
harway
charicature
kovell
fidaa
roadworkers
aaphp
whitelees
collapso
yirrell
llywd
manging
reminescent
qatami
bussinger
tomskneft
undercharging
incumbants
comanies
etyen
huayong
knobkerrie
puedpong
canalettos
shiastan
levittowners
pories
dechane
toffeemen
poeu
baleegh
traficking
mmwave
buerck
muré
nonpoisonous
meddyg
palisi
sinuplasty
bluffness
boumelha
virent
hatefest
kavishe
mostaghim
markiet
carelink
machie
overbright
johnann
coralling
octabromodiphenyl
untaxable
gyegu
responcible
herita
exhibiton
ramchurn
mazyek
beuracracy
humanises
geither
hyfryd
meatiness
vetrovec
gruppioni
edof
uloth
rampersaud
knockbacks
overmastered
pearton
boysenberries
resole
adamkhel
officiale
morestead
pishtacos
overproduces
sacrilegiously
reciprical
logorrheic
cohabitations
iannacone
ppan
nutrisse
waicu
curfewed
onanist
aboudihaj
scratchmann
glubok
mossbay
rialtas
garibotto
dalcetrapib
zwak
taliking
cantens
zizola
mntx
millimole
owamagbe
trentside
jackalberry
tacambaro
prudishly
muraca
bilefsky
consulations
xway
philanthrocapitalism
neverlost
gellideg
baverez
debens
lifto
siddharam
nasariyah
maasz
nezamuddin
apotheoses
mwangura
papura
saymeh
kufrah
lodrick
itronix
moldowan
lefotu
meglomaniacal
mohtaj
ptcda
aghassi
vanderbuilt
dagunduro
stason
pacn
misconnections
tydtwd
turnto
geob
reassortant
ryhurst
nelso
amortizations
nktr
aousc
rizai
freris
prochymal
outsted
drelich
turocy
vourloumis
farchnad
gubernick
chaping
sadker
penfed
collegeamerica
macdorman
simsch
starbucking
maryfran
mohanram
tooher
cotliar
mccrann
lvhs
triaminic
vbvoice
copehagen
midgrade
shawni
birkelbach
geoapi
coifed
disapearance
pressie
hulhumale
unsown
bilks
powerlessly
acuras
hamchetou
budburst
ukic
wetherhold
keinon
minues
congressdaily
goolding
bosasso
lafeuille
propertyfinder
ncib
pelem
testimoney
naswa
microsavings
bondioli
nephros
anbaris
ecycle
telocation
lawniczak
fanball
ummersen
warmists
faveri
irrestible
popkey
solodyn
entrail
stagebound
stylefeeder
placke
rozerem
doomsaying
kwangba
bpom
highchairs
corteges
punnishment
reihill
pedofile
steeber
necastro
britnee
muzhakhoyeva
winkey
babyboomers
proscia
bibw
derrieres
zandanshatar
boecke
stybel
wintek
websurfers
inury
schornagel
hiree
thwap
unpunishable
islamicism
sniffly
bontan
tryscoring
penisa
ekaitz
strihavka
detatchment
febreeze
gonnot
cchq
wilcha
paraag
rhianne
ensha
fanok
darious
hachigian
schoose
anticollision
poolton
seekin
multimillions
taeye
melmed
whiplashed
seegars
thonhofer
rakieten
blueworks
umsted
istreamplanet
disintermediating
kezman
abdirahim
olhao
braciola
leathard
satnavs
hackhurst
polyphenon
responcibility
boinking
stooging
standardbearer
soussou
lajdziak
tippex
athome
persisters
iwerddon
creachadoir
wildstone
citlalli
weissflog
mexcio
staford
hasanovic
sundem
ythe
perfidies
pielenhofen
chrz
clucked
hensrud
ishum
akgun
shuneh
indeginous
gapkids
twinstead
callater
microcephalics
mipdoc
hungriness
tsco
rerp
vmag
overperforming
epsco
acoem
auditel
marinza
adeley
gallichio
farci
kalanke
shakiri
koblik
apointment
galgael
lacelike
ivonete
girjet
uffner
photofiltre
taware
petville
jethrow
reliberation
kalvarisky
xixin
worki
staffline
pgmob
balilo
overdesign
shailagh
keissler
alminova
mussayab
vtms
shilled
sulock
theanyspacewhatever
mourby
libé
sonterra
corefirst
xoie
gilgore
innovata
torbinsky
deinstitutionalize
capolingua
randels
bowster
lashinsky
pikulski
roushill
douched
ithin
visanet
ballarò
groesfaen
kameaim
widmeyer
protrays
ellabell
cetnik
buidlings
aigcp
gidani
cabinent
mailshots
shoptime
deyton
swankiest
rocknoceros
glamorises
vaunts
jacobses
lamplit
jeacocks
mediakit
rondstadt
paquay
digipass
frisselle
satchu
vervotte
snowblindness
irfe
arbitraged
slingy
superpremium
heisbourg
coziest
frienemy
bioforce
ogryzko
yasawas
witterings
knobkerry
wouldd
senie
sebirumbi
larynxes
rescreening
sadeghieh
buscall
qualstar
uncollateralized
dyudya
misfield
intot
madrids
rongkun
zepf
oppertunities
bierma
plocnik
stinke
cushiony
boxler
kayitesi
gullable
correctitude
towungana
michcon
lösche
postlaunch
fxall
abdominally
bloemraad
rebeuh
rebwar
medla
akwesi
califronia
recogniseable
morganchase
ritsaert
grinney
trussells
rafiqa
revvy
koonings
teclas
shelbourn
juanicó
philipina
zdanowski
darsley
chunxia
kabei
andriol
tryers
dfls
jesselson
waziriyah
hirael
nagaa
gravitz
choedon
echline
carticel
fatteners
athron
yendys
soaringly
crackel
verderosa
glossiest
listpage
mesalamine
minujin
hanaka
recrowned
schervish
richfood
affy
breki
ackard
larazotide
huahong
gudgel
coroneos
weatherizing
ahoua
mislay
ancramdale
bordoli
sacopee
outburts
varjabedian
komanduri
distortedly
cavalluzzo
stifado
gildings
hollfelder
rushmoredrive
spectular
redelmeier
godhwani
weissfluhjoch
tyreman
mcginly
jackness
thinglab
mamlouka
hypertargeting
fledgelings
fullcourt
ecomomic
ryklin
kaufelt
rouches
teamorigin
southalls
djugashvili
meillier
resonsible
indomitability
excape
netherhampton
crummack
mazri
corun
rmmc
declassé
webernian
crestliner
baruchin
ryohin
schönhaus
sucessors
rhymetime
crgt
scalemp
edmos
barough
administrable
kallmyer
groupons
recapitalisations
resitting
karamagi
chittka
overmanning
carclaze
blaschak
galouzine
daychopan
mauksch
cmdbuild
clockwatch
fruteau
jatas
cmws
robata
wellmeadow
growt
freakshows
zinurova
calliflower
cannisters
semitruck
decsions
latifullah
jingchao
childres
kalarikkal
vcnetwork
cusinato
kayabukiya
gottcha
carseat
architzel
twittersphere
ellinais
itpro
vicosa
fridy
logistex
ecda
blackhills
aramendi
pareco
searer
unstack
cyflwyno
otherized
nairashvili
nonallergenic
unrented
mangelsen
eurpean
mckeirnan
frasinetti
willimas
bangstad
kawesqar
fiending
msxi
artumas
visvader
inishfree
canadain
hemopurifier
aritonang
donohoo
hezballah
samolis
kaminskis
mexder
mugniyeh
jobid
minimall
oaklea
restasis
eanet
spichern
glowered
highters
thobeka
txtloan
janets
sehc
sutisoft
dutartre
busybodying
vitarte
rongwo
emergis
scamwatch
gnezdilov
telefusion
dyton
nutballs
fobts
familian
lauriski
hoketsu
dispised
gheith
sterilite
smellers
parsani
pechenik
semaconnect
baghdis
trachuk
cutress
vanderlugt
potful
drauniniu
colorectum
mongerer
spallina
keepaway
expecations
inceased
loveluck
grumping
atill
longevinex
willerson
hirgigo
peoiple
umberson
clevudine
pulper
ciolfi
atherothrombotic
nhpau
outsanding
finzels
earthkeepers
ketziot
slicethepie
borowich
mceowen
comicconnect
damnjanovic
rebkong
palmgreen
badl
bhgh
scrocca
squalour
hammocking
tihwa
sciame
mandey
bially
zorzoli
wasington
diplomacies
henglong
lethendy
loyens
mwda
neigboring
calerie
sevylor
szemberg
destablizing
kreuzinger
chandebise
starvelings
babaoglu
golanski
healtcare
spiritueux
nakatsuji
elephantitis
kanstantin
uninvite
snugged
pharmamar
passur
ruogu
hpna
davaajargal
biscaglia
bumster
evaulation
choppergate
shierholz
shanmugarajah
unfortuntately
manoschek
souleimane
eskilson
sweeta
pasoans
sophomorically
szakacs
intelegent
guffawed
mahaweel
inlayed
deniably
staba
corrigans
rhapsodize
mistfrog
quikcard
nonchronological
dawdler
wybar
nightster
ignominies
keychest
mcilree
kurodake
ranit
brondyffryn
chotai
maedgen
mavuno
andipa
dicketts
reincarnationist
tuayev
cadfund
lynnerup
tbed
cyberdefense
ameriyah
womick
ghazaliyah
gissendanner
scenerios
dooking
pantigo
nonteaching
congratualtions
waufle
bariay
purevision
elmalich
mortgagebot
hellobeautiful
slyest
devaughndre
drumossie
superhospital
bellardo
kocchar
heafield
nanzenji
bizwiki
taddonio
relistor
electropsychometer
maritas
claustrophobically
fanah
coonie
yaoping
mcneley
birrificio
befoul
deloire
gauffin
pongcharoen
nardolillo
applanix
munning
drec
crucet
rossanna
rikhotso
kiswana
thanksusa
trailblazed
denegre
dazel
trafnidiaeth
xoma
superferries
lightful
zaafarana
provea
ahemed
pansea
madhosingh
khulumani
goodhealth
audeon
estruc
accordign
dunkling
meddlin
urumuqi
nuren
birthmothers
asiainfo
shrops
campagn
everyboy
hissene
egnal
gurgl
ccmrf
efama
seldens
sohair
cevis
fujihata
seex
msgi
amerigon
americanising
collission
corrons
chowkay
produ
mcanderson
quotational
lipot
blatstein
aucott
breaktimes
chatterers
delahoy
rohsenow
flipswap
krahenbuhl
jourquin
georesources
dressner
slimier
potupchik
cpst
clubley
sleety
ungenuine
vdel
rogalsky
janullah
blipped
sourgens
oters
remainig
destor
tamelander
chausseestrasse
piroli
nabco
brzak
afdhal
sriprakash
karakoc
torregaveta
moraca
funwall
shurbaji
kupetz
goltzer
gelband
departmen
zaffarese
tscharnke
chivvied
underskilled
cenziper
englightenment
bresnitz
mapule
percussiveness
knog
blwyddyn
cinquin
yacon
zhenliang
ofrasio
sfari
auyang
claragh
shioi
huelsken
myfoxatlanta
teixiera
ananenkov
shoukhrat
rameck
ducroux
makondo
grueneberg
politov
mlotshwa
roeloffs
incented
moneris
banlaoi
laregely
derserves
clader
hmmmmmmmmm
unpicks
lakshi
kangho
herapin
krystexxa
gurgly
militarty
podair
noooooooooo
tayberries
bouffants
kornum
losan
starvest
ppda
hypocritcal
khaisman
rubashkins
demofall
sweethearting
trasylol
freedmans
mvis
goldmanite
motobike
oneda
sicolo
figueruelas
scriptapalooza
dorrier
advertently
bougerol
schoonveld
larushka
phaseouts
fiterstein
oremland
geomatrix
schlagman
wisegal
holyer
newhampton
engery
fiddaman
lightpole
hasiotis
gasm
milnot
pearlised
kaloga
lancor
riqqa
romaan
sardone
thorthormi
disidente
quaterback
pozzale
oularé
rhythym
beglitis
highdeal
fiftyfold
afghansitan
magnatude
tabosa
ulzheimer
kjaerulff
knifelike
corraling
pratfalling
disolving
hahvahd
romeyka
shites
makhzoumi
seggie
devicevm
licuado
newboys
hanchet
norooznews
völkers
dndn
bulicame
helmsmanship
diﬀerent
unreels
greenscape
ivimy
negal
leaud
bhumibhol
zenobians
skett
kvakhadze
hiliary
domboshawa
grovelled
jeré
alphama
pliszka
childeren
sustenna
evidian
bistate
lendle
arhabi
steigrad
zentaris
acromas
shoegazey
surgenor
jalkoti
startrans
badoian
carmelitana
dreihaus
unilens
illhaeusern
clintonism
gridlocks
rassweiler
vaernes
rmcm
nonjudgemental
nullriver
caipirinhas
stuffily
navyblue
fjera
sotio
kerness
robell
tamamoto
paefgen
metronomically
caison
vanter
undercapacity
pashkin
bankroller
iniatives
couglin
nassem
unhygenix
rumaithi
szejnfeld
torrentially
bognanni
tellies
gennara
blynyddol
bokelberg
antoney
arrivers
sophana
tushes
abaurrea
qiqihaer
jatoba
miswired
sssd
quandries
margets
unmountable
paolicelli
rueing
vukaj
sagid
gwragedd
technogroup
goye
gamecubes
creandum
wardag
nanoantenna
encinosa
araton
mulvin
decentralizes
tromsoe
technologizer
musayib
reprotoxic
￡
nigbur
monzingo
counterassault
koopersmith
mcclesky
menactra
levenwick
hanaan
voelckers
greastest
bitchiest
inditement
muehlberger
beiong
theatening
mellas
lavinthal
inititative
intersquad
represenatives
roughhewn
hallglen
juliens
intreccio
magallan
tibilisi
langinger
djelic
stroia
atlhough
wiswe
brickbeard
alyawarra
borneon
pbsct
dacra
cheiffetz
privilaged
aigurande
tranquillon
frapin
syphers
polygrapher
underproductive
veleva
shawcor
stuas
backpeddling
koerfer
mezcals
closley
namenda
ferrarone
etailers
nabanga
parmlid
karastan
gougères
oversulfated
lubéron
ehlke
narseal
salihovic
shikse
sumptious
safetly
billyball
colmant
licalzi
mullahcracy
canetta
singiser
aswirl
jarbawi
scapicchio
dicotomy
profauna
raclin
bronger
hahaya
cygnids
upstreamed
muthalib
rqi
racak
socalist
hamantaschen
aqeeq
borovic
breathalizer
blackington
abtan
vandemeulebroucke
sodomise
zagornyi
cibulas
bohlig
kalaje
guilleuma
islamofacism
bastardise
vantrix
kochanov
texaplex
orcl
heulyn
diangi
librans
toybina
mingtai
foucheux
nyoraku
trixibelle
wanggang
ibison
courtway
listerners
cariatide
trishas
resynchronise
carpinteyro
masciale
seriocomedy
witech
pinksy
counries
ndemo
emtriva
losson
holbeache
whiped
sugarbabes
mondane
pediped
zubakin
ifshin
warech
baramullah
llowes
magtira
lukehart
ergos
swopes
yaam
cloddy
omniums
iruretagoyena
biberaj
naward
nikolce
shklyarov
kasraoui
yakexi
causualties
bewleys
inbody
gehle
calandriello
birp
kloeden
vidailhet
ferrety
burlier
demagogical
fergi
phinnaeus
dayami
tibetians
kleinbard
ponturo
visund
lightproof
mouneimne
carbonnieux
ichael
teeniest
pysche
welie
rittenmeyer
olaim
nytol
goldworth
centiliters
veiroj
eural
pokertek
forham
elidel
stiffelman
amrami
mcleister
unscrambles
guriceel
dipso
sanani
wheaty
pushbikes
domaille
generosities
uncowed
postretirement
apollus
granularly
trugs
puttered
ghaws
humilated
clabburn
camft
mstation
masne
uscom
indemnifications
bizcom
bimes
stheeman
pathela
konoba
mcitp
marwolaeth
barrowed
sarachandran
cherman
mcswiggan
tognarelli
amnestying
sawtimber
greehey
troyak
breakfeast
apiafi
ofthis
djoker
gulenist
abinbev
yahye
bullhooks
phuntso
motaeb
flybus
kanacevic
hooha
mandzukic
safder
bobrauschenbergamerica
redlihs
monthairons
unarrested
mcphaden
hammertoe
radnedge
drico
reinflated
mudathir
mindup
haozhou
latinode
daejan
orasure
thrombogenics
pithiest
unfastens
danglard
dvortsovaya
nympholepsy
immiediately
litheness
bolshaw
nonremovable
akshayuk
megace
exalgo
ccsn
cmarket
tuckup
sosinsky
filipini
atatra
lichnosti
becharam
fousing
millberg
supersector
baveja
scansnap
shallying
cosr
uncensorable
wory
donig
rungan
fouhami
olasewere
remindful
myriah
crickard
bogatin
ipcm
pornification
naddeo
hakkies
topnote
majoriy
breiwick
subverters
spde
felux
krauel
watherstone
develo
serping
koukoulas
amdl
imigrant
muttonchops
tomnacross
piersen
standoffishness
ballieston
cplg
trublood
burklow
kettenring
outshout
auctomatic
iveth
yusifiyah
radioss
camarthenshire
matshidiso
xlif
schnobrich
dyddiol
ahmidan
olympitis
skalleberg
sharwoods
joerres
netcomm
belarussia
jarawas
cowardness
tummies
maridueña
marwyn
leevy
oosterdok
tocquevillian
videoplaza
kosachyov
bitorrent
sloooow
ricinine
husarska
luned
poweful
staglin
sculco
penhaul
hostopia
requir
zelyaeva
honourables
hasselstrom
sairr
razrs
collobrières
compny
waynetta
klontz
kerick
mopi
optimark
semiabstract
fnia
accelerographs
pluots
takishita
hostry
ayoreos
microblogger
fausey
blahoski
crowngate
uncurls
kourakos
kerzel
tooman
babyfaced
overated
cbay
sportcombi
farrows
calanchini
mellowest
trollhattan
volach
freijo
damanged
inhand
frubes
daetoo
onvoy
declairing
mavroidis
imporoved
redniss
fukino
cering
luthan
gywn
bellinis
filgo
ameresco
imcl
melya
dasenbrock
proscar
vpso
intertalent
elitetorrents
jallal
cirtain
tessieri
piperade
bluehill
beruwela
indolently
ghenghis
thromboprophylaxis
prstore
amabassador
quintais
khunkitti
chichakli
tyrannise
progessing
stepgrandchildren
toeachizown
intoduce
fifton
englan
kabum
slopers
pasw
gwag
rostas
menstrually
accoun
solartaxi
paillette
provacateur
migliano
untoned
nefzi
seonaid
unstageable
evolet
septuagenarians
schmoozed
botney
ineptitudes
unpotable
bamfords
lebovitch
yarzeh
edgler
hifx
baldarelli
weeing
palmary
interupting
twinspires
sadowy
tazzyman
plantscape
nvicp
yassaie
dutchness
preprepared
shafan
counterdemonstration
boertmann
glaría
jancek
meiquan
goughie
zulehner
standardbearers
vizit
plumo
rabouin
shimkovitz
udalagama
tongkou
zurzolo
rathr
bexa
licketyship
siumu
highjackers
kruszynski
geobra
pacenza
faeza
sornosa
gallowtree
farookh
huitson
calabacitas
ibeer
greenslate
zaney
helitanker
souvenaid
basravi
simpers
nobl
arkeith
renkart
rhapsodically
beautee
gazgireeva
acuo
keshishyan
eskiimo
zerotruck
skyfuel
naroth
wienert
bessye
netequalizer
premerger
georgiopoulos
nutopian
devlieger
sinapse
sharlon
sexstone
penichet
angiotech
barandica
arciga
aaden
sherifat
sowetans
endotheliotropic
nsduh
pontificators
chappaquidick
abkazia
goodstadt
barnies
flexidiscs
korinne
szapary
sigersons
prokopanko
cggveritas
wilhern
trusdell
jantzi
caltroit
inattentively
buiness
kvarnen
vistica
mcmuffins
nuvis
kodas
karakoy
naifs
meglomania
snappiest
garmont
jints
apocolyptic
panger
ramic
hojjatollah
nonsuited
tzeo
charlena
gereffi
educability
mushada
hudetz
hardheadedness
pagnoncelli
jovicevic
exoticize
nanah
wapakman
judys
stratex
congresss
admix
bekkouche
basumatari
frostings
nspf
mahlyanov
bartnoff
bruckshaw
searchings
intalio
samcam
corretora
menday
benifited
muonelo
rubot
coachloads
mashreqbank
ceausescus
insited
strangleholds
servheen
heska
mustchin
welldynamics
ajilon
fomenter
steppingstones
poqui
cortadito
antitobacco
listmaking
incant
osbaldo
edbi
schmutterer
salorio
feebates
rochsoles
gowning
reprivatised
cenicola
uroda
annnounced
toifilou
islandsbanki
barix
makuxi
cachetes
auchleven
mplsound
forseeing
vandvik
navane
loftman
pepperstock
leaguewide
gottis
reinemer
francouer
joevan
harbertonford
timesmachine
upwaltham
locf
abuhamza
rwsl
taggen
huppi
ptds
mascioni
mojtabai
caukin
excellus
mxenergy
miralax
pixxi
pourfar
microbrewer
coprorate
affilliation
trexel
zapien
starey
gossipmongers
bonning
openpeak
seatholders
discerningly
pannaway
räth
sensme
goathorn
yakutumba
websky
vidaza
opiated
bedenbaugh
neurohr
quaegebeur
unsearched
abdurajak
dugat
bamlet
sbux
jeager
soard
humaidi
chatshows
rashpal
oratz
cnnstudentnews
nabokovs
moneytoday
hipocracy
iconomou
reoganisation
liabilty
congess
giancoli
xcaliber
petroliferos
agrisa
miszuk
secuirty
wotorson
abobe
greenhealth
sensless
gronbeck
stratusphere
elciego
misgoverned
ekaya
mccartt
empaneling
bretherick
zinnov
rodnick
opolo
azzarella
irela
starroc
mouthrinse
holylands
cryolife
veppers
salatto
charai
teychenné
unpractised
rgis
chemsitry
atrovent
sachino
dudeperfect
kutschbach
koure
unibrowed
ejaf
brazille
xuebin
tivit
bydlo
imiev
mobie
yunchao
fidolia
hamdiya
razoronov
dosman
ostasz
alboher
hakas
scullys
ovacik
straehle
benayer
obendorf
joosse
hlaa
masselis
photocoagulator
ablard
ronnette
aankoop
honaryar
kohlíček
hargeaves
dunned
comdisco
narks
nadt
jaroenrattanatarakoon
snaffled
couoh
rashbrook
anascape
kayonga
damam
henningfield
figlar
svrluga
sadatullah
olenn
obrycka
mandozai
mastah
kandoo
nicom
itacare
admistration
sugal
saynez
macroplastique
sanitisers
khreis
beauvilain
bamattre
strongarmed
tirgoviste
pecoc
marinich
cruysse
pultizer
knoell
textese
floeter
disports
eatmon
chwaer
paycheques
ballysally
xanthohumol
szkutak
stablest
yossie
conking
mazaika
gadari
scottà
jelden
smeargate
islamberg
professiona
toudic
terrrorism
bootcut
euopean
kiondo
apaydin
donckier
altiere
kuettel
manorcare
yeardye
oreign
sluttiest
giddyap
loecknitz
rspa
ignomy
demeuse
loway
turgeau
depomed
bindmans
orderbook
brenkley
accomplised
ustaoglu
cabreras
posang
palermitans
dankness
thefrisky
retroaction
kamoa
ravdin
whirlie
masonary
mantris
obam
sargen
webbington
stefanides
einich
relati
sivley
releived
suscribe
muraka
saegheh
sedor
miskicked
covd
inpart
assemblys
mapplebeck
recapper
exfoliator
solazzo
yogaworks
paynet
pensants
kulhan
erenga
galick
morgage
linkwise
cazzulani
lacek
intercontinentals
workamper
skypein
nonelected
bioarts
standbridge
mcil
fendering
lisneal
kuretich
alikhel
friedgood
olwine
tinglingly
carseats
horist
drisko
terrorisation
solagh
skachevsky
robotuna
diivory
rewrap
schleuter
ribcages
foofwa
drillfield
goriness
stressman
pianosoft
hurner
licentiously
olekas
underlit
monogramming
ltvs
grabbin
widmyer
johannesdottir
markwells
sinafasi
kapitannikov
casasanto
pohorylle
chirrups
yamanya
avcen
yarusso
nairz
fishcross
guarrasi
valutation
lamouret
burggren
kilinc
aaim
correra
zarazua
tunçay
soumache
calebasse
innovaro
ikhtilat
naake
steingold
nuqui
fetullah
ghausi
alwash
muvunyi
mittromney
alapini
viewspaper
kapidex
deborrah
colorito
lounibos
cpaj
stng
adecn
okayish
posilac
immitating
disfigurations
inzalaco
deathcare
khajepour
recyclage
louiville
lenval
daraio
grasstops
schertwitis
dubbya
hirter
skittered
tslf
yenilmez
preannouncement
arangements
bootay
remediates
gaidhealach
sackfuls
futureu
unconservative
ashleymadison
vancsik
biafore
favas
interferers
moonridge
pompas
mrkonjic
dormmates
gruny
caputured
longhaugh
dugue
urbanowski
pentling
mansiz
zuritsky
wischmeyer
adjustors
loudhailers
dbsi
fdns
mohammud
gildroy
mandleson
bruegels
pamfili
audon
powazek
gabbers
xuyu
oshoosi
kgia
hitschmann
leisurecorp
unbolting
polanksi
wymeersch
spectrial
sculpter
ringingly
savicki
tamileela
frostiness
octavias
multigeneration
earleywine
cowett
grubtown
soporifics
myfoxny
constitutionalize
abhore
fenceless
falleur
nigol
lotting
motari
ridgell
burches
ariizumi
ngosso
rukhadze
kenlaw
komejan
repentances
scmg
dolmus
infc
kothgasser
mutsinzi
rothensteiner
abthrax
obesandjo
dgmt
shishito
thisyear
pogea
changdu
sabree
screenwash
skiver
botul
rasker
velensek
lppv
melexis
rotomolding
turtons
cathys
simkoff
youtubed
eaee
veloza
gassick
nebrich
yaers
schertle
earleir
shigeie
girliness
ardagna
fonet
recr
dentinger
synchrophasors
nmcf
reinjury
kerrington
setkiewicz
fuzeon
zostavax
quarantinable
knabusch
skaalen
vachara
mwelu
ascanelli
tosteson
balkanised
saltcellar
duplicities
merendon
mesterhazy
mompoint
nataro
impoco
pinfeathers
arcsa
reinterview
lombardis
hidrocapital
palestians
durao
turgunov
misn
xinchao
finncap
ishmaelia
gsces
childraising
fenerbache
kenmark
ruices
dignataries
darunee
camwell
bassarath
druillenec
schmitzberger
condems
forcément
lowensohn
palestines
boyos
desparation
fifis
funloving
cyberdefence
clubbier
bissonet
svento
kleinfield
tuohys
digitallife
luoshui
peranteau
svetkey
bevi
racily
forodesine
addlepated
unicare
slaggy
noviye
antirejection
fernstrom
hannides
urbutis
partyism
rugambarara
edgily
bigstockphoto
milns
boersig
bonxies
tayet
kronmiller
pfaeffle
quarterhorses
energid
kivutha
fiercesome
achillion
shakibi
adawe
tedactive
fleak
broadpoint
caprasse
prescriptives
pasqualati
herzon
sigvaris
giampapa
weakend
neurovision
chinaaid
toxically
wahadat
readyness
hipotecaria
hilchey
bohua
clearport
quelynah
ukcisa
mougeotte
roft
umberella
skiercross
fücks
vdoe
zelnorm
larrson
shishmanian
bakeoff
juluca
schrode
murithi
calvyn
unstuffy
kremlinologists
gussi
itrip
bregg
goung
odones
hasselhof
mudfest
glugging
nahhh
candidtate
baltierra
aarika
nonflowering
nakasai
fandila
qibing
hmmh
langemeier
multiculturality
vidricaire
technewsdaily
sarasi
aksyutin
camellos
loyds
rameys
oehm
disintermediate
suduko
ntumi
taparko
overlearned
outpunched
dyann
vicitm
obamacan
accessorise
kabateck
tstorm
spregelburd
moreillon
beatlemaniacs
mismanages
sabertoothed
alliyah
arrowing
nyeholt
abdullahs
pearley
accessorizes
lacarte
onesy
somolia
sommermeyer
glunimore
noncancer
zaidy
farkers
hshieh
guejito
dailybeast
euromarket
supplementarity
tresemme
chibale
longhaven
korolyev
arfat
superciliously
beserra
farinet
feic
nachchikuda
gurstelle
buttkicker
technophilic
perdziola
mamhoud
junqiu
murfi
thinc
spyhole
pieron
cammillo
unplumbed
townhalls
shufflewick
peopple
penuwch
royzman
schmaljohn
galmes
centralpark
emop
rhayel
doobee
semonin
gdsn
punsters
spinally
nyicff
simatovic
oskal
enthusastic
armh
schumannesque
telecommuter
brunwin
lucianna
faassenii
teethmarks
huibin
gadding
trenks
insectlike
enanga
lasikplus
profazio
etfc
acambis
autopay
repubic
tulgan
anzorreguy
instigative
textgate
lucullan
lumizyme
libresco
spainard
medexpress
gymdeithas
kalnas
vivenzio
beasting
chrisler
mckenith
techmedia
glenfeshie
shmuely
dankberg
myhouse
braugh
sgiliau
piacente
violete
fostamatinib
dentley
ucking
systemm
wygle
pantomine
conoci
sasisekharan
chuis
callooh
geokinetics
shanghaiese
tinakorn
witalec
donaca
carrianne
naturallycurly
troudi
leventhall
hartrey
commoditize
lanais
zigas
moshed
klaitz
rydill
luescher
kramerbooks
bimalendra
groezinger
haagarorum
schmitts
amillia
paykulliana
stroebele
anatomizing
aconites
truska
nasery
kozhinov
vampirish
zetia
metalsa
wevl
miedel
seadown
showtown
breastbones
bruderman
koubriti
bioinspiration
olympianism
partygoing
sturiale
acresford
enoe
nawur
pixable
truemors
bronrott
aytac
khaili
ringfencing
castrission
silverglade
glufosfamide
delegitimising
grungey
hoppier
phullan
tinactin
vaifanua
exhilarates
motavizumab
plisch
czyzewska
hightops
kcdl
delevigne
sunblest
adzharia
kingennie
hirthe
overexploit
mdus
reaseach
ostmarks
grossnickel
mcgie
unprovocative
iopa
chandoo
goverance
depiano
oooohhh
amanjit
recuitment
scooterists
hpcmp
abstact
kazmar
albat
sahed
lantejuela
chhean
katios
himmelrich
huettel
zentella
balkiz
underlayers
basio
sonagas
poisioned
jetfighters
tpmc
imnam
debruine
goalkeepr
reimposes
robbersons
lubriderm
slobbered
hawrysh
sinjari
blockish
serveware
stabilty
visilizumab
underwired
ibotirama
bcbsma
llandulas
grabowicz
teplizumab
exciteable
herringtons
vannas
confeniae
sunspree
tullet
huiqin
chivvying
artsingapore
nmec
abbateggio
ricicles
ekhlas
boonlert
valden
pointclear
stepstool
nibbies
rihal
expansys
barwanah
oranienstrasse
misallocating
mezaache
tobalske
scarton
mokafive
khristoforov
perkier
lecouls
baobob
expatiating
nayomi
obler
makarere
salivated
shaiken
amanpuri
ageorges
doucin
bmhc
bleepin
wobs
umaima
sathiyamoorthy
cheapish
tutted
subpostmasters
mutagamba
tyrihans
russain
keynsianism
enagas
ftserver
literalizes
telkämper
baichuan
catchlight
orsbon
umutoni
kuzar
maysaa
wahono
llanmorlais
suitemate
mmddyyyy
televized
kajla
reinterviewed
michelles
apalta
indovations
lenamore
pantelic
humpreys
mutallip
qeqm
chamness
nakali
ecopack
azte
taiyanggong
shabaneh
angulana
uhlirova
mcgc
noroxin
potiers
fundtech
sevillanos
klingebiel
gulfstreams
unreflectively
milosovic
kingsbrae
frojdfeldt
rocklike
hickmore
cotrimoxazole
bambur
miscoded
gasayev
lamph
engalnd
mingzhong
txting
persahabatan
peijian
taraha
bnpparibas
solland
physican
tieup
tentpoles
rhizotron
pulidevan
svacina
knappskog
noneducational
metabolon
transnacional
undoc
weijiang
vyroubova
chiapanecan
tykerb
ogilvyone
berthu
tushie
drad
lacerates
cdwr
schatzie
fishington
digao
habby
southeby
egco
misaddressed
orwel
endoscopists
olympiysky
mulesed
bloviator
baukus
charytin
cowtow
beloserkovsky
shanthan
yusufiya
janadriyah
occold
voison
extenet
destablize
nantkes
murhammer
brudenells
lcor
inititiatives
venturesource
stopoff
uraguay
eloshvili
lobbyed
subpeonaed
lils
ocfcu
crustier
advisen
abgenix
coldron
koutsomitis
yaming
intoa
rezmar
birthin
emmenthaler
mccormally
cicic
dicate
catilin
alía
abdelrazek
schragis
schnyer
gcaqe
uglification
campains
gyawu
saqeb
repug
eyewonder
handanovic
apolitically
enlargment
handrolled
nkwali
galgudud
extemporisation
scamble
bugsby
hojatollah
tremblor
votron
nauseaum
warkwickshire
anaman
treescape
hakeemullah
narcissitic
mubarrak
pantelakis
crouchie
brushers
adzhubei
cynlais
dureing
mohsini
pisted
yoghi
iqualuit
unindustrialized
psychogeographer
kromek
hélyette
dumfounded
governmet
opolot
uacn
paniguian
naxton
vanslyke
pesner
souzas
venokur
mhrn
papakyriazis
fayram
upsettingly
slidey
scienc
calagna
lewai
biomechanists
masoner
dyfatty
voglibose
visualcv
gyntaf
dianca
robertses
capitaliq
swette
sencar
leamond
rocephin
elswood
decontaminates
arriani
ciirc
combinenet
fransiska
tuchis
tillekaratne
kidults
manheru
khusruf
messenging
schoolmasterly
grandbanks
crapuchettes
alevites
shafika
kalpen
urqhart
optionable
barylick
sewip
gedaechtniskirche
pointework
recriminate
houken
chwm
diamantine
controlee
tarkong
jebby
proseries
lham
festivites
backsplashes
denegrated
regionalcare
smelkov
lenow
birsak
clinix
dorrity
kabwegyere
mmadi
spokesbird
dizzier
jakuchu
hebbron
mountainkeeper
hamdia
serdula
activties
kulei
alior
otologics
bloking
fassier
livein
radezolid
munekata
rosio
ummel
shunto
babajob
neigbour
kudair
barrenetxea
katakolon
underdoing
kervick
uniprise
sovietov
roshandel
pilotin
jambur
barefooters
schoerke
recuiting
majozi
kapah
barkau
santuary
lambregts
worksource
alaistair
nytb
smellovision
bordiu
guehenno
rabinov
easdown
beatt
veico
zainy
alterg
kingsfort
sevcenko
gaztanaga
geovision
fidelistas
schefft
akasako
sivilla
pellizzaro
nordpark
euphemise
bryanlgh
marlynn
rockharbor
fondebrider
sickbeds
abdom
adwok
quidnunc
turnbough
madeoy
schiffauer
gdsm
thaib
dodou
rosenblit
glyncoed
eichwede
rascall
nonepileptic
pontieu
mddi
labolt
missons
softish
keyunta
mccutheon
propsect
xuefan
hidded
silerio
cornelson
thanawiya
wgpc
procyclicality
individua
duaghter
ballcaps
beaudreault
kwikpoint
ultrachrome
meatpaper
unlear
sooooooooo
goodleaf
bhattasali
rockstrom
friskiness
reconaissance
feltgen
scaloppine
telmap
contiero
mawlavi
helbers
bahina
amercican
mazzorbo
shanni
econolodge
elimated
inessentials
gloabl
atws
uncurbed
delsym
attendents
smokewood
napss
nmba
farler
hermange
kazahkstan
ceawc
storas
biothreat
insulet
safraz
solgar
luddin
worldgate
ephrons
prisonlike
brigadeiros
teamhealth
chutchawal
cramsey
mdingi
thown
tagholm
pureology
treftadaeth
theey
schrecongost
xolela
anticruelty
meagreness
marinières
taganana
ispwich
maryani
chongquing
rugy
bliman
teamates
zilhão
divalicious
uplighters
tinkled
flessner
renditioning
lumbertubs
donosti
hdcs
magnicaballi
quily
transperineal
shawley
klapötke
ballymacormick
trilegiant
strategyone
tvws
caerhendy
fetchingly
zargana
bracalente
scrummy
airelles
blurrily
barbered
enforement
shamblers
cpsg
keiana
nampak
vanech
freeto
kneuer
cnsx
granatino
sihk
quainter
golddigging
krumenacker
straighttalk
fostok
chekara
dübel
hoplamazian
nicoise
loczi
zulkipli
aaeu
walubi
poye
corluka
rickell
shantan
khalizad
bretonside
accouterment
jahmi
niedzwiedzki
cellsearch
fieriest
robotization
wehliye
toplines
businss
skirmett
dreumel
whomped
aiesha
smartsip
pairat
atendido
stemsource
kingate
rushcard
zestimate
glantzman
pyinmagon
iksanov
jillo
democarcy
iraqna
shirel
akinsiku
rojelio
lemminkainen
hellooooo
cadeddu
kwandang
nfsc
penitentiaire
moyeen
stiffling
tschepikow
spags
tictac
giannarelli
romanno
rudins
siblani
mongel
osri
allll
naeemah
hrones
kreczmer
industralized
schoenhoft
shager
tavolacci
apondi
economywide
levensohn
pjanic
zaryen
lowerhouses
marcovitch
postroom
kaduskar
patalinghug
hyperactively
mvoe
indictors
refundability
osct
surefootedness
baete
brastoff
maldic
vertafore
resposibilities
pwerau
charelle
basketcase
monribot
regall
kirchik
reliastar
mohelim
beppino
ftsc
benbrahim
kodithuwakku
socialvibe
caravanette
fowzia
nrmla
wellmans
patwant
dimetos
idiazábal
purechoice
marantis
richlite
prepaired
enviros
capozza
davaco
ngage
golum
samancor
sherlockians
katuka
jenabi
stomaphyx
awajun
sikorskys
bevies
clothkits
yerro
tatics
belneftekhim
xaviars
yidis
egullet
grassier
dewaal
demaliaj
xekt
vechey
ahamdi
rasff
freih
txtspk
kiffians
inventables
ralstin
yormark
jostes
furbelows
falleros
anahad
rimler
perceptable
facua
pranged
macconachie
loneman
hiom
appologizing
runnell
keystudio
machira
ripstik
durng
sluszka
masebe
bukima
adoped
accelera
flywhoosh
overdramatized
tsabar
invo
expressionistically
stupefies
manary
odlaug
annings
izbet
mushwana
wearever
thuggishness
acvc
trups
ponsky
norsat
mudala
princessy
lasku
motesanib
budreikaitė
irawaddy
steinkohle
tarriff
dilulio
lausell
moralizer
orsus
denino
eidhr
kohorst
hosc
hirsts
behins
grommit
jaffey
arghanj
missoup
nsmd
dibeneditto
gammerman
olstein
faulques
conselyea
chekir
freneticism
salduz
agace
foodiest
reykjavic
gewirz
esterling
charungvat
notionals
pineyro
meesawat
vadlamani
cosmegen
sonepar
mestrovich
scdea
jealosy
scudellari
norng
helloooo
nutropin
repellently
surfware
bonemeal
thimote
baichu
weizsaecker
danias
pnps
szymanczyk
magnetation
orand
bassole
saludes
vilayphonh
okao
slakes
beirao
qosh
odfs
vidals
coeptis
wrongfoot
partscore
overfond
publicker
allsort
rafsky
kolinko
kholiquzzaman
wrongfooting
taameer
birmingam
mccai
labantu
aemi
telecharge
washaun
cafardi
denhead
mccarthyesque
velz
jayabalan
anandvan
bonyongwe
testamatta
mccuiston
hariklia
hayesbrook
grrreat
karskens
midyette
tukkers
vitreal
nordt
yorkwood
michol
laubert
grafenwohr
geronemus
worshipfully
kronthal
midfa
rachwal
judu
jiricna
neigbors
mangouras
nitech
acholis
barkdull
kalasho
teamate
outnet
kanayev
maftuh
pecot
ntuf
biktimirova
freehanded
malfeasances
glympse
cacfp
weisbaum
nykjaer
cyzer
cinebistro
moonwalked
hyperpartisanship
twittery
kerulos
nonforcing
pluimer
condolezza
bernshteyn
waining
pullaway
abergorlech
hermening
elsbree
prepays
tokwiro
diffenbach
deleverage
dewerstone
consessions
enjoli
estenson
streetline
revealling
suspened
declaimers
ostrovany
featherstall
chunfu
stoere
furores
photomural
dvani
quintessencelabs
mavignier
gecho
bruisingly
chonji
alexiades
tininess
pitiably
tmts
srere
boesendorfer
claburn
corhan
rachyal
narrain
mangova
dalfampridine
andrezza
amenorrheic
rinckel
anousha
ganaches
possibiliy
jazziness
wazirstan
pellecchia
slud
seinfeldian
youngtrigg
prefferred
valodia
flouris
roiss
cuisia
supandji
cleanaway
gutsiness
tullymore
zietek
rheinheimer
faddishness
kookier
kuhlenthal
censoriously
washingon
overcapitalised
umbilically
homesellers
auzmendi
poucha
hamisha
ratfishes
ossenbrink
chengeta
lifechanging
hynod
shampan
follensby
hatian
linforth
catawampus
bacarella
beaston
cuddliness
seyfer
sandwichs
scrabbles
buchak
kinnel
federlein
sanquer
despont
dummond
unrenewable
whitemoon
neveda
olare
sverak
nogoodnik
cpwr
scherdel
schirato
macapaar
ribhi
solipsistically
neupro
gassiyev
nanoproducts
delorm
saunby
insinuative
floppier
overstuff
messerich
unerstand
appeasements
kerastase
investers
stilyan
junbo
nortech
gerakis
reenvisioned
bamdev
ferragu
ellerin
microban
perchuk
abbenante
godbehere
sanzin
beleagured
makanga
floof
refulgence
cloughan
joky
colebourne
nilsestuen
overemphatic
terjem
grigaitis
ballymacilroy
sowerbys
passporting
morrills
drywaller
hipath
cleareyed
lygaid
ogemdi
nushin
chiota
notchup
babywear
mofunanya
coucous
crisphead
cicione
bkkiff
zugu
goobey
cataio
mhura
nonlawyers
riskind
adedy
rainsbury
galouzeau
ifob
ouramdane
sangmo
wubbe
lesport
salarno
sudsing
jarheads
bassily
governmetn
tipoffs
sunhe
treasurys
blobfest
shengchang
inderpreet
jeromie
daneshjou
eathquake
munificently
maleter
schöllgen
lindomar
hubschman
borsheims
pruthviraj
accouting
overaker
anderst
salmawy
anoh
lateisha
hediati
taiano
wiczyk
asassination
kaskida
jütte
sunée
musueum
somatotrophin
cardmembers
npaw
pedral
nonproducing
cogane
hofnung
weekened
bouwe
gjeli
kaessmann
mirimichi
gerking
hartoch
gratsos
rousting
opentech
korodi
bhumidhar
cluzaud
smartridges
bealefeld
tasnadi
aquilion
cronins
kaalbye
habani
chhiri
bijilo
gabris
miracco
buckmans
skidpan
artise
carbonators
huajie
bisbort
collaspe
veissid
ventureone
touti
auchenbothie
hegemonistic
ellsmore
protheses
labeeb
spritzy
amazonico
volounteer
kevkhishvili
intralymphatic
digpal
mccafés
forgit
mahardika
steeliness
repke
latibex
wefel
quailing
greyfields
ringcube
clucky
netenyahu
laidre
mittals
anggoro
ttpa
krale
luzyanin
mirle
gennarelli
visably
magante
beachner
prillaman
nyawera
choic
harpootlian
hardarson
yellowhair
falcondo
kassimeris
herschaft
airspan
buckely
faifi
divella
konat
hokuryo
decourrière
picalm
shortner
helenas
zyzak
latoni
porthaven
prith
knuckleheaded
ifun
dfrl
novachuk
jereon
nawcwd
nmrd
vinoteca
negc
niroula
attitiude
moneer
qualnet
pellengahr
cinemagoing
cupecoy
plangency
faroqi
marykay
crns
cwel
prisioner
braunecker
downtick
polonetsky
mahria
sampradya
dobusch
jabaar
ondák
sadulayeva
stolman
brassic
appts
sicardy
cahe
strengthend
raikia
peachie
optiver
obsai
sculpher
validas
opnly
sarpe
isentress
blimpish
heidenfelder
nikahnama
battaglino
gelaterias
marcelled
drooper
condemmed
resuced
dataframe
econony
scotand
houillon
maubach
diemu
borgzinner
bihm
rubbishness
ipilot
galvus
ideapaint
poomacha
erecord
stabalize
americraft
zamili
okari
polivy
brownsman
naaqoos
tellas
conjugality
cenb
karinto
multicounty
telcommunications
evercare
yings
intradivision
shurgard
plaquenil
ccrkba
derichs
euromin
athersys
hmgcr
kolkena
bogany
airlessness
salsicce
dobyne
powertek
tranier
ecnomic
ahuacatl
valiyeva
lamell
yiddishisms
chsi
exchage
chennan
bicay
cesnauskis
sweatiness
alumbaugh
subastas
azedine
ireal
digusted
bahill
attaturk
tenene
megraw
zhouyuan
whx
farceurs
bivl
gridworks
brazilan
groundbreakingly
wherabouts
zhentou
yahyavi
grochan
cammies
vilotte
vitravene
parwiz
flashflood
juicily
kodinji
sarksyan
imedeen
horrevoets
suvit
friedlos
jhihben
diallers
unmeltable
kompromat
ghankay
cabrerra
kizingo
bociurkiw
kontora
rogombe
basoglu
pallinsburn
ghowr
michgan
recriminating
orwick
avadesh
vaubecourt
galanthophiles
sehlinger
kirtas
cantler
ahmedy
wieghart
tereshuk
jakeb
rifaqat
sublety
smialowski
vreeman
mallak
fanueil
timlett
relpax
rizzatti
shour
spatule
iranair
guidestone
mushoriwa
gingernut
twistedness
olnick
diversoes
kulayigye
wahbe
gaueko
kerrins
safeauto
dongda
anjimile
kalbag
andey
akiiki
rootmusic
isakhan
impre
forestfarm
hadco
succe
prashan
mccaffer
deerow
cousinage
footbeds
chuzhda
dukies
messom
defrays
unideological
piteira
nariyah
cushingberry
perservere
waterpik
joensson
hirees
wispry
denita
unwealthy
hollwood
radiocor
ozbolt
geust
malkins
nurtingen
optaflu
jayyus
kriegsteini
warmdaddy
playday
karegar
teeu
xrank
jokiel
schusters
supermetals
jurkowitz
anamitra
procampo
faustyn
slaugher
farrs
chicking
thickburger
linksland
schlenger
gurgled
zahren
gramms
nilc
connnected
homestands
applogic
wijnstekers
subsalt
castaner
oyinda
terracino
leafiness
simove
swilled
ladek
hudack
shangba
jenesis
maymount
pescheux
poptag
avrl
besogne
baselice
patraeus
lanoxin
dccl
bucktrout
eprescribing
burnsley
wieruszewski
wingshooting
boulevardiers
meyercord
sidlauskas
unharassed
couragous
portenoy
hilariousness
smedleys
karvellas
talel
blanksby
sheilding
abcess
fontenet
konopiste
tokeer
destructionists
shimbum
teache
dosara
verrus
hcat
greengen
marcellous
staywell
cmpid
novacea
pricewise
corngold
nagley
mcare
poncing
typetalk
medov
frentic
lickable
carvahlo
maycol
prickled
addditional
bullhook
loggans
singsongs
epcon
disaters
unpins
craignair
panathanaikos
ranel
roqaya
nussberger
sesne
vucicevic
picat
restaffing
goelzer
jhones
bassolet
curviness
geysering
setley
poice
gueriguian
wearier
connectability
novarka
triefus
mislingford
saifain
mlgpe
urasia
siderperu
starkopf
codies
harrath
bullot
fluffball
courion
sarwe
gizzie
fbfc
hobmeier
emberly
noilea
domansky
unbruised
goettsche
shanava
amdani
joylessness
quangoes
ganascia
dgccrf
atomosphere
snickerdoodles
mehrats
plentz
stolzman
estroff
orgn
folotyn
petee
peplin
portschach
canterna
kildean
airballed
provience
vargason
narongsak
apari
bcii
dorozhko
electrosensitivity
komsic
cisionpoint
anfavea
schuykill
offhandedness
unfalteringly
metee
modde
derrow
miniority
smartsave
strumph
thiels
nccmh
calderons
sukanaveita
gtdi
ozgul
practicar
mofetta
autoport
rocksugar
newsaper
athere
popluation
hustai
superlow
alltwalis
citreon
leperre
businees
gossipmonger
trender
norbom
lawworks
lefkosa
muscillo
yuster
baaps
cypc
werst
saariselka
floppiness
reinares
hollywoodized
pricedoc
marmaro
psilos
etreppid
asriran
nextlevel
louwagie
ratnieks
qadissiyah
fujito
sinola
matesanz
offbreaks
proverty
wepman
karoub
oligofructose
hefron
satanovsky
liquidiser
ahhhhhhh
inadquate
victums
readvertised
twitterpated
korfiatis
sangakarra
vooz
senyukov
incindiary
mccauslin
xmark
potera
tudclud
overdeveloping
pathologise
hawiyah
orphange
foggier
slipperiest
bjorgolfsson
mutanabi
movenda
holewinski
supermice
ncipher
kutka
nevr
cheekier
multidose
hookergate
wittkamper
buidheann
pasteurise
geners
jeonghee
chardenoux
beirniadol
hairsprays
fedeski
galicki
houssain
prepster
manuma
bentolila
connectwise
minaudière
akikusa
untraded
bloolips
darneille
visschedyk
paee
turbolink
zipse
tustain
bartuska
gerbehaye
potenial
kokolopori
popwrap
servetas
fotoflexer
menye
rightousness
layshock
greman
kashuk
unbritish
pghm
lemtrada
fosrenol
frangialli
daylite
olmsteads
wisekey
widowerhood
debilitatingly
diabled
mtia
odwan
adelaider
powerchords
youthnoise
tryl
stricks
emmaculate
ignominous
allaithy
unatural
ezzie
baldovie
lurvey
bossnapping
arbesman
beinhauer
bowran
squareenix
mallorcans
trevizo
ionta
pubicly
adriá
tribendimidine
xtream
mothae
terab
teufelberger
counterbid
seevaratnam
budowsky
relious
prebooked
brinkhill
beavans
thandekile
pulikovsky
shaylin
rotterdammers
jichan
writeprint
tutman
rpcc
footholes
galileos
botticella
edlow
nordam
photoacoustics
areitio
hopewood
mobitelea
protopic
jondishapur
whistance
torralbas
dogchannel
pazarbasioglu
divestures
salow
busone
fvo
evogene
improptu
ctdc
tipul
tranformers
bevvy
goukoye
coylumbridge
ojom
hilfy
zaggy
scarpered
ideologised
sealane
misdescribe
junoesque
lashman
lpbp
rubberlike
norsigian
wooliness
kahumbu
turbohercules
toryalai
michican
abdiwali
aleksandrow
techspeak
kožušník
akunne
gyllander
screpis
airasiax
carbasalate
hibson
moleac
lawnservice
doulatabadi
scaillet
kandau
enviormental
assualted
josem
hoodle
ayapata
siport
pivarnik
zacari
mackinson
jetés
beylerian
glenfair
shakran
popzilla
clangy
rossiters
kreke
venneman
careworkers
windall
couck
proquad
ntsanwisi
danailova
rollitt
leitenberger
reunifies
fortressed
vinelli
angelsoft
escapeway
vondrasek
anpaa
chornet
steamier
hominess
wiski
streetwall
guérineau
experieced
pietton
giammalvo
rivky
lugubriously
dechiaro
paparizov
bioprospectors
platzerwasel
stakkato
glespin
bhajis
jumbojet
convenership
knightz
ceneta
avalide
augustijnen
jaskula
perambulated
rinsky
brodre
donston
skumanick
thirgood
opam
nagat
sheahon
senokot
alket
wenping
pathologising
rangina
chuckies
apunta
giansily
bucaresti
strippable
ghurkhas
flavorx
tamworths
intraoffice
waaaah
spindleruv
elfassi
hamptonne
traums
unfortunantly
vandrei
devost
phsycological
voxtec
zophei
shanakill
roeb
kazlow
ruszin
eyla
forgivably
lomeiko
kiboshed
kontonis
intest
vahidov
redeliver
bogleheads
gelig
pallemaerts
oleoducto
hlavni
friendsreunited
qingnan
maides
therebucks
rodhams
falzano
infopoints
ecweru
expeditures
landcape
hardister
danilow
akery
tumpey
mickleborough
marciani
expediture
autojumble
scaldings
hedgie
ghurkas
akuseki
vasogen
dokin
forecariah
kanyanta
terrachoice
vanderstraaten
accavitti
kuijen
tokasz
michaelsson
bamboozlement
ahnold
kehm
missonis
dirrrty
barthomeuf
rfet
sussanna
kloda
nitibhon
pccd
solanezumab
erye
abdikarin
poellnitz
shepphird
newsdrive
mosny
repolishing
ipodjuice
iniscarn
kurfurstendamm
wellchoice
skrutskie
rullis
ontroerend
mujibar
ukayroc
gulash
shoker
unshocked
blushingly
diamondworks
giottos
husten
juxt
eftc
freydkin
asbahi
zorislav
famalies
knowledgeworks
zazza
snorkeled
spidle
hitrans
rfos
condenast
ashikbayev
zobi
treiki
paulitz
civlian
chiacchiera
defibrillating
jodka
gummint
lobbestael
dembina
grulke
motorokr
beginnin
bookstands
vydrin
kolloen
shoomp
ravenstruther
devasting
londonwide
spectratone
powerpath
dossers
fmct
genomas
titfer
athro
skined
othodoxy
pasttimes
mâitre
tuim
cheeseboard
ducre
thyangboche
purc
dhaliwals
agbash
websoft
shalwars
starchiness
teson
medjet
broujerdi
maravel
gurpinar
apovian
ronsons
varoga
lador
unkechaug
qatalum
kanterman
ishoy
shershnev
sawants
hankee
drollness
kafataris
petrostate
newcombs
funisia
bbka
vascutek
schweicker
zhanybek
schlepped
herchenbach
nashers
harbut
zimprich
panarese
supermum
mcsteamy
kobrinsky
emachine
scanimation
qati
mufas
murrayburn
zafrin
gopaleen
raibin
visitpa
isbrae
chilate
masterstrokes
shipleys
michelberger
conoly
gehmacher
falsecypress
cruses
schraft
zhouqu
tecnobrega
ultrasmall
stategies
wche
meilaender
lebonheur
bransons
meeing
caplans
meadowlane
wyless
michelfelder
farberware
mouthings
wittberg
perluss
eidak
kellaris
smyle
euthansia
kinerney
zeitouna
viacorka
alnur
milebush
affrontery
moshekwa
cuencame
bvocs
vucciria
preneed
neosphincter
gorbi
henigson
encite
kimpson
appelius
verizonwireless
uncaringly
contourglobal
nuami
annuality
earlam
gorkana
aderito
guitron
propac
nucleur
ydanis
suchana
hualpen
amscan
emebet
twiglet
lolloping
haltiner
kushiage
jenx
cribby
spinspotter
ekas
ouandja
timmens
knipschild
persude
gaukhar
opencalais
accuvein
kalkunte
alsema
strander
anjeanette
peuhl
enimies
setena
kochanska
dimmings
hammas
britcliffe
pongara
fishfingers
majlaton
rpls
honigsbaum
backstoppers
guzzardo
griskevicius
goffee
aliceanna
majeda
danactive
detrani
dancebrazil
moyai
upskirting
hogmany
hernshaw
caravillas
penri
wannell
matyszczyk
fillips
tatch
carrozzerie
immeidately
bhotmange
shandies
veros
qiheng
turgidly
comedytime
isohata
nasdtec
dumigan
shahkrit
deschryver
absy
clincs
norrod
arburúa
demisch
mcmanmon
statememt
polygonati
disapoint
heimbrock
cerian
quadfather
damagh
kramper
henselwood
jakubczak
latip
nirapathpongporn
intellgence
wangita
cgpme
reimprisonment
ptpa
welcomely
instrumentalised
nonactors
kongzhong
ychwanegu
wallaert
reusables
mamytov
foilage
surveillence
janys
krisada
houlin
cowmeadow
eclinicalworks
sadnesses
ozalp
thwak
fnba
frehse
lionise
wylds
tripso
sunflag
aidells
westsound
oculography
freefone
polytechnos
rechnik
vancocin
qsearch
ysgrifennu
qotbi
benoquin
delashaun
acpos
tyrannising
respresentative
trilbys
castlebank
batishchev
urbanfetch
museli
iupat
pailor
coolatta
beceem
mmrp
lanthorne
mellegard
komondors
galns
fronded
demaré
brenhinol
cobnut
draftspersons
torisel
chandeliered
kivlan
köpfer
shortcourse
rapidarc
mizroch
kupalba
christianist
hashiya
somech
odeke
antoci
oreopoulos
akhmal
catalona
resaurant
grossner
lequatre
snicks
berthillon
driblets
sleuthed
niaspan
captiol
videoplus
rodriguezes
channelvision
owlstone
downthread
sledger
chidamabaram
berllan
vasold
draftspeople
enelow
musland
dudayeva
bumai
admendment
naivette
doorstepping
harpersport
panjandrums
vassilaki
restacking
cardownie
lowenstine
illums
bolighus
nkorea
orbimed
haulouts
chincua
mingchun
tfah
pourang
werlen
outjumping
lecturn
peetey
schulein
jovia
inema
airtour
silverbrow
dfmo
mulfinger
snobberies
estandia
trinis
hibor
rhapsodises
organisatio
kirumira
krajinovic
lornamead
stockli
stodelle
coertze
grafflin
gensheng
kezer
behary
briamonte
sportweek
hojat
puschnik
mbywangi
nonurgent
noranside
pozycki
zelnickmedia
chapchai
pilafs
iskrov
mythologise
gurassa
chayya
muslum
tanshi
frappés
kilver
shappert
burbie
ignorace
instutitions
hikawera
bikaye
jayded
dogley
superprotonic
azarya
domash
leafblower
nasjrb
manoochehri
mapledene
tusla
unilateralists
vabishchevich
suffereing
cooltouch
navelbine
chatelperron
fourqurean
repreve
anthrozoos
hwlffordd
fidos
marinading
sharanova
champix
embattlement
rtsm
ignors
gelbakhiani
farrers
ulasewicz
switchball
sempervivums
morbey
negotiatiors
touhidul
wischik
perkily
rubrobacter
vaziev
leftest
trusera
pentabus
varagona
genuises
cordarone
vanch
koketsu
solerebels
jihadjane
nightclubber
cueman
cuebas
kobylka
mitshubishi
legalizations
kusurin
pridwell
shejaia
gronke
stovies
singapuri
mayilvaganam
kilaly
rathgael
bobinsky
fecklessly
sukhu
fatkin
chimpcam
helicoper
slutski
winvian
symbyax
tecp
severley
finocchiona
proffy
peronnet
marilys
rbda
weisglass
chinanet
ahei
sinneth
arrosto
naftz
runzheimer
herried
nolbert
ldcm
thiacloprid
bankrupty
delossantos
whiterashes
fikse
ancic
regressives
toueg
fondiller
suws
phillipses
ewropeaidd
killorin
fcmc
molaioli
kishinami
pinkowitz
xianjiang
delarco
crewmax
cargenbridge
listners
ambroses
frohoff
godette
maushart
eructations
koitka
tuttomercatoweb
froguts
sangiamo
finreg
vijn
catanzarite
crimesider
porthmadoc
lienholders
remenber
schorno
abbf
celebritydom
smithie
morrision
diminshing
millum
dualogic
skateable
bychowski
janovec
kolson
ribbink
impelsys
smartswitch
gwallt
elevance
ukrtransgaz
lsoas
ygf
ballsed
assurers
altabef
laagan
eagon
faugeron
neuherberg
naruk
thorbjoern
rethatched
malbranche
pontificator
fitpatrick
dadiba
kabulis
trinations
kanburi
praekelt
khanaqa
rumsfeldian
mapaches
cuckow
moneysaving
tenderizes
lewitter
baghad
havron
nfpp
monogamists
snootily
companero
auguring
kwikstep
kazempour
sanick
bcimc
domota
ecotarian
lithely
sideliners
ingenuities
phelpses
psybt
wessis
rasananda
verhovek
siclari
vamosi
ducalcon
fuele
tepotzlan
efexor
wintersteen
sysyem
christope
maksharip
chitman
singletree
intratribal
gridsure
factures
auerback
mantrip
senak
eidner
heimerich
salhiyeh
comstocks
fungoes
pertot
amesquita
obssession
petraco
geeser
krichefski
summitry
dullas
poursaitides
guevaras
oinking
girlanda
killmann
medvedevs
sundecks
starlix
cyberbritain
kowalksi
keesecker
ahpi
medivaced
dubula
luchinat
youings
martignetti
helvoetsluys
supernote
tajeddine
tajideen
struski
uitkijk
hqv
miserabilist
naqaash
yaqing
ikramuddin
dunafon
malombo
aminos
abheek
dellapina
pasqualucci
zongker
redenominate
bongate
genevievette
sebastianelli
romec
juctice
scordi
polsfuss
tavaroli
indissociable
magenheimer
artecoll
infracapital
insenstive
lazed
jianjie
kneesocks
timoc
dizzies
playerauctions
alampil
protraying
lawate
translucently
guica
hurtfull
nomc
plumania
prsident
koumoutsakos
kidtopia
billmeir
obamaland
rwcl
breedam
loviglio
greenpath
sakewitz
faghihi
marquice
quicktrim
gerardis
tempestuousness
savoree
kiepersol
hummmm
grandiloquently
difficulities
karamanou
adroddiadau
levonelle
laliyev
gauguins
chisale
ifbc
yacovone
foodnet
nurdling
schienle
chomos
natiello
deradicalisation
kibibi
oelschlager
finestein
arnoni
secoyas
ayestaran
shortcircuiting
marqueece
uninnovative
recano
moufid
anticensorship
dataquick
willburn
matuk
nonabrasive
casadaban
petitenget
menyoli
overexuberant
smeeding
leidinger
multispeed
dunkleberger
instructus
sholders
capuozzo
mouctar
slinked
axxent
euphemizing
msdsonline
afghanaid
danho
guzzinati
swissness
soraghan
rubberneckers
reinjure
detroyed
eygpt
birton
rijnveld
friedson
actimmune
somalinet
synchrorev
igold
bakhtaran
hughett
fidelina
byoo
belohorská
dingier
cutchet
htgc
unvaryingly
ehealthcare
evermay
presskits
soetwater
uncreased
schachtschneider
peccerelli
hatchfield
paradee
wollock
latinworks
husien
thirtyfold
greany
wwoofers
tarnok
benchenaa
cauterised
érection
saltholme
metzgers
knaff
pricedout
bsst
ncctg
vranesh
wiliest
ljajic
jamahiriyah
biolase
tautges
aetheist
ecomomy
camerson
bernoth
hispanicize
farmerie
tyburski
paladares
easdaq
perfromance
kronprinsensgade
ghandehari
rechavia
ruzga
northernness
lifewatch
mendelowitz
jheranie
gassant
ziesig
yosts
skuza
kurtulan
understocked
falacrine
jobmother
europlace
kodwa
ellouise
dalloul
transgenomic
bredhoff
netback
dbjrg
bomgardner
goirigolzarri
cherrell
devogue
vity
ammond
perese
palnu
masket
techtronics
kargozaran
breikss
nvva
mvoto
traipses
kostric
deepseated
kamwenho
hortonwood
rinnie
lilestone
hqz
wholeheartedness
nerison
souhayr
leasings
ntetema
bouzegza
nakhooda
demilles
indomitably
upgradings
innotrac
sivilia
assiter
wembo
ipers
ungreased
abnegate
kerusso
keeshin
mhsi
rdeb
jollied
hidings
handsy
pinnawela
kabessa
filak
yuguo
capraesque
incompletes
tanshin
deeken
burnice
homden
geeslin
kuuki
benthal
reportability
gerthe
deather
offficial
antlerless
infelicitously
thorensen
avav
feinsteins
demogod
sobaski
oppotunities
amaani
resport
koors
girondine
trowsdale
speidi
vspring
brimingham
schambelan
cludo
rozenberga
millheiser
malingered
schoolmarms
tivos
vazire
nonvirtual
wisertogether
fiordellisi
fastpencil
trennert
evena
brittanee
patdowns
flusters
cheffy
ghazvinian
wiroj
covetable
duabi
zaniolo
weemote
slavco
herbawi
rybski
gurtman
caiping
stracener
assalouyeh
monandry
brittainey
lipsmacking
geberth
lipitz
lewdest
nightwave
rabiatou
urogynecologists
backheeled
skyra
chebil
snippers
bromund
kasemeyer
wanisha
farked
radez
rozzelle
whazzat
subflooring
stanculescu
lykawka
kristjansen
lisfannon
bcbsa
austrain
hordon
sajjil
onaiza
skyzone
canimar
battlehawk
crocky
streeteasy
dhakla
gonese
comfier
anadys
weissfluh
youngjohns
prucher
billons
wadhera
maliva
baenen
zeitels
intellitools
schuettler
wamogo
mylanta
dissembles
porthault
cetrone
schurke
neoral
pscw
normatov
radetich
sjon
kalahar
willowdene
sheikhattar
frusterated
encaged
abjorensen
galineiro
intoxilyzer
mukda
naitonal
espree
treponemes
unretirement
lovasik
contiued
nitobi
bluechoice
shrode
newidiadau
nanthikadal
geoghagan
kulesa
tehranis
ardesta
constabile
dadless
engand
healthchoice
shichor
rodaway
caudalie
gottelier
singapor
carrollsburg
reichers
mersyside
sshh
herztier
sandigo
dkos
arguidos
banaj
schubertii
sudnik
walensky
makeunder
tagwerk
huefner
intangibly
sourander
leible
mushroomy
woerthersee
throughball
lyndra
asteres
akbn
ctbf
midstation
kestahn
magnext
cranachs
nadhira
schulers
critisise
adapated
backloading
masback
filandro
grueneich
girolles
rouseau
sellenger
clackett
lamach
enfd
gourlie
sukree
foolhardily
jesien
kapuku
provectus
malomir
rehabbers
dorband
wyckoffs
selakano
lamford
dervisevic
omvig
electrifyingly
rasbury
likudniks
deseeded
gmms
hingeline
chanvit
multistop
borochoff
conviser
suburu
moyersoen
gellionnen
competive
htvl
sweetner
demarkus
beancounters
lauher
prupim
frenziedly
hyperthymestics
nisid
smartchip
travniki
cuell
arttactic
miclyn
stratospherically
shoubaki
extremisms
zold
cheltzie
outguns
influenzas
committtee
shafdan
safey
quecedo
chardavoyne
gaskett
omnivorously
mippa
bekheet
ideeli
hopworks
perrogative
bactine
ambiently
perucchi
jashbhai
novations
johnmark
varnishkes
yuldash
weyinmi
handycams
jiashen
dusanj
stryderman
qianhui
supercolliders
indage
yangzte
tatanashvili
turbes
brryan
crapster
sopheon
grinchmas
crûg
reissmann
pacwind
airbeam
pmog
boralex
bivash
khybar
amberwatch
tenges
rangae
dishion
snapps
okkarides
expurgating
resentence
mayclem
conniburrow
barington
gaohua
ayear
waterwalls
barbatelli
ventarron
braima
cenkos
unsprayed
rashana
authentidate
dennises
sfoglia
ouzký
naissaare
elhadad
detora
camcording
,,and
perrywood
medicene
vinnakota
cluckers
unaffecting
eleminate
elemans
chupinazo
resane
advantis
vanoy
hoofbeat
maquieira
cannick
haertl
trimas
tunzelman
jailhouses
yoes
lobbiest
rewelded
orubebe
cherrypicker
reassorted
disposible
cureently
carryforwards
enflames
betzl
slimeballs
krupitsky
mertha
tranfers
wippich
weatherseal
pemier
paraskevopoulou
ftseurofirst
valencic
kharr
yuzaburo
pendaries
tamagi
kroeff
jorts
angelholm
bifab
laynes
niordc
gillettes
mollins
tyrna
gmitter
syntometrine
ichet
jaschinski
gybels
metaversum
ghilardotti
setaside
karacic
buelvas
amzing
buteco
bryantseva
wayfare
rihoy
reslo
rodzwicz
jewmongous
unreassuring
pagetti
chaoda
nondepressed
growns
compariosn
heiba
sndk
cooverman
ambie
enerdel
tuag
soetero
shojai
oceanaire
frizzies
amazements
otheguy
ifng
tortorello
avego
decmeber
stolarik
unaffordability
purtee
thattungal
takanakuy
pleštinská
bruintjes
greenagel
sghair
planika
plugholes
ukama
kufar
potichnyj
twords
abbatangelo
frieght
treter
endof
hussains
taluri
pyramind
janusian
bogenschutz
glenwild
gaibhre
movielike
namgla
floormat
sharedealing
loseing
cineasia
lipozene
glenbrittle
nossik
abderahman
feverishness
sacrfice
shukovsky
heismans
thillet
heizel
outjump
channey
tziolas
fynvola
overspecified
reabrook
duzant
lvlt
ghostliness
intermatic
pratyay
abbeel
jauntier
iraquara
brodell
chasemore
mochaccino
gisladottir
chacaliaza
ninglick
mataharis
boffing
stonephace
sirny
gaska
kfrs
sogginess
iwsr
datanalisis
eatocracy
venirauto
eyeview
fotonation
icoty
feuling
laiti
merseysippi
sipili
atpi
landgrabbing
eighthly
krupoviesa
easypay
shtf
privlege
cogstate
miserandino
bacquelaine
niewoehner
competiveness
indaparapeo
zimulti
mncube
krums
giardinetto
monastically
burgic
grisliest
langostinos
kajese
oladeji
sharhabil
blastland
multihit
squibber
tkdl
icll
oenophilia
guardedness
peshmergas
hlatky
malpai
weppner
nonplayoff
prishantha
baldhill
mawkishly
jalazun
gombar
ellingsworth
ostomies
hishmeh
tunnacliffe
cecato
washignton
mcknelly
todat
resurgency
visipics
caldesi
jackhammered
doosh
tinyiko
freaken
vallandingham
gallingly
tadiello
longua
berrong
fablon
veljovic
boujis
compasionate
madderson
parises
berezovchuk
jagolinzer
michaelene
provoq
sidesplitting
ensconsed
sisina
ecbm
lutula
chunkiness
alcmi
michenaud
banales
relaese
fanwell
elvinger
zypora
palaeobiologist
prejudgments
waterers
creevekeeran
dramtic
dfhs
zombielike
yangzonghai
tomaszczyk
fedorak
magginas
kytril
niering
matterley
bluegiga
dissappointing
cadario
faffed
roguishly
thuerk
gursey
eckner
chemnet
sementelli
burglarizes
fosterers
manjon
mobilevillage
dupioni
vucovich
tashiyev
kongantiyev
ululate
padriag
reclosure
kazmier
gargaro
friga
prossor
garnice
hollinswood
outfoxing
decentrally
shelper
kollapen
dorjay
fragilities
unfulfilment
awayday
stockworks
trimbles
babushkas
rogulski
superbright
skimpiness
kaison
deseree
ebayer
cuejdiu
teeradej
iciar
pasulka
bollain
adrants
nuamah
graffami
sodruzhestvo
borguna
jarding
katersky
downlist
kobland
undercompensated
nakornthap
whittern
kircaldy
dottel
sipix
gilhart
newsgatherers
meskini
ukrsotsbank
cochochi
wouldl
sosland
boigon
eyeclops
postsurgery
metaforic
resilent
uttern
antitax
powerleap
ffyona
drybulk
cins
yawners
zweigwhite
petkeeping
gagola
ingrediant
conspicous
initialling
dickmans
yurkievich
hyperaccumulation
raied
leaderhip
torbjornsen
jubah
sufferring
winikoff
aiyub
newsfox
sahro
semiprofessionals
swaybacked
especilly
jchr
unlearnt
tuppenceworth
marshae
nahleh
yenier
samuelsons
cydnabod
smellies
patrece
feldenkreis
sharklike
tanimu
biotrickling
plinks
vinegrad
autocheck
bullishly
nmgc
ntap
rythym
aguru
conserative
hoppock
partneriaeth
kuranyi
stoneyfield
nonactive
efficiences
arazo
hydebank
mosolova
katiucia
heliosphera
unpersuasively
dihad
teichberg
zalasiewicz
uninsureds
simplifiers
friaa
karches
merindol
kvaskhvadze
rabjohns
dscm
burbled
mielcarek
autismlink
hidek
weaponizable
waterbuses
pulmicort
unnammed
szepmuveszeti
aaph
lambrias
sleeptracker
motiondsp
weatherspoons
serebra
coyaba
saau
countersuing
evenually
obain
bannsiders
dlugi
consuption
kornblau
optistruct
comensurate
gloer
womanisers
gerassimenko
andreaus
overclaim
vybornov
braner
ketric
tulcingo
massareene
jeswani
jimison
adesioye
fishbar
kronmal
lapdap
hojris
evelis
lachenbruch
wlq
waspe
enterting
polyunsaturates
copfs
bertko
greenbee
cheeriest
gredley
thudded
warholesque
yusopov
shakrai
shoulong
incured
fresch
alliah
redomiciled
niegel
tonium
awaad
carrutherstown
aretos
songhouse
postley
meltaway
unlikelier
trooptube
gocaj
easycruiseone
cryus
depken
shvydkoi
underplanting
reprioritize
daleo
oceanteam
finshing
lefelau
bistrotheque
smackdowns
exhbition
mossbrucker
myxopyronin
salopettes
getzenberg
nondaily
surfies
enviromentally
canalaska
defensing
tachwedd
rydlewski
biasin
foodstocks
bajolet
anolik
vukojevic
preuitt
theato
carniverous
helicoptors
guilkey
accomadation
noyac
gmyrek
smishing
sifrits
arzoxifene
surathani
hotwash
guhar
beckwiths
flarion
somoa
glitteringly
cynta
regenerist
namemedia
lnet
culloton
ruszczynski
flexitarianism
captrust
snorers
blemishless
policyowners
correoso
tentlike
septmber
cybertecture
janotka
microbridge
piliang
comback
yongye
centsables
deindustrialising
seoulites
cemitas
proaganda
graziose
dragonlab
gadiv
woldman
phasuk
louies
hirchson
chhaupadi
invigilating
syringed
misdemeanant
karalekas
mcex
chromalloy
assauge
taepo
anydoc
craggan
gaesser
wecare
bequelin
blogers
aiam
clickpad
sprycel
ucbh
thambo
shinpads
ghcn
bolcar
ghassim
olanoff
huntersfield
jebaliya
postrio
legaltech
myant
laddism
pvnews
grindale
narcomantas
taaramae
cambelt
foget
thermotechnology
reisfield
kwait
indescretions
anasara
flappable
adready
nayernia
molinette
juwad
falic
poisen
kaaran
chiorean
bekic
aurigid
dayclub
personailty
effots
checkfree
gulfiya
handspeed
bensignor
parascending
kuthep
stupidty
highstown
cisas
betrán
bantleman
kausik
chemabwai
newarker
bendict
thrr
lincare
hyperglycaemic
crnkovich
bumbuli
boizel
redattore
deutschneudorf
flemm
subesequent
telephia
unicomm
hcws
oculofacial
tverskoi
leitzmann
artl
syncfusion
mistime
dialyze
fartun
spitzers
uimc
cenx
nannied
farache
magenn
abdurraheem
wehdorn
lyonheart
clinc
eshkashem
valmiro
preists
nnuh
deffense
shbair
medzini
ymwelwyr
andrini
dhaene
lubing
insolated
nermeen
ciasa
bapen
porreco
electrocaloric
headguards
sunners
sambodromo
mflow
mazrouie
ashwaq
madeshi
skains
appeasment
moosekian
stallkamp
koumura
moelmann
mambisa
amoaku
profmedia
charania
khudzhamov
willowbrae
gruppuso
pyland
wotou
windparks
kuleba
confrim
excelcomindo
rsponse
oldborough
trozzi
cclr
mezyk
tamerza
acresso
nontreaty
beades
quac
purescale
equasis
sychronised
chiappero
nardizzi
basketballing
psvt
muhid
nicastri
okbridge
deepish
dryvit
folllowed
noblin
eleasha
oredered
illiberality
yooman
hlavackova
shavoo
bedcover
menkhaus
pneumovax
kaczyńskis
hellems
ezdi
karlenzig
etoa
beleno
kutnick
superdrol
petracek
unentitled
wernke
doolittles
financialy
carlshamre
malwal
midcounty
segei
medspa
justyne
liebermans
jalkh
sherre
cubatao
guarneris
sclub
kidger
neorest
sariyah
webbley
soldatino
strojan
antimonarchist
retrocessional
demoraes
roeslani
iprospect
manior
grueter
treleigh
senoglu
reminicent
ruigang
selfconscious
tourie
mermel
harfst
colchicums
kamennoye
boczkowski
kesayev
mcmafia
neoucom
sawczuk
sanjaa
debiec
tennessen
beinazir
isarel
beelines
maciulis
retouchers
inneralpbach
sacrédeus
bravermans
photiou
kalosha
underwhelmingly
woodlin
gracewell
rosettabooks
pinchinat
milette
mcnaulty
patternings
favourtie
kassou
bushite
rollermania
nicoles
parasiticides
profferred
kuanjie
wysham
geofences
stiftungsrat
plansponsor
volumns
korobko
emerling
bollom
ruqin
zulman
fanista
cribiore
sacai
elusys
mastrojanni
cbae
lebasse
antigang
glenogil
timebanks
compazine
pfeg
bjoerndalen
larghi
mosic
textonyms
pangaribuan
chilangos
fitriana
openlands
bearnson
cholestrol
transmogrifies
galdamez
courcoult
maxxed
tissy
leibsohn
tavassolian
hssv
kilwilkie
feyness
dukker
wichterman
feldwehr
pursuiter
rabasse
khurmato
emfl
kurinsky
prissiness
loute
mcube
tenereillo
lithesome
dessima
bandelow
unslaked
cervixes
rehill
cxos
bejiing
kudwa
noodled
palillos
hotcourses
wusthof
xaki
youfei
drumgold
photonix
expeience
coarsens
bluegem
rightious
siwula
boughter
umicevic
brainlessly
bourada
roohul
molvar
altchek
adeela
bakfiets
zhuzhu
laughlan
mehas
transelec
touxagas
denove
reyeb
spytalk
camptosar
tagget
bofe
halatyn
furlanis
diwedd
tantalised
zoph
padco
tregoyd
charewicz
dexcel
uimm
ghafoori
choccy
arnum
fluery
racr
pdpn
flyp
liquavista
osaar
rasshan
insean
cokeheads
choren
crocoseum
counterprotests
spinboldak
vartazarian
swanmines
miadich
roguishness
pozan
zuola
dextrously
seniorlife
mornie
radioterapia
torland
stommes
restituting
zeligson
peditto
puntarelle
techoperators
pigpens
afores
carfizzi
ellingtonian
bhuji
hotelkeepers
zarki
kilchuimen
agundez
decapua
hollidge
wolfblock
chourouk
awasi
strangehold
slangman
honeybell
nicolita
glastir
drenewydd
vitriole
ziaee
gcapp
blns
badio
aaxa
yurtkuran
panagopoulou
omalanga
czekalski
elsesser
ziruk
kremchek
yuchanyan
voikkaa
picoplatin
turbull
carise
granchildren
slyvia
gragan
aburedwan
etholiadau
spastically
valdeci
gielsdorf
dshe
housecleaners
esuri
axelbank
syllabubs
welchol
centergy
jarski
bloodsteel
eletricity
kayiranga
markwalter
zuleger
janusek
berenbeim
kresak
zimrin
berrones
slappable
ellemford
gitwe
czarnezki
kraehenbuehl
portmeiron
aberly
boathook
yitai
sugule
arcomadrid
fleamarkets
findwell
chuene
posawatz
vredevoogd
gehlbach
karnavian
darlingscott
pecsenye
cucurull
lenkowsky
hbts
dionnes
diagnositic
meelia
frapa
autotask
valtech
fcea
algenis
pendyrus
minoza
babyboomer
leynse
bebar
souleman
edraak
guility
aquatheatre
susurration
vancl
rashaud
breitzke
dimbort
tearfulness
heartens
resk
spikiness
dhamankar
smartgauge
ecoguide
pesälä
bernstone
lifethreatening
serrah
amom
zimplats
downhillers
habtu
yaguarete
remmereit
chyrons
dipkarpaz
babcia
blumke
ortayli
ramunno
dobnik
kathleeen
hedkvist
sumisho
commtech
munzu
faultiness
hitchcon
carapetyan
smajic
desrosier
fetherolf
heidtke
sankhare
spylaw
scarpinato
tting
novilleros
genomatica
isales
takham
thcream
ayou
lvpecl
matok
kandic
unpartisan
murphrey
outmuscle
rakotovahiny
pucic
hudnell
matchams
chandrapati
hepatits
cardies
meridium
devores
whiskering
profnet
scragged
fatstock
rathon
cremant
unequitable
wasserhövel
oclassen
gorevan
lamole
cochina
groundbreakings
thrummed
shender
toyoto
jalrez
reynierse
miklin
mobilo
lccj
selular
toolmark
yamnarm
mycal
tarduno
jetpod
conguillio
crinolined
hulverstone
imaps
freewheeled
orrf
ltip
awaa
ioec
nonsporting
authorties
midteens
koshering
odekerken
mlbp
gardenswartz
leydecker
chiantishire
manouevring
ilfs
downpayments
tongqi
mccartneyesque
mohammedawi
syndroms
underalls
tumbrils
inums
shamrayev
malualkon
accountablity
eglwysfach
leitert
brunzema
hobet
sesow
laddishness
taroczy
minifestival
stellmann
unforgetable
lumengo
autocatalysts
wyko
thmselves
torce
smithamundsen
buhusi
panjiawan
stoeckli
cleret
mummbles
accorinti
becsey
qawas
olesky
rosokhrankultura
raemoir
ntcd
saewyc
maguigan
perpertrators
exploting
jasenovec
carparts
shirtcliff
yinquan
ccur
factotums
oddment
elsinga
ringdroid
esner
waterhorse
demerse
kilfennan
helath
dendrobaena
scattaglia
lsquo
carzy
yousefzada
superest
boemio
saflex
nominatable
sterlini
eaken
dreamport
xprotect
mlynky
neofonie
kennerson
pencarnisiog
hostaria
nisoor
prosun
nitkin
kutznetsova
jaidan
wrapover
shlyakhturov
manpowered
skinful
caspit
tenerians
unted
misfielded
avereage
hyperpartisan
caizergues
zooamerica
dragoshi
famesque
ampyra
anworth
semenggoh
bozzella
themm
bachina
laoisa
ecotown
irreverant
manida
askews
npbc
jaamia
subsititute
bmxing
usteleradiology
theatened
naftowe
gazownictwo
ungenerously
gitcho
feguson
rtvelo
ncna
gastons
kittas
tronical
amankila
ivorys
fredrichs
netbenefit
knockbracken
waspishly
xianchen
pricetags
jpel
mayercraft
arcanely
natso
malsin
abdulbasit
hspr
gubrud
opport
whethe
caregroup
palestineans
shareowners
tolva
eleia
rachvelishvili
langr
thingamabobs
karokh
onise
ehrlick
bioelements
ringeisen
dalbar
drync
equiniti
eurek
misunderestimating
coolbox
platenik
ugresic
wideroos
noodge
lyko
mirken
mceg
khabari
lamastra
kelsee
onecommunity
torihama
subprimal
kolke
hollock
sonchai
demandforce
talebani
hothousing
mtuze
preannounce
dulevo
hgsi
herzilya
beermakers
ritzberger
vinikoor
sarwakai
petumenos
flexeril
sasazu
depandi
englad
kriess
freebirth
prosty
copperthwaite
demurrals
asfg
jayousi
timorousness
desagneaux
rubbered
ketchman
faintheartedness
newertech
alexiadou
auringer
hiltachk
powershots
dowloading
vandegriff
goht
twonk
jinye
flickerings
epzicom
ndundulu
baytowne
xingyao
rafaels
rogueish
kolodiejchuk
omio
inadequates
prepregnancy
alarco
franni
schoenkopf
caddishness
gisolfi
sakur
yockel
orose
libanori
gensemer
westcon
ignatowicz
jackups
sherzinger
rowlen
stöss
flynet
pottruck
pochettes
olhaye
ferrelli
dzud
systen
earnhardts
trusk
kahlers
xiaochu
clearone
lossmaking
washway
dessertspoon
desteno
fedup
gestevision
unasul
bandukwala
souty
overhyping
buttonholing
norrel
kfsl
afourer
teethe
wildthings
vreeke
oskana
judici
holdalls
duebendorf
sentix
reweave
chiffoniers
hamminess
dahiyah
felbinac
whitepark
snuffly
hingo
chakerian
goranin
llywodraethwyr
kaskazi
mildewy
fishn
unlabored
aguilla
centralisers
eltish
consid
kitrell
chowdah
limitus
hasabe
catepillar
gradinger
pachico
adscs
circumventions
scelles
grundies
gralow
presumptous
colibria
securit
dagle
holdbrooks
talban
candidte
methyltrienolone
mtdf
dexton
avastar
democtratic
srslabs
penotti
joesbury
amurs
weskott
perecentage
fracases
pelusi
srmf
fxdc
becalming
deglazed
subach
multiheaded
cordesville
tupelov
kazilek
decliners
kamlapur
dottiness
stingier
nokias
amhad
oxandrin
anavar
gball
mchughs
magliochetti
dolney
sendme
ungallantly
chichontepec
lungcancer
belpoliti
toture
chengwen
tremblingly
posniak
mylincoln
acrosss
kablia
broë
chisnau
yedl
eilperin
putvinski
microneedle
pituitaries
deconditioned
signina
droeven
gerwing
approched
mpet
acheis
iopener
aeoniums
culinarians
earthstone
nonplaying
pesquidoux
cephos
tpmdc
borgesen
fidgetiness
mettin
pandemicflu
weiselberg
cherbak
bernel
econorthwest
juddi
governemnts
pidx
lonce
swapclear
naeimi
norregaard
ropel
naypidaw
geziry
jacquemod
scrutineered
amiridze
schpoliansky
arbeed
rompza
cemach
conformability
zvirgzdauskas
ezeilo
arrancame
marquell
unionising
alexico
trooien
moffic
mulvi
darouiche
nonambulatory
bamajam
paddleboarders
wellses
constitutionalisation
transgastric
untaxing
dmtc
coplandesque
hcom
fabulosity
bositis
conferenc
nemchin
twestival
fitten
galparsoro
oxymoronically
wallrath
crewton
laskett
acquatic
guarante
reskilling
ndrn
davaughn
pterri
joltingly
kidology
stonley
vrwc
alaso
balkanise
baitzel
nighthunter
monopril
imaginit
ofthese
borchering
fashu
bhekumuzi
jaboulet
wallowers
yanxiang
mobilex
mouthbreathing
canburg
mfpc
viering
mehltretter
koppman
nitelite
antoinne
diliberti
vieing
tyquan
coisir
clamminess
tacoli
applico
hermalyn
mishta
solnick
foscarinis
euroferries
demys
ecba
vicentillo
overtrain
berghorn
sageer
eurovignette
shelepina
ccips
myrgren
stirches
stanski
carrascalao
mascs
titanyen
nerding
masergy
careplus
vanore
panglin
nesoya
holleis
illford
chequebooks
candymaking
isau
kikkerland
embangweni
junhasavasdikul
overate
matwalli
notthingham
hypocrital
plaud
microbleeds
bicknese
noncompulsory
hopgrove
marvak
telazol
potashnik
nccnhr
wobmann
bootroom
overdependent
sloa
fascadale
pippens
palying
jubliant
compitition
lincolnesque
rnst
commentaters
ikhyd
chapping
mastalir
boxiness
sizably
typar
wolohan
ifric
buhrle
worah
satawa
hipbones
kurdin
unclenching
hafeman
drewesii
prefinished
impsat
mcgarvin
mcgillian
trutherism
chamelecon
pointroll
pubugou
sobai
dorfmans
lecos
spacebook
drearier
pliés
verrusio
deliever
kujundzic
montesque
corchero
achutan
behavin
bezengi
guiming
wethersby
gaisensha
paradossi
stiltwalkers
yerushalmy
trundler
divadom
bukiewicz
kunders
thones
songyou
ceans
tutenkhamun
israeil
kerrigans
pudner
otouto
moracchini
petiteau
astilbes
qforma
techlab
swffryd
carmelli
dopilka
unshed
homless
resolut
boomgard
pathologised
qhse
overpromoted
pooched
bidz
rumiz
astea
olesiak
limara
spatterings
btig
stakehold
medsphere
peacfully
musiwave
skivers
tugiyo
assegued
youssoufou
strycova
klundt
geckel
shebani
liuping
physiognomists
cosignatory
sheffiel
lisnave
topcider
misogynic
shriving
indago
koschinsky
theresea
kombuisia
senturia
kaneshige
roomiest
victs
straigth
kratochvilova
nonsupernatural
compasion
sidron
ekingi
catalyn
falbr
amtd
reminick
shamrat
rajapaska
lambeosaurs
hubalek
nonanswer
adulthoods
oihana
willenbring
technometrica
seabasing
gutseriyev
noncorporate
ehiem
saltone
studentcity
isarescu
shozu
rivertowne
teetotalling
caramelise
knezovich
mottinger
ewarts
stepgrandson
chattier
pigged
corprate
soundtown
dalixia
addenbroke
dibinga
giftwrap
saffia
castagnetta
nantongo
nonguaranteed
conventry
distastful
shatterbox
whitbreads
outpunch
bijapurkar
pospiech
proconsolo
kuznicki
llloyd
pontdue
piglike
krzynowek
murketing
guyliner
manocherian
gerties
uhsm
mohanned
relativising
jonckers
nedder
hissers
skordas
americatown
phongthep
strongstry
tedia
fluking
headily
romaeo
thinkstrategies
lomox
anytone
bigdeal
owlery
talkboards
dutv
wainrot
rakitic
eastdil
sirnik
radsan
oztekin
ghadanfar
havlova
quiverful
testwell
scanscout
killins
sovx
hungtington
tumakov
fykes
zavesca
cowboyish
taskila
reckson
mspmentor
stayover
kaass
blyther
hesselbart
wilusz
malaisha
mapathon
fisketjon
irisguard
énarques
hamli
kesayeva
garmser
sigmundsdottir
windoz
decreation
delibasic
hachilensa
weifu
naturex
petrocco
ochinko
scrumming
merar
mingkai
ffynnongroyw
planco
cakarel
verifica
cytotoxics
brezenoff
rrazz
sudairis
upshifted
semifictional
noncombative
sawafta
infantilisation
hartocollis
unefón
ratemycop
immunomedics
grotelueschen
trichloramine
gerbron
mcmeikan
twistedly
poorish
airyhall
ahsas
sungate
melser
gadael
bppf
unmonumental
nellson
banwait
invariables
lafemme
abdulgader
smobile
gilroes
cueta
weigandt
lomsadze
kourlis
hardstark
henick
cholewinski
nucelar
seepe
pelpuo
laleena
fishs
wenes
unrelentless
firststrike
expencive
klimavicius
abadinsky
chemaf
chasni
jeylan
raffah
ulsterwoman
pennequin
kruskopf
ipeer
tarraf
triyono
masserene
megaships
bohos
elians
cvat
emambakhsh
gruendemann
oafishness
karaghiozis
follw
alertus
wickramsinghe
filty
mirescu
barnesy
naffness
newstalkzb
skordelli
overegging
remortgages
lulek
boutari
sjekloca
tokman
fawole
kalonge
lovetsky
jakobsdottir
intergy
basestations
liittschwager
abusrd
brightsmith
orsuto
adref
canaval
portered
megabanks
saighan
finicial
stulla
theword
skolar
alvita
ghoryan
clunked
christmasses
thimerosol
desserich
europcr
dockser
trebble
slart
mereham
talalla
finace
mudroom
burritoville
eslake
soberingly
behid
imperical
harringey
komossa
miyasaki
nakedcapitalism
pozdnyshev
fahrenthold
chimerix
mydamnchannel
vpci
sbdcs
wrecklessly
westernone
hollekim
colaprete
nomfet
industralised
lunkheaded
bueser
beckstoffer
palaiokostas
kobylarz
carslberg
oweinat
coreco
derveaux
cogitated
jocked
maecha
ryavec
utlimately
supermagnet
cyberoperations
pinces
pushdo
jannice
photoespana
glittenberg
repulican
ironiya
pursuading
globalgap
cesmat
evangelicism
spafinder
tchale
buchori
kaelo
kozhov

gudele
gladiatorus
sanez
flipshare
civilianisation
firek
egoscue
phaneesh
brayn
gocycle
spoonists
gulsun
aestheticize
sermonise
kondic
ekasala
helitankers
newlins
budgetting
younsi
funktional
jaffre
utccr
gyrion
erold
gandas
disinflationary
tenjune
xpressprint
lafig
clerkish
gargonza
amidan
kolevar
eatings
nextage
helthwyzer
niehuus
rashun
attorny
mothing
livant
hocha
calafati
bouillabaise
ergil
motorguide
deathbowl
kopeloff
downclimb
bookcrossers
antonias
kartin
consumerfed
pourquery
tzds
atlanticists
asmedia
besmillah
makansutra
mikulecky
commercia
angiddy
hirable
nantana
suzella
mapeley
sixti
tankosic
lscp
kirsta
nonconsumption
nuturing
fujimorism
securitising
flublok
zizhuyuan
comunications
hollinquest
costermans
printables
awail
wheezed
pyonyang
miksche
mileaf
buppies
eurotelecom
govnerment
curnick
sarbayev
rightfield
lumie
teatimes
nplate
safco
wobbliness
szul
keysaney
piratbyran
cumbiamba
ruppy
ricucci
cottoy
darce
czitrom
butterwegge
mohssin
gloser
jazzes
curatorially
thankgiving
brazlian
liant
anaide
wonderfall
sevc
badnaban
tristans
sulskis
seliverstova
flokati
bpgc
sorabi
asaish
wokorach
socialable
partnernetwork
glunk
nmrx
manazel
ennico
kinro
nattavut
dietziker
noodlings
dadoun
pettazzi
unspecificed
txtng
alburn
normany
extenstive
maarib
upgrowth
niklison
kickings
graffius
impressionability
poitevien
rittgers
bawb
edrms
baltanas
tairu
cygielman
aution
terissa
treehole
incredulousness
fueding
russound
kabemba
apharwat
jodphur
rsbp
martabe
goalward
qeemat
wutke
sateh
healther
tremolandi
zhuping
makone
ntdoy
piranesian
imomali
fachette
pammie
voytenko
dsgv
haferman
mollart
giorgy
najja
asheley
valups
roseen
whitticase
alesco
acorss
fawziya
hatefilled
fonté
eastbanc
arenot
saffre
arthroscopies
lapucci
detoc
tapjoy
gastronomists

abent
smolinsky
spisto
visitflorida
tiaoyutai
gweithwyr
shabdarbayev
kuechle
indata
behesti
woozily
sumray
mmbo
sokudo
kachine
hupert
talecris
namerow
thistly
peyankov
pondel
homefires
kermits
rssd
dursts
scrappier
gnaas
zewdu
stubholt
migl
launchbox
buchaille
jerera
cteep
gillygate
cosmetiques
mullahey
tuffers
slepnev
epiderm
solantic
torjanac
vslas
pelletised
porini
comraderie
caramelises
duans
roguy
lehwald
fehilly
zimonline
atiqi
huorn
komesaroff
taxs
carfare
mittleider
summun
confiteria
losties
kanunnik
cyxymu
shular
driis
mccrickard
pluff
volpes
pittances
katsarou
delgardo
schoenebaum
shantia
winnner
asadoorian
overcaffeinated
happyton
kadetsky
megasites
gondas
jirong
ticketnews
cockshaw
cubias
bannissement
kuduru
talkathon
yrcw
lecarpentier
munsef
posiblity
winpro
jibriel
xianchun
mcblain
clendaniel
chessum
angelisa
castlebrae
nassiriyah
drillable
ingérence
madakhel
genkin
devilishness
varicoceles
protoconsciousness
evoras
semenkovich
princella
listerial
ayurvastra
horua
rumbak
zonko
bradshawgate
enternships
chestertonian
nervier
thanapol
zachelmie
barasia
cheonsu
locasio
encoring
splotching
wintergirls
delbianco
kamaron
gfig
stci
bellalago
beushausen
birdbrains
foodtech
suleimaniyah
goshdarnit
cimab
jamere
lalchandani
olaciregui
almspeople
yuwali
emelye
mncn
abbyasov
trinton
nainima
bestthinking
policinski
stovepiped
allick
elagina
wiedl
konyn
machanga
solartech
descouensi
coffeepots
roomfuls
mayhane
machikhel
myfaves
masoum
paniza
bentleyforbes
elemendorf
mithering
nogee
hillfarm
gordyn
okeelanta
haslehurst
theemithi
dzierzanowski
niveen
bedoni
knottier
chanterie
miniucchi
buscato
sungwon
goolag
lmag
unsnarl
spectable
kichel
potholm
disraelian
palacerigg
coronor
gloton
newenergy
meggyesi
moniquet
datagate
drapey
arterra
tandrup
summersell
cieba
pospech
blotz
backcare
antalina
ruchei
edgedale
maliti
weaked
denrées
arshadi
niehus
julfest
passionflowers
tillema
romenskaya
muntazer
takashio
teamsite
nihiwatu
lesak
yubraj
veltliners
veneficum
rusalca
anthrasimias
forticom
jheng
valencio
rouhalamini
minwax
deunta
eiderdowns
goedjen
jumptap
badric
interceed
linburg
resue
abdalhadi
canden
hurder
microbia
iljans
kaleja
governmen
libanan
ubran
dowgielewicz
fuzziest
varlotta
carepages
chuckin
puthukudiyiruppu
kourtni
hovatter
detyen
languir
adviceuk
plca
baczkowski
drcm
totico
dwis
compnies
zelenkova
tohani
gphin
dubovi
usbsf
tribalised
maqtari
ghadafi
hhrs
intrinsa
eberli
ostenberg
soaraway
sancrosanct
texass
brownen
premeditatedly
saucepot
cityvista
atifa
wilfing
wastefull
duchowny
yelpers
sestaret
kruszewska
floridans
rogeriee
photofits
antza
moyad
simitra
rationalizer
sliwka
zianet
servents
dollor
agritalk
rechler
lhag
edesia
eyadou
cherylyn
ﬁrms
parasiliti
laxest
superweeds
zacatecanos
frontpoint
pedlosky
splittism
deroch
waiflike
extemporise
stridulations
kadaré
terrorisim
eichele
phoslo
jihangir
zestier
mundorff
strugging
mynmar
sedotti
desertscape
cheongdamdong
chernof
slyusarev
nulf
lauryssens
krizmanich
declinism
machreth
solomeo
wordle
vindico
slumdogs
humilate
hydrapak
grandclément
parapropalaehoplophorus
passsed
targui
alsanea
pacquement
shaed
caisi
elemis
monteleon
creepfest
xoft
upshots
genery
khadivi
liando
vidana
rukmangad
blaymire
mathiew
trzeszkowski
schimelpfenig
brechet
zhoucheng
baugham
melera
verschure
prority
jarping
kontarinis
gypsyism
soodeen
transracially
stuker
catwalking
maislin
wsri
jreri
dornbracht
betani
mocal
repercusions
glyver
yahyo
bnrt
chysler
rajprasong
mompreneur
pennypot
batakovic
chouquette
duesler
mangelwurzels
poethlyn
douple
bezzubenkov
mimmick
newsosaur
famis
ajorlu
updraughts
vanica
nvó
cornettos
eonnagata
shaboo
bouchar
haimowitz
casanas
padmanathan
hironoshin
mushey
investgation
punchless
ibaceta
catney
sheehys
loglisci
unerasable
gehua
biznet
modirzadeh
leuh
azade
splattery
aastrom
sparlin
humad
acrophobe
sallard
kastanek
mailbu
pazornik
onieva
slingjaw
franticness
gureckis
copolla
okropiridze
ritziest
giulani
bikinians
bolofo
reprioritizing
storewide
torreal
craigielaw
crouin
huppuch
daranee
mokhiber
rsvped
cabrnoch
qutenza
khvalynskoye
quivery
shigal
clintec
abeysundara
pffs
lizak
laminusa
jcct
lvas
medialunas
arnots
bufali
hadzidakis
quanxin
maeslant
healthteacher
vrijdagmarkt
kotwani
svns
woldt
lyron
takirambudde
transitcenter
ratemds
berekely
thirstily
halusa
craythorne
paupiette
kannam
obetta
clct
girifushi
shiratsuka
goldshield
ikejiani
monkerhostin
waldmeir
donatelle
abrevaya
blistery
stockrooms
ooten
kalkilya
caplis
postrelease
orthodoxly
mdst
pitchforked
sokolin
nosazi
colat
jeapardy
mitschele
optium
shalaan
shehanshah
ellenbecker
varshavyanka
schulzes
radco
izaga
merscom
feistiest
krindjabo
returnables
hospenthal
manyani
powerbases
headiness
asankya
farakhan
eldonian
innovent
electromotors
longeurs
nuhanovic
intersouth
georgens
kissman
larfaoui
harges
pvcu
cheatam
rehabiltation
ufland
gyeonghui
qutbal
meleca
beibars
chermen
abdukhadir
iois
immodium
poeticizing
hyvonen
decosse
herriges
mallandaine
cosmotv
trajal
lankapuvath
overoptimism
chrysohoidis
vujadinovic
biomerieux
beleivers
wimhurst
bleecher
subserviently
nonart
footlik
almds
sevastopulo
ginandes
humanbeing
nonowners
chinses
aarone
hulser
weatherwoman
fatas
damnosa
wheatgerm
megaraid
galatolo
ssoe
hubbies
mccrimmons
financialservices
sausan
milwiki
gamboling
atender
outhalf
amaroq
exillon
rafert
tactis
priobskoye
llywydd
strozyk
mormoris
spiegelfeld
generousity
mybo
dogeared
lucato
montamat
abdilahi
fvpf
dicharry
harridans
xbot
arbinet
nooristan
wantchekon
shaleum
chocoholics
hankerin
shigakogen
moscaritolo
schweiterman
idenix
speac
pazdur
mouthrinses
expatriot
scrunches
osgoi
demiglace
firstflight
ceiron
gxowa
yolan
kayrol
lbdr
beluah
shaggily
furlatt
trochez
devakumaran
superglass
gerstenblith
sqw
sadafi
keelen
sicherer
expostulates
esfandyari
nombo
bozhong
coloroso
womblike
nonjury
karadjian
gloriousness
clasketgate
durgahee
frontbenches
thalesnano
gogola
snildal
karradah
machnig
rabeling
revus
bairey
vizjak
palistinian
predicter
collegeconfidential
grassmayr
losingest
mereenie
piong
systinet
denit
smartcar
batzeli
tiumen
phfa
gormez
hursting
kither
insmed
hassina
sterilises
karukinka
gavelled
stonner
mobayi
pspn
toyloy
transatel
bayoil
regenia
tangerini
bravata
zwirko
natrecor
caravantes
marset
zubasnabar
ainain
wizansky
hulihan
leendertz
abeckaser
massarray
trollops
soooooooooooo
shoushtari
putaway
seeen
edul
lovain
dopier
distration
rosenholtz
loreille
techlink
sasmaz
loebinger
reiging
bonagrass
iupa
smykal
etoken
railbirds
concusion
wawruch
digene
forsees
etinde
marakon
gastrotomy
kielich
yuewen
roffler
follath
jerriais
schizophragma
dasic
buppie
sacharov
peochar
hertzke
bargaal
marguarite
duboef
constantis
compeat
jadaf
mgas
loumia
hiridjee
willowcroft
railpass
unwigged
downgradings
ottenhoff
strandheim
hadjimichael
ozgo
komkommer
honeybadger
rephased
respun
hicl
nereyda
veltre
treebhoowoon
duirng
sharethrough
delre
ballough
reroof
perevi
limra
maestrecampo
eorpach
thearon
forestries
zarandeado
grimana
pixol
findomestic
naryanan
neimo
preseren
violari
polymedica
cmops
mascola
kaltbaum
flatscreens
inkdata
fateyev
ashker
sisso
furda
norlandia
cprg
contnue
saroeun
crystaltalk
budaors
yessayan
magwire
promming
mobissimo
nwamitwa
planemaker
dinampo
mojie
redpants
umberta
enchainé
mealbox
parliamentarisation
summerley
damiran
machanic
kalivas
rengin
ganswein
gjana
credu
isturiz
unreachably
mcpaper
sherkan
alure
emrose
bioterrorists
debussian
eurorealist
secretarygeneral
incease
irrelevence
stemis
winsky
retailored
gbod
airdefense
lulejian
revarnished
dreweatts
krajnak
scoffingly
zengana
cardpoint
boogieing
debrecin
vorley
dohse
khidi
staidness
coicou
garabitos
wadir
hadibo
rutskoi
maureau
morkūnaitė
hoofy
alkifah
chioce
staloff
fireraising
ultrarich
undercook
iseminger
commentateurs
beanpoles
jamaatul
juleanna
mcwalters
venofer
twinsuk
larium
hilots
rakotoarinivo
nishta
burtonshaw
operationa
nozad
maciborski
dankest
ungallant
rippleffect
bonstein
ideaconnection
asrgs
wilpower
makahs
axcan
fredik
haredit
soltanifar
unpragmatic
basestation
battaglio
auffhammer
minendra
scoreable
tickover
pgba
qualita
petraske
balsille
garrera
sneineh
meleta
talampanel
woolliness
rfma
bkis
rummyroyal
blayn
receeding
misalliances
scruntiny
qole
socheat
hfdc
offp
chemobrain
boosaaso
aebleskiver
kitschiness
slatterns
afriat
bulletproofed
finaid
kundai
quinceaneras
sammys
kabati
kambeitz
swingingest
risonare
marinates
finkelshteyn
cieszynski
xeomin
kharadze
donathan
blumka
asiko
handblown
excellium
cavco
kabealo
nfzs
macedone
streetcorners
babiera
krepon
ramages
pulles
blanas
turkomens
provopoulos
superhigh
udenze
applehans
deyun
ibbo
catalfamo
pocknell
ppgi
rpcl
goodhind
lynchpins
gobbing
madail
gueridon
heped
islamicise
oders
karaiskos
silkily
buesing
kabakumba
hsmr
muehling
kollwitzplatz
jande
pocp
hindsboro
fclo
mccarson
braininess
astringently
zwetsch
quavery
ravussin
fundementals
angiomax
queasily
ojul
patientline
crabbiness
martiz
yturralde
jalang
solvesborg
hortsman
balber
shintos
priorties
meteast
leciester
priklopil
chaorach
admetech
victem
baciagalupo
jurade
reinherz
draskovics
tamau
loungani
sphp
metalline
martern
foodbuzz
junketing
chanomi
mofya
blogdom
dongfan
altintop
octagenarian
watercoolers
arpil
rahate
ncys
innefective
bellying
cochetel
oversexualization
flittering
downdraughts
abduhl
kennnedy
watchetts
rebundled
dornic
milevskyi
chengelis
engrain
loretha
altiparmak
belto
curanderas
smwm
liberalizes
hellga
jassat
demarzo
ngxoli
gerlis
butmalai
execunet
windsurfed
exected
mgma
eaggf
sarway
ultrasonographer
scoill
ontak
laalou
orginality
surfcomber
suchlicki
soulquarian
seffrin
earthport
houplain
lazere
cyruses
pasich
trimac
kazanjy
vernakalant
ropeless
muchdi
honberg
calongne
tutterow
enstar
sheratt
luftballoons
shinbrot
unnessarily
mcclanan
andwas
hernried
prophy
corporat
schield
unclenched
mobots
nusairi
hollensteiner
meidt
chimmaya
hanunuo
tamica
helicoptering
surpirse
myhal
himeslf
stroumillo
kannady
llanederyn
menstral
woelfle
peruto
arocatus
annoor
niemuth
decailly
unjamming
hogendoorn
carlike
muzdalifa
voigtsberger
kuloglu
hoors
velocix
goalside
vasys
finchatton
coffaro
zhuwawo
andrelated
myatts
truscan
laulan
jariah
rotflmfao
newsevents
billgates
dranove
risibility
qpsa
queenies
pauperizing
challeng
cynnig
chadborn
daelman
kaplanoglu
airwire
simpfendorfer
fleischli
meuniere
zeltia
mardomsalari
kamlari
peripherality
malhorta
pierazzo
comitment
silajdzic
smolarski
swithering
radovcic
twmc
rosselkhozbank
przybyla
datacash
kabatu
blynyddoedd
gasifies
exploracion
daliesque
minocha
hempels
pennenvironment
fawcette
guarinisuchus
bingers
currupt
sasseen
parapattan
banteaux
mhip
cinebarre
dudina
yugosphere
stiltwalker
chessen
dubit
dotmed
dahaf
helrich
wishfulness
iacopi
gongmeng
pamperin
tallymen
tulawie
reverance
sunoo
wakpa
vollet
ekawat
economis
blaid
pecnik
bebear
chimtal
gerogia
bickson
avroko
binkos
stooksbury
chinula
maxxaudio
dcar
datscan
gandrange
wearingly
nadesapillai
horribilus
magavern
zatorre
wheelgate
viprinex
strohschein
deonta
barchana
estepp
rouman
afssa
mingenbach
almahata
campaignin
dailin
misvalued
samanthas
klitscher
sirantha
couvillon
vouvrays
schlosstein
palmour
tromped
nextone
nazirhat
bioform
pollers
twitscoop
hrebejk
gibbers
wieske
planéte
coachin
chàvez
kenerly
qedumim
merily
medfusion
mirwald
cytox
dŷ
aizhixing
iapac
ofits
tsagaev
deconcentrate
skyseer
arway
tezal
esary
rabenold
mellamphy
camisón
mattaliano
shibor
marrapodi
cachel
rossminster
mceneany
cethromycin
reconstructors
airil
hibe
moderaters
schenter
anfac
vixenish
coolcullen
taunya
dzhakishev
olymic
falcondrone
gulbai
fondamente
mazurak
khuloud
pbteen
grindler
unwhipped
kulcinski
heterodontosaurs
siegessaeule
prosinecki
charisa
riness
manssor
tcktcktck
nandrive
dularge
marazziti
gethi
monocracy
snarfing
wolkind
jaffeholden
strummers
totties
brulant
jmma
dgacm
fcag
fastiggi
mansionization
hospiscare
floorcoverings
rovit
durrence
obamessiah
sanregret
underpays
healthcard
zhanjun
rhones
koolhaus
hatworld
mubai
ocelote
fimmvorduhals
felco
ffliw
electrosensitive
priebatsch
hightailed
birreria
emster
gussying
arjuns
shirtdresses
borthers
abajian
badros
invirase
wellnet
detzel
oxby
lesnicki
photoshow
appenines
urunana
belyeu
menji
homeusers
unfashionability
inemi
badiani
leakier
ylan
pennisula
neumuenster
gravatai
kaukenas
synnwyr
insensative
spravedlivost
proximagen
fisca
lumeng
wilhemsson
polakovs
eglseder
overbred
socialiser
hadil
franticly
buffings
brashers
youngsun
sadove
sokou
bokas
cordozar
nucks
madobi
lupanare
sajar
intelence
swarbrook
graessler
dammerman
trony
munkley
gadabouts
viazzo
computors
coughter
raceco
allopass
amplifon
gladinet
volnard
avaliani
apmss
menvielle
naroua
shumeet
sainbayar
hanakamp
roseberg
alsol
trugreen
kuntoro
nightclubbers
laox
viitasalo
janetzky
settting
afdo
bickelman
dicier
handsewn
huipu
datafeed
maidlow
seiman
marchioli
dragomán
dilyn
spacie
wigoda
victimises
consituents
iradi
felipo
cuciurgan
memushi
hfsa
munnoch
arachnophobes
curations
miskick
marokvia
nzonzi
shengen
bluesign
progession
mossbawn
unshakeably
wafering
ghdx
minggao
pityingly
caranobe
usband
wafflers
vuth
drumlike
gorgadze
hopefullly
integre
cattlewomen
eborall
meritocrats
realny
snoxell
tarenflurbil
knuchel
sheliah
kedersha
tearily
trhis
siliconware
nonsampling
angore
grundvig
monterubio
ukli
burkwoodii
yibao
namwamba
memolo
dogwalkers
extemporising
warpspeed
phillipon
superzooms
requete
malbecs
hilterman
rakatan
carnick
basuta
trbic
gentic
unblinkered
godean
cundle
guilliams
misruling
navindra
johndrow
therapuetic
denuccio
ymlaen
mothersole
saltry
poppyscotland
astoundsound
outlaying
thinkbroadband
etzkorn
gulfood
twittelator
satchit
nlihc
nowling
arlingtonian
siezure
majonica
ondracek
cenzic
povall
unsinkability
sassenachs
koterec
undercapitalisation
kizilkaya
chauzy
kovalevska
zzzzzzzzzzzz
chinaamc
labioplasty
yugonostalgia
irregularites
coldstreamer
trymaine
cupless
olphen
mabledon
minicucci
kriegstein
sonorously
tawnies
tazered
iddles
techonline
trygvesta
randzio
aquavits
khinda
pemgroup
gerontocratic
rawanchaikul
tactico
jermie
conciliations
baimoensis
braques
kepkay
oteha
andalucians
trebing
hedayatollah
flimflammery
siennas
burghes
acaai
kevelos
finnfellow
perinatologist
muliana
qifa
wcdi
arapata
rouder
raulini
maleos
prezista
trickski
tarogi
waxworm
ivhs
petzen
guilermo
niyaki
realtionships
wolridge
parliamant
moneyspinner
saddah
eliès
gorund
goehler
lambah
dörentrup
twihard
gearrannan
pereplotkins
gravner
learningexpress
housewifely
maskiot
akhileshwar
gravagna
faiez
ueckert
gillingstool
pecoul
hirawi
mulpuru
bellozanne
zanner
khomeinis
cerphe
urgen
dinnae
funkyzeit
noldus
tailândia
marlines
azzarelli
johnasson
markridge
aezs
calderan
efinancial
unknowables
tekturna
jookie
daintier
intere
loungy
quoran
marshalik
embroilled
jaoshvili
screenline
testim
eggcups
mietto
waxter
monjaraz
coldhams
eurocracy
levian
calimocho
haband
twinem
talenfeld
weisfeldt
jsda
castellinaldo
undisc
neurostar
lagravere
godapitiya
rajohnson
lederhandler
camardo
alberquerque
mopup
chiclero
breckconnect
balatony
hommen
microlenders
acjw
faizaan
ilaris
llanrhuddlad
clinto
nonmuslims
martier
raptiva
parrin
kumbwada
rhiann
vitalo
grncarov
chevanton
runarounds
nobama
gttc
eurostoxx
iorworth
backpedalled
sinsuwong
rothwax
giezen
twangiza
djenane
shyte
ebct
brynllys
icepower
daozheng
unfond
cristocea
imado
carloz
nicois
nanomagnet
lemacks
kinkri
fennebresque
certificados
manufaturer
voldermort
iopc
curseen
hscrp
reumayr
strimmers
imponderabilia
capline
benecio
wainting
cibers
daschund
unpardonably
trafeh
emmitted
enria
simcor
amarger
nflx
qualit
trevia
caplon
eteraz
borosage
unamimous
cytokinetics
hailers
stryde
virtuosically
detheridge
tarifs
madhwani
buddenberg
mwangaguhunga
stickless
unconsulted
ivatury
fontanillas
maradonas
magern
jumpshots
baskis
celldex
ecovoyager
somaliweyn
sabauddin
conar
esye
akkaz
shanghi
tablescapes
betao
slarke
nonurban
lelliot
junquiera
executiv
ooer
stepha
schoneborn
moanings
yeezys
celgard
wiedenhoeft
nonsedating
munlo
lerga
ossies
jalaledin
knitzer
baudelairean
supremicists
intelegence
coaltrans
elahian
sences
aarin
zavadskaya
zéribi
housholds
bossasso
achievo
hervi
mcelwrath
romaniw
fosback
firmdale
mpdus
boliva
hammily
oetsch
ngci
grillings
scamster
achelpohl
harkett
mouthers
verhoff
crystalox
tarado
redinbo
minibars
plunz
kentwan
panicing
loserdom
muhamud
sufferable
vollebaek
burtless
powderject
geldorf
costive
unrecyclable
schnable
grogin
jobbies
newgent
redialing
xpac
swica
froideur
scambaiters
seropyan
lohachara
habibinia
skyroom
shiral
athashri
shukarno
posners
dakdouk
viznar
outswinging
carryalls
tescopoly
christoffers
mulyo
pgeo
snohetta
ruholamini
uyttendaele
fhlbs
freeflowing
suddeutsche
spiffier
maldeikis
craigengillan
hypercharged
snarry
griscti
sudnick
soberest
hlang
carradines
meixin
arriaran
secuity
dourer
mesenbourg
tawke
stoniest
cordex
seretide
depersia
cmpmedica
cmpb
rillette
attabi
magincalda
harild
mozypro
formenting
syste
airboarding
healthpro
szebeni
heartspring
skiiing
dumanski
jeneen
nycity
mahbob
womanize
rubatos
corrpution
karbovanec
omnisource
amcore
tonery
spaciness
flummoxing
knsy
ambulence
carmat
chazi
delsman
challem
achata
taukafa
altzheimer
bundred
truckmaker
braught
kurzak
wavebob
sartwelle
ristroph
orelans
syatt
sheddocksley
univesrity
cpzs
gravesides
pharmachem
adeiladu
haberdasheries
aplf
couldwell
fineran
nonswimmers
otarola
legaue
evacuators
mpitsang
fereos
esnol
mainegeneral
libala
afghnistan
qnexa
benoits
seviour
yenegoa
fineable
vermögensverwaltung
psam
facpm
phytonutrient
nurik
crispbreads
semilong
meresamun
tombol
andews
unfurrowed
bjoergen
millionares
anesta
tornagrain
stolzfus
emphasys
yelstin
goubaud
anjale
hawkishly
minuting
chifundo
hullis
lambeosaur
wloszczowa
sensemaya
sandalled
iversens
kleinhendler
westfalische
belabed
tchico
pentref
enmeshes
watanabes
idustry
reaggravating
predeccesor
marenberg
yahe
weimeraner
premajayantha
pennsyvlania
lightningcast
dickover
paccheri
kodzo
biggoted
undersupported
schulkin
identidy
foppishness
scrummagers
siphan
govermment
eqii
niknejad
donerson
eckaus
duncrue
shaghaghi
drcc
latisse
wackadoo
antisa
farcus
aerothermodynamic
aprilla
vuic
subaiya
mlam
boinod
lletty
bostelman
melonguane
orbicule
beewolves
reassortants
truanted
huler
orgias
cazmo
nyiregyhazi
upfitter
repored
yakumi
deptartment
minidisks
greedheads
documenary
gunowners
intitials
overfamiliarity
piquot
reposa
consecuences
helathcare
gunterman
shimanaka
steaz
cottoning
geraghtys
ikarian
fentora
kinstellar
selzentry
imposssible
phormiums
charrieriana
whump
boguchanskaya
boonekamp
corecommerce
zigun
naglazyme
galsulfase
scardelletti
sabatello
phippses
shadowood
nonfactor
felmlee
yanire
porking
misallocations
wiould
degroof
pcubed
kisiis
mucoadhesives
ablikim
ehsi
biriotti
misadvised
kirkston
asawin
rxnorth
legeros
giganticism
judaise
presentencing
juiceless
claireece
hauda
chelseas
volex
loderick
sarwary
litigiously
jacksplace
musati
ogunniyi
webathon
perbacco
unwinable
schmechel
dancejam
wayser
wirex
yakhdan
longhini
wafels
avinza
cutwail
girat
ziraba
jopari
thruppence
ssnc
droba
ngwira
sakalis
sukhova
cssiw
sokunthea
aptuit
islamaj
ameriville
laiping
dilxat
holencik
eeewww
masese
seafronts
longcourse
cholerton
toblerones
labañino
dushkina
fazalur
claymates
shambas
pelletiere
lucoff
chutkan
maximedia
bierbower
majimboism
redrag
bithel
hinterlaces
conveyers
amons
perplexedly
razziq
yaggy
takabuti
healthscape
popcast
tukuls
jaiani
travella
unseasonally
loofahs
skion
yozzer
duveens
gomedia
sspt
menks
critcally
andriesen
stretchier
shawana
ommar
dragseth
koziarski
lasama
imoogi
tingvall
klarissa
vidir
gtgp
tuvshinbayar
unneutered
poisened
hanesyddol
myride
merluzzi
mydroilyn
preseident
profounders
alafoti
stollsteiner
alfies
vendanta
killinochi
interuptions
poortown
metiner
bearishness
sousvide
psittacosaur
retured
casarett
lanita
aaahs
kimberlain
iaro
bischel
shortcircuited
foreced
yelsky
hematopoeitic
malliouhana
zakouski
couthard
marjority
mochileros
eavenson
popcorns
arkans
pochepa
inititive
dugle
srla
universecity
kampang
gloddfa
tangee
tanigue
wreghitt
boltonian
bournmouth
drapper
taekjip
hilgart
fullham
beefcakes
overeaction
stremel
alariachi
tarmachan
hasab
monaveen
youngwoo
kalimanzira
orneriness
fusf
mantuo
diavolina
hebior
rebaza
chelson
pyrolitic
cvvm
rompel
magliozzis
kossek
scobleizer
tyab
marghoob
exent
cavalho
spontex
itvplc
outrushing
cpfs
munajid
babymaking
underclothed
felloni
mcmurrey
kliemt
manghera
wichy
kinglas
aljabri
outgang
legalisations
jennilyn
toquam
charmil
multihyphenate
connivers
brighthaupt
minocin
monickers
argentini
hveragerdi
bartmess
grivko
contradictoriness
tacticts
ayouni
crystallex
impm
nietsch
shouk
uninstructive
pdois
haynor
karadzhova
atci
samsami
opuc
kloske
gillert
neverthess
stoiko
echenard
lagreat
tonier
bourtree
swaggerers
cloman
umgd
hejl
prewired
husbandly
eminant
deflationist
grytsenko
rodearmel
albosinensis
miomi
bellio
evox
ncade
literalizing
slanker
pteroptyx
sanudi
drmo
sulaikh
cannistra
eqipment
yvras
freh
cateura
weskeag
tycko
magwilde
raxit
renationalizing
monfried
bolakoro
wiederer
blunderings
gladwellian
swedishamerican
darstein
geissbuhler
geronzi
viviant
fierek
melquisedet
westmuir
chillaton
bellshaw
microprojects
sylviana
waszak
acqusition
clothilda
labeij
stansbery
optumhealth
khumbanyiwa
sentimentalizes
worldheart
tackman
qteros
smartrend
softsoap
pedrocco
sherbin
festgoers
crimebeat
shustak
maureece
europewide
wdrs
dysautonomic
bhagh
harpst
retailleau
kamide
ubinetics
cuona
archfoe
yingxi
mehlberg
chafkin
podrían
oduma
pieminister
macroalbuminuria
pinballing
poleyeff
angery
rheolau
defendory
plantsmanship
rajdamnoen
kemoko
mzembi
tedmund
stansall
mcfarley
käch
majolo
wadongo
muhe
wudnt
fliegelman
shedroff
overkeen
unloveable
anticalins
negotiatior
vellanoweth
noticin
manglings
coldstreamers
cahalin
stückl
dunmores
vanillylamide
wansee
santona
joshes
highl
grosveld
masieri
lahud
puglise
khatiya
mcbrady
ausp
quiffed
pandalabs
witlessness
nonbanking
relion
epaule
stasya
stoopball
stronati
ensequence
maufacturer
winski
gschlacht
ratoons
racinger
mateyka
berezowitz
taulafo
talanx
rudwan
myhres
frankfinn
overdrafted
welsummer
collaboratorium
tokusei
pohlschmidt
kazakoff
dalehouse
plice
gillim
mavizen
boritzer
aaaaaah
mikolášik
cendra
winnercomm
appbrain
ghappour
floodplane
staffy
allidina
mirós
mostaque
tantus
clienti
realstar
incents
sutiexpense
ziagen
lerue
lemnis
distractedness
ionix
sadlon
peformances
coppolla
taepung
maximiser
detkov
moiser
sobaru
eckern
hollywould
scapagnini
seymourpowell
vapnyar
skabba
sporicidal
moota
eddye
bearnageeha
levatich
kigawa
grassfires
wriststrong
briggeman
pdbs
unfresh
ffch
snowcaps
bialka
stajner
staycationers
moualem
wedick
disis
gochfeld
bhall
difinitely
arrar
pendy
contrac
smiljka
primarys
buckmiller
gourneau
anzorena
marcouiller
scriptpro
securest
herbein
brideaux
flotel
displaymate
baabaa
pharmas
tecaccess
gricia
economc
ginobli
litepanels
kingzett
atropo
hammerling
overmanned
nusseirat
vanwart
foraminotomy
arington
witzenburg
mamond
sinesis
leszcz
gwinett
thomand
oakervee
poundings
jieho
grippier
pushier
mersman
boogying
kwadzo
adrenalyn
eorann
nasutra
actigraphs
vasotec
prinivil
dsdha
alternadad
krissoff
meerwala
kiffy
cortera
birgfeld
ricasa
harmohinder
gadarif
pihlstrom
chocka
dotorg
lumpectomies
banxico
livecity
finlo
venezeula
fallbarrow
featherbrained
keiy
bergmanesque
unflappably
cornova
agossi
campiest
golagha
sunw
nusta
swedroe
kagda
kissine
realview
laudamiel
mediumsized
internaitonal
longevialle
leighninger
harshed
stokking
oceanico
guilano
poghisio
chaisaeng
estrace
juthamas
andrunache
nrbs
twdc
glitziest
hukins
astorina
beachbum
moushaumi
langlees
gantumur
chialvo
tantalises
sekaggya
gruebel
flowerbomb
baraclude
olimpos
mitoji
ineffecient
retière
jeannene
cowhey
trippled
sciencexpress
cubita
villiard
gbms
ceidiog
nativeenergy
inopportunely
deante
yanoviak
pretreat
mosterd
zeljka
strianese
busches
zypad
andrewandrew
mamozai
bergsrud
citytime
shalwitz
arthouses
crotzer
subaccounts
sunscape
jegley
tracleer
nyregion
brugnaut
loanmod
woelper
transdniestr
schoenbohm
shimari
tappable
dwpi
neomedia
delusioned
pruhealth
raneen
relan
simplifydigital
brydes
urgup
nosologist
grocki
deeker
kaldenbach
fotowatio
geophysic
promptu
bridis
biosys
lscb
antiamericanism
nonwork
synapt
lengele
undersung
yawningly
stennack
mareli
hanian
viveurs
goldmeier
abosede
gloveless
grassfed
lievense
comeek
americredit
minqiang
pdks
schkolne
euroyen
athoi
willibrod
amnor
sweer
formfitting
diweddar
nichanian
mandere
algt
boutris
iqms
atiz
pfgbest
knuutila
kairy
estreller
stepinski
netcord
kandids
sharieff
kimmick
braml
garrotting
napeo
comisiwn
pollaro
proietto
direxion
waghaz
citgroup
salhiya
christofore
nonuniformed
chipaumire
rehabilitacion
rovaris
hanakee
paternos
aeol
lanau
muttawa
yuchai
preaubert
wakeups
folletts
hypocrates
convivially
reado
magarshak
alavarez
dimbos
peoplesupport
altropane
netspend
mingquan
bluenoses
nonhormonal
prebil
comng
pepperjam
altunian
cristinas
oversells
mbet
qannik
tundidor
esipova
cpdos
cutsie
tigerskin
sonodynamic
phantasmagorias
salord
thars
prowar
almli
fasto
danielovitch
seniorcare
jibao
plocker
bolgiano
kampachi
qudoos
cyfle
recanvassing
marqueze
deryan
irlando
bamcafé
duii
chasseuil
rackow
kurnosova
feugère
matoo
dieugenio
sholz
runbacks
pesapane
bureaucratise
ariannol
sebangau
kubatko
dispossesed
balaresque
meaklim
kharji
ngure
punctiliousness
wisard
humanizer
jinpan
glicks
truppi
magestic
yosbany
jangl
icjb
monastaries
allweather
rodalquilar
sollman
premeal
ptolomey
kefta
alsobrooks
sandyknowes
dupaul
cluizel
giulesti
critcizing
lelly
rahala
reiffenstuel
rhywbeth
penzes
insurrectionism
itagui
shaladi
welched
vlahou
reineccius
wennemer
khorafi
volumizing
filmmagic
lutex
halfax
sakelarios
lepse
xigaoxue
shaolian
ooof
hebu
brennig
velsor
fristrup
hayz
muchauraya
dessaline
alaksa
responsibilties
pontoh
nederkoorn
pomsox
grimiest
sourceone
ventureworks
sfsf
weadon
reavealed
denoument
sanhuan
peltor
sentrysafe
rydbergs
gmed
rightsignature
rosnano
bezielle
trilbies
visitpittsburgh
nakanowatari
unbusy
facilitie
csosa
cutomers
niepoort
kazmunaigaz
bifas
bmsn
speakaboos
snicked
yeowart
miuntes
wolowicz
shatswell
sphb
sanctimonius
panfried
mirnehad
nonelderly
lancearmstrong
winiarz
journalis
kohly
kricker
tirua
motznik
delusionist
snicko
gluhwein
liberalness
chillaxin
boyev
ridgback
sizeist
uniqema
cisel
leimsider
enouth
probablys
disintermediated
penkair
bitchery
acephalic
rnsa
intelisano
ettalhi
snowsuits
quanities
microfilters
theamerican
mowings
gallanagh
zimiles
faugere
bioenergia
fstd
saechao
abseiler
rvucom
disappoinment
huiet
mppi
demitrios
zosyn
okiharu
intermet
coltellacci
wohlschlegel
icepicks
karambir
oktem
simrill
lhabu
alokozai
kursman
heshu
duadji
polkes
farfur
proboszcz
vricella
buerhle
bahdon
multidecade
cocksureness
ogap
skiied
praill
dicecco
syaifudin
daunivucu
rabuor
haartez
underdosing
aganda
rowark
fincad
abisaab
peribere
llabres
murenzi
altegris
buechley
nimblest
fetc
oestmann
noguiera
sellz
contemptous
avtc
kandao
komombo
unappealingly
krenwinkle
stranack
wirya
simulcrypt
leviten
sebagh
faceing
dejarnett
knocknagoney
kohyama
midsemester
camellones
shawnda
iotova
stapert
ecotarium
adconion
fasanaro
catryn
chakrabati
appinions
stuebe
precontractual
tlaa
asiate
kosciuszki
whoda
catylist
gptv
rendlen
saslong
adfusion
milstd
dubrock
buzhala
fortrex
gazump
chargeoffs
realtysouth
durned
croftlands
kunsman
shaktu
kennedyesque
laudin
wanzhi
gallups
sierpina
paylocity
grandparenthood
ballyard
schneberger
silvesters
garimpos
abdrabou
cupcakestop
lemorin
laicité
zekria
cadetii
mcintyres
bubbliness
kahakuloa
javacool
sayare
toutanji
alimar
decaid
bowins
vashee
zagre
webrangers
jiam
autorite
edci
cybots
higginsen
distaval
carvajales
hybridpower
sukeyasu
brizzio
biodrama
labarbe
yaaqob
summize
ecoark
snpl
kaziu
mostof
dissaprove
gwertzman
kustoff
eforts
monarcy
motorexpo
kressen
sumeth
brilinta
fuggy
cukurca
hangabehi
swipeit
direz
eckoh
agostine
embaressed
moschofilero
cuggino
lightener
deffered
cyberpolice
chantaco
egoavil
dumstorf
estabillo
krezel
dustcarts
gentianes
cphpc
rodenator
californina
politkovskaja
veddw
embarrasement
ascofare
beckmen
drthom
lapada
littlies
multihulled
bookbuilding
mahvish
gutturally
gaunter
bramow
guzelimian
leinin
umerzai
poipoi
islamshahr
phuah
drawerful
opportun
walkus
lacors
harringon
planey
vengenance
hofelich
cvff
offenheiser
lowham
gordonian
blakovich
krenke
afghanstan
takavesi
freewest
amercans
secamb
playgolf
cptr
zibel
mergener
chinches
seuer
couba
starclass
seghal
latosha
inalterably
silagy
confortably
tamboen
jmba
catrell
nkuku
dutybound
magagnini
felesky
jezibaba
cramdown
folabi
gvtc
gordys
rugero
lunchmeats
rexcorp
leedle
conflab
crankbait
gerds
klion
hazut
repenteth
kyriat
wonderingly
cdmrp
tabbakh
barick
haisong
djemah
escritt
tikvat
smize
hollowest
surdin
jacoub
noncomedogenic
antaviliai
omerovic
baissour
kaboudvand
klarich
colbertist
bodenmiller
groeninx
tekura
golotsutskov
sollitto
mercuro
gbubemi
alfoneh
bogot
enpocket
sofugan
towergroup
seviche
hiruy
annahof
fratantoni
goiabeira
buzzd
slaughterings
glenzier
inceasing
bogenberger
dalaro
disaffections
outduel
herminator
marlaud
fitzerald
trichlorethylene
cbkn
spunbond
reshetin
becquelin
biotherapies
tilburn
ottomeyer
punamiya
barakova
skordis
marszal
zélindor
shiit
jeremis
vumilia
camgian
nealley
portie
serhani
rickarby
marandola
uauy
thepkanchana
rozzers
darkmans
paracuelles
readys
demerges
campara
botchergate
levinsons
giubbilei
edusoft
jacqua
dualchas
bodhnath
bildman
sitcommy
ankudinoff
udovicki
rearrests
fundholders
uncategorically
pontolillo
nken
keyontyli
reprogramme
gittrich
obamafest
glitterbest
zhadobin
toktumi
plumerias
copney
swilcan
corpselike
schwenkler
teulere
gammex
pennese
vaghari
shakhova
deutschebog
pauperised
bdrc
homecall
wrily
cytos
bracek
lindrup
winsomely
bettocchi
sibbach
hozelock
castion
wellsphere
businesselite
gabbatt
septwolves
alailima
rublein
kuehnlein
idox
bsharpsonata
superstruct
shwani
debriefer
whyles
retaped
crescendoed
wedner
hietikko
aaraji
sidanko
schenkein
extrabudgetary
pingg
corticeiro
unilin
goedbloed
koreanness
shishas
duffles
dadhwal
antenatally
jounalism
hiec
decrepid
venardos
epals
russoli
coldiretti
lichened
fdtc
awdc
corrolla
akoskin
uafm
tanys
mfhs
avruga
groehler
matejovsky
jecca
eathen
nwnw
sheeeesh
saltshakers
republicca
goettle
monfortino
lerck
femgineers
microgenres
drzal
atatcks
fudds
magicard
wyeside
notbe
reincarceration
translatlantic
menawi
mutungo
gaidica
gotvoice
fbma
kishenganj
bathaa
backpain
gorlier
semakula
galumph
cramdowns
modelworks
compugroup
groskinsky
foudland
sdvizhkov
tamaela
shalvata
glenmachan
howgills
bernsdorff
shernhall
merseysider
nonreportable
halabjah
fashionableness
killaloo
svia
floersch
countermajoritarian
chengetai
hemmingwell
clocklike
cornici
gostic
realitywanted
pastic
mintwood
austock
swindonians
fadumo
offseting
menrath
squeeking
hypocaloric
meehans
exculpates
showily
loiederman
sevmorneftegaz
ridong
blgm
dewormed
smashups
dyomushkin
kocijancic
leacann
thaobh
foofaraw
sacrified
legbone
shurqat
queuer
smartcool
rychleski
zrihen
unselfconsciousness
muguti
hydroenergy
responibility
resel
tetchiness
saffiotti
iobe
youssry
shillibeers
overindebted
villement
giorla
mohammmad
gadw
fayant
governmment
gabeler
eisenbraun
pepic
gorbold
leisureville
collombat
obsta
dowdier
kraskin
snowballers
hysenaj
dolfino
ecott
bornfree
fergany
leverence
mavaddat
grewar
resetters
furiouser
carhaulers
dwurnik
fetishises
pretape
frankest
jenkinses
rhuhel
begner
leefolt
jenev
mayed
banlieus
todner
freakily
magunga
februarys
pushiest
neoclassics
noncommunication
schui
kuehnert
eindoven
squitiro
kallakis
simantha
pratheepan
hickorytech
nacil
cyberactivists
healthtrust
lijit
agiles
hirsuteness
powerbuoys
superchic
jesdanun
gulliani
sluglike
gortex
pentapeptides
powerpad
buckleysandler
purbecks
comforce
mksm
guadalupanos
merkavas
khizeh
kharatian
emoly
goffard
belinde
vinals
hoovler
bakkevig
sigwalt
ajones
hermangarde
villicana
vietman
hungy
uhpa
thostrup
esotouric
audibled
vladymyr
chevedden
timbercon
unczur
inamed
dalati
nonvitamin
karaganis
spatisserie
heinitzburg
sparapani
sentencers
chvotkin
pordy
norpramin
firrantello
swooningly
museminali
oivind
overprescribed
monoply
liddard
goalkick
bejzat
schuey
neohoodoo
activitists
regionalising
tienna
cortelco
pevero
cradlepoint
cromme
pharyngolaryngeal
holters
alvarsson
guidettes
jcrew
pixon
arshed
outhitting
icpas
istabl
switerland
kodindo
eilu
adhyaksa
shrooming
stridence
adcolor
hantro
maurading
gvep
pharmavite
shuffett
somprasong
mvaas
prevor
sharpatov
ahhing
fourwinds
spaghettini
meing
irelend
cciced
gunyon
brincko
machingura
unsubordinated
brilliantined
bitsberger
easytown
milenkovich
fofanny
agleam
tataei
homoepathic
vartys
cspro
bagl
dowsey
megabudget
majescomastek
housemade
abdhul
chikez
ufizzi
avator
grawunder
artrip
paudorf
knockbreck
aestheticising
ayanoglu
irbd
cantania
sinawatra
taxachusetts
mitsubish
ecopy
sabip
paskiewicz
unanswerably
armagen
lengfelder
allerdene
pertoldi
cosmopulos
taddio
katiforis
langdeau
babkas
rationalizers
menance
souléymane
huslin
menveo
franceye
reedlike
montmelo
uscategui
ctagg
agrofuels
packnett
obstreperousness
cinemanx
kocen
recchio
woebcken
menupages
bedmaker
affrunti
nuegados
dukette
darzins
constituants
clickables
lbjs
finuoli
lovrien
rhude
schöppner
viilo
wanabees
surenas
shalley
auconie
zhengzheng
chalcroft
vottero
offlimits
weedless
sovie
betreiben
mizuni
muyale
agjobs
panjwin
callvantage
shadowboxes
buzhinsky
concernd
taffety
tskj
abbaya
mypublisher
kahumba
sehee
folksier
haszeldine
mutrif
barbach
bedforshire
roglic
subsidizer
olarn
dippings
perell
mawatheeq
sunkids
easdon
healthview
efmr
chunter
cirendeu
lovesounds
gottliebova
intelispend
parnaz
senizergues
lancovo
numide
rewatchable
gerdak
furadan
olsenboye
geyskens
ibaviosa
instantaneousness
bouchiha
nleomf
picolotti
waltonen
jesture
chunping
alevels
caluco
swashed
defenestrating
makaela
chitrabon
shallman
onhttp
otarian
sangzao
pollyannas
negotions
dialaflight
porgo
norenzayan
nebras
genderblind
craggier
dumsday
croisieres
openhouse
errupt
abeling
pieraerts
kendar
agovino
yisan
weatherbill
lybba
embyro
vidosevic
madeover
pansiyon
chipkar
delgates
chalghoumi
stapely
basqueness
murasawa
remunerates
hanick
docca
ndubi
ieni
suhy
dundees
hweg
healthfood
fillpot
presinal
overinterpret
isotis
kimanthi
sloter
dandana
sprinboks
januzzi
stodgily
synflorix
rabanel
namibrand
premsingh
embeda
artventure
viccaro
moalin
gaisanov
caleen
linktone
peterkiewicz
transperformance
kudlacek
mauritsson
carpluk
gagor
siemieniec
romatic
amazins
lavasier
cafedirect
llif
locca
supertour
brooming
bestsellerdom
shaqil
braintech
lightcycler
senesac
excercize
salleras
sonthoff
fingermarks
schoendorfer
undivulged
schmerge
alhajji
goryachko
alcaino
sermonize
endorsable
egmi
alldred
cotweet
shinguard
lshc
underthings
wedag
preboarding
carhaul
guindilla
elitek
jardee
bribesville
candlewax
betsafe
alayan
crisfar
aquafarm
drayna
fenyn
hamahara
tranquillising
liquidised
bollgard
abhorant
segaram
duwisib
kondaurov
innerpreneurs
boxroom
viap
delegitimised
stuerzinger
shalina
micropulse
woodburners
legalizers
deulbari
palies
noncurrent
timeliest
esikia
neflix
ddeddf
grässle
yahnke
wakings
gudmunsson
sciple
dadur
andrezj
izundu
mudawar
preservación
tashiev
fabrizius
mbcn
sanjayas
triossi
shengying
vakacegu
halahuni
kanamma
fyfyrwyr
numaniya
partnerworld
govdelivery
delapp
yribe
muntaser
budgetarily
greenopia
tricast
ngler
soprovich
livng
welcomingly
gilliot
cupchik
droptop
hashimzada
sarzanini
eroshevich
mentgen
fajarina
wolozin
sweetshops
abeibara
thaweesak
hauptli
zehfuss
schmocker
drehers
jurisprudentially
kantstrasse
mexicola
kabbia
octf
aelod
zaitschek
holslag
ragil
seamoor
yusanto
spryly
durabook
yamileth
mcdermed
vardai
auwarter
morphotek
wimpier
lamielle
cimatrone
creditreform
baybio
squidlike
warmist
douetil
empoyees
incessently
scppa
nukhazhiyev
pockmarking
rypkema
budweisers
besuited
estrasorb
matlary
sisvel
grandstander
intermissionless
enzos
photodna
ensus
pipart
tacnav
investorplace
ziegelmueller
discomfits
teeson
discourtesies
scardapane
mmcfd
djidonou
zdralek
owiso
cemetries
siwakoti
roussouw
lykendra
hulahan
tezampanel
galzerano
bliemeister
tarculovski
stuppy
bestirred
clarklewis
discrace
dressiness
zuffo
peales
chidlren
jfin
ucdmo
maoxian
thatwas
resperate
rosprirodnadzor
belcom
stoddards
swistowicz
eatingwell
gankhuyag
wamai
lativa
arsema
lounguine
hyundais
telik
szello
flakiest
widyalankara
baumohl
priveledged
karzais
mcgrathnicol
zibari
redpeg
nonathletes
skiby
croony
graduat
gedarif
imaginero
tongswood
makowka
ngers
diemtigtal
terer
cusati
foradil
typosquatters
munith
gameplans
frowzy
triozzi
schokko
moviehouses
laznicka
modishness
enewsletters
bitomsky
severcorr
kardel
mistimes
perfectability
meaulte
peformers
romazzino
evrony
monachs
jeitawi
mercher
polictics
uncrustables
ivobank
smartcenter
mizners
investisseurs
mendendez
raisingkids
anselmetti
androidguys
kjaerholm
rathode
kvell
mbrt
snippety
flageollet
duanna
waddya
sekitani
mbomio
shipquay
bhailal
deflon
phlebologist
ehly
srebnick
unstrapped
probléme
lamperth
denty
koik
bestofmedia
economicaly
louring
chioco
tapey
mamouri
manslayers
jiabo
feckner
capla
farimex
fanuzzi
keybanc
bspm
jaromin
gabrysia
dyarbakir
underresourced
katorza
jalalludin
laperrouze
tftd
geosa
hadlaq
fltc
caucasin
camaign
cepelak
arboriculturists
rymers
succomb
brainlock
folksters
bowmark
superlarge
unexcitable
werrin
trively
najlah
vegging
overcalculated
honex
nonroster
cblc
appworld
linseeds
dandey
mirarchi
givernment
niangadou
georgeou
brownites
landsource
amicas
amariyah
ensky
futuresex
dandling
romash
schuerger
peadophile
sadeek
hobrough
deveoped
ireo
alwad
pervent
starpharma
pointiest
peacemonger
bedecking
cherenfant
blockdot
turnas
frockcoat
rameriz
eggos
joester
takash
nitrofen
kirtonkhola
bankrutpcy
kinleigh
grunander
jinqiang
rotundone
alberes
hoehler
sczech
fedroff
fakka
saakashvilli
shahidur
ariannu
geravand
millana
brainforest
canyou
murfee
casopitant
virganskaya
miscontrolled
chalupsky
jagging
metsblog
paparrazi
jaakke
nickolenko
koumis
fassotte
crocop
shosteck
nationalsecurity
farmstands
sustainabilty
kuwik
shaabab
tongyao
rightsize
whaps
courics
crosstabs
windenergy
masonson
reupped
shengjie
rayaam
innerlight
usdjpy
steudle
giebelhausen
akahane
opont
brichet
viptera
wetly
refrom
almajiri
nunemaker
nitzkydorf
harrenstien
addm
tianmenshan
parrson
wickelgren
miatas
chipmaking
nanomoles
clampi
djabar
biddlecome
backcomb
hawkamah
lotempio
sheese
goldbogen
goldhawks
myot
constution
grogans
phonetapping
imigran
wimpiness
conservers
wesla
gweinidog
colonics
britsoft
weetjens
clapometer
lyssenko
suhandi
constar
cotcher
fullol
altamuskin
kovida
sankov
bohstedt
nosir
pbxpress
hladyr
nimrawi
firoozi
nivalin
americhoice
flurrying
simonia
intracommunity
unsatiable
retailvision
homosexualists
leisman
viehland
azda
schwarznegger
audacities
ioanes
fiim
swaba
dipeso
valdan
unthawed
houseroom
scatback
damous
anberber
woundingly
paslow
finsley
schnaidt
toelke
stressfully
diamonbacks
scrawnier
andriamananoro
solamar
birillo
frape
everdream
blugerman
canolbarth
denbigshire
motiani
limons
chhoekyapa
ldar
ballacloan
huco
miljen
muhmand
gripers
xiongbing
rungi
thuer
orsow
kaewkamnerd
strizhev
reestraat
mawji
basei
sadrau
wemos
fulbridge
dillute
preldzic
guskiewicz
tsuneyo
intelligroup
nadaam
egotastic
tepidity
lorencin
equitisation
jitloff
skimpiest
legetic
sagovsky
ayoo
bourdonnec
biodeisel
salesgirls
laviv
puchase
brotherwood
garicano
medcare
lourda
vanderhill
exomos
vaniak
nicma
myrddyn
suweidi
ecstasea
thamanya
vergnoux
safenano
topups
shaherose
overridingly
mtoko
commisssion
krkic
bytemobile
micunovic
zaffuto
pamodzi
monib
mouphtaou
exonerative
gilgoff
wormersley
vorkapic
aloxi
atutxa
vesselbo
cauterise
skomal
mediaedge
refraim
proscuitto
zannel
torries
harthiya
serener
butzke
wittkopp
raea
joyella
hehea
sneery
tarlike
freilla
kittlaus
nallely
uxua
mercadeo
ohmed
nesvold
skiptracers
oligarchial
grmovsek
lashay
chavarin
nafee
nankani
mailee
destigmatized
kajer
steffanoni
shofield
simonovis
graveses
stoba
nontribal
baleri
laysiepen
sheepmeat
plushly
techinsights
malariologists
leeae
jinren
lontscharitsch
launsky
estefans
milovidov
trpceski
cucurto
sheepshearing
novantrone
elecnor
thoumieux
imrali
isports
peckerar
xyzal
merler
wiredsafety
fatheree
republicn
ccdo
mauly
ldbrain
pascuale
decitre
embezzelment
overparenting
resegregated
sopexa
trayless
kiyawa
ferraccioli
mirthala
foodspotting
fittall
manchay
kanyen
cdii
rekulak
roscar
sellergren
privateness
fujimoristas
sulphonylureas
crosscage
lowgrade
silversneakers
yodler
paulsin
cndr
madagascans
bramell
committted
oncourt
roesing
addlington
cflp
motodev
skcin
mulverhill
coddler
miaohe
shortlanesend
kanatzidis
tianren
diulio
cnoa
orrante
mhia
tonankai
gelil
amanatullah
spyk
kafalas
bearhop
caurier
sorgato
stroehlein
textplus
chuvashov
swaptree
annc
problaby
robohm
designcon
deviatovski
dreisler
botach
californicate
sushinskiy
chocalate
dejac
hosptal
volosky
cashay
verazzano
zimberoff
mahmoodzada
spasmed
belongia
dcip
sinkable
cardtronics
biancoceleste
ppdi
malpasse
bambling
mytongate
proxes
redsell
efforting
photocalls
nonclassified
airprox
centenery
manhali
lasensky
bilions
tancer
sosinski
sinuousness
bancruptcy
cosmeticians
curbsides
ozguc
overbill
noshehra
doht
fasslabend
rhoni
ukcmri
vawd
bjorkestra
amygdalas
dillweed
streetbrand
mobilebeat
aurilla
feuchtgebiete
agdur
protetion
unbylined
petroline
monistere
tramontozzi
hyytia
shortbreads
gauntness
victom
tzampazi
unumb
execuitive
cmev
ticketleap
fulminator
onseong
initatives
azzoli
kettelkamp
rahmin
clonking
rubefacients
llidi
cubukcu
eliphante
xhua
prixes
yuexing
thirteenfold
giley
woroud
geniesse
seonag
tschofen
melodramatist
druba
rositano
magyarosi
tubruq
deurwaarder
janoyan
almight
presgraves
cawrse
bethworks
druglink
cornah
zorome
honnington
cuvees
keybridge
skyports
segalini
mausoom
nekhaychik
lisnagelvin
droopiness
mpel
kahayla
reawoke
splashbacks
civilans
fcpf
datolo
cryans
catastrophize
zeulner
chvc
schoolboyish
campanologists
aquasplash
shopdropping
iwonder
videomaking
antiseptically
armenistis
flexibilisation
michelletti
klueger
crofthouse
reservationist
microfleece
nsengimana
emanuello
mojadeddi
wafty
breaktrough
fragmin
ecel
huvane
hildburg
skirr
arzerra
zainabu
mirosoft
neelsville
saintula
registrational
toporoff
kolymsky
diogel
weidenhamer
gellerman
delapoer
soedarmo
breea
mohaned
oversalted
hŷn
svonavec
nigec
weitzen
sabhnani
fluffiest
kirmanto
sheduled
amsec
nooriala
gyrru
bernann
volac
abdourahim
raithatha
heakin
babyliss
rodabaugh
lymphopoietin
inautix
sekur
sopera
tedxeast
prakosa
safeen
sovietskiy
orcun
ibfd
voong
darparu
hosseiny
westhrin
surrended
sandostatin
iritani
florinef
avgousti
bidegorry
shaanika
flocoumafen
travelportland
wolraich
werschkul
fabtech
bermanzohn
bedstuy
hammertoes
crabaugh
compasspoint
burnoose
triccas
afbnp
trillon
taiy
illegall
modernica
prefunded
subservicing
superefficient
nacr
irascibly
orcy
ecochic
kittoch
roona
suitemates
sassily
undersells
liphardt
magrao
basyir
ladaris
cockups
nonblacks
dulsori
unknowably
mangassarian
gratl
khodori
bestel
omazic
morayniss
excessing
eniel
baimurat
shortgate
zeszutek
verwilst
ballcarriers
nasry
kabaha
deminishes
safeseanet
abdisalan
corevalve
wynx
rushaid
baofang
barnacled
intercytex
tjoka
bratzel
touadi
lawenda
mtlqq
wheeland
crystalising
kösters
voped
gotzis
gulity
arbabzadeh
kodner
ebank
slickened
ibeo
mandoon
bbag
cauterising
almeira
vulovic
riase
uttecht
reteach
noridian
issott
dentaquest
properity
iskoot
densign
mckenzy
togarashi
bcbst
caufman
kuryama
logvinov
adimora
midcalf
soooooooooo
kozlowsky
nalci
spectracef
peterses
oikomi
maksin
aliferis
intersegment
vetoryl
motterlini
guadian
enzhu
tacoda
sweazy
malbran
yankess
askana
schweiter
barrowing
repoxygen
hibey
yogman
visioneering
llds
cheriegate
rixie
rastetter
blameable
berkenfield
kanowitz
yurk
cocktales
angelette
sarikas
yekaterinberg
exaggerators
molczan
sungar
belhoul
dannals
roundbush
larotonda
cheskis
hoiem
myrup
voraciousness
siblin
mulleted
retainability
aqazadeh
slavishness
bushco
tasali
ozgen
blabbers
bamat
eshaunte
cbrj
cavates
megaclub
flyblown
gaulthier
luzkhov
lantheus
nicholashayne
sayedabad
methu
fatula
zaback
knutmania
wunderkinds
descriminated
dapartment
kardous
pratomo
dujkovic
chedham
sleepiest
whitts
roubatis
athenaeums
opportuntiy
cbnc
savageries
starkevičiūtė
whitwham
nessiteras
frontcourts
ganeshas
skeered
lebida
uptegrove
milnesand
kuwaits
mecary
alexza
grisliness
commodites
dharmapalan
friestad
raafa
haymills
overwhlemed
pursestrings
fanhood
halldora
ooil
mungers
vautrey
chunxiang
holtslander
bonczek
parkatmyhouse
alderwick
wimik
hogervorst
lucquin
abdiqadir
ribstein
yakutugol
scovilles
baybasin
actemra
quellenhof
deeanna
danday
asrv
ognianova
volpendesto
vetcogray
slipenchuk
epiduo
reputationally
rorshach
beanee
dingfu
ivvr
opprtunity
nazek
enior
rapdily
rocketbook
meterologists
tushies
sautés
wybodaeth
youssifiyah
cavm
moeny
zerina
bosinger
obag
sfiso
aminosalicylates
travelscene
rrev
wondertime
indusrty
luotuoshan
replikins
sceats
kemigisa
lingerfelt
pacheo
ringwalk
postponment
decalred
styleless
gieseker
gimcracks
treatery
covar
palmchip
inovis
ienm
peberdy
countesswells
predjudices
birthweights
qioptiq
kafayat
wawrzynski
tertsakian
teixeria
bastianello
brauser
expotition
tepotzlán
boardies
polically
meltoff
sorur
explainin
ferrlecit
domboshava
dataworks
conrol
ehcr
jdcc
stoleshnikov
songfests
petursdottir
haakan
flation
hussell
betac
gghc
covalt
doumato
mgcs
scanties
suyatno
uninterupted
gjenero
ségalot
vitamind
libforall
megabuck
propeack
fuhrerbunker
arijon
rondeli
oudea
shivalika
wrapit
shufro
saparmurad
freshfarm
bouskill
greyber
eurolist
mobclix
schappacher
sovio
baghaturia
applink
shorters
tweeze
trustnet
naidex
janetos
sheffielder
jcwi
coldbloodedly
hoobrook
nesnplus
stiksel
jotr
gventer
kinnings
teochow
rightie
pancost
detillion
babyyeah
bibbers
aneg
nkombo
sauvagere
anham
implmented
jharkand
tamishia
adimab
zlinux
jabob
ifcj
ritchhart
mantsch
heeeere
yunessun
nagareda
eleviate
responsble
kiconco
klebolds
cravenness
witlox
moezi
borghoff
ngaruiya
snackable
rohmat
depatment
photoframe
kubaisa
marktest
mbeke
raydi
biersdorfer
atyr
retraceable
alrajhi
lybrel
nfumu
temik
mennill
shareese
aisight
asdso
mcalorum
superphones
kfhp
ipsonar
wohle
heinbockel
dinasaur
puct
farhani
nerazzuri
loveing
entrepeneurship
subowo
enrst
gardenless
magazzeni
okusanya
primness
dipersio
milenky
preppiness
kruess
slfc
schwingel
rulemaker
offthe
rdcm
idiz
myvouchercodes
staidly
ehealthinsurance
calambokidis
avandamet
xlhealth
hatherall
okeyo
stuiber
hellmold
huseein
armanino
legasse
mersbergen
veteto
tegni
gonek
overdorf
tauting
slavelike
maguwu
mishura
boniwell
kulibaev
nightdresses
hawkpoint
chandrakasan
tfwa
garshelis
qadus
hanci
decarbonized
luckson
frownfelter
weece
namina
cunsolo
harsheim
spritzed
emergengy
communitization
linak
nilpferd
sibello
guyaux
chalkpit
tastily
economi
azilect
patchworking
ammart
shamshatoo
grugel
supermiddleweight
renane
schurenberg
motcomb
unitedlex
disported
zarkovich
bondareff
acquiesence
oitavos
bardino
moygannon
dystopianism
fadeaways
chuvit
lobukhin
zargun
screenagers
yrfa
metwest
inaugeration
udemezue
krajacic
murm
rtea
ainardi
brainstems
smartplant
resown
nethawk
illahun
purivatra
papalexopoulos
dreamiest
conomic
rouyet
xxviiith
birded
ligorano
yinghong
cogcc
laipson
anpac
legarie
sherbino
sakakeeny
rakosky
yazel
wagemakers
erbol
bueti
paradysz
cornw
bimha
jungala
demartinis
maialino
zovath
caplat
creditex
turver
soays
mannato
felhi
slipskin
emmans
babyfood
girthy
mathendele
ballydonaghy
schallhorn
evote
embrassed
usprotect
rudebusch
eveyrone
konyk
ammor
marcelles
ejii
helloooooo
mcgivering
alchoholism
sonfield
nomaan
chenia
tjarnqvist
danneker
extavia
wtert
gibbscam
hnilicka
sarandrea
moletress
manjang
bdti
atoki
ambitous
maldron
oktavec
asmundsson
bennathan
biojet
keefauver
mouillefarine
backslap
sixteenfold
marketaxess
varasano
peoplecare
sgarabhaigh
multicard
sameere
laserline
delao
ofeibea
youselves
oneriot
dreamtown
ioli
liskey
satellier
kelsoe
weitekamp
butchness
ignoramous
tumai
mimb
prepandemic
carmountside
zanzinger
steelriver
moneyback
drabbest
hadijat
relishable
siyavus
chiavaroli
capelles
fordrough
vatuvoka
colllingwood
galichia
denegration
salkovskis
degideo
lusuardi
myfortic
kijivu
wotring
barkor
clolar
mehp
edgelab
heavyhandedness
forard
kruzich
pizam
mundow
heshmatollah
kingspark
requelme
gillibrands
pregis
yaverbaum
rajivan
riverso
wifelets
handlova
swiebodzin
sheronick
kultida
jumale
frothily
wadnaha
remail
suddock
felzer
karadogan
papped
gausi
thainstone
rolofylline
wasbir
depoliticising
rippons
spiehs
tambaro
szczygiel
guduric
hradecka
puckery
siliconsystems
snarkiest
rcgm
jumhuriyah
roddymoor
smuggs
whitmyre
contompasis
transis
drevets
schnure
wolflin
koyle
merkels
tarnstrom
colclasure
plasil
antimiscegenation
ghalanai
catroga
blansett
ultramarathoners
busefink
toumaz
uthemann
floatables
miuro
rethemeier
vanquishers
xcellerator
yeffet
wagtendonk
lofalk
diminsh
jfit
kueffer
stroppiness
imboela
satphone
cvrg
admati
loyko
moany
inestrosa
mileyworld
delapidated
peerialism
tirschwell
sheppeck
genender
tripartisan
korndorfer
ragatzu
lemaine
dongrong
ngarongo
kayse
canally
kipness
biofach
bettinga
olibeau
travelsphere
gerasimowicz
sirls
aleece
schooyans
puckishly
frederikson
telecommutes
boozin
scoblic
reinarman
eulala
kremo
shetgaonkar
sipkin
cameronbridge
demutualisations
gueros
veyrons
anticrisis
ravelian
buttheads
spalga
tiwonge
novaquest
decarl
dinned
greeland
quershi
wazne
cruisy
itlay
aidcp
incomptence
rozmarin
dinela
tumorgrafts
biogeneric
guiltlessly
sentel
bridgelux
ministy
photronics
myscene
baharvand
amankora
nuron
cuhna
jinzhao
illeagal
amaney
charchian
poupi
cdxc
khoula
cribsheet
jovtchev
gelées
windstars
behrad
pollastro
galoots
uberior
micras
libaud
trueful
charfauros
marketpulse
delorian
vucitrn
hoelzl
cynnie
shawler
vefour
mislaying
chinaillon
canlyniadau
thinkpiece
abdulkhaleq
microsofties
bothof
diversifier
ozery
goodhartz
ricking
lloydstsb
mittys
pollyannish
portmead
atayeva
deflationists
fishbowlny
opensea
chybowski
goplo
mozakka
yarish
multitasks
yaghoobian
sperle
crossmans
eurong
usiba
timorously
nlst
tawassoul
pongy
mnlu
prueitt
wuerttemburg
involontaire
chuckleheads
prendegast
ongoin
nannis
aerias
civitello
shpt
unctuousness
herpetophobia
maastrict
ipof
menerbes
nintendos
starmaking
saifan
phoniest
lochrane
copenhangen
wildbeast
funseekers
unhealable
misconnection
numpties
sostanza
annapolitan
gotova
tenojoki
trueform
substudy
alioli
ddpa
worlsey
frostfrench
generalissima
kopanya
phoneys
alteratives
impassiveness
svvs
handu
atemschaukel
preachments
fsrp
micato
popolzai
zahidov
ashooh
backdale
fougera
monumentalizing
aranyosi
brentside
afinitor
hantzis
augignac
neujahrskonzert
chistians
updos
hargy
mohammara
overegged
hazeldell
agwara
tipsily
beehaus
presniakov
bizanski
gioffredi
kisseberth
cloakware
psychopharmacologists
missery
gêm
itrk
littlestar
saydabad
laiskonis
vonona
brimbles
ambiq
altaroma
prampero
gelpe
armellin
miskimmon
lugacy
mebaa
oetzi
harns
centrebacks
gayhart
pfoten
deleb
braehmer
desurvire
petrák
dujanah
fouettes
domeyer
bevelacqua
okechi
cinner
ghebremedhin
hanoman
pulino
puladi
waldera
mexes
junifer
expobank
sherona
frappucino
kerchers
edwight
ngunyi
numerati
gawrieh
keppert
recyling
plumy
mrct
julfikar
mazrooei
sikelele
unpeel
ccggy
miniaturise
schretter
brussow
museumnacht
pencalenick
duzgun
tarsin
bmwed
vappie
seventysomething
maturén
wonthe
ghanea
europeanise
powwowing
saod
krebes
mbec
lecourtier
hisave
niezgodski
afghanization
xtify
mishori
simine
cyberthief
nardell
suitter
ebio
ultradome
icengelo
kierstede
chayanda
beniston
magrakvelidze
pregnacy
ambercrombie
tsep
epus
bertrands
pasick
undermanning
skellett
innoncent
manochehr
thewhole
nlsi
mergasur
twea
lasharie
vinotherapy
shriberg
misrecording
nathee
zelio
ustp
craigrownie
okume
craphole
bergthold
suntalk
importatn
damgard
pescoluse
donguan
offroader
dufan
llanspyddid
zivana
troyers
swapceinski
inhospitably
qongqothwane
bureacracies
terriost
koshlyakov
chintzware
wackjobs
valensa
gadlin
tidemill
marketforce
djouadi
bennites
turbinator
millesima
ddisc
sarkozys
tbar
songkitti
bottlenoses
sickey
pribetich
delamo
ojon
kruman
yelovich
rehomes
smartridge
hendarso
bulovic
mogulof
gfcm
pitville
montari
kassinove
kaddatz
rammeloo
zahhar
budino
corlato
bellafiore
agurs
yongze
unbrushed
doepfner
casteu
bascher
tecsis
esterhammer
openhanded
ampareen
galleano
longers
gorwitz
penrhosgarnedd
nejia
mrkt
bionj
banksys
kucy
huenke
castignoli
callxpress
lyophilizer
sternii
jaaskelainen
bordry
mehrer
mlyako
mahami
charvez
caginess
pigmyweed
negbi
scabrously
mollura
rhotert
gindalbie
amerca
luedicke
hyggelig
nonutility
corraleja
spacehoppers
hspda
delisio
jenden
scane
campie
rhor
voyennykh
perezes
prvni
agbere
ehleringer
roadley
shabaa
wapava
ditomaso
lionising
lafollete
callusing
regurgitative
candacraig
kasisopa
ifcn
oladokun
abdulbari
crystalizing
vardakas
alvidrez
swaggert
perhach
unberth
plonks
kinnair
floobs
expendible
djahid
declassifed
girdner
saleapaga
chelski
lahouri
contitution
exiqon
qristyl
eccelstone
gfes
gonso
sentimentalise
engwirda
ciesco
janury
griffinger
tosas
grubesic
paralyzingly
headscarved
konje
ambelopoulia
birbili
zoerner
dannels
bleepers
boessenkool
swirlie
hanono
jrct
toadied
skoby
dexis
debtload
carkeel
laubhan
ychydig
megastardom
trusch
tovel
americn
tilhill
coachload
hirabe
betchley
terravina
iggles
tradelines
poyry
efjohnson
ncayiyana
marmoleum
cordings
schorner
commmunist
shelterer
mehmedinovic
canuteson
talkiness
boutoille
birbragher
nonoffensive
lefurgy
queenmaker
flosser
cadungog
summerdance
billips
gambolled
gindara
wedgetails
husseiniyah
girfriend
eacom
witnessess
borha
rodhouse
trevarrian
gaucherie
brinkmanns
njar
fasttech
landikotal
davymarkham
saglie
arclin
showgoers
sahaab
gazumped
tailies
ecomonic
thriftiest
overprovision
atwinkle
kifuji
methyr
cashpoints
garganas
wetang
jekabsone
pandaw
wanye
rozansky
dorger
leadbeatter
dtvpal
susanyi
kiesner
beituniya
noncaloric
hampsters
boqiang
kinneson
loetz
shinawatras
springwoods
swaddles
habbash
hseni
shafranik
istodax
bodhrans
wyand
bebenek
nasami
mokhtarian
quattara
colllins
ojelade
clearbridge
murugathasan
flégère
balack
alcoentre
spigler
docklow
twanda
prefeminist
kdolsky
provolo
nashawtuc
fonduta
shipperlee
itexpo
wrzesniewski
kayci
cacophonies
recultivate
friendlessness
aildenafil
compumentor
jeranimo
advar
edcuation
friedes
vriesman
spycam
tweeple
owentown
chebanenko
mogin
palipane
curtainless
sunburning
oathes
hoehnke
spinmasters
ubse
exhuberant
khvaja
cyberthreats
gyffredinol
hohenschoenhausen
cloudmont
campanulas
lowballed
zumbrunn
lemaricus
rahaim
abdulhakeem
niederhofer
manei
cpii
alaeldin
boatmakers
neurotronics
cotgreave
clivias
kimmeter
stenny
slushing
yxta
adgey
antidepression
pazur
weariest
onyeri
belghazi
ghalamnews
rstandard
pilk
boudjenah
geeya
estopinan
farasat
perdigao
qliance
numed
tenhaaf
ofour
sadrs
khrista
delac
scratchier
usbank
growmore
vakidis
kouremenos
convergex
eigerman
superegos
ponderers
sahidullah
majercik
progestagen
winkingly
peaco
villatuerta
adrees
kurmanov
rezae
cagoules
femip
maurcio
khakasia
geomet
surlier
scalcione
pemberthy
anastazia
yudkoff
ulset
halilhodzic
rsst
kameezes
baggan
profitstars
bahuga
khusa
pegum
skinkers
lattig
apono
westendarp
beaties
makovich
abdalaziz
atunyote
biocheck
swarns
pusses
iphs
ectomorphs
propects
momslikeme
mifs
untethering
delhusa
aardt
goddell
grassett
reoffends
pomarici
olness
relaise
lueking
kairen
sitlika
twantay
cytopia
oponion
allendes
cyberslacking
doonhamer
quallion
stmts
zelenovic
chungchong
ehlenfeldt
crossflo
svemo
kavlico
willebaldo
spievak
visicu
feldmar
hopscotched
recompetition
delcined
hidayati
temcor
primewest
gacesa
landmined
mutinously
mudsnails
phizzy
balague
lenova
bankrollers
adcirca
primesight
starcat
soubiane
gatignon
noncovered
spinnler
goacher
accedo
emiley
taumua
bagherian
feezell
pekurny
nontraded
spudman
reproted
weidenmier
postpartisan
cousteaus
abilitynet
stagged
pikulthong
eckerberg
usuf
valayanmadam
lilamani
shortsightedly
thelast
stinken
skymarket
fireballing
scampish
offier
skillend
sleazefest
oversimplication
rifes
godsick
aluise
ranaivoniarivo
blottnitz
scrofani
stiffie
newsmag
anait
citicard
habeebullah
burningly
tretherras
perelstein
technomarine
steinreich
derms
boccie
raphäel
dibler
tempkin
unhoped
stricklan
ikamva
gundegmaa
sincavage
cocklers
esclapez
flyde
poliburo
sectorally
arritola
mooly
bruchsaler
nachawati
absolutepoker
cohabitators
khimm
welmoed
goetschl
koubi
sofronios
hrabik
reinsures
chaffins
plainspeak
synctv
consummations
stavreva
zavi
jaabar
objectvideo
nurkin
vandemoortele
dukkah
qaddoura
wingcon
selega
bashinsky
saddiqi
genocided
homeys
orphanidea
himnself
esmerian
ghambir
dropcard
mncwango
abulhoul
gorringes
gfsr
hubnik
blubbers
secretrary
kahau
experienc
thxa
naupa
preheim
nykanen
tussionex
indepabis
labuanbajo
foodcalc
decherney
varnai
intertain
obiol
trialx
cuttable
strongeagle
tocto
korkoneas
vicepresidential
kampars
milkpep
nasacort
middled
sojurn
omenetto
laprevotte
bimingham
crummier
courtiour
rabinor
chantecaille
underreact
barthmuss
payplan
moyao
tohamy
whipcrack
najmedin
notel
goshdarn
shakhsiyah
hasanein
chanceller
worldpac
overconsume
kobsak
stoyer
mckiver
vondran
koernig
wilben
musicskins
milliion
gelbin
dalha
alcolac
iapso
redxdefense
mccaughley
untransferable
adgie
treadwells
ameriya
gopilal
nsemi
landina
bakhty
samecki
malomuzh
zoneperfect
carrazana
ameircan
eventfulness
abelcet
exportability
matoian
herrle
madatyan
trenn
aarron
dlar
weiqun
lasecki
shlemov
paleokostas
sledai
raucus
shammary
monstered
weltons
galliwasps
zapor
zickefoose
chanate
doani
edeen
giftwrapped
vebas
unstitch
cuddlier
tzun
aytat
kvasnicka
litef
temodar
screenburn
securityholders
bethenod
neocate
unbottled
tobaccowala
prynu
gacula
amornwiwat
overcoated
krenning
radeta
bornt
claustrophobics
wackadoodle
kingshall
toube
roadmate
ffilmiau
wetbikes
perfectville
cabler
buffetted
rajaprasong
akbaraly
haythorn
verhoeve
houriya
rottinghaus
janiece
tayub
goware
coiffeurs
euroskepticism
sideswept
droeger
beigler
faatau
zeppetella
deysbrook
preinstalling
caua
reappreciation
chaeruddin
muzzaker
bachgen
frugoli
deductability
resynchronisation
brni
unfamiliarly
creamiest
ginco
issmp
aminiasi
pangeo
kargha
mvbase
christandl
spotxchange
romski
desperance
untch
hilliary
mompreneurs
jakabok
golser
sitdowns
sprado
unshuffled
brti
yardenna
goaltend
pannullo
mckerchar
kalkay
crosschecks
dinb
agrisure
nosers
laceless
haihua
schlecks
finacee
brazoban
stryland
ordemann
chartouni
bollingbroke
bonat
reneuron
uchucchacua
eschenbacher
untame
schiffbauer
ogreish
stubbylee
josmer
nonturbo
misfields
selleca
cronosoft
tocheri
yannakis
catrack
richlist
kabkabiya
orbotix
yocca
punternet
camae
ariesen
gisy
naftzger
chezzi
mrugala
basteiro
ehya
uvat
classness
sanitises
madginford
wuryanto
worthersee
dimunation
elsohly
ordovas
goodair
seasteads
pomerols
izhaki
tailspins
towork
unfancy
abaete
suann
ʼi
geniom
gologone
preema
porkulus
ouamba
gontmakher
anemically
trma
cryptoportico
cvilak
aglycons
tazeem
truronian
armaris
nasdq
aguettant
alderly
kavulich
citrucel
cerelink
neuronetics
gonter
colpy
glah
brayed
bocchia
jinhang
zeedyk
stainmaster
datafeeds
giammo
vechicles
montobbio
bluedorn
mainardo
confiming
hempey
pearlized
danielczyk
cadwraeth
heddatron
sauat
mauviel
catapault
zaiqing
intuniv
schistomiasis
tullydonnell
vacine
annihilatory
reort
digeplayers
crawleyside
auchmill
lachar
genscape
onges
somaxon
zhilyaev
shafkat
quartely
lenarcic
akylbek
kwangchul
kramberg
highfliers
rashevski
matsuhita
marmillion
maryin
tavelman
cervelats
ferronniere
dahllof
putzes
subasi
schandelmeier
astromaterials
assulted
temblador
wahishi
bengtol
eyeris
grandmet
teresea
bannapot
tyland
shohin
pirilli
englobal
fazalullah
shirtlessness
misbahuddin
jumpsuited
tribulete
cillier
arcalyst
superswarms
shanshal
humantarian
scalinatella
gecker
mcelheron
almirida
waxings
venezula
cgibin
muffi
competitior
handbuilding
kopchinski
azzaz
dotcomedy
etol
newheights
interfear
reflate
ioactive
ddydd
talevi
essakow
tetherow
adnitt
cloughie
birkenfield
colussy
zoppe
goldkamp
collexis
gamarekian
establishmentarians
lanap
reann
zawislak
superscooper
whitny
slogar
tesfamichael
parentsconnect
worobey
baijiao
loathsomely
knickmeyer
spazzes
tunstill
noncyclical
balgrayhill
confesercenti
vitino
dogumentary
janusiak
ndjai
birgin
basyouni
solictor
hpas
pixieish
muhyadin
goverenment
bacterin
hammouri
saftler
hughstan
monetaire
undelineated
interferred
globaltrans
ulitmately
regathers
traumeel
kreeps
railplus
portaloos
duilia
gamidov
clippered
wiedinmyer
quavis
neswick
chmm
jabalee
zendani
ditsa
tianliang
authentium
snackfoods
grauvogel
genilson
trustcenter
hoerling
superelite
flittner
bicyles
terrritory
lehecka
niwar
pintxo
kachka
preopening
isratine
drabelle
bluestring
headlingley
muraguri
sexcapade
brovetto
sedovic
malteses
exurbanites
innocuousness
eberwine
pressganged
abadoned
mcnesby
overmedicalization
prkr
huthis
blaringly
reemphasise
reyl
pickiest
respraying
sepil
llares
shmura
morhouse
thorplands
akmad
hougardy
cottigny
gulkis
zorpette
kaltoft
katokichi
fatcats
cornblatt
unharried
frownies
eldridges
fairfx
landgrab
institutionality
scheiwe
ardanaiseig
prht
zislis
fertittas
reivich
kadii
squiring
rodins
pramudwinai
whoopla
blutrich
softbrands
invigorator
crudité
gwom
ziolo
ffrindiau
jenard
tyruss
prehearing
desquesnes
antufiev
casber
avichal
fulvolineata
conando
forzley
overshare
pulmonarias
hammadou
hanoians
footit
solaren
chestang
vyxsin
freiden
lermon
stablise
mashadani
rubacky
fadillioglu
murmurous
dannye
manorway
litvan
goatie
rabits
gramke
orlandella
redenvelope
viransehir
immersively
boussin
malehorn
siochána
fastac
micast
drymades
klingvall
grillmaster
tomasevski
mizher
profusions
negrohead
gucciardi
sandbergs
prevette
megaports
jianglong
spaclub
adelis
prabaharan
panagiotopoulou
vermon
trmb
larvin
jennelle
jempson
ziouani
eppert
robotopia
inseams
neukum
baracky
eyelike
shuddery
sophoan
gosinski
repondents
rothlin
dasy
konowaloff
monzur
mikelle
fineliving
swide
tanier
guaranted
alicart
chantlike
colberto
tecua
alphamosaic
kimondo
zuca
iogear
mexp
stonkingly
delwit
frankeny
shuweihat
fortebio
aylene
tachosil
colica
cyborgian
borgogno
sunbleached
megafights
rihannas
versimilitude
successs
technogeeks
aerocrine
alphalab
chinemelu
krason
shrillest
arthrocare
breininger
damne
javorn
spinmeisters
shaokun
ambay
costafilm
aqualab
handyperson
hynick
passkeys
nhsta
robier
coxmoor
textme
hupton
teendom
usbln
agensky
yonglan
baseships
haberkamp
trelinski
fraggings
neosoul
cvmp
savouriness
keresey
jinc
ciecc
burkus
leisch
quib
millert
barrelful
maclarty
bamrah
penbury
lobbenberg
traumatises
squishiness
behenji
nebulisers
noxema
ezon
moseying
goergens
scudettos
samalout
sagaciously
pizzella
beems
borisch
shiray
decaff
wombell
snowfort
schlafengehen
enthrallingly
sofronis
gamelogic
giannobile
cadwyn
kennebrew
icddr
snacked
chonail
pomajambo
elpistostegids
morael
powerseller
latecoming
schaloske
wolfed
thrupenny
cleggy
mutiga
rachow
chfn
tegrity
somayli
pecheux
azarmehr
dayya
bordine
heeran
evolene
dhasmana
garikai
ateb
workrights
rovian
deglamorized
georgelle
oddes
mutaa
offfice
hassaine
lostroh
truffling
dimunition
epedemic
outthought
parenton
ubachs
ftfm
bluejohn
ertong
jerkbait
harmie
tharan
arancam
tertrais
boltless
kensy
capablities
gondolo
zoomy
renkes
gleefulness
gorens
madban
vizzavi
stonefrost
thrifting
jazzmutant
uithoven
teraelectronvolts
bter
warwak
blear
reprioritised
haziest
riedo
mitschek
ohja
gribbins
unevacuated
nowacek
wursts
melograno
knapke
hasanin
punnets
kamore
gadgetwise
credenzas
horomones
haiders
sensia
unknowledgable
montioni
enung
aplix
emmar
bluffy
reindert
muyuni
kilomters
magram
yannas
glucotrol
matasano
vdrs
greitemeyer
celeberity
rummery
brudnick
comack
popcake
afghanastan
cottagey
tavlin
pafuramidine
ayyalusamy
henss
shinwatra
riccadonna
lubit
runless
refettorio
yuban
hvhc
ruefulness
winterisation
ngemelis
imberman
neigborhoods
narel
kisamba
holidome
rutab
biggots
melenie
owour
gerui
vietnamwar
kameroff
mushier
boobis
allue
klosk
skooba
yabroff
faigenbaum
bridgeen
siomara
radzius
convergency
scandelous
yourmoney
leismer
strohal
everpower
exbs
heddings
masoumian
toree
vinelike
dasaad
makaibari
crummiest
muzijevic
sugich
etene
bleatings
shoetops
hseq
semperian
leporati
pafr
sarrafi
tweentribune
jirovec
bonitatibus
fossgate
plantiffs
lockfield
achosion
righful
billfolds
dimascio
drywallers
beligerence
hincker
svetlik
baiyaa
wauck
acrodea
lannett
taous
knicknacks
joshpe
bluchers
lawnmowing
bromptons
rebuying
shaniece
scappati
ormesher
ascendia
neglectfully
paradyne
moussin
ahlfeldt
autothrust
almatis
greendimes
cychwynnol
cprit
corcell
fattahian
brownstoner
kallinis
ministates
prawdzik
unsusceptible
gueant
aminpour
muers
akhilgov
prmkt
trememdous
huffmon
albinger
unenticing
sahenk
rivertime
aweidah
basagoitia
issaias
susno
corwm
castillas
birchfields
upswelling
finmin
bloombergs
vkernel
gaohe
reynar
fiese
kiltmakers
marleine
coppolas
heech
zilliontv
houssoy
robna
poliakine
illiteracies
bouveries
thadee
jilbabs
andimuthu
wadian
julmiste
diffi
tenreyro
jamarkus
sterigenics
rhonnda
snbc
schlachet
boukous
bamboccioni
khoory
récitations
rhiannan
mastronarde
gaveling
istiqbal
nacarat
dumebi
bierhanzl
aryasova
normil
spalton
badiozamani
ecomedia
devistated
atiha
elektrobay
elfine
selfsufficiency
gamemanship
advisery
corall
sahmarani
ugnivenko
stting
mcroskey
parzych
twitterview
vickki
cestyll
mtrx
unhealth
silverpop
collpased
rehersals
buchanek
sjfc
railomo
beligum
sainvil
gotaland
broyd
amlen
klaussner
phuyal
capered
blacksell
scrybe
monzavous
axert
celandines
demonstraters
kyphon
reybier
jüst
didarul
montealtosuchus
zardar
metelitsa
defensless
salloway
kathat
trenell
dricks
kinseys
wahedi
woodfuel
guestrin
sadikoglu
wonduruba
superinsulators
iscoe
hulteen
undercharge
overstayer
beginin
strokeless
oniya
corepharma
savp
hasabu
courchinoux
pollermann
ctei
chlopak
muaid
symlabs
scouarnec
syagen
overexcitement
mehrat
cdhe
foreyt
organlike
nberg
retreaters
mulyanto
solarmagic
phobot
saintbridge
eversham
torezolid
huffily
rdova
viselman
sorgeloos
nuvio
ithat
venktesh
garafolo
dawald
knickel
orllewin
epayables
lafemina
guinsburg
worldtech
vivitrol
retureta
consective
obaidah
sambucas
overstuffing
jatna
tussel
optison
daikondi
macovich
bordelet
sidanco
compenents
mygas
maaouiya
pellegrine
lokhi
raniga
beddia
novogradac
hastelow
speedware
pinzone
loggos
abakr
mahgerefteh
underinvested
rosendael
picnicker
stoitsov
chicer
stroch
clenell
hinderlider
philant
nstep
regieoper
wharrels
celacade
ouidad
stanos
neutraceuticals
runnebaum
mahlak
skeikh
paesa
ageis
jsfs
smokeries
darapladib
kingliness
jukowski
donadello
demontez
battersbee
egyptain
erox
costetti
rushenden
deligates
daogah
piemaker
zueco
starriest
percec
groenvold
vibiano
murphi
wiiitis
zhejian
milktoast
scoltock
energiekontor
seapak
smartshop
narredu
sceintists
itij
nfvcb
ixic
cyclicals
patentholder
withrawal
welber
isln
ugartemendia
pontocho
hadizatou
mlim
ganyang
yarovoi
vegatables
eftychiou
smolanoff
tresierra
gwacs
gscb
talinda
liquider
vestoids
junping
joylessly
jintropin
gbgc
elaborator
bedair
nonhospital
mcnosc
mexicanness
dannijo
mishavonna
hcais
appenine
trevine
chadiha
suslik
galitzia
mererani
innnings
coccidiostats
yeskey
wendesday
reclast
barelegged
twigging
protrade
constitition
masterofthehorse
kuritzkes
wirelines
kosawa
metalrax
duopolistic
crowells
penpower
wqh
frizzed
keithan
rabines
sjostad
sapphist
scrabbly
gandarilla
inacol
rosgill
taketv
catheterized
zinging
dahlitz
natshe
adapatation
yudelquis
cutrara
multiculturism
erqs
lton
interbanking
schimenti
dhongchai
pongpaibul
marceles
thyrogen
ramsower
chiumento
motuara
cyberstates
campsies
kolch
mesalles
tenrehte
weeky
lythic
delannon
taojin
spoe
gastronuts
galbreth
tacugama
detaines
fifg
guildtown
clackety
gainsboroughs
tarrantino
carmelized
allweiler
engzell
noresco
quois
cubreacov
thiruvenkadam
kimla
iwanowicz
treaury
ploger
weediest
mangalik
prefiled
lowballing
corken
conab
attisso
survivaball
llsh
adahdad
ancellotti
ihry
linuxlink
bioforensics
qashqavi
spottiness
likudnik
bohatyryova
villagey
ppmo
daskas
cowcher
mrag
horizen
ouldn
schrider
biedrins
medicalise
oberschledorn
flulike
gwynfa
dummermuth
entertaing
maggioncalda
velathri
osipoff
foqaha
souviron
microcephalia
nurdle
purvanov
estiatorio
hongke
ronsky
scaleability
hypercapitalism
icrime
starluck
mohmoud
everfi
lescol
gianino
sarcmark
privilidge
adeeba
headfake
kandeepan
thirugnanam
mistier
catrini
steinmayer
otiso
polpette
unloseable
apoaequorin
ungless
dubak
underattended
fleetnet
kelimbetov
boomj
plummetted
politiicans
nghaerdydd
seruga
dworetzky
counterpiracy
khulud
dahianna
eyjolfsson
mckamie
chakhkiev
pachecho
tongon
celm
rubinomics
ficando
flaam
canare
aelric
ljubco
catapaulted
lauterback
rezgar
taksta
explorit
lobying
agenstvo
szubski
maxximo
laquanda
kazhakstan
sleekit
byronesque
taikong
salmina
aafaq
safecall
amotosalen
weedbusters
orgasmically
dprc
ayira
diguglielmo
sayekti
widmay
steelier
scienceinsider
ureneck
visciously
barentsobserver
bukasov
daabas
ibrain
dirgelike
sumnall
cassiotou
toebben
mileycyrus
mitee
forzoni
globby
hawkishness
rheumy
hydref
chrylser
nakahiro
arbani
yizhousaurus
rosteck
fewn
enviromentalists
danchick
corix
praticò
turginovo
megawave
kozena
antitakeover
hematide
drennec
sybron
rhombopteryx
staffie
rejeski
adavantage
ghoram
baage
chinarat
milhorat
masisa
wilpons
dabryan
leanback
flycast
steenvoorden
shalo
alnc
barsade
perimekar
calportland
juqueri
aruond
sanjae
pliss
moomilk
mintec
jancovici
bambale
ditkoff
abdolahi
cavadino
deweycheatumnhowe
lambinon
swineflu
xtion
nordfinanz
wybot
schoeder
mayuran
swagging
soglasiye
lahmajun
fordwat
cupfuls
newbeauty
fintur
sotudeh
retriggered
braindistrict
mikelic
girsky
yuppified
timberlakes
kucharova
pertusio
salviatino
buraida
juette
ooomph
felicetta
yumurtalik
schefenacker
rosiness
harkatul
tabakman
amelior
mommywood
kagamé
melaas
imdc
cizhong
assomo
lightfooted
zwitserloot
groundhandling
koskimaki
wavra
widness
unguardedly
ecuadorans
ghwell
altmanesque
amerisur
guvamombe
lydianne
reykavik
flector
yerks
trevessa
selanikio
arrogent
quarantillo
abdelwahhab
demutualising
majestik
shafayet
torkhum
pacewicz
hagees
poussier
northwater
fountained
atronic
ystafell
chersich
nmpp
ramono
scupltures
seascope
episkin
goussous
delphon
chicness
assetts
senseman
norrona
hollmen
mrina
determinato
littmoden
abashova
knowetop
shamdinan
baghdassarian
hopscotches
tungsha
judders
iurgi
lumberingly
sinkevicius
wilberding
sprightlier
kapani
woodrick
budulis
mauriceo
faqirullah
balmier
nakivale
cockhorse
clickagents
llifogydd
simplemail
unrefusable
zoolights
mobilitrix
palastanga
windcheaters
tutssel
voudrai
politicain
resignees
firefront
tovagliari
mountleigh
arogance
preregister
solarmission
greyser
bronzeoak
salanti
intrepidness
bippu
xenazine
tibnah
trammelled
chengbo
valcyte
gooier
moerdiono
sukarna
towes
evolta
vinovation
adenotonsillectomy
nurko
corproate
mergerstat
matekitonga
uncharming
islamified
soliloquising
boardwear
sansalone
gawkiness
pffffft
vtuner
wyithe
burshteyn
waisanen
overincarceration
hopsack
brainquicken
wajar
cappellani
sinahala
smackheads
pswr
seleshi
bedinghaus
augill
waikupanaha
wizemann
gesel
yorking
bronwylfa
antisocialism
shiberghan
innotrust
ambrozaitis
growbags
bokks
salarkia
mediamarkt
inprisoned
mohlenkamp
sitecatalyst
sundloff
marketsource
orvarsson
vistory
luxalpha
kajouji
pirnhall
crustiness
pedaller
allarton
misspend
thaix
gajon
kapolczynski
robocroc
abdygany
interestd
markenson
idividuals
vanadia
wirestone
kampeter
stiffies
akonix
zweletemba
sandeela
nendick
mypunchbowl
demetrus
stoilova
answorth
cityryde
grandner
yanghui
nascetti
umemployment
kerkstra
baikalsk
nextio
skinceuticals
iband
shellys
kalinak
oktrends
becici
motoboy
iserman
hafeth
fruhis
congileo
squeamishly
windshift
maktoums
qualye
adultvest
schnidejoch
acounted
chenevey
headage
beccalli
kazutsugi
queiro
bromirski
farraher
nanobusiness
zolqadr
scarbrow
hyunjin
kengeter
mutler
liuetenant
reynen
levistre
sawalich
idoya
avports
tejdeep
newscale
sutow
dramexchange
gejon
overboiled
tekelioglu
kushners
jobatey
schrapnel
sasomsub
yongyut
katalia
salstein
qubeka
pairolero
nedds
radojicic
muzer
icebag
shakif
geogheghan
shanafelt
ridgeworth
counternarrative
echometrix
alcoholically
tribalisms
barganing
intellipharmaceutics
moeb
merepark
furors
sharrief
baskan
greehouse
westernzagros
uplit
ramblingly
econonmy
sourcemap
sedensky
psirri
backcast
mcrel
graumlich
keneshbek
malezhik
offshorable
golftyn
nilled
tranquilising
medstory
cedare
czjzek
agys
billl
isletmeleri
hennaed
inefficiences
silverwear
fallica
kasereka
vemdalen
ehlis
nemecz
jjohnson
glitner
nonadherent
blokeish
hoelzle
grunstra
pilgrimmage
thermocool
attaboys
antiwhaling
sirci
prostitues
tyrelle
spanakos
perusers
vongs
formbook
yowled
basyan
tinkerman
cockiest
seraphically
emoze
supriyatna
molori
stiffled
bruijns
goodlyburn
kulbel
kovacova
vangent
hagenbuckle
casenergy
dreno
iragi
postapartheid
oceg
bersell
mafeje
tatley
salesladies
kurzarbeit
daibes
darea
fasfous
vangeline
debowski
emplyees
bronxites
volumizers
mezzomo
borgula
bestcovery
janvey
simpon
hosbrook
insideflyer
tradeables
spokewoman
malungisa
hotelclub
nonpetroleum
lacers
readius
zirko
eslc
powerlong
bungeed
stoptech
fejtö
mostart
gilletts
hudacký
gerspach
khais
fundementalists
zahlmann
smedile
weihrer
nisly
risken
ctem
yukna
spotco
joebert
lunchable
yinfeng
zimrights
sumco
kallari
vimac
ogide
sabitsana
neatt
lacanche
mittersteig
tauxemont
churms
neugut
schiffren
symmetrel
manaeesh
stratecast
shamso
isotec
bechta
melquisedec
frieburg
montmayeur
moldomusa
markeljevic
accera
nipbl
quinlisk
sakhakot
unipaas
gargula
louvard
nonhomicide
volksparteien
rizwanullah
shoutalong
zakhari
yacqub
fiscalis
ficos
temudjin
arbx
shaimiyev
limbaughs
dopps
renesola
zsweet
transdnistria
showboated
kandyland
desensitises
ijjas
triartisan
aliffi
gawds
pinacci
vasquezes
sangprapai
carcich
flytower
newcasters
acolades
runnan
matathia
marsilli
contemporizing
blowy
botiga
vewwy
firstdirect
fusidate
shamsiddin
noboard
misoft
inphi
rightmedia
spirita
mayrhuber
beckylyn
relphorde
claimimg
dragao
squanderers
tourondel
bittani
njambi
garanimals
projectory
bamboozler
simé
trikini
ccvp
foodex
perogordo
chontos
itwould
supprelin
bctga
qidfa
hedenstrom
truffade
fursty
snowblades
shate
ahktar
disraught
tarbett
buhary
subasri
nonstudent
wembridge
kadamovas
rapidata
wibautstraat
steepish
darimont
nonhealing
tackily
laatz
zaelke
lokoff
corrpro
greenblade
arixtra
grundie
vanouver
bennigans
exactpro
eckelt
powercast
vfend
dases
barcaloungers
pozition
khahar
dzhennet
mahrus
hernieder
lanzkron
penderels
pakhrin
tiarza
hardnut
raghzai
clanked
quegan
viarengo
dimissing
skinniness
tiremaker
arcanto
prajak
anticar
umle
dyssynergic
tigergate
exfoliators
sefydliad
antibribery
kersteen
baghram
ophra
euthanizations
requ
wawas
akule
carnhill
oesterman
genevers
tusty
vandrey
anjun
dorotan
qalqilia
fanguy
depilate
kulacz
zonebridge
hatriot
corcella
countres
ghreadaidh
worrywarts
eudemons
modishly
tomblike
geille
maswik
muhmud
inaccuarate
mccrobie
ksander
schlaeppi
kustendorf
govens
wendtland
nelstrops
jolyn
stancanelli
vorovoro
harbourers
qualisystems
heidgerd
fezziwigs
sukkariyeh
beccio
mymedicalrecords
osmaan
lavanga
grilikhes
laborc
hevenu
rietmeijer
maneuverer
lifevests
gollies
landgrabs
raddo
accelerade
jablonowo
eddc
ngenera
zellwegers
physicans
shwak
kundtz
coolblue
atyniad
wilgenbusch
teltsch
rachmadi
danatus
kósáné
clomazone
zolensky
securites
kleinfelter
aclara
elosu
sojewish
murefu
brasiliero
montegufoni
nosiest
telecommuted
argetine
ryvkin
ulitimately
shirian
falujah
kamami
chazzie
zuykov
quaddafi
lemeshow
teixeiras
dostis
lonngren
adct
whodathunkit
beachwatch
succeds
ilabor
maobama
lenczner
shemel
hardern
zegrean
caidan
xterasys
modiface
jibbed
continuin
taneka
fuzesi
imcompetence
nintento
pharmatelevision
wehran
precooking
schwabish
provado
niklason
diegue
pitale
rockwellian
mitigator
münchau
gbis
blazak
gruach
airknight
thalomid
augustawestland
stonefaced
vesterdorf
ratanakkiri
lebedevs
jerseyeans
starbirth
millrock
bragnalo
timonthy
allvin
neuromedical
delnaet
fingerpoint
midset
khorsheed
stalest
mucousal
mingkwan
wastall
squeegeeing
kazmunaygaz
afeworki
inexcuseable
byrs
liverpoool
coslov
easylife
birdbrained
arune
recarpeting
adlea
krausman
jigdal
ratifiable
compozr
snowlines
melvillian
engellau
mtendeni
mediatex
tonsai
chargeholder
melnikow
siniyah
battiness
feseha
moussed
withagen
htibs
pastelería
shenneika
biobusiness
johnannesburg
brattons
naween
lindeperg
trunkload
unprintably
hyperrational
bearnes
piturca
fujimorismo
holografika
bugati
velcroed
mustier
majedie
hygate
sadikovic
boltuch
promarket
sillice
pakstani
tounges
mikolashek
gnaizda
luttmer
neurologics
teeuwisse
hookipa
birdflu
hartenbaum
debrowski
trowelling
miclot
sandick
gillens
kinksters
berlamont
todwong
counceling
ceberus
secrtary
attitutde
autobox
bcri
wiganer
biothreats
khodi
kinzett
jinkosolar
withdrawel
beotch
gormlessness
kirston
seitou
demcratic
myser
womencount
uncloseted
lassaline
casseroled
piggles
traiterous
komaci
findingdulcinea
delamor
postbop
depersonalising
cuihu
brisc
catastophe
aleksanian
hypermilers
sharati
hambrey
mcnairs
bromenshenk
poertschach
compudyne
thickos
miljas
zukoski
sagalowsky
babouches
czomba
dmitrovic
friedmanite
bioalliance
bacarro
kahiem
jurkovac
newsconference
larusdottir
discorse
pavluchenko
ufberg
hsip
arrabiata
alhaq
mammosite
telemetrics
patricot
scme
healthit
infotrends
kimre
insurence
khalina
arcua
mulner
chassard
bcbsm
gurtovoy
chinaedu
haentjes
kirrage
fesenjan
tltc
tilletts
netprospex
aquaclass
paktiawal
outterson
nextenergy
nawasi
chassés
jadva
vinzavod
adolor
opde
partono
hofinger
regivaldo
manuitt
kulasi
tylette
zuanazzi
blousons
jsna
honered
panzirer
exorbitance
nobeltec
misraje
lewkowitz
mnister
rsvping
yirgacheffe
huluplus
peua
rippee
jeyarajah
lumsdens
nonlocals
rcht
mommys
cottees
anabolics
gourinchas
chilewich
turkeltaub
rimensnyder
soljacic
bulgrin
aranxta
perlwitz
actressy
midstage
dissaray
winsomeness
voelkischer
beslow
wodges
kaganda
joaqin
advancers
rantao
bapm
willliam
cubiche
kappatos
lanong
paraschuk
prepetition
kallop
yolimar
panosh
kambakhsh
aquastar
ghalyoun
reahard
declinist
spitals
markinor
trembow
lamebrain
westates
phoan
multilateralisation
heterodontosaur
grifts
scheinbaum
playact
acdl
chanonat
mangersta
gwenigale
healm
envivo
frerot
jazzlike
foseco
vacationeers
smartmeters
dalbadin
bednet
xxib
translumenal
noscar
copperwynd
xaviere
amitiza
shangbao
kidderminister
lobodzinski
hydrator
snarlingly
benzaclin
blaentillery
kkottongnae
myawadi
tricksiness
ecoatm
disipate
dynavax
peisinoe
wouhra
immemory
kienan
soltwedel
prolla
auru
hildebrandts
gontebanye
rentabiliweb
hatcheted
biorubber
greden
huysse
lustman
mallicki
katewell
enmired
hummous
vedemosti
waea
fazlic
hebranko
jinzhan
meleri
magacin
seghesio
ranexa
martinette
shlm
superpops
preordain
schwelb
treestand
mhrt
rowzee
overtreated
presdent
shirell
mediaplanet
gairnshiel
hyojung
nonseminomas
indignations
jjones
tautest
danesford
xiangbin
slickster
winlatter
annoint
housebuyers
matchar
nagaski
traply
kumming
soarian
renthop
kriner
explorist
mediasite
sirloins
hipsterdom
quievrain
lucinella
helpped
chiddix
unviolated
dfcon
mixim
tourino
grubbiness
budeprion
mooed
roseblade
japarov
roundpoint
mczs
intarcia
yishion
djingarey
eartag
loropeni
mamonekene
alakrana
cikeas
hyperliterate
ecocare
buttie
greenmarkets
barrathon
rancourous
schnoll
awearness
mikhalov
ornskoldsvik
hohenhaus
ketover
hallenborg
muhanned
habibian
cuprinol
axcell
geranger
sirit
hiroshimas
cherrybank
foladi
milligrammes
loveably
boonyakiat
mazeina
tallahasee
satalia
npsas
marcuccio
rehospitalized
betik
lplayer
zhenling
nogga
rmaileh
zighy
kahmed
espouser
oildex
fadam
krupskaia
visalberghi
overallocation
zyrianov
adenin
eunie
anastenarides
couget
compassionless
bertzi
mirsayafi
officescan
pasticheur
khatiashvili
handilift
camgymeriad
mossvale
treskerby
irudayaraj
pamboukis
viselli
drinktec
sajudis
schlubby
wahey
creditwatch
testoterone
budianto
landshare
arashima
stairsteady
irimpen
zaafaraniya
royalities
emerilware
distractable
cannibalises
panzella
protasiuk
meknassi
listining
grandeau
fecitt
schwendimann
carrasquero
gaich
drexen
madelain
paystubs
telegrah
vitacco
manvar
gorowitz
confuciusi
hedonistically
brisbanites
shortle
killertal
wharington
rashy
palmeter
tomatoey
thinkspain
ucyclyd
buphenyl
throughts
supposdly
houreld
liffen
karlah
fuggin
patchworked
climatization
guochao
shrimplike
lawnwood
givony
aptivus
livinghomes
pitjantjatjarra
coalbeds
wakaresaseya
laloi
azuelo
propinvest
krafcik
eligon
daehlie
bluegene
machmouchi
cokas
chundering
stengthen
ctcm
psychopharmaceuticals
hagase
suthanthirapuram
flowriders
codels
sipam
qaraqe
askalani
zautcke
auroi
albanna
fibrex
hadjidemetriou
whisteblower
visioncare
bannat
skielik
steriod
graybeards
refineria
shoffman
pilaro
babadiya
mangkusubroto
pressprich
drowart
maffulli
monticciolo
stenvinkel
lazzeretti
arizonians
greenforce
optisolar
belajac
gorree
tseliso
heeman
leavisite
albattikhi
pulungan
donmez
haulfryn
twitchhiker
lightfair
güllner
obletz
atfaluna
sambuaga
competa
headdon
threatended
jabbouri
kartashyan
munyard
muataz
consequnces
thaniah
intollerance
durands
dekatherms
zazzara
palinistas
schwanenberg
independe
kuckelkorn
mostoufi
finlandisation
zelinger
zareian
bheil
ballyarnet
fleschner
mayweathers
sebik
saberton
tangoe
jezabeel
aproaches
karstetter
athero
settelen
jaghato
sentanta
ltcg
lefist
comsumer
abenaa
sigurjon
yellers
hasaj
qanan
inister
fortifiers
askeraser
matxin
technlogies
bridgewaters
sulphonylurea
rokonuzzaman
dellara
magheralave
phly
sackable
underexamined
amenties
silberztein
dalog
overscheduled
doofuses
vinacapital
overambition
wonkier
avitzur
parentdish
dumighan
anticpated
yankus
tingman
basine
tyagachev
fanrocket
darori
pavlata
numbnut
leanachan
farance
kasirer
etherealness
yatabare
harrott
obstreperously
lrmr
bapras
gilarski
uzeta
willnot
jerichos
zerihoun
khec
barbwires
racheting
lasseigne
ouwbsm
sweetspire
lyglenson
inaccessability
kranzbach
cuissardes
uncoolness
indepented
gnashes
ditlow
inoma
summonded
bbmg
bohndorf
juress
nonveterans
prodeco
gradates
felgner
tebidaba
bourneside
kozloduj
atock
unbankable
fedecamaras
grammatis
paseornek
otuoma
carsebridge
bigal
marxloh
yaker
farruquito
welagedera
pitcox
katchkie
cassama
brijit
shmatte
papakipos
bfoe
riddence
bendickson
loungey
titarchuk
bault
klasnic
homaira
scagell
canadean
mabina
farmelant
mehmoud
feaga
gvss
mkts
fmtvs
fesik
sendoffs
baybak
telops
geybel
moseys
toegther
nourmand
xibrom
uncrashable
imcas
buzziest
partyer
pharmacuticals
kushins
pakistanies
pirici
clarisia
jomana
froeb
feruzzi
camerapeople
nossell
netblazer
milband
dancetown
fueld
kodima
compensational
erignac
jamiyyat
iiat
grouchily
pokerpro
langerhorst
yazicioglu
mccarragher
spinmeister
riazat
madaí
meadwell
gridskipper
crunkleton
stroot
aftere
damianova
leingang
chiroux
maqueen
dharsi
urbaneye
frelick
hipperson
overcautiousness
ozaka
maradonian
hatties
scotgold
svedang
coresoft
cloudspotter
flummoxes
roundbox
rhyw
bäte
bilbro
constella
stiffeniis
alfasuds
lipham
levx
chorush
responsys
unomedical
waterproofer
dhevi
failor
bhbc
scientest
aqlaam
libanes
vegtables
schran
hariyadi
cfpo
hopleys
valuefirst
nonedible
madrilenos
egdf
raisonée
poujadists
attitudinizing
farwolaeth
aficio
laturnus
moeai
recharacterizing
protohuman
osbo
baymon
chrobot
schoeppner
strenghth
donnar
widish
kitaeva
rezistans
landskroner
ladleful
refarming
victems
corros
risktaking
elmerton
swiller
hepped
sunwave
pson
desoky
sheffie
blunderingly
passikoff
suwyn
kvetches
vukusic
innovas
plude
judases
ecocampus
fwank
qunnipiac
reinspect
fianza
losable
unpredecented
wondwossen
moadel
knomo
viec
incongruousness
tugrik
ultraperipheral
versar
deregulator
taifook
encysive
harazin
pauga
karpreilly
vertuno
nadem
underdue
galvestonians
bejmuk
blefari
curascript
demspey
zakotnik
hcvp
damges
virgance
centano
sabretoothed
cyoung
pfanstiehl
sausser
chippiness
jeang
galvanizer
larmenier
mwyaf
murria
kosutnjak
falesly
kleintjes
gullibles
borrini
rentar
multitrillion
coutee
yachana
porici
overborrowing
tornelli
antinoro
safcol
yurkanin
jelana
pennybaker
malinow
mccarthyistic
lazeez
laubsch
samudro
charties
jailal
haydom
chkb
respall
sklaver
illnois
fluno
romdeng
marshay
xterras
waltiea
cummines
pubgoers
mycka
palous
roukos
guineabissau
einreinhofer
thehub
kafwain
dryweryn
cofiwch
hipermart
purgar
osteoperosis
odyessy
zepped
hynny
jiqin
arbayeen
lunne
artily
shcontemporary
porcellato
kremes
zhivilo
anderselite
swinder
invigilated
antinozzi
spude
kaiserball
paraphenylenediamine
caseby
deductables
convnet
perrou
technolo
esmailyn
bakhshayesh
vitalmiro
potterat
binbags
parlos
valires
debrazza
nainakala
umbulharjo
ebidta
rootphi
uselton
pischa
gact
cannedy
skandium
bondzio
larrubia
kitulagoda
vozoff
catarivas
chrisanne
seignon
sufering
stoneview
herwerth
lobira
fabulistic
paaso
alltell
sansert
plymbridge
syncora
yoomi
nagelmann
perruccio
snugger
schiralli
jamnicky
leverndale
carmonas
verhaaren
cagefighter
clarient
newsier
loadsa
zherka
trianing
lysova
kirollos
crumblier
overpraise
gongxin
stiverson
minutage
appolicious
fiostv
khateri
dolorfino
oostdyk
taskstream
satbariya
goussev
radwaniya
renegotiable
twihards
martinezes
naikuni
mmfs
saxobank
wuebbles
kaidarashvili
inedibles
kanmi
rakovich
radlo
tranquillised
loudwell
portmsouth
htsi
slackerdom
fallat
perupetro
khayankhyarvaa
saltboxes
ringmistress
silagyi
texis
cutera
keriako
drymala
genvec
gnashers
ségo
teacupful
jarlais
potson
maroot
akhlas
polq
caucci
famillies
hangtag
shahruddin
semaphoring
isebrook
boroughwide
suwung
pelters
axeda
stealthiest
hakamies
euple
stepic
goulborn
iishiba
intefered
tarenghi
boozehound
capitolism
matzbacher
overkalix
boulesteix
ppifs
amphastar
talkmobile
haughtiest
qvar
sauturaga
democracts
barsuglia
kukuwa
chuanlin
mirander
wolvie
dukuzov
stemwinder
tauf
submolecular
humilating
bianchet
tufegdzic
cachaças
usacheva
ahhed
aseefa
robbinschilds
cagc
shortcovers
kabalu
xiangchen
zadworna
suseptible
coffeecake
stratege
someboby
darrie
tummie
regranex
ringscan
zoëtry
subro
calpastatin
karlsten
perfluoroelastomer
clomp
lobuje
fuentez
careercast
custardy
breating
premchaiporn
glamming
schapps
snugness
hemcon
decarta
gottino
qaissi
bamcinématek
kiranchi
uchena
geomôn
atmgurus
nonstaining
enterpreneurs
ominousness
jolliness
yeeee
quarterhouse
zhuoma
coordinatior
rilvan
zinszer
kanilai
patrycia
bratts
zulifqar
akilov
furmedge
yohnka
netxtreme
abota
markunas
barhum
netmethods
fosko
bullfeathers
heartstealer
kaluma
zurik
idotic
ridgeling
faurlin
bovenzi
azziman
bogomilsky
thamesbank
mormom
extracare
netsationals
coreflood
heppelmann
garbages
unwrecked
yoursel
nahyans
sivelov
escmid
kuyam
jerkier
cartmail
cueno
boliviarian
cliseam
lemonette
wincingly
secfinex
lanzate
axilrod
dhiaa
muschaweck
soldinger
motivaction
jatania
dfob
hassabi
ratholes
kashmula
fenglian
wanh
jermoluk
steckart
unexpectly
vanderhaar
pennsylania
huffling
aslc
elmalan
mackerron
delvac
umwelthilfe
meevee
sycrest
arbaje
blazered
celious
kazimiyah
grapegrowers
mauquoy
chraplyvy
danuri
snaptrack
broes
jezic
nonmuslim
bahgdad
sucharow
baddoch
abenaqui
westerburger
deltapodus
rubdowns
ripcords
luemba
tamco
dogubayazit
changbao
feroli
ieae
gollums
winkelreid
tepilo
brandless
lincl
redeterminations
clearpad
dhuluiyah
zafaryab
barbotte
hedgefunds
begov
achmadinejad
horberry
swashing
embuggerance
kwedit
mandiyu
hufft
perishingly
mushraff
vavreck
prelicensure
caprail
scatigna
amhras
njbpu
haydos
contech
ceasers
ibeanu
gressett
bieze
capitilize
passportmd
rozencwajg
lethola
loefgren
stastical
wintrob
jardo
spatting
tinklings
yuhnke
duxfield
deluzio
brokerswood
marktplaats
adgas
mackes
azamiyah
nondisplaced
augustavas
ghelardini
mcgautha
mukoni
allouettes
marckwardt
oboma
remobilize
qayoum
hoines
hyperfocused
wampach
dasaro
verbio
zwikel
shutterly
wailani
anzemet
bradburd
marijuna
enduing
abduljabar
fratboys
homaizi
portmuck
nusym
makda
loutishness
ramminger
khowlan
malgir
changpin
schwertz
freakanomics
kiwanga
karacadag
perent
millsteed
tawteen
winkelberg
noorderhaven
abedine
coccoluto
uncustomarily
cyfeirio
navr
mudavanhu
needlman
remobilizing
albaisa
alexanco
aurangajeb
montsoleil
naffest
contracyclical
longstick
hydrovac
substitues
multiplanet
hyprocrisy
moffle
slowworms
absnet
hairmax
favourties
websdale
retrenches
bonusses
jockying
nekzad
devington
ringbacks
mastertones
telzrow
riverrink
maalula
cibani
unluckier
tomrrow
fmaily
trigt
rawanda
ceralyte
capezzali
ileene
turkcan
fcoj
kearsage
chronix
deadpanning
cassman
radiohole
mcskillet
intergovernmentally
ventrassist
unconvertible
morlat
fugel
cges
romanc
geckeler
mkda
meyskens
mauffrey
orgal
dicers
kjaersgaard
hextalls
rodgeriqus
puffet
brainwashers
sypris
rahodeb
wheeden
plattel
esmailin
uibel
rasfer
yurisel
dahok
sablic
tijanis
lechlitner
garbuja
quickeys
solarfun
congu
kawash
sherter
baseem
puissantes
schraufnagel
dictorship
mentallity
redtv
bustar
banally
danehurst
fthomas
dikky
fairl
marscher
whyke
hazeldon
ssdnow
novamont
accountabilty
acountable
huajin
khawad
spinwam
smarttrade
biosite
chiacgo
majoda
handbagging
calilfornia
delmendo
civitano
miyeegombo
saieg
mussig
flashpots
checksfield
shenyin
osanga
wordell
wrister
retoric
celsis
scogs
razumova
xingchang
mikova
inquiringly
superpotent
stampar
healthroster
foodsafe
fienstein
bejjani
thundershower
kututwa
kutano
burgeons
oponnents
speedwork
conita
pronovias
cherrymount
adventis
decoufle
semeta
monomaniacally
abuelhawa
alcotest
transcallosal
olago
jimmied
kijafa
renzer
hoshor
gulowsen
zamari
lindoff
moyard
godswmobile
janusson
tikkas
hartnady
schottenfeld
outpolls
scotiamocatta
jimela
ballyhooing
ruymen
palestini
enlander
stempler
bestpic
iguidensis
ereleases
abbamondi
simonka
staehr
adlen
baecke
trajedy
rennó
volling
subletter
guyt
samaraneftegaz
magora
khaleeq
guantun
laskawy
sukhjeet
pinsentry
gadur
lambdon
elmores
zaanin
wemheuer
varsallone
bikable
underutilize
hyperaware
sluggards
sharyar
kuchinski
mossialos
stainken
loster
suddoth
akkeron
franfurt
spawling
kemkes
girneys
krati
shanwell
zabradli
nonleague
shilsky
communisim
kuadey
anticopyright
cedain
liposuctioned
fotl
massacusetts
keedwell
effeithiau
kunselman
lerrick
bledaite
caners
benoin
philoxenia
spiffiest
recomplete
hollicombe
padiet
rahmaniyah
zurbatiyah
zimpapers
weatherpeople
momentoes
xience
yokata
lanear
unsavoriness
wynaendts
freedarko
peshwari
mcphilips
oggle
swomley
rowhome
zamarai
aawas
kizzi
thriftier
recompete
qunshan
belbacha
pebo
kingold
sanctioner
vetsfirst
ohhhhhhh
sarlot
miloh
blackmont
incrementality
berdieyinne
bolkenstein
camperships
gamecorp
romuzga
spandikow
termansen
prebirth
rajt
poisining
syscan
dvbe
massawe
vietnamisation
jnem
booksonboard
sainjon
gorff
refusniks
marrujo
feeman
bahceli
allaux
denounciation
exceutive
ellinghorst
hilfman
lukken
dimassimo
mikeyy
parver
forclosure
majorgeneral
dolydd
xueju
unchilled
lepro
spellbindingly
palestinains
vidusha
verchot
shaltz
raleys
klangsang
jinou
themeless
erlenbusch
fintecna
estrostep
gerszberg
mainolfi
imazon
sirajudin
petitte
joyrich
turaqistan
eecl
roadmonkey
jigwan
neyhart
moralioglu
suweon
psychoanalysed
stimilus
mcqueeny
gusov
mcguffee
wauwinet
cgpl
tribole
putzing
machievelli
rakeen
raheim
kasambala
shelk
biocentury
willomitzer
kalban
velissarides
hakskeen
synerject
thrivers
zakarneh
dughmush
washingotn
facebooked
deeres
discreditably
decommitment
seandel
frakkin
schoolgirlish
biven
prodigally
wisked
glurdjidze
naziq
pqri
headtorch
wickramatunge
niftier
liverpol
iiwa
hekmatullah
noncertified
ratanamorn
hoppert
palpitates
tdtv
barbituates
petitbois
matone
closantel
nuvomedia
poldma
aruh
sheigra
diamondware
hunosa
dkrw
biffed
venk
unromanticised
slinkiest
igive
apologizer
magrino
onebiggame
passafiume
zivancevic
reaccelerate
zerista
haelterman
cajou
govement
nuclearisation
mordenti
roumel
theright
uncollateralised
celg
tutana
shuala
takkt
verhaagh
worldlier
sentrus
alceus
semaya
dogubeyazit
raingear
zugaza
lazette
tursunova
glassings
schaffart
hritz
efilecabinet
grischuna
salicath
abderamane
mombaur
toevs
mercedeses
prvt
duffed
sakong
jouney
rydning
calingaert
dsti
defife
electriccommander
lasseur
spamford
singlehurst
adelsons
urogynecologist
kanikka
firchau
scbm
midsixties
airfreighted
tenbrink
looooove
ooohs
baymiller
kayakoy
dovebid
chatikavanich
repreive
alavaro
vujanic
charlottenstrasse
bohuslan
stephanowicz
kerkyasharian
dawsy
alberson
chindogu
kohmann
thruyou
sobelle
lasaro
nondiscretionary
spauling
calzon
immsi
razadyne
unauditable
exousia
claustrophobe
mangiola
awtan
clafouti
resomation
sammadar
rulemakers
russmann
signifi
qtech
togoimi
microphage
liftboats
kirchmaier
catienus
augi
zakout
rdpr
tenneh
mpdi
ethirajan
ellacoya
droukdal
mmscf
nisantasi
motormouthed
mehiläinen
dockdogs
joric
antelava
susol
esmin
epuron
cushier
shaplin
fudgey
coffinettes
swertz
reamined
dumers
invironment
parchet
gorbechev
racialize
vladovic
willmake
kronors
worrisomely
guarrantee
olgay
cinet
tedsters
bushit
soyjoy
bicksler
aurs
provencare
emapa
katzourakis
kaumeyer
charmlessly
reindicted
jerril
mediterra
shork
bouncily
accordioned
angerame
raddho
wetroom
hawlicek
atragene
devistation
norbank
selfsufficient
tatarowicz
murier
urni
dincel
kommounistiko
toshirou
sawasaki
liepold
biamp
gabridge
mujuthaba
pennay
kwapa
ecrehous
trackie
carregosa
siripan
brickles
matthewses
stelluto
castelbello
cloudshield
operat
tresset
alguera
meritocrat
wahabbist
basturk
kolpin
llani
unindicated
hopleaf
teenee
kaziboni
kuvshinova
tritest
mamarbachi
xianguo
doodycalls
colonscopy
retroscope
coverflow
berkwitz
ministrokes
fongwan
gluckson
sidakan
epitiro
kaloyanides
supri
coundley
antireflux
bercero
swartzman
britians
caritiana
uanble
geam
venjah
preggo
chammari
unoprostone
ermatov
nukala
cronyist
siraz
hospitol
oursleves
twitchiness
hockeysticks
confabs
arestat
jalaeipour
ennstone
changewave
malebogo
predigital
auturo
britflick
clake
slenk
pureit
vigent
arkal
apof
bingsheng
unifest
downtrends
apero
kamaaina
togged
nawabad
cyberbox
kommunalkredit
fdac
artemisias
handwarmers
newlaithes
tighest
adrenalised
rambaldini
cliatt
lajcak
trizetto
megastomias
prochazkova
wurud
retout
ipayment
rogles
cresyn
ahmadzada
deleveraged
carlinsky
giambelluca
disarmers
kermanian
cyncial
mcaliley
arayama
popultion
qualeh
zarbakht
understimate
gapay
raceable
kajuri
cereplast
exenberger
whimpy
haneline
crimminal
wegryn
klatches
travelsupermarket
frikken
metsavaht
merrywell
yoobamrung
palmitoleate
quidel
westlawnext
succcessful
unjaded
vlyf
katzke
astronergy
mlawer
changson
auwaerter
ocsober
grienke
baloloy
redeemability
spands
nayaf
chaffoteaux
techstreet
alshabab
aestheticised
shipitko
eifan
gladhanding
mannigan
decollete
zuzin
kottmyer
novogrudek
concertinaed
unpresidential
osteoplasty
helioslough
marquest
amardev
mishitting
burgans
lagzdina
dramtically
bonnant
filebound
quadramed
fieldside
interfaceflor
overspreading
hulf
ickies
genaudio
nonpaper
pursuiting
ashcrofts
osmonaliyev
berentson
ratshitanga
rentel
betablocker
gwefan
dimmery
cholestorol
rhaglen
safflowers
sallenger
rushid
gorgiashvili
lumgair
karyagin
shokubai
dipal
tolerancy
akpele
burliest
skoblikov
goldins
vincentric
maqar
pallancata
michalchyshyn
multaka
mensil
madurell
inshriach
gargoyled
newshole
becasse
geegee
shanman
djousse
carouses
bridion
gubenatorial
yasmann
melodeo
dualmode
ohlhaber
hosselkus
asliddin
asteco
glci
womanized
ngarambe
sarigerme
zests
frenna
botkier
atheron
chernomorneftegaz
dracaenas
postwork
hatchel
shillaker
cyberdisplay
hernst
sandelman
etnz
lightborne
jarrom
jinduicheng
anoraky
malealea
imundo
bowriders
ozhan
neglegent
kusmer
skreba
shajoy
lewaravu
underprotected
fyock
bibliowicz
stoppardian
kazombiaze
fortyfold
ballein
moundros
taoping
rhenigidale
cobbes
brezhnevian
pelaccio
effortfully
bergermeer
nonjews
cagnazzi
wonkily
sendups
ontheir
moelleken
chiames
sardeha
zimbawe
kocker
orbuch
goldengrove
heijokyo
waterloos
zaarir
orexo
yermack
rotert
playfootball
wasseem
filsoufi
bozard
threatexpert
theken
robinul
mough
committedly
whinnied
misna
vigal
boezio
alprin
muniwireless
hochgurgl
keppelhoff
cogeval
tigher
dynamise
qurbi
stabalized
kookiest
ticoll
soscia
duekoue
medicide
vaark
woodburner
allybar
lhakhangs
antii
reaccelerated
kiriakakis
jaheem
saadnayel
hydrodome
nongamers
koverman
tanoos
choongh
boordy
pressive
chawkay
anderies
prisna
rones
desperaux
vivaty
moneea
joszt
scrappiest
brocko
derar
unproud
adesta
tbsc
atomredmetzoloto
polymedix
mpuc
lumera
chemosurgery
handrolling
noncareer
umiyuki
gurbantunggut
elithis
thandisizwe
outpitching
kikuyuland
elinogrel
garibyan
gatorfest
thabault
sanbona
bracketologists
hycrete
sbrefa
hersonski
krudy
irlan
agued
budihari
kérastase
everlyne
trippiness
crystalise
besseberg
arwain
verbiscer
skretteberg
taccetti
irigonegaray
hmag
ostir
sheikhly
marzouka
poortmans
speisman
legowo
rogered
dragooning
gruenstein
ballahutchin
binangun
fugazzeta
camahort
hudziak
conservitive
undeland
entura
zeravica
netchoice
treesje
kalief
mackriell
gaccio
cpei
serd
leontaris
musinski
masaud
smileybooks
drollest
patrak
dalvinder
myfoxphoenix
underlap
xcell
nonwinners
gpsi
natterings
storzer
bernasek
corway
ingorant
treiser
knbt
hallandsåsen
seramas
dwikhondito
swanned
asifi
wyndal
bowcutt
tinseth
midtower
viggiu
puhleeze
racetrax
marota
belluardo
musiccares
palanzo
ngdt
casorso
yertysbayev
healthsmart
lecka
rasilez
proesch
dpri
licsw
sourpusses
saghand
canoodle
keoghs
myrthil
proreader
mulhaupt
sanukite
bnsp
worldbeaters
achtner
vopium
blondeness
groople
groupabout
charcon
leidenberger
theochari
luckhaupt
nefariousness
uyeki
ipth
mychoice
sloganeer
innoculations
punchestowns
insipidness
crasbo
choeden
myespn
balcas
waldmani
scanlife
spoonable
pichichero
compnaies
ivad
kralyevich
schwellnus
senjaray
citac
ceau
mereilles
forclosed
glycomark
dissatified
prepsters
springier
iufms
veeva
sixmilewater
enyele
zebrugge
macksoud
glitazones
kontilai
hevly
shlein
wiswedel
dudl
hodmezovasarhely
clonings
ohva
aspidistras
lemack
castiglionis
peripatetically
ajras
stonecastle
caramanlis
delagates
seeit
pirotti
wajba
billlion
cagelike
hugins
starcaps
boggier
grosholtz
bakkom
kortuem
kvivik
yhey
nusing
canyonside
aliriza
talbooth
lmga
aasmundstad
utrechtsestraat
viradouro
inaccordance
ekeli
leadres
mutobo
netcents
buangan
convington
enell
roshydromet
kozelka
foodvest
sangduen
skarssen
prelapse
paramiltary
climatesmart
elaa
ziecker
paksitani
javers
leshno
bouncebackability
hueh
miqdadiya
jakari
chindex
catcote
tangwanghe
idrissu
durrua
superlong
hazuka
windyhall
merked
rezconnect
aserf
abdelhai
mwelwa
hertzmann
ivanplats
aneres
himbos
vilaceca
celebrityhood
roofbeam
typcially
mabthera
espad
bajillions
sgeis
chdos
dellaporta
rippeth
doim
leanore
eblex
laundromatinee
billotti
raishbrook
mengqian
landsnes
cydcor
eslr
dicastro
sukhdave
stavitskaya
memolink
thomassey
visipaque
badurdeen
yesawich
urnov
kreidel
ibrahimia
postcrash
mijoro
mannuzza
noninflationary
alizyme
cuunjieng
stupek
hadlington
polanksy
medstrat
krygyzstan
siyamak
destablise
noteless
giftable
ultreo
banmiller
beringea
territo
lamic
gutlessness
raychoudhuri
seddi
akaz
hwnnw
hbac
rendells
nantas
hootnick
kickouts
reyda
pogrebniak
mutumbo
rhapsodise
cuzon
zabaleen
beroiz
razorgator
iacolino
iotum
paline
stylemark
qiujiang
nonincumbent
scholly
hostelbookers
famulare
annualisation
pitots
cendon
downlighters
gerrero
firly
tahdia
hascup
diettribe
ppdg
namuncura
decentralises
artope
colagiovanni
zaranek
constiuents
prusci
vanhool
circlelending
saidd
awyren
jading
mootral
smithbucklin
ickier
businessworks
shuana
okema
honeman
karamouzis
erdaoqiao
chepulis
lyndin
lockness
greencycle
idelogical
floweriness
jerime
cardica
sleepness
gastelu
zimondi
sundher
nicodeme
fastballer
chifunyise
zubeyde
lorich
opensides
gwsc
kamrob
skwerl
staco
rpis
iljazi
sandeels
brandelli
forwar
competefor
acqusitions
sulkovsky
blueskies
jannett
burutin
ukccis
paekdu
nealry
grandos
dmec
vilhjalmsson
feraz
yaldara
chickcharnie
chaupad
birkmire
kagro
cciee
kurbegovic
geplak
swiston
portpin
schewel
njpa
igodigital
freidan
hsopital
protho
steeliest
khawari
akinaka
chirring
timebanking
mohamedain
ivotronic
segelstein
junkier
homaid
sorros
smsi
mediaconnect
strmecki
clownishly
pampe
nutlike
nonautistic
ostrolenk
soundabout
billingses
pyeritz
nonforfeitable
nerby
nedanovski
sandlofer
ouirgane
margoles
pubby
chicanna
harerimana
geosentinel
elomire
utilites
podhajski
cnpem
spinale
huxlin
frable
verbrugh
machil
predecisional
tibbert
scarved
namugala
fasick
turaihi
counterstrategy
corupt
choubina
kleinhaus
automart
danyong
loundy
avanex
frnakly
chler
vooks
ronnee
phurnacite
accoya
thuggishly
dustlike
brynmally
pttow
bossenger
qaswarah
wipsi
enterprisewide
szakaly
datalabs
awwwwwwww
northless
perezcano
hilam
urds
krausner
entralled
slepicka
gyurmey
buildouts
unfrock
medjumbe
ericsdottir
schme
mooore
levetan
twoway
monitary
horillo
untheatrical
hunshi
understandin
hillon
micturating
parkus
polydopamine
scotchguard
tchilaia
lungundu
vdec
amerah
yorkshirewoman
eyeshadows
hunkiest
puppyish
idcg
tweenagers
husani
fardosa
scrogie
tureli
homemanager
undereye
ifthikar
unsexiest
vodpod
brutzman
promacta
evco
niaki
chyngton
cavvy
wraig
euroconsult
jetbus
houmous
stepter
cbeex
faiure
benchings
gundogan
wesal
herubel
rainsoaked
ostensen
applbaum
bejaysus
interlinings
tritch
zemlyansky
deconto
jamiaa
cosport
limco
credant
yepiz
harralson
pizzola
salwens
sevenhills
peasents
sieno
fmlc
niekirk
fatless
traditonally
sentimentalising
curosurf
nsrt
cossetted
honeytraps
tickett
preformulation
ibone
ddod
rellas
luxlash
mcaneney
nosovice
bedsharing
gordji
bodystep
yogarasa
planinic
bombmakers
lastonia
microdistilling
knca
pohoryles
imagenetix
wootens
daxas
ronilson
sirilal
muhlenkamp
befera
veterens
underhit
ipct
otana
sarjo
cagier
zekelman
chernoi
suzettes
taride
windwood
blingy
khalass
missimi
affirmitive
wolaner
luduena
nanodragster
feeneys
edulink
townfoot
lixun
boomerangers
dimieari
bijeel
gollywog
schiestel
twickers
snivels
valtchev
muhlhauser
mourayan
marakis
mtsc
mvrs
stosberg
dominka
hoggie
latture
dhanju
resoling
reinjecting
myungji
elitech
hoellwarth
nanev
unjudgmental
belth
owczarski
bnac
choedak
vaidisova
contemptuousness
bunol
finalcut
zayuna
aocl
accorn
collasped
costcutting
viawest
sciclone
dourest
chwaraeon
hargey
alchol
xcz
democratiques
milinovic
dinsmores
condis
dunclug
discrimated
smeland
mousad
zlatanovic
djebbar
parkervision
deftest
socialight
moldow
hoopty
wojtecki
pianin
fizzier
envigorated
hpsas
mayelikohan
pordon
trochet
fiberweb
comdexvirtual
repucci
benrock
fusnesau
dapidran
gyorfi
wilsterman
wichcraft
mickeyd
yangpyong
laneve
mcely
creemos
jdimytai
qianfo
tolerent
sagey
elfenworks
fbars
marose
nellcote
drudger
airportal
essangui
teachfirst
operatin
lewinksy
mosalikanti
shije
goinggreen
allegeldy
overexpand
chepkemboi
nteziryayo
untradeable
worldcard
daltonian
nonsquamous
mangiantini
ikebal
cherquenco
grandluxe
arzou
odlozil
decoff
vampirical
foxily
kpnc
soxman
ponne
guinnea
showal
uncommercialised
timberlawn
furtw
steiker
bankman
ghazawi
raisian
dajohn
breedveld
deedie
arfaa
jockish
ultratravel
huldisch
macrolane
sathiyan
vancisin
forkful
bedevilment
bankfirst
rosebrock
lefraks
cwaf
saelee
silkiest
ainkawa
aixtron
suspose
trattorie
greengo
ontex
uncensured
dermatologically
bohanna
obenschain
funfit
medflash
skippyjon
prefete
iscn
ymgynghori
scantest
icmeler
wangsness
wextrust
superquadra
eheart
mapasua
besecker
petercam
mazraq
infocrossing
burkheiser
svanes
sladjan
herszberg
giribone
lizardy
squibbed
jaisham
hapn
pigd
grishkoff
bioenterprise
gynnig
alleron
mckeeve
barnholtz
amvac
hamsik
patineur
financal
peignoirs
eisiau
minzolini
telepiu
crosstech
inferometer
reminyl
janielle
nobbly
domainsbyproxy
overamplified
petroenergy
smfm
stagging
eskovitz
misiura
leafers
shaqs
befouls
shahabeddin
customiser
optimor
chupka
tannachy
pmscs
scarjo
molica
sehnert
zzzs
levittowns
panetteria
dahalani
aspercreme
jidori
preganant
reblackpool
gherzi
aujila
yansha
cryobanks
roughneen
assocations
swifly
pobanz
edicson
azmal
lykourgou
damchoe
hmics
wireforms
gardesana
hokiness
bernardaud
ayudas
multaq
naraki
gaudiello
societythe
shopsins
goodbrand
liram
towheaded
chantana
thumpings
mzili
wozzy
irizarri
geeing
enotah
aappo
thatje
tuttman
darxia
chowns
checkett
irksomely
vahtera
bourzai
daycoval
azazeel
nannelli
punchiest
druidale
kimbriel
forgaard
discoverx
mcellrath
hyclak
centrico
paragallo
daryatmo
unscrutinised
uapd
harebreaks
natalo
vanhooydonck
churnings
lasorsa
peracchia
albinski
dianyuan
rohd
traited
muhtaseb
gromicko
writeroom
spenhill
suppoters
hornbarger
gohir
morquecho
dirshe
turchet
wicab
mazamanian
supergraphic
graap
liftline
bouncebacks
kowalcyk
baraou
prisions
mushraf
arthroscopes
hyvarinen
flammia
norklun
katiria
teleconferenced
overfloweth
pokrovnik
yatskievych
klaviter
jamjoom
pipsqueaks
plechner
tdvcodec
illionois
hubsi
rushwaya
mccraty
cironi
blta
comapanies
zhancheng
oppponent
dribbly
hayab
skedaddled
rogowicz
unstowing
frienemies
firstcity
faridany
accure
gammagard
harussani
yalen
salaheldin
arrm
rosnani
kacandes
chuffs
sohat
wynfrey
alcobas
miserabilism
dogtopia
oogjes
storyvault
damange
frazeur
pitnick
spigaroli
pedestrianize
releaved
eurotaxglass
qaderzadeh
underoccupied
merkushev
borval
goeas
krasn
smallball
irizuki
coalmen
barladeanu
speet
daugter
bertucco
knowehead
tiresomeness
yelnikov
solian
coquillat
yeondoo
namouh
forida
neopeltolide
matatizo
diory
liugui
unflushed
snowshoed
hogli
landrell
mwakasungura
batwomen
antithrombotics
jessicas
frincke
allback
anecdotalist
slobbing
unholster
chantrill
tellam
kanzus
leichtenstein
showhorse
corgentum
fsmt
entrancingly
policharki
deparments
nonunionized
vanillas
glir
nsqip
nohpat
iurc
enovate
cherveny
telergy
nonphysicians
pamos
wrange
ewusi
vobile
sweatpant
guolong
disb
zajtman
selectadisc
genocidaire
hamidzada
altamore
chumra
clacked
squassoni
mcgriddle
faridullah
drobisz
asets
aloun
gertmenian
reticient
constituences
nqobizitha
reciben
hbtc
bahrains
shult
somadikarta
falewicz
pwnc
benightedness
pornthiva
camiro
qiodravu
cdhp
depegged
helldorfer
racebrook
chemchemal
xeriscaped
rushower
imataca
oossanen
ıf
moenig
lafeet
hendzel
binui
naabzada
spectical
ngere
hospodarske
hausenblas
aircap
comcam
gombossy
fronteer
retrogene
liggers
fuschias
bagfuls
bauert
nasirzadeh
wiimotes
longmay
watermaster
dunievitz
treanda
incrimental
atoyebi
rawitch
orock
bendett
blackberrying
invigilate
hajdasz
orbaum
usuary
etlin
michellod
midevil
howr
phillipina
smooched
andarko
eyesmart
dywed
izaurralde
sukant
primeurs
tixylix
boardfest
goulios
unstopable
sedaqat
majome
soflens
thuba
ryaguzov
furiousness
guereda
pizzette
moronuki
budesliga
jpsk
luminere
dweud
governership
doneghy
basyrov
kristianto
gwneud
weathy
jppm
gmtn
provamel
inkubation
kandaharis
canawati
overstyled
appsec
eathai
elasha
penggen
gimer
shoebomber
elanbach
eyewatering
northoff
tysson
beseechingly
glospace
megamovie
samure
bomboniere
deshar
muhrcke
fannying
leveretts
sabeckis
cadna
willdorf
posioned
lattif
outdoorswoman
misamore
atxalandabaso
stjc
neiderer
piccaso
unlacing
sharick
khourshid
sidearmed
lifflander
burrelli
xiaoduan
frazzini
presepi
abdennadher
hypermesh
renagel
maronian
wavelight
underheated
fstr
tauntings
adamovicz
katoro
theut
trumba
glenhill
unlaundered
togelius
cellai
pontarsais
juliber
fanatec
emotionalize
freewrite
snbts
barnao
bestpractices
preibus
pulbere
trayport
ccsbt
caccavella
calosha
wirehouse
gklavakis
crongeyer
xantrex
phytoglycogen
haeley
superprime
mcgrorty
kossie
sabcs
axene
panflu
nhow
reactiv
silverfern
sandretti
nemakonde
landesklinikum
demythologise
baniulis
spunkiest
choska
anahiem
andrianony
cspl
flemke
milosa
quimet
pgic
stearing
pontzer
chadrick
rogasch
bidoon
undermotivated
fashionair
scarpette
overalled
acquifer
razoring
meangingful
exhaustedly
rastafarism
trichopoulou
azadnagar
baccellieri
rorb
weintz
shemy
datcp
bwcabus
yabad
reorientating
politicains
rampager
trebevic
strulovici
bohigian
leesfield
gibh
sorbera
wladyka
clincial
constitutency
tiime
gloersen
disneylands
farisa
sinsuwongse
splitty
nextmap
truecompanion
giacchetto
setlow
smirkingly
atisreal
loorz
yaquby
gearworks
mycoop
bertilson
stangler
bashika
owlishly
pregancies
eyjolfur
kanoto
senocak
skarrild
mantecal
silversol
solaicx
rietze
cozied
bubser
tinhay
crouchley
slipcovered
aurik
christoyannis
cequent
antiprostitution
dfeb
wollschlager
knuckleballing
macroeconomically
srisook
jiyad
unight
emafo
ccbp
pagnamenta
eacha
gawell
giarra
washbowls
unaffectionately
kehoes
metropulos
xvala
frohwein
showbizzy
reappeal
congresscenter
caravanos
consigner
humaidhi
uhlenhopp
nashvilles
racusin
alvogen
waayeel
healthtalk
skunking
gcec
kotchen
mazuka
danaei
kolltan
himandhoo
garbutts
combinatorx
fettuccini
widewell
tivoed
robala
komlosi
haloumi
stirlitz
travelmaster
lassise
homoerotically
pejcinovic
carruades
switcheroos
turkewitz
perithia
dealertrack
quilvest
moatassim
alinia
meshad
yamoun
arysta
deinhardt
imageware
febbo
tshkinvali
fortrans
runions
kochifas
veiwers
rolufs
baragar
vineys
ladderlike
yinhong
zenie
dojaka
heartrendingly
djordjevich
hitlery
madrenas
bronczek
gianfranceschi
karabak
cistuses
webctrl
grivnov
burleys
farrakan
gywneth
daisycutter
zenns
flouch
llawn
pierzchala
metatartaric
contituency
tranquilise
phadungsil
mccamy
kebkabiya
botwinik
semcken
kenneway
khyali
wwoofing
castron
mikoczy
ahfc
glasers
begolly
deposer
partnerka
spendy
gnvc
blixseths
thanatologists
closeknit
lladd
virtuoz
apoligised
ammenities
shimaa
neosa
chatterati
toumeh
paraportiani
daypacks
paavonen
sauey
roofscapes
cmft
prosound
sittert
masibambane
mccammack
mccauliffe
oshel
nydj
nesbo
kimotong
dshi
gelatos
spielbergs
mikulska
misdefense
qleibo
ammuntion
jachnik
totallyjewish
honsberg
oxilp
einum
olass
salaisons
ruesselsheim
telepan
komanski
flautt
limbad
ruessi
repolling
neafsey
stuggles
ixempra
upfitting
brokercheck
lacalamita
detonative
youbet
cempa
prekaze
swaggeringly
stursa
sampong
spitzauer
buinevicius
nortin
elgammal
miguelez
dulford
drpic
tegwyn
kerstie
loubeau
sokaluk
gokce
crme
scheri
budgeters
myday
housseini
fiream
hellishness
soussana
witchiness
septemer
itallian
thenuwara
antiproliferation
schappler
microplanet
khogiani
giancarla
tepav
incestuousness
straayer
sqaud
sielemann
touranment
ethanols
mineer
anamarie
ncjfcj
tavulares
walkiria
tanjiashan
fiotakis
enterprisingly
pushinka
exámenes
ownit
timeing
zojoji
diseasome
felana
megwa
harjedalen
grapplings
errami
rullah
diepolder
nasfaa
ferniehill
thingummybob
mahabeer
forklifting
dikoy
creanza
marzah
urethras
wirecutters
tpps
villagevines
postpay
steirteghem
gabura
parcio
zirandaro
emcp
presssure
ramussen
pimpi
pabell
boudrot
satomiae
rearly
bardaweel
lowde
prattled
rodota
securitizer
toates
widlife
karibuni
woolich
egelien
fibropapilloma
artstorm
dekelboum
guantanimo
hallihan
cherkov
makili
doryman
wahanda
mbpa
magarsa
davore
arroni
reoffenders
kotevski
heximer
behud
ticey
palatably
enginee
paradisis
ruzgas
consolingly
buesaco
isakovic
westins
zinnanti
suaya
aytre
jspca
bageecha
nardine
guardala
leifland
pummelos
smartsense
hydeskov
pitzarella
marchadier
miksanek
counterrevolutions
connextions
allmenus
siig
flexbar
facehunter
stalkerazzi
teeni
miscuing
woger
ersie
kickz
keslassy
biotown
mehboba
voied
deconsolidated
lintala
smugs
overweaning
hillraiser
lailvaux
presliced
zonias
rimli
snedegar
doyennes
fetterhoff
kruegers
cnossen
faizasyah
bagila
newsmith
mwakio
gubta
mujawayo
onerously
semisecret
schaye
alisara
briitish
gargagliano
vizioncore
handstitched
rohmah
lovies
opencrowd
libertiny
gaľa
delleney
biocrude
ashstead
kausal
neidl
lunchbucket
dharmawardena
bellieve
ultcw
lvaro
littleheath
kyriad
pursuiters
mbbi
slyer
dihok
boggild
lacquement
aspinalls
osenat
pynter
laksin
smartstart
manarin
bogaards
sibotshiwe
untackled
endgadget
unthreateningly
deodorise
mohammaden
wedepohl
handyphone
abotsway
uncinematic
caviola
liquidware
lanarth
overcollateralization
mourino
nfus
crocetto
cheptais
melmount
motzer
boespflug
jakstas
talkfests
cresapartners
shmear
powergenix
hardekopf
narcoterrorists
reflationary
provexis
postfight
amselle
uplinq
muirhall
leeney
zeuli
trusim
acciardi
tessas
plotholders
flenner
krosby
halfnight
openband
bougette
obagi
sebrle
newstex
mcwar
mbmg
angemi
moseyed
fearmongers
lumpu
palanco
ahour
sinutab
sodian
sheikhli
sanofipasteur
cinephilic
lothstein
soubie
zerno
bkhm
cystography
usocial
sermonised
obhi
ammoudi
lousier
ereckson
paracletes
krpshtskan
giganomics
ccording
seasonique
perlroth
fieriness
mugginess
barthelemey
xenoport
metrogel
tranc
teguest
muslimyar
multiclient
foccacia
kosaisuk
menopur
hizer
schelberg
barnlike
cebri
tosay
turnow
analex
powerskin
sidie
schoeber
ahijado
zlhr
cushty
igadd
fpies
putumattalan
oglesbee
selfesteem
mallipo
subisidies
afrough
justwhistledixie
cycnical
cfnc
woollies
biben
simultaneoulsy
kreishan
uoya
adapoids
krovvidi
patelis
yalayalatabua
brasato
connerys
funkily
surendiran
comora
kanmen
gasby
heglin
ameripath
fanswarm
lydiatt
metalmaster
chiroma
windemuth
lukacz
splashier
krentcil
cataouatche
yastremskiy
mcclanaghan
megamerger
shticky
artherosclerosis
grueskin
tendy
lignieres
gallmeyer
girthed
homman
jebur
rahati
wahayshi
bluetraker
yospe
tuduv
infantilising
dufflet
balony
adrianto
appcraver
rubbled
escapia
mostari
belche
gülenists
khalidis
weitzberg
souhail
usfm
chirpiness
proffessors
musolini
stonily
nothingburger
pengassan
mantrips
karele
ungenteel
wepower
pipline
panoramically
dinnin
bravey
subeh
poerschke
dzemaili
slentrol
pallipat
unequivical
questek
pointment
tawain
donnybrooks
abaar
jambia
moerheim
pithed
emalie
ontier
underbidder
shourong
kaldoun
authenticom
capak
bluedogs
soquem
junyao
kenzero
chocbox
asmq
dabovic
boersen
videocan
skelid
pingzhong
carrycot
cranfleet
bonsib
ncreif
waterseal
eucalyptuses
vallish
hafeezuddin
brodricks
christianists
machler
ariaan
psychopathically
luxmanor
bayari
frivol
scheeler
embarressed
stickability
reganomics
twttr
tystiolaeth
ndds
dagmawit
davise
startegy
jetin
saidenov
unbendingly
derbyites
blathered
jamayel
mccullars
theopold
bulgers
mastoris
multiseason
cyfarfod
levano
darrenkamp
telecardiology
bpsi
nytta
stadelmayer
contemporised
fouhse
lunkheads
evites
firsthealth
clywch
trvs
coratti
reletting
esconsed
broffman
stonefire
lorans
newspics
karlbaum
penaflorida
northburn
hurkey
ctiy
tinselled
marketsandmarkets
lilang
hirtes
rcent
rougeou
contribued
fobis
erwiah
dellavigna
unstainable
wurstfest
palpitate
oomc
neopharm
groenefeld
brainlessness
screentonic
viisage
hotelchatter
lequn
kemala
reesby
diemtig
ellisman
mdco
vrsn
mamade
invastion
unambivalent
haileys
trivan
sternick
rooperi
serff
herrgesell
wieczynski
superlabs
bidabe
hfth
uzumcu
cojan
ignia
abstraktes
strenthen
aqualine
cialdea
jeyapalan
negronis
paudert
zhenyao
atiende
magnetom
hymning
bonadie
trollopian
thagi
schnetler
esrp
stathakopoulos
nachterstedt
boipeba
ultrasecret
panshir
ecocidal
katania
blendini
recalcine
momager
saphris
rediculas
diavlogs
ovcon
susick
jihadic
coneybury
hethmon
shirm
rishwain
kireker
mirell
sheeren
eramerica
luxehills
touchiest
yalon
banglaore
rommi
lenigas
megaquake
mascone
coruption
smartjog
gollinger
babeh
pendall
misek
subaqua
dallavalle
voreloxin
dithery
gonacon
janumet
vettes
staylor
senternovem
exce
hitleresque
rocen
digrf
ramondini
bushway
cbase
ihilani
hirtenstein
slakteris
unete
hilarides
serices
ticca
shockin
pedegg
cinep
transtac
appma
aqah
prudentiel
rostar
illuminatingly
flimflammed
boogied
hyfforddi
plzensky
eithinog
otesaga
aspling
pollalis
indigovision
blachard
porciello
tereas
eshki
dalgish
akafuku
petroquimica
tunlan
edaily
oorjapac
punjani
sagie
cnrd
hometowne
conced
silga
multigame
brinlee
surepress
statemented
jharkhali
pasquarello
gerianne
pastapur
gartain
lamonaco
bluehenge
sonc
pipho
mischon
tradetech
bunnets
penjore
anshur
sondervan
ebels
facai
biostructures
klenert
bagrock
nutrarev
dusoulier
arjowiggins
skylogic


================================================
FILE: assignments/word_transform/eval.vocab
================================================
tiene,has
habían,had
entendido,understood
clase,class
harry,harry
pluma,pen
guerra,war
tan,so
dios,god
le,you
estés,are
marea,tide
mr,mr
el,he
jerry,jerry
puedo,can
coño,cone
marca,brand
debió,must
diferente,different
tras,after
rival,rival
películas,films
ésta,this
piel,skin
intención,intention
show,show
ir,go
os,you
aumentar,increase
país,country
marcador,marker
perfecta,perfect
ben,ben
presión,pressure
pasada,pass
deje,leave
dia,day
dólares,dollars
porque,why
maldita,damn
locura,madness
fotos,photos
hinchar,swell
regresar,return
alto,high
chico,boy
soberanía,sovereignty
aquella,that
hables,speak
poder,power
tomado,taken
verde,green
nube,cloud
playa,beach
mercado,market
nadie,nobody
contrario,contrary
olvidar,forget
jodido,fucking
altavoz,speaker
pobre,poor
oigan,hear
viuda,widow
vivo,alive
verle,see
creí,believed
malas,bad
hubiera,would
perra,dog
muestra,sample
bienvenidos,welcome
calcetines,socks
dónde,where
teléfono,phone
huele,smells
clientes,customers
sería,would
biblioteca,library
paciente,patient
ruido,noise
pasa,happens
diplomático,diplomatic
llamaba,called
prosperar,prosper
nosotros,us
vas,go
emergencia,emergency
sucia,dirty
desastre,disaster
david,david
pensar,think
real,real
humano,human
vuelvas,return
estaría,be
comprar,buy
red,net
sea,be
ray,ray
presa,dam
ganado,won
sexo,sex
oficina,office
recibir,receive
maravilloso,wonderful
dura,hard
estupendo,great
depende,depends
bastardo,bastard
media,half
pedazo,piece
unas,nail
ojalá,hopefully
banda,band
metros,meters
siente,feels
posibilidad,possibility
inevitable,inevitable
batalla,battle
señorita,miss
peor,worst
naval,naval
buenas,good
completamente,completely
sientes,feel
paso,passed
callejón,alley
observación,observation
perfecto,perfect
flor,flower
imposible,impossible
hagan,make
conversión,conversion
trasero,rear
diez,ten
línea,line
c,c
buena,good
adelante,ahead
ee,ee
otras,other
voz,voice
mofeta,skunk
política,politics
ah,ah
nombres,name
maestro,teacher
ablandar,soften
dará,give
encantado,charmed
cállate,quiet
ocho,eight
fuimos,went
fiesta,party
quedo,remain
sentí,felt
cansado,tired
oro,gold
abierta,open
cámara,camera
magnético,magnetic
ratón,mouse
seguro,insurance
como,as
imagino,imagine
guantes,gloves
espacio,space
otros,others
bailando,dancing
herido,injured
oportunidad,opportunity
bobby,bobby
robert,robert
uso,use
encontrado,found
manos,hands
ver,see
afuera,outside
habéis,have
quienes,who
iluminación,lighting
fácil,easy
menor,less
dirección,address
negocios,business
privado,private
lengua,language
informática,computing
mary,mary
tratando,trying
ejército,army
perros,dogs
cosecha,harvest
siempre,always
vienes,viennese
cabra,goat
gana,desire
empieza,starts
deben,should
vengo,come
tuvo,had
dolor,pain
tuve,had
efecto,effect
quedado,left
llegue,arrived
caluroso,hot
organizado,organized
quede,stay
estarás,be
eso,that
hijos,children
tuvimos,had
vergüenza,shame
alegra,happy
gobierno,government
caro,expensive
oscuridad,darkness
investigación,investigation
mike,mike
dinero,money
hacia,toward
dulce,sweet
siéntate,sit
parecer,seem
vistazo,glance
historias,stories
vender,sell
roja,red
gallo,rooster
vayan,go
chicos,boys
contrato,contract


================================================
FILE: assignments/word_transform/train.vocab
================================================
catedral,cathedral
escúchame,listen
accidente,accident
té,tea
gorda,fat
regresa,returned
negación,denial
pato,duck
precisamente,accurately
imagen,image
persona,person
pistola,pistol
donde,where
café,coffee
negocio,business
quería,wanted
pensaba,thought
espectáculo,show
seguridad,security
juvenil,juvenile
venga,come
alrededor,around
eres,are
robo,stole
especial,special
solos,alone
olvidé,forgot
árbol,tree
danny,danny
hicimos,did
ay,oh
noche,night
regalo,present
entiendes,understand
disculpe,sorry
es,is
impulso,impulse
interactuar,interact
cerebro,brain
cosas,things
supuesto,supposed
reina,queen
baile,dance
ayudarme,help
traído,brought
escuela,school
diario,daily
tu,you
gran,great
principio,beginning
dejas,let
vuelve,returns
voluntad,will
favor,favor
personal,personal
directo,direct
tal,such
lobo,wolf
inmigrante,immigrant
semanas,weeks
base,base
interior,inside
preguntar,ask
pasé,pass
tejer,weave
lector,reader
oigo,hear
piedra,stone
madre,mother
hoy,today
caballero,gentleman
sistema,system
familia,family
podía,could
examen,exam
restaurante,restaurant
conveniencia,convenience
cara,face
hora,hour
empleo,job
pista,track
pronto,soon
año,year
millón,million
pasará,happen
bob,bob
domingo,sunday
hacerme,me
maravillosa,wonderful
brutal,brutal
ciudad,city
come,eat
billy,billy
incalculable,incalculable
deleite,delight
debido,due
mala,bad
estúpido,stupid
libre,free
contacto,contact
enamorado,love
desde,since
pasar,happen
bailar,dance
verano,summer
prima,premium
date,date
mano,hand
cine,cinema
bonito,beautiful
consecutivo,consecutive
conocer,know
sermón,sermon
señoras,ladies
tigre,tiger
señora,mrs
recuerdas,remember
cuarto,room
vez,time
aquí,here
repugnante,disgusting
estoy,am
verás,see
dio,gave
ganas,forward
amigo,friend
tendré,have
química,chemistry
verdadero,true
cansada,tired
cocido,cooked
cual,which
cielo,sky
policía,police
padre,father
dando,giving
asiento,seat
toque,touch
agente,agent
isla,island
cuántos,many
nena,baby
entender,understand
instante,instant
iglesia,church
suerte,luck
luego,then
perfectamente,perfectly
animal,animal
corazón,heart
gracias,thank
prefiero,prefer
creía,thought
renta,rent
delgado,thin
bañar,bathe
estuviste,were
continuar,continue
la,the
llevaré,take
comienzo,start
mujeres,women
vea,see
creen,believe
control,control
cabrón,dumbass
mitad,half
arena,sand
absolutamente,absolutely
mata,bush
doy,give
conejo,spider
ti,you
detrás,behind
hablamos,speak
anna,anna
encuentro,meeting
perdona,forgives
mayor,higher
ganar,win
trabajando,working
gay,gay
encontró,found
conseguir,get
peter,peter
funciona,works
preciosa,precious
esperen,expect
hacemos,make
haré,do
velocidad,speed
vecino,neighbor
crimen,crime
posición,position
bosque,forest
nuestro,our
hecho,fact
sr,mr
tenía,had
saliendo,leave
ángeles,angels
nutritivo,nutrient
final,final
nota,note
asunto,issue
nos,us
carga,load
talento,talent
segundos,seconds
apenas,barely
explosión,explosion
alma,soul
vaqueros,jeans
mujer,woman
otra,other
idea,idea
abogado,attorney
rayos,ray
crudo,raw
acuerdas,remember
anillo,ring
mente,mind
parte,part
mal,wrong
proyecto,draft
chaqueta,jacket
listo,ready
onda,wave
tommy,tommy
lados,sides
había,was
buenos,good
importante,important
dama,lady
aeropuerto,airport
irresistible,compelling
siento,feel
corriendo,running
oscuro,dark
mirar,look
edad,age
salgan,leave
papá,dad
tardes,afternoons
tío,uncle
fantástico,fantastic
memoria,memory
camisa,shirt
confianza,trust
perder,lose
nueva,new
comida,food
momentos,moments
vamos,go
cuento,story
estupidez,stupidity
teológico,theological
nuestros,our
amo,love
cama,bed
sois,are
dijiste,said
ninguno,any
sorpresa,surprise
sucio,dirty
tarde,late
ciudadanía,citizenship
crucero,cruise
detente,stop
pulmón,lung
cinturón,belt
siendo,being
traje,suit
cuidado,attention
niño,boy
tenga,have
intentar,try
enseñar,teach
extranjero,foreigner
llamas,calls
tontería,nonsense
mierda,shit
tomar,drink
bien,well
lastimado,hurt
locos,crazy
militar,military
motocicleta,motorcycle
acá,here
sí,yes
calor,heat
libro,book
ya,already
dar,give
junto,together
nivel,level
idiotas,idiots
profesor,professor
unos,some
horrible,horrible
hacerle,make
deseo,wish
sostener,sustain
odio,hate
días,days
despierta,awake
relámpago,lightning
ser,be
acaba,just
todo,all
quedarme,stay
estará,be
mucha,much
vidas,lives
basta,enough
enorme,huge
religión,religion
querida,dear
pongo,put
creo,believe
llegamos,arrived
empresa,company
podré,can
diablo,devil
demonios,damn
verá,see
pregunto,ask
visita,visit
socorro,help
feliz,happy
bar,pub
temprano,early
piscina,pool
a,to
exactamente,exactly
bicicleta,bicycle
intento,attempt
código,code
objetivo,objective
culpable,guilty
gustó,taste
miles,thousands
doble,double
jack,jack
dejó,left
encontraron,found
ponga,put
partes,parts
filete,steak
común,common
maestra,teacher
ves,see
cebolla,onion
resto,rest
iba,going
vena,vein
tienes,have
ceño,frown
fusil,rifle
tranquila,quiet
pienso,think
próxima,next
llevan,carry
hablan,speak
espada,sword
r,r
drogas,drugs
usar,use
frustrar,frustrate
llevar,carry
muchachos,boys
democracia,democracy
medicina,medicine
navidad,christmas
lluvia,rain
bella,beautiful
esperanza,hope
animales,animals
dejaste,left
sola,alone
grandes,big
comenzó,started
exacto,exact
esperaba,expected
bonita,beautiful
charles,charles
especie,species
biblia,bible
ey,ey
humanos,humans
trata,about
duda,doubt
muy,very
majestad,majesty
cambio,change
estar,be
habría,be
límite,limit
honor,honor
comienza,begins
mortalidad,mortality
lista,list
muchacho,boy
prisión,prison
tome,take
mono,monkey
cuando,when
rey,king
durante,during
contento,happy
ejemplo,example
volveré,return
técnico,technician
buscar,search
fuerzas,forces
difícil,difficult
vaya,go
jurisdicción,jurisdiction
francés,french
cuesta,cost
cuántas,many
tv,tv
castillo,castle
cinco,five
cambiar,change
realmente,really
baja,low
regreso,returned
hace,does
decirle,tell
fatiga,fatigue
viene,comes
computadora,computer
viernes,friday
tenido,had
bebida,drink
suena,sounds
limpio,clean
ha,has
grande,big
juicio,judgment
quedan,are
mojado,wet
cambia,change
hijo,son
papel,paper
jugar,play
carrera,career
trabajar,work
especificar,specify
debí,should
frente,front
escritorio,desk
cariño,sweetie
matarme,kill
necesitas,need
hombres,mens
mansión,mansion
educación,education
idiota,moron
futuro,future
planta,plant
pagar,pay
compañero,companion
estados,state
cosa,thing
pendientes,earrings
llevó,wear
estas,these
taxi,taxi
quieren,want
pápa,pope
sofá,couch
mas,more
especular,speculate
hubo,was
ideas,ideas
débil,weak
querido,dear
mejor,best
vino,wine
coordinar,coordinate
sostenible,sustainable
california,california
ocurrió,occurred
intercambio,exchange
comenzar,start
chicas,girls
oye,hears
viste,dresses
fui,was
usa,uses
disculpa,sorry
direcciones,directions
distancia,distance
diablos,devils
gordo,fat
pocos,few
diga,tell
toda,all
haber,have
srta,ms
hablado,spoken
victoria,victory
príncipe,prince
últimos,latest
multitud,crowd
ve,go
elección,choice
alguien,someone
tengas,have
pensando,thinking
prueba,proof
debes,must
importa,matters
petición,plea
casa,house
cumpleaños,birthday
actualizar,update
tenemos,have
usted,you
pudiera,could
loco,crazy
médico,doctor
beber,drink
eh,eh
estan,are
jake,jake
respeto,respect
freno,break
camino,path
razón,reason
sol,sun
cuerpo,body
motor,engine
recuerda,remember
pareces,seem
depositar,deposit
miren,look
seguir,follow
guapo,handsome
escritor,writer
quieto,still
brazos,arms
haces,do
empezar,start
entra,enters
cuál,which
presidente,president
armonía,harmony
oiga,listen
pedido,order
intelectual,intellectual
necesario,necessary
dedos,fingers
punto,point
alemán,german
granizo,hail
salud,health
irás,go
guapa,beautiful
sandalia,sandals
pruebas,tests
elefante,elephant
favorable,favorable
darte,give
preocupes,worry
llega,arrives
uds,you
muertos,dead
ningún,any
horno,oven
darme,give
flores,flowers
entrar,enter
formas,shapes
enemigo,enemy
llorar,cry
lamento,lament
hola,hello
johnny,johnny
pared,wall
gusto,taste
propio,own
todos,everybody
salió,left
amar,love
encantaría,love
extranjeros,languages
republicano,republican
tuyo,yours
será,be
podido,have
estamos,are
gratis,free
cliente,client
llegó,arrived
caucho,rubber
debía,should
sido,been
abrigo,coat
excelente,excellent
naturaleza,nature
blusa,blouse
música,music
probabilidad,probability
estrella,star
san,saint
cascada,waterfall
terminar,terminate
depredador,predatory
sra,mrs
sarah,sarah
puerta,door
busca,search
seleccionado,selected
jardín,garden
libros,books
ciencia,science
encontré,found
amas,love
pues,well
escuchar,hear
mataré,kill
pobres,poor
pequeña,small
pez,fish
llama,call
hacerlo,do
sociedad,society
creerlo,believe
tratar,try
ponte,ponte
alquiler,rent
sir,sir
lanzamiento,launch
caso,case
inherente,inherent
max,max
información,information
película,movie
aun,yet
aceptación,acceptance
los,the
museo,museum
solamente,only
pasando,passing
departamento,department
tuya,yours
iré,go
ajo,garlic
humor,humor
sigues,follow
invencible,invincible
predicar,preach
decisión,decision
autobús,bus
avión,airplane
zona,zone
de,from
conocía,knew
casi,almost
héroe,hero
digo,say
tenedor,fork
esperar,wait
pelaje,fur
garganta,throat
conmigo,with
eddie,eddie
eran,were
largo,long
confiar,trust
movimiento,movement
lámpara,lamp
nieve,snow
tesoro,treasure
hermanos,brothers
quedar,stay
novia,girlfriend
fuera,outside
inspector,inspector
lee,read
damas,ladies
irse,leave
podrás,can
par,pair
completo,full
anoche,night
especialmente,especially
fin,end
mejores,top
rico,rich
muerta,dead
fondo,bottom
sé,know
amigos,friends
toma,taking
quieres,want
vacaciones,holidays
irnos,leave
universidad,university
buscando,searching
veinte,twenty
vida,life
das,give
alegro,glad
bolsa,bag
joven,young
bebé,baby
caminar,walk
pie,foot
estabas,were
john,john
llegar,arrive
detective,detective
programa,program
hice,did
somos,are
entiendo,understand
habrá,have
apuesto,handsome
calma,calm
hombre,man
vuelto,turned
marcha,march
tipo,kind
amarillo,yellow
quédate,stay
arco,bow
mami,mommy
definitivamente,definitely
techo,roof
carro,car
irme,go
tema,theme
estén,are
llegué,arrived
colocación,placement
casado,married
interesante,interesting
articular,articulate
delante,ahead
veras,see
prisa,hurry
sentir,feel
tenéis,have
medio,medium
significa,means
poner,place
piensas,think
decir,say
cuentas,accounts
después,after
azul,blue
arrepentirse,repent
siéntese,sit
propiedad,property
algo,something
perdido,lost
montaña,mountain
daré,give
uno,one
frágil,fragile
noches,nights
loca,crazy
hacer,do
rostro,face
ambos,both
belleza,beauty
bronce,bronze
capitán,captain
supongo,suppose
pidió,asked
nuevo,new
muerto,dead
hubieras,had
familiar,familiar
mirada,look
prometo,promise
trabajo,job
razones,reasons
querer,want
piso,floor
giro,twist
semejanza,similarity
costa,coast
agradecer,appreciate
saberlo,know
estuvo,was
circulo,circle
oí,hear
puerto,door
tú,you
repente,suddenly
barco,ship
fotografía,photograph
hogar,home
hacen,make
mí,me
terminado,finished
minutos,minutes
ustedes,you
resulta,result
jóvenes,young
ego,ego
tambien,also
dejen,leave
empezó,started
cargo,position
comandante,commander
almohada,pillow
hago,make
caballo,horse
demandante,plaintiff
canción,song
profesional,professional
escena,scene
elegible,eligible
mayoría,most
tribunal,court
comentario,remark
iremos,go
habló,speak
dice,says
morir,die
porqué,why
piensa,think
descansar,rest
potable,potable
trato,treatment
tuviera,had
cocina,kitchen
club,club
ahí,there
reunión,meeting
sal,salt
sean,are
espiar,spy
gracia,grace
calle,street
reloj,clock
ayudar,help
ropa,clothes
calles,streets
beso,kiss
tarjeta,card
mark,mark
francia,france
fracción,fraction
hará,will
geometría,geometry
debajo,below
trampa,trap
perdone,forgive
puta,bitch
chispa,spark
viviendo,living
jefe,boss
bajar,down
intimidad,intimacy
esposa,wife
jabón,soap
casas,houses
ironía,irony
propósito,purpose
personas,people
muelle,dock
bote,boat
pero,but
esta,this
matar,kill
abuela,grandmother
niebla,fog
camión,truck
sale,leaves
plato,plate
oyes,hear
inocente,innocent
dan,give
pide,asks
única,only
referir,refer
hizo,did
revólver,revolver
atención,attention
injusto,unfair
ésa,that
gustan,like
equivalente,equivalent
mi,my
van,go
aburrido,boring
perro,dog
alcalde,mayor
entiende,understands
busco,search
bueno,good
dormido,asleep
nunca,never
precioso,precious
éxito,success
blanco,white
cuanto,many
encima,above
delicioso,delicious
tantas,many
álgebra,algebra
whisky,whiskey
perdonar,forgive
oh,oh
otro,other
foto,photo
escuche,heard
pájaro,bird
negros,black
robar,steal
trabaja,working
fortuna,fortune
al,to
relación,relationship
fuerza,force
llanta,wheel
embargo,embargo
abierto,open
palabra,word
serán,be
problemas,problems
thomas,thomas
con,with
grueso,thick
bill,bill
caliente,hot
bañador,swimsuit
dejes,let
aburrida,boring
alemania,germany
su,his
garantía,guarantee
unidad,unity
atrás,behind
temo,fear
inglaterra,england
salido,protruding
m,m
escucha,listen
disparar,shoot
además,besides
molécula,molecule
obra,work
ninguna,any
segundo,second
mía,mine
agradable,nice
listos,ready
claro,clear
vemos,see
palabras,words
sube,up
último,latest
noticia,news
cielos,heavens
felices,happy
dijeron,said
situación,situation
toca,plays
preocupado,worried
tensión,strain
todas,all
dave,dave
puertas,doors
volvió,returned
tocar,play
ayude,help
vieja,old
honesto,honest
parecen,seem
j,j
elaborar,elaborate
vuelo,flight
vacío,vacuum
entre,between
parecía,seemed
noticias,news
cartas,letters
amante,lover
esperando,waiting
entonces,then
cheque,check
aduana,customs
vayamos,go
espina,spine
ducha,shower
acusación,accusation
sigue,follow
mientras,while
retirada,retreat
orar,pray
absoluto,absolute
llevas,take
delincuente,offender
danza,dance
acabo,finished
tren,train
vendedor,seller
física,physics
masa,dough
pon,put
bautismo,baptism
dijo,said
bajo,low
divertido,fun
protestante,protestant
mataron,killed
s,s
nuestra,our
luchar,fight
nariz,nose
arcilla,clay
saca,removes
york,york
serás,be
conducir,drive
tranquilo,quiet
turno,turn
sano,healthy
gusta,like
minuto,minute
fea,ugly
era,was
dedo,finger
excepto,except
siquiera,even
amable,friendly
bravo,bravo
ayúdame,help
boda,wedding
oferta,sale
hija,daughter
adónde,where
dueño,owner
misión,mission
doctor,doctor
seguramente,surely
saben,know
paz,peace
repentino,sudden
cualquiera,anyone
epidemia,epidemic
tarifa,rate
equivocado,wrong
murió,died
serio,serious
veré,see
www,www
estimación,estimate
salga,out
dentro,inside
aqui,here
mamá,mom
destino,destination
cuello,neck
nuestras,our
puente,bridge
suficiente,enough
debe,should
experiencia,experience
embarazada,pregnant
chofer,driver
tienda,store
pantalones,pants
americano,american
paseo,walk
pone,places
honestamente,honestly
pata,duck
cambiado,changed
parque,park
partido,match
biología,biology
quedó,stayed
sangre,blood
baño,bathroom
hechos,acts
lado,side
primero,first
levántate,raise
hey,hey
escuchen,listen
diferentes,different
velcro,velcro
genial,great
quedarse,stay
china,china
está,this
arma,weapon
mis,my
verdad,true
filosófico,philosophical
patata,potato
templo,temple
novio,boyfriend
hospital,hospital
abuelo,grandfather
ocurre,occurs
vivir,live
oír,hear
suéter,sweater
deber,must
vete,go
sentía,felt
podemos,can
diciendo,saying
ventana,window
sentido,sense
librería,bookstore
general,general
quién,who
vos,you
verlo,see
escaleras,stairs
cuestión,question
tendremos,have
complicado,complicated
trauma,trauma
hermano,brother
semana,week
veremos,see
culo,ass
presuntamente,allegedly
millones,millions
antiguo,old
fe,faith
consejo,advice
molesta,bothers
turismo,tourism
has,have
intervalo,interval
edificio,building
gustaba,liked
oído,ear
decirme,tell
alex,alex
alguna,any
toalla,towel
dame,give
espalda,back
cerda,pig
cenar,dine
arrodillarse,kneel
di,gave
camboya,cambodia
mapa,map
venir,come
monasterio,monastery
vigésimo,twentieth
rueda,wheel
más,more
hablé,talked
diferencia,difference
nuevos,new
presente,present
alboroto,riot
enferma,sick
hablas,speak
saldrá,will
vd,you
corre,run
ante,before
imbécil,fool
darle,give
voy,go
echar,throw
enderezar,straighten
corte,cut
tengo,have
comer,eat
rana,frog
ataque,attack
años,years
contar,tell
vine,came
droga,drug
yo,i
peligroso,dangerous
necesitaba,needed
un,a
brillante,brilliant
última,last
ligero,light
por,by
primer,first
matrimonio,marriage
dormir,sleep
hablar,talk
soldados,soldiers
barrio,neighborhood
director,director
terminó,finished
pila,sink
vosotros,you
vista,view
quisiera,want
correr,run
diría,say
queda,remains
primo,cousin
luna,moon
broma,joke
nosotras,we
ok,okay
rápido,fast
jim,jim
hermoso,beautiful
pedir,ask
esa,that
james,james
patada,kick
bienvenida,welcome
viaje,travel
sabemos,know
hombro,shoulder
gente,people
unidos,united
londres,london
pido,ask
triste,sad
obispo,bishop
vuestro,your
tenías,had
quien,who
constitución,constitution
parece,seems
matado,killed
preguntas,questions
cargador,charger
demasiado,too
dije,said
correcto,right
irte,leave
digamos,say
público,public
están,are
acelerar,accelerate
saber,know
armas,weapons
linda,pretty
pelear,fight
estúpida,stupid
encanto,charm
estaremos,be
tendrás,have
sepa,know
conocido,known
si,if
cae,falls
dejo,left
muñeca,wrist
montón,heap
fundir,melt
venido,come
abajo,down
energía,energy
esto,this
tendrá,have
perdón,sorry
ahi,there
hiciera,do
correa,belt
pantalla,screen
agua,water
pequeños,little
ruego,beg
ocurrido,happened
henry,henry
tendrán,will
estación,station
bastante,quite
termina,ends
cola,tail
muerte,death
que,what
ayer,yesterday
panaderia,shop
boca,mouth
hacía,toward
b,b
haciendo,doing
caballos,horses
modo,mode
secreto,secret
verte,see
gato,cat
fábrica,factory
piensan,think
sabe,knows
mensaje,message
dime,tell
cierre,zipper
tampoco,neither
to,to
estado,state
llamada,call
d,d
muchas,many
ojo,eye
lapicero,pen
tanto,much
pierna,leg
acabar,finish
ojos,eyes
puto,fucking
cresta,ridge
comprendo,comprehend
grave,serious
debería,should
centro,center
mismo,same
viudo,widower
órdenes,orders
monstruo,monster
deberías,should
visto,viewed
piernas,legs
nada,nothing
señor,mister
correos,office
teníamos,had
borracho,drunk
estadio,stadium
encuentras,find
pueblo,town
clases,lessons
natural,natural
dices,say
proclamar,proclaim
fuese,was
olvides,forget
defensa,defending
estarán,be
supe,knew
carne,meat
antes,before
llave,key
manta,blanket
llaman,call
coge,grabs
través,through
izquierda,left
asuntos,issues
algunas,some
enfermero,nurse
quiénes,who
probar,try
cristianismo,christianity
leal,loyal
detalles,details
jugando,playing
sam,sam
cierto,true
placer,pleasure
pollo,chicken
pase,pass
mundo,world
miedo,fear
dos,two
aunque,although
hermana,sister
patrón,patron
puñetazo,punch
jamás,never
tony,tony
trago,drink
falda,skirt
explícito,explicit
televisión,television
sino,but
hay,are
finalmente,finally
decía,said
salida,exit
adentro,in
caja,box
hígado,liver
despierto,awake
escapar,escape
rica,rich
juntos,together
nervioso,nervous
papi,daddy
cerrar,close
dibujar,draw
negro,black
suya,his
todavía,still
anterior,underwear
seas,are
estuviera,was
incluso,even
mañana,morning
informe,report
tolerancia,tolerance
gloria,glory
contigo,with
teatro,theater
naríz,nose
hablando,speaking
américa,america
tiro,threw
pareja,couple
me,me
daño,hurt
cuidar,care
copa,cup
oso,bear
juro,swear
cantar,sing
arriba,above
libras,pounds
simple,simple
lugares,places
pudo,could
tendría,have
revisión,review
veamos,see
trajo,brought
volver,return
ellos,they
problema,problem
alemanes,german
son,are
diré,say
decirte,tell
ama,love
aire,air
opción,option
ministro,minister
veía,looked
vio,saw
naranja,orange
walter,walter
huevos,eggs
encontramos,find
amiga,friend
muevas,move
día,day
soldado,soldier
cabeza,head
lapiz,pencil
haga,make
habitación,room
fútbol,football
denso,dense
mantener,keep
perforar,drill
luces,lights
charlie,charlie
qué,what
tomó,took
campo,field
matemáticas,math
lleva,carries
bienvenido,welcome
cita,appointment
patrocinador,sponsor
queja,complaint
carta,letter
caer,fall
siete,seven
empujón,poke
viejos,old
estudiar,study
mil,thousand
orgulloso,proud
llamar,call
océano,ocean
ido,gone
poco,little
dientes,teeth
justicia,justice
dejado,left
viejo,old
lleno,full
salvo,except
posible,possible
lejos,far
dígame,tell
allí,there
cerdo,pig
rojo,red
intenta,try
quedarte,stay
carretera,highway
polvo,dust
del,of
parar,stop
nave,ship
juego,game
ciclomotor,moped
parís,paris
hubiese,had
las,the
p,p
causa,cause
conoce,known
alegar,allege
él,he
feo,ugly
haya,beech
vuestra,your
líquido,liquid
tonto,stupid
siguiente,following
sentado,seated
vestíbulo,hallway
pelea,fight
profesora,professor
menos,less
querría,want
cerveza,beer
bromeando,joking
respecto,respect
inmediato,now
mando,send
sólo,only
seré,be
economía,economics
lleve,carried
verla,see
esos,those
roma,rome
asesinato,murder
colegio,college
charca,pond
debo,must
pelo,hair
quizá,maybe
sábado,saturday
recortar,trim
leer,read
inmediatamente,immediately
capaz,able
aprender,learn
españa,spain
llamaré,call
viendo,seeing
olvidado,forgotten
mesa,table
officina,office
enemigos,enemies
mirando,looking
madera,timber
acción,action
aquel,that
acerca,about
tener,have
gustaría,like
actuar,act
ballena,whale
cena,dinner
solía,accustomed
deja,let
total,total
bus,bus
ave,bird
viento,wind
joder,fuck
mentira,lie
umbral,threshold
cayó,fell
compañía,company
operación,operation
tapa,lid
casarse,marry
amor,love
bomba,bomb
conozco,know
anda,walks
invención,invention
cuatro,four
sur,south
sabías,know
extraña,strange
llevará,carry
compromiso,compromise
sheriff,sheriff
espere,waited
volar,fly
tanta,much
contabilidad,accounting
rutinariamente,routinely
libertad,freedom
abre,opens
silla,chair
haremos,will
tomando,taking
sobre,on
precio,price
cinta,ribbon
para,for
aspirina,aspirin
motivo,reason
perdió,lost
totalmente,totally
digas,say
sus,their
señores,sirs
falta,lack
muere,die
zapatos,shoes
hiciste,did
recuperar,recover
permiso,permission
malditos,damn
io,io
electrónica,electronics
seco,dry
puntos,points
crees,believe
capa,coat
sigo,follow
guardia,guard
ágil,agile
ahora,now
nuevas,news
cerca,close
llevo,wear
pensé,thought
peligro,danger
en,in
brazo,arm
sombrero,hat
preocupe,worry
rato,while
responsable,responsable
michael,michael
inevitablemente,inevitably
podremos,can
cierra,closes
almacén,warehouse
extraño,strange
nombre,name
rosa,pink
déjeme,let
éste,east
hable,talked
dejar,leave
río,river
color,color
oeste,west
alta,high
juventud,youth
contribuyente,contributor
estudio,study
raro,rare
lucha,fight
pesar,weigh
pueden,may
nick,nick
pasado,past
aspecto,appearance
joe,joe
sucedió,happened
traer,bring
pijama,pajamas
you,you
escupir,spit
puesto,position
eras,were
vestido,dress
ángel,angel
adiós,goodbye
demás,other
hayas,have
sueños,dreams
cuchillo,knife
demócrata,democrat
sirve,serves
da,gives
aquellos,those
tiempo,time
cruel,cruel
valiente,brave
derecho,right
permite,allows
codo,elbow
equipaje,luggage
abrir,open
cabello,hair
papa,dad
graduación,graduation
leche,milk
periódico,newspaper
lago,lake
estufa,stove
salir,leave
puse,put
forma,shape
acto,act
roto,broken
luz,light
orden,order
conoces,know
cada,each
veterano,veteran
varias,several
mucho,much
tránsito,transit
vale,okay
plan,plan
también,also
jesús,jesus
sargento,sergeant
auto,car
chica,girl
prensa,press
continúa,continue
duro,hard
dado,dice
haz,make
durmiendo,sleeping
coger,take
inteligente,intelligent
preparado,prepared
pies,feet
estaba,was
tornillo,bolt
ellas,they
uh,uh
ley,law
diccionário,dictionary
verdadera,real
cálculo,calculus
vive,lives
según,according
viva,live
paja,straw
dé,from
asesino,murderer
mire,look
espíritu,spirit
una,a
coronel,colonel
jacob,jacob
cabo,cape
mira,look
tí,you
va,goes
servicio,service
carajo,fuck
tengan,have
entrada,entry
espera,wait
reservado,reserved
vuelva,return
cálmate,calm
respuesta,answer
bañera,bathtub
pedí,asked
steve,steve
recibido,received
fué,was
espejo,mirror
maldición,curse
nacional,national
quiere,wants
habla,speaks
the,the
culpa,guilt
lindo,pretty
valle,valley
sonido,sound
oficial,official
niñas,girls
cómo,how
esas,those
ayuda,help
lástima,pity
momento,moment
farmácia,pharmacy
campesino,peasant
tejido,fabric
george,george
llaves,keys
opinión,opinion
richard,richard
muchacha,girl
suyo,yours
mírame,look
error,error
dejarlo,leave
llamado,called
vendrá,come
dejé,leave
infeliz,unhappy
ladrón,thief
enfermedad,disease
mes,month
silencio,silence
vengan,come
banco,bank
enfermo,sick
infierno,hell
lión,lion
microonda,microwave
subir,up
pequeño,small
igual,same
normal,normal
llámame,call
apartamento,apartment
tumba,grave
rádio,radio
acaso,perhaps
pena,pain
tipos,types
enseguida,immediately
ahogar,drown
malo,bad
papeles,papers
sala,room
lugar,place
selva,jungle
alguno,any
sentimientos,feelings
puedes,can
gracioso,funny
simplemente,simply
dejaré,leave
frío,cold
profeta,prophet
pasó,passed
habías,had
autonomía,autonomy
sacar,take
alabanza,praise
padres,parents
cuenta,account
muévete,move
siga,follow
nueve,nine
colina,hill
sin,without
pecho,chest
líder,leader
así,yes
riesgo,risk
rodilla,knee
apología,apology
u,or
ayudarte,help
ni,neither
propia,own
llegado,arrived
tus,your
marido,husband
dieron,gave
acuerdo,agreement
este,east
puso,put
pago,payment
toques,touches
golpe,knock
suelo,floor
hambre,hungry
ridículo,ridiculous
tom,tom
desea,wish
necesitamos,need
interesa,interested
tres,three
preocupa,worries
ocupado,occupied
santa,saint
transmitir,transmit
tomas,shots
paga,pay
niños,children
cree,believes
aún,yet
supone,supposed
hasta,until
cuchara,spoon
pareció,seemed
arte,art
cintura,waist
cien,hundred
dicho,saying
hablemos,talk
adorar,worship
santo,holy
dr,dr
experto,skilled
puede,can
genio,genius
mar,sea
hagamos,do
he,have
juez,judge
ella,she
sueño,dream
refiero,refer
seis,six
vi,saw
testigo,witness
señoría,lordship
misma,same
hablo,speak
impuesto,tax
verme,see
hielo,ice
tenían,had
máquina,machine
vaca,cow
necesita,needs
realidad,reality
mundial,world
déjalo,leave
geografía,geography
inútil,useless
pan,bread
escribir,write
larry,larry
muchos,many
chris,chris
fuego,fire
hotel,hotel
existe,exists
maldito,damned
lavaplatos,dishwasher
sabía,knew
despacio,slowly
famoso,famous
mármol,marble
inglés,english
larga,long
acabó,finished
llame,called
aceptar,accept
decidido,decided
escrito,written
cerrado,closed
acabado,finish
botella,bottle
yendo,going
automovíl,car
salvar,save
recuerdo,memory
allá,there
increíble,amazing
fue,was
solo,alone
o,or
veces,times
terriblemente,terribly
volverá,return
coco,coconut
vienen,come
humana,human
perdí,lost
partir,from
siguen,follow
encontrar,find
déjame,let
basura,trash
oreja,ear
zoológico,zoo
meses,months
escuché,heard
estrellas,stars
araña,spider
duerme,sleeps
judaismo,judaism
estáis,are
pude,could
t,t
modos,modes
pueda,can
justo,fair
y,and
estábamos,were
arreglar,fix
han,have
acelerado,accelerated
cuándo,when
dicen,say
contemplar,contemplate
pregunta,question
jimmy,jimmy
tierra,earth
segura,safe
teniente,lieutenant
ello,it
paul,paul
águila,eagle
no,no
conocí,met
sabes,know
arroz,rice
les,them
washington,washington
varios,various
valor,value
tonta,dumb
llena,full
miel,honey
necesitan,need
sexualidad,sexuality
princesa,princess
tantos,many
horas,hours
gallina,chicken
central,central
menudo,often
halcón,hawk
costar,cost
deprisa,quickly
probablemente,probably
planes,plans
blanca,white
biografía,biography
evitar,avoid
ibas,were
tienen,have
voluntario,voluntary
esposo,husband
número,number
encuentra,find
conversación,conversation
cárcel,jail
te,tea
caballeros,gentlemen
veo,see
primera,first
irá,go
negra,black
n,n
subterráneo,subway
ei,ei
podrá,can
terrible,terrible
platillo,saucer
grupo,group
tía,aunt
estilo,style
recordar,remember
norte,north
coche,car
descanso,rest
principal,principal
demonio,demon
dile,tell
municipal,municipal
se,oneself
armario,closet
deberíamos,should
estos,these
sitio,site
entero,whole
metido,involved
oveja,sheep
barato,cheap
peso,weight
llevaba,took
manera,way
cualquier,any
árboles,trees
creer,believe
época,time
espero,hope
equipo,team
buen,good
trae,brings
mío,mine
soleado,sunny
jane,jane
llamó,called
próximo,next
fuerte,strong
resumen,summary
reglas,rules
necesito,need
soy,am
hermosa,beautiful
bebe,baby
felicidad,happiness
fragmento,fragment
intentando,trying
globo,balloon
vayas,go
derecha,right
vuelvo,return
sucede,happens
palo,stick
estaré,be
uva,grape
estás,are
abrumar,overwhelm
puedas,can
área,area
contra,against
vuelta,return
lágrimas,tears
estuve,was
frank,frank
historia,history
algún,some
europa,europe
esté,be
llamo,call
hicieron,made
niña,girl
donación,donation
mismos,same
quizás,maybe
radio,radio
algunos,some
mató,killed
planeta,planet
duele,hurts
ven,come
señal,signal
unir,merge
único,only


================================================
FILE: examples/02_lazy_loading.py
================================================
""" Example of lazy vs normal loading
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 02
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf 

######################################## 
## NORMAL LOADING   			      ##
## print out a graph with 1 Add node  ## 
########################################

x = tf.Variable(10, name='x')
y = tf.Variable(20, name='y')
z = tf.add(x, y)

with tf.Session() as sess:
	sess.run(tf.global_variables_initializer())
	writer = tf.summary.FileWriter('graphs/normal_loading', sess.graph)
	for _ in range(10):
		sess.run(z)
	print(tf.get_default_graph().as_graph_def())
	writer.close()

######################################## 
## LAZY LOADING   					  ##
## print out a graph with 10 Add nodes## 
########################################

x = tf.Variable(10, name='x')
y = tf.Variable(20, name='y')

with tf.Session() as sess:
	sess.run(tf.global_variables_initializer())
	writer = tf.summary.FileWriter('graphs/lazy_loading', sess.graph)
	for _ in range(10):
		sess.run(tf.add(x, y))
	print(tf.get_default_graph().as_graph_def()) 
	writer.close()

================================================
FILE: examples/02_placeholder.py
================================================
""" Placeholder and feed_dict example
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 02
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

# Example 1: feed_dict with placeholder

# a is a placeholderfor a vector of 3 elements, type tf.float32
a = tf.placeholder(tf.float32, shape=[3])
b = tf.constant([5, 5, 5], tf.float32)

# use the placeholder as you would a constant
c = a + b  # short for tf.add(a, b)

writer = tf.summary.FileWriter('graphs/placeholders', tf.get_default_graph())
with tf.Session() as sess:
    # compute the value of c given the value of a is [1, 2, 3]
    print(sess.run(c, {a: [1, 2, 3]}))                 # [6. 7. 8.]
writer.close()


# Example 2: feed_dict with variables
a = tf.add(2, 5)
b = tf.multiply(a, 3)

with tf.Session() as sess:
    print(sess.run(b))                                 # >> 21
    # compute the value of b given the value of a is 15
    print(sess.run(b, feed_dict={a: 15}))              # >> 45

================================================
FILE: examples/02_simple_tf.py
================================================
""" Simple TensorFlow's ops
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf

# Example 1: Simple ways to create log file writer
a = tf.constant(2, name='a')
b = tf.constant(3, name='b')
x = tf.add(a, b, name='add')
writer = tf.summary.FileWriter('./graphs/simple', tf.get_default_graph()) 
with tf.Session() as sess:
    # writer = tf.summary.FileWriter('./graphs', sess.graph) 
    print(sess.run(x))
writer.close() # close the writer when you’re done using it

# Example 2: The wonderful wizard of div
a = tf.constant([2, 2], name='a')
b = tf.constant([[0, 1], [2, 3]], name='b')

with tf.Session() as sess:
    print(sess.run(tf.div(b, a)))
    print(sess.run(tf.divide(b, a)))
    print(sess.run(tf.truediv(b, a)))
    print(sess.run(tf.floordiv(b, a)))
    # print(sess.run(tf.realdiv(b, a)))
    print(sess.run(tf.truncatediv(b, a)))
    print(sess.run(tf.floor_div(b, a)))

# Example 3: multiplying tensors
a = tf.constant([10, 20], name='a')
b = tf.constant([2, 3], name='b')

with tf.Session() as sess:
    print(sess.run(tf.multiply(a, b)))
    print(sess.run(tf.tensordot(a, b, 1)))

# Example 4: Python native type
t_0 = 19 
x = tf.zeros_like(t_0) 					# ==> 0
y = tf.ones_like(t_0) 					# ==> 1

t_1 = ['apple', 'peach', 'banana']
x = tf.zeros_like(t_1) 					# ==> ['' '' '']
# y = tf.ones_like(t_1) 				# ==> TypeError: Expected string, got 1 of type 'int' instead.

t_2 = [[True, False, False],
       [False, False, True],
       [False, True, False]] 
x = tf.zeros_like(t_2) 					# ==> 3x3 tensor, all elements are False
y = tf.ones_like(t_2) 					# ==> 3x3 tensor, all elements are True

print(tf.int32.as_numpy_dtype())

# Example 5: printing your graph's definition
my_const = tf.constant([1.0, 2.0], name='my_const')
print(tf.get_default_graph().as_graph_def())

================================================
FILE: examples/02_variables.py
================================================
""" Variable exmaples
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 02
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf

# Example 1: creating variables
s = tf.Variable(2, name='scalar') 
m = tf.Variable([[0, 1], [2, 3]], name='matrix') 
W = tf.Variable(tf.zeros([784,10]), name='big_matrix')
V = tf.Variable(tf.truncated_normal([784, 10]), name='normal_matrix')

s = tf.get_variable('scalar', initializer=tf.constant(2)) 
m = tf.get_variable('matrix', initializer=tf.constant([[0, 1], [2, 3]]))
W = tf.get_variable('big_matrix', shape=(784, 10), initializer=tf.zeros_initializer())
V = tf.get_variable('normal_matrix', shape=(784, 10), initializer=tf.truncated_normal_initializer())

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(V.eval())

# Example 2: assigning values to variables
W = tf.Variable(10)
W.assign(100)
with tf.Session() as sess:
    sess.run(W.initializer)
    print(sess.run(W))                    	# >> 10

W = tf.Variable(10)
assign_op = W.assign(100)
with tf.Session() as sess:
    sess.run(assign_op)
    print(W.eval())                     	# >> 100

# create a variable whose original value is 2
a = tf.get_variable('scalar', initializer=tf.constant(2)) 
a_times_two = a.assign(a * 2)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) 
    sess.run(a_times_two)                 	# >> 4
    sess.run(a_times_two)                 	# >> 8
    sess.run(a_times_two)                 	# >> 16

W = tf.Variable(10)
with tf.Session() as sess:
    sess.run(W.initializer)
    print(sess.run(W.assign_add(10)))     	# >> 20
    print(sess.run(W.assign_sub(2)))     	# >> 18

# Example 3: Each session has its own copy of variable
W = tf.Variable(10)
sess1 = tf.Session()
sess2 = tf.Session()
sess1.run(W.initializer)
sess2.run(W.initializer)
print(sess1.run(W.assign_add(10)))        	# >> 20
print(sess2.run(W.assign_sub(2)))        	# >> 8
print(sess1.run(W.assign_add(100)))        	# >> 120
print(sess2.run(W.assign_sub(50)))        	# >> -42
sess1.close()
sess2.close()

# Example 4: create a variable with the initial value depending on another variable
W = tf.Variable(tf.truncated_normal([700, 10]))
U = tf.Variable(W * 2)

================================================
FILE: examples/03_linreg_dataset.py
================================================
""" Solution for simple linear regression example using tf.data
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# Step 1: read in the data
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create Dataset and iterator
dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))

iterator = dataset.make_initializable_iterator()
X, Y = iterator.get_next()

# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))

# Step 4: build model to predict Y
Y_predicted = X * w + b

# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')
# loss = utils.huber_loss(Y, Y_predicted)

# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

start = time.time()
with tf.Session() as sess:
    # Step 7: initialize the necessary variables, in this case, w and b
    sess.run(tf.global_variables_initializer()) 
    writer = tf.summary.FileWriter('./graphs/linear_reg', sess.graph)
    
    # Step 8: train the model for 100 epochs
    for i in range(100):
        sess.run(iterator.initializer) # initialize the iterator
        total_loss = 0
        try:
            while True:
                _, l = sess.run([optimizer, loss]) 
                total_loss += l
        except tf.errors.OutOfRangeError:
            pass
            
        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

    # close the writer when you're done using it
    writer.close() 
    
    # Step 9: output the values of w and b
    w_out, b_out = sess.run([w, b]) 
    print('w: %f, b: %f' %(w_out, b_out))
print('Took: %f seconds' %(time.time() - start))

# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data with squared error')
# plt.plot(data[:,0], data[:,0] * (-5.883589) + 85.124306, 'g', label='Predicted data with Huber loss')
plt.legend()
plt.show()

================================================
FILE: examples/03_linreg_placeholder.py
================================================
""" Solution for simple linear regression example using placeholders
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')

# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))

# Step 4: build model to predict Y
Y_predicted = w * X + b 

# Step 5: use the squared error as the loss function
# you can use either mean squared error or Huber loss
loss = tf.square(Y - Y_predicted, name='loss')
# loss = utils.huber_loss(Y, Y_predicted)

# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)


start = time.time()
writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())
with tf.Session() as sess:
	# Step 7: initialize the necessary variables, in this case, w and b
	sess.run(tf.global_variables_initializer()) 
	
	# Step 8: train the model for 100 epochs
	for i in range(100): 
		total_loss = 0
		for x, y in data:
			# Session execute optimizer and fetch values of loss
			_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) 
			total_loss += l
		print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

	# close the writer when you're done using it
	writer.close() 
	
	# Step 9: output the values of w and b
	w_out, b_out = sess.run([w, b]) 

print('Took: %f seconds' %(time.time() - start))

# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
plt.legend()
plt.show()

================================================
FILE: examples/03_linreg_starter.py
================================================
""" Starter code for simple linear regression example using placeholders
Created by Chip Huyen (huyenn@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)

# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
# Remember both X and Y are scalars with type float
X, Y = None, None
#############################
########## TO DO ############
#############################

# Step 3: create weight and bias, initialized to 0.0
# Make sure to use tf.get_variable
w, b = None, None
#############################
########## TO DO ############
#############################

# Step 4: build model to predict Y
# e.g. how would you derive at Y_predicted given X, w, and b
Y_predicted = None
#############################
########## TO DO ############
#############################

# Step 5: use the square error as the loss function
loss = None
#############################
########## TO DO ############
#############################

# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)

start = time.time()

# Create a filewriter to write the model's graph to TensorBoard
#############################
########## TO DO ############
#############################

with tf.Session() as sess:
    # Step 7: initialize the necessary variables, in this case, w and b
    #############################
    ########## TO DO ############
    #############################

    # Step 8: train the model for 100 epochs
    for i in range(100):
        total_loss = 0
        for x, y in data:
            # Execute train_op and get the value of loss.
            # Don't forget to feed in data for placeholders
            _, loss = ########## TO DO ############
            total_loss += loss

        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))

    # close the writer when you're done using it
    #############################
    ########## TO DO ############
    #############################
    writer.close()
    
    # Step 9: output the values of w and b
    w_out, b_out = None, None
    #############################
    ########## TO DO ############
    #############################

print('Took: %f seconds' %(time.time() - start))

# uncomment the following lines to see the plot 
# plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
# plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
# plt.legend()
# plt.show()

================================================
FILE: examples/03_logreg.py
================================================
""" Solution for simple logistic regression model for MNIST
with tf.data module
MNIST dataset: yann.lecun.com/exdb/mnist/
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
import time

import utils

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 30
n_train = 60000
n_test = 10000

# Step 1: Read in data
mnist_folder = 'data/mnist'
utils.download_mnist(mnist_folder)
train, val, test = utils.read_mnist(mnist_folder, flatten=True)

# Step 2: Create datasets and iterator
train_data = tf.data.Dataset.from_tensor_slices(train)
train_data = train_data.shuffle(10000) # if you want to shuffle your data
train_data = train_data.batch(batch_size)

test_data = tf.data.Dataset.from_tensor_slices(test)
test_data = test_data.batch(batch_size)

iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                           train_data.output_shapes)
img, label = iterator.get_next()

train_init = iterator.make_initializer(train_data)	# initializer for train_data
test_init = iterator.make_initializer(test_data)	# initializer for train_data

# Step 3: create weights and bias
# w is initialized to random variables with mean of 0, stddev of 0.01
# b is initialized to 0
# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)
# shape of b depends on Y
w = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer(0, 0.01))
b = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer())

# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
logits = tf.matmul(img, w) + b 

# Step 5: define loss function
# use cross entropy of softmax of logits as the loss function
entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=label, name='entropy')
loss = tf.reduce_mean(entropy, name='loss') # computes the mean over all the examples in the batch

# Step 6: define training op
# using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

# Step 7: calculate accuracy with test set
preds = tf.nn.softmax(logits)
correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1))
accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())
with tf.Session() as sess:
   
    start_time = time.time()
    sess.run(tf.global_variables_initializer())

    # train the model n_epochs times
    for i in range(n_epochs): 	
        sess.run(train_init)	# drawing samples from train_data
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l = sess.run([optimizer, loss])
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))
    print('Total time: {0} seconds'.format(time.time() - start_time))

    # test the model
    sess.run(test_init)			# drawing samples from test_data
    total_correct_preds = 0
    try:
        while True:
            accuracy_batch = sess.run(accuracy)
            total_correct_preds += accuracy_batch
    except tf.errors.OutOfRangeError:
        pass

    print('Accuracy {0}'.format(total_correct_preds/n_test))
writer.close()


================================================
FILE: examples/03_logreg_placeholder.py
================================================
""" Solution for simple logistic regression model for MNIST
with placeholder
MNIST dataset: yann.lecun.com/exdb/mnist/
Created by Chip Huyen (huyenn@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import time

import utils

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 30

# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
mnist = input_data.read_data_sets('data/mnist', one_hot=True)
X_batch, Y_batch = mnist.train.next_batch(batch_size)

# Step 2: create placeholders for features and labels
# each image in the MNIST data is of shape 28*28 = 784
# therefore, each image is represented with a 1x784 tensor
# there are 10 classes for each image, corresponding to digits 0 - 9. 
# each lable is one hot vector.
X = tf.placeholder(tf.float32, [batch_size, 784], name='image') 
Y = tf.placeholder(tf.int32, [batch_size, 10], name='label')

# Step 3: create weights and bias
# w is initialized to random variables with mean of 0, stddev of 0.01
# b is initialized to 0
# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)
# shape of b depends on Y
w = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer())
b = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer())

# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
logits = tf.matmul(X, w) + b 

# Step 5: define loss function
# use cross entropy of softmax of logits as the loss function
entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss')
loss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch
# loss = tf.reduce_mean(-tf.reduce_sum(tf.nn.softmax(logits) * tf.log(Y), reduction_indices=[1]))

# Step 6: define training op
# using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

# Step 7: calculate accuracy with test set
preds = tf.nn.softmax(logits)
correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

writer = tf.summary.FileWriter('./graphs/logreg_placeholder', tf.get_default_graph())
with tf.Session() as sess:
	start_time = time.time()
	sess.run(tf.global_variables_initializer())	
	n_batches = int(mnist.train.num_examples/batch_size)
	
	# train the model n_epochs times
	for i in range(n_epochs): 
		total_loss = 0

		for j in range(n_batches):
			X_batch, Y_batch = mnist.train.next_batch(batch_size)
			_, loss_batch = sess.run([optimizer, loss], {X: X_batch, Y:Y_batch}) 
			total_loss += loss_batch
		print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))
	print('Total time: {0} seconds'.format(time.time() - start_time))

	# test the model
	n_batches = int(mnist.test.num_examples/batch_size)
	total_correct_preds = 0

	for i in range(n_batches):
		X_batch, Y_batch = mnist.test.next_batch(batch_size)
		accuracy_batch = sess.run(accuracy, {X: X_batch, Y:Y_batch})
		total_correct_preds += accuracy_batch	

	print('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))

writer.close()


================================================
FILE: examples/03_logreg_starter.py
================================================
""" Starter code for simple logistic regression model for MNIST
with tf.data module
MNIST dataset: yann.lecun.com/exdb/mnist/
Created by Chip Huyen (chiphuyen@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 03
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
import time

import utils

# Define paramaters for the model
learning_rate = 0.01
batch_size = 128
n_epochs = 30
n_train = 60000
n_test = 10000

# Step 1: Read in data
mnist_folder = 'data/mnist'
utils.download_mnist(mnist_folder)
train, val, test = utils.read_mnist(mnist_folder, flatten=True)

# Step 2: Create datasets and iterator
# create training Dataset and batch it
train_data = tf.data.Dataset.from_tensor_slices(train)
train_data = train_data.shuffle(10000) # if you want to shuffle your data
train_data = train_data.batch(batch_size)

# create testing Dataset and batch it
test_data = None
#############################
########## TO DO ############
#############################


# create one iterator and initialize it with different datasets
iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                           train_data.output_shapes)
img, label = iterator.get_next()

train_init = iterator.make_initializer(train_data)	# initializer for train_data
test_init = iterator.make_initializer(test_data)	# initializer for train_data

# Step 3: create weights and bias
# w is initialized to random variables with mean of 0, stddev of 0.01
# b is initialized to 0
# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)
# shape of b depends on Y
w, b = None, None
#############################
########## TO DO ############
#############################


# Step 4: build model
# the model that returns the logits.
# this logits will be later passed through softmax layer
logits = None
#############################
########## TO DO ############
#############################


# Step 5: define loss function
# use cross entropy of softmax of logits as the loss function
loss = None
#############################
########## TO DO ############
#############################


# Step 6: define optimizer
# using Adamn Optimizer with pre-defined learning rate to minimize loss
optimizer = None
#############################
########## TO DO ############
#############################


# Step 7: calculate accuracy with test set
preds = tf.nn.softmax(logits)
correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1))
accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())
with tf.Session() as sess:
   
    start_time = time.time()
    sess.run(tf.global_variables_initializer())

    # train the model n_epochs times
    for i in range(n_epochs): 	
        sess.run(train_init)	# drawing samples from train_data
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l = sess.run([optimizer, loss])
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))
    print('Total time: {0} seconds'.format(time.time() - start_time))

    # test the model
    sess.run(test_init)			# drawing samples from test_data
    total_correct_preds = 0
    try:
        while True:
            accuracy_batch = sess.run(accuracy)
            total_correct_preds += accuracy_batch
    except tf.errors.OutOfRangeError:
        pass

    print('Accuracy {0}'.format(total_correct_preds/n_test))
writer.close()

================================================
FILE: examples/04_linreg_eager.py
================================================
""" Starter code for a simple regression example using eager execution.
Created by Akshay Agrawal (akshayka@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 04
"""
import time

import tensorflow as tf
import tensorflow.contrib.eager as tfe
import matplotlib.pyplot as plt

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# In order to use eager execution, `tfe.enable_eager_execution()` must be
# called at the very beginning of a TensorFlow program.
tfe.enable_eager_execution()

# Read the data into a dataset.
data, n_samples = utils.read_birth_life_data(DATA_FILE)
dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))

# Create variables.
w = tfe.Variable(0.0)
b = tfe.Variable(0.0)

# Define the linear predictor.
def prediction(x):
  return x * w + b

# Define loss functions of the form: L(y, y_predicted)
def squared_loss(y, y_predicted):
  return (y - y_predicted) ** 2

def huber_loss(y, y_predicted, m=1.0):
  """Huber loss."""
  t = y - y_predicted
  # Note that enabling eager execution lets you use Python control flow and
  # specificy dynamic TensorFlow computations. Contrast this implementation
  # to the graph-construction one found in `utils`, which uses `tf.cond`.
  return t ** 2 if tf.abs(t) <= m else m * (2 * tf.abs(t) - m)

def train(loss_fn):
  """Train a regression model evaluated using `loss_fn`."""
  print('Training; loss function: ' + loss_fn.__name__)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)

  # Define the function through which to differentiate.
  def loss_for_example(x, y):
    return loss_fn(y, prediction(x))

  # `grad_fn(x_i, y_i)` returns (1) the value of `loss_for_example`
  # evaluated at `x_i`, `y_i` and (2) the gradients of any variables used in
  # calculating it.
  grad_fn = tfe.implicit_value_and_gradients(loss_for_example)

  start = time.time()
  for epoch in range(100):
    total_loss = 0.0
    for x_i, y_i in tfe.Iterator(dataset):
      loss, gradients = grad_fn(x_i, y_i)
      # Take an optimization step and update variables.
      optimizer.apply_gradients(gradients)
      total_loss += loss
    if epoch % 10 == 0:
      print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples))
  print('Took: %f seconds' % (time.time() - start))
  print('Eager execution exhibits significant overhead per operation. '
        'As you increase your batch size, the impact of the overhead will '
        'become less noticeable. Eager execution is under active development: '
        'expect performance to increase substantially in the near future!')

train(huber_loss)
plt.plot(data[:,0], data[:,1], 'bo')
# The `.numpy()` method of a tensor retrieves the NumPy array backing it.
# In future versions of eager, you won't need to call `.numpy()` and will
# instead be able to, in most cases, pass Tensors wherever NumPy arrays are
# expected.
plt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r',
         label="huber regression")
plt.legend()
plt.show()


================================================
FILE: examples/04_linreg_eager_starter.py
================================================
""" Starter code for a simple regression example using eager execution.
Created by Akshay Agrawal (akshayka@cs.stanford.edu)
CS20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Lecture 04
"""
import time

import tensorflow as tf
import tensorflow.contrib.eager as tfe
import matplotlib.pyplot as plt

import utils

DATA_FILE = 'data/birth_life_2010.txt'

# In order to use eager execution, `tfe.enable_eager_execution()` must be
# called at the very beginning of a TensorFlow program.
#############################
########## TO DO ############
#############################

# Read the data into a dataset.
data, n_samples = utils.read_birth_life_data(DATA_FILE)
dataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))

# Create weight and bias variables, initialized to 0.0.
#############################
########## TO DO ############
#############################
w = None
b = None

# Define the linear predictor.
def prediction(x):
  #############################
  ########## TO DO ############
  #############################
  pass

# Define loss functions of the form: L(y, y_predicted)
def squared_loss(y, y_predicted):
  #############################
  ########## TO DO ############
  #############################
  pass

def huber_loss(y, y_predicted):
  """Huber loss with `m` set to `1.0`."""
  #############################
  ########## TO DO ############
  #############################
  pass

def train(loss_fn):
  """Train a regression model evaluated using `loss_fn`."""
  print('Training; loss function: ' + loss_fn.__name__)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)

  # Define the function through which to differentiate.
  #############################
  ########## TO DO ############
  #############################
  def loss_for_example(x, y):
    pass

  # Obtain a gradients function using `tfe.implicit_value_and_gradients`.
  #############################
  ########## TO DO ############
  #############################
  grad_fn = None

  start = time.time()
  for epoch in range(100):
    total_loss = 0.0
    for x_i, y_i in tfe.Iterator(dataset):
      # Compute the loss and gradient, and take an optimization step.
      #############################
      ########## TO DO ############
      #############################
      optimizer.apply_gradients(gradients)
      total_loss += loss
    if epoch % 10 == 0:
      print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples))
  print('Took: %f seconds' % (time.time() - start))
  print('Eager execution exhibits significant overhead per operation. '
        'As you increase your batch size, the impact of the overhead will '
        'become less noticeable. Eager execution is under active development: '
        'expect performance to increase substantially in the near future!')

train(huber_loss)
plt.plot(data[:,0], data[:,1], 'bo')
# The `.numpy()` method of a tensor retrieves the NumPy array backing it.
# In future versions of eager, you won't need to call `.numpy()` and will
# instead be able to, in most cases, pass Tensors wherever NumPy arrays are
# expected.
plt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r',
         label="huber regression")
plt.legend()
plt.show()


================================================
FILE: examples/04_word2vec.py
================================================
""" starter code for word2vec skip-gram model with NCE loss
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

import utils
import word2vec_utils

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016
NUM_VISUALIZE = 3000        # number of tokens to visualize


def word2vec(dataset):
    """ Build the graph for word2vec model and train it """
    # Step 1: get input, output from the dataset
    with tf.name_scope('data'):
        iterator = dataset.make_initializable_iterator()
        center_words, target_words = iterator.get_next()

    """ Step 2 + 3: define weights and embedding lookup.
    In word2vec, it's actually the weights that we care about 
    """
    with tf.name_scope('embed'):
        embed_matrix = tf.get_variable('embed_matrix', 
                                        shape=[VOCAB_SIZE, EMBED_SIZE],
                                        initializer=tf.random_uniform_initializer())
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embedding')

    # Step 4: construct variables for NCE loss and define loss function
    with tf.name_scope('loss'):
        nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE],
                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))
        nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

        # define loss function to be NCE loss function
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                            biases=nce_bias, 
                                            labels=target_words, 
                                            inputs=embed, 
                                            num_sampled=NUM_SAMPLED, 
                                            num_classes=VOCAB_SIZE), name='loss')

    # Step 5: define optimizer
    with tf.name_scope('optimizer'):
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
    
    utils.safe_mkdir('checkpoints')

    with tf.Session() as sess:
        sess.run(iterator.initializer)
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph)

        for index in range(NUM_TRAIN_STEPS):
            try:
                loss_batch, _ = sess.run([loss, optimizer])
                total_loss += loss_batch
                if (index + 1) % SKIP_STEP == 0:
                    print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                    total_loss = 0.0
            except tf.errors.OutOfRangeError:
                sess.run(iterator.initializer)
        writer.close()

def gen():
    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, 
                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)

def main():
    dataset = tf.data.Dataset.from_generator(gen, 
                                (tf.int32, tf.int32), 
                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))
    word2vec(dataset)

if __name__ == '__main__':
    main()


================================================
FILE: examples/04_word2vec_eager.py
================================================
""" starter code for word2vec skip-gram model with NCE loss
Eager execution
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
import tensorflow.contrib.eager as tfe

import utils
import word2vec_utils

tfe.enable_eager_execution()

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016

class Word2Vec(object):
  def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):
    self.vocab_size = vocab_size
    self.num_sampled = num_sampled
    self.embed_matrix = tfe.Variable(tf.random_uniform(
                                      [vocab_size, embed_size]))
    self.nce_weight = tfe.Variable(tf.truncated_normal(
                                    [vocab_size, embed_size],
                                    stddev=1.0 / (embed_size ** 0.5)))
    self.nce_bias = tfe.Variable(tf.zeros([vocab_size]))

  def compute_loss(self, center_words, target_words):
    """Computes the forward pass of word2vec with the NCE loss.""" 
    embed = tf.nn.embedding_lookup(self.embed_matrix, center_words)
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.nce_weight, 
                                        biases=self.nce_bias, 
                                        labels=target_words, 
                                        inputs=embed, 
                                        num_sampled=self.num_sampled, 
                                        num_classes=self.vocab_size))
    return loss


def gen():
  yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,
                                      VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,
                                      VISUAL_FLD)

def main():
  dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),
                              (tf.TensorShape([BATCH_SIZE]),
                              tf.TensorShape([BATCH_SIZE, 1])))
  optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)
  model = Word2Vec(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE)
  grad_fn = tfe.implicit_value_and_gradients(model.compute_loss)
  total_loss = 0.0  # for average loss in the last SKIP_STEP steps
  num_train_steps = 0
  while num_train_steps < NUM_TRAIN_STEPS:
    for center_words, target_words in tfe.Iterator(dataset):
      if num_train_steps >= NUM_TRAIN_STEPS:
        break
      loss_batch, grads = grad_fn(center_words, target_words)
      total_loss += loss_batch
      optimizer.apply_gradients(grads)
      if (num_train_steps + 1) % SKIP_STEP == 0:
        print('Average loss at step {}: {:5.1f}'.format(
                num_train_steps, total_loss / SKIP_STEP))
        total_loss = 0.0
      num_train_steps += 1


if __name__ == '__main__':
    main()


================================================
FILE: examples/04_word2vec_eager_starter.py
================================================
""" starter code for word2vec skip-gram model with NCE loss
Eager execution
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
import tensorflow.contrib.eager as tfe

import utils
import word2vec_utils

# Enable eager execution!
#############################
########## TO DO ############
#############################

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016

class Word2Vec(object):
  def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):
    self.vocab_size = vocab_size
    self.num_sampled = num_sampled
    # Create the variables: an embedding matrix, nce_weight, and nce_bias
    #############################
    ########## TO DO ############
    #############################
    self.embed_matrix = None
    self.nce_weight = None
    self.nce_bias = None

  def compute_loss(self, center_words, target_words):
    """Computes the forward pass of word2vec with the NCE loss.""" 
    # Look up the embeddings for the center words
    #############################
    ########## TO DO ############
    #############################
    embed = None

    # Compute the loss, using tf.reduce_mean and tf.nn.nce_loss
    #############################
    ########## TO DO ############
    #############################
    loss = None
    return loss


def gen():
  yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,
                                      VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,
                                      VISUAL_FLD)

def main():
  dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),
                              (tf.TensorShape([BATCH_SIZE]),
                              tf.TensorShape([BATCH_SIZE, 1])))
  optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)
  # Create the model
  #############################
  ########## TO DO ############
  #############################
  model = None

  # Create the gradients function, using `tfe.implicit_value_and_gradients`
  #############################
  ########## TO DO ############
  #############################
  grad_fn = None

  total_loss = 0.0  # for average loss in the last SKIP_STEP steps
  num_train_steps = 0
  while num_train_steps < NUM_TRAIN_STEPS:
    for center_words, target_words in tfe.Iterator(dataset):
      if num_train_steps >= NUM_TRAIN_STEPS:
        break

      # Compute the loss and gradients, and take an optimization step.
      #############################
      ########## TO DO ############
      #############################
      
      if (num_train_steps + 1) % SKIP_STEP == 0:
        print('Average loss at step {}: {:5.1f}'.format(
                num_train_steps, total_loss / SKIP_STEP))
        total_loss = 0.0
      num_train_steps += 1


if __name__ == '__main__':
    main()


================================================
FILE: examples/04_word2vec_visualize.py
================================================
""" word2vec skip-gram model with NCE loss and 
code to visualize the embeddings on TensorBoard
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

import utils
import word2vec_utils

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016
NUM_VISUALIZE = 3000        # number of tokens to visualize

class SkipGramModel:
    """ Build the graph for word2vec model """
    def __init__(self, dataset, vocab_size, embed_size, batch_size, num_sampled, learning_rate):
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.batch_size = batch_size
        self.num_sampled = num_sampled
        self.lr = learning_rate
        self.global_step = tf.get_variable('global_step', initializer=tf.constant(0), trainable=False)
        self.skip_step = SKIP_STEP
        self.dataset = dataset

    def _import_data(self):
        """ Step 1: import data
        """
        with tf.name_scope('data'):
            self.iterator = self.dataset.make_initializable_iterator()
            self.center_words, self.target_words = self.iterator.get_next()

    def _create_embedding(self):
        """ Step 2 + 3: define weights and embedding lookup.
        In word2vec, it's actually the weights that we care about 
        """
        with tf.name_scope('embed'):
            self.embed_matrix = tf.get_variable('embed_matrix', 
                                                shape=[self.vocab_size, self.embed_size],
                                                initializer=tf.random_uniform_initializer())
            self.embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embedding')

    def _create_loss(self):
        """ Step 4: define the loss function """
        with tf.name_scope('loss'):
            # construct variables for NCE loss
            nce_weight = tf.get_variable('nce_weight', 
                        shape=[self.vocab_size, self.embed_size],
                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (self.embed_size ** 0.5)))
            nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

            # define loss function to be NCE loss function
            self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                                biases=nce_bias, 
                                                labels=self.target_words, 
                                                inputs=self.embed, 
                                                num_sampled=self.num_sampled, 
                                                num_classes=self.vocab_size), name='loss')
    def _create_optimizer(self):
        """ Step 5: define optimizer """
        self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, 
                                                              global_step=self.global_step)

    def _create_summaries(self):
        with tf.name_scope('summaries'):
            tf.summary.scalar('loss', self.loss)
            tf.summary.histogram('histogram loss', self.loss)
            # because you have several summaries, we should merge them all
            # into one op to make it easier to manage
            self.summary_op = tf.summary.merge_all()

    def build_graph(self):
        """ Build the graph for our model """
        self._import_data()
        self._create_embedding()
        self._create_loss()
        self._create_optimizer()
        self._create_summaries()

    def train(self, num_train_steps):
        saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias

        initial_step = 0
        utils.safe_mkdir('checkpoints')
        with tf.Session() as sess:
            sess.run(self.iterator.initializer)
            sess.run(tf.global_variables_initializer())
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))

            # if that checkpoint exists, restore from checkpoint
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)

            total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
            writer = tf.summary.FileWriter('graphs/word2vec/lr' + str(self.lr), sess.graph)
            initial_step = self.global_step.eval()

            for index in range(initial_step, initial_step + num_train_steps):
                try:
                    loss_batch, _, summary = sess.run([self.loss, self.optimizer, self.summary_op])
                    writer.add_summary(summary, global_step=index)
                    total_loss += loss_batch
                    if (index + 1) % self.skip_step == 0:
                        print('Average loss at step {}: {:5.1f}'.format(index, total_loss / self.skip_step))
                        total_loss = 0.0
                        saver.save(sess, 'checkpoints/skip-gram', index)
                except tf.errors.OutOfRangeError:
                    sess.run(self.iterator.initializer)
            writer.close()

    def visualize(self, visual_fld, num_visualize):
        """ run "'tensorboard --logdir='visualization'" to see the embeddings """
        
        # create the list of num_variable most common words to visualize
        word2vec_utils.most_common_words(visual_fld, num_visualize)

        saver = tf.train.Saver()
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))

            # if that checkpoint exists, restore from checkpoint
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)

            final_embed_matrix = sess.run(self.embed_matrix)
            
            # you have to store embeddings in a new variable
            embedding_var = tf.Variable(final_embed_matrix[:num_visualize], name='embedding')
            sess.run(embedding_var.initializer)

            config = projector.ProjectorConfig()
            summary_writer = tf.summary.FileWriter(visual_fld)

            # add embedding to the config file
            embedding = config.embeddings.add()
            embedding.tensor_name = embedding_var.name
            
            # link this tensor to its metadata file, in this case the first NUM_VISUALIZE words of vocab
            embedding.metadata_path = 'vocab_' + str(num_visualize) + '.tsv'

            # saves a configuration file that TensorBoard will read during startup.
            projector.visualize_embeddings(summary_writer, config)
            saver_embed = tf.train.Saver([embedding_var])
            saver_embed.save(sess, os.path.join(visual_fld, 'model.ckpt'), 1)

def gen():
    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, 
                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)

def main():
    dataset = tf.data.Dataset.from_generator(gen, 
                                (tf.int32, tf.int32), 
                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))
    model = SkipGramModel(dataset, VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE)
    model.build_graph()
    model.train(NUM_TRAIN_STEPS)
    model.visualize(VISUAL_FLD, NUM_VISUALIZE)

if __name__ == '__main__':
    main()

================================================
FILE: examples/05_randomization.py
================================================
""" Examples to demonstrate ops level randomization
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 05
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

# Example 1: session keeps track of the random state
c = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.574932
    print(sess.run(c)) # >> -5.9731865

# Example 2: each new session will start the random state all over again.
c = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.574932

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.574932

# Example 3: with operation level random seed, each op keeps its own seed.
c = tf.random_uniform([], -10, 10, seed=2)
d = tf.random_uniform([], -10, 10, seed=2)

with tf.Session() as sess:
    print(sess.run(c)) # >> 3.574932
    print(sess.run(d)) # >> 3.574932

# Example 4: graph level random seed
tf.set_random_seed(2)
c = tf.random_uniform([], -10, 10)
d = tf.random_uniform([], -10, 10)

with tf.Session() as sess:
    print(sess.run(c)) # >> 9.123926
    print(sess.run(d)) # >> -4.5340395
    

================================================
FILE: examples/05_variable_sharing.py
================================================
""" Examples to demonstrate variable sharing
CS 20: 'TensorFlow for Deep Learning Research'
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 05
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import tensorflow as tf

x1 = tf.truncated_normal([200, 100], name='x1')
x2 = tf.truncated_normal([200, 100], name='x2')

def two_hidden_layers(x):
    assert x.shape.as_list() == [200, 100]
    w1 = tf.Variable(tf.random_normal([100, 50]), name='h1_weights')
    b1 = tf.Variable(tf.zeros([50]), name='h1_biases')
    h1 = tf.matmul(x, w1) + b1
    assert h1.shape.as_list() == [200, 50]  
    w2 = tf.Variable(tf.random_normal([50, 10]), name='h2_weights')
    b2 = tf.Variable(tf.zeros([10]), name='2_biases')
    logits = tf.matmul(h1, w2) + b2
    return logits

def two_hidden_layers_2(x):
    assert x.shape.as_list() == [200, 100]
    w1 = tf.get_variable('h1_weights', [100, 50], initializer=tf.random_normal_initializer())
    b1 = tf.get_variable('h1_biases', [50], initializer=tf.constant_initializer(0.0))
    h1 = tf.matmul(x, w1) + b1
    assert h1.shape.as_list() == [200, 50]  
    w2 = tf.get_variable('h2_weights', [50, 10], initializer=tf.random_normal_initializer())
    b2 = tf.get_variable('h2_biases', [10], initializer=tf.constant_initializer(0.0))
    logits = tf.matmul(h1, w2) + b2
    return logits

# logits1 = two_hidden_layers(x1)
# logits2 = two_hidden_layers(x2)

# logits1 = two_hidden_layers_2(x1)
# logits2 = two_hidden_layers_2(x2)

# with tf.variable_scope('two_layers') as scope:
#     logits1 = two_hidden_layers_2(x1)
#     scope.reuse_variables()
#     logits2 = two_hidden_layers_2(x2)

# with tf.variable_scope('two_layers') as scope:
#     logits1 = two_hidden_layers_2(x1)
#     scope.reuse_variables()
#     logits2 = two_hidden_layers_2(x2)

def fully_connected(x, output_dim, scope):
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE) as scope:
        w = tf.get_variable('weights', [x.shape[1], output_dim], initializer=tf.random_normal_initializer())
        b = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
        return tf.matmul(x, w) + b

def two_hidden_layers(x):
    h1 = fully_connected(x, 50, 'h1')
    h2 = fully_connected(h1, 10, 'h2')

with tf.variable_scope('two_layers') as scope:
    logits1 = two_hidden_layers(x1)
    # scope.reuse_variables()
    logits2 = two_hidden_layers(x2)

writer = tf.summary.FileWriter('./graphs/cool_variables', tf.get_default_graph())
writer.close()

================================================
FILE: examples/07_convnet_layers.py
================================================
""" Using convolutional net on MNIST dataset of handwritten digits
MNIST dataset: http://yann.lecun.com/exdb/mnist/
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 07
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time 

import tensorflow as tf

import utils

class ConvNet(object):
    def __init__(self):
        self.lr = 0.001
        self.batch_size = 128
        self.keep_prob = tf.constant(0.75)
        self.gstep = tf.Variable(0, dtype=tf.int32, 
                                trainable=False, name='global_step')
        self.n_classes = 10
        self.skip_step = 20
        self.n_test = 10000
        self.training=False

    def get_data(self):
        with tf.name_scope('data'):
            train_data, test_data = utils.get_mnist_dataset(self.batch_size)
            iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                                   train_data.output_shapes)
            img, self.label = iterator.get_next()
            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])
            # reshape the image to make it work with tf.nn.conv2d

            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data
            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data

    def inference(self):
        conv1 = tf.layers.conv2d(inputs=self.img,
                                  filters=32,
                                  kernel_size=[5, 5],
                                  padding='SAME',
                                  activation=tf.nn.relu,
                                  name='conv1')
        pool1 = tf.layers.max_pooling2d(inputs=conv1, 
                                        pool_size=[2, 2], 
                                        strides=2,
                                        name='pool1')

        conv2 = tf.layers.conv2d(inputs=pool1,
                                  filters=64,
                                  kernel_size=[5, 5],
                                  padding='SAME',
                                  activation=tf.nn.relu,
                                  name='conv2')
        pool2 = tf.layers.max_pooling2d(inputs=conv2, 
                                        pool_size=[2, 2], 
                                        strides=2,
                                        name='pool2')

        feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3]
        pool2 = tf.reshape(pool2, [-1, feature_dim])
        fc = tf.layers.dense(pool2, 1024, activation=tf.nn.relu, name='fc')
        dropout = tf.layers.dropout(fc, 
                                    self.keep_prob, 
                                    training=self.training, 
                                    name='dropout')
        self.logits = tf.layers.dense(dropout, self.n_classes, name='logits')

    def loss(self):
        '''
        define loss function
        use softmax cross entropy with logits as the loss function
        compute mean cross entropy, softmax is applied internally
        '''
        # 
        with tf.name_scope('loss'):
            entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits)
            self.loss = tf.reduce_mean(entropy, name='loss')
    
    def optimize(self):
        '''
        Define training op
        using Adam Gradient Descent to minimize cost
        '''
        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, 
                                                global_step=self.gstep)

    def summary(self):
        '''
        Create summaries to write on TensorBoard
        '''
        with tf.name_scope('summaries'):
            tf.summary.scalar('loss', self.loss)
            tf.summary.scalar('accuracy', self.accuracy)
            tf.summary.histogram('histogram loss', self.loss)
            self.summary_op = tf.summary.merge_all()
    
    def eval(self):
        '''
        Count the number of right predictions in a batch
        '''
        with tf.name_scope('predict'):
            preds = tf.nn.softmax(self.logits)
            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))
            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

    def build(self):
        '''
        Build the computation graph
        '''
        self.get_data()
        self.inference()
        self.loss()
        self.optimize()
        self.eval()
        self.summary()

    def train_one_epoch(self, sess, saver, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init) 
        self.training = True
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                if (step + 1) % self.skip_step == 0:
                    print('Loss at step {0}: {1}'.format(step, l))
                step += 1
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        saver.save(sess, 'checkpoints/convnet_layers/mnist-convnet', step)
        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))
        print('Took: {0} seconds'.format(time.time() - start_time))
        return step

    def eval_once(self, sess, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init)
        self.training = False
        total_correct_preds = 0
        try:
            while True:
                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                total_correct_preds += accuracy_batch
        except tf.errors.OutOfRangeError:
            pass

        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))
        print('Took: {0} seconds'.format(time.time() - start_time))

    def train(self, n_epochs):
        '''
        The train function alternates between training one epoch and evaluating
        '''
        utils.safe_mkdir('checkpoints')
        utils.safe_mkdir('checkpoints/convnet_layers')
        writer = tf.summary.FileWriter('./graphs/convnet_layers', tf.get_default_graph())

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            saver = tf.train.Saver()
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_layers/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            
            step = self.gstep.eval()

            for epoch in range(n_epochs):
                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)
                self.eval_once(sess, self.test_init, writer, epoch, step)
        writer.close()

if __name__ == '__main__':
    model = ConvNet()
    model.build()
    model.train(n_epochs=15)

================================================
FILE: examples/07_convnet_mnist.py
================================================
""" Using convolutional net on MNIST dataset of handwritten digits
MNIST dataset: http://yann.lecun.com/exdb/mnist/
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 07
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time 

import tensorflow as tf

import utils

def conv_relu(inputs, filters, k_size, stride, padding, scope_name):
    '''
    A method that does convolution + relu on inputs
    '''
    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:
        in_channels = inputs.shape[-1]
        kernel = tf.get_variable('kernel', 
                                [k_size, k_size, in_channels, filters], 
                                initializer=tf.truncated_normal_initializer())
        biases = tf.get_variable('biases', 
                                [filters],
                                initializer=tf.random_normal_initializer())
        conv = tf.nn.conv2d(inputs, kernel, strides=[1, stride, stride, 1], padding=padding)
    return tf.nn.relu(conv + biases, name=scope.name)

def maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'):
    '''A method that does max pooling on inputs'''
    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:
        pool = tf.nn.max_pool(inputs, 
                            ksize=[1, ksize, ksize, 1], 
                            strides=[1, stride, stride, 1],
                            padding=padding)
    return pool

def fully_connected(inputs, out_dim, scope_name='fc'):
    '''
    A fully connected linear layer on inputs
    '''
    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:
        in_dim = inputs.shape[-1]
        w = tf.get_variable('weights', [in_dim, out_dim],
                            initializer=tf.truncated_normal_initializer())
        b = tf.get_variable('biases', [out_dim],
                            initializer=tf.constant_initializer(0.0))
        out = tf.matmul(inputs, w) + b
    return out

class ConvNet(object):
    def __init__(self):
        self.lr = 0.001
        self.batch_size = 128
        self.keep_prob = tf.constant(0.75)
        self.gstep = tf.Variable(0, dtype=tf.int32, 
                                trainable=False, name='global_step')
        self.n_classes = 10
        self.skip_step = 20
        self.n_test = 10000
        self.training = True

    def get_data(self):
        with tf.name_scope('data'):
            train_data, test_data = utils.get_mnist_dataset(self.batch_size)
            iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                                   train_data.output_shapes)
            img, self.label = iterator.get_next()
            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])
            # reshape the image to make it work with tf.nn.conv2d

            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data
            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data

    def inference(self):
        conv1 = conv_relu(inputs=self.img,
                        filters=32,
                        k_size=5,
                        stride=1,
                        padding='SAME',
                        scope_name='conv1')
        pool1 = maxpool(conv1, 2, 2, 'VALID', 'pool1')
        conv2 = conv_relu(inputs=pool1,
                        filters=64,
                        k_size=5,
                        stride=1,
                        padding='SAME',
                        scope_name='conv2')
        pool2 = maxpool(conv2, 2, 2, 'VALID', 'pool2')
        feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3]
        pool2 = tf.reshape(pool2, [-1, feature_dim])
        fc = fully_connected(pool2, 1024, 'fc')
        dropout = tf.nn.dropout(tf.nn.relu(fc), self.keep_prob, name='relu_dropout')
        self.logits = fully_connected(dropout, self.n_classes, 'logits')

    def loss(self):
        '''
        define loss function
        use softmax cross entropy with logits as the loss function
        compute mean cross entropy, softmax is applied internally
        '''
        # 
        with tf.name_scope('loss'):
            entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits)
            self.loss = tf.reduce_mean(entropy, name='loss')
    
    def optimize(self):
        '''
        Define training op
        using Adam Gradient Descent to minimize cost
        '''
        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, 
                                                global_step=self.gstep)

    def summary(self):
        '''
        Create summaries to write on TensorBoard
        '''
        with tf.name_scope('summaries'):
            tf.summary.scalar('loss', self.loss)
            tf.summary.scalar('accuracy', self.accuracy)
            tf.summary.histogram('histogram loss', self.loss)
            self.summary_op = tf.summary.merge_all()
    
    def eval(self):
        '''
        Count the number of right predictions in a batch
        '''
        with tf.name_scope('predict'):
            preds = tf.nn.softmax(self.logits)
            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))
            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

    def build(self):
        '''
        Build the computation graph
        '''
        self.get_data()
        self.inference()
        self.loss()
        self.optimize()
        self.eval()
        self.summary()

    def train_one_epoch(self, sess, saver, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init) 
        self.training = True
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                if (step + 1) % self.skip_step == 0:
                    print('Loss at step {0}: {1}'.format(step, l))
                step += 1
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', step)
        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))
        print('Took: {0} seconds'.format(time.time() - start_time))
        return step

    def eval_once(self, sess, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init)
        self.training = False
        total_correct_preds = 0
        try:
            while True:
                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                total_correct_preds += accuracy_batch
        except tf.errors.OutOfRangeError:
            pass

        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))
        print('Took: {0} seconds'.format(time.time() - start_time))

    def train(self, n_epochs):
        '''
        The train function alternates between training one epoch and evaluating
        '''
        utils.safe_mkdir('checkpoints')
        utils.safe_mkdir('checkpoints/convnet_mnist')
        writer = tf.summary.FileWriter('./graphs/convnet', tf.get_default_graph())

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            saver = tf.train.Saver()
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            
            step = self.gstep.eval()

            for epoch in range(n_epochs):
                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)
                self.eval_once(sess, self.test_init, writer, epoch, step)
        writer.close()

if __name__ == '__main__':
    model = ConvNet()
    model.build()
    model.train(n_epochs=30)


================================================
FILE: examples/07_convnet_mnist_starter.py
================================================
""" Using convolutional net on MNIST dataset of handwritten digits
MNIST dataset: http://yann.lecun.com/exdb/mnist/
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 07
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import time 

import tensorflow as tf

import utils

def conv_relu(inputs, filters, k_size, stride, padding, scope_name):
    '''
    A method that does convolution + relu on inputs
    '''
    #############################
    ########## TO DO ############
    #############################
    return None

def maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'):
    '''A method that does max pooling on inputs'''
    #############################
    ########## TO DO ############
    #############################
    return None

def fully_connected(inputs, out_dim, scope_name='fc'):
    '''
    A fully connected linear layer on inputs
    '''
    #############################
    ########## TO DO ############
    #############################
    return None

class ConvNet(object):
    def __init__(self):
        self.lr = 0.001
        self.batch_size = 128
        self.keep_prob = tf.constant(0.75)
        self.gstep = tf.Variable(0, dtype=tf.int32, 
                                trainable=False, name='global_step')
        self.n_classes = 10
        self.skip_step = 20
        self.n_test = 10000

    def get_data(self):
        with tf.name_scope('data'):
            train_data, test_data = utils.get_mnist_dataset(self.batch_size)
            iterator = tf.data.Iterator.from_structure(train_data.output_types, 
                                                   train_data.output_shapes)
            img, self.label = iterator.get_next()
            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])
            # reshape the image to make it work with tf.nn.conv2d

            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data
            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data

    def inference(self):
        '''
        Build the model according to the description we've shown in class
        '''
        #############################
        ########## TO DO ############
        #############################
        self.logits = None

    def loss(self):
        '''
        define loss function
        use softmax cross entropy with logits as the loss function
        tf.nn.softmax_cross_entropy_with_logits
        softmax is applied internally
        don't forget to compute mean cross all sample in a batch
        '''
        #############################
        ########## TO DO ############
        #############################
        self.loss = None
    
    def optimize(self):
        '''
        Define training op
        using Adam Gradient Descent to minimize cost
        Don't forget to use global step
        '''
        #############################
        ########## TO DO ############
        #############################
        self.opt = None

    def summary(self):
        '''
        Create summaries to write on TensorBoard
        Remember to track both training loss and test accuracy
        '''
        #############################
        ########## TO DO ############
        #############################
        self.summary_op = None
        
    def eval(self):
        '''
        Count the number of right predictions in a batch
        '''
        with tf.name_scope('predict'):
            preds = tf.nn.softmax(self.logits)
            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))
            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))

    def build(self):
        '''
        Build the computation graph
        '''
        self.get_data()
        self.inference()
        self.loss()
        self.optimize()
        self.eval()
        self.summary()

    def train_one_epoch(self, sess, saver, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init) 
        total_loss = 0
        n_batches = 0
        try:
            while True:
                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                if (step + 1) % self.skip_step == 0:
                    print('Loss at step {0}: {1}'.format(step, l))
                step += 1
                total_loss += l
                n_batches += 1
        except tf.errors.OutOfRangeError:
            pass
        saver.save(sess, 'checkpoints/convnet_starter/mnist-convnet', step)
        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))
        print('Took: {0} seconds'.format(time.time() - start_time))
        return step

    def eval_once(self, sess, init, writer, epoch, step):
        start_time = time.time()
        sess.run(init)
        total_correct_preds = 0
        try:
            while True:
                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])
                writer.add_summary(summaries, global_step=step)
                total_correct_preds += accuracy_batch
        except tf.errors.OutOfRangeError:
            pass

        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))
        print('Took: {0} seconds'.format(time.time() - start_time))

    def train(self, n_epochs):
        '''
        The train function alternates between training one epoch and evaluating
        '''
        utils.safe_mkdir('checkpoints')
        utils.safe_mkdir('checkpoints/convnet_starter')
        writer = tf.summary.FileWriter('./graphs/convnet_starter', tf.get_default_graph())

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            saver = tf.train.Saver()
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_starter/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            
            step = self.gstep.eval()

            for epoch in range(n_epochs):
                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)
                self.eval_once(sess, self.test_init, writer, epoch, step)
        writer.close()

if __name__ == '__main__':
    model = ConvNet()
    model.build()
    model.train(n_epochs=15)

================================================
FILE: examples/07_run_kernels.py
================================================
"""
Simple examples of convolution to do some basic filters
Also demonstrates the use of TensorFlow data readers.

We will use some popular filters for our image.
It seems to be working with grayscale images, but not with rgb images.
It's probably because I didn't choose the right kernels for rgb images.

kernels for rgb images have dimensions 3 x 3 x 3 x 3
kernels for grayscale images have dimensions 3 x 3 x 1 x 1

CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 07
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import sys
sys.path.append('..')

from matplotlib import gridspec as gridspec
from matplotlib import pyplot as plt
import tensorflow as tf

import kernels

def read_one_image(filename):
    ''' This method is to show how to read image from a file into a tensor.
    The output is a tensor object.
    '''
    image_string = tf.read_file(filename)
    image_decoded = tf.image.decode_image(image_string)
    image = tf.cast(image_decoded, tf.float32) / 256.0
    return image

def convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'):
    images = [image[0]]
    for i, kernel in enumerate(kernels):
        filtered_image = tf.nn.conv2d(image, 
                                      kernel, 
                                      strides=strides,
                                      padding=padding)[0]
        if i == 2:
            filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255)
        images.append(filtered_image)
    return images

def show_images(images, rgb=True):
    gs = gridspec.GridSpec(1, len(images))
    for i, image in enumerate(images):
        plt.subplot(gs[0, i])
        if rgb:
            plt.imshow(image)
        else: 
            image = image.reshape(image.shape[0], image.shape[1])
            plt.imshow(image, cmap='gray')
        plt.axis('off')
    plt.show()

def main():
    rgb = False
    if rgb:
        kernels_list = [kernels.BLUR_FILTER_RGB, 
                        kernels.SHARPEN_FILTER_RGB, 
                        kernels.EDGE_FILTER_RGB,
                        kernels.TOP_SOBEL_RGB,
                        kernels.EMBOSS_FILTER_RGB]
    else:
        kernels_list = [kernels.BLUR_FILTER,
                        kernels.SHARPEN_FILTER,
                        kernels.EDGE_FILTER,
                        kernels.TOP_SOBEL,
                        kernels.EMBOSS_FILTER]

    kernels_list = kernels_list[1:]
    image = read_one_image('data/friday.jpg')
    if not rgb:
        image = tf.image.rgb_to_grayscale(image)
    image = tf.expand_dims(image, 0) # make it into a batch of 1 element
    images = convolve(image, kernels_list, rgb)
    with tf.Session() as sess:
        images = sess.run(images) # convert images from tensors to float values
    show_images(images, rgb)

if __name__ == '__main__':
    main()

================================================
FILE: examples/11_char_rnn.py
================================================
""" A clean, no_frills character-level generative language model.

CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Danijar Hafner (mail@danijar.com)
& Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 11
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import random
import sys
sys.path.append('..')
import time

import tensorflow as tf

import utils

def vocab_encode(text, vocab):
    return [vocab.index(x) + 1 for x in text if x in vocab]

def vocab_decode(array, vocab):
    return ''.join([vocab[x - 1] for x in array])

def read_data(filename, vocab, window, overlap):
    lines = [line.strip() for line in open(filename, 'r').readlines()]
    while True:
        random.shuffle(lines)

        for text in lines:
            text = vocab_encode(text, vocab)
            for start in range(0, len(text) - window, overlap):
                chunk = text[start: start + window]
                chunk += [0] * (window - len(chunk))
                yield chunk

def read_batch(stream, batch_size):
    batch = []
    for element in stream:
        batch.append(element)
        if len(batch) == batch_size:
            yield batch
            batch = []
    yield batch

class CharRNN(object):
    def __init__(self, model):
        self.model = model
        self.path = 'data/' + model + '.txt'
        if 'trump' in model:
            self.vocab = ("$%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    " '\"_abcdefghijklmnopqrstuvwxyz{|}@#➡📈")
        else:
            self.vocab = (" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    "\\^_abcdefghijklmnopqrstuvwxyz{|}")

        self.seq = tf.placeholder(tf.int32, [None, None])
        self.temp = tf.constant(1.5)
        self.hidden_sizes = [128, 256]
        self.batch_size = 64
        self.lr = 0.0003
        self.skip_step = 1
        self.num_steps = 50 # for RNN unrolled
        self.len_generated = 200
        self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    def create_rnn(self, seq):
        layers = [tf.nn.rnn_cell.GRUCell(size) for size in self.hidden_sizes]
        cells = tf.nn.rnn_cell.MultiRNNCell(layers)
        batch = tf.shape(seq)[0]
        zero_states = cells.zero_state(batch, dtype=tf.float32)
        self.in_state = tuple([tf.placeholder_with_default(state, [None, state.shape[1]]) 
                                for state in zero_states])
        # this line to calculate the real length of seq
        # all seq are padded to be of the same length, which is num_steps
        length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
        self.output, self.out_state = tf.nn.dynamic_rnn(cells, seq, length, self.in_state)

    def create_model(self):
        seq = tf.one_hot(self.seq, len(self.vocab))
        self.create_rnn(seq)
        self.logits = tf.layers.dense(self.output, len(self.vocab), None)
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits[:, :-1], 
                                                        labels=seq[:, 1:])
        self.loss = tf.reduce_sum(loss)
        # sample the next character from Maxwell-Boltzmann Distribution 
        # with temperature temp. It works equally well without tf.exp
        self.sample = tf.multinomial(tf.exp(self.logits[:, -1] / self.temp), 1)[:, 0] 
        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep)

    def train(self):
        saver = tf.train.Saver()
        start = time.time()
        min_loss = None
        with tf.Session() as sess:
            writer = tf.summary.FileWriter('graphs/gist', sess.graph)
            sess.run(tf.global_variables_initializer())
            
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/' + self.model + '/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            
            iteration = self.gstep.eval()
            stream = read_data(self.path, self.vocab, self.num_steps, overlap=self.num_steps//2)
            data = read_batch(stream, self.batch_size)
            while True:
                batch = next(data)

            # for batch in read_batch(read_data(DATA_PATH, vocab)):
                batch_loss, _ = sess.run([self.loss, self.opt], {self.seq: batch})
                if (iteration + 1) % self.skip_step == 0:
                    print('Iter {}. \n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))
                    self.online_infer(sess)
                    start = time.time()
                    checkpoint_name = 'checkpoints/' + self.model + '/char-rnn'
                    if min_loss is None:
                        saver.save(sess, checkpoint_name, iteration)
                    elif batch_loss < min_loss:
                        saver.save(sess, checkpoint_name, iteration)
                        min_loss = batch_loss
                iteration += 1

    def online_infer(self, sess):
        """ Generate sequence one character at a time, based on the previous character
        """
        for seed in ['Hillary', 'I', 'R', 'T', '@', 'N', 'M', '.', 'G', 'A', 'W']:
            sentence = seed
            state = None
            for _ in range(self.len_generated):
                batch = [vocab_encode(sentence[-1], self.vocab)]
                feed = {self.seq: batch}
                if state is not None: # for the first decoder step, the state is None
                    for i in range(len(state)):
                        feed.update({self.in_state[i]: state[i]})
                index, state = sess.run([self.sample, self.out_state], feed)
                sentence += vocab_decode(index, self.vocab)
            print('\t' + sentence)

def main():
    model = 'trump_tweets'
    utils.safe_mkdir('checkpoints')
    utils.safe_mkdir('checkpoints/' + model)

    lm = CharRNN(model)
    lm.create_model()
    lm.train()
    
if __name__ == '__main__':
    main()

================================================
FILE: examples/kernels.py
================================================
import numpy as np
import tensorflow as tf

a = np.zeros([3, 3, 3, 3])
a[1, 1, :, :] = 0.25
a[0, 1, :, :] = 0.125
a[1, 0, :, :] = 0.125
a[2, 1, :, :] = 0.125
a[1, 2, :, :] = 0.125
a[0, 0, :, :] = 0.0625
a[0, 2, :, :] = 0.0625
a[2, 0, :, :] = 0.0625
a[2, 2, :, :] = 0.0625

BLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
# a[1, 1, :, :] = 0.25
# a[0, 1, :, :] = 0.125
# a[1, 0, :, :] = 0.125
# a[2, 1, :, :] = 0.125
# a[1, 2, :, :] = 0.125
# a[0, 0, :, :] = 0.0625
# a[0, 2, :, :] = 0.0625
# a[2, 0, :, :] = 0.0625
# a[2, 2, :, :] = 0.0625
a[1, 1, :, :] = 1.0
a[0, 1, :, :] = 1.0
a[1, 0, :, :] = 1.0
a[2, 1, :, :] = 1.0
a[1, 2, :, :] = 1.0
a[0, 0, :, :] = 1.0
a[0, 2, :, :] = 1.0
a[2, 0, :, :] = 1.0
a[2, 2, :, :] = 1.0
BLUR_FILTER = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[1, 1, :, :] = 5
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[2, 1, :, :] = -1
a[1, 2, :, :] = -1

SHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[1, 1, :, :] = 5
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[2, 1, :, :] = -1
a[1, 2, :, :] = -1

SHARPEN_FILTER = tf.constant(a, dtype=tf.float32)

# a = np.zeros([3, 3, 3, 3])
# a[:, :, :, :] = -1
# a[1, 1, :, :] = 8

# EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32)

EDGE_FILTER_RGB = tf.constant([
			[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],
            [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],
			[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
			[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]]
])

a = np.zeros([3, 3, 1, 1])
# a[:, :, :, :] = -1
# a[1, 1, :, :] = 8
a[0, 1, :, :] = -1
a[1, 0, :, :] = -1
a[1, 2, :, :] = -1
a[2, 1, :, :] = -1
a[1, 1, :, :] = 4

EDGE_FILTER = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[0, :, :, :] = 1
a[0, 1, :, :] = 2 # originally 2
a[2, :, :, :] = -1
a[2, 1, :, :] = -2

TOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[0, :, :, :] = 1
a[0, 1, :, :] = 2 # originally 2
a[2, :, :, :] = -1
a[2, 1, :, :] = -2

TOP_SOBEL = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 3, 3])
a[0, 0, :, :] = -2
a[0, 1, :, :] = -1 
a[1, 0, :, :] = -1
a[1, 1, :, :] = 1
a[1, 2, :, :] = 1
a[2, 1, :, :] = 1
a[2, 2, :, :] = 2

EMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32)

a = np.zeros([3, 3, 1, 1])
a[0, 0, :, :] = -2
a[0, 1, :, :] = -1 
a[1, 0, :, :] = -1
a[1, 1, :, :] = 1
a[1, 2, :, :] = 1
a[2, 1, :, :] = 1
a[2, 2, :, :] = 2
EMBOSS_FILTER = tf.constant(a, dtype=tf.float32)

================================================
FILE: examples/utils.py
================================================
import os
import gzip
import shutil
import struct
import urllib

os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

from matplotlib import pyplot as plt
import numpy as np
import tensorflow as tf

def huber_loss(labels, predictions, delta=14.0):
    residual = tf.abs(labels - predictions)
    def f1(): return 0.5 * tf.square(residual)
    def f2(): return delta * residual - 0.5 * tf.square(delta)
    return tf.cond(residual < delta, f1, f2)

def safe_mkdir(path):
    """ Create a directory if there isn't one already. """
    try:
        os.mkdir(path)
    except OSError:
        pass

def read_birth_life_data(filename):
    """
    Read in birth_life_2010.txt and return:
    data in the form of NumPy array
    n_samples: number of samples
    """
    text = open(filename, 'r').readlines()[1:]
    data = [line[:-1].split('\t') for line in text]
    births = [float(line[1]) for line in data]
    lifes = [float(line[2]) for line in data]
    data = list(zip(births, lifes))
    n_samples = len(data)
    data = np.asarray(data, dtype=np.float32)
    return data, n_samples

def download_one_file(download_url, 
                    local_dest, 
                    expected_byte=None, 
                    unzip_and_remove=False):
    """ 
    Download the file from download_url into local_dest
    if the file doesn't already exists.
    If expected_byte is provided, check if 
    the downloaded file has the same number of bytes.
    If unzip_and_remove is True, unzip the file and remove the zip file
    """
    if os.path.exists(local_dest) or os.path.exists(local_dest[:-3]):
        print('%s already exists' %local_dest)
    else:
        print('Downloading %s' %download_url)
        local_file, _ = urllib.request.urlretrieve(download_url, local_dest)
        file_stat = os.stat(local_dest)
        if expected_byte:
            if file_stat.st_size == expected_byte:
                print('Successfully downloaded %s' %local_dest)
                if unzip_and_remove:
                    with gzip.open(local_dest, 'rb') as f_in, open(local_dest[:-3],'wb') as f_out:
                        shutil.copyfileobj(f_in, f_out)
                    os.remove(local_dest)
            else:
                print('The downloaded file has unexpected number of bytes')

def download_mnist(path):
    """ 
    Download and unzip the dataset mnist if it's not already downloaded 
    Download from http://yann.lecun.com/exdb/mnist
    """
    safe_mkdir(path)
    url = 'http://yann.lecun.com/exdb/mnist'
    filenames = ['train-images-idx3-ubyte.gz',
                'train-labels-idx1-ubyte.gz',
                't10k-images-idx3-ubyte.gz',
                't10k-labels-idx1-ubyte.gz']
    expected_bytes = [9912422, 28881, 1648877, 4542]

    for filename, byte in zip(filenames, expected_bytes):
        download_url = os.path.join(url, filename)
        local_dest = os.path.join(path, filename)
        download_one_file(download_url, local_dest, byte, True)

def parse_data(path, dataset, flatten):
    if dataset != 'train' and dataset != 't10k':
        raise NameError('dataset must be train or t10k')

    label_file = os.path.join(path, dataset + '-labels-idx1-ubyte')
    with open(label_file, 'rb') as file:
        _, num = struct.unpack(">II", file.read(8))
        labels = np.fromfile(file, dtype=np.int8) #int8
        new_labels = np.zeros((num, 10))
        new_labels[np.arange(num), labels] = 1
    
    img_file = os.path.join(path, dataset + '-images-idx3-ubyte')
    with open(img_file, 'rb') as file:
        _, num, rows, cols = struct.unpack(">IIII", file.read(16))
        imgs = np.fromfile(file, dtype=np.uint8).reshape(num, rows, cols) #uint8
        imgs = imgs.astype(np.float32) / 255.0
        if flatten:
            imgs = imgs.reshape([num, -1])

    return imgs, new_labels

def read_mnist(path, flatten=True, num_train=55000):
    """
    Read in the mnist dataset, given that the data is stored in path
    Return two tuples of numpy arrays
    ((train_imgs, train_labels), (test_imgs, test_labels))
    """
    imgs, labels = parse_data(path, 'train', flatten)
    indices = np.random.permutation(labels.shape[0])
    train_idx, val_idx = indices[:num_train], indices[num_train:]
    train_img, train_labels = imgs[train_idx, :], labels[train_idx, :]
    val_img, val_labels = imgs[val_idx, :], labels[val_idx, :]
    test = parse_data(path, 't10k', flatten)
    return (train_img, train_labels), (val_img, val_labels), test

def get_mnist_dataset(batch_size):
    # Step 1: Read in data
    mnist_folder = 'data/mnist'
    download_mnist(mnist_folder)
    train, val, test = read_mnist(mnist_folder, flatten=False)

    # Step 2: Create datasets and iterator
    train_data = tf.data.Dataset.from_tensor_slices(train)
    train_data = train_data.shuffle(10000) # if you want to shuffle your data
    train_data = train_data.batch(batch_size)

    test_data = tf.data.Dataset.from_tensor_slices(test)
    test_data = test_data.batch(batch_size)

    return train_data, test_data
    
def show(image):
    """
    Render a given numpy.uint8 2D array of pixel data.
    """
    plt.imshow(image, cmap='gray')
    plt.show()

================================================
FILE: examples/word2vec_utils.py
================================================
from collections import Counter
import random
import os
import sys
sys.path.append('..')
import zipfile

import numpy as np
from six.moves import urllib
import tensorflow as tf

import utils

def read_data(file_path):
    """ Read data into a list of tokens 
    There should be 17,005,207 tokens
    """
    with zipfile.ZipFile(file_path) as f:
        words = tf.compat.as_str(f.read(f.namelist()[0])).split() 
    return words

def build_vocab(words, vocab_size, visual_fld):
    """ Build vocabulary of VOCAB_SIZE most frequent words and write it to
    visualization/vocab.tsv
    """
    utils.safe_mkdir(visual_fld)
    file = open(os.path.join(visual_fld, 'vocab.tsv'), 'w')
    
    dictionary = dict()
    count = [('UNK', -1)]
    index = 0
    count.extend(Counter(words).most_common(vocab_size - 1))
    
    for word, _ in count:
        dictionary[word] = index
        index += 1
        file.write(word + '\n')
    
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    file.close()
    return dictionary, index_dictionary

def convert_words_to_index(words, dictionary):
    """ Replace each word in the dataset with its index in the dictionary """
    return [dictionary[word] if word in dictionary else 0 for word in words]

def generate_sample(index_words, context_window_size):
    """ Form training pairs according to the skip-gram model. """
    for index, center in enumerate(index_words):
        context = random.randint(1, context_window_size)
        # get a random target before the center word
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

def most_common_words(visual_fld, num_visualize):
    """ create a list of num_visualize most frequent words to visualize on TensorBoard.
    saved to visualization/vocab_[num_visualize].tsv
    """
    words = open(os.path.join(visual_fld, 'vocab.tsv'), 'r').readlines()[:num_visualize]
    words = [word for word in words]
    file = open(os.path.join(visual_fld, 'vocab_' + str(num_visualize) + '.tsv'), 'w')
    for word in words:
        file.write(word)
    file.close()

def batch_gen(download_url, expected_byte, vocab_size, batch_size, 
                skip_window, visual_fld):
    local_dest = 'data/text8.zip'
    utils.download_one_file(download_url, local_dest, expected_byte)
    words = read_data(local_dest)
    dictionary, _ = build_vocab(words, vocab_size, visual_fld)
    index_words = convert_words_to_index(words, dictionary)
    del words           # to save memory
    single_gen = generate_sample(index_words, skip_window)
    
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(single_gen)
        yield center_batch, target_batch


================================================
FILE: setup/requirements.txt
================================================
tensorflow==1.4.1
scipy==1.0.0
scikit-learn==0.19.1
matplotlib==2.1.1
xlrd==1.1.0
ipdb==0.10.3
Pillow==5.0.0
lxml==4.1.1

================================================
FILE: setup/setup_instruction.md
================================================
Please follow the official instruction to install TensorFlow [here](https://www.tensorflow.org/install/). For this course, I will use Python 3.6 and TensorFlow 1.4. You’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 3.6. You don't need GPU for most code examples in this course, though having GPU won't hurt. If you install TensorFlow on your local machine, my ecommendation is always set up Tensorflow using virtualenv. 

For the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses. 

There are a few things to note:
- As of version 1.2, TensorFlow no longer provides GPU support on macOS.
- On macOS, Python 3.6 might gives warning but still works.
- TensorFlow with GPU support will only work with CUDA® Toolkit 8.0 and cuDNN v6.0, not the newest CUDA and cnDNN version. Make sure that you install the correct CUDA and cuDNN versions to avoid frustrating issues.
- On Windows, TensorFlow supports only 64-bit Python 3.5 anx Python 3.6.
- If you see the warning:
```bash
Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
```
it's because you didn't install TensorFlow from sources to take advantage of all these settings. You can choose to install TensorFlow from sources -- the process might take up to 30 minutes. To silence the warning, add this before importing TensorFlow: <br>

```bash
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
```

- If you want to install TensorFlow from sources, keep in mind that TensorFlow doesn't officially support building TensorFlow on Windows. On Windows, you may try using the highly experimental Bazel on Windows or TensorFlow CMake build.

Below is a simpler instruction on how to install TensorFlow on macOS. If you have any problem installing Tensorflow, feel free to post it on [Piazza](piazza.com/stanford/winter2018/cs20)

If you get “permission denied” error in any command, use “sudo” in front of that command.

You will need pip3 (or pip if you use Python2), and virtualenv.

Step 1: install python3 and pip3. Skip this step if you already have both. You can find the official instruction [here](http://docs.python-guide.org/en/latest/starting/install3/osx/)

Step 2: upgrade six
```bash
$ sudo easy_install --upgrade six
```

Step 3: install virtualenv. Skip this step if you already have virtualenv
```bash
$ pip3 install virtualenv
```

Step 4: set up a project directory. You will do all work for this class in this directory
```bash
$ mkdir cs20
```

Step 5: set up virtual environment with python3
```bash
$ cd cs20
$ python3 -m venv .env
```
These commands create a venv subdirectory in your project where everything is installed.

Step 6: activate the virtual environment 
```bash
$ source .env/bin/activate
```

If you type:
```bash
$ pip3 freeze
```

You will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, you can see/download the list of requirements in [the setup folder of this repository](https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/setup/requirements.txt).

Step 7: Install Tensorflow and other dependencies
```bash
$ pip3 install -r requirements.txt
```

Step n: 
To exit the virtual environment, use:
```bash
$ deactivate
```

### Other options
#### Floydhub
Floydhub has a clean, GitHub-like interface that allows you to create and run TensorFlow projects.

# Possible set up problems
## Matplotlib
If you have problem with using Matplotlib in virtual environment, here are two simple ways to fix. <br>
1. If you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib.
Go there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg```
2. After importing matplotlib, simply add: ```matplotlib.use("TkAgg")```

If you run into more problems, feel free to post your questions on [Piazza](https://piazza.com/stanford/winter2018/cs20) or email us cs20-win1718-staff@lists.stanford.edu.